Here are slides from a talk, coauthored with Richard Sproat, given at the Grapholinguistics in the 21st Century conference, on how we talk about writing systems in speech and language processing. We will try to get this into archival form soon.
Category: Language
On “from scratch”
For a variety of historical and sociocultural reasons, nearly all natural language processing (NLP) research involves processing of text, i.e., written documents (Gorman & Sproat 2022). Furthermore, most speech processing research uses written text either as input or output.
A great deal of speech or language processing treads words (however they are understood) as atomic, indivisible units rather than the “intricately structured objects linguists have long recognized them to be” (Gorman in press). But there has been a recent trend to instead work with individual Unicode codepoints, or even the individual bytes of a Unicode string encoded in UTF-8. When such systems are part of an “end-to-end” neural network, these systems are sometimes said to be “from scratch”; see, e.g., Gillick et al. 2016 and Li et al. 2019, who both use this exact phrase to describe their contributions. There is an implication that such systems, by bypassing the fraught notion of word, have somehow eliminated the need for linguistic insight altogether.
The expression “from scratch” makes an analogy to baking: it is as if we are making angel food cake by sifting flour, superfine sugar, and cream of tartar, rather than using the “just add water and egg whites” mixes from Betty Crocker. But this analogy understates just how much linguistic knowledge can be baked in (or perhaps “sifted in”) to writing systems. Writing systems are essentially a type of linguistic analysis (Sproat 2010), and like any language technology, they necessarily reify the analysis that underlies them.1 The linguistic analysis underlying a writing system may be quite naïve but may also encode sophisticated phonemic and/or morphemic insights. Thus written text, whether expressed as Unicode codepoints or UTF-8 bytes, may have quite a bit of linguistic knowledge sifted and folded in.
A familiar and well-known example of this kind of knowledge comes from English (Gorman in press). In this language, changes in vowel quality triggered by the addition of “level 1” suffixes like -ity are generally not indicated in written form. Thus sane [seɪn] and sanity [sæ.nɪ.ti], for instance, are spelled more similarly than they are pronounced (Chomsky and Halle 1968: 44f.), meaning that this vowel change need not be modeled when working with written text.
Endnotes
- The Sumerian and Egyptian scribes were thus history’s first linguists, and history’s first language technologists.
References
Chomsky, N., and Halle, M. 1968. Sound Pattern of English. Harper & Row.
Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. 2016. Multilingual language processing from bytes. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1296-1306.
Gorman, K.. In press. Computational morphology. In Aronoff, M. and Fudeman, K., What is Morphology? 3rd edition. Blackwell.
Gorman, K., and Sproat, R. 2022. The persistent conflation of writing and language. Paper presented at Grapholinguistics in the 21st Century.
Li, B., Zhang, Y., Sainath, T., Wu, Y., and Chan, W. 2019. Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5621-5625.
Sproat, R. 2010. Language, Technology, and Society. Oxford University Press.
The computational revolution in linguistics
(Throughout this post, I have taken pains not to name any names. The beauty of subtweeting and other forms of subposting is that nobody knows for sure you’re the person being discussed unless you volunteer yourself. So, don’t.)
One of the more salient developments in linguistics as a discipline over the last two decades is the way in which computational knowledge has diffused into the field.1 20 years ago, there were but a handful of linguistics professors in North America who could perform elaborate corpus analyses, apply machine learning and statistical analysis, or extract acoustic measurements from an audio file. And, while it was in some ways quite robust, speech and language processing at the turn of the last century simply did not hold the same importance it does nowadays.
While some professors—including, to their credit, many of my mentors and colleagues—can be commended for having “skilled up” in the intervening years, this knowledge has, I am sad to say, mostly advanced one death (and subsequent tenure line renewal) at a time. This has negative consequences for linguistics students who want to train for or pivot to a career in the tech sector, since there are professors who were, in their time, computationally sophisticated, but lack the skills a rising computational linguist is expected to have mastered. In an era of contracting tenure rolls and other forms of casualization in the academy, this has the risk of pushing out legitimate, albeit staid, lines of linguistic inquiry in favor of areas favored by capitalists.2
Yet I believe that this upskilling has a lot to contribute to linguistics as a discipline. There are many core questions about language use, acquisition, variation, and change which are best answered with a computational simulation that forces us to be explicit about our assumptions, or a corpus study that tells us what people really said, or a statistical analysis that tells us whether our correlations are likely to be meaningful, or even a machine learning system that helps us rapidly label linguistic data.3 It is a boon to our field that linguists of any age can employ these tools when appropriate.
This is not to say that the transition has not been occasionally ugly. First, there are the occasional nasty turf wars over who exactly is a linguist.4 Secondly, the standards of quality for work in this area must be negotiated and imposed. While a syntax paper in NL< from even 30 years ago are easily readable today, the computational methods of even widely-praised paper from 15 or 20 years ago are, frankly, often quite sloppy. I have found it necessary to explain this to students who want to interact with this older work lest they lower their own methodological standards.
I discern at least a few common sloppy habits in this older computational work, focusing for the moment on computational cognitive models of linguistic behavior.
- If a proposed computational model is compared to some “baseline” or older model, this older model is usually an ancient associationist model from psychology. This older model naturally lacks much of the rich linguistic specifications of the proposed model, and naturally it fails to model the data. Deliberately picking a bad baseline is putting one’s finger on the scale.
- Comparison of different computational models is usually informal. One should instead use statistical model comparison methods.
- The dependent variable for modeling is often derived from poorly-designed human subjects experiments. The subjects in these experiments may be instructed to perform a task they are unlikely to be able to do consciously (i.e., the tasks are cognitively impenetrable). Unjustified assumptions about appropriate scales of measurement may have been made. Finally, the n‘s are often needlessly small. Computational cognitive models demand high-quality measures of the behaviors they’re meant to model.
- Once the proposed model has been shown better than the baseline, it is reified far beyond what the evidence suggests. Computational cognitive modeling can at most show that certain explicit assumptions are consistent with the observed data: they cannot establish much beyond that.
The statistician Andrew Gelman writes that scientific discourse sometimes proceeds as if earlier published work has additional claim to truth than later research that is critical of the original findings (which may or may not be published yet).5 Critical interpretation of this older computational work is increasingly called for, as our methodological standards continue to mature. I find reviewers (and literature-reviewers) overly deferential to prior work of dubious quality simply because of its priority.
Endnotes
- An under-appreciated element to this process is that it is is simply easier to do linguistically-relevant things with computers than it was 20 years prior. For this, one should thank Python and R, NumPy and Scikit-learn, and of course tools like Praat and Parselmouth.
- I happen to think college education should not be merely vocational training.
- I happen to think most of these questions can be answered with a cheap laptop, and only a few require a CUDA-enabled GPU.
- I suspect this is mostly a response to the rapidly casualizing academy. Unfortunately, any question about whether we should be doing X in linguistics is misinterpreted as a question about whether people who do X deserve to have a job. This is a presupposition failure for me: I believe everyone deserves meaningful work, and that academic tenure is a model of labor relations that should be expanded beyond the academy.
- To free ourselves of this bias, Gelman proposes what he calls the time-reversal heuristic, in which one imagines the temporal order reversed (e.g., that the later failed replication is now the first published result on the matter) and then re-evaluates the evidence. When interacting with older computational work, similar thinking is called for here.
When rule directionality does and does not matter
At the Graduate Center we recently hosted an excellent lecture by Jane Chandlee of Haverford College. Those familiar with her work may know that she’s been studying, for some time now, two classes of string-to-string functions called the input strictly local (ISL) and output strictly local (OSL) functions. These are generalizations of the familiar notion of the strictly local (SL) languages proposed by McNaughton and Papert (1971) many years ago. For definitions of ISL and OSL functions, see Chandlee et al. 2014 and Chandlee 2014. Chandlee and colleagues have been arguing, for some time now, that virtually all phonological processes are ISL, OSL, or both (note that their intersection is non-null).
In her talk, Chandlee attempted to formalize the notions of iterativity and non-iterativity in phonology with reference to ISL and OSL functions. One interesting side effect of this work is that one can, quite easily, determine what makes a phonological process direction-invariant or direction-specific. In FSTP (Gorman & Sproat 2021:§5.1.1) we describe three notions of rule directionality (ones which are quite a bit less general than Chandlee’s notions) from the literature, but conclude: “Note, however, that directionality of application has no discernable effect for perhaps the majority of rules, and can often be ignored.” (op. cit., 53) We didn’t bother to determine when this is the case, but Chandlee shows that the set of rules which are invariant to direction of application (in our sense) are exactly those which are ISL ∩ OSL; that is, they describe processes which are both ISL and OSL, in the sense that they are string-to-string functions (or maps, to use her term) which can be encoded either as ISL or OSL.
As Richard Sproat (p.c.) points out to me, there are weaker notions of direction-invariance we may care about in the context of grammar engineering. For instance, it might be the case that some rule is, strictly speaking, direction-specific, but the language of input strings is not expected to contain any relevant examples. I suspect this is quite common also.
References
Chandlee, J. 2014. Strictly local phonological processes. Doctoral dissertation, University of Delaware.
Chandlee, J., Eyraud, R., and Heinz, J. 2014. Learning strictly local subsequential functions. Transactions of the Association for Computational Linguistics 2: 491-503.
Gorman, K., and Sproat, R. 2021. Finite-State Text Processing. Morgan & Claypool.
McNaughton, R., and Papert, S. A. 1971. Counter-Free Automata. MIT Press.
A* shortest string decoding for non-idempotent semirings
I recently completed some work, in collaboration with Google’s Cyril Allauzen, on a new algorithm for computing the shortest string through weighted finite-state automaton. For so-called path semirings, the shortest string is given by the shortest path, but up until now, there was no general-purpose algorithm for computing the shortest string over non-idempotent semirings (like the log or probability semiring). Such an algorithm would make it much easier to decode with interpolated language models or elaborate channel models in a noisy-channel formalism. In this preprint, we propose such an algorithm using A* search and lazy (“on-the-fly”) determinization, and prove that it is correct. The algorithm in question is implemented in my OpenGrm-BaumWelch library by the baumwelchdecode command-line tool.
WFST talk
I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.
Dutch names in LaTeX
One thing I recently figured out is a sensible way to handle Dutch names (i.e., those that begin with den, van or similar particles. Traditionally, these particles are part of the cited name in author-date citations (e.g., den Dikken 2003, van Oostendorp 2009) but are ignored when alphabetizing (thus, van Oostendorp is alphabetized between Orgun & Sprouse and Otheguy, not between Vago and Vaux). This is not something handled automatically by tools like LaTeX and BibTeX, but it is relatively easy to annotate name particles like this so that they do the right thing.
First, place, at the top of your BibTeX file, the following:
@preamble{{\providecommand{\noopsort}[1]{}}}
Then, in the individual BibTeX entries, wrap the author field with this command like so:
author = {{\noopsort{Dikken}{den Dikken}}, Marcel},
This preserves the correct in-text author-date citations, but also gives the intended alphabetization in the bibliography.
Note of course that not all people with van (etc.) names in the Anglosphere treat the van as if it were a particle to be ignored; a few deliberately alphabetize their last name as if it begins with v.
X moment
A Reddit moment is an expression used to refer to a certain type of cringe ‘cringeworthy behavior or content’ judged characteristic of Redditors, habitual users of the forum website reddit.com. It seems hard to pin down what makes cringe Redditor-like, but discussion on Urban Dictionary suggests that one salient feature is a belief in one’s superiority, or the superiority of Redditors in general; a related feature is irl behavior that takes Reddit too seriously. The normal usage is as an interjection of sorts; presented with cringeworthy internet content (a screenshot or URL), one might simply respond “Reddit moment”.
However, Reddit isn’t the only community that can have a similar type of pejorative X moment. One can find many instances of crackhead moment, describing unpredictable or spazzy behavior. A more complicated example comes from a friend, who shared a link about a software developer who deliberately sabotaged a widely used JavaScript software library to protest the Russian invasion of Ukraine. JavaScript, and the Node.js community in particular, has been extremely vulnerable to both deliberate sabotage and accidental bricking ‘irreversible destruction of technology’, and naturally my friend sent the link with the commentary “js moment”. The one thing that seems to unite all X moment snowclones is a shared negative evaluation of the community in the common ground.
Evaluations from the past
In a literature review, speech and language processing specialists often feel tempted to report evaluation metrics like accuracy, F-score, or word error rate for systems described in the literature review. In my opinion, this is only informative if the prior and present work use the exact same data set(s) for evaluations. (Such results should probably be presented in a table along with results from the present work, not in the body of the literature review.) If instead, they were tested on some proprietary data set, an obsolete corpus, or a data set the authors of the present work have declined to evaluate on, this information is inactionable. Authors should omit this information, and reviewers and editors should insist that it be omitted.
It is also clear to me that these numbers are rarely meaningful as measures of how difficult a task is “generally”. To take an example from an unnamed 2019 NAACL paper (one guilty of the sin described above), word error rates on a single task in a single language range between 9.1% and 23.61% (note also the mixed precision). What could we possibly reason from this enormous spread of results across different data sets?
Country (dead)naming
Current events reminded me of an ongoing Discourse about how we ought to refer to the country Ukraine in English. William Taylor, US ambassador to the country under George W. Bush, is quoted on the subject in this Time magazine piece (“Ukraine, Not the Ukraine: The Significance of Three Little Letters”, March 5th, 2014; emphasis mine), which is circulating again today:
The Ukraine is the way the Russians referred to that part of the country during Soviet times … Now that it is a country, a nation, and a recognized state, it is just Ukraine.
Apparently they don’t fact-check claims like this, because this is utter nonsense. Russian doesn’t have definite articles, i.e., words like the. There is simply no straightforward way to express the contrast between the Ukraine and Ukraine in Russian (or in Ukrainian for that matter).
Now, it’s true that the before Ukraine has long been proscribed in English, but this seems to be more a matter of style—the the variant sounds archaic to my ear—than ideology. And, in Russian, there is variation between в Украине and на Украине, both of which I would translate as ‘in Ukraine’. My understanding is that both have been attested for centuries, but one (на) was more widely used during the Soviet era and thus the other (в) is thought to emphasize the country’s sovereignty in the modern era. As I understand it, that one preposition is indexical of Ukrainian nationalist sentiment and another is indexical of Russian revanchist-nationalist sentiment is more or less linguistically arbitrary in the Saussurean sense. Or, more weakly, the connotative differences between the two prepositions are subtle and don’t map cleanly onto the relevant ideologies. But I am not a native (or even competent) speaker of Russian so you should not take my word for it.
Taylor, in the Time article, continues to argue that US media should use the Ukrainian-style transliteration Kyiv instead of the Russian-style transliteration Kiev. This is a more interesting prescription, at least in that the linguistic claim—that Kyiv is the standard Ukrainian transliteration and Kiev is the standard Russian transliteration—is certainly true. However, it probably should be noted that dozens of other cities and countries in non-Anglophone Europe are known by their English exonyms, and no one seems to be demanding that Americans start referring to Wien [viːn] ‘Vienna’ or Moskva ‘Moscow’. In other words Taylor’s prescription is a political exercise rather than a matter of grammatical correctness. (One can’t help but notice that Taylor is a retired neoconservative diplomat pleading for “political correctness”.)