On “from scratch”

For a variety of historical and sociocultural reasons, nearly all natural language processing (NLP) research involves processing of text, i.e., written documents (Gorman & Sproat 2022). Furthermore, most speech processing research uses written text either as input or output.

A great deal of speech or language processing treads words (however they are understood) as atomic, indivisible units rather than the “intricately structured objects linguists have long recognized them to be” (Gorman in press). But there has been a recent trend to instead work with individual Unicode codepoints, or even the individual bytes of a Unicode string encoded in UTF-8. When such systems are part of an “end-to-end” neural network, these systems are sometimes said to be “from scratch”; see, e.g., Gillick et al. 2016 and Li et al. 2019, who both use this exact phrase to describe their contributions. There is an implication that such systems, by bypassing the fraught notion of word, have somehow eliminated the need for linguistic insight altogether.

The expression “from scratch” makes an analogy to baking: it is as if we are making angel food cake by sifting flour, superfine sugar, and cream of tartar, rather than using the “just add water and egg whites” mixes from Betty Crocker. But this analogy understates just how much linguistic knowledge can be baked in (or perhaps “sifted in”) to writing systems. Writing systems are essentially a type of linguistic analysis (Sproat 2010), and like any language technology, they necessarily reify the analysis that underlies them.¹ The linguistic analysis underlying a writing system may be quite naïve but may also encode sophisticated phonemic and/or morphemic insights. Thus written text, whether expressed as Unicode codepoints or UTF-8 bytes, may have quite a bit of linguistic knowledge sifted and folded in.

A familiar and well-known example of this kind of knowledge comes from English (Gorman in press). In this language, changes in vowel quality triggered by the addition of “level 1” suffixes like -ity are generally not indicated in written form. Thus sane [seɪn] and sanity [sæ.nɪ.ti], for instance, are spelled more similarly than they are pronounced (Chomsky and Halle 1968: 44f.), meaning that this vowel change need not be modeled when working with written text.

Endnotes

The Sumerian and Egyptian scribes were thus history’s first linguists, and history’s first language technologists.

References

Chomsky, N., and Halle, M. 1968. Sound Pattern of English. Harper & Row.
Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. 2016. Multilingual language processing from bytes. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1296-1306.
Gorman, K.. In press. Computational morphology. In Aronoff, M. and Fudeman, K., What is Morphology? 3rd edition. Blackwell.
Gorman, K., and Sproat, R. 2022. The persistent conflation of writing and language. Paper presented at Grapholinguistics in the 21st Century.
Li, B., Zhang, Y., Sainath, T., Wu, Y., and Chan, W. 2019. Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5621-5625.
Sproat, R. 2010. Language, Technology, and Society. Oxford University Press.

The computational revolution in linguistics

(Throughout this post, I have taken pains not to name any names. The beauty of subtweeting and other forms of subposting is that nobody knows for sure you’re the person being discussed unless you volunteer yourself. So, don’t.)

One of the more salient developments in linguistics as a discipline over the last two decades is the way in which computational knowledge has diffused into the field.¹ 20 years ago, there were but a handful of linguistics professors in North America who could perform elaborate corpus analyses, apply machine learning and statistical analysis, or extract acoustic measurements from an audio file. And, while it was in some ways quite robust, speech and language processing at the turn of the last century simply did not hold the same importance it does nowadays.

While some professors—including, to their credit, many of my mentors and colleagues—can be commended for having “skilled up” in the intervening years, this knowledge has, I am sad to say, mostly advanced one death (and subsequent tenure line renewal) at a time. This has negative consequences for linguistics students who want to train for or pivot to a career in the tech sector, since there are professors who were, in their time, computationally sophisticated, but lack the skills a rising computational linguist is expected to have mastered. In an era of contracting tenure rolls and other forms of casualization in the academy, this has the risk of pushing out legitimate, albeit staid, lines of linguistic inquiry in favor of areas favored by capitalists.²

Yet I believe that this upskilling has a lot to contribute to linguistics as a discipline. There are many core questions about language use, acquisition, variation, and change which are best answered with a computational simulation that forces us to be explicit about our assumptions, or a corpus study that tells us what people really said, or a statistical analysis that tells us whether our correlations are likely to be meaningful, or even a machine learning system that helps us rapidly label linguistic data.³ It is a boon to our field that linguists of any age can employ these tools when appropriate.

This is not to say that the transition has not been occasionally ugly. First, there are the occasional nasty turf wars over who exactly is a linguist.⁴ Secondly, the standards of quality for work in this area must be negotiated and imposed. While a syntax paper in NL&LT from even 30 years ago are easily readable today, the computational methods of even widely-praised paper from 15 or 20 years ago are, frankly, often quite sloppy. I have found it necessary to explain this to students who want to interact with this older work lest they lower their own methodological standards.

I discern at least a few common sloppy habits in this older computational work, focusing for the moment on computational cognitive models of linguistic behavior.

If a proposed computational model is compared to some “baseline” or older model, this older model is usually an ancient associationist model from psychology. This older model naturally lacks much of the rich linguistic specifications of the proposed model, and naturally it fails to model the data. Deliberately picking a bad baseline is putting one’s finger on the scale.
Comparison of different computational models is usually informal. One should instead use statistical model comparison methods.
The dependent variable for modeling is often derived from poorly-designed human subjects experiments. The subjects in these experiments may be instructed to perform a task they are unlikely to be able to do consciously (i.e., the tasks are cognitively impenetrable). Unjustified assumptions about appropriate scales of measurement may have been made. Finally, the n‘s are often needlessly small. Computational cognitive models demand high-quality measures of the behaviors they’re meant to model.
Once the proposed model has been shown better than the baseline, it is reified far beyond what the evidence suggests. Computational cognitive modeling can at most show that certain explicit assumptions are consistent with the observed data: they cannot establish much beyond that.

The statistician Andrew Gelman writes that scientific discourse sometimes proceeds as if earlier published work has additional claim to truth than later research that is critical of the original findings (which may or may not be published yet).⁵ Critical interpretation of this older computational work is increasingly called for, as our methodological standards continue to mature. I find reviewers (and literature-reviewers) overly deferential to prior work of dubious quality simply because of its priority.

Endnotes

An under-appreciated element to this process is that it is is simply easier to do linguistically-relevant things with computers than it was 20 years prior. For this, one should thank Python and R, NumPy and Scikit-learn, and of course tools like Praat and Parselmouth.
I happen to think college education should not be merely vocational training.
I happen to think most of these questions can be answered with a cheap laptop, and only a few require a CUDA-enabled GPU.
I suspect this is mostly a response to the rapidly casualizing academy. Unfortunately, any question about whether we should be doing X in linguistics is misinterpreted as a question about whether people who do X deserve to have a job. This is a presupposition failure for me: I believe everyone deserves meaningful work, and that academic tenure is a model of labor relations that should be expanded beyond the academy.
To free ourselves of this bias, Gelman proposes what he calls the time-reversal heuristic, in which one imagines the temporal order reversed (e.g., that the later failed replication is now the first published result on the matter) and then re-evaluates the evidence. When interacting with older computational work, similar thinking is called for here.

When rule directionality does and does not matter

At the Graduate Center we recently hosted an excellent lecture by Jane Chandlee of Haverford College. Those familiar with her work may know that she’s been studying, for some time now, two classes of string-to-string functions called the input strictly local (ISL) and output strictly local (OSL) functions. These are generalizations of the familiar notion of the strictly local (SL) languages proposed by McNaughton and Papert (1971) many years ago. For definitions of ISL and OSL functions, see Chandlee et al. 2014 and Chandlee 2014. Chandlee and colleagues have been arguing, for some time now, that virtually all phonological processes are ISL, OSL, or both (note that their intersection is non-null).

In her talk, Chandlee attempted to formalize the notions of iterativity and non-iterativity in phonology with reference to ISL and OSL functions. One interesting side effect of this work is that one can, quite easily, determine what makes a phonological process direction-invariant or direction-specific. In FSTP (Gorman & Sproat 2021:§5.1.1) we describe three notions of rule directionality (ones which are quite a bit less general than Chandlee’s notions) from the literature, but conclude: “Note, however, that directionality of application has no discernable effect for perhaps the majority of rules, and can often be ignored.” (op. cit., 53) We didn’t bother to determine when this is the case, but Chandlee shows that the set of rules which are invariant to direction of application (in our sense) are exactly those which are ISL ∩ OSL; that is, they describe processes which are both ISL and OSL, in the sense that they are string-to-string functions (or maps, to use her term) which can be encoded either as ISL or OSL.

As Richard Sproat (p.c.) points out to me, there are weaker notions of direction-invariance we may care about in the context of grammar engineering. For instance, it might be the case that some rule is, strictly speaking, direction-specific, but the language of input strings is not expected to contain any relevant examples. I suspect this is quite common also.

References

Chandlee, J. 2014. Strictly local phonological processes. Doctoral dissertation, University of Delaware.
Chandlee, J., Eyraud, R., and Heinz, J. 2014. Learning strictly local subsequential functions. Transactions of the Association for Computational Linguistics 2: 491-503.
Gorman, K., and Sproat, R. 2021. Finite-State Text Processing. Morgan & Claypool.
McNaughton, R., and Papert, S. A. 1971. Counter-Free Automata. MIT Press.

A* shortest string decoding for non-idempotent semirings

I recently completed some work, in collaboration with Google’s Cyril Allauzen, on a new algorithm for computing the shortest string through weighted finite-state automaton. For so-called path semirings, the shortest string is given by the shortest path, but up until now, there was no general-purpose algorithm for computing the shortest string over non-idempotent semirings (like the log or probability semiring). Such an algorithm would make it much easier to decode with interpolated language models or elaborate channel models in a noisy-channel formalism. In this preprint, we propose such an algorithm using A* search and lazy (“on-the-fly”) determinization, and prove that it is correct. The algorithm in question is implemented in my OpenGrm-BaumWelch library by the baumwelchdecode command-line tool.

WFST talk

I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.

Dutch names in LaTeX

One thing I recently figured out is a sensible way to handle Dutch names (i.e., those that begin with den, van or similar particles. Traditionally, these particles are part of the cited name in author-date citations (e.g., den Dikken 2003, van Oostendorp 2009) but are ignored when alphabetizing (thus, van Oostendorp is alphabetized between Orgun & Sprouse and Otheguy, not between Vago and Vaux). This is not something handled automatically by tools like LaTeX and BibTeX, but it is relatively easy to annotate name particles like this so that they do the right thing.

First, place, at the top of your BibTeX file, the following:

@preamble{{\providecommand{\noopsort}[1]{}}}

Then, in the individual BibTeX entries, wrap the author field with this command like so:

 author = {{\noopsort{Dikken}{den Dikken}}, Marcel},

This preserves the correct in-text author-date citations, but also gives the intended alphabetization in the bibliography.

Note of course that not all people with van (etc.) names in the Anglosphere treat the van as if it were a particle to be ignored; a few deliberately alphabetize their last name as if it begins with v.

X moment

A Reddit moment is an expression used to refer to a certain type of cringe ‘cringeworthy behavior or content’ judged characteristic of Redditors, habitual users of the forum website reddit.com. It seems hard to pin down what makes cringe Redditor-like, but discussion on Urban Dictionary suggests that one salient feature is a belief in one’s superiority, or the superiority of Redditors in general; a related feature is irl behavior that takes Reddit too seriously. The normal usage is as an interjection of sorts; presented with cringeworthy internet content (a screenshot or URL), one might simply respond “Reddit moment”.

However, Reddit isn’t the only community that can have a similar type of pejorative X moment. One can find many instances of crackhead moment, describing unpredictable or spazzy behavior. A more complicated example comes from a friend, who shared a link about a software developer who deliberately sabotaged a widely used JavaScript software library to protest the Russian invasion of Ukraine. JavaScript, and the Node.js community in particular, has been extremely vulnerable to both deliberate sabotage and accidental bricking ‘irreversible destruction of technology’, and naturally my friend sent the link with the commentary “js moment”. The one thing that seems to unite all X moment snowclones is a shared negative evaluation of the community in the common ground.

Evaluations from the past

In a literature review, speech and language processing specialists often feel tempted to report evaluation metrics like accuracy, F-score, or word error rate for systems described in the literature review. In my opinion, this is only informative if the prior and present work use the exact same data set(s) for evaluations. (Such results should probably be presented in a table along with results from the present work, not in the body of the literature review.) If instead, they were tested on some proprietary data set, an obsolete corpus, or a data set the authors of the present work have declined to evaluate on, this information is inactionable. Authors should omit this information, and reviewers and editors should insist that it be omitted.

It is also clear to me that these numbers are rarely meaningful as measures of how difficult a task is “generally”. To take an example from an unnamed 2019 NAACL paper (one guilty of the sin described above), word error rates on a single task in a single language range between 9.1% and 23.61% (note also the mixed precision). What could we possibly reason from this enormous spread of results across different data sets?

Country (dead)naming

Current events reminded me of an ongoing Discourse about how we ought to refer to the country Ukraine in English. William Taylor, US ambassador to the country under George W. Bush, is quoted on the subject in this Time magazine piece (“Ukraine, Not the Ukraine: The Significance of Three Little Letters”, March 5th, 2014; emphasis mine), which is circulating again today:

The Ukraine is the way the Russians referred to that part of the country during Soviet times … Now that it is a country, a nation, and a recognized state, it is just Ukraine.

Apparently they don’t fact-check claims like this, because this is utter nonsense. Russian doesn’t have definite articles, i.e., words like the. There is simply no straightforward way to express the contrast between the Ukraine and Ukraine in Russian (or in Ukrainian for that matter).

Now, it’s true that the before Ukraine has long been proscribed in English, but this seems to be more a matter of style—the the variant sounds archaic to my ear—than ideology. And, in Russian, there is variation between в Украине and на Украине, both of which I would translate as ‘in Ukraine’. My understanding is that both have been attested for centuries, but one (на) was more widely used during the Soviet era and thus the other (в) is thought to emphasize the country’s sovereignty in the modern era. As I understand it, that one preposition is indexical of Ukrainian nationalist sentiment and another is indexical of Russian revanchist-nationalist sentiment is more or less linguistically arbitrary in the Saussurean sense. Or, more weakly, the connotative differences between the two prepositions are subtle and don’t map cleanly onto the relevant ideologies. But I am not a native (or even competent) speaker of Russian so you should not take my word for it.

Taylor, in the Time article, continues to argue that US media should use the Ukrainian-style transliteration Kyiv instead of the Russian-style transliteration Kiev. This is a more interesting prescription, at least in that the linguistic claim—that Kyiv is the standard Ukrainian transliteration and Kiev is the standard Russian transliteration—is certainly true. However, it probably should be noted that dozens of other cities and countries in non-Anglophone Europe are known by their English exonyms, and no one seems to be demanding that Americans start referring to Wien [viːn] ‘Vienna’ or Moskva ‘Moscow’. In other words Taylor’s prescription is a political exercise rather than a matter of grammatical correctness. (One can’t help but notice that Taylor is a retired neoconservative diplomat pleading for “political correctness”.)

On conspiracies

Kisseberth (1970) introduces the notion of conspiracies, cases in which a series of phonological rules in a single language “conspire” to create similar output configurations. Supposedly, Haj Ross chose the term “conspiracy”, and it is perhaps not an accident that the term he chose immediately reminds one of conspiracy theory, which has a strong negative connotation implying that the existence of the conspiracy cannot be proven. Kisseberth’s discovery of conspiracies motivated the rise of Optimality Theory (OT) two decades later—Prince & Smolensky (1993:1) refer to conspiracies as a “conceptual crisis” at the heart of phonological theory, and Zuraw (2003) explicitly links Kisseberth’s data to OT—but curiously, it seemingly had little effect on contemporary phonological theorizing. (A positivist might say that the theoretical technology needed to encode conspiratorial thinking simply did not exist at the time; a cynic might say that contemporaries did not take Kisseberth’s conspiratorial thinking seriously until it became easy to do so.) I discern two major objections to the logic of conspiracies: the evolutionary argument and the prosodic argument, which I’ll briefly review.

The evolutionary argument

What I am calling the evolutionary argument was first made by Kiparsky (1973:75f.) and is presented as an argument against OT by Hale & Reiss (2008:14). Roughly, if a series of rules lead to the same set of output configurations, they must be surface true, or they would not contribute to the putative conspiracy. Since surface-true rules are assumed to be easy to learn, especially relative to opaque rules are assumed to be difficult to learn, and since failure to learn rules would contribute to language change, grammars will naturally accumulate functionally related surface-true rules. I think we should question the assumption (au courant in 1973) that opacity is the end-all of what makes a rule difficult to acquire, but otherwise I find this basic logic sound.

The prosodic argument

At the time Kisseberth was writing, standard phonological theory included few of the prosodic primitives; even the notion of syllable was considered dubious. Subsequent revisions of the theory have introduced rich hierarchies of prosodic primitives. In particular, a subsequent generation of phonologists hypothesized that speakers “build” or “parse” sequences of segments into onsets and rimes, syllables, and feet, with repairs like stray erasure, i.e., deletion, of unsyllabified segmental or epenthesis used to resolve conflicts (McCarthy 1979, Steriade 1982, Itô 1986). It seems to me that this approach accounts for most of the facts of Yowlumne (formerly Yawelmani) reviewed by Kisseberth in his study:

there are no word-initial CC clusters
there are no word-final CC clusters
derived CCCs are resolved either by deletion or i-epenthesis
there are no CCC clusters in underlying form

The relevant observation that links all these facts is simply that Yowlumne does not permit branching onsets or codas, but more specifically, Yowlumne’s syllable-parsing algorithm does not build branching onsets or codas. This immediately accounts for facts #1-2. Assuming the logic of the McCarthy and contemporaries, #3 is also unsurprising: these clusters simply cannot be realized faithfully; the fact that there are multiple resolutions for the *CCC pathology is besides the point. And finally, adopting the logic that Prince & Smolensky (1993:54) were later to call Stampean occultation, the absence of underlying CCC clusters follows from the inability of them to surface, since the generalizations in question are all surface-true. (Here, we are treading closely to Kiparsky’s thoughts on the matter too.) Crucially, the analysis given above does not reify any surface constraints; the facts all follow from the feed-forward derivational structure of prosodically-informed phonological theory current a decade before Prince & Smolensky.

Conclusion

While Prince & Smolensky are right to say that OT provides a principled solution to Kisseberth’s notion of conspiracies, researchers in the ’70s and ’80s treated Kisseberth’s notion as epiphenomena of acquisition (Kiparsky) or prosodic structure-building (McCarthy and contemporaries). Perhaps, then, OT do not deserve credit for solving an unsolved problem in this regard. Of course, it remains to be seen whether the many implicit conjectures in these two objections can be sustained.

References

Hale, M. and Reiss, C. 2008. The Phonological Enterprise. Oxford University Press.
Kiparsky, P. 1973. Phonological representations. In O. Fujimura (ed.), Three Dimensions of Linguistic Theory, pages 1-135. TEC Corporation.
Kisseberth, C. W. 1970. On the functional unity of phonological rules. Linguistic Inquiry 1(3): 291-306.
Itô, J. 1986. Syllable theory in prosodic phonology. Doctoral dissertation, University of Massachusetts, Amherst. Published by Garland Publishers, 1988.
McCarthy, J. 1979. Formal problems in Semitic phonology and morphology. Doctoral dissertation, MIT. Published by Garland Publishers, 1985.
Prince, A., and Smolensky, P. 1993. Optimality Theory: constraint interaction in generative grammar. Rutgers Center for Cognitive Science Technical Report TR-2.
Steriade, D. 1982. Greek prosodies and the Nature of syllabification. Doctoral dissertation, MIT.
Zuraw, K. 2003. Optimality Theory in linguistics. In M. Arbib (ed.), Handbook of Brain Theory and Neural Networks, pages 819-822. 2nd edition. MIT Press.