For a variety of historical and sociocultural reasons, nearly all natural language processing (NLP) research involves processing of text, i.e., written documents (Gorman & Sproat 2022). Furthermore, most speech processing research uses written text either as input or output.
A great deal of speech or language processing treads words (however they are understood) as atomic, indivisible units rather than the “intricately structured objects linguists have long recognized them to be” (Gorman in press). But there has been a recent trend to instead work with individual Unicode codepoints, or even the individual bytes of a Unicode string encoded in UTF-8. When such systems are part of an “end-to-end” neural network, these systems are sometimes said to be “from scratch”; see, e.g., Gillick et al. 2016 and Li et al. 2019, who both use this exact phrase to describe their contributions. There is an implication that such systems, by bypassing the fraught notion of word, have somehow eliminated the need for linguistic insight altogether.
The expression “from scratch” makes an analogy to baking: it is as if we are making angel food cake by sifting flour, superfine sugar, and cream of tartar, rather than using the “just add water and egg whites” mixes from Betty Crocker. But this analogy understates just how much linguistic knowledge can be baked in (or perhaps “sifted in”) to writing systems. Writing systems are essentially a type of linguistic analysis (Sproat 2010), and like any language technology, they necessarily reify the analysis that underlies them.1 The linguistic analysis underlying a writing system may be quite naïve but may also encode sophisticated phonemic and/or morphemic insights. Thus written text, whether expressed as Unicode codepoints or UTF-8 bytes, may have quite a bit of linguistic knowledge sifted and folded in.
A familiar and well-known example of this kind of knowledge comes from English (Gorman in press). In this language, changes in vowel quality triggered by the addition of “level 1” suffixes like -ity are generally not indicated in written form. Thus sane [seɪn] and sanity [sæ.nɪ.ti], for instance, are spelled more similarly than they are pronounced (Chomsky and Halle 1968: 44f.), meaning that this vowel change need not be modeled when working with written text.
Endnotes
- The Sumerian and Egyptian scribes were thus history’s first linguists, and history’s first language technologists.
References
Chomsky, N., and Halle, M. 1968. Sound Pattern of English. Harper & Row.
Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. 2016. Multilingual language processing from bytes. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1296-1306.
Gorman, K.. In press. Computational morphology. In Aronoff, M. and Fudeman, K., What is Morphology? 3rd edition. Blackwell.
Gorman, K., and Sproat, R. 2022. The persistent conflation of writing and language. Paper presented at Grapholinguistics in the 21st Century.
Li, B., Zhang, Y., Sainath, T., Wu, Y., and Chan, W. 2019. Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5621-5625.
Sproat, R. 2010. Language, Technology, and Society. Oxford University Press.
Ah, but flour is also not a naive ingredient 🙂 Which grain is it from? Which parts of the seeds were milled? How finely? Cakes made from white wheat flour vs. whole spelt might be argued to be more different than Russian written in Cyrillic vs. transliterated Latin characters; not to mention the different kinds of sugar, etc.
Nitpicky, sure, but maybe our notions of “scratch” are relativized to begin with.
I don’t disagree—some writing systems contain way more information than others, and size of the bits may matter too. For instance in our g2p work, it pays off enormously to simply run Hangul inputs through a deterministic romanizer.