Is NLP stuck?

I can’t help but feel that NLP is once again stuck.

From about 2011 to 2019, I can identify a huge step forward just about every year. But the last thing that truly excited me is BERT, which came out in 2018 and was published in 2019. For those not in the know, the idea of BERT is to pre-train a gigantic language model, with either monolingual or multilingual data. The major pre-training task is masked language model prediction: we pretend some small percentage (usualyl 15%) of the words in a sentence are obscured by noise and try to predict what they were. Ancillary tasks like predicting whether two sentences are adjacent or not (or if they were, what was their order) are also used, but appear to be non-essential. Pre-training (done a single time, at some expense, at BigCo HQ), produces a contextual encoder, a model which can embed words and sentences in ways that are useful for many downstream tasks. But then one can also take this encoder and fine-tune it to some other downstream task (an instance of transfer learning). It turns out that the combination of task-general pre-training using free-to-cheap ordinary text data and a small amount of task-specific fine-tuning using labeled data results in substantial performance gains over what came before. The BERT creators gave away both software and the pre-trained parameters (which would be expensive for an individual or a small academic lab to reproduce on their own), and an entire ecosystem of sharing pre-trained model parameters has emerged. I see this toolkit-development ecosysytem as a sign of successful science.

From my limited perspective, very little has happened since then that is not just more BERTology—that is, exploiting BERT and similar models. The only alternative on the horizon, in the last 4 years now, are pre-trained large language models without the encoder component, of which the best known are the GPT family (now up to GPT-3). These models do one thing well: they take a text prompt and produce more text that seeminly responds to the prompt. However, whereas BERT and family are free to reuse, GPT-3’s parameters and software are both closed source and can only be accessed at scale by paying a licensing fee to Microsoft. That itself is a substantial regression compared to BERT. More importantly, though, the GPT family are far less expressive tools than BERT, since they don’t really support fine-tuning. (More precisely, I don’t see any difficult technical barriers to fine-tuning GPT-style models; it’s just not supported.) Thus they can be only really used for one thing: zero-shot text generation tasks, in which the task is “explained” to the model in the input prompt, and the output is also textual. Were it possible to simply write out, in plain English, what you want, and then get the output in a sensible text format, this of course would be revolutionary, but that’s not the case. Rather, GPT has spawned a cottage industry of prompt engineering. A prompt engineer, roughly, is someone who specializes in crafting prompts. It is of course impressive that this can be done at all, but just because an orangutan can be taught to make an adequate omelette doesn’t mean I am going to pay one to make breakfast. I simply don’t see how any of this represents an improvement over the BERT ecosystem, which at least has an easy-to-use free and open-source ecosystem. And as you might expect, GPT’s zero-shot approach is quite often much worse than what one would obtain using the light supervision of the BERT-style fine-tuning approach.

Phonological nihilism

One might argue that phonology is in something of a crisis period. Phonology seems to be going through early stages of grief for what I see as the failure of teleological, substance-rich, constraint-based, parallel-evaluation approaches to make headway, but the next paradigm shift is yet to become clear to us. I personally think that logical, substance-free, serialist approaches ought to represent our next i-phonology paradigm, with “evolutionary”-historical thinking providing the e-language context, but I may be wrong and altogether different paradigm may be waiting in the wing. The thing that troubles me is that phonologists from these still-dominant constraint-based traditions seem to have less and less faith in the tenets of their theories, and in the worst case this expresses itself as a sort of nihilism. I discern two forms of this nihilism. The first is the phonologist who thinks we’re doing “word sudoku”, playing games of minimal description that produce generalizations without a shred of cognitive support. The second is the phonologist who thinks that everything is memorized, so that the actual domain of phonological generalization are just Psych 101 subject pool nonce word experiments. My pitch to both types of nihilists is the same: if you truly believe this, you ought to spend more time at the beach and less in the classroom, and save some space in the discourse for those of us who believe in something.

On the past tense debate; Part 3: the overestimation of overirregularization

One final, and still unresolved, issue in the past tense debate is the role of so-called overirregularization errors.

It is well-known that children acquiring English tend to overregularize irregular verbs; that is, they apply the regular -d suffix to verbs which in adult English form irregular pasts, producing, e.g., *thinked for thought. Maratsos (2000) estimates that children acquiring English very frequently overregularize irregular verbs; for instance, Abe, recorded roughly 45 minutes a week from ages 2;5 to 5;2, overregularizes rare irregular verbs as much as 58% of the time, and even the most frequent irregular verbs are overregularized 18% of the time. Abe appears to have been exceptional in that he had a very large receptive vocabulary for his age (as measured by the Peabody Picture Vocabulary Test), giving him more opportunities (and perhaps more grammatical motivation) for overregularization,1 but Maratsos estimates that less-precocious children have lower but overall similar rates of overregularization.

In contrast, it is generally agreed that overirregularization, or the application of irregular patterns (e.g., in English, of ablaut, shortening, etc.) are quite a bit rarer. The only serious attempt to count overirregularizations is by Xu & Pinker (1995; henceforth XP). They estimate that children produce such errors no more than 0.2% of the time, which would make overirregularizations roughly two orders of magnitude rarer than overregularizations. This is a substantial difference. If anything, I think that XP overestimate overirregularizations. For instance, XP count brang as an overirregularization, even though this form does exist quite robustly in adult English (though it is somewhat stigmatized). Furthermore, XP count *slep for *slept as an overirregularization, though this is probably just ordinary (td)-deletion, a variable rule that is attested already in early childhood (Payne 1980). But by any account, overirregularization is extremely rare. The same is found in nonce word elicitation experiments such as those conducted by Berko (1958): both children and adults are loath to generate irregular past tenses for nonce verbs.2 

This is a problem for most existing computational models. Nearly all of them—Albright & Hayes’ (2003) rule-based model (see their §4.5.3), O’Donnell’s (2015) rules-plus-storage system, and all analogical models and neural networks I am aware of—not only overregularize, like children do, but also overirregularize at rates far exceeding what children do. I submit that any computational model which produces substantial overirregularization is simply on the wrong track.

Endnotes

  1. It is amusing to note that Abe is now, apparently, a trial lawyer and partner at a white-shoe law firm.
  2. As I mentioned in a previous post, this is somewhat obscured by ratings tasks, but that’s further evidence we should disregard such tasks.

References

Albright, A. and Hayes, B. 2003. Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90(2): 119-161.
Berko, J. 1958. The child’s learning of English morphology. Word 14: 150-177.
Maratsos, M. 2000. More overregularizations after all: new data and discussion on Marcus, Pinker, Ullman, Hollander, Rosen & Xu. Journal of Child Language 27: 183-212.
O’Donnell, T. 2015. Productivity and Reuse in Language: a Theory of Linguistic Computation and Storage. MIT Press.
Payne, A. 1980. Factors controlling the acquisition of the Philadelphia dialect by out-of-state children. In W. Labov (ed.),  Locating Language in Time and Space, pages 143-178. Academic Press.
Xu, F. and Pinker, S. 1995. Weird past tense forms. Journal of Child Language 22(3): 531-556.

Thought experiment #3

[The semester is finally winding down and I am back to writing again.]

Let us suppose one encounters a language in which the only adjacent consonants are affricates like [tʃ, ts, tɬ].1 One might be tempted to argue that these affricates are in fact singleton contour phonemes2 and that the language does not permit true consonant clusters.3

Let us suppose instead that one finds a language in which word-internal nasal-stop clusters are common, but nasal-glide and nasal-liquid clusters are not found except at transparent morpheme boundaries.4 One then might be tempted to argue that in this language, nasal-stop clusters are in fact sequences of nasal followed by an oral consonant rather than singleton contour phonemes.

In my opinion, neither of these argument “go through”. They follow from nothing, or at least nothing that has been explicitly stated. Allow me to explain, but first, consider the following hypothetical:

The metrical system of Centaurian, the lingua franca of the hominid aliens of the Alpha Centauri system, historically formed weight-insensitive trochees, with final extrametricality for prosodic words with odd syllable count of more than one syllable. However, a small group of Centaurian exiles have been hurtling towards the Sol system at .05 parsecs a year (roughly 1m MPH) for the last century or so. Because of their rapid speed of travel it is impossible for these pioneers to stay in communication with their homeworld, and naturally their language has undergone drift over the past few centuries. In particular, Pioneer Centaurian (as we’ll call it) has slowly but surely lost all the final extrametrical syllables of Classical Centaurian, and as a result there are no longer any 3-, 5-, 7- or 9- (etc.) syllable words in the Pioneer dialect.

As a result of a phonetically well grounded, “plausible”, Neogrammarian sound change, Pioneer Centaurian (PC) lacks long words with an odd number of syllables, though it still has 1-syllable words. What then is the status of this generalization in the grammar of PC speakers? The null hypothesis has to be that it has no status at all. Even though the lexical entries of PC have undergone changes, the metrical grammar of PC could easily be identical to Classical Centaurian: weight-intensitive trochees, with a now-vacuous rule of final extrametricality. Furthermore, it is quite possible that PC speakers have simply not noticed the relevant metrical facts, either consciously or subconsciously. Would PC speakers rate, say, 4-syllable nonce words as ill-formed possible words? No one knows. When PC speakers inevitably come in contact with English, will be they be reluctant to borrow a 6-syllable words like anthropomorphism or detoxification into their language, or will they feel the need to append or delete a syllable to conform to their language’s lexicon? Once again, no one knows.

The same is essentially true of the aforementioned language in which the only consonant clusters are affricates, or the aforementioned language in which nasal-consonant clusters are highly restricted. It might be the case that the grammar treats the former as single segments and the grammar treats the latter as clusters, but absolutely nothing presented thus far suggests it has to be true.

Let us refer to the idea that the grammar needs to encode phonotactic generalizations (somehow) as the phonotactic hypothesis. I have argued—though more for the sake of argument than out of genuine commitment—for a constrained version of this hypothesis; I note that any surface-true rule will rule out certain surface forms. Thus, if desired, one can derive—or perhaps more accurately, project—certain phonotactic generalizations by taking a free-ride on surface-true rules.5 But note: I have not argued that the phonotactic hypothesis is correct. Rather, I have simply provided a way to derive some phonotactic generalizations using entrenched grammatical machinery (i.e., phonological alternations). And this can only account for a subset of possible phonotactic generalizations.

Let us consider the language with word-initial affricates again. Linguists are often heard to say that one needs to posit phonotactic generalizations to “rule out” consonant clusters in this language. I disagree. Imagine that we have two grammars, G and G’. G has a set of URs, which includes contour phoneme affricates (/t͡ɬakaʔ-/ ‘people’, /t͡sopelik-/ ‘sweet’, etc., where the IPA tie bar symbolizes contour phonemes) but no consonant clusters. G also has a surface constraint on consonant clusters other than the affricates (which can be assumed to be contour phonemes, for sake of simplicity). G’ has the same set of URs, but lacks the surface constraint. Is there any reason to prefer G over G’? With the evidence given so far, I submit that there is not. Of course, there might be some grammatical patterns which, if otherwise unconstrained, would produce consonant clusters, in which case the phonotactic constraint of G may have some work to do. And, there may additional facts (perhaps the adaptation of loanwords, or wordlikeness judgments, though these data are not applied to this problem without making additional strong assumptions) may also militate in favor of G. But rarely if ever are these additional facts presented when positing G’. Now let us consider a third grammar, G”. This grammar is the same as G’, except that the affricates are now represented as consonant clusters (/tɬakaʔ-/ ‘people’, /tsopelik-/ ‘sweet’, etc.) rather than contour phonemes. Is there any reason to prefer either G’ or G” given the facts available to us thus far? It seems to me there is not.

This is a minor scandal for phonemic analysis. But it is not a purely philosophical issue: it is the same issue that children acquiring Nahuatl face. “Phonotacticians” have largely sidestepped these issues by making a completely implicit assumption that grammars (or perhaps, language learners) abhor a vacuum, in the sense that phonotactic constraints need to be posited to rule out that which does not occur. The problem is that there is often no reason to think these things would occur in the first place. If we assume that grammars do not abhor a vacuum—allowing us to rid ourselves of the increasingly complex machinery used to encode phonotactic generalizations not derived from alternations—we obtain exactly the same results in the vast majority of cases.

Endnotes

  1. One language with this property is Classical Nahuatl.
  2. Whatever that means! It’s not immediately clear, since there does not seem to be a fully-articulated theory that explains what it means to be a single segment in underlying representation to correspond to multiple articulatory targets on the surface. Without such a theory this feels like mere phenomenological description.
  3. Recently, Gouskova & Stanton (2021) express this heuristic, which has antecedents going back to at least Trubetzkoy, as a simple computational model.
  4. One language which supposedly has this property is Gurindji (McConvell 1988), though I only have only seen the relevant data reprinted in secondary sources. Thanks to Andrew Lamont (p.c.) for drawing my attention to this data. Note that in this language, the nasal-obstruent clusters undergo dissimilation when preceded by another nasal-obstruent cluster, which might—under certain assumptions—be a further argument that nasal-obstruent sequences are really clusters.
  5. See also Gorman 2013, particularly chapters 3-4.

References

Gorman, K. 2013. Generative phonotactics. Doctoral dissertation, University of Pennsylvania.
Gouskova, M. and Stanton, J. 2021. Learning complex segments. Language 97(1): 151-193.
McConvell, P. 1988. Nasal cluster dissimilation and constraints on phonological variables in Gurundji and related languages. Aboriginal Linguistics 1: 135-165.

On the past tense debate; Part 2: dual-route models are (still) incomplete

Dual-route models remain for the most part incompletely specified. Because crucial details are missing from their specification, they have generally not been implemented as computational cognitive models. Therefore, there is far less empirical rigor in dual-route thinking. To put it starkly, dual-route proponents have conducted expensive, elaborate brain imaging studies to validate their model but have not proposed a model detailed enough to implement on a $400 laptop.

The dual-route description of the English past tense can be given as such:

  1. Use associative memory to find a past tense form.
  2. If this lookup fails, or times out, append /-d/ and apply phonology.

Note that this ordering is critical: one cannot ask simply ask whether a verb is regular, since by hypothesis some or all regular verbs are not stored as such. And, as we know (Berko 1958), novel and nonce verbs are almost exclusively inflected with /-d/, consistent with the current ordering.1 This model equates—rightly, I think—the notions of regularity with the elsewhere condition. The problem is with the fuzziness in how one might reach condition (2). We do not have any notion of what it might mean for associative memory lookup to fail. Neural nets, for instance, certainly do not fail to produce an output, though they will happily produce junk in certain cases. Nor do we much of a notion of how it might time out.

I am aware of two serious attempts to spell out this crucial detail. The first is Baayen et al.’s 1997 visual word recognition study of Dutch plurals. They imagine that (1) and (2) are competing activation “routes” and that recognition occurs when either of the routes reaches activation threshold, as if both routes run in parallel. To actually fit their data, however, their model immediately spawns epicycles in the form of poorly justified hyperparameters (see their fn. 2) and as far as I know, no one has ever bothered to reuse or reimplement their model.2 The second is O’Donnell’s 2015 book, which proposes a cost-benefit analysis for storage vs. computation. However, this complex  and clever model is not described in enough detail for a “white room” implementation, and no software has been provided. What dual route proponents owe us, in my opinion, is a next toolkit. Without serious investment in formal computational description and reusable, reimplementable, empirically validated models, it is hard to take the dual-route proposal seriously.

Endnotes

  1. There’s a lot of work which obfuscates this point. An impression one might get from Albright & Hayes (2003) is that adult nonce word studies produce quite a bit of irregularity, but this is only true in their rating task and hardly at all in their “volunteering” (production) task, and a hybrid task finds much higher ratings for noce irregulars. Schütze (2005) argues—convincingly, in my opinion—that this is because speakers use a different task model in rating tasks, one that is mostly irrelevant to what Albright & Hayes are studying.
  2. One might be tempted to fault Baayen et al. for using visual stimulus presentation (in a language with one of the more complex and opaque writing systems), or for using recognition as a proxy for production. While these are probably reasonably critiques today, visual word recognition was still the gold standard in 1997.

References

Albright, A. and Hayes, B. 2003. Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90(2): 119-161.
Baayen, R. H., Dijkstra, T., and Schreuder, R. 1997. Singulars and plurals in Dutch: evidence for a parallel dual-route model. Journal of Memory & Language 37(1): 94-117.
Berko, J. 1958. The child’s learning of English morphology. Word 14: 150-177.
O’Donnell, T. 2015. Productivity and Reuse in Language: a Theory of Linguistic Computation and Storage. MIT Press.
Schütze, C. 2005. Thinking about what we are asking speakers to do. In S. Kepser and M. Reis (ed.), Linguistic Evidence: Empirical, Theoretical, and Computational Perspectives, pages 457-485. Mouton de Gruyter.

On the past tense debate; Part 1: the RAWD approach

I have not had time to blog in a while, and I really don’t have much time now either. But here is a quick note (one of several, I anticipate) about the past tense debate.

It is common to talk as if connectionist approaches and dual-route models are the two opposing approaches to morphological irregularity, when in fact there are three approaches. Linguists since at least Bloch (1947)1 have claimed that regular, irregular, and semiregular patterns are all rule-governed and ontologically alike. Of course, the irregular and semiregular rules may require some degree lexical conditioning, but phonologists have rightly never seen this as some kind of defect or scandal. Chomsky & Halle (1968), Halle (1977), Rubach (1984), and Halle & Mohanan (1985) all spend quite a bit of space developing these rules, using formalisms that should be accessible to any modern-day student of phonology. These rules all the way down (henceforth RAWD) approaches are empirically adequate and have been implemented computationally with great success: some prominent instances include Yip & Sussman 1996, Albright & Hayes 2003,2 and Payne 2022. It is malpractice to ignore these approaches.

One might think that RAWD has more in common with dual-route approaches than with connectionist thinking, but as Mark Liberman noted many years ago, that is not obviously the case. Mark Seidenberg, for instance, one of the most prominent Old Connectionists, has argued that there is a tendency for regulars and irregulars to share certain structural similarities. To take one example, semi-regular slept does not look so different from stepped, and the many zero past tense forms (e.g., hit, bid) end in the same phones—[t, d]—used to mark the plural. While I am not sure this is a meaningfuly generalization, it clearly is something that both connectionist and RAWD models can encode.3 This is in contradistinction to dual-route models, which have no choice but to treat these observations as coincidences. Thus, as Mark notes, connectionists and RAWD proponents find themselves allied against dual-route models.

(Mark’s post, which I recommend, continues to draw a parallel between dual-routism and bi-uniqueness which will amuse anyone interested in the history of phonology.)

Endnotes

  1. This is not exactly obscure work: Bloch taught at two Ivies and was later the president of the LSA. 
  2. To be fair, Albright & Hayes’s model does a rather poor job recapitulating the training data, though as they argue, it generalizes nonce words in a way consistent with human behavior.
  3. For instance, one might propose that slept is exceptionally subject to a vowel shortening rule of the sort proposed by Myers (1987) but otherwise regular.

References

Albright, A. and Hayes, B. 2003. Rules vs. analogy in English past tenses: a computational/experimental study. Cognition 90(2): 119-161.
Bloch, B. 1947. English verb inflection. Language 23(4): 399-418.
Chomsky, N., and Halle, M. 1968. Sound Pattern of English. Harper & Row.
Halle, M. 1977. Tenseness, vowel shift and the phonology of back vowels in Modern English. Linguistic Inquiry 8(4): 611-625.
Halle, M., and Mohanan, K. P. 1985. Segmental phonology of Modern English. Linguistic Inquiry 16(1): 57-116.
Myers, S. 1987. Vowel shortening in English. Natural Language & Linguistic Theory 5(4): 485-518.
Payne, S. R. 2022. When collisions are a good thing: the acquisition of morphological marking. Bachelor’s thesis, University of Pennsylvania. 
Pinker, S. 1999. Words and Rules: the Ingredients of Language. Basic Books.
Rubach, J. 1984. Segmental rules of English and cyclic phonology. Language 60(1): 21-54.
Yip, K., and Sussman, G. J. 1997. Sparse representations for fast, one-shot learning. In Proceedings of the 14th National Conference on Artificial Intelligence and 9th Conference on Innovative Applications of Artificial Intelligence, pages 521-527.

The Wordlikeness Project

We (myself, Karthik Durvasula, and Jimin Kahng) recently got the good news that our NSF collaborative research proposal has been funded. This works springs ultimately from my dissertation. There I argue—using a mix of logical argumentation and “archival” wordlikeness data mostly taken from appendices of previously published work—that the view of phonotactic grammar as statistical patterns or constraints projected from the lexicon is not strongly supported by the available data. My conclusions are perhaps weakened by the low overall quality of this archival data, which is drawn from various stimulus presentation modalities (i.e., auditory vs. orthographic) and response modalities (Likert scale vs. binary forced-choice vs. transcription). In the NSF study, we will be collecting wordlikeness data in English and Korean, manipluating these stimulus presentation and response modalities, and this data will be made publicly available under the name of the Wordlikeness Project. (Here we draw inspiration from the English Lexicon Project and spinoffs.) We will also be using this data for extensive computational modeling, to answer some of the questions raised in my dissertation and in Karthik and Jimin’s subsequent work.

Stop being weird about the Russian language

As you know, Russia is waging an unprovoked war on Ukraine. It should go without saying that my sympathies are with Ukraine, but of course both states are undemocratic, one-party kleptocracies and I have little hope for anything good coming from the conflict.

That’s all besides the point. Since the start of the war, I have had several conversations with linguists who suggested that the study of the Russian language—one of the most important languages in linguistic theorizing over the years—is now “cringe”. This is nonsense. First, official statistics show that a majority of Ukrainian citizens identify as ethnically Russian, and that a substantial minority speak Russian as a first language (and this is probably skewed by social-desirability bias). Secondly, it is wrong to identify a language with any one nation. (It is “cringe” to use flag emojis to label languages; just use the ISO codes.) Third, it is foolish to equate the state with the people who live underneath them, particularly after the end of the kind of mass political movements that in earlier times could stop this kind of state violence. It is a basic corollary of the i-language view that children learn whatever languages they’re sufficiently exposed to, regardless of their location or of their caretakers’ politics. The iniquity of war does not travel from nation to language to its speakers. Stop being weird about it.

The end of defectivity

As of yesterday I have completed my series of defectivity case studies, at least for the time being. From these I propose the following tentative taxonomy:

It is not clear to me whether three categories are really needed. In both of the latter two, here seems to be some tight phonotactic constraint on inflectional variants which results in ungrammaticality and defectivity if not satisfied. In the two cases from Africa, these constraints are of a metrical nature and impact many lexemes; in the cases from Scandinavia, they concern stem-final consonant clusters and possible mutations to them. And this looks a lot like the case of Russian verbs. This just leaves Tagalog, which I think has simply been misanalyzed, and Turkish, where the only defective lexemes are a handful of subminimal borrowings.

I am aware of two other cases of interest: (various stages of) Sanskrit (Stump 2010) and Latvian. These are phenomenologically quite different from the ones I’ve discussed so far: both involve gaps in the paradigms of inflected pronouns. I do not find gaps in the distribution of functional elements to be nearly as shocking as the failure of, say, an otherwise unobjectionable Russian or Spanish verb to have a 1sg. form. I should mention that the constraint against contracting n’t to am in standard English (see Yang 2017:§3 and references therein) is also possibly an example of this form; I suppose it depends on whether or not n’t is really an inflectional affix or not.

References

Stump, G. 2010. Interactions between defectiveness and syncretism. In M. Baerman, G. G. Corbett, and D. Brown (ed.), Defective Paradigms: Missing Forms and What They Tell Us, pages 181-210. Oxford University Press.
Yang, C. 2017. How to wake up irregular (and speechless). In C. Bowern, L. Horn, and R. Zanuttini (ed.), On Looking into Words (and Beyond), pages 211-233. Language Science Press.