The computational revolution in linguistics

(Throughout this post, I have taken pains not to name any names. The beauty of subtweeting and other forms of subposting is that nobody knows for sure you’re the person being discussed unless you volunteer yourself. So, don’t.)

One of the more salient developments in linguistics as a discipline over the last two decades is the way in which computational knowledge has diffused into the field.1 20 years ago, there were but a handful of linguistics professors in North America who could perform elaborate corpus analyses, apply machine learning and statistical analysis, or extract acoustic measurements from an audio file. And, while it was in some ways quite robust, speech and language processing at the turn of the last century simply did not hold the same importance it does nowadays.

While some professors—including, to their credit, many of my mentors and colleagues—can be commended for having “skilled up” in the intervening years, this knowledge has, I am sad to say, mostly advanced one death (and subsequent tenure line renewal) at a time. This has negative consequences for linguistics students who want to train for or pivot to a career in the tech sector, since there are professors who were, in their time, computationally sophisticated, but lack the skills a rising computational linguist is expected to have mastered. In an era of contracting tenure rolls and other forms of casualization in the academy, this has the risk of pushing out legitimate, albeit staid, lines of linguistic inquiry in favor of areas favored by capitalists.2

Yet I believe that this upskilling has a lot to contribute to linguistics as a discipline. There are many core questions about language use, acquisition, variation, and change which are best answered with a computational simulation that forces us to be explicit about our assumptions, or a corpus study that tells us what people really said, or a statistical analysis that tells us whether our correlations are likely to be meaningful, or even a machine learning system that helps us rapidly label linguistic data.3 It is a boon to our field that linguists of any age can employ these tools when appropriate.

This is not to say that the transition has not been occasionally ugly. First, there are the occasional nasty turf wars over who exactly is a linguist.4 Secondly, the standards of quality for work in this area must be negotiated and imposed. While a syntax paper in NL&LT from even 30 years ago are easily readable today, the computational methods of even widely-praised paper from 15 or 20 years ago are, frankly, often quite sloppy. I have found it necessary to explain this to students who want to interact with this older work lest they lower their own methodological standards.

I discern at least a few common sloppy habits in this older computational work, focusing for the moment on computational cognitive models of linguistic behavior.

  1. If a proposed computational model is compared to some “baseline” or older model, this older model is usually an ancient associationist model from psychology. This older model naturally lacks much of the rich linguistic specifications of the proposed model, and naturally it fails to model the data. Deliberately picking a bad baseline is putting one’s finger on the scale.
  2. Comparison of different computational models is usually informal. One should instead use statistical model comparison methods.
  3. The dependent variable for modeling is often derived from poorly-designed human subjects experiments. The subjects in these experiments may be instructed to perform a task they are unlikely to be able to do consciously (i.e., the tasks are cognitively impenetrable). Unjustified assumptions about appropriate scales of measurement may have been made. Finally, the n‘s are often needlessly small. Computational cognitive models demand high-quality measures of the behaviors they’re meant to model.
  4. Once the proposed model has been shown better than the baseline, it is reified far beyond what the evidence suggests. Computational cognitive modeling can at most show that certain explicit assumptions are consistent with the observed data: they cannot establish much beyond that.

The statistician Andrew Gelman writes that scientific discourse sometimes proceeds as if earlier published work has additional claim to truth than later research that is critical of the original findings (which may or may not be published yet).5 Critical interpretation of this older computational work is increasingly called for, as our methodological standards continue to mature. I find reviewers (and literature-reviewers) overly deferential to prior work of dubious quality simply because of its priority.

Endnotes

  1. An under-appreciated element to this process is that it is is simply easier to do linguistically-relevant things with computers than it was 20 years prior. For this, one should thank Python and R, NumPy and Scikit-learn, and of course tools like Praat and Parselmouth.
  2. I happen to think college education should not be merely vocational training.
  3. I happen to think most of these questions can be answered with a cheap laptop,  and only a few require a CUDA-enabled GPU.
  4. I suspect this is mostly a response to the rapidly casualizing academy. Unfortunately, any question about whether we should be doing X in linguistics is misinterpreted as a question about whether people who do X deserve to have a job. This is a presupposition failure for me: I believe everyone deserves meaningful work, and that academic tenure is a model of labor relations that should be expanded beyond the academy.
  5. To free ourselves of this bias, Gelman proposes what he calls the time-reversal heuristic, in which one imagines the temporal order reversed (e.g., that the later failed replication is now the first published result on the matter) and then re-evaluates the evidence. When interacting with older computational work, similar  thinking is called for here.

A* shortest string decoding for non-idempotent semirings

I recently completed some work, in collaboration with Google’s Cyril Allauzen, on a new algorithm for computing the shortest string through weighted finite-state automaton. For so-called path semirings, the shortest string is given by the shortest path, but up until now, there was no general-purpose algorithm for computing the shortest string over non-idempotent semirings (like the log or probability semiring). Such an algorithm would make it much easier to decode with interpolated language models or elaborate channel models in a noisy-channel formalism. In this preprint, we propose such an algorithm using A* search and lazy (“on-the-fly”) determinization, and prove that it is correct. The algorithm in question is implemented in my OpenGrm-BaumWelch library by the baumwelchdecode command-line tool.

WFST talk

I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.

Evaluations from the past

In a literature review, speech and language processing specialists often feel tempted to report evaluation metrics like accuracy, F-score, or word error rate for systems described in the literature review. In my opinion, this is only informative if the prior and present work use the exact same data set(s) for evaluations. (Such results should probably be presented in a table along with results from the present work, not in the body of the literature review.) If instead, they were tested on some proprietary data set, an obsolete corpus, or a data set the authors of the present work have declined to evaluate on, this information is inactionable. Authors should omit this information, and reviewers and editors should insist that it be omitted.

It is also clear to me that these numbers are rarely meaningful as measures of how difficult a task is “generally”. To take an example from an unnamed 2019 NAACL paper (one guilty of the sin described above), word error rates on a single task in a single language range between 9.1% and 23.61% (note also the mixed precision). What could we possibly reason from this enormous spread of results across different data sets?

On expanding acronyms

Student writers are often taught that acronyms should also be given in expanded form on first use. While this is a good rule of thumb in my opinion, there is an exception for any acronym whose expansion the author believes to be misleading about its referent, particularly when the acronym in question seems to have been coined after the fact and purely for the creator’s amusement.

“Many such cases.”

An author-date citation may be preferable to spelling out the silly acronym.

Logistic regression as the bare minimum. Or, Against naïve Bayes

When I teach introductory machine learning, I begin with (categorical) naïve Bayes classifiers. These are arguably the simplest possible supervised machine learning model, and can be explained quickly to anyone who understands probability and the method of maximum likelihood estimation. I then pivot and introduce logistic regression and its various forms. Ng et al. (2002) provide a nice discussion of how the two relate, and I encourage students to read their study.

Logistic regression is a more powerful technique than naïve Bayes. First, it is “easier” in some sense (Breiman 2001) to estimate the conditional distribution, as one does in logistic regression, than to model the joint distribution, as one does in naïve Bayes. Secondly, logistic regression can be learned using standard (online) stochastic gradient descent methods. Finally, it naturally supports conventional regularization strategies needed to avoid overfitting. For this reason, in 2022, I consider regularized logistic regression the bare minimum supervised learning method, the least sophisticated method that is possibly good enough. The pedagogical-instructional problem I then face is trying to convince students not to use naïve Bayes, given that it is obsolete—it is virtually always inferior to regularized logistic regression—given that tools like scikit-learn (Pedregosa et al. 2011) make it almost trivial to swap one machine learning method for the other.

References

Breiman, Leo. 2001. Statistical modeling: the two cultures. Statistical Science 16:199-231.
Ng, Andrew Y., and Michael I. Jordan. 2002. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Proceedings of NeurIPS, pages 841-848.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, …, and Édouard Duchesnay. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12:2825-2830.

On “alternative” grammar formalisms

A common suggestion to graduate students in linguistics (computational or otherwise) is to study “alternative” grammar formalisms [not my term-KBG]. The implication is that the student is only familiar with formal grammars inspired by the supposedly-hegemonic generativist tradition—though it is not clear if we’re talking about the GB-lite of Penn Treebank, the minimalist grammars (MGs) of Ed Stabler, or perhaps something else—and that the set of “alternatives” includes lexical-functional grammars (LFGs), tree-adjoining grammars (TAGs), combinatory categorial grammars (CCGs), head-driven phrase structure grammar (HPSG), or one of the various forms of construction grammar. I would never say that students should study less rather than more, but I am not convinced this diversity of formalism is key to training well-rounded students. TAGs and CCGs are known to be strongly equivalent (Schiffer & Maletti 2021), and the major unification-based grammar systems (which includes CCGs and HPSGs, and formal forms of construction grammars too) are equivalent to MGs. I speculate that maybe we should be emphasizing similarities rather than differences insofar as those differences are not represented in relative generative capacity.

Another useful way to determine the relative utility of alternative formalisms is to look at their actual use in wide-coverage computational grammars, since as Chomsky (1981: 6) says, it is possible to put systems to the test “only to the extent that we have grammatical descriptions that are reasonably compelling in some domain…”. Or put another way, grammar frameworks both hegemonic and alternative can be assessed for coverage (which can be extensive, in some languages and domains) or general utility rather than for the often-spicy rhetoric of their proponents.

Finally, it is at least possible that some alternative frameworks are simply losers of a multi-agent coordination game and at least some consolidation is desirable.

References

Chomsky, N. 1981. Lectures in Government and Binding. Foris.
Schiffer, L. K. and Maletti, A. 2021. Strong equivalence of TAG and CCG. Transactions of the Association for Computational Linguistics 9: 707-720.

Academic reviewing in NLP

It is obvious to me that NLP researchers are, on average, submitting manuscripts far earlier and more often than they ought to. The average manuscript I review is typo-laden, full of figures and tables far too small to actually read or intruding on the margins, with an unusable bibliography that the authors have clearly never inspected. Sometimes I receive manuscripts whose actual titles are transparently ungrammatical.

There are several reasons this is bad, but most of all it is a waste of reviewer time, since the reviewers have to point out (in triplicate or worse) minor issues that would have been flagged by proof-readers, advisors, or colleagues, were they involved before submission. Then, once these issues are corrected, the reviewers are again asked to read the paper and confirm they have been addressed. This is work the authors could have done, but which instead is pushed onto committees of unpaid volunteers.

The second issue is that the reviewer pool lacks relevant experience. I am regularly tasked with “meta-reviewing”, or critically summarizing the reviews. This is necessary in part because many, perhaps a majority, of the reviewers simply do not know how to review an academic paper, having not received instruction on this topic from their advisors or mentors, and their comments need to be recast in language that can be quickly understood by conference program committees.

[Moving from general to specific.]

I have recently been asked to review an uncommonly large collection of papers on the topic of prompt engineering. Several years ago, it became apparent that neural network language models, trained on enormous amounts of text data, could often provide locally coherent (though rarely globally coherent) responses to prompts or queries. The parade example of this type of model is GPT-2. For instance, if the prompt was:

Malfoy hadn’t noticed anything.

the model might continue:

“In that case,” said Harry, after thinking it over, “I suggest you return to the library.”

I assume this is because there’s fan fiction in the corpus, but I don’t really know. Now it goes without saying that at no point will, Facebook, say, launch a product in which a gigantic neural network is allowed to regurgitate Harry Potter fan fiction (!) at their users. However, researchers persist for some reason (perhaps novelty) to try to “engineer” clever prompts that produce subjectively “good” responses, rather than attempting to understand how any of this works. (It is not an overstatement to say that we have little idea why neural networks, and the methods we use to train them in particular, work at all.) What am I to do when asked to meta-review papers like this? I try to remain collegial, but I’m not sure this kind of work ought to exist at all. I consider GPT-2 a billionaire plaything, a rather wasteful one at that, and it is hard for me to see how this line of work might make the world a better place.

The 24th century Universal Translator is unsupervised and requires minimal resources

The Star Trek: Deep Space Nine episode “Sanctuary” pretty clearly establishes that by the 24th century, the Star Trek universe’s Universal Translator works in an unsupervised fashion and requires only a (what we in the real 21st century would consider) minimal monolingual corpus and a few hours of processing to translate Skrreean, a language new to Starfleet and friends. Free paper idea: how does the Universal Translator’s capabilities (in the 22nd through the 24th century, from Enterprise to the original series to the 24th century shows) map onto known terms of art in machine translation in our universe?