I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.
Category: NLP
Evaluations from the past
In a literature review, speech and language processing specialists often feel tempted to report evaluation metrics like accuracy, F-score, or word error rate for systems described in the literature review. In my opinion, this is only informative if the prior and present work use the exact same data set(s) for evaluations. (Such results should probably be presented in a table along with results from the present work, not in the body of the literature review.) If instead, they were tested on some proprietary data set, an obsolete corpus, or a data set the authors of the present work have declined to evaluate on, this information is inactionable. Authors should omit this information, and reviewers and editors should insist that it be omitted.
It is also clear to me that these numbers are rarely meaningful as measures of how difficult a task is “generally”. To take an example from an unnamed 2019 NAACL paper (one guilty of the sin described above), word error rates on a single task in a single language range between 9.1% and 23.61% (note also the mixed precision). What could we possibly reason from this enormous spread of results across different data sets?
On expanding acronyms
Student writers are often taught that acronyms should also be given in expanded form on first use. While this is a good rule of thumb in my opinion, there is an exception for any acronym whose expansion the author believes to be misleading about its referent, particularly when the acronym in question seems to have been coined after the fact and purely for the creator’s amusement.
“Many such cases.”
An author-date citation may be preferable to spelling out the silly acronym.
Logistic regression as the bare minimum. Or, Against naïve Bayes
When I teach introductory machine learning, I begin with (categorical) naïve Bayes classifiers. These are arguably the simplest possible supervised machine learning model, and can be explained quickly to anyone who understands probability and the method of maximum likelihood estimation. I then pivot and introduce logistic regression and its various forms. Ng et al. (2002) provide a nice discussion of how the two relate, and I encourage students to read their study.
Logistic regression is a more powerful technique than naïve Bayes. First, it is “easier” in some sense (Breiman 2001) to estimate the conditional distribution, as one does in logistic regression, than to model the joint distribution, as one does in naïve Bayes. Secondly, logistic regression can be learned using standard (online) stochastic gradient descent methods. Finally, it naturally supports conventional regularization strategies needed to avoid overfitting. For this reason, in 2022, I consider regularized logistic regression the bare minimum supervised learning method, the least sophisticated method that is possibly good enough. The pedagogical-instructional problem I then face is trying to convince students not to use naïve Bayes, given that it is obsolete—it is virtually always inferior to regularized logistic regression—given that tools like scikit-learn (Pedregosa et al. 2011) make it almost trivial to swap one machine learning method for the other.
References
Breiman, Leo. 2001. Statistical modeling: the two cultures. Statistical Science 16:199-231.
Ng, Andrew Y., and Michael I. Jordan. 2002. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Proceedings of NeurIPS, pages 841-848.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, …, and Édouard Duchesnay. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12:2825-2830.
On “alternative” grammar formalisms
A common suggestion to graduate students in linguistics (computational or otherwise) is to study “alternative” grammar formalisms [not my term-KBG]. The implication is that the student is only familiar with formal grammars inspired by the supposedly-hegemonic generativist tradition—though it is not clear if we’re talking about the GB-lite of Penn Treebank, the minimalist grammars (MGs) of Ed Stabler, or perhaps something else—and that the set of “alternatives” includes lexical-functional grammars (LFGs), tree-adjoining grammars (TAGs), combinatory categorial grammars (CCGs), head-driven phrase structure grammar (HPSG), or one of the various forms of construction grammar. I would never say that students should study less rather than more, but I am not convinced this diversity of formalism is key to training well-rounded students. TAGs and CCGs are known to be strongly equivalent (Schiffer & Maletti 2021), and the major unification-based grammar systems (which includes CCGs and HPSGs, and formal forms of construction grammars too) are equivalent to MGs. I speculate that maybe we should be emphasizing similarities rather than differences insofar as those differences are not represented in relative generative capacity.
Another useful way to determine the relative utility of alternative formalisms is to look at their actual use in wide-coverage computational grammars, since as Chomsky (1981: 6) says, it is possible to put systems to the test “only to the extent that we have grammatical descriptions that are reasonably compelling in some domain…”. Or put another way, grammar frameworks both hegemonic and alternative can be assessed for coverage (which can be extensive, in some languages and domains) or general utility rather than for the often-spicy rhetoric of their proponents.
Finally, it is at least possible that some alternative frameworks are simply losers of a multi-agent coordination game and at least some consolidation is desirable.
References
Chomsky, N. 1981. Lectures in Government and Binding. Foris.
Schiffer, L. K. and Maletti, A. 2021. Strong equivalence of TAG and CCG. Transactions of the Association for Computational Linguistics 9: 707-720.
Academic reviewing in NLP
It is obvious to me that NLP researchers are, on average, submitting manuscripts far earlier and more often than they ought to. The average manuscript I review is typo-laden, full of figures and tables far too small to actually read or intruding on the margins, with an unusable bibliography that the authors have clearly never inspected. Sometimes I receive manuscripts whose actual titles are transparently ungrammatical.
There are several reasons this is bad, but most of all it is a waste of reviewer time, since the reviewers have to point out (in triplicate or worse) minor issues that would have been flagged by proof-readers, advisors, or colleagues, were they involved before submission. Then, once these issues are corrected, the reviewers are again asked to read the paper and confirm they have been addressed. This is work the authors could have done, but which instead is pushed onto committees of unpaid volunteers.
The second issue is that the reviewer pool lacks relevant experience. I am regularly tasked with “meta-reviewing”, or critically summarizing the reviews. This is necessary in part because many, perhaps a majority, of the reviewers simply do not know how to review an academic paper, having not received instruction on this topic from their advisors or mentors, and their comments need to be recast in language that can be quickly understood by conference program committees.
[Moving from general to specific.]
I have recently been asked to review an uncommonly large collection of papers on the topic of prompt engineering. Several years ago, it became apparent that neural network language models, trained on enormous amounts of text data, could often provide locally coherent (though rarely globally coherent) responses to prompts or queries. The parade example of this type of model is GPT-2. For instance, if the prompt was:
Malfoy hadn’t noticed anything.
“In that case,” said Harry, after thinking it over, “I suggest you return to the library.”
I assume this is because there’s fan fiction in the corpus, but I don’t really know. Now it goes without saying that at no point will, Facebook, say, launch a product in which a gigantic neural network is allowed to regurgitate Harry Potter fan fiction (!) at their users. However, researchers persist for some reason (perhaps novelty) to try to “engineer” clever prompts that produce subjectively “good” responses, rather than attempting to understand how any of this works. (It is not an overstatement to say that we have little idea why neural networks, and the methods we use to train them in particular, work at all.) What am I to do when asked to meta-review papers like this? I try to remain collegial, but I’m not sure this kind of work ought to exist at all. I consider GPT-2 a billionaire plaything, a rather wasteful one at that, and it is hard for me to see how this line of work might make the world a better place.
LIWC is a joke
It’s a bunch of lists of words some guy’s trying to sell. They’re not even very good lists. You can just make your own domain-appropriate word lists in your language of choice.
The 24th century Universal Translator is unsupervised and requires minimal resources
The Star Trek: Deep Space Nine episode “Sanctuary” pretty clearly establishes that by the 24th century, the Star Trek universe’s Universal Translator works in an unsupervised fashion and requires only a (what we in the real 21st century would consider) minimal monolingual corpus and a few hours of processing to translate Skrreean, a language new to Starfleet and friends. Free paper idea: how does the Universal Translator’s capabilities (in the 22nd through the 24th century, from Enterprise to the original series to the 24th century shows) map onto known terms of art in machine translation in our universe?
Moneyball Linguistics
[This is just a fun thought experiment. Please don’t get mad.]
The other day I had an intrusive thought: the phrase moneyball linguistics. Of course, as soon as I had a moment to myself, I had to sit down and think what this might denote. At first I imagined building out a linguistics program on a small budget like Billy Beane and the Oakland A’s. But it seems to me that linguistics departments aren’t really much like baseball teams—they’re only vaguely competitive (occasionally for graduate students or junior faculty), there’s no imperative to balance the roster, there’s no DL list (or is just sabbatical?), and so on—and the metaphor sort of breaks down. But the ideas of Beane and co. do seem to have some relevance to talking about individual linguists and labs. I don’t have OBP or slugging percentage for linguists, and I wouldn’t dare to propose anything so crude, but I think we can talk about linguists and their research as a sort of “cost center” and identify two major types of “costs” for the working linguist:
- cash (money, dough, moolah, chedda, cheese, skrilla, C.R.E.A.M., green), and
- carbon (…dioxide emissions).
I think it is a perfectly fine scientific approximation (not unlike competence vs. performance) to treat the linguistic universe as having a fixed amount of cash and carbon, so that we could use this thinking to build out a roster-department and come in just under the pay cap. While state research budgets do fluctuate—and while our imaginings of a better world should also include more science funding—it is hard to imagine near-term political change in the West would substantially increase it. And similarly, while there is roughly 1012 kg of carbon in the earth’s crust, climate scientists agree that the vast majority of it really ought to stay there. Finally, I should note that maybe we shouldn’t treat these as independent factors, given that there is a non-trivial amount of linguistics funding via petrodollars. But anyways, without further ado, let’s talk about some types of researchers and how they score on the cash-and-carbon rubric.
- Armchair research: The armchairist is clearly both low-cash (if you don’t count the sports coats) and low-carbon (if you don’t count the pipe smoke).
- Field work: “The field” could be anywhere, even the reasonably affordable, accessible, and often charming Queens, the archetypical fieldworker is flying in, first on a jet and then maybe reaches their destination via helicopter or seaplane. Once you’re there though, life in the field is often reasonably affordable, so this scores as low-cash, high-carbon.
- Experimental psycholinguistics: Experimental psycholinguists have reasonably high capital/startup costs (in the form of eyetracking devices, for instance) and steady marginal costs for running subjects: the subjects themselves may come from the Psych 101 pool but somebody’s gotta be paid to consent them and run them through the task. We’ll call this medium-cash, low-carbon.
- Neurolinguistics: The neurolinguistic imaging technique du jour, magnetoencephalography (or MEG), requires superconducting coils cooled to a chilly 4.2 K (roughly −452 °F); this in turn is accomplished with liquid helium. Not only is the cooling system expensive and power-hungry, the helium is mostly wasted (i.e., vented to the atmosphere). Helium is itself the second-most common element out there, but we are quite literally running out of the stuff here on Earth. So, MEG, at least, is high-cash, high-carbon.
- Computational linguistics: there was a time not so long ago when I would said that computational linguists were a bunch of hacky-sackers filling up legal pads with Greek letters (the weirder the better) and typing some kind of line noise they call “Haskell” into ten-year-old Thinkpads. But nowadays, deep learning is the order of the day, and the substantial carbon impact from these methods are well-documented, or at least well-estimated (e.g., Strubell et al. 2019). Now, it probably should be noted that a lot of the worst offenders (BigCos and the Quebecois) locate their data centers near sources of plentiful hydroelectric power, but not all of us live within the efficient transmission zones for hydropower. And of course, graphics processing units are expensive too. So most computational linguistics is, increasingly, high-cash, high-carbon.
On a more serious note, just so you know, unless you run an MEG lab or are working on something called “GPT-G6”, chances are your biggest carbon contributions are the meat you eat, the cars you drive, and the short-haul jet flights you take, not other externalities of your research.
References
Strubell, M., Ganesh, A. and McCallum, A. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645-3650.
“I understood the assignment”
We do a lot of things downstream with the machine learning tool we build, but not always can a model reasonably say it “understood the assignment” in the sense that the classifier is trained to do exactly what it we are making it do.
Take for example, Yuan and Liberman (2011), who study the realization of word-final ing in American English. This varies between a dorsal variant [ɪŋ] and a coronal variant [ɪn].1 They refer to this phenomenon using the layman’s term g-dropping; I will use the notation (ing) to refer to all variants. They train Gaussian mixture models on this distinction, then enrich their pronunciation dictionary so that each word can be pronounced with or without g-dropping; it is as if the two variants are homographs. Then, they perform a conventional forced alignment; as a side effect, it determines which of the “homographs” was most likely used. This does seem to work, and is certainly very clever, but strikes me as a mild abuse of the forced alignment technique, since the model was not so much trained to distinguish between the two variants as produce a global joint model over audio and phoneme sequences.
What would an approach to the g-dropping problem that better understood the assignment look like? One possibility would be to run ordinary forced alignment, with an ordinary dictionary, and then extract all instances of (ing). The alignment would, naturally, give us reasonably precise time boundaries for the relevant segments. These could then be submitted to a discriminative classifier (perhaps an LSTM) trained to distinguish the various forms of (ing). In this design, one can accurately say that the two components, aligner and classifier, understand the assignment. I expect that this would work quite a bit better than what Yuan and Liberman did, though that’s just conjecture at present.
Some recent work by my student Angie Waller (published as Waller and Gorman 2020), involved an ensemble of two classifiers, one which more clearly understood the assignment than the other. The task here was to detect reviews of professors which are objectifying, in the sense that they make off-topic, usually-positive, comments about the professors’ appearance. One classifier makes document-level classifications, and cannot be said to really understand the assignment. The other classifier attempts to detect “chunks” of objectifying text; if any such chunks are found, one can label the entire document as objectifying. While neither technique is particularly accurate (at the document level), the errors they make are largely uncorrelated so an ensemble of the two obtains reasonably high precision, allowing us to track trends in hundreds of thousands of professor reviews over the last decade.
Endnotes
- This doesn’t exhaust the logical possibilities of variation; for instance, for some speakers (including yours truly), there is a variant with a tense vowel followed by the coronal nasal.
References
Waller, A. and Gorman, K. 2020. Detecting objectifying language in online professor reviews. In Proceedings of the Sixth Workshop on Noisy User-Generated Text, pages 171-180.
Yuan, J. and Liberman, M. 2011. Automatic detection of “g-dropping” in American English using forced alignment. In IEEE Workshop on Automatic Speech Recognition & Understanding, pages 490-493.