How to write linguistic examples

There is a standard, well-designed way in which linguists write examples, and failure to use it in a paper about language is a strong shibboleth suggesting unfamiliarity with linguistics as a field. In brief, it is as follows:

When an example (affix, word, phrase, or sentence) appears in the body (i.e., the middle of a sentence):
- if written in Roman, it should be italicized.
- if written in non-Roman, but alphabetic scripts like Cyrillic, italicization is optional. (Cyrillic italics are, like the Russian cursive hand they’re based on, famously hard for Western amateurs like myself to read.)
- if written in a non-alphabetic script, it can just be written as is, though you’re welcome to experiment.
- Examples should never be underlined, bolded, or placed in single or double quotes, regardless of the script used.
When an example is set off from the body (i.e., as a numbered example or in a table), it need not be italicized.
Any non-English example should be immediately followed with a gloss.
- A gloss should always be single-quoted.
- Don’t intersperse words like “meaning”, as in “…kitab meaning ‘book’…”, just write “…kitab ‘book’…”
If using morph-by-morph or word-by-word glossing, follow the Leipzig glossing conventions.

How to write numbers

A lot of students—and increasingly, given how young the field of NLP is—don’t know how to write numbers in papers. Here are a few basic principles (some of these are loosely based off the APA guidelines):

Use the same number of decimals every time and don’t omit trailing zeros after the decimal. Thus “.50” or “.5000” and not “.5”.
Round to a small number of decimals: 2, 4, or 6 are all standard choices.
Omit leading zeros before the decimal if possible values of whatever quantity are always within [0, 1], thus you might say you got “.9823” accuracy.
(For LaTeX users) put the minus sign in math mode, too, or it’ll appear as a hyphen (ASCII char 45), which is quite a bit shorter and just looks wrong.
Use commas to separate the hundreds and thousands place (etc.) in large integers, and try not to use too many large exact integers; rounding is fine once they get large.
Expressions like “3k”, “1.3m” and “2b” are too informal; just write “3,000”, “1.3 million”, and “2 billion”.
Many evaluation metrics can either be written as (pseudo-)probabilities or percentages. Pick one or the other format and stick with it.

A few other points about tables with numbers (looking at you LaTeX users):

Right-align numbers in tables.
Don’t put two numbers (like mean and standard deviation or a range) in a single cell; the alignment will be all wrong. Just use more cells and tweak the intercolumnar spacing.
Don’t make the text of your tables smaller than the body text, which makes the table hard to read. Just redesign the table instead.

Moneyball Linguistics

[This is just a fun thought experiment. Please don’t get mad.]

The other day I had an intrusive thought: the phrase moneyball linguistics. Of course, as soon as I had a moment to myself, I had to sit down and think what this might denote. At first I imagined building out a linguistics program on a small budget like Billy Beane and the Oakland A’s. But it seems to me that linguistics departments aren’t really much like baseball teams—they’re only vaguely competitive (occasionally for graduate students or junior faculty), there’s no imperative to balance the roster, there’s no DL list (or is just sabbatical?), and so on—and the metaphor sort of breaks down. But the ideas of Beane and co. do seem to have some relevance to talking about individual linguists and labs. I don’t have OBP or slugging percentage for linguists, and I wouldn’t dare to propose anything so crude, but I think we can talk about linguists and their research as a sort of “cost center” and identify two major types of “costs” for the working linguist:

cash (money, dough, moolah, chedda, cheese, skrilla, C.R.E.A.M., green), and
carbon (…dioxide emissions).

I think it is a perfectly fine scientific approximation (not unlike competence vs. performance) to treat the linguistic universe as having a fixed amount of cash and carbon, so that we could use this thinking to build out a roster-department and come in just under the pay cap. While state research budgets do fluctuate—and while our imaginings of a better world should also include more science funding—it is hard to imagine near-term political change in the West would substantially increase it. And similarly, while there is roughly 10¹² kg of carbon in the earth’s crust, climate scientists agree that the vast majority of it really ought to stay there. Finally, I should note that maybe we shouldn’t treat these as independent factors, given that there is a non-trivial amount of linguistics funding via petrodollars. But anyways, without further ado, let’s talk about some types of researchers and how they score on the cash-and-carbon rubric.

Armchair research: The armchairist is clearly both low-cash (if you don’t count the sports coats) and low-carbon (if you don’t count the pipe smoke).
Field work: “The field” could be anywhere, even the reasonably affordable, accessible, and often charming Queens, the archetypical fieldworker is flying in, first on a jet and then maybe reaches their destination via helicopter or seaplane. Once you’re there though, life in the field is often reasonably affordable, so this scores as low-cash, high-carbon.
Experimental psycholinguistics: Experimental psycholinguists have reasonably high capital/startup costs (in the form of eyetracking devices, for instance) and steady marginal costs for running subjects: the subjects themselves may come from the Psych 101 pool but somebody’s gotta be paid to consent them and run them through the task. We’ll call this medium-cash, low-carbon.
Neurolinguistics: The neurolinguistic imaging technique du jour, magnetoencephalography (or MEG), requires superconducting coils cooled to a chilly 4.2 K (roughly −452 °F); this in turn is accomplished with liquid helium. Not only is the cooling system expensive and power-hungry, the helium is mostly wasted (i.e., vented to the atmosphere). Helium is itself the second-most common element out there, but we are quite literally running out of the stuff here on Earth. So, MEG, at least, is high-cash, high-carbon.
Computational linguistics: there was a time not so long ago when I would said that computational linguists were a bunch of hacky-sackers filling up legal pads with Greek letters (the weirder the better) and typing some kind of line noise they call “Haskell” into ten-year-old Thinkpads. But nowadays, deep learning is the order of the day, and the substantial carbon impact from these methods are well-documented, or at least well-estimated (e.g., Strubell et al. 2019). Now, it probably should be noted that a lot of the worst offenders (BigCos and the Quebecois) locate their data centers near sources of plentiful hydroelectric power, but not all of us live within the efficient transmission zones for hydropower. And of course, graphics processing units are expensive too. So most computational linguistics is, increasingly, high-cash, high-carbon.

On a more serious note, just so you know, unless you run an MEG lab or are working on something called “GPT-G6”, chances are your biggest carbon contributions are the meat you eat, the cars you drive, and the short-haul jet flights you take, not other externalities of your research.

References

Strubell, M., Ganesh, A. and McCallum, A. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645-3650.

My SYNC talk

The handout from my SYNC talk, on phonotactics, is available here. I plan to revise it in the next few months and then post it at LingBuzz at some point.

“I understood the assignment”

We do a lot of things downstream with the machine learning tool we build, but not always can a model reasonably say it “understood the assignment” in the sense that the classifier is trained to do exactly what it we are making it do.

Take for example, Yuan and Liberman (2011), who study the realization of word-final ing in American English. This varies between a dorsal variant [ɪŋ] and a coronal variant [ɪn].¹ They refer to this phenomenon using the layman’s term g-dropping; I will use the notation (ing) to refer to all variants. They train Gaussian mixture models on this distinction, then enrich their pronunciation dictionary so that each word can be pronounced with or without g-dropping; it is as if the two variants are homographs. Then, they perform a conventional forced alignment; as a side effect, it determines which of the “homographs” was most likely used. This does seem to work, and is certainly very clever, but strikes me as a mild abuse of the forced alignment technique, since the model was not so much trained to distinguish between the two variants as produce a global joint model over audio and phoneme sequences.

What would an approach to the g-dropping problem that better understood the assignment look like? One possibility would be to run ordinary forced alignment, with an ordinary dictionary, and then extract all instances of (ing). The alignment would, naturally, give us reasonably precise time boundaries for the relevant segments. These could then be submitted to a discriminative classifier (perhaps an LSTM) trained to distinguish the various forms of (ing). In this design, one can accurately say that the two components, aligner and classifier, understand the assignment. I expect that this would work quite a bit better than what Yuan and Liberman did, though that’s just conjecture at present.

Some recent work by my student Angie Waller (published as Waller and Gorman 2020), involved an ensemble of two classifiers, one which more clearly understood the assignment than the other. The task here was to detect reviews of professors which are objectifying, in the sense that they make off-topic, usually-positive, comments about the professors’ appearance. One classifier makes document-level classifications, and cannot be said to really understand the assignment. The other classifier attempts to detect “chunks” of objectifying text; if any such chunks are found, one can label the entire document as objectifying. While neither technique is particularly accurate (at the document level), the errors they make are largely uncorrelated so an ensemble of the two obtains reasonably high precision, allowing us to track trends in hundreds of thousands of professor reviews over the last decade.

Endnotes

This doesn’t exhaust the logical possibilities of variation; for instance, for some speakers (including yours truly), there is a variant with a tense vowel followed by the coronal nasal.

References

Waller, A. and Gorman, K. 2020. Detecting objectifying language in online professor reviews. In Proceedings of the Sixth Workshop on Noisy User-Generated Text, pages 171-180.
Yuan, J. and Liberman, M. 2011. Automatic detection of “g-dropping” in American English using forced alignment. In IEEE Workshop on Automatic Speech Recognition & Understanding, pages 490-493.

Anatomy of an analogy

I have posted a lightly-revised version of the handout of a talk I gave at Stony Brook University last November here on LingBuzz. In it, I argue that analogical leveling phenomena in Latin previously attributed to pressures against interparadigmatic analogy or towards phonological process overapplication are better understood as the result of Neogrammarian sound change, loss of productivy, and finally covert reanalysis.

Abbreviation paper is now up

Our Findings of EMNLP paper about abbreviation expansion is now up, along with the associated data released under an Apache 2.0 license.

Don’t take money from the John Templeton Foundation

Don’t take money from the John Templeton Foundation. They backed the murderous Chicago School economists, the genocidal architects of the war on Iraq, and are among the largest contributors to the climate change denial movement. That’s all.

Linguistics has its own Sokal affair

The Sokal affair was a minor incident in which physics professor Alan Sokal published a “hoax” (his term) paper in the cultural studies journal Social Text. Sokal’s intent was to demonstrate that reviewers and editors would approve of an article of utter nonsense so long as it obeyed certain preconceived notions, in this case that everything is a social construct. (It is, but that’s a story for another blog.)

The affair has been “read” many ways but it is generally understood to illustrate poor editorial standards at top humanities journals and/or the bankruptcy of the entire cultural studies enterprise. However, I don’t think we have any reason to suspect that either of these critiques are limited to cultural studies and adjacent fields.

I submit that the Pirahã recursion affair has many of the makings of a linguistic Sokal affair. But if anything, the outlook for linguistics is quite a bit worse than the Sokal story. By all accounts, Sokal’s hoax article was a minor scholarly event, and does not seem to have received much attention before it was revealed to be a hoax. In contrast, when Everett’s article first appeared in Current Anthropology in 2005, it received an enormous amount of attention from both scholars and the press, and ultimately led to to multiple books, including a sympathetic portrait of Everett and his work by none other than the late Tom Wolfe (bang! krrp!). Finally, nearly all of what Everett has written on the subject is manifest nonsense.

I believe many scholars in linguistics and adjacent fields found Everett’s claim compelling, and while I think linguists should have seen through the logical leaps and magical thinking in the Current Anthropology piece, it wasn’t until a few years later, after the exchange with Nevins et al. in Language, that the empirical issues (to put it mildly) with Everett’s claims came to light. But the key element which gave Everett’s work such influence is that, like Sokal intended his hoax to do, it played to the biases (anti-generativist, and particularly, anti-Noam Chomsky) of a wide swath of academics (and to a lesser degree, fans of US empire, like Tom Wolfe). In that regard, it scarcely matters whether Everett himself believes or believed what he wrote: we have all been hoaxed.

What phonotactics-free phonology is not

In my previous post, I showed how many phonological arguments are implicitly phonotactic in nature, using the analysis of the Latin labiovelars as an example. If we instead adopt a restricted view of phonotactics as derived from phonological processes, as I argue for in Gorman 2013, what specific forms of argumentation must we reject? I discern two such types:

Arguments from the distribution of phonemes in URs. Early generative phonologists posited sequence structure constraints, constraints on sequences found in URs (e.g, Stanley 1967, et seq.). This seems to reflect more the then-contemporary mania for information theory and lexical compression, ideas which appear to have lead nowhere and which were abandoned not long after. Modern forms of this argument may use probabilistic constraints instead of categorical ones, but the same critiques remain. It has never been articulated why these constraints, whether categorical or probabilistic, are considered key acquirenda. I.e., why would speakers bother to track these constraints, given that they simply recapitulate information already present in the lexicon. Furthermore, as I noted in the previous post, it is clear that some of these generalizations are apparent even to non-speakers of the language; for example, monolingual New Zealand English speakers have a surprisingly good handle on Māori phonotactics despite knowing few if any Māori words. Finally, as discussed elsewhere (Gorman 2013: ch. 3, Gorman 2014), some statistically robust sequence structure constraints appear to have little if any effect on speakers judgments of nonce word well-formedness, loanword adaptation, or the direction of language change.
Arguments based on the distribution of SRs not derived from neutralizing alternations. Some early generative phonologists also posited surface-based constraints (e.g., Shibatani 1973). These were posited to account for supposed knowledge of “wordlikeness” that could not be explained on the basis of constraints on URs. One example is that of German, which has across-the-board word-final devoicing of obstruents, but which clearly permits underlying root-final voiced obstruents in free stems (e.g., [gʀaːt]-[gʀaːdɘ] ‘degree(s)’ from /grad/). In such a language, Shibatani claims, a nonce word with a word-final voiced obstruent would be judged un-wordlike. Two points should be made here. First, the surface constraint in question derives directly from a neutralizing phonological process. Constraint-based theories which separate “disease” and “cure” posit a constraint against word-final obstruents, but in procedural/rule-based theories there is no reason to reify this generalization, which after all is a mere recapitulation of the facts of alternation, arguably more a more entrenched source of evidence for grammar construction. Secondly, Shibatani did not in fact validate his claim about German speakers’ in any systematic fashion. Some recent work by Durvasula & Kahng (2019) reports that speakers do not necessarily judge a nonce word to be ill-formed just because it fails to follow certain subtle allophonic principles.

References

Durvasula, K. and Kahng, J. 2019. Phonological acceptability is not isomorphic with phonological grammaticality of stimulus. Talk presented at the Annual Meeting on Phonology.
Gorman, K. 2013. Generative phonotactics. Doctoral dissertation, University of Pennsylvania.
Gorman, K. 2014. A program for phonotactic theory. In Proceedings of the Forty-Seventh Annual Meeting of the Chicago Linguistic Society: The Main Session, pages 79-93.
Shibatani, M. 1973. The role of surface phonetic constraints in generative phonology. Language 49(1): 87-106.
Stanley, R. 1967. Redundancy rules in phonology. Language 43(2): 393-436.