Linguistics has its own Sokal affair

The Sokal affair was a minor incident in which physics professor Alan Sokal published a “hoax” (his term) paper in the cultural studies journal Social Text. Sokal’s intent was to demonstrate that reviewers and editors would approve of an article of utter nonsense so long as it obeyed certain preconceived notions, in this case that everything is a social construct. (It is, but that’s a story for another blog.)

The affair has been “read” many ways but it is generally understood to illustrate poor editorial standards at top humanities journals and/or the bankruptcy of the entire cultural studies enterprise. However, I don’t think we have any reason to suspect that either of these critiques are limited to cultural studies and adjacent fields.

I submit that the Pirahã recursion affair has many of the makings of a linguistic Sokal affair. But if anything, the outlook for linguistics is quite a bit worse than the Sokal story. By all accounts, Sokal’s hoax article was a minor scholarly event, and does not seem to have received much attention before it was revealed to be a hoax. In contrast, when Everett’s article first appeared in Current Anthropology in 2005, it received an enormous amount of attention from both scholars and the press, and ultimately led to to multiple books, including a sympathetic portrait of Everett and his work by none other than the late Tom Wolfe (bang! krrp!). Finally, nearly all of what Everett has written on the subject is manifest nonsense.

I believe many scholars in linguistics and adjacent fields found Everett’s claim compelling, and while I think linguists should have seen through the logical leaps and magical thinking in the Current Anthropology piece, it wasn’t until a few years later, after the exchange with Nevins et al. in Language, that the empirical issues (to put it mildly) with Everett’s claims came to light. But the key element which gave Everett’s work such influence is that, like Sokal intended his hoax to do, it played to the biases (anti-generativist, and particularly, anti-Noam Chomsky) of a wide swath of academics (and to a lesser degree, fans of US empire, like Tom Wolfe). In that regard, it scarcely matters whether Everett himself believes or believed what he wrote: we have all been hoaxed.

Does GPT-3 have free speech rights?

I have some discomfort with this framing. It strikes me as unnecessarily frivolous about some serious questions. Here is an imagined dialogue.

Should GPT-3 have the right to free speech?

No. Software does not have rights nor should it.  Living things are the only agents in moral-ethical calculations. Free speech as it currently is construed should also be recognized as a civic myth of the United States, one not universally recognized. Furthermore it should be recognized that all rights, including the right to self-expression, can impinge upon the rights and dignity of others.

What if a court recognized a free-speech right for GPT-3?

Then that court would be illegitimate. However, it is very easy to imagine this happening in the States given that the US “civic myth” is commonly used to provide extraordinary legal protections to corporate entities.

What if that allowed it to spread disinformation?

Then the operator would be morally responsible for all consequences of that dissemination.

They’re going to tell you…

…at some very near point in the future, that there’s something inherently white supremacist about teaching and studying generative linguistics. They will never tell you  how generative linguistics enforces white supremacy, but they will tell you that it represents a hegemonic power in the science of language (it does not, it is clearly just one way of knowing, spottily represented outside the Angophone west) and that it competes for time and mindshare with other forms of linguistic knowledge (an unexamined austerity mindset). This rhetorical trick—the same one used to slander the socialist left across the democratic West 2016-present—would simply not work on the generative community were they a militant, organized, self-assured vanguard rather than a casualized, disorganized, insecure community, one serously committed to diversity in race and sexual orientation but largely uninterested in matters of class and power. And then, once you’ve accepted their framing, they’re going to sell you a radically empiricist psycho-computational mode of inquiry that is deeply incurious about language diversity, that cares not a whit for the agency of speakers, and trains students to serve the interests of the most powerful men in the world.

Words and what we should do about them

Every January linguists and dialectologists gather for the annual meeting of the Linguistics Society of America and its sister societies. And, since 1990, attendees crowd into a conference room to vote for the American Dialect Society’s Word Of The Year (or WOTY for short). The guidelines for nominating and selecting the WOTY are deliberately underdetermined. There are no rules about what’s a word (and, increasingly, picks are not even a word under any recognizable definition thereof), what makes a word “of the year” (should it be a new coinage? should its use be vigorous or merely on the rise? should it be stereotyped or notorious? should it reflect the cultural zeitgeist?) or even whether the journalists in the room are eligible to vote.

By my count, there are two major categories of WOTY winners over the last three decades: commentary on US and/or world political events, or technological jargon; I count 14 in the former category (1990’s bushlips, 1991’s mother of all, 2000’s chad, 2001’s 9-11, 2002’s WMD, 2004’s red state/blue state, 2005’s truthiness, 2007’s subprime and 2008’s bailout, 2011’s occupy, 2014’s #blacklivesmatter, 2016’s dumpster fire, 2017’s fake news, 2018’s tender-age shelter) and 9 in the latter (1993’s information superhighway, 1994’s cyber, 1995’s web, 1997’s millennium bug, 1998’s e-, 1999’s Y2K, 2009’s tweet, 2010’s app, 2012’s hashtag) But, as Allan Metcalf, former executive of the American Dialect Society, writes in his 2004 book Predicting New Words: The Secrets Of Their Success, terms which comment on a situation—rather than fill some denotational gap—rarely have much of a future. And looking back some of these picks not only fail to recapitulate the spirit of the era but many (bushlips, newt, morph, plutoed) barely denote at all. Of those still recognizable, it is shocking how many refer to—avoidable—human tragedies: a presidential election decided by a panel of judges, two bloody US incursions into Iraq and the hundreds of thousands of civilian casualities that resulted, the subprime mortgage crisis and the unprecedented loss of black wealth that resulted, and unchecked violence by police and immigration officers against people of color and asylum-seekers.

Probably the clearest example of this is the 2018 WOTY, tender-age shelter. This ghoulish euphemism was not, in my memory, a prominent 2018 moment, so for the record, it refers to a Trump-era policy of separating asylum-seeking immigrants from their children. Thus, “they’re not child prisons, they’re…”. Ben Zimmer, who organizes the WOTY voting, opined that this was a case of bureaucratic language backfiring, but I disagree: there was no meaningful blowback. The policy remains in place, and the people who engineered the policy remain firmly in power for the forseeable future, just as do the architects of and propagandists for the Iraqi invasions (one of whom happens to be a prominent linguist!), the subprime mortgage crisis, and so on. Tender-age shelter is of course by no means the first WOTY that attempts to call out right-wing double-talk, but as satire it fails. There’s no premise—it is not even in the common ground that the US linguistics community (or the professional societies who represent them) fervently desire an end to the aggressive detention and deportion of undocumented immigrants, which after all has been bipartisan policy for decades, and will likely remain so until at least 2024—and without this there is no irony to be found. Finally, it bespeaks a preoccupation with speech acts rather than dire material realities.

This is not the only dimension on which the WOTY community has failed to self-criticize. A large number of WOTY nominees (though few outright winners) of the last few years have clear origins in the African-American community (e.g., 2017 nominees wypipocaucasity, and 🐐, 2018 nominees yeet and weird flex but OK, 2019 nominees Karen and woke). Presumably these terms become notable to the larger linguistics community via social media. It is certainly possible for the WOTY community to celebrate language of people of color, but it is also possible to read this as exotificiation. The voting audience, of course, is upper-middle-class and mostly-white, and here these “words”, some quite well-established in the communities in which they originate, compete for novelty and notoriety against tech jargon and of-the-moment political satire. As scholars of color have noted, this could easily reinforce standard ideologies that view African-American English as a debased form of mainstream English rather than a rich, rule-governed system in its own right. In other words, the very means by which we as linguists engage in public-facing research risk reproducing linguistic discrimination:

How might linguistic research itself, in its questions, methods, assumptions, and norms of dissemination, reproduce or work against racism? (“LSA Statement on Race”, Hudley & Mallison 2019)

I conclude that the ADS should issue stringent guidance about what makes expressions “words”, and what makes them “of the year”. In particular, these guidelines should orient voters towards linguistic novelty, something the community is well-situated to assess.

Elizabeth Warren and the morality of the professional class

I am surprised by the outpouring of grief engendered by Senator Elizabeth Warren’s exit from the presidential primary among my professional friends and colleagues. I dare not tell them how they ought to feel, but the spectacle of grief makes me wonder whether my friends are selling themselves short: virtually all of them have lived, in my opinion, far more virtuous lives than the senator from Massachusetts.

First off, none of them have spent most of their professional lives as right-wing activists, as did Warren, a proud Republican until the late ’90s. As recently as 1991, Warren gave a keynote at a meeting of the Federalist Society, the shadowy anti-choice legal organization that gave us Justice Brett Kavanaugh and so many other young ultra-conservative judicial appointees.

Secondly, Warren spent decades lying about her Cherokee heritage, presumably for nothing more than professional gain. This is a stunningly racist personal behavior, one that greatly reinforces white supremacy by equating the almost-unimaginable struggles of indigenous peoples with plagiarized recipes and “high cheekbones”. Were any of my friends or colleagues caught lying so blatantly on a job application, they would likely be subject to immediate termination. It is shocking that Warren has not faced greater  professional repercussions for this lapse in judgment.

Warren’s more recent history of regulatory tinkering around the most predatory elements of US capitalism, while important, are hardly an appropriate penance for these two monumental personal-professional sins.

Action, not ritual

It is achingly apparent that an overwhelming amount of research in speech and language technologies considers exactly one human language: English. This is done so unthinkingly that some researchers seem to see the use of English data (and only English) as obvious, so obvious as to require no comment. This is unfortunate in part because English is, typologically speaking, a bit of an outlier. For instance, it has uncommonly impoverished inflectional morphology, a particularly rigid word order, and rather large vowel inventory. It is not hard to imagine how lessons learned designing for—or evaluating on—English data might not generalize to the rest of the world’s languages. In an influential paper, Bender (2009) encourages researchers to be more explicit about the languages studied, and this, framed as an imperative, is has come to be called the Bender Rule.

This “rule”, and the aforementioned observations underlying it, have taken on an almost mythical interpretation. They can easily be seen as a ritual granting the authors a dispensation to continue their monolingual English research. But this is a mistake. English hegemony is not merely bad science, nor is it a mere scientific inconvenience—a threat to validity.

It is no accident of history that the scientific world is in some sense an English colony. Perhaps you live in a country that owes an enormous debt to a foreign bank, and the bankers are demanding cuts to social services or reduction of tariffs: then there’s an excellent chance the bankers’ first language is English and that your first language is something else. Or maybe, fleeing the chaos of austerity and intervention, you find yourself and your children in cages in a foreign land: chances are you in Yankee hands. And, it is no accident that the first large-scale treebank is a corpus of English rather than of Delaware or Nahuatl or Powhatan or even Spanish, nor that the entire boondoggle was paid for by the largest military apparatus the world has ever known.

Such material facts respond to just one thing: concrete actions. Rituals, indulgences, or dispensations will not do. We must not confuse the act of perceiving and naming the hegemon with the far more challenging act of actually combating it. It is tempting to see the material conditions dualistically, as a sin we can never fully cleanse ourselves of. But they are the past and a more equitable world is only to be found in the future, a future of our own creation. It is imperative that we—as a community of scientists—take  steps to build the future we want.

References

Bender, Emily M. 2009. Linguistically naïve != language independent: why NLP needs linguistic typology. In EACL Workshop on the Interaction Between Linguistics and Computational Linguistics, pages 26-32.

A first impression of Rome

Rome is a body, freshly exhumed, straddled on all sides by leering police.

Rome is dominated by ruins: the forum is a boneyard, the colosseum a headstone, the Aurelian walls fence in the hallowed ground. Few visible works predate the emperor Trajan, who lived to see Rome reach its territorial maximum, saw it begin its long decline. The archaelogical mode in Rome is to unearth, to lay out before us whatever can be found, as it was found. Reconstruction is mostly confined to pathetic medieval attempts to extract from the pagan monumenta a bit of glory for Christ. There are the shrines—and later a church—built on the floor of the colosseum, at a time when the inner corridors were turned into barns for sheep—and temples rededicated as shrines or churches—though these lack their fine marble facings and bronze ceilings they had in an earlier era.

The ongoing excavation of the Domus Aurea is the most striking Roman corpse. Nero was not so much a madman as the first disaster capitalist. After the great fire of 64 C—a fire that may have been started by the emperor’s confederates—Nero seized a full third of Rome for a pleasure palace, a ‘golden hall’. Years later, Trajan, by comparison a great liberalizer and homo populi, razed Nero’s colossal artificial lake and set to construct an amphitheatre that later became the colosseum. As for the halls of the Domus, Trajan stripped them of their marble facings and filled them with rubble, and they became the foundations, and the sewers, for a great public bath. The only hint of their one-time splendor are the fine frescoes that can be seen on weekend tours offered while the excavators rest.

Standing over the ancient Roman corpse are a huge mass of police, of which there are both far too many and far too many types. There is the esercito—the army—of whom it is said, “at least they are competent”. There are carabinieri, who use the iconography of musketeers but whose portfolio includes roles played by state troopers and sheriffs, the DEA and the FBI, the Pinkertons and second-world paramilitaries. And there are also polizia and guardia. As far as one can tell, small bands of the various armed forces have been camping at their chosen corners for years, doing little more than smoking cigarettes and talking amongst themselves. The duplication of effort is exquisite, and not without a bit of apparent dispute over turf. One will not uncommonly find an entrance to a church guarded over by the esercito and the exit by polizia. The effect would be chilling were the patient not so long dead, so long in the ground.

arXiv vs. LingBuzz

In the natural language processing community, there has been a bit of kerfuffle about the ACL preprint policy, which essentially prevents you from submitting a manuscript to preprint aggregation websites like arXiv when the m.s. is also under review for a conference. I personally think this is a good policy: double blind review is really important for fairness. This lead me to reflect a bit on the outsized role that arXiv plays in natural language processing research. It is interesting to contrast arXiv with LingBuzz, a preprint aggregator for formal linguistics research.1 arXiv is visually ugly and cluttered, expensive (it somehow takes over $800,000 from Simons Foundations’ money to run it every year), and submissions tare subject to detailed, strict, carefully enforced editorial guidelines. In contrast, LingBuzz has a minimalistic text interface, is run and operated by a single professor (Michael Starke at the University of Tromsø), and the editorial guidelines are simple (they fit on a single page) and laxily enforced (mostly after the fact). Despite the laissez-faire attitude at LingBuzz, it has seen some rather contentious debates involving the usual trollish suspects (Postal, Everett, Behme, etc.) but it managed to keep things under control. But what I really love about LingBuzz is that unlike arXiv, no linguist is under the impression that it is any sort of substitute for peer review, or that authors need to know about (and cite) late-breaking work only available on LingBuzz. I think NLP researchers should take a hint from this and stop pretending arXiv is a reasonable alternative to peer review.

Endnotes

1. There are a few other such repositories. The Rutgers Optimality Archive (ROA) was once a popular repository for pre-prints of Optimality Theory work, but its contents are re-syndicated on LingBuzz and Optimality Theory is largely dead anyways. There is also the Semantics Archive.

A minimalist project design for NLP

Let’s say you want to build a new tagger, a new named entity recognizer, a new dependency parser, or whatever. Or perhaps you just want to see how your coreference resolution engine performs on your new database of anime reviews. So how should you structure your project? Here’s my minimalist solution.

There are two principles that guide my design. The first one is modularity. Some of these components will get run many times, some won’t. If you’re doing model comparison—and you should be doing model comparison—some components will get swapped out with someone else’s code. This sort of thing is a major lift unless you opt for modularity. The second principle is filesystem state. The filesystem is your friend. If your embedding table eats up all your RAM and you have to restart, the filesystem will be in roughly the same state as when you left. The filesystem allows you to organize things into directories and subdirectories, and give the pieces informative names; I like to record information about datasets and hyperparameter values in my file and directory names. So without further ado, here are the recommended scripts or applications to create when you’re starting off on a new project.

  1. split takes the full dataset and a random seed (which you should store for later) as input. The script reads the data in, randomly shuffles the data, and then splits it into an 80% training set, 10% development set, and a 10% test (i.e., evaluation set) which it then outptus. If you’re comparing to prior work that used a “standard split” you may want to have a separate script that generates that too, but I strongly recommend using randomly generated splits.
  2. train takes the training set as input and outputs a model file or directory. If you’re automating hyperparameter tuning you will also want to provide the development set as input; if not you will probably want to either add a bunch of flags to control the hyperparameters or allow the user to pass some kind of model configuration file (I like YAML for this).
  3. apply takes as input the model file(s) produced in (2) and the test set, and applies the model to the data, outputting a new hypothesized test data set (i.e., the model’s predictions). One open question is whether this ought to take only unlabeled data or should overwrite the existing labels: it depends.
  4. evaluate takes as input the gold test set and the hypothesized test data set generated in (3) and outputs the evaluation results (as text or in some structured data format—sometimes YAML is a good choice, other times TSV files will do). I recommend you test this with a small amount of data first.

That’s all there’s to it. When you begin doing model comparison you may find yourself swapping out (2-3) for somebody else’s code, but make sure to still stick to the same evaluation script.

I read “Language: The Cultural Tool”. You’ll never guess what happened next.

I recently obtained a copy of Daniel Everett’s pop-science paperback Language: The Cultural Tool (2012) from the Brooklyn Public Library. The chunky fonts of the cover made me think I was about to enter the world of a staunch iconoclast. But what I actually found was a laundry list of what you might call “grievance studies”—if that didn’t already mean something else—against a broadly generativist conception of language.

Everett, once a specialist in languages of the Amazon, does not draw so much from niche fieldwork so much as splashy papers by non-linguists in high-impact pop-science journals like Nature and Science. Thanks to my colleague Richard Sproat, I have seen how those august organizations make their sausage: they either don’t let linguists referee, or if they do, they simply ignore their negative reviews. (Everett, as it happens, has glowing things to say about the latter paper even though it has nothing particular to do with his titular thesis.) In general, the works cited draw from disparate areas that have received relatively little attention from specialists, so while Everett is a decent prose stylist,1 he is tilting at windmills for much of the book.

Everett often substitutes appeals to authority to actual arguments. For instance:2

Michael Tomasello, the Director of Psycholinguistics at the Max Planck Institute for Evolutionary Anthropology in Leipzig, says exactly this. A world leader in the study of cognitive development in canines and primates, including humans, he says simply ‘Universal grammar is dead.’ It was a good idea. It didn’t pan out. (p. 192)

That’s all we get on that point.

The other thing I was struck with were elementary factual errors that would have been cleaned up had literally any other linguist read the book before it went to press. Early on, Everett is discussing definitions of language. After describing the proposed definitions by Sweet and by Bloch and Trager, he quotes (p. 32) a passage from Noam Chomsky (the reference is neither given nor known to me):

A formal language is a (usually infinite) set of sequences of symbols (such sequences are “strings”) constructed by applying production rules to another sequence of symbols which initially contains just the start symbol.

Now, obviously this is not a definition of language as we understand it but rather the start of a definition of the mathematical construct formal language, a notion which predates Chomsky by at least half a century. Everett is either deeply confused or is deliberately misleading his readers.3 The second howler I found is the following passage, now from Everett:

The late Professor George Zipf of Howard University formulated an explanation of the relative lengths of words that has come to be known as ‘Zipf’s Law.’ His law predicts that more frequent words will be shorter than less frequent words. (p. 106)

George Kingsley Zipf taught at Harvard University, not Howard University, and that’s not what Zipf’s Law denotes.4

There are several factual errors. For instance, we’re told that ejectives are not found in European languages, which is only true if we don’t consider Armenian, Georgian, etc. languages of Europe (p. 177). And Xhosa is described as a Khoisan language when in fact it’s Bantu (p. 178).

And there’s the casually racist, classist, and sexist stuff. For instance, Everett posits that Pirahã children lack a theory of mind:

…many Pirahãs used to stare at me (some children still do) and talk about me in front of me—they didn’t believe I had a mind! (p. 165)

Okay. But maybe they were surprised rather than mentally deficient.

Later, Everett tells us:

…for many Ohio factory workers being overweight is less of a moral problem and more of a health problem—they do not value being at the right weight all that highly. (p. 300)

Okay. But the factories pretty much all closed down in Ohio years ago.

We’re told that in Wari‘, a language of the Amazon, the word for ‘wife’, manaxi’, means literally ‘our hole’ or ‘our vagina’. Everett suggests that “some outsiders”—let’s call them “the libs”—might “jump to the facile conclusion that this is a crude and demeaning comparison”. What’s the right analysis, though?

Perhaps to the Wari’ reproduction and the family are such important values that they honor the wife and the vagina as the source of life. So it is the highest form of flattery to call the wife ‘our vagina’, the source of life. Is this a possibfle conclusion? Yes. Is it the right one? I don’t know. No one can known unless they undertake a systematic analysis of Wari’ culture… (p. 195)

Okay. But maybe Everett could have just asked his coauthor Barbara Kern, an anthropologist who lived among the Wari’ for over forty years and who speaks their language fluently.

Finally, we’re told Banawá, another language of the Amazon, uses feminine as the default gender. Everett then proceeds to describe what I would call a (from a non-relativist perspective) brutal and essentializing coming-of-age ritual for pubescent Banawá girls. Are these facts related?

It is exactly by exploring such cultural values that we would try to build a connection between feminine identity and grammar in Banawá and other Arawan languages. I have not yet established such a link, but I am working on this. (p. 210).

Okay.

Footnotes

  1. Despite his affectation for cheery-dreary Boomer cultural touchstones, that is. In the first few chapters he mentions “Under The Boardwalk”, the music of Cream, the plot of an episode of The Andy Griffith Show, and the murder trial of Phil Spector. Sorry, but I already have a Dad.
  2. For the record, this also gets Tomasello’s title wrong: he was “Co-director” of the Institute, not “the Director of Psycholinguistics”.
  3. As a colleague pointed out, Everett himself is a coauthor on a paper (Futrell et al. 2016) that claims that Pirahã, an Amazonian language, can be described by a regular language. This suggests that Everett understands the distinction between human languages, of which Pirahã is an instantiation, and formal languages, of which the regular languages are an instantiation, and is simply being disingenuous here. For what it’s worth, the argument in that paper is incoherent. The authors simply observe that their corpus can be described by a regular language, but so can any finite sample. This is a vacuous observation. That said the study is not totally without value: the appendix contains an annotated corpus of Pirahã sentences.
  4. Zipf does observe something of the sort in his 1935 book The Psycho-biology of Language (p. 28f.), but “Zipf’s law” does not refer to word length at all.