A minimalist project design for NLP

Let’s say you want to build a new tagger, a new named entity recognizer, a new dependency parser, or whatever. Or perhaps you just want to see how your coreference resolution engine performs on your new database of anime reviews. So how should you structure your project? Here’s my minimalist solution.

There are two principles that guide my design. The first one is modularity. Some of these components will get run many times, some won’t. If you’re doing model comparison—and you should be doing model comparison—some components will get swapped out with someone else’s code. This sort of thing is a major lift unless you opt for modularity. The second principle is filesystem state. The filesystem is your friend. If your embedding table eats up all your RAM and you have to restart, the filesystem will be in roughly the same state as when you left. The filesystem allows you to organize things into directories and subdirectories, and give the pieces informative names; I like to record information about datasets and hyperparameter values in my file and directory names. So without further ado, here are the recommended scripts or applications to create when you’re starting off on a new project.

  1. split takes the full dataset and a random seed (which you should store for later) as input. The script reads the data in, randomly shuffles the data, and then splits it into an 80% training set, 10% development set, and a 10% test (i.e., evaluation set) which it then outptus. If you’re comparing to prior work that used a “standard split” you may want to have a separate script that generates that too, but I strongly recommend using randomly generated splits.
  2. train takes the training set as input and outputs a model file or directory. If you’re automating hyperparameter tuning you will also want to provide the development set as input; if not you will probably want to either add a bunch of flags to control the hyperparameters or allow the user to pass some kind of model configuration file (I like YAML for this).
  3. apply takes as input the model file(s) produced in (2) and the test set, and applies the model to the data, outputting a new hypothesized test data set (i.e., the model’s predictions). One open question is whether this ought to take only unlabeled data or should overwrite the existing labels: it depends.
  4. evaluate takes as input the gold test set and the hypothesized test data set generated in (3) and outputs the evaluation results (as text or in some structured data format—sometimes YAML is a good choice, other times TSV files will do). I recommend you test this with a small amount of data first.

That’s all there’s to it. When you begin doing model comparison you may find yourself swapping out (2-3) for somebody else’s code, but make sure to still stick to the same evaluation script.

I read “Language: The Cultural Tool”. You’ll never guess what happened next.

I recently obtained a copy of Daniel Everett’s pop-science paperback Language: The Cultural Tool (2012) from the Brooklyn Public Library. The chunky fonts of the cover made me think I was about to enter the world of a staunch iconoclast. But what I actually found was a laundry list of what you might call “grievance studies”—if that didn’t already mean something else—against a broadly generativist conception of language.

Everett, once a specialist in languages of the Amazon, does not draw so much from niche fieldwork so much as splashy papers by non-linguists in high-impact pop-science journals like Nature and Science. Thanks to my colleague Richard Sproat, I have seen how those august organizations make their sausage: they either don’t let linguists referee, or if they do, they simply ignore their negative reviews. (Everett, as it happens, has glowing things to say about the latter paper even though it has nothing particular to do with his titular thesis.) In general, the works cited draw from disparate areas that have received relatively little attention from specialists, so while Everett is a decent prose stylist,1 he is tilting at windmills for much of the book.

Everett often substitutes appeals to authority to actual arguments. For instance:2

Michael Tomasello, the Director of Psycholinguistics at the Max Planck Institute for Evolutionary Anthropology in Leipzig, says exactly this. A world leader in the study of cognitive development in canines and primates, including humans, he says simply ‘Universal grammar is dead.’ It was a good idea. It didn’t pan out. (p. 192)

That’s all we get on that point.

The other thing I was struck with were elementary factual errors that would have been cleaned up had literally any other linguist read the book before it went to press. Early on, Everett is discussing definitions of language. After describing the proposed definitions by Sweet and by Bloch and Trager, he quotes (p. 32) a passage from Noam Chomsky (the reference is neither given nor known to me):

A formal language is a (usually infinite) set of sequences of symbols (such sequences are “strings”) constructed by applying production rules to another sequence of symbols which initially contains just the start symbol.

Now, obviously this is not a definition of language as we understand it but rather the start of a definition of the mathematical construct formal language, a notion which predates Chomsky by at least half a century. Everett is either deeply confused or is deliberately misleading his readers.3 The second howler I found is the following passage, now from Everett:

The late Professor George Zipf of Howard University formulated an explanation of the relative lengths of words that has come to be known as ‘Zipf’s Law.’ His law predicts that more frequent words will be shorter than less frequent words. (p. 106)

George Kingsley Zipf taught at Harvard University, not Howard University, and that’s not what Zipf’s Law denotes.4

There are several factual errors. For instance, we’re told that ejectives are not found in European languages, which is only true if we don’t consider Armenian, Georgian, etc. languages of Europe (p. 177). And Xhosa is described as a Khoisan language when in fact it’s Bantu (p. 178).

And there’s the casually racist, classist, and sexist stuff. For instance, Everett posits that Pirahã children lack a theory of mind:

…many Pirahãs used to stare at me (some children still do) and talk about me in front of me—they didn’t believe I had a mind! (p. 165)

Okay. But maybe they were surprised rather than mentally deficient.

Later, Everett tells us:

…for many Ohio factory workers being overweight is less of a moral problem and more of a health problem—they do not value being at the right weight all that highly. (p. 300)

Okay. But the factories pretty much all closed down in Ohio years ago.

We’re told that in Wari‘, a language of the Amazon, the word for ‘wife’, manaxi’, means literally ‘our hole’ or ‘our vagina’. Everett suggests that “some outsiders”—let’s call them “the libs”—might “jump to the facile conclusion that this is a crude and demeaning comparison”. What’s the right analysis, though?

Perhaps to the Wari’ reproduction and the family are such important values that they honor the wife and the vagina as the source of life. So it is the highest form of flattery to call the wife ‘our vagina’, the source of life. Is this a possibfle conclusion? Yes. Is it the right one? I don’t know. No one can known unless they undertake a systematic analysis of Wari’ culture… (p. 195)

Okay. But maybe Everett could have just asked his coauthor Barbara Kern, an anthropologist who lived among the Wari’ for over forty years and who speaks their language fluently.

Finally, we’re told Banawá, another language of the Amazon, uses feminine as the default gender. Everett then proceeds to describe what I would call a (from a non-relativist perspective) brutal and essentializing coming-of-age ritual for pubescent Banawá girls. Are these facts related?

It is exactly by exploring such cultural values that we would try to build a connection between feminine identity and grammar in Banawá and other Arawan languages. I have not yet established such a link, but I am working on this. (p. 210).

Okay.

Footnotes

  1. Despite his affectation for cheery-dreary Boomer cultural touchstones, that is. In the first few chapters he mentions “Under The Boardwalk”, the music of Cream, the plot of an episode of The Andy Griffith Show, and the murder trial of Phil Spector. Sorry, but I already have a Dad.
  2. For the record, this also gets Tomasello’s title wrong: he was “Co-director” of the Institute, not “the Director of Psycholinguistics”.
  3. As a colleague pointed out, Everett himself is a coauthor on a paper (Futrell et al. 2016) that claims that Pirahã, an Amazonian language, can be described by a regular language. This suggests that Everett understands the distinction between human languages, of which Pirahã is an instantiation, and formal languages, of which the regular languages are an instantiation, and is simply being disingenuous here. For what it’s worth, the argument in that paper is incoherent. The authors simply observe that their corpus can be described by a regular language, but so can any finite sample. This is a vacuous observation. That said the study is not totally without value: the appendix contains an annotated corpus of Pirahã sentences.
  4. Zipf does observe something of the sort in his 1935 book The Psycho-biology of Language (p. 28f.), but “Zipf’s law” does not refer to word length at all.

The libfixes -pire, -spire, and -cuck

[CW: distasteful ideologies.]

A student at CUNY, Emily Campbell, recently brought two libfixes to my attention.

The first is -pire, presumably extracted from empire and found in the blend Fempire (an “investment cooperative for FIERCE women”) and in Trumpire, presumably a pejorative meaning something like ‘the world of the Trump family’. Both of these look blend-like in that the base provides a /m/.

In looking for more examples I also discovered a bunch of brand names in -spire, a libfix that appears to have been extracted from inspire. There is Artspire, an art festival, CitySpire, a New York City skyscraper which is more of a dome than a spire (n.), and the tech companies FundspireJobspirePinspire, and WeSpire.

A linguistically more interesting example is -cuck. This originates in cuckold, an archaic pejorative referring to the husband of an adulterous woman. How did a (string) prefix become a suffix? Here’s my best guess. First, cuckold obtains a new and more transgressive sense as the name for a genre of pornography in which a (usually white) man is forced to watch as a straight man (usually non-white) has sex with his (usually white) wife or girlfriend. This new racist sense lead to the blend cuckservative, a pejorative for white conservative Western politicians perceived to have betrayed their race (and perhaps also their donor base). While we might expect this would lead to a prefixal reanalysis (and a new libfix *cuck-), what seems to have happened first is cuck was made into a free stem. In informal usage, to cuck (v.) is to embarass, or more specifically emasculate, someone, and a cuck (n.) is someone perceived to be acting against their interests or the interests of their in-group; a class-, race-, or gender-traitor (though a conservative belief system is not necessarily presupposed). It didn’t take long before conservative politicians started using that one on each other. Later, with the fossilization of the incel narrative, we find the suffixal form -cuck as in words like wagecuck ‘wage-slave’ (“whadda schnook!”, I guess), Eurocucknormcuck, or studycuck, all pejorative (though not necessarily racist).

-cel goes libfix

Oh no, not that story: that’s misognynistic, objectivifying trash. But that narrative, regressive and objectifying as it is, has given us something new and exciting, a new libfix: -cel.

[CW: distasteful ideologies, misogyny, fat-shaming.]

It’s a familiar story, one we all know:

our protagonist, a young white man, can’t find a sexual partner because of feminism, his weak chin, his poor muscle tone…

Oh no, not that story: that’s misognynistic, objectivifying nonsense. But that narrative, regressive as it is, has given us something novel, a new libfix.

The story begins with two closely-related coinages. The first, according to Wikipedia, is the creation of a semi-anonymous Canadian college student who created a blog, “Alana’s Invo to discuss her sexual inactivity. The title: “Alana’s Involuntary Celibacy Project”. Involuntary celibacy, in the community that arose, was first shorted to invcel, then incel. (The author, as is happened, ultimately realized she was queer and abandoned the community she’d created.)

In the years since, a community of men gathered on Reddit (and specifically the subreddit “r/incels”), blaming women for their celibacy, and in some cases advocating for sexual violence to recoup their imagined losses. They call themselves incel (n.).

Not all the celibate are aggreviedly so; some have chosen their lot voluntarily, and they, in the jargon of the incel community, are termed volcel (n.). It is not immediately clear that this is a widely-used term of self-identification (though it has its own subreddit, too), and it doesn’t seem to satisfy a lexical need that wasn’t already being served by more-precise, in-community terms like asexual or aromantic. But, it does pair nicely with incel, and it’s fun to apply this plunky neologism to the private lives of historical asexuals like Virgil, James Buchanan, or H. P. Lovecraft.

So far, what we’ve seen looks like a standard type of word formation: clipping or (i.e., truncation) of both parts of a compound expression, which are then joined together to form a single word. In this case, the first syllable [1] of the both words is perserved. This is not particularly novel: consider Amex (< American Express) or op-ed (< opinion editorial).

But as is often the case, the clipping in incel and volcel appears to have spawned a libfix, an affix-like formative extracted from the compound. Witness the recently coined heightcel, an involuntarily celibate short person, presumably one whose involuntarily celibacy can be attributed to their diminuitive stature. Here, -cel attaches not to a clipping like in- or vol-, but to a free stem, the noun height. Libfixation, at least as it should be defined, has begun.

There are many more. (I’m not linking to any “manosphere” sources.) A marcel is an married incel; a baldcel is a bald(ing) incel; a currycel is an incel of South Asian descent; a ricecel is an incel of East- or Southeast Asian descent; a gingercel is a red-headed incel; and so on. There’s (ugh) fatcel, though there’s debate (in the incel community, at least) whether that’s more incel or volcel. And there’s even ironycel, someone (non-celibate, I suppose) who mocks incels.

Some of these -cel types foreground features that seem totally orthogonal to the sexual marketplace, suggesting some sort of gallows humor for outsiders, and for the mods: are we really to believe that some young man, somewhere, thinks he’d have a shot with Stacy if his wrists were just a bit thicker? But yet they keep coming.

[1] In volcel, it’s technically the first syllable plus the onset of the following unstressed syllable: [vɑl] < [vɑ.lənˌtɛ.ɹi].

[Some of my prior coverage of libfixation: Defining libfixesYour libfix and blend report for May 2016Your libfix and blend report for February 2018]

[Thanks to Twitter folks for some minor corrections.]

Sweet potato salad

Nerds love posting weird recipes; here’s mine.

Ingredients

Two sweet potatoes, skinned and cubed
1/2th cup balsamic vinegar
1/2th cup extra virgin olive oil
1 cup dry farro
A bag of baby kale
(Optional) a handful of fresh blueberries or strawberries
(Optional) chunks of chevre
Salt & pepper to taste

Preparation

Preheat oven to 425 degrees F. Wrap the cubed sweet potatoes loosely in foil and dress lightly with salt, pepper, and a dash of olive oil. Roast roughly 30 minutes (turning it at least once), until golden brown, and refrigerate.

Cook the farro in a rice cooker according to directions, and refrigerate.

Reduce the balsamic vinegar in a sauce pan over low heat, and set aside.

Wash the baby kale and combine with sweet potatoes, farro, and (optionally) fruit or chevre. Dress with equal parts balsamic vinegar and olive oil, and add salt and pepper to taste.

The history of “drain(ing) the swamp(s)”

In US political discourse, the phrase drain the swamp(s) usually refers to fighting corruption and undue influence. But the origins of the expression are quite far from this sense. The swamps in question are the Pontine Marshes (Pomptinae Paludes) to the south of Rome. Efforts to drain them have been made, on and off, for three millennia, and even predate Roman settlement in the region. The Appian Way (Via Appia, completed in 312 BCE), a famous ancient road, traversed the swamps, and major efforts (by the senators and consuls, by the emperors, and by the medieval popes) were required to keep the roadbed above water level. And of course the swamps’ waters are infested with malarial mosquitoes. Thus it is no surprise that many a historical Roman leader used “drain the swamps!” as a political slogan.

The most famous swamp drainer of all is Benito Mussolini, who tackled the marshes (now known as Agro Pontino) as part of a flashy, highly publicized infrastructure campaign. Once completed—with untold workers succumbing to malaria in the process—2,000 pro-fascist families from North Italy were granted farmsteads in former swampland. But after the Allied invasion of Sicily, the Armstice of Cassibile, and the Nazi reinforcement of Italy, the Nazis stopped the pumps and opened the dikes, flooding the marshes with brackish water. While it’s not at all clear this tactic was effective at slowing down Allied advances, it certainly did help to spread malaria (at a time when quinine was in short supply) and it utterly devastated the region’s civilian population. It was an act of biological warfare against a now-hostile civilian population no longer aligned with the Nazi cause.

Propaganda poster for the “Agro Pontino” campaign.

Nowadays the swamp waters are relatively well-controlled, and liberal application of the pesticide DDT in the middle 20th century helped to rein in the mosquito population, and the region has largely been repopulated.

Postscript: I want to be clear that I’m not saying that “drain the swamp” is always intended to index Mussolini (or whatever), just that many well-read Westerners will likely see use of this expression as “normalizing fascism”.

A Morris Halle memory

Morris Halle passed away earlier today. Morris was an absolute giant in the field of linguistics. His work in the 1950s and 1960s completely revolutionized phonological theory. He did this, primarily, by rejecting an axiom of the previous century’s work.
The theory of phonology was so utterly transformed by his argument against the principle of biuniqueness that the very concept is rarely even taught in the 21st century.
And this was just one of his earliest scientific contributions.

I could say a lot more about Morris’s work, but instead let me tell a short anecdote. In 2010 or so I happened to be in the Boston area and my advisor kindly arranged for me to meet Morris. After getting coffee we walked to his spare shared office. The only thing of note was a single wall-mounted bookshelf containing three books: Morris’ own Sound Pattern of Russian and Sound Pattern of English—with the dust cover removed so as to exhibit the unique bas-relief cover designed by Morris’s wife, a talented visual artist—and of course, Walker’s rhyming dictionary. For whatever reason, we started to discuss Latin. Working with the legal pad, Morris first showed me a novel analysis of thematic vowels. Ignoring a few irregular (“athematic”) stems, all Latin verb stems have a characteristic final vowel: -ā- in the first conjugation, -ē- in the second, a vowel of varying quality (usually e or i) in the third, and -ī- in the fourth. In the first conjugation and most of the third conjugation, this vowel disappears in the first person singular active indicative verb, which is marked with an suffix. Thus for the second conjugation verb docēre ‘teach’, we have doceō ‘I teach’, with the theme vowel preserved, and similarly for the fourth conjugation. In contrast, for the first conjugation verb amāre ‘love’, we have amō ‘I love’, with the theme vowel omitted, and similarly for the majority of the third conjugation. This much I already knew. To me it was just one of those conjugational quirks one has to memorize when learning Latin but Morris suggested that it was not necessarily so. What if, he argued, the first conjugation -ā- was deleted by a following ? (Certainly that rule is surface-true, except for a handful of Greek loanwords like chaos.) But what about the third conjugation? Morris suggested that he had long believed the underlying form of the third conjugation theme vowel was [+back], something like /ɨ/, and he proceeded to lay out the necessary allophonic rules, and finally a rule which deletes the first of two [+back] segments! I was floored.

I then showed him an analysis I was working on at the time. Once again ignoring a few irregulars, Latin masculines and feminine nouns of the third declension are characterized by a nominative singular suffix -s. When the verb stem is athematic and ends in a /t, d/, this consonant is deleted in the nominative singular (e.g., frons, frontis ‘forehead’). I argued that this rule ought to be extended to also target /r/ so as to account for the so-called “rhotic” stems like honōs, honōris ‘honor’ (e.g., /honōr-s/ → [honōs]). To make this work, one must write the rule so that it bleeds its own application (see here for the full analysis), and as one of several opaque rules. This is something which is possible in the rule-application framework proposed by Morris and colleagues, but which cannot be straightforwardly implemented in more recent theoretical frameworks. I must have hesitated for a moment as I was talking through this, because Morris grabbed my hand and said to me: “Young man, remember always to speak clearly and to never apologize for your rule ordering.” And then he bid me adieu.

When should we call it “terrorism”?

According to White House Press Secretary Sarah Huckabee Sanders, a recent spate of serial bombings targeting prominent African-Americans in Austin, TX, has “no apparent nexus to terrorism at this time”. I want to make a pedantic lexicographic point about the definition of terrorism (and terrorist) regarding this. There is certainly a sense of terrorism which just involves random lethal violence against civilians, and by that definition this absolutely qualifies. But, that is not the definition used by the state (or mass media). Rather, they favor an alternative sense which emphasizes the way in which the violence undermines the authority of the state. This is in fact encoded in the (deeply evil) PATRIOT Act, which defines terrorism as an attempt to “…to influence the policy of a government by intimidation or coercion; or to affect the conduct of a government by mass destruction, assassination, or kidnapping.” Let’s assume, as seems likely though by no means certain, that the bomber(s) are white supremacists targeting African-American communities. You’d be hard-pressed to argue that terrorizing people of color undermines the authority of a deeply racist society and its institutions any more than say, trafficking crack cocaine in African-American communities to support right-wing death squads abroad. Terrorizing people of color is absolutely in line with US domestic and foreign policy, and the language chosen by the White House (and parroted by the media) naturally reflects that.