“Python” is a proper name

In just the last few days I’ve seen a half dozen instances of the phrase python package or python script in published academic work. It’s disappointing to me that this got by the reviewers, action editors, and copy editors, since Python is obviously a proper name and should be in titlecase. (The fact that the interpreter command is python is irrelevant.)

Markdown isn’t good enough to replace LaTeX

I am generally sympathetic with calls to replace LaTeX with something else. LaTeX has terrible defaults, Unicode and font support is a constant problem, the syntax is deliberately obfuscatory, and actual generation is painfully slow (probably because the whole thing is a big pasta factory of interpreted code instead of a single static library).

But at the same time, I don’t think Markdown is really good enough for LaTeX. Of course one can use Pandoc to generate LaTeX from Markdown notes, and its output is often a decent thing to copy and paste into your LaTeX document. But Markdown just doesn’t solve any of the issues I mention, except making the syntax a tad more WYSIWYG than it would be otherwise. And Markdown is quite a bit worse at one thing: the extended syntax for tables is very hard to key in and still much less expressive than LaTeX’s actually pretty rational tabular environment.

Python hasn’t changed much

Since successfully sticking the landing for the migration from Python 2 (circa 3.6 or so), Python has been on a tear with a large number of small releases. These releases have cleaned up some warts in the “batteries included” modules and made huge improvements to the performance of the parser and run-time. There are also a few minor language features added; for instance, f-strings (which I like a lot) and the so-called walrus operator, mostly used for regular expression matching.

When Python improvements (and they are improvements, IMO) are discussed on sites like Hacker News, there is a lot of fear and trepidation. I am not sure why. These are rather minor changes, and they will take years to diffuse through the Python community. Overall, very little has changed.

On getting fired

I probably shouldn’t say too much about this, but I am genuinely baffled why an extremely well-compensated tech employee would torpedo their career just to tell us that they think girls are bad at math, or that they think a language model is sentient. Even if true, what are the material consequence for these claims? What is the right framework for thinking about this? Is clout worth more than a job (“in this economy”)?

The computational revolution in linguistics

(Throughout this post, I have taken pains not to name any names. The beauty of subtweeting and other forms of subposting is that nobody knows for sure you’re the person being discussed unless you volunteer yourself. So, don’t.)

One of the more salient developments in linguistics as a discipline over the last two decades is the way in which computational knowledge has diffused into the field.1 20 years ago, there were but a handful of linguistics professors in North America who could perform elaborate corpus analyses, apply machine learning and statistical analysis, or extract acoustic measurements from an audio file. And, while it was in some ways quite robust, speech and language processing at the turn of the last century simply did not hold the same importance it does nowadays.

While some professors—including, to their credit, many of my mentors and colleagues—can be commended for having “skilled up” in the intervening years, this knowledge has, I am sad to say, mostly advanced one death (and subsequent tenure line renewal) at a time. This has negative consequences for linguistics students who want to train for or pivot to a career in the tech sector, since there are professors who were, in their time, computationally sophisticated, but lack the skills a rising computational linguist is expected to have mastered. In an era of contracting tenure rolls and other forms of casualization in the academy, this has the risk of pushing out legitimate, albeit staid, lines of linguistic inquiry in favor of areas favored by capitalists.2

Yet I believe that this upskilling has a lot to contribute to linguistics as a discipline. There are many core questions about language use, acquisition, variation, and change which are best answered with a computational simulation that forces us to be explicit about our assumptions, or a corpus study that tells us what people really said, or a statistical analysis that tells us whether our correlations are likely to be meaningful, or even a machine learning system that helps us rapidly label linguistic data.3 It is a boon to our field that linguists of any age can employ these tools when appropriate.

This is not to say that the transition has not been occasionally ugly. First, there are the occasional nasty turf wars over who exactly is a linguist.4 Secondly, the standards of quality for work in this area must be negotiated and imposed. While a syntax paper in NL&LT from even 30 years ago are easily readable today, the computational methods of even widely-praised paper from 15 or 20 years ago are, frankly, often quite sloppy. I have found it necessary to explain this to students who want to interact with this older work lest they lower their own methodological standards.

I discern at least a few common sloppy habits in this older computational work, focusing for the moment on computational cognitive models of linguistic behavior.

  1. If a proposed computational model is compared to some “baseline” or older model, this older model is usually an ancient associationist model from psychology. This older model naturally lacks much of the rich linguistic specifications of the proposed model, and naturally it fails to model the data. Deliberately picking a bad baseline is putting one’s finger on the scale.
  2. Comparison of different computational models is usually informal. One should instead use statistical model comparison methods.
  3. The dependent variable for modeling is often derived from poorly-designed human subjects experiments. The subjects in these experiments may be instructed to perform a task they are unlikely to be able to do consciously (i.e., the tasks are cognitively impenetrable). Unjustified assumptions about appropriate scales of measurement may have been made. Finally, the n‘s are often needlessly small. Computational cognitive models demand high-quality measures of the behaviors they’re meant to model.
  4. Once the proposed model has been shown better than the baseline, it is reified far beyond what the evidence suggests. Computational cognitive modeling can at most show that certain explicit assumptions are consistent with the observed data: they cannot establish much beyond that.

The statistician Andrew Gelman writes that scientific discourse sometimes proceeds as if earlier published work has additional claim to truth than later research that is critical of the original findings (which may or may not be published yet).5 Critical interpretation of this older computational work is increasingly called for, as our methodological standards continue to mature. I find reviewers (and literature-reviewers) overly deferential to prior work of dubious quality simply because of its priority.

Endnotes

  1. An under-appreciated element to this process is that it is is simply easier to do linguistically-relevant things with computers than it was 20 years prior. For this, one should thank Python and R, NumPy and Scikit-learn, and of course tools like Praat and Parselmouth.
  2. I happen to think college education should not be merely vocational training.
  3. I happen to think most of these questions can be answered with a cheap laptop,  and only a few require a CUDA-enabled GPU.
  4. I suspect this is mostly a response to the rapidly casualizing academy. Unfortunately, any question about whether we should be doing X in linguistics is misinterpreted as a question about whether people who do X deserve to have a job. This is a presupposition failure for me: I believe everyone deserves meaningful work, and that academic tenure is a model of labor relations that should be expanded beyond the academy.
  5. To free ourselves of this bias, Gelman proposes what he calls the time-reversal heuristic, in which one imagines the temporal order reversed (e.g., that the later failed replication is now the first published result on the matter) and then re-evaluates the evidence. When interacting with older computational work, similar  thinking is called for here.

Please don’t send .docx or .xlsx files

.docx and .xlsx can only be read on a small subset of devices and only after purchasing a license. It is frankly a bit rude to expect everyone to have such licenses in 2022 given the proliferation of superior, and free, alternatives. If the document is static, read-only content, convert it to a PDF. If it’s something you want me to edit or comment on, or which will be changing with time, send me the document via Microsoft 365 or the equivalent Google offerings. Or a Git repo. Sorry to be grumpy but everyone should know this by now. If you’re still emailing these around, please stop.

Dutch names in LaTeX

One thing I recently figured out is a sensible way to handle Dutch names (i.e., those that begin with denvan or similar particles. Traditionally, these particles are part of the cited name in author-date citations (e.g., den Dikken 2003, van Oostendorp 2009) but are ignored when alphabetizing (thus, van Oostendorp is alphabetized between Orgun & Sprouse and Otheguy, not between Vago and Vaux)This is not something handled automatically by tools like LaTeX and BibTeX, but it is relatively easy to annotate name particles like this so that they do the right thing.

First, place, at the top of your BibTeX file, the following:

@preamble{{\providecommand{\noopsort}[1]{}}}

Then, in the individual BibTeX entries, wrap the author field with this command like so:

 author = {{\noopsort{Dikken}{den Dikken}}, Marcel},

This preserves the correct in-text author-date citations, but also gives the intended alphabetization in the bibliography.

Note of course that not all people with van (etc.) names in the Anglosphere treat the van as if it were a particle to be ignored; a few deliberately alphabetize their last name as if it begins with v.

X moment

A Reddit moment is an expression used to refer to a certain type of cringe ‘cringeworthy behavior or content’ judged characteristic of Redditors, habitual users of the forum website reddit.com. It seems hard to pin down what makes cringe Redditor-like, but discussion on Urban Dictionary suggests that one salient feature is a belief in one’s superiority, or the superiority of Redditors in general; a related feature is irl behavior that takes Reddit too seriously. The normal usage is as an interjection of sorts; presented with cringeworthy internet content (a screenshot or URL), one might simply respond  “Reddit moment”.

However, Reddit isn’t the only community that can have a similar type of pejorative X moment. One can find many instances of crackhead moment, describing unpredictable or spazzy behavior. A more complicated example comes from a friend, who shared a link about a software developer who deliberately sabotaged a widely used JavaScript software library to protest the Russian invasion of Ukraine. JavaScript, and the Node.js community in particular, has been extremely vulnerable to both deliberate sabotage and accidental bricking ‘irreversible destruction of technology’, and naturally my friend sent the link with the commentary “js moment”. The one thing that seems to unite all X moment snowclones is a shared negative evaluation of the community in the common ground.

Country (dead)naming

Current events reminded me of an ongoing Discourse about how we ought to refer to the country Ukraine in English. William Taylor, US ambassador to the country under George W. Bush, is quoted on the subject in this Time magazine piece (“Ukraine, Not the Ukraine: The Significance of Three Little Letters”, March 5th, 2014; emphasis mine), which is circulating again today:

The Ukraine is the way the Russians referred to that part of the country during Soviet times … Now that it is a country, a nation, and a recognized state, it is just Ukraine.

Apparently they don’t fact-check claims like this, because this is utter nonsense. Russian doesn’t have definite articles, i.e., words like the. There is simply no straightforward way to express the contrast between the Ukraine and Ukraine in Russian (or in Ukrainian for that matter).

Now, it’s true that the before Ukraine has long been proscribed in English, but this seems to be more a matter of style—the the variant sounds archaic to my ear—than ideology. And, in Russian, there is variation between в Украине and на Украине, both of which I would translate as ‘in Ukraine’. My understanding is that both have been attested for centuries, but one (на) was more widely used during the Soviet era and thus the other (в) is thought to emphasize the country’s sovereignty in the modern era. As I understand it, that one preposition is indexical of Ukrainian nationalist sentiment and another is indexical of Russian revanchist-nationalist sentiment is more or less linguistically arbitrary in the Saussurean sense. Or, more weakly, the connotative differences between the two prepositions are subtle and don’t map cleanly onto the relevant ideologies. But I am not a native (or even competent) speaker of Russian so you should not take my word for it.

Taylor, in the Time article, continues to argue that US media should use the Ukrainian-style transliteration Kyiv instead of the Russian-style transliteration Kiev. This is a more interesting prescription, at least in that the linguistic claim—that Kyiv is the standard Ukrainian transliteration and Kiev is the standard Russian transliteration—is certainly true. However, it probably should be noted that dozens of other cities and countries in non-Anglophone Europe are known by their English exonyms, and no one seems to be demanding that Americans start referring to Wien [viːn] ‘Vienna’ or Moskva ‘Moscow’. In other words Taylor’s prescription is a political exercise rather than a matter of grammatical correctness. (One can’t help but notice that Taylor is a retired neoconservative diplomat pleading for “political correctness”.)