Automatic batch sizing

Yoyodyne is my lab’s sequence-to-sequence library, intended to be a replacement for Fairseq, which is (essentially) abandonware. One matter of urgency for me in building Yoyodyne was to enable automatic hyperparameter tuning. This was accomplished by logging results to Weights & Biases (W&B). We can perform a random or Bayesian hyperparameter sweep using a “grid” specified via a YAML file, monitor progress on the W&B website, or even hit the API to grab the best hyperparameters. One issue that kept coming up, however, is that it is easy to hit out-of-memory (OOM) errors during this process. Here’s what we did about it:

OOMs are not purely due to model size: the model, batch, and gradients all need to fit into the same VRAM. PyTorch Lightning, which is a key part of the Yoyodyne backend, provides a function for automatically determining the maximum batch size that will not trigger an OOM. Basically, it works by starting with a low batch size (by default, 2), randomly drawing three batches of that size, and then attempting training (but in fact caching parameters so that no real training occurs). If this does not trigger an OOM, it doubles the batch size, and so on.1,2 You can enable this approach in Yoyodyne using the flag --find_batch_size max. You’d want to use this if you believe that a giant batch size is fine and you just want to fully saturate your GPU.

A slightly more sophisticated version of this, useful when you actually want to tune batch size, is enabled with the flag --find_batch_size opt. This again begins by doubling the size of randomly drawn batches as well, but here it halts once the doubling exceeds the value of the --batch_sizeflag. If the max batch size is larger than the requested size, it is used as is; thus this acts as a soft check against OOMs. If, however, the max batch size is smaller than --batch_size it instead solves for a new batch size, the largest batch size which is smaller than the max and which is a divisor of --batch_size`. It then enables multiple rounds of gradient accumulation per update,3 thus perfectly-losslessly simulating the desired batch size while using as much of VRAM as possible. I can assure you this is a killer feature for neural network tuning.


  1. This is a little imprecise, and one can refine it by doing a binary search, but in practice it’s not worth the effort when working with ragged data.
  2. Whatever batch size was requested with the --batch_size flag is ignored.
  3. More formally, given desired batch size $b$ and a max batch size $n’$, it finds $a, n$ such that $a$ is the smallest integer, and $n$ is the largest integer, where $an = b$. This is computed via brute force; my implementation of an elegant solution based on the prime factorization was a bit slower.

An interesting semantic change: “raw dogging”

The term raw-dogging is a slightly-obscene, slangy term for engaging in unprotected sex, often used to celebrate that occasionally-risky behavior. However, this term has undergone an interesting semantic change in the last five or so years. I think the actuator of this chain of events is prolific Twitter user @jaboukie:

This is a straightforward, jocular, semantic extension, generalizing the sense of danger associated with unprotected sex to life itself. In its wake (it was a very popular tweet), I also saw a tweet about “raw dogging” to refer to riding the subway without headphones or sunglasses. Years later, I read a blind item about a US senator flying commercially from the States to Israel; apparently, according to his seat mate, during the long flight, he didn’t listen to music or podcasts, read, check email, nap, or watch a movie, he just…sat there, for hours and hours, like an absolute maniac. I haven’t been able to find this story, and I don’t remember whether it referred to raw-dogging, but I have since seen several stories discussing raw-dogging flights (e.g., this recent one in GQ). Discussions of raw-dogging in the commercial aviation sense largely recognize the act’s covert prestige: it is recognized as a curious and difficult task, one associated with macho and/or maleness. The GQ article also quotes individuals who refer to stimulation-free commercial flying as barebacking, which traditionally refers to unprotected anal sex between men. (In contrast raw-dogging in its original sense does not specify the specific sex act beyond some form of genital-genital penetration, nor does it specify the gender or sexual orientation of the participants.)

“Indic” considered harmful

Indic is an adjective referring to the Indo-Aryan languages such as Hindi-Urdu or Bengali. These languages are spoken mostly in the northern parts of India, as well as in Bangladesh, Pakistan, Sri Lanka, Nepal, and the Maldives. This term can be confusing, because hundreds of millions of people in the Indian subcontinent (and nearby island nations) speak non-Indic first languages: over 250 million people, particularly in the south of India and the north of Sri Lanka, speak Dravidian languages, which include Malayalam, Tamil, and Telugu. Austronesian, Tibeto-Burman, and Tai-Kadai languages, and many language isolates, are also spoken in the India and the other nations of subcontinent, as is English (and French, and Portuguese). Unfortunately, there is now a trend to use Indic to mean ‘languages of the subcontinent’. See here for a prominent example. This is a new sense for Indic, and while there is probably a need for such a lexeme to express the notion (language of India or subcontinental language would work), reusing Indic, which already has a distinct and well-established sense, just adds unnecessary confusion.

A minor syntactic innovation in English: “BE crazy”

I recently became aware of an English syntactic construction I hadn’t noticed before. It involves the predicate BE crazy, which itself is nothing new, but here the subject of that predicate is, essentially, quoted speech from a second party. I myself am apparently a user of this variant. For example, a friend told me of someone who describes themselves (on an online dating platform) as someone who …likes travel and darts, and I responded, simply, Likes darts is crazy. That is to say, I am making some kind of assertion that the description “likes darts”, or perhaps the speech act of describing oneself as such, is itself a bit odd. Now in this case, the subject is simply the quotation (with the travel and part elided), and while this forms a constituent, a tensed VP, we don’t normally accept them as the subject of predicates. And I suspect constituenthood is not even required. So this is distinct from the ordinary use of BE crazy with a nominal subject.

I suspect, though I do not have the means to prove, this is a relatively recent innovation; I hear it from my peers (i.e., those of similar age, not my colleagues at work, who may be older) and students, but not often elsewhere. I also initially thought it might be associated with the Mid-Atlantic but I am no longer so sure.

Your thoughts are welcome.

Vibe check: EACL 2024

I was honored to be able to attend EACL 2024 in Malta last month. The following is a brief, opinionated “vibe check” on NLP based on my experiences there. I had never been to an EACL, but it appealed to me because I’ve always respected the European speech & language processing community’s greater interest in multilingualism compared to what I’m familiar with in the US. And, because when or why else would I get to see Malta? The scale of EACL is a little more manageable than what I’m used to, and I was able to take in nearly every session and keynote. Beyond that, there wasn’t much difference. Here are some trends I noticed.

We’re doing prompt engineering, but we’re not happy about it

It’s hard to get a research paper out of prompt engineering. There really isn’t much to report, except the prompts used and the evaluation results. And, there doesn’t seem to be the slightest theory about how one ought to design a prompt, suggesting that the engineering part of the term is doing a lot of work. So, while I did see some papers (to be fair, mostly student posters) about prompt engineering, the interesting ones actually compared prompting against a custom-built solution.

There’s plenty of headroom for older technologies

I was struck by one of the demonstration papers, which was using fine-tuned BERT for the actual user-facing behaviors, but an SVM or some other type of simple linear model trained on the same data to provide “explanability”. I was also struck by the many papers I saw in which fine-tuned BERT or some other kind of custom-built solution outperformed prompting.

Architectural engineering is dead for now

I really enjoy learning about new “architectures”, i.e., ways to frame speech and language processing problems as a neural network. Unfortunately, I didn’t learn about any new ones this year. I honestly think the way forward, in the long term, will be to identify and eliminate the less-principled parts of our modeling strategies, and replace them with “neat”, perhaps even proof-theoretic, solutions, but I’m sad to say this is not a robust area.

Massive multilingualism needs new application areas

In the first half of Hinrich Schütze’s keynote, he discussed a massively multilingual study covering 1,500 languages in all. That itself is quite impressive. However, I was less impressed with the tasks targeted. One was an LM-based task (predicting the next word, or perhaps a masked word), evaluated with “pseudo-perplexity”. I’m not sure what pseudo-perplexity is but real perplexity isn’t good for much. The other task was predicting, for each verse from the Bible, the appropriate topic code; these topics are things like “recommendation”, “sin”, “grace”, or “violence”. Doing some kind of semantic prediction, at the verse/sentence level, at such scale might be interesting, but this particular instantiation seems to me to be of no use to anyone, and as I understand it, the labels were projected from those given by English annotators, which makes the task less interesting. Let me be clear, I am not calling out Prof. Schütze, for whom I have great respect—and the second half of his talk was very impressive—but I challenge researchers working at massively multilingual scale to think of tasks really worth doing!

We’ve always been at war with Eurasia

I saw at least two pro-Ukraine papers, both focused on the media environment (e.g., propaganda detection). I also saw a paper about media laws in Taiwan that raised some ethical concerns for me. It seems this may be one of those countries where truth is not a defense against charges of libel, and the application was helping the police enforce that illiberal policy. However, I am not at all knowledgeable about the political situation there and found their task explanation somewhat hard to follow, presumably because of my Taiwanese political illiteracy.

My papers

Adam Wiemerslage presented a paper coauthored with me and Katharina von der Wense in which we propose model-agnostic metrics for measuring hyperparameter sensitivity, the first of their kind. We then use these metrics to show that, at least for the character-scale transduction problems we study (e.g., grapheme-to-phoneme conversion and morphological generation), LSTMs really are less hyperparameter-sensitive than transformers, not to mention more accurate when properly tuned. (Our tuned LSTMs turn in SOTA performance on most of the languages and tasks.) I thought this was a very neat paper, but it didn’t get much burn from the audience either.

I presented a paper coauthored with Cyril Allauzen describing a new algorithm for shortest-string decoding that makes fewer assumptions. Indeed, it allows one for the first time to efficiently decode traditional weighted finite automata trained with expectation maximization (EM). This was exciting to me because this is a problem that has bedeviled me for over 15 years now when I first noticed the conceptual gap. <whine>The experience getting this to press was a great frustration to me, however. It was first desk-rejected at a conference on grammatical inference (i.e., people who study things like formal language learning) on the grounds that it was too applied. On the other hand, the editors at TACL desk-rejected a draft of the paper on the grounds that no one does EM anymore, and didn’t respond when I pointed out that there were in fact two papers in the ACL 2023 main session about EM. So we submitted it to ARR. The first round of reviews were not much more encouraging. It was clear that these reviewers did not understand the important distinction between the shortest path and shortest string, even though the paper was almost completely self-contained, and were perhaps annoyed at being asked to read mathematics (even if it’s all basic algebra).  One reviewer even dared to asked why one would bother, as we do, to prove that our algorithm is correct! To the area chair’s credit, they found better reviewers for the second round, and to those reviewers’ credits, they helped us improve the quality of the paper. However, the first question I got in the talk was basically a heckler asking why I’d bother to submit this kind of work to an ACL venue. Seriously though, where else should I have submitted it? It’s sound work.</whine>

“Segmented languages”

In a recent paper (Gorman & Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like “right-to-left language”, “syllabic language” or “ideographic” language found in the literature. Thus we were surprised to find the following:

Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate (CER), instead of WER… (Gemini Team 2024:18)

Since the most salient feature of the writing systems used to write Mandarin, Japanese, Korean, and Thai is the absence of segmentation information (e.g., whitespace used to indicate word boundaries), presumably the authors mean to say that the data they are using has already been pre-segmented (by some unspecified means). But this is not a property of these languages, but rather of the available data.

[h/t: Richard Sproat]


Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint 2312.11805. URL:

Gorman, K. and Sproat, R.. 2023. Myths about writing systems in speech & language technology. In Proceedings of the Workshop on Computation and Written Language, pages 1-5.

Optionality as acquirendum

A lot of work deals with the question of acquiring “optional” or “variable” grammatical rules, and my impression is that different communities are mostly talking at cross-purposes. I discern at least three ways linguists conceive of optionality as something which the child must acquire.

  1. Some linguists assume—I think without much evidence—that optionality is mere “free variation”, so that the learner simply needs to infer which rules bear a binary [optional] feature. This is an old idea, going back to at least Dell (1981); Rasin et al. (2021:35) explicitly state the problem in this form.
  2. Variationist sociolinguists focus on the differential rates at which grammatical rules apply. They generally recognize the acquirenda as essentially conditional probability distributions which give the probability of rule application in a given grammatical context. Bill Labov is a clear avatar of this strain of thinking (e.g., Labov 1989). David Adger and colleagues have attempted to situate this within modern syntactic frameworks (e.g., Adger 2006).
  3. Some linguists believe that optionality is not statable within a single grammar, and must reflect the competing grammars. The major proponent of this approach is Anthony Kroch (e.g., Kroch 1989). While this conception might license some degree of “nihilism” about optionality, it also has led to some interesting work which hypothesizes interesting substantive constraints on grammar-internal constraints on variation as in the work of Laurel MacKenzie and colleagues (e.g., MacKenzie 2019). This work is also very good at ridding the (2) of some of its unfortunate “externalist” thinking.

I have to reject (1) as overly simplicistic. I find (2) and (3) both compelling in some way but a lot of work remains to synthesize or adjudicate between them.


Adger, D. 2006. Combinatorial variability. Journal of Linguistics 42(3): 503-530.
Dell, F. 1981. On the learnability of optional phonological rules. Linguistic Inquiry 12(1): 31-37.
Kroch, A. 1989. Reflexes of grammar in patterns of language change. Language Variation & Change 1(1): 199-244.
Labov, W. 1989. The child as linguistic historian. Language Variation & Change 1(1): 85-97.
MacKenzie, L. 2019. Perturbing the community grammar: Individual differences and community-level constraints on sociolinguistic variation. Glossa 4(1): 28.
Rasin, E., Berger, I., Lan, R., Shefi, I., and Katzir, R. 2021. Approaching explanatory adequacy in phonology using Minimum Description Length. Journal of Language Modelling 9(1): 17-66.

Kill yr darlings…

…or at least make them more rigorous.

In the field of computational phonology, there were three mid-pandemic articles that presented elaborate computational “theories of everything” in phonology: Ellis et al. (2022), Rasin et al. (2021), and Yang & Piantadosi (2022).1 I am quite critical of all three offerings. All three provide computational models evaluated for their ability to acquire phonological patterns—with varying amounts overheated rhetoric about what this means for generative grammar—and in each case, there is a utter lack of rigor. None of the papers prove, or even conjecture, anything hopeful or promising about the computational complexity of the proposed models, how long they take to converge (or if they do), or whether there is any bound on the kinds of mistakes the models might make once they converge. What they do instead is demonstrate that the models produce satisfactory results on toy problem sets. One might speculate that these three papers are the result of lockdown-era hyperfocus on thorny passion projects. But I think it’s unfortunate that the authors (and doubly so the reviewers and editors) considered these projects complete before providing formal characterization of the proposed models’ substantive properties.2 By stating this critique here, I hopefully commit myself to align actions with my values in my future work, and I challenge the aforementioned authors to study these properties.


  1. To be fair, Yang and Piantadosi claims to be a theory of not just phonology…
  2. I am permitted to state that I reviewed one of these papers—my review was “signed” and made public, along with the paper—and my review was politely negative. However, it was clear to me that the editor and other reviewers had a very high opinion of this work and there was no reason for me to fight the inevitable.


Ellis, K., Albright, A., Solar-Lezama, A., Tenenbaum, J. B., and O’Donnell, T. J. 2022. Synthesizing theories of human language with bayesian program induction. Nature Communications 2022:1–13.
Rasin, E., Berger, I., Lan, N., Shefi, I. and Katzir, R. 2021. Approaching explanatory adequacy in phonology using Minimum Description Length. Journal of Language Modelling 9:17–66.
Yang, Y. and Piantadosi, S. T. 2022. One model for the learning of language. Proceedings of the National Academy of Sciences 119:e2021865119.