SPE & Lakoff on exceptionality

Recently I have attempted to review and synthesize different theories of what we might call lexical (or morpholexical or morpheme-specific) exceptionality. I am deliberately ignoring accounts that take this to be a property of segments via underspecification (or in a few cases, pre-specification, usually of prosodic-metrical elements like timing slots or moras), since I have my own take on that sort of thing under review now. Some takeaways from my reading thus far:

  • This is an understudied and undertheorized topic.
  • At the same time, it seems at least possible that some of these theories are basically equivalent.
  • Exceptionality and its theories play only a minor role in adjudicating between competing theories of phonological or morphological representation, despite their obvious relevance.
  • Also despite their obvious relevance, theories of exceptionality make little contact with theories of productivity and defectivity.

Since most of the readings are quite old, I will include PDF links when I have a digital copy available.

Today, I’m going to start off with Chomsky & Halle’s (1968) Sound Pattern of English (SPE), which has two passages dealing with exceptionality: §4.2.2 and §8.7. While I attempt to summarize these two passages as if they are one, they are not fully consistent with one another and I suspect they may have been written at different times or by different authors. Furthermore, it seemed natural for me to address, in this same post, some minor revisions proposed by Lakoff (1970: ch. 2). Lakoff’s book is largely about syntactic exceptionality, but the second chapter, in just six pages, provides important revisions to the SPE system. I myself have also taken some liberties filling in missing details.

Chomsky & Halle give a few examples of what they have in mind when they mention exceptionality. There is in English a rule which laxes vowels before consonant clusters, as in convene/conven+tion or (more fancifully) wide/wid+th. However, this generally does not occur when the consonant cluster is split by a “#” boundary, as in restrain#t.1 The second, and more famous, example involves the trisyllabic shortening of the sort triggered by the -ity suffix. Here laxing also occurs (e.g., sereneseren+ityobsceneobscen+ity) though not in the backformation obese-obesity.2 As Lakoff (loc. cit.:13) writes of this example, “[n]o other fact about obese is correlated to the fact that it does not undergo this rule. It is simply an isolated fact.” Note that both of these examples involve underapplication, and the latter passage gives more obesity-like examples from Lightner’s phonology of Russian, where one rule applies only to “Russian” roots and another only to “Church Slavonic” roots.

SPE supposes that by default, that there is a feature associated with each rule. So, for instance, if there is a rule R there exists a feature [±R] as well. A later passage likens these to features for syntactic category (e.g., [+Noun]), intrinsic morpho-semantic properties like animacy, declension or conjugation class features, and the lexical strata features introduced by Lees or Lightner in their grammars of Turkish and Russian. SPE imagine that URs may bear values for [R]. The conventions are then:

(1) Convention 1: If a UR is not specified [-R], introduce [+R] via redundancy rule.
(2) Convention 2: If a UR is [αR], propagate feature specification [αR] to each of its segments via redundancy rule.
(3) Convention 3: A rule R does not apply to segments which are [-R].

Returning to our two examples above, SPE proposes that obese is underlylingly [−Trisyllabic Shortening], which accounts for the lack of shortening in obesity. They also propose rules which insert these minus-rule features in the course of the derivation; for instance, it seems they imagine that the absence of laxing in restraint is the result of a rule like V → {−Laxing} / _ C#C, with a phonetic-morphological context.

Subsequent work in the theory of exceptionality has mostly considered cases like obesity, the rule features are present underlyingly but with one exception, discussed below, the restraint-type analysis, in which rule features are introduced during the derivation, do not seem to have been further studied. It seems to me that the possibility of introducing minus-rule features to a certain phonetic context could be used to derive a rule that applies to unnatural classes. For example, imagine an English rule (call it Tensing) which tenses a vowel in the context of anterior nasals {m, n} and the voiceless fricatives {f, θ, s, ʃ} but not voiced fricatives like {v, ð}.3 Under any conventional feature system, there is no natural class which includes {m, n, f, θ, s, ʃ} but not also {ŋ, v}, etc. However, one could derive the desired disjunctive effect by introducing a -Tensing specification when the vowel is followed by a dorsal, or by a voiced fricative. This might look something like this:

(4) No Tensing 1: [+Vocalic] → {−Tensing} / _ [+Dorsal]
(5) No Tensing 2: [+Vocalic] → {−Tensing} / _ [-Voice, +Obstruent, +Continuant]

This could continue for a while. For instance, I implied that Tensing does not apply before a stop so we could insert a -Tensing  specification when the following segment is [+Obstruent, -Continuant], or we could do something similar with a following oral sonorant, and so on. Then, the actual Tensing rule would need little (or even no) phonetic conditioning.

To put it in other words, these rules allow the rule to apply to a set of segments which cannot be formed conjunctively from features, but can be formed via set difference.4 Is this undesirable? Is it logically distinct from the desirable “occluding” effect of bleeding in regular plural and past tense allomorphy in English (see Volonec & Reiss 2020:28f.)? I don’t know. The latter SPE passage seems to suggest this is undesirable: “…we have not found any convincing example to demonstrate the need for such rules [like my (4-5)–KBG]. Therefore we propose, tentatively, that rules such as [(4-5)], with the great increase in descriptive power that they provide, not be permitted in the phonology.” (loc cit.:375). They propose instead that only readjustment rules should be permitted to introduce rule features; otherwise rule feature specifications must be underlyingly present or introduced via redundancy rule.

As far as I can see, SPE does not give any detailed examples in which rule feature specifications are introduced via rule. Lakoff however does argue for this device. There are rules which seem to apply to only a subset of possible contexts; one example given are the umlaut-type plurals in English like footfeet or goosegeese. Later in the book (loc. cit./, 126, fn. 59) the rules which generate such pairs are referred to these as minor rules. Let us call the English umlauting rule simply Umlaut. Lakoff notes that if one simply applies the above conventions naïvely, it will be necessary to mark a huge number of nouns—at the very least, all nouns which have a [+Back] monophthong in the final syllable and which form a non-umlauting plural—as [-Umlaut]. This, as Lakoff notes, would wreck havoc on the feature counting evaluation metric (see §8.1), and would treat what we intuitively recognize as exceptionality (forming an umlauting plural in English) as “more valued” than non-exceptionality. Even if one does not necessarily subscribe to the SPE evaluation metric, one may still feel that this has failed to truly encode the productivity distinction between minor rules and major rules that have exceptions. To address this, Lakoff proposes there is another rule which introduces [Umlaut], and that this rule (call it No Umlaut) applies immediately before Umlaut. Morphemes which actually undergo Umlaut are underlying -No Umlaut. Thus the UR of an noun with an umlauting plural, like foot, will be specified [No Umlaut], and this will not undergo a rule like the following:

(6) No Umlaut: [ ]  → {Umlaut}

However, a noun with a regular plural, like juice, will undergo this rule and thus the umlauting rule U will not apply to it because it was marked [-U] by (6).

One critique is in order here. It is not clear to me why SPE introduces (what I have called) Convention 2; Lakoff simply ignores it and proposes an alternative version of Convention 3 where target morphemes, rather than segments, must be [+R] to undergo rule R. Of his proposal, he writes: “This system makes the claim that exceptions to phonological rules are morphemic in nature, rather than segmental.” (loc. cit., 18) This claim, while not necessarily its 1970-era implementation, is very much in vogue today. There are some reasons to think that Convention 2 introduces unnecessary complexities, which I’ll discuss in a subsequent post. One example (SPE:374) makes it clear that for Chomsky & Halle, Convention 3 requires that that for rule R the target be [+R], but later on, they briefly consider what if anything happens if any segments in the environment (i.e., structural change) are [-R].5 They claim (loc. cit., 375) there are problems with allowing [-R] specifications in the environment to block application of R, but give no examples. To me, this seems like an issue created by Convention 2, when one could simply reject it and keep the rule features at the morpheme level.

I have since discovered that McCawley (1974:63) gives more or less the critique of this convention in his review of SPE.

A correction: after rereading Zonneveld, I think Lakoff misrepresents the SPE theory slightly, and I repeated his mispresentation. Lakoff writes that the SPE theory could have phonological rules that introduce minus-rule features. In fact C&H say (374-5) that they have found no compelling examples of such rules and that they “propose, tentatively” that such rules “not be permitted in the phonology”; any such rules must be readjustment rules, which are assumed to precede all phonological rules. This means that (4-5) are probably ruled out. Lakoff’s mistake may reflect the fact that the 1970 book is a lightly-adapted version of his 1965 dissertation, for which he drew on a pre-publication version of SPE.

[This post, then, is the first in a series on theories of lexical exceptionality.]

Endnotes

  1. The modern linguist would probably not regard words like restraint as  subject to this rule at all. Rather, they would probably assign #t to the “word” stratum (equivalent to the earlier “Level 2”) and place the shortening rule in the “stem” stratum (roughly equivalent to “Level 1”). Arguably, C&H have stated this rule more broadly than strictly necessary to make the point.
  2. It is said that the exceptionality of this pair reflects its etymology: obese was backformed from the earlier obesity. I don’t really see how this explains anything synchronically, though.
  3. This is roughly the context in which Philadelphia short-a is tense, though the following consonant must be tautosyllabic and tautomorphemic with the vowel. Philadelphia short-a is, however, not a great example since it’s not at all clear to me that short-a tensing is a synchronic process.
  4. Formally, the set in question is something like [−Dorsal] ∖ [+Voice, +Consonantal, +Continuant, −Nasal].
  5. This issue is taken up in more detail by Kisseberth (1970); I’ll review his proposal in a subsequent post.

References

Chomsky, N. and Halle, M. 1968. The Sound Pattern of English. Harper & Row.
Kisseberth, C. W. 1970. The treatment of exceptions. Papers in Linguistics 2: 44-58.
Lakoff, G. 1970. Irregularity in Syntax. Holt, Rinehart and Winston.
McCawley, J. D. 1974. Review of Chomsky & Halle (1968), The Sound Pattern of English. International Journal of American Linguistics 40: 50-88.

Linguistic relativity and i-language

Elif Batuman’s autofiction novel The Idiot follows Selin, a Harvard freshman in the mid 1990s. Selin initially declares her major in linguistics and describes two classes in more detail. One is with a soft-spoken professor who is said to be passionate about Turkic phonetics (no clue who this might be: anybody?) and the other is described as a Italian semanticist who wears beautiful suits (maybe this is Gennaro Chierchia; not sure). Selin is taken aback by the stridency with which her professor (presumably the Turkic phonetician) rails against the Sapir-Whorf hypothesis—she regrets how the professor repeatedly mentions Whorf’s day job as a fire prevention specialist—and finds linguistic relativity so intuitive she changes her major at the end of the book.

Batuman is not the only person to draw a connection between rejection of the stronger forms of the Sapir-Whorf hypothesis and generativism. Here’s the thing though: there is no real connection between these two ideas! Generativism has no particular stance on any of this. The only connection I see between these two ideas is that, when you adopt the i-language view, you simply have more interesting things to study. If you truly understand, say, poverty of the stimulus arguments, you just won’t feel the need to entertain intuitive-popular views of language because you’ll recognize that the human condition vis-à-vis language is much richer and much stranger than Whorf ever imagined.

Representation vs. explanation?

I have often wondered whether detailed representational formalism is somehow in conflict with genuine explanation in linguistics. I have been tangentially involved in the cottage industry that is applying the Tolerance Principle (Yang 2005, 2016) to linguistic phenomena, most notably morphological defectivity. In our paper on the subject (Gorman & Yang 2019), we are admittedly somewhat nonchalant about the representations in question, a nonchalance which is, frankly, characteristic of this microgenre.

In my opinion, however, our treatment of Polish defectivity is representationally elegant. (See here for a summary of the data.) In this language, fused case/number suffixes show suppletion based on the gender—in the masculine, animacy—of the stem, and there is lexically conditioned suppletion between -a and -u, the two allomorphs of the gen.sg. for masculine inanimate nouns. To derive defectivity, all we need to show is that Tolerance predicts that, in the masculine inanimate, there is no default suffix to realize the gen.sg. If there are two realization rules in competition, we can implement this by making both of them lexically conditioned, and leaving nouns which are defective in the gen.sg. off both lexical “lists”. We can even imagine, in theories with late insertion, that the grammatical crash is the result of uninterpretable gen.sg. features which are, in defective nouns, still present at LF.1

It is useful to contrast this with our less-elegatn treatment of Spanish defectivity in the same paper. (See here for a summary of the data.) There we assume that there is some kind of grammatical competition for verbal stems between the rules that might be summarized as “diphthongize a stem vowel when stresssed” and “do not change”. We group the two types of diphthongization (o to ue [we] and to ie [je]) as a single change, even though it is not trivial to make these into a single change.2 This much at least has a venerable precedent, but what does it mean to treat diphthongization as a rule in the first place? The same tradition tends to treat the propensity to diphthongize as a phonological (i.e., perhaps via underspecification or prespecification, à la Harris 1985) or morphophonological property of the stem (a lexical diacritic à la Harris 1969, or competition between pseudo-suppletive stems à la Bermúdez-Otero 2013), and the phonological contents of a stem is presumably stored in the lexicon, and not generated by any sort of rule.3 Rather, our Tolerance analysis seems to imply we have thrown in our lots with Albright and colleagues (Albright et al. 2001, Albright 2003) and Bybee & Pardo (1981), who analyze diphthongization as a purely phonological rule depending solely on the surface shape of the stem. This is despite the fact that we are bitterly critical of these authors for other reasons4 and I would have preferred—aesthetically at least—to adopt an analysis where diphthongization is a latent property of particular stems.

At this point, I could say, perhaps, that the data—combined with our theoretical conception of the stem inventory portion of the lexicon as a non-generative system—is trying to tell me something about Spanish diphthongization, namely that Albright, Bybee, and colleagues are onto something, representationally speaking. But, compared with our analysis of Polish, it is not clear how these surface-oriented theories of diphthongization might generate grammatical crash. Abstracting from the details, Albright (2003) imagines that there are a series of competing rules for diphthongization, whose “strength” derives from the number of exemplars they cover. In his theory, the “best” rule can fail to apply if its strength is too low, but he does not propose any particular threshold and as we show in our paper, his notion of strength is poorly correlated with the actual gaps. Is it possible our analysis is onto something if Albright, Bybee, and colleagues are wrong about the representational basis for Spanish diphthongization?

Endnotes

  1. This case may still be a problem for Optimality Theory-style approaches to morphology, since Gen must produce some surface form.
  2. I don’t have the citation in front of me right now, but I believe J. Harris originally proposed that the two forms of diphthongization can be united insofar as both of them can be modeled as insertion of e triggering glide formation of the preceding mid vowel.
  3. For the same reason, I don’t understand what morpheme structure constraints are supposed to do exactly. Imagine, fancifully, that you had a mini-stroke and the lesion it caused damaged your grammar’s morpheme structure rule #3. How would anyone know? Presumably, you don’t have any lexical entries which violate MSC #3, and adults generally does not make up new lexical entries for the heck of it.
  4. These have to do with what we perceive as the poor quality of their experimental evidence, to be fair, not their analyses.

References

Albright, A., Andrade, A., and Hayes, B. 2001. Segmental environments of Spanish diphthongization. UCLA Working Papers in Linguistics 7: 117-151.
Albright, A. 2003. A quantitative study of Spanish paradigm gaps. In Proceedings of the 22th West Coast Conference on Formal Linguistics, pages 1-14.
Bermúdez-Otero, R. The Spanish lexicon stores stems with theme vowels, not roots with inflectional class features. Probus 25: 3-103.
Bybee, J. L. and Pardo, E. 1981. On lexical and morphological conditioning of alternations: a nonce-prob e experiment with Spanish verbs. Linguistics 19: 937-968.
Gorman,. K. and Yang, C. 2019. When nobody wins. In F. Rainer, F. Gardani, H. C. Luschützky and W. U. Dressler (ed.), Competition in Inflection and Word Formation, pages 169-193. Springer.
Harris, J. W. 1969. Spanish Phonology. MIT Press.
Harris, J. W. 1985. Spanish diphthongisation and stress: a paradox resolved. Phonology 2: 31-45.

Automatic batch sizing

Yoyodyne is my lab’s sequence-to-sequence library, intended to be a replacement for Fairseq, which is (essentially) abandonware. One matter of urgency for me in building Yoyodyne was to enable automatic hyperparameter tuning. This was accomplished by logging results to Weights & Biases (W&B). We can perform a random or Bayesian hyperparameter sweep using a “grid” specified via a YAML file, monitor progress on the W&B website, or even hit the API to grab the best hyperparameters. One issue that kept coming up, however, is that it is easy to hit out-of-memory (OOM) errors during this process. Here’s what we did about it:

OOMs are not purely due to model size: the model, batch, and gradients all need to fit into the same VRAM. PyTorch Lightning, which is a key part of the Yoyodyne backend, provides a function for automatically determining the maximum batch size that will not trigger an OOM. Basically, it works by starting with a low batch size (by default, 2), randomly drawing three batches of that size, and then attempting training (but in fact caching parameters so that no real training occurs). If this does not trigger an OOM, it doubles the batch size, and so on.1,2 You can enable this approach in Yoyodyne using the flag --find_batch_size max. You’d want to use this if you believe that a giant batch size is fine and you just want to fully saturate your GPU.

A slightly more sophisticated version of this, useful when you actually want to tune batch size, is enabled with the flag --find_batch_size opt. This again begins by doubling the size of randomly drawn batches as well, but here it halts once the doubling exceeds the value of the --batch_sizeflag. If the max batch size is larger than the requested size, it is used as is; thus this acts as a soft check against OOMs. If, however, the max batch size is smaller than --batch_size it instead solves for a new batch size, the largest batch size which is smaller than the max and which is a divisor of --batch_size`. It then enables multiple rounds of gradient accumulation per update,3 thus perfectly-losslessly simulating the desired batch size while using as much of VRAM as possible. I can assure you this is a killer feature for neural network tuning.

Endnotes

  1. This is a little imprecise, and one can refine it by doing a binary search, but in practice it’s not worth the effort when working with ragged data.
  2. Whatever batch size was requested with the --batch_size flag is ignored.
  3. More formally, given desired batch size $b$ and a max batch size $n’$, it finds $a, n$ such that $a$ is the smallest integer, and $n$ is the largest integer, where $an = b$. This is computed via brute force; my implementation of an elegant solution based on the prime factorization was a bit slower.

An interesting semantic change: “raw dogging”

The term raw-dogging is a slightly-obscene, slangy term for engaging in unprotected sex, often used to celebrate that occasionally-risky behavior. However, this term has undergone an interesting semantic change in the last five or so years. I think the actuator of this chain of events is prolific Twitter user @jaboukie:

This is a straightforward, jocular, semantic extension, generalizing the sense of danger associated with unprotected sex to life itself. In its wake (it was a very popular tweet), I also saw a tweet about “raw dogging” to refer to riding the subway without headphones or sunglasses. Years later, I read a blind item about a US senator flying commercially from the States to Israel; apparently, according to his seat mate, during the long flight, he didn’t listen to music or podcasts, read, check email, nap, or watch a movie, he just…sat there, for hours and hours, like an absolute maniac. I haven’t been able to find this story, and I don’t remember whether it referred to raw-dogging, but I have since seen several stories discussing raw-dogging flights (e.g., this recent one in GQ). Discussions of raw-dogging in the commercial aviation sense largely recognize the act’s covert prestige: it is recognized as a curious and difficult task, one associated with macho and/or maleness. The GQ article also quotes individuals who refer to stimulation-free commercial flying as barebacking, which traditionally refers to unprotected anal sex between men. (In contrast raw-dogging in its original sense does not specify the specific sex act beyond some form of genital-genital penetration, nor does it specify the gender or sexual orientation of the participants.)

“Indic” considered harmful

Indic is an adjective referring to the Indo-Aryan languages such as Hindi-Urdu or Bengali. These languages are spoken mostly in the northern parts of India, as well as in Bangladesh, Pakistan, Sri Lanka, Nepal, and the Maldives. This term can be confusing, because hundreds of millions of people in the Indian subcontinent (and nearby island nations) speak non-Indic first languages: over 250 million people, particularly in the south of India and the north of Sri Lanka, speak Dravidian languages, which include Malayalam, Tamil, and Telugu. Austronesian, Tibeto-Burman, and Tai-Kadai languages, and many language isolates, are also spoken in the India and the other nations of subcontinent, as is English (and French, and Portuguese). Unfortunately, there is now a trend to use Indic to mean ‘languages of the subcontinent’. See here for a prominent example. This is a new sense for Indic, and while there is probably a need for such a lexeme to express the notion (language of India or subcontinental language would work), reusing Indic, which already has a distinct and well-established sense, just adds unnecessary confusion.

A minor syntactic innovation in English: “BE crazy”

I recently became aware of an English syntactic construction I hadn’t noticed before. It involves the predicate BE crazy, which itself is nothing new, but here the subject of that predicate is, essentially, quoted speech from a second party. I myself am apparently a user of this variant. For example, a friend told me of someone who describes themselves (on an online dating platform) as someone who …likes travel and darts, and I responded, simply, Likes darts is crazy. That is to say, I am making some kind of assertion that the description “likes darts”, or perhaps the speech act of describing oneself as such, is itself a bit odd. Now in this case, the subject is simply the quotation (with the travel and part elided), and while this forms a constituent, a tensed VP, we don’t normally accept them as the subject of predicates. And I suspect constituenthood is not even required. So this is distinct from the ordinary use of BE crazy with a nominal subject.

I suspect, though I do not have the means to prove, this is a relatively recent innovation; I hear it from my peers (i.e., those of similar age, not my colleagues at work, who may be older) and students, but not often elsewhere. I also initially thought it might be associated with the Mid-Atlantic but I am no longer so sure.

Your thoughts are welcome.

Vibe check: EACL 2024

I was honored to be able to attend EACL 2024 in Malta last month. The following is a brief, opinionated “vibe check” on NLP based on my experiences there. I had never been to an EACL, but it appealed to me because I’ve always respected the European speech & language processing community’s greater interest in multilingualism compared to what I’m familiar with in the US. And, because when or why else would I get to see Malta? The scale of EACL is a little more manageable than what I’m used to, and I was able to take in nearly every session and keynote. Beyond that, there wasn’t much difference. Here are some trends I noticed.

We’re doing prompt engineering, but we’re not happy about it

It’s hard to get a research paper out of prompt engineering. There really isn’t much to report, except the prompts used and the evaluation results. And, there doesn’t seem to be the slightest theory about how one ought to design a prompt, suggesting that the engineering part of the term is doing a lot of work. So, while I did see some papers (to be fair, mostly student posters) about prompt engineering, the interesting ones actually compared prompting against a custom-built solution.

There’s plenty of headroom for older technologies

I was struck by one of the demonstration papers, which was using fine-tuned BERT for the actual user-facing behaviors, but an SVM or some other type of simple linear model trained on the same data to provide “explanability”. I was also struck by the many papers I saw in which fine-tuned BERT or some other kind of custom-built solution outperformed prompting.

Architectural engineering is dead for now

I really enjoy learning about new “architectures”, i.e., ways to frame speech and language processing problems as a neural network. Unfortunately, I didn’t learn about any new ones this year. I honestly think the way forward, in the long term, will be to identify and eliminate the less-principled parts of our modeling strategies, and replace them with “neat”, perhaps even proof-theoretic, solutions, but I’m sad to say this is not a robust area.

Massive multilingualism needs new application areas

In the first half of Hinrich Schütze’s keynote, he discussed a massively multilingual study covering 1,500 languages in all. That itself is quite impressive. However, I was less impressed with the tasks targeted. One was an LM-based task (predicting the next word, or perhaps a masked word), evaluated with “pseudo-perplexity”. I’m not sure what pseudo-perplexity is but real perplexity isn’t good for much. The other task was predicting, for each verse from the Bible, the appropriate topic code; these topics are things like “recommendation”, “sin”, “grace”, or “violence”. Doing some kind of semantic prediction, at the verse/sentence level, at such scale might be interesting, but this particular instantiation seems to me to be of no use to anyone, and as I understand it, the labels were projected from those given by English annotators, which makes the task less interesting. Let me be clear, I am not calling out Prof. Schütze, for whom I have great respect—and the second half of his talk was very impressive—but I challenge researchers working at massively multilingual scale to think of tasks really worth doing!

We’ve always been at war with Eurasia

I saw at least two pro-Ukraine papers, both focused on the media environment (e.g., propaganda detection). I also saw a paper about media laws in Taiwan that raised some ethical concerns for me. It seems this may be one of those countries where truth is not a defense against charges of libel, and the application was helping the police enforce that illiberal policy. However, I am not at all knowledgeable about the political situation there and found their task explanation somewhat hard to follow, presumably because of my Taiwanese political illiteracy.

My papers

Adam Wiemerslage presented a paper coauthored with me and Katharina von der Wense in which we propose model-agnostic metrics for measuring hyperparameter sensitivity, the first of their kind. We then use these metrics to show that, at least for the character-scale transduction problems we study (e.g., grapheme-to-phoneme conversion and morphological generation), LSTMs really are less hyperparameter-sensitive than transformers, not to mention more accurate when properly tuned. (Our tuned LSTMs turn in SOTA performance on most of the languages and tasks.) I thought this was a very neat paper, but it didn’t get much burn from the audience either.

I presented a paper coauthored with Cyril Allauzen describing a new algorithm for shortest-string decoding that makes fewer assumptions. Indeed, it allows one for the first time to efficiently decode traditional weighted finite automata trained with expectation maximization (EM). This was exciting to me because this is a problem that has bedeviled me for over 15 years now when I first noticed the conceptual gap. <whine>The experience getting this to press was a great frustration to me, however. It was first desk-rejected at a conference on grammatical inference (i.e., people who study things like formal language learning) on the grounds that it was too applied. On the other hand, the editors at TACL desk-rejected a draft of the paper on the grounds that no one does EM anymore, and didn’t respond when I pointed out that there were in fact two papers in the ACL 2023 main session about EM. So we submitted it to ARR. The first round of reviews were not much more encouraging. It was clear that these reviewers did not understand the important distinction between the shortest path and shortest string, even though the paper was almost completely self-contained, and were perhaps annoyed at being asked to read mathematics (even if it’s all basic algebra).  One reviewer even dared to asked why one would bother, as we do, to prove that our algorithm is correct! To the area chair’s credit, they found better reviewers for the second round, and to those reviewers’ credits, they helped us improve the quality of the paper. However, the first question I got in the talk was basically a heckler asking why I’d bother to submit this kind of work to an ACL venue. Seriously though, where else should I have submitted it? It’s sound work.</whine>

“Segmented languages”

In a recent paper (Gorman & Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like “right-to-left language”, “syllabic language” or “ideographic” language found in the literature. Thus we were surprised to find the following:

Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate (CER), instead of WER… (Gemini Team 2024:18)

Since the most salient feature of the writing systems used to write Mandarin, Japanese, Korean, and Thai is the absence of segmentation information (e.g., whitespace used to indicate word boundaries), presumably the authors mean to say that the data they are using has already been pre-segmented (by some unspecified means). But this is not a property of these languages, but rather of the available data.

[h/t: Richard Sproat]

References

Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint 2312.11805. URL: https://arxiv.org/abs/2312.11805.

Gorman, K. and Sproat, R.. 2023. Myths about writing systems in speech & language technology. In Proceedings of the Workshop on Computation and Written Language, pages 1-5.