Translating lost languages using machine learning?

[The following is a guest post from my colleague Richard Sproat. This should go without saying, but: this post does not represent the opinions of anyone’s employer.]

In 2009 a paper appeared in Science by Rajesh Rao and colleagues that claimed to show using “entropic evidence” that the thus far undeciphered Indus Valley symbol system was true writing not, as colleagues and I had argued, a non-linguistic symbol system. Some other papers from Rao and colleagues followed, and there was also a paper in the Proceedings of the Royal Society by Rob Lee and colleagues that used a different “entropic” method to argue that symbols carved on stones by the Picts of Iron Age Scotland also represented language.

I, and others, were deeply skeptical (see e.g. here) that such methods could distinguish between true writing and symbol systems that, while having structure, encoded some sort of non-linguistic information. This skepticism was fed in part by our observation that completely random meaningless “symbol systems” could be shown to fall into the “linguistic” bin according to those measures. What if anything were such methods telling us about the difference between natural language and other systems that convey meaning? My skepticism led to a sequence of presentations and papers, culminating in this paper in Language, where I tried a variety of statistical methods, including those of the Rao and Lee teams, in an attempt to distinguish between samples of systems that were known to be true writing, and systems known to be non-linguistic. None of these methods really worked and I concluded that simple extrinsic measures based on the distribution of symbols without knowing what the symbols denote, were unlikely to be of much use.

The upshot of this attempt at debunking Rao’s and Lee’s widely publicized work was that I convinced people who were already convinced and failed to convince those who were not. As icing on the cake, I was accused by Rao and Lee and colleagues of totally misrepresenting their work, which I most certainly had not done: indeed I was careful to consider all possible interpretations of their arguments, the problem being that their own interpretations of what they had done seemed to be rather fluid, changing as the criticisms changed; on the latter point see my reply, also in Language. This experience led me to pretty much give up the debunking business entirely, since people usually end up believing what they want to believe, and it is rare for people to admit they were wrong.

Still, there are times when one feels inclined to try to set the record straight, and one such instance is this recent announcement from MIT about work from Regina Barzilay and colleagues that purports to provide a machine-learning based system that “aims to help linguists decipher languages that have been lost to history.” The paper this press release is based on (to appear in the Transactions of the Association for Computational Linguistics) is of course more reserved than what the MIT public relations people produced, but is still misleading in a number of ways.

Before I get into that though, let me state at the outset that as with the work by Rao et al. and Lee et al. that I had critiqued previously, the issue here is not that Barzilay and colleagues do not have results, but rather what one concludes from their results. And to be fair, this new work is a couple of orders of magnitude more sophisticated than what Rao and his colleagues did.

In brief summary, Barzilay et al’s approach is to take a text in an unknown ancient script, which may be unsegmented into words, along with phonetic transcriptions of a known language. In general the phonetic values of the unknown script are, well, not known, so candidate mappings are generated. (The authors also consider cases where some of the values are known, or can be guessed at, e.g. because the glyphs look like glyphs in known scripts.) The weights on the various mappings are learnable parameters, and the learning is also guided by phonological constraints such as assumed regularity of sound changes and rough preservation of the size of the phonemic inventory as languages change. (Of course, phoneme inventories can change a lot in size and details over a long history: Modern English has quite a different inventory from Proto-Indo-European. Still, since one’s best hope of a decipherment is to find languages that are reasonably closely related to the target, the authors’ assumption here may not be unreasonable.) The objective function for the learning aims to cover as much of the unknown text as possible while optimizing the quality of the extracted cognates. Their training is summarized in the following pseudocode from page 6 of their paper:

One can then compare the results of the algorithm when run with the unknown text, and a set of known languages, to see which of the known languages is the best model. The work is thus in many ways similar to earlier work by Kevin Knight and colleagues, which the present paper also cites.

In the experiments the authors used three ancient scripts: Ugaritic (12th century BCE), a close relative of Hebrew; Gothic, a 4th century CE East Germanic language that is also the earliest preserved Germanic tongue; and Iberian, a heretofore undeciphered script — or more accurately a collection of scripts — of the late pre-Common Era from the Iberian peninsula. (It is worth noting that Iberian was very likely to have been a mixed alphabetic-syllabic script, not a purely alphabetic one, which means that one is giving oneself a bit of a leg up if one bases one’s work on a transliteration of those texts into a purely alphabetic form.) The comparison known languages were Proto-Germanic, Old Norse, Old English, Latin, Spanish, Hungarian, Turkish, Basque, Arabic and Hebrew. (I note in passing that Latin and Spanish seem to be assigned by the authors to different language families!)

For Ugaritic, Hebrew came out as dramatically closer than other languages, and for Gothic, Proto-Germanic. For Iberian, no language was a dramatically better match, though Basque did seem to be somewhat closer. As they argue (p. 9):

The picture is quite different for Iberian. No language seems to have a pronounced advantage over others. This seems to accord with the current scholarly understanding that Iberian is a language isolate, with no established kinship with others.

“Scholarly understanding” may be an overstatement since the most one can say at this point is that there is scholarly disagreement on the relationships between the Iberian language(s) and known languages.

But, in any case, one problem is that since they only perform this experiment for three ancient scripts, two of which they are able to find clear relationships for, and the third not so clearly, it is not obvious what if anything one can conclude from this. The statistical sample is not such as to be overwhelming in its significance. Furthermore, in at least one case there is a serious danger of circularity: the closest match they find for Gothic is with Proto-Germanic, which shows a much better match than the other Germanic languages, Old Norse or Old English. But that is hardly surprising: Proto Germanic reconstructions are heavily informed by Gothic, the earliest recorded example of a Germanic language. Indeed, if Gothic were truly an unknown language, and assuming that we had no access to a reconstructed protolanguage that depends in part on Gothic for its reconstruction, then we would be left with the two known Germanic languages in their set, Old English and Old Norse. This of course would be a more reasonable model in any case for the situation a real decipherer would encounter. But then the situation for Gothic becomes much less clear. Below is their Figure 4, which plots various settings of their coverage threshold hyperparameter rcov against the obtained coverage. The more separated the curve for the language is above the rest, the better the method is able to distinguish the closest matched language from everything else. With this in mind, Hebrew is clearly a lot closer to Ugaritic than anything else. Iberian, as we noted, does not have a language that is obviously closest, though Basque is a contender. For Gothic, Proto-Germanic (PG) is a clear winner, but if one removed that the closest two are now Old English (OE) and Old Norse (ON). Not bad, of course, but just eyeballing the plots, the situation is no longer as dramatic, and not clearly more dramatic than the situation for Iberian.

And as for Iberian, again, they note (p. 9) that “Basque somewhat stands out from the rest, which might be attributed to its similar phonological system with Iberian”. But what are they comparing against? Modern Basque is certainly different from its form 2000+ years ago, and indeed if one buys into recent work by Juliette Blevins, then Ancient Basque was phonologically quite a bit different from the modern language. Which in turn leaves one wondering what these results are telling us.

The abstract of the paper opens with the statement that:

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined.

Of course this is all perfectly true, but it rather understates the case when it comes to the real challenges faced in most cases of decipherment.

To wit:

Not only is the “closest … language” not usually known, but there may not even be a closest language. This appears to be the situation for Linear A where, even though there is a substantial amount of Linear A text, and the syllabary is very similar in appearance and was almost certainly the precursor to the deciphered Linear B, decipherment has remained elusive for 100 years in large measure because we simply do not know anything about the Eteocretan Language. It is also the situation for Etruscan. The authors of course claim their results support this conclusion for Iberian, and thereby imply that their method can help one decide whether there really is a closest language, and thus presumably whether it is worth wasting one’s time pursuing a given relationship. But as we have suggested above, the results seem equivocal on this point.

Even when it turns out that the text is in a language related to a known language, the way in which the script encodes that language may make the correspondences far less transparent than the known systems chosen for this paper. Gothic and Ugaritic are both segmental writing systems which presumably had a fairly straightforward grapheme-to-phoneme relation. And while Ugaritic is a “defective” writing system in that it fails to represent, e.g., most vowels, it is no different from Hebrew or Arabic in that regard. This makes it a great deal easier to find correspondences than, say, Linear B. Linear B was a syllabary, and it was a lousy way to write Greek. It failed to make important phonemic distinctions that Greek had, so that whereas Greek had a three-way voiced-voiceless-voiceless aspirate distinction in stops, Linear B for the most part could only represent place, not manner of articulation. It could not for the most part directly represent consonant clusters so that either these had to be broken up into CV units (e.g. knossos as ko-no-so) or some of the consonants ended up being unrepresented (e.g. sperma as pe-ma).

And all of this assumes the script was purely phonographic. Many ancient scripts, and all of the original independently invented scripts, included at least some amount of purely logographic (or, if you prefer, morphographic) and even semasiographic symbology, so that an ancient text was a mix of glyphs, some of which would relate to the sound, and others of which would relate to a particular morpheme or its meaning. And when sound was encoded, it was often quite unsystematic in the way in which it was encoded, certainly much less systematic than Gothic or Ugaritic were.

Then there is the issue of the amount of text available, which may be merely in the hundreds, or fewer, of tokens. And of course there are issues familiar in decipherment such as knowing when two glyphs in a pair of inscriptions that look similar to each other are indeed the same glyph, or not. Or as in the case of Mayan, where very different looking glyphs are actually calligraphic variants of the same glyph (see e.g. here in the section on “head glyphs”). The point here is that one often cannot be sure whether two glyphs in a corpus are instances of the same glyph, or not, until one has a better understanding of the whole system.

Of course, all of these might be addressed using computational methods as we gradually whittle away at the bigger problem. But it is important to stress that methods such as the one presented in this paper are really a very small piece in the overall task of decipherment.

We do need to say one more thing here about Linear B, since the authors of this paper claim that one of their previously reported systems (Luo, Cao and Barzilay, 2019) “can successfully decipher lost languages like … Linear B”. But if you look at what was done in that paper, they took a lexicon of Linear B words, and aligned them successfully to a nicely cleaned up lexicon of known Greek names noting, somewhat obliquely, that location names were important in the successful decipherment of Linear B. That is true, of course, but then again it wasn’t particularly the largely non-Greek Cretan place names that led to the realization that Linear B was Greek. One must remember that Michael Ventris, no doubt under the influence of Arthur Evans, was initially of the opinion that Linear B could not be Greek. It was only when the language that he was uncovering started to look more and more familiar, and clearly Greek words like ko-wo (korwos) ‘boy’ and i-qo (iqqos) ‘horse’ started to appear that the conclusion became inescapable. To simulate some of the steps that Ventris went through, one could imagine using something like the Luo et al. approach as follows. First guess that there might be proper names mentioned in the corpus, then use their algorithm to derive a set of possible phonetic values for the Linear B symbols, some of which would probably be close to being correct. Then use those along with something along the lines of what is presented in the newest paper to attempt to find the closest language from a set of candidates including Greek, and thereby hope one can extend the coverage. That would be an interesting program to pursue, but there is much that would need to be done to make it actually work, especially if we intend an honest experiment where we make as few assumptions as possible about what we know about the language encoded by the system. And, of course more generally this approach would fail entirely if the language were not related to any known language. In that case one would end up with a set of things that one could probably read, such as place names, and not much else — a situation not too dissimilar from that of Linear A. All of which is to say that what Luo et al. presented is interesting, but hardly counts as a “decipherment” of Linear B.

Of course Champollion is often credited with being the decipherer of Egyptian, whereas a more accurate characterization would be to say that he provided the crucial key to a process that unfolded over the ensuing century. (In contrast, Linear B was to a large extent deciphered within Ventris’ rather short lifetime — but then again Linear B is a much less complicated writing system than Egyptian.) If one were being charitable, then, one might compare Luo et al.’s results to those of Champollion, but then it is worth remembering that from that initial stage to a full decipherment of the system can still be a daunting task.

In summary, I think there are contributions in this work, and there would be no problem if it were presented as a method that provides a piece of what one would need in one’s toolkit if one wanted to (semi-) automate the process of decipherment. (In fact, computational methods have played thus far only a very minor role in real decipherment work, but one can hold out hope that they could be used more.) But everything apparently has to be hyped these days well beyond what the work actually does.

Needless to say, the press loves this sort of stuff, but are scientists mainly in the business of feeding exciting tidbits to the press? Apparently they often are: my paper that I referenced in the introduction that appeared in Language was initially submitted to Science as a reply to the paper by Rao and colleagues. This reply was rejected before it even made it out of the editorial office. The reason was pretty transparent: Rao and colleagues’ original paper purported to be a sexy “AI”-based approach that supposedly told us something interesting about an ancient civilization. My paper was a more mundane contribution showing that none of the proposed methods worked. Which one sells more copies?

In any event, with respect to the paper currently under discussion, hopefully my attempt here will have served at least to put things a bit more in perspective.

Acknowledgements: I thank Kyle Gorman and Alexander Gutkin for comments on earlier versions.

They’re going to tell you…

…at some very near point in the future, that there’s something inherently white supremacist about teaching and studying generative linguistics. They will never tell you how generative linguistics enforces white supremacy, but they will tell you that it represents a hegemonic power in the science of language (it does not, it is clearly just one way of knowing, spottily represented outside the Angophone west) and that it competes for time and mindshare with other forms of linguistic knowledge (an unexamined austerity mindset). This rhetorical trick—the same one used to slander the socialist left across the democratic West 2016-present—would simply not work on the generative community were they a militant, organized, self-assured vanguard rather than a casualized, disorganized, insecure community, one serously committed to diversity in race and sexual orientation but largely uninterested in matters of class and power. And then, once you’ve accepted their framing, they’re going to sell you a radically empiricist psycho-computational mode of inquiry that is deeply incurious about language diversity, that cares not a whit for the agency of speakers, and trains students to serve the interests of the most powerful men in the world.

Asymmetries in Latin glide formation

Let us assume, as I have in the past, that the Classical Latin glides [j, w] are allophones of the short high monophthongs /i, u/. Then, any analysis of this allophony must address the following four asymmetries between [j] and [w]:

Intervocalical /i/ is [j.j], as in peior [pej.jor] ‘worse’; intervocalic /u/ is simple.
Intervocalically, /iu/ is realized as [jw], as in laeua [laj.wa] ‘left, leftwards’ (fem. nom.sg.), but /ui/ is realized as [wi], as in pauiō [pa.wi.oː] ‘I beat’.
/u/ preceded by a liquid and followed by a vowel is also realized as [w], as in ceruus [ker.wus] and silua [sil.wa] ‘forest’, but /i/ is never realized as a glide in this position.
There are two cases in which [u] alternates with [w] (the deadjectival suffix /-u-/ is realized as /-w-/ when preceded by a liquid, as in caluus [cal.wus] ‘bald’, and the perfect suffix /-u-/ is realized as /-w-/ in “thematic” stems like cupīuī [ku.piː.wiː] ‘I desired’); there are no alternations between [i] and [j].

What rules gives rise to these asymmetries?

Results of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion

The results of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion are now in, and are summarized in our task paper. A couple bullet points:

Unsurprisingly, the best systems all used some form of ensembling.
Many of the best teams performed self-training and/or data augmentation experiments, but most of these experiments were performance-negative except in simulated low-resource conditions. Maybe we’ll do a low-resource challenge in a future year.
LSTMs and transformers are roughly neck-and-neck; one strong submission used a variant of hard monotonic attention.
Many of the best teams used some kind of pre-processing romanization strategy for Korean, the language with the worst baseline accuracy. We speculate why this helps in the task paper.
There were some concerns about data quality for three languages (Bulgarian, Georgian, and Lithuanian). We know how to fix them and will do so this summer, if time allows. We may also “re-issue” the challenge data with these fixes.

Optimizing three-way composition for decipherment problems

Knight et al. (2006) introduce a class of problems they call decipherment. In this scenario, we observe a “ciphertext” C , which we wish to decode. We imagine that there exists a corpus of “plaintext” P, and which to recover the encipherment model G that transduces from P to C. All three components can be represented as (weighted) finite-state transducers: P is a language model over plaintexts, C is a list of strings, and G is an initially-uniform transducer from P to C. We can then estimate the parameters (i.e.. arc weights) of G by holding P and C constant and applying the expectation maximization algorithm (Dempster et al. 1977).

Both training and prediction require us to repeatedly compute the “cascade”, the three-way composition P ○ G ○ C. First off, two-way composition is associative, so for all a, b, c : (a ○ b) ○ c = a ○ (b ○ c). However, given any n-way composition, some associations may be radically more efficient than others. Even were the time complexity of each possible composition known, it is still not trivial to compute the optimal association. Fortunately, in this case we are dealing with three-way composition, for which there are only two possible associations; we simply need to compare the two.¹

Composition performance depends on the sorting properties of the relevant machines. In the simplest case, the inner loop of (two-way) composition consists of a complex game of “go fish” between a state in the left-hand side automaton and a state in the right-hand side automaton. One state enumerates over its input (respectively, output) labels and queries the other state’s output (respectively input) labels. In the case that the state in the automaton being queried has its arcs sorted according to the label values, a sublinear binary search is used; otherwise, linear-time search is required. Optimal performance obtains when the left-hand side of composition is sorted by output labels and the right-hand side is sorted by input labels.² Naturally, we also want to perform arc-sorting offline if possible.

Finally, OpenFst, the finite-state library we use, implements composition as an on-the-fly operation: states in the composed FST are lazily computed and stored in an LRU cache.³ Assiduous use of the cache can make it feasible to compute very large compositions when it is not necessary to visit all state of the composed machine. Today I focus on associativity and assume optimal label sorting; caching will have to wait for another day.

Our cascade consists of three weighted finite-state machines:

P is a language model expressed as a weighted label-sorted finite-state acceptor. The model is order 6, with Witten-Bell smoothing (Bell et al. 1990) backoffs encoded using φ (i.e., “failure”) transitions, and has been shrunk to 1 million n-grams using relative entropy pruning (Stolcke 1998).
G is a uniform channel model encoded as a finite-state transducer. Because it is a non-deterministic transducer, it can be input-label-sorted or output-label sorted, but not both.
C is an unweighted label-sorted string finite-state acceptor encoding a long plaintext.

There are two possible associativities, which we illustrate using the OpenFst Python bindings.⁴In the first, we use a left-associative composition. Offline, before composition, we input label-sort G:

In [5]: G.arcsort("ilabel")

Then, we perform both compositions, sorting the intermediate object by output label:

In [6]: %timeit -n 10 
...          partial = compose(P, G, connect=False).arcsort("olabel"); 
...          cascade = compose(partial, C, connect=False)
10 loops, best of 3: 41.6 s per loop

In our second design, we use the parallel right-associative construction. Offline, we output label-sort G:

In [7]: G.arcsort("olabel")

Then, we perform both compositions, sorting the intermediate object by input label:

In [8]: %timeit -n 10 
...          partial = compose(G, C, connect=False).arcsort("ilabel"); 
...          cascade = compose(P, partial, connect=False)
3 loops, best of 3: 38.5 s per loop

So we see a small advantage for the right-associative composition, which we take advantage of in OpenGrm-BaumWelch, freely available from the OpenGrm website.

Endnotes

There exist FST algorithms for n-ary composition (Allauzen & Mohri 2009), but in practice one can achieve similar effects using composition filters (Allauzen et al. 2010) instead.
Note that acceptors which are input label-sorted are implicitly output label-sorted and vice versa, and string FSTs are input and output label-sorted by definition.
In the case where one needs the entire composition at once, we can simply disable caching; in OpenFst, the result is also connected (i.e., trimmed) by default, but we disable that since we need to track the original state IDs.
The timeit module is used to estimate execution times irrespective of caching.

References

Allauzen, C., and Mohri, M.. 2009. N-way composition of weighted finite-state transducers. International Journal of Foundations of Computer Science 20(4): 613-627.
Allauzen, C., Riley, M. and Schalkwyk, J. 2010. Filters for efficient composition of weighted finite-state transducers. In Implementation and Application of Automata: 15th International Conference, CIAA 2010, pages 28-38. Manitoba.
Bell, T.C., Clearly, J. G., and Witten, I.H. 1990. Text Compression. Englewood Cliffs, NJ: Prentice Hall.
Dempster, A. P., Laird, N., M, and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39(1): 1-38.
Knight, K., Nair, A., Rathod, N, Yamada, K. 2006. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 499-506. Sydney.
Stolcke, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA Broadcast News And Understanding Workshop, pages 270–274. Lansdowne, Virginia.

Words and what we should do about them

Every January linguists and dialectologists gather for the annual meeting of the Linguistics Society of America and its sister societies. And, since 1990, attendees crowd into a conference room to vote for the American Dialect Society’s Word Of The Year (or WOTY for short). The guidelines for nominating and selecting the WOTY are deliberately underdetermined. There are no rules about what’s a word (and, increasingly, picks are not even a word under any recognizable definition thereof), what makes a word “of the year” (should it be a new coinage? should its use be vigorous or merely on the rise? should it be stereotyped or notorious? should it reflect the cultural zeitgeist?) or even whether the journalists in the room are eligible to vote.

By my count, there are two major categories of WOTY winners over the last three decades: commentary on US and/or world political events, or technological jargon; I count 14 in the former category (1990’s bushlips, 1991’s mother of all, 2000’s chad, 2001’s 9-11, 2002’s WMD, 2004’s red state/blue state, 2005’s truthiness, 2007’s subprime and 2008’s bailout, 2011’s occupy, 2014’s #blacklivesmatter, 2016’s dumpster fire, 2017’s fake news, 2018’s tender-age shelter) and 9 in the latter (1993’s information superhighway, 1994’s cyber, 1995’s web, 1997’s millennium bug, 1998’s e-, 1999’s Y2K, 2009’s tweet, 2010’s app, 2012’s hashtag) But, as Allan Metcalf, former executive of the American Dialect Society, writes in his 2004 book Predicting New Words: The Secrets Of Their Success, terms which comment on a situation—rather than fill some denotational gap—rarely have much of a future. And looking back some of these picks not only fail to recapitulate the spirit of the era but many (bushlips, newt, morph, plutoed) barely denote at all. Of those still recognizable, it is shocking how many refer to—avoidable—human tragedies: a presidential election decided by a panel of judges, two bloody US incursions into Iraq and the hundreds of thousands of civilian casualities that resulted, the subprime mortgage crisis and the unprecedented loss of black wealth that resulted, and unchecked violence by police and immigration officers against people of color and asylum-seekers.

Probably the clearest example of this is the 2018 WOTY, tender-age shelter. This ghoulish euphemism was not, in my memory, a prominent 2018 moment, so for the record, it refers to a Trump-era policy of separating asylum-seeking immigrants from their children. Thus, “they’re not child prisons, they’re…”. Ben Zimmer, who organizes the WOTY voting, opined that this was a case of bureaucratic language backfiring, but I disagree: there was no meaningful blowback. The policy remains in place, and the people who engineered the policy remain firmly in power for the forseeable future, just as do the architects of and propagandists for the Iraqi invasions (one of whom happens to be a prominent linguist!), the subprime mortgage crisis, and so on. Tender-age shelter is of course by no means the first WOTY that attempts to call out right-wing double-talk, but as satire it fails. There’s no premise—it is not even in the common ground that the US linguistics community (or the professional societies who represent them) fervently desire an end to the aggressive detention and deportion of undocumented immigrants, which after all has been bipartisan policy for decades, and will likely remain so until at least 2024—and without this there is no irony to be found. Finally, it bespeaks a preoccupation with speech acts rather than dire material realities.

This is not the only dimension on which the WOTY community has failed to self-criticize. A large number of WOTY nominees (though few outright winners) of the last few years have clear origins in the African-American community (e.g., 2017 nominees wypipo, caucasity, and 🐐, 2018 nominees yeet and weird flex but OK, 2019 nominees Karen and woke). Presumably these terms become notable to the larger linguistics community via social media. It is certainly possible for the WOTY community to celebrate language of people of color, but it is also possible to read this as exotificiation. The voting audience, of course, is upper-middle-class and mostly-white, and here these “words”, some quite well-established in the communities in which they originate, compete for novelty and notoriety against tech jargon and of-the-moment political satire. As scholars of color have noted, this could easily reinforce standard ideologies that view African-American English as a debased form of mainstream English rather than a rich, rule-governed system in its own right. In other words, the very means by which we as linguists engage in public-facing research risk reproducing linguistic discrimination:

How might linguistic research itself, in its questions, methods, assumptions, and norms of dissemination, reproduce or work against racism? (“LSA Statement on Race”, Hudley & Mallison 2019)

I conclude that the ADS should issue stringent guidance about what makes expressions “words”, and what makes them “of the year”. In particular, these guidelines should orient voters towards linguistic novelty, something the community is well-situated to assess.

Pynini 2020: State of the Sandwich

I have been meaning to describe some of the work I have been doing on Pynini, our weighted finite-state grammar development platform. For one, while I have been the primary contributor through the history of the project (Richard Sproat wrote the excellent path iteration library), we are now also getting many contributions from Lawrence Wolf-Sonkin (rewrite of the symbol table wrapper, type hints) and lots of usability and bug reports from the Google linguists.

We are currently on Pynini release 2.1.1. Here are some new features/improvements from the last few releases:

2.0.9: Adds an efficient multi-argument union.
2.0.9: Pynini (and the rest of OpenGrm) are available on Conda via Conda-Forge. This means that for most users, there is no longer any need to compile Pynini by hand; instead Pynini is compiled (for a variety of platforms) in the cloud, using a continuous integration framework.
2.1.0: Rewrites the string compiler so that symbol tables are no longer attached to compiled FSTs, eliminating the need for expensive symbol table merging and relabeling options.
2.1.0: Rewrites the FST and symbol table class hierarchies to better reflect the organization of lower-level APIs.
2.1.1: Adds PEP 484/PEP 561-compatible type stubs.

We also have removed or renamed quite a few features:

stringify is renamed string.
text is renamed print (cf. the command-line tool fstprint).
The defaults struct is removed, though it may be reintroduced as a context manager at some point.
The * infix operator, previously used for composition is removed; use @ instead.
transducer‘s arguments input_token_type and output_token_type are merged as token_type.

Finally, we have broken Python 2.7 compatibility as of 2.1.0; pywrapfst, the lower-level API, still has some degree of Python 2.7 compatibility, but this is probably the last release to maintain that property.

Idealizations gone wild

Generative grammar and information theory are products of the US post-war defense science funding boom, and it is no surprise that the former attempted to incorporate insights from the latter. Many early ideas in generative phonology—segment structure and morpheme structure rules and constraints (Stanley 1967), the notion of the evaluation metric (Aspects, §6), early debates on opacities, conspiracies, and the alternation condition—are clearly influenced by information theory. It is interesting to note that as early as 1975, Morris Halle regarded his substantial efforts in this area to have been a failure.

In the 1950’s I spent considerable time and energy on attempts to apply concepts of information theory to phonology. In retrospect, these efforts appear to me to have come to naught. For instance, my elaborate computations of the information content in bits of the different phonemes of Russian (Cherry, Halle & Jakobson 1953) have been, as far as I know, of absolutely no use to anyone working on problems in linguistics. And today the same negative conclusion appears to be to be warranted about all my other efforts to make use of information theory in linguistics. (Halle 1975: 532)

Thus, the mania for information theory in early generative grammar—was exactly the sort of bandwagon effect of the sort Claude Shannon, the inventor of information theory, warned about decades earlier.

In the first place, workers in other fields should realize that the basic results of the subject are aimed at a very specific direction, a direction that is not necessarily relevant to such fields as psychology, economics, and other social sciences. (Shannon 1956)

Today, however, information theory is not exactly in disrepute in linguistics. First off, perplexity, a metric derived from information theory, is used as an intrinsic metric in certain natural language processing tasks, particularly language modeling.¹ Secondly, there have been attempts to revive information theory notions as an explanatory factor in the study of phonology (e.g., Goldsmith & Riggle 2012) and human morphological processing (e.g., Moscoso del Prado Martı́n et al. 2004). And recently, Mollica & Piantadosi (2019; henceforth M&P) dare to use information theory to measure the size of the grammar of English.

M&P’s program is fundamentally one of idealization. Now, I don’t have any problem per se with idealization. Idealization is an important part of the epistemic process in science, one without which there can be no scientific observation at all. Critics of idealizations (and of idealization itself) are usually concerned with the things an idealization abstracts away from; for instance, critics of Chomsky’s famous “ideal speaker-listener” (Aspects, p. 3f) note correctly that it ignores bilingual interference, working memory limitations, and random errors. But idealizations are not merely the infinitude of variables they choose to ignore (and when the object of study is an enormously complex polysensory, multifactorial system like the human capacity for language, one is simply not going to be able to study the entire system all at once); they are just as much defined by the factors they foreground and the affordances they create, and the constraints they impose on scientific inquiry.

In this case, an information theoretic characterization of grammars constrains us to conceive of our knowledge of language in terms of probability distributions. This is a step I am often uncomfortable with. It is, for example, certainly possible to conceive of speakers’s lexical knowledge as a sort of probability distribution over lexical items, but I am not sure that P(word) has much grammatical work to do except act as a model of the readily apparent observation that more frequent words can be recalled and recognized more rapidly than rare words. To be sure, studies like the aforementioned one by Moscoso del Prado Martı́n et al. attempt to connect information theoretic characterizations of the lexicon to behavioral results, but these studies are correlational and provide little in the way of mechanistic-causal explanation.

However, for sake of argument, let us assume that the probabilistic characterization of grammatical knowledge is coherent. Why then should it be undertaken? M&P claim that the measurements they will allow—grammar sizes, measured in bits—weigh on an familiar debate. As they frame it:

…is the amount of information about language that is learned substantial (empiricism) or minimal (nativism)?

I don’t accept the terms of this debate. While I consider myself a nativist, I have formed no opinions about how many bits it takes to represent the grammar of English, which is by all accounts a rather complex object. The tradeoff between what is to be learned and what is innate is something that has been given extensive consideration in the nativist literature. Nativists recognize that the less there is to be learned, the more that has to have evolved in the rather short amount of time (in evolutionary terms) since we humans split off from our language-lacking primate cousins. But this tradeoff is strictly qualitative; were it possible to satisfactorily measure both evolutionary plausibility and grammar size, they would still be incommensurate quantities.

M&P proceed by computing the number of bits for various linguistic subsystems. They compute the information associated with phonemes (really, the acoustic cues to various features), the phonemic representation of wordforms, lexical semantics (mappings from words to meanings, here represented as a vector space as is the fashion), word frequency, and finally syntax. For each of these they provide lower bounds and upper bounds, though the upper bounds are in some cases constructed by adding an ad-hoc factor-of-two error to the lower bound. Finally, they sum these quantities, giving an estimate of roughly 1.5 megabytes. This M&P consider to be substantial. It is not at all clear why they feel this is the case, or how small a grammar would have to be to be “minimal”.

There is a lot to complain about in the details of M&P’s operationalizations. First, I am not certain that the systems they have identified are well-defined modules that would be recognizable to working linguists; for instance their phonemes module has next to nothing to do with my conception of phonological grammar. Secondly, it seems to me that by summing the bits needed to characterize each module, they are assuming a sort of “feed-forward”, non-interactive relationship between these components, and it is not clear that this is correct; for example, there are well-understood lexico-semantic constraints on verbs’ argument structure.

While I do not wish to go too far afield, it may be useful to consider in more detail their operationalization of syntax. For this module, they use a corpus of textbook example sentences, then compute the number of possible unlabeled binary branching trees that would cover each example. (This quantity is the same as the nth Catalan number.) To turn this into a probability, they assume that one correct parse has been sampled from a uniform distribution over all possible binary trees for the given sentence. First, this assumption of uniformitivity is completely unmotivated. Secondly, since they assume there’s exactly one possible bracketing, and do not provide labels to non-terminals, they have no way of representing the ambiguity of sentences like Call John an ambulance. (Thanks to Brooke Larson for suggesting this example.) Anyone familiar with syntax will have no problem finding gaping faults with this operationalization.²

M&P justify all this hastiness by comparing their work to the informal estimation approach known as a Fermi problem (they call them “Fermi calculations”). In the original framing, the quantity being estimated is the product of many terms, so assuming errors in estimation of each term are independent, the final estimate’s error is expected to grow logarithmically as the number of terms increases (roughly, this is because the logarithm of a product is equal to the sum of the logarithms of its terms). But in M&P’s case, the quantity being estimated is a sum, so the error will grow much faster, i.e., linearly as a function of the number of terms. Perhaps, as one reviewer writes, “you have to start somewhere”. But do we? If something is not worth doing well—and I would submit that measuring grammars, in all their richness, by comparing them to the storage capacity of obsolete magnetic storage media is one such thing—it seems to me to be not worth doing at all.

Footnotes

Though not without criticism; in speech recognition, probably the most important application of language modeling, it is well-known that decreases in perplexity don’t necessarily give rise to decreases in word error rate.
Why do M&P choose such a degenerate version of syntax? Because syntactic theory is “experimentally under-determined”, so they want to be “independent as possible from the specific syntactic formalism.”

References

Cherry, E. C., Halle, M., and Jakobson, R. 1953. Towards the logical description of languages in their phonemic aspect. Language 29(1): 34-46.
Chomsky, N. 1965. Aspects in the theory of syntax. Cambridge: MIT Press.
Goldsmith, J. and Riggle, J. 2012. Information theoretic approaches to phonology: the case of Finnish vowel harmony. Natural Language & Linguistic Theory 30(3): 859-896.
Halle, M. 1975. Confessio grammatici. Language 51(3): 525-535.
Mollica, F. and Piantadosi, S. P. 2019. Humans store about 1.5 megabytes of information during language acquisition. Royal Society Open Science 6: 181393.
Moscoso del Prado Martı́n, F., Kostić, A., and Baayen, R. H. 2004. Putting the bits together: an information theoretical perspective on morphological processing. Cognition 94(1): 1-18.
Shannon, C. E. 1956. The bandwagon. IRE Transactions on Information Theory 2(1): 3.
Stanley, R. 1967. Redundancy rules in phonology. Language 43(2): 393-436.

Elizabeth Warren and the morality of the professional class

I am surprised by the outpouring of grief engendered by Senator Elizabeth Warren’s exit from the presidential primary among my professional friends and colleagues. I dare not tell them how they ought to feel, but the spectacle of grief makes me wonder whether my friends are selling themselves short: virtually all of them have lived, in my opinion, far more virtuous lives than the senator from Massachusetts.

First off, none of them have spent most of their professional lives as right-wing activists, as did Warren, a proud Republican until the late ’90s. As recently as 1991, Warren gave a keynote at a meeting of the Federalist Society, the shadowy anti-choice legal organization that gave us Justice Brett Kavanaugh and so many other young ultra-conservative judicial appointees.

Secondly, Warren spent decades lying about her Cherokee heritage, presumably for nothing more than professional gain. This is a stunningly racist personal behavior, one that greatly reinforces white supremacy by equating the almost-unimaginable struggles of indigenous peoples with plagiarized recipes and “high cheekbones”. Were any of my friends or colleagues caught lying so blatantly on a job application, they would likely be subject to immediate termination. It is shocking that Warren has not faced greater professional repercussions for this lapse in judgment.

Warren’s more recent history of regulatory tinkering around the most predatory elements of US capitalism, while important, are hardly an appropriate penance for these two monumental personal-professional sins.

On the not-exactly-libfixes

In an early post I noted the existence of libfix-like elements where the newly liberated affix mirrors existing—though possibly semantically opaque—morphological boundaries. The example I gave was that of -giving, as in Spanksgiving and Friendsgiving. Clearly, this comes from Thanksgiving, which is etymologically (if not also synchronically) a compound of the plural noun Thanks and the gerund/progressive giving. It seems some morphological innovation has occurred because this gives rise to new coinages and the semantics of -giving is more circumscribed than the free stem giving: it necessarily refers to a harvest-time holiday, not merely to “giving”.

At the time I speculated that it was no accident that the morphological boundaries of the new libfix mimic those of the compound. Other examples I have since collected include –mare (< nightmare; e.g., writemare, editmare); –core (< hardcore; e.g., nerdcore, speedcore) and –step (< two-step; e.g., breakstep, dubstep), both of which refer to musical genres (Zimmer & Carson 2012); –gate (< Watergate; e.g., Climategate, Nipplegate, Troopergate) and –stock (< Woodstock; e.g., Madstock, Calstock), extracted from familiar toponyms, and –position (< exposition; e.g., sexposition, craposition), for which the most likely source can be analyzed as a Latinate “level 1” prefix attached to a bound stem. So, what do we think? Are these libfixes too? Does it matter that recutting mirrors the etymological—or even synchronic—segmentation of the source word?

References

B. Zimmer and C. E. Carson. 2012. Among the new words. American Speech 87(3): 350-368.