Generative grammar and reaction

I entered college in fall 2003, planning to major in psychology, but quickly fell in love with an introductory linguistics class taken to fulfill a general education requirement. I didn’t realize it at time, but in retrospect I think that the early “aughts” represented a time of reaction, in the political sense, to generative grammar (GG). A huge portion of the discourse of that era (roughly 2003-2010, and becoming more pronounced later in the decade) was dominated by debates oriented around opposition to GG. This includes:

  • Pullum & Scholz’s (2002) critique of poverty of the stimulus arguments,
  • various attempts to revive the past tense debate (e.g, Pinker 2006),
  • Evans & Levinson (2009) on “the myth of language universals”,
  • Gibson & Fedorenko (2010) on “weak quantitative standards”, and
  • the Pirahã recursion affair.

And there are probably others I can’t recall at present. In my opinion, very little was learned from any of these studies. In particular the work of Pullum & Scholz, Gibson & Fedorenko, and Everett falls apart with careful empirical scrutiny; for which see Legate & Yang 2002, the work of Jon Sprouse and colleagues, and Nevins et al. 2009, respectively; few seem to have been convinced by Pinker or Evans & Levinson. It is something of a surprise to me that these highly-contentious debates, some of which was even covered in the popular press, are now rarely read by young scholars.

I don’t know why opposition to GG was so stiff at the time but I do have a theory. The aughts were essentially an apocalyptic era, culturally and materially, and the crucial event of the decade is the invasion of Iraq by a US-led coalition. The invasion represented a failure of elites: it lacked a coherent political justification legible to the rest of the world, resulted in massive civilian casualties, and lead to institutional failures at home and nuclear proliferation abroad. And there are still-powerful voices in linguistics, intellectuals who have a responsibility “to speak the truth and to expose lies”, who were paid handsomely to manufacture consent for the Iraq war. In that context, it is not surprising that the received wisdom of GG, perceived as hegemonic and culturally associated with the anti-war left, came under heavy attack.

References

Evans, N., and Levinson, S.C. 2009. The myth of language universals: Language diversity and its importance for cognitive science. Behavioral & Brain Sciences 32: 429-492.
Gibson, E., and Fedorenko, E. 2010. Weak quantitative standards in linguistics research. Trends in Cognitive Science 14: P223-234.
Legate, J. A., and Yang, C. D. 2002. Empirical re-assessment of stimulus poverty arguments. The Linguistic Review 19: 151-162.
Nevins, A., Pesetsky, D., and Rodrigues, C. 2009. Pirahã exceptionality: a reassessment. Language 85: 355-4040.
Pinker, S. 2006. Whatever happened to the past tense debate? In Baković, E., Ito, J, and McCarthy, J. J. (ed.), Wondering at the Natural Fecundity of Things: Essays in Honor of Alan Prince, pages 221-238. BookSurge.
Pullum, G., and Scholz, B. 2002. Empirical assessment of stimulus poverty arguments. The Linguistic Review 19: 9-50.

Thought experiment #2

In an earlier post, I argued that for the logical necessity of admitting some kind of “magic” to account for lexically arbitrary behaviors like Romance metaphony or Slavic yers. In this post I’d like to briefly consider the consequences for the theory of language acquisition.

If mature adult representations have magic, infants’ hypothesis space must also include the possibility of positing magical URs (as Jim Harris argues for Spanish or Jerzy Rubach argues for Polish). What might happen the hypothesis space was not so specified? Consider the following thought experiment:

The Rigelians from Thought Experiment #1 did not do a good job sterilizing their space ships. (They normally just lick the flying saucer real good.) Specks of Rigelian dust carry a retrovirus that infects human infants and modifies their their faculty of language so that they no longer entertain magical analyses.

What then do we suppose might happen to Spanish and Polish patterns we previously identified as instances of magic? Initially, the primary linguistic data will not have changed, just the acquisitional hypothesis space. What kind of grammar will infected Spanish-acquiring babies acquire?

For Harris (and Rubach), the answer must be that infected babies cannot acquire the metaphonic patterns present in the PLD. Since there is reason to think (see, e.g., Gorman & Yang 2019:§3) that the diphthongization is the minority pattern in Spanish, it seems most likely that the children will acquire a novel grammar in which negar ‘to deny’ has an innovative non-alternating first person singular indicative *nego rather than niego ‘I deny’.

Not all linguists agree. For instance, Bybee & Pardo (1981; henceforth BP) claim that there is some local segmental conditioning on diphthongization, in the sense that Spanish speakers may be able to partially predict whether or not a stem diphthongizes on the basis of nearby segments.1 Similarly, Albright, Andrade, & Hayes (2001; henceforth AAH) develop a computational model which can extract generalizations of this sort.2 For instance, BP claim that an e followed by __r, __nt, or __rt are more likely to diphthongize, and AAH claim that a following stem-final __rr (the alveolar trill [r], not the alveolar tap [ɾ]) and a following __mb also favor diphthongization. BP are somewhat fuzzy about the representational status of these generalizations, but for AAH, who reject the magical segment analysis, they are expressed by a series of competing rules.

I am not yet convinced by this proposal. Neither BP nor AAH give the reader any general sense of the coverage of the segmental generalizations they propose (or in the case of AAH, that their computational model discovers): I’d like to know basic statistics like precision and recall for existing words. Furthermore, AAH note that their computational model sometimes needs to fall back on “word-specific rules” (their term), rules in which the segmental conditioning is an entire stem, and I’d like to know how often this is necessary.3 Rather than reporting coverage, BP and AAH instead correlate their generalizations with the results of wug-tasks (i.e., nonce word production tasks) by Spanish-speaking adults. The obvious objection here is that no evidenceor even an explicit linking hypothesislinks adults’ generalizations about nonce words in a lab to childrens’ generalizations about novel words in more naturalistic settings.

However, I want to extend an olive branch to linguists who are otherwise inclined to agree with BP and AAH. It is entirely possible that children do use local segmental conditioning to learn the patterns linguists analyzed with magical segments and/or morphs, even if we continue to posit magic segments or morphs. It is even possible that sensitivity to this segmental conditioning persists into adulthood as reflected in the aforementioned wug-tasks. Local segmental conditioning might be an example of domain-general pattern learning, and might be likened to sound symbolism—such as the well-known statistical tendency for English words beginning in gl– to relate to “light, vision, or brightness” (Charles Yang, p.c.)insofar as both types of patterns reduce apparent arbitrariness of the lexicon. I am also tempted to identify both local segmental conditioning and sound symbolism as examples of third factor effect (in the sense of Chomsky 2005). Chomsky identifies three factors in the design of language: the genetic endowment, “experience” (the primary linguistic data), and finally “[p]rinciples not specific to the faculty of language”. Some examples of third factorsas these principles not specific to the faculty of language are calledgiven in the paper include domain-general principles of “data processing” or “data analysis” and biological constraints, whether “architectural”, “computational”, or “developmental”. I submit that general-purpose pattern learning might be an example of of domain-general “data analysis”.

As it happens, we do have one way to probe the coverage of local segmental conditioning. Modern sequence-to-sequence neural networks, arguably the most powerful domain-general string pattern learning tool known to us, have been used for morphological generation tasks. For instance, in the CoNLL-SIGMORPHON 2017 shared task, neural networks are used to predict the inflected form of various words given some citation form  and a morphological specification. For instance, given the pair (dentar, V;IND;PRS;1;SG) the models have to predict diento ‘I am teething’. Very briefly, these models, as currently designed, are much like babies infected with the Rigelian retrovirus: their hypothesis space does not include “magic” segments or lexical diacritics and they must rely solely on local segmental conditioning. It is perhaps not surprising, then, that they misapply diphthongization in Spanish (e.g., *recolan for recuelan ‘they re-strain’; Gorman et al. 2019) or yer deletion in Polish, when presented with previously unseen lemmata. But it is an open question how closely these errors pattern like those made by children, or with adults’ behaviors in wug™-tasks.

Acknowledgments

I thank Charles Yang for drawing my attention to some of the issues discussed above.

Endnotes

  1. Similarly, Rysling (2016) argues that Polish yers are epenthesized to avoid certain branching codas, though she admits that their appearance is governed in part by magic (according to her analysis, exceptional morphs of the Gouskova/Pater variety).
  2. Later versions of this model developed by Albright and colleagues are better known for popularizing the notion of “islands of reliability”.
  3. Bill Idsardi (p.c.) raises the question of whether magical URs and morpholexical rules are extensionally equivalent. Good question.

References

Albright, A., Andrade, A., and Hayes, B. 2001. Segmental environments of Spanish diphthongization. UCLA Working Papers in Linguistics 7: 117-151.
Bybee, J., and Pardo, E. 1981. Morphological and lexical conditioning of rules: experimental evidence from Spanish. Linguistics 19: 937-968.
Chomsky, N. 2005. Three factors in language design. Linguistic Inquiry 36(1): 1-22.
Gorman, K. and Yang, C. 2019. When nobody wins. In Franz Rainer, Francesco Gardani, Hans Christian Luschützky and Wolfgang U. Dressler (ed.), Competition in inflection and word formation, pages 169-193. Springer.
Gorman, K., McCarthy, A.D., Cotterell, R., Vylomova, E., Silfverberg, M., Markowska, M. 2019. Weird inflects but okay: making sense of morphological generation errors. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 140-151.
Rysling, A. 2016. Polish yers revisited. Catalan Journal of Linguistics 15: 121-143.

Why language resources should be dynamic

Virtually all the digital linguistic resources used in speech and language technology are static in the sense that

  1. One-time: they are generated once and never updated.
  2. Read-only: they provide no mechanisms for corrections, feature requests, etc.
  3. Closed-source: code and raw data used to generate the data are not released.

However, there are some benefits to designing linguistic resources dynamically, allowing them to be repeatedly regenerated and iteratively improved with the help of the research community. I’ll illustrate this with WikiPron (Lee et al. 2020), our database-cum-library for multilingual pronunciation data.

The data

Pronunctionary dictionaries are an important resource for speech technologies like automatic speech recognition and text-to-speech synthesis. Several teams have considered the possibility of mining pronunciation data from the internet, particularly from the free online dictionary Wiktionary, which by now contains millions of crowd-sourced pronunciations transcribed using the International Phonetic Alphabet. However, none of these prior efforts released any code, nor were their scrapes run repeatedly, so at best they represent of a single (2016, or 2011) slice of the data.

The tool

WikiPron is, first and foremost, a Python command-line tool for scraping pronunciation data from Wiktionary. Stable versions can be installed from PyPI using tools like pip. Once the tool is installed, users specify a language, optionally, a dialect, and various optional flags, and pronunciation data is printed to STDIN as a two-column TSV file. Since this requires an internet connection and may take a while, the system is even able to retry where it left off in case of connection hiccups. The code is carefully documented, tested, type-checked, reflowed, and linted using the CircleCI continuous integration system. 

The infrastructure

We also release, at least annually, a multilingual pronunciation dictionary created using WikiPron. This increases replicability, permits users to see the format and scale of the data WikiPron makes available, and finally allows casual users to bypass the command-line tool altogether. To do this, we provide the data/ directory, which contains data and code which automates “the big scrape”, the process by which we regenerate the multilingual pronunciation dictionary. It includes

  • the data for 335 (at time of writing) languages, dialects, scripts, etc.,
  • code for discovering languages supported by Wiktionary,
  • code for (re)scraping all languages,
  • code for (re)generating data summaries (both computer-readable TSV files and human-readable READMEs rendered by GitHub), and
  • integration tests that confirm the data summaries match the checked-in data,

as well as code and data used for various quality assurance processes. 

Dynamic language resources

In what sense is WikiPron a dynamic language resource? 

  1. It is many-time: it can be run as many times as one wants. Even “the big scrape” static data sets are updated more-than-annually.
  2. It is read-write: one can improve WikiPron data by correcting Wiktionary, and we provide instructions for contributors wishing to send pull requests to the tool.
  3. It is open-source: all code is licensed under the Apache 2.0 license; the data bears a Creative Commons Attribution-ShareAlike 3.0 Unported License inherited from Wiktionary.

Acknowledgements

Most of the “dynamic” features in WikiPron were implemented by CUNY Graduate Center PhD student Lucas Ashby and my colleague Jackson Lee; I have at best served as an advisor and reviewer.

References

Lee, J. L, Ashby, L. F.E., Garza, M. E., Lee-Sikka, Y., Miller, S., Wong, A.,
McCarthy, A. D., and Gorman, K. 2020. Massively multilingual pronunciation
mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228.

Thought experiment #1

A non-trivial portion of what we know about the languages we speak includes information about lexically-arbitrary behaviors, behaviors that are specific to certain roots and/or segments and absent in other superficially-similar roots and/or segments. One of the earliest examples is the failure of English words like obesity to undergo Chomsky & Halle’s (1968: 181) rule of trisyllabic shortening: compare sereneserenity to obese-obesity (Halle 1973: 4f.). Such phenomena are very common in the world’s languages. Some of the well-known examples include Romance mid-vowel metaphony and the Slavic fleeting vowels, which delete in certain phonological contexts.1

Linguists have long claimed (e.g., Harris 1969) one cannot predict whether a Spanish e or o in the final syllable of a verb stem will or will not undergo diphthongization (to ie or ue, respectively) when stress falls on the stem rather than the desinence. For instance negar ‘to deny’ diphthongizes (niego ‘I deny’, *nego) whereas the superficially similar pegar ‘to stick to s.t.’ does not (pego ‘I stick to s.t.’, *piego). There is no reason to suspect that the preceding segment (n vs. p) has anything to do with it; the Spanish speaker simply needs to memorize which mid vowels diphthongize.2 The same is arguably true of the Polish fleeting vowels known as yers, which delete in, among other contexts, the genitive singular (gen.sg.) of masculine nouns. Thus sen ‘dream’ has a gen.sg. snu, with deletion of the internal e, whereas the superficially similar basen ‘pool’ has a gen.sg. basenu, retaining the internal (Rubach 2016: 421). Once again, the Polish speaker needs to memorize whether or not each deletes.

So as to not presuppose a particular analysis, I will refer to segments with these unpredictable alternations—diphthongization in Spanish, deletion in Polish—as magical. Exactly how this magic ought to be encoded is unclear.3 One early approach was to exploit the feature system so that they were underlyingly distinct from non-magical segments. These “exploits” might include mapping magical segments onto gaps in the surface segmental inventory, underspecification, or simply introducing new features. Nowadays, phonologists are more likely to use prosodic prespecification. For instance, Rubach (1986) proposes that the Polish yers are prosodically defective compared to non-alternating e.4 Others have claimed that magic resides in the morph, not the segment.

Regardless of how the magic is encoded, it is a deductive necessity that it be encoded somehow. Clearly something is representationally different in negar and pegar, and sen and basen. Any account which discounts this will be descriptively inadequate. To make this a bit clearer, consider the following thought experiment:

We are contacted by a benign, intelligent alien race, carbon-based lifeforms from the Rigel system with feliform physical morphology and a fondness for catnip. Our scientists observe that they exhibit a strange behavior: when they imbibe fountain soda, their normally-green eyes turn yellow, and when they imbibe soda from a can, their eyes turn red. Scientists have not yet been able to determine the mechanisms underlying these behaviors.

What might we reason about the alien’s seemingly magical soda sense? If we adopt a sort of vulgar uniformitarianism—one which rejects outlandish explanation like time travel or mind-reading—then the only possible explanation remaining to us is that there really is something chemically distinct between the two classes of soda, and the Rigelian sensory system is sensitive to this difference.

Really, this deduction isn’t so different from the one made by linguists like Harris and Rubach: both observe different behaviors and posit distinct entities to explain them. Of course, there is something ontologically different between the two types of soda and the two types of Polish e. The former is a purely chemical difference; the latter arises  because the human language faculty turns primary linguistic data, through the epistemic process we call first language acquisition, into one type of meat (brain tissue), and that type of meat makes another type of meat (the articulatory apparatus) behave in a way that, all else held equal, will recapitulate the primary linguistic data. But both of these deductions are equally valid.

Endnotes

  1. Broadly-similar phenomena previously studied include fleeting vowels in Finnish, Hungarian, Turkish, and Yine, ternary voice contrasts in Turkish, possessive formation in Huichol, and passive formation in Māori.
  2. For simplicity I put aside the arguments by Pater (2009) and Gouskova (2012) that morphs, not segments, are magical. While I am not yet convinced by their arguments, everything I have to say here is broadly consistent with their proposal.
  3. This is yet another feature of language that is difficult to falsify. But as Ollie Sayeed once quipped, the language faculty did not evolve to satisfy a vulgar Popperian falsificationism.
  4. Specfically, Rubach assumes that the non-alternating e‘s have a prespecified mora, whereas the alternating e‘s do not.

References

Chomsky, N. and Halle, M. 1968. The Sound Pattern of English. Harper & Row.
Gouskova, M. 2012. Unexceptional segments. Natural Language & Linguistic Theory 30: 79-133.
Halle, M. 1973. Prolegomena to a theory of word formation. Linguistic Inquiry 4: 3-16.
Harris, J. 1969. Spanish Phonology. MIT Press.
Pater, J. 2009. Morpheme-specific phonology: constraint indexation and inconsistency resolution. In S. Parker (ed.), Phonological Argumentation: Essays on Evidence and Motivation, pages 123-154. Equinox.
Rubach, J. 1986. Abstract vowels in three-dimensional phonology: the yers. The Linguistic Review 5: 247-280.
Rubach, J. 2016. Polish yers: Representation and analysis. Journal of Linguistics 52: 421-466.

Does GPT-3 have free speech rights?

I have some discomfort with this framing. It strikes me as unnecessarily frivolous about some serious questions. Here is an imagined dialogue.

Should GPT-3 have the right to free speech?

No. Software does not have rights nor should it.  Living things are the only agents in moral-ethical calculations. Free speech as it currently is construed should also be recognized as a civic myth of the United States, one not universally recognized. Furthermore it should be recognized that all rights, including the right to self-expression, can impinge upon the rights and dignity of others.

What if a court recognized a free-speech right for GPT-3?

Then that court would be illegitimate. However, it is very easy to imagine this happening in the States given that the US “civic myth” is commonly used to provide extraordinary legal protections to corporate entities.

What if that allowed it to spread disinformation?

Then the operator would be morally responsible for all consequences of that dissemination.

Translating lost languages using machine learning?

[The following is a guest post from my colleague Richard Sproat. This should go without saying, but: this post does not represent the opinions of anyone’s employer.]

In 2009 a paper appeared in Science by Rajesh Rao and colleagues that claimed to show using “entropic evidence” that the thus far undeciphered Indus Valley symbol system was true writing not, as colleagues and I had argued, a non-linguistic symbol system. Some other papers from Rao and colleagues followed, and there was also a paper in the Proceedings of the Royal Society by Rob Lee and colleagues that used a different “entropic” method to argue that symbols carved on stones by the Picts of Iron Age Scotland also represented language. 

I, and others, were deeply skeptical (see e.g. here) that such methods could distinguish between true writing and symbol systems that, while having structure, encoded some sort of non-linguistic information. This skepticism was fed in part by our observation that completely random meaningless “symbol systems” could be shown to fall into the “linguistic” bin according to those measures. What if anything were such methods telling us about the difference between natural language and other systems that convey meaning? My skepticism led to a sequence of presentations and papers, culminating in this paper in Language, where I tried a variety of statistical methods, including those of the Rao and Lee teams, in an attempt to distinguish between samples of systems that were known to be true writing, and systems known to be non-linguistic. None of these methods really worked and I concluded that simple extrinsic measures based on the distribution of symbols without knowing what the symbols denote, were unlikely to be of much use.

The upshot of this attempt at debunking Rao’s and Lee’s widely publicized work was that I convinced people who were already convinced and failed to convince those who were not. As icing on the cake, I was accused by Rao and Lee and colleagues of totally misrepresenting their work, which I most certainly had not done: indeed I was careful to consider all possible interpretations of their arguments, the problem being that their own interpretations of what they had done seemed to be rather fluid, changing as the criticisms changed; on the latter point see my reply, also in Language. This experience led me to pretty much give up the debunking business entirely, since people usually end up believing what they want to believe, and it is rare for people to admit they were wrong.

Still, there are times when one feels inclined to try to set the record straight, and one such instance is this recent announcement from MIT about work from Regina Barzilay and colleagues that purports to provide a machine-learning based system that “aims to help linguists decipher languages that have been lost to history.” The paper this press release is based on (to appear in the Transactions of the Association for Computational Linguistics) is of course more reserved than what the MIT public relations people produced, but is still misleading in a number of ways.

Before I get into that though, let me state at the outset that as with the work by Rao et al. and Lee et al. that I had critiqued previously, the issue here is not that Barzilay and colleagues do not have results, but rather what one concludes from their results. And to be fair, this new work is a couple of orders of magnitude more sophisticated than what Rao and his colleagues did.

In brief summary, Barzilay et al’s approach is to take a text in an unknown ancient script, which may be unsegmented into words, along with phonetic transcriptions of a known language. In general the phonetic values of the unknown script are, well, not known, so candidate mappings are generated. (The authors also consider cases where some of the values are known, or can be guessed at, e.g. because the glyphs look like glyphs in known scripts.) The weights on the various mappings are learnable parameters, and the learning is also guided by phonological constraints such as assumed regularity of sound changes and rough preservation of the size of the phonemic inventory as languages change. (Of course, phoneme inventories can change a lot in size and details over a long history: Modern English has quite a different inventory from Proto-Indo-European. Still, since one’s best hope of a decipherment is to find languages that are reasonably closely related to the target, the authors’ assumption here may not be unreasonable.) The objective function for the learning aims to cover as much of the unknown text as possible while optimizing the quality of the extracted cognates. Their training is summarized in the following pseudocode from page 6 of their paper:

One can then compare the results of the algorithm when run with the unknown text, and a set of known languages, to see which of the known languages is the best model. The work is thus in many ways similar to earlier work by Kevin Knight and colleagues, which the present paper also cites.

In the experiments the authors used three ancient scripts: Ugaritic (12th century BCE), a close relative of Hebrew; Gothic, a 4th century CE East Germanic language that is also the earliest preserved Germanic tongue; and Iberian, a heretofore undeciphered script — or more accurately a collection of scripts — of the late pre-Common Era from the Iberian peninsula. (It is worth noting that Iberian was very likely to have been a mixed alphabetic-syllabic script, not a purely alphabetic one, which means that one is giving oneself a bit of a leg up if one bases one’s work on a transliteration of those texts into a purely alphabetic form.) The comparison known languages were Proto-Germanic, Old Norse, Old English, Latin, Spanish, Hungarian, Turkish, Basque, Arabic and Hebrew. (I note in passing that Latin and Spanish seem to be assigned by the authors to different language families!)

For Ugaritic, Hebrew came out as dramatically closer than other languages, and for Gothic, Proto-Germanic. For Iberian, no language was a dramatically better match, though Basque did seem to be somewhat closer. As they argue (p. 9):

The picture is quite different for Iberian. No language seems to have a pronounced advantage over others. This seems to accord with the current scholarly understanding that Iberian is a language isolate, with no established kinship with others.

“Scholarly understanding” may be an overstatement since the most one can say at this point is that there is scholarly disagreement on the relationships between the Iberian language(s) and known languages.

But, in any case, one problem is that since they only perform this experiment for three ancient scripts, two of which they are able to find clear relationships for, and the third not so clearly, it is not obvious what if anything one can conclude from this. The statistical sample is not such as to be overwhelming in its significance. Furthermore, in at least one case there is a serious danger of circularity: the closest match they find for Gothic is with Proto-Germanic, which shows a much better match than the other Germanic languages, Old Norse or Old English. But that is hardly surprising: Proto Germanic reconstructions are heavily informed by Gothic, the earliest recorded example of a Germanic language. Indeed, if Gothic were truly an unknown language, and assuming that we had no access to a reconstructed protolanguage that depends in part on Gothic for its reconstruction, then we would be left with the two known Germanic languages in their set, Old English and Old Norse. This of course would be a more reasonable model in any case for the situation a real decipherer would encounter. But then the situation for Gothic becomes much less clear. Below is their Figure 4, which plots various settings of their coverage threshold hyperparameter rcov against the obtained coverage. The more separated the curve for the language is above the rest, the better the method is able to distinguish the closest matched language from everything else. With this in mind, Hebrew is clearly a lot closer to Ugaritic than anything else. Iberian, as we noted, does not have a language that is obviously closest, though Basque is a contender. For Gothic, Proto-Germanic (PG) is a clear winner, but if one removed that the closest two are now Old English (OE) and Old Norse (ON). Not bad, of course, but just eyeballing the plots, the situation is no longer as dramatic, and not clearly more dramatic than the situation for Iberian.

And as for Iberian, again, they note (p. 9) that “Basque somewhat stands out from the rest, which might be attributed to its similar phonological system with Iberian”. But what are they comparing against? Modern Basque is certainly different from its form 2000+ years ago, and indeed if one buys into recent work by Juliette Blevins, then Ancient Basque was phonologically quite a bit different from the modern language. Which in turn leaves one wondering what these results are telling us.

The abstract of the paper opens with the statement that:

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined.

Of course this is all perfectly true, but it rather understates the case when it comes to the real challenges faced in most cases of decipherment. 

To wit:

Not only is the “closest … language” not usually known, but there may not even be a closest language. This appears to be the situation for Linear A where, even though there is a substantial amount of Linear A text, and the syllabary is very similar in appearance and was almost certainly the precursor to the deciphered Linear B, decipherment has remained elusive for 100 years in large measure because we simply do not know anything about the Eteocretan Language. It is also the situation for Etruscan. The authors of course claim their results support this conclusion for Iberian, and thereby imply that their method can help one decide whether there really is a closest language, and thus presumably whether it is worth wasting one’s time pursuing a given relationship. But as we have suggested above, the results seem equivocal on this point.

Even when it turns out that the text is in a language related to a known language, the way in which the script encodes that language may make the correspondences far less transparent than the known systems chosen for this paper. Gothic and Ugaritic are both segmental writing systems which presumably had a fairly straightforward grapheme-to-phoneme relation. And while Ugaritic is a “defective” writing system in that it fails to represent, e.g., most vowels, it is no different from Hebrew or Arabic in that regard. This makes it a great deal easier to find correspondences than, say, Linear B. Linear B was a syllabary, and it was a lousy way to write Greek. It failed to make important phonemic distinctions that Greek had, so that whereas Greek had a three-way voiced-voiceless-voiceless aspirate distinction in stops, Linear B for the most part could only represent place, not manner of articulation. It could not for the most part directly represent consonant clusters so that either these had to be broken up into CV units (e.g. knossos as ko-no-so) or some of the consonants ended up being unrepresented (e.g. sperma as pe-ma). 

And all of this assumes the script was purely phonographic. Many ancient scripts, and all of the original independently invented scripts, included at least some amount of purely logographic (or, if you prefer, morphographic) and even semasiographic symbology, so that an ancient text was a mix of glyphs, some of which would relate to the sound, and others of which would relate to a particular morpheme or its meaning. And when sound was encoded, it was often quite unsystematic in the way in which it was encoded, certainly much less systematic than Gothic or Ugaritic were.

Then there is the issue of the amount of text available, which may be merely in the hundreds, or fewer, of tokens. And of course there are issues familiar in decipherment such as knowing when two glyphs in a pair of inscriptions that look similar to each other are indeed the same glyph, or not. Or as in the case of Mayan, where very different looking glyphs are actually calligraphic variants of the same glyph (see e.g. here in the section on “head glyphs”). The point here is that one often cannot be sure whether two glyphs in a corpus are instances of the same glyph, or not, until one has a better understanding of the whole system.

Of course, all of these might be addressed using computational methods as we gradually whittle away at the bigger problem. But it is important to stress that methods such as the one presented in this paper are really a very small piece in the overall task of decipherment.

We do need to say one more thing here about Linear B, since the authors of this paper claim that one of their previously reported systems (Luo, Cao and Barzilay, 2019) “can successfully decipher lost languages like … Linear B”. But if you look at what was done in that paper, they took a lexicon of Linear B words, and aligned them successfully to a nicely cleaned up lexicon of known Greek names noting, somewhat obliquely, that location names were important in the successful decipherment of Linear B. That is true, of course, but then again it wasn’t particularly the largely non-Greek Cretan place names that led to the realization that Linear B was Greek. One must remember that Michael Ventris, no doubt under the influence of Arthur Evans, was initially of the opinion that Linear B could not be Greek. It was only when the language that he was uncovering started to look more and more familiar, and clearly Greek words like ko-wo (korwos) ‘boy’ and i-qo (iqqos) ‘horse’ started to appear that the conclusion became inescapable. To simulate some of the steps that Ventris went through, one could imagine using something like the Luo et al. approach as follows. First guess that there might be proper names mentioned in the corpus, then use their algorithm to derive a set of possible phonetic values for the Linear B symbols, some of which would probably be close to being correct. Then use those along with something along the lines of what is presented in the newest paper to attempt to find the closest language from a set of candidates including Greek, and thereby hope one can extend the coverage. That would be an interesting program to pursue, but there is much that would need to be done to make it actually work, especially if we intend an honest experiment where we make as few assumptions as possible about what we know about the language encoded by the system. And, of course more generally this approach would fail entirely if the language were not related to any known language. In that case one would end up with a set of things that one could probably read, such as place names, and not much else — a situation not too dissimilar from that of Linear A. All of which is to say that what Luo et al. presented is interesting, but hardly counts as a “decipherment” of Linear B. 

Of course Champollion is often credited with being the decipherer of Egyptian, whereas a more accurate characterization would be to say that he provided the crucial key to a process that unfolded over the ensuing century. (In contrast, Linear B was to a large extent deciphered within Ventris’ rather short lifetime — but then again Linear B is a much less complicated writing system than Egyptian.) If one were being charitable, then, one might compare Luo et al.’s results to those of Champollion, but then it is worth remembering that from that initial stage to a full decipherment of the system can still be a daunting task.

In summary, I think there are contributions in this work, and there would be no problem if it were presented as a method that provides a piece of what one would need in one’s toolkit if one wanted to (semi-) automate the process of decipherment. (In fact, computational methods have played thus far only a very minor role in real decipherment work, but one can hold out hope that they could be used more.) But everything apparently has to be hyped these days well beyond what the work actually does. 

Needless to say, the press loves this sort of stuff, but are scientists mainly in the business of feeding exciting tidbits to the press? Apparently they often are: my paper that I referenced in the introduction that appeared in Language was initially submitted to Science as a reply to the paper by Rao and colleagues. This reply was rejected before it even made it out of the editorial office. The reason was pretty transparent: Rao and colleagues’ original paper purported to be a sexy “AI”-based approach that supposedly told us something interesting about an ancient civilization. My paper was a more mundane contribution showing that none of the proposed methods worked. Which one sells more copies?

In any event, with respect to the paper currently under discussion, hopefully my attempt here will have served at least to put things a bit more in perspective.

Acknowledgements: I thank Kyle Gorman and Alexander Gutkin for comments on earlier versions.

They’re going to tell you…

…at some very near point in the future, that there’s something inherently white supremacist about teaching and studying generative linguistics. They will never tell you  how generative linguistics enforces white supremacy, but they will tell you that it represents a hegemonic power in the science of language (it does not, it is clearly just one way of knowing, spottily represented outside the Angophone west) and that it competes for time and mindshare with other forms of linguistic knowledge (an unexamined austerity mindset). This rhetorical trick—the same one used to slander the socialist left across the democratic West 2016-present—would simply not work on the generative community were they a militant, organized, self-assured vanguard rather than a casualized, disorganized, insecure community, one serously committed to diversity in race and sexual orientation but largely uninterested in matters of class and power. And then, once you’ve accepted their framing, they’re going to sell you a radically empiricist psycho-computational mode of inquiry that is deeply incurious about language diversity, that cares not a whit for the agency of speakers, and trains students to serve the interests of the most powerful men in the world.

Asymmetries in Latin glide formation

Let us assume, as I have in the past, that the Classical Latin glides [j, w] are allophones of the short high monophthongs /i, u/. Then, any analysis of this allophony must address the following four asymmetries between [j] and [w]:

  1. Intervocalical /i/ is [j.j], as in peior [pej.jor] ‘worse’; intervocalic /u/ is simple.
  2. Intervocalically, /iu/ is realized as [jw], as in laeua [laj.wa] ‘left, leftwards’ (fem. nom.sg.), but /ui/ is realized as [wi], as in pauiō [pa.wi.oː] ‘I beat’.
  3. /u/ preceded by a liquid and followed by a vowel is also realized as [w], as in ceruus [ker.wus] and silua [sil.wa] ‘forest’, but /i/ is never realized as a glide in this position.
  4. There are two cases in which [u] alternates with [w] (the deadjectival suffix /-u-/ is realized as /-w-/ when preceded by a liquid, as in caluus [cal.wus] ‘bald’, and the perfect suffix /-u-/ is realized as /-w-/ in “thematic” stems like cupīuī [ku.piː.wiː] ‘I desired’); there are no alternations between [i] and [j].

What rules gives rise to these asymmetries?

Results of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion

The results of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion are now in, and are summarized in our task paper. A couple bullet points:

  • Unsurprisingly, the best systems all used some form of ensembling.
  • Many of the best teams performed self-training and/or data augmentation experiments, but most of these experiments were performance-negative except in simulated low-resource conditions. Maybe we’ll do a low-resource challenge in a future year.
  • LSTMs and transformers are roughly neck-and-neck; one strong submission used a variant of hard monotonic attention.
  • Many of the best teams used some kind of pre-processing romanization strategy for Korean, the language with the worst baseline accuracy. We speculate why this helps in the task paper.
  • There were some concerns about data quality for three languages (Bulgarian, Georgian, and Lithuanian). We know how to fix them and will do so this summer, if time allows. We may also “re-issue” the challenge data with these fixes.