What phonotactics-free phonology is not

In my previous post, I showed how many phonological arguments are implicitly phonotactic in nature, using the analysis of the Latin labiovelars as an example. If we instead adopt a restricted view of phonotactics as derived from phonological processes, as I argue for in Gorman 2013, what specific forms of argumentation must we reject? I discern two such types:

  1. Arguments from the distribution of phonemes in URs. Early generative phonologists posited sequence structure constraints, constraints on sequences found in URs (e.g, Stanley 1967, et seq.). This seems to reflect more the then-contemporary mania for information theory and lexical compression, ideas which appear to have lead nowhere and which were abandoned not long after. Modern forms of this argument may use probabilistic constraints instead of categorical ones, but the same critiques remain. It has never been articulated why these constraints, whether categorical or probabilistic, are considered key acquirenda. I.e., why would speakers bother to track these constraints, given that they simply recapitulate information already present in the lexicon. Furthermore, as I noted in the previous post, it is clear that some of these generalizations are apparent even to non-speakers of the language; for example, monolingual New Zealand English speakers have a surprisingly good handle on Māori phonotactics despite knowing few if any Māori words. Finally, as discussed elsewhere (Gorman 2013: ch. 3, Gorman 2014), some statistically robust sequence structure constraints appear to have little if any effect on speakers judgments of nonce word well-formedness, loanword adaptation, or the direction of language change.
  2. Arguments based on the distribution of SRs not derived from neutralizing alternations. Some early generative phonologists also posited surface-based constraints (e.g., Shibatani 1973). These were posited to account for supposed knowledge of “wordlikeness” that could not be explained on the basis of constraints on URs. One example is that of German, which has across-the-board word-final devoicing of obstruents, but which clearly permits underlying root-final voiced obstruents in free stems (e.g., [gʀaːt]-[gʀaːdɘ] ‘degree(s)’ from /grad/). In such a language, Shibatani claims, a nonce word with a word-final voiced obstruent would be judged un-wordlike. Two points should be made here. First, the surface constraint in question derives directly from a neutralizing phonological process. Constraint-based theories which separate “disease” and “cure” posit a  constraint against word-final obstruents, but in procedural/rule-based theories there is no reason to reify this generalization, which after all is a mere recapitulation of the facts of alternation, arguably more a more entrenched source of evidence for grammar construction. Secondly, Shibatani did not in fact validate his claim about German speakers’ in any systematic fashion. Some recent work by Durvasula & Kahng (2019) reports that speakers do not necessarily judge a nonce word to be ill-formed just because it fails to follow certain subtle allophonic principles.

References

Durvasula, K. and Kahng, J. 2019. Phonological acceptability is not isomorphic with phonological grammaticality of stimulus. Talk presented at the Annual Meeting on Phonology.
Gorman, K. 2013. Generative phonotactics. Doctoral dissertation, University of Pennsylvania.
Gorman, K. 2014.  A program for phonotactic theory. In Proceedings of the Forty-Seventh Annual Meeting of the Chicago Linguistic Society: The Main Session, pages 79-93.
Shibatani, M. 1973. The role of surface phonetic constraints in generative phonology. Language 49(1): 87-106.
Stanley, R. 1967. Redundancy rules in phonology. Language 43(2): 393-436.

Towards a phonotactics-free phonology

Early generative phonology had surprisingly little to say about the theory of phonotactics. Chomsky and Halle (1965) claim that English speakers can easily distinguish between real words like brick, well-formed or “possible” nonce words like blick, and ill-formed or “impossible” nonce words like bnick. Such knowledge must be in part language-specific, since, for instance, [bn] onsets are in some languages—Hebrew for instance—totally unobjectionable. But few attempts were made at the time to figure out how to encode this knowledge.

Chomsky and Halle, and later Stanley (1967), propose sequence structure constraints (SSCs), generalizations which encode sequential redundancies in underlying representations.1 Chomsky and Halle (p. 100) hypothesize that such generalizations might account for the ill-formedness of bnick: perhaps English consonants preceded by a word-initial obstruent must be liquids: thus blick but not bnick. Shibatani (1973) claims that not all language-specific generalizations about (im)possible words can derive from restrictions on underlying representations and must (instead or also) be expressed in terms of restrictions on surface form. For instance, in German, obstruent voicing is contrastive but neutralized word-finally; e.g., [gʀaːt]-[gʀaːtɘ] ‘ridge(s) vs. [gʀaːt]-[gʀaːdɘ] ‘degree(s)’. Yet, Shibatani claims that German speakers supposedly judge word-final  voiced obstruents, as in the hypothetical but unattested [gʀaːd], to be ill-formed. Similar claims were made by Clayton (1976). And that roughly exhausts the debate at the time. Many years later, Hale and Reiss can, for instance, deny that that this kind of knowledge is part of the narrow faculty of language.

Even if we, as linguists, find some generalizations in our description of the lexicon, there is no reason to posit these generalizations as part of the speaker’s knowledge of their language, since they are computationally inert and thus irrelevant to the input-output mappings that the grammar is responsible for. (Hale and Reiss 2008:17f.)

Many years later, Charles Reiss (p.c.) proposed to me a brief thought experiment. Imagine that you were to ask a naïve non-linguist monolingual English speaker to discern whether a short snippet of spoken language was either, say, Māori or Czech. Would you not expect that such a speaker would do far better than chance, even if they themselves do not know a single word in either language? Clearly then, (at least some form of) phonotactic knowledge can be acquired extremely indirectly, effortlessly, without any substantial exposure to the language, and does not imply any deep knowledge of the grammar(s) in question.2

In a broader historical context, though, early generativists’ relative disinterest in phonotactic theory is something of a historical anomaly. Structuralist phonologists, in developing phonemicizations, were at least sometimes concerned with positing phonemes that have a restricted distribution. And for phonologists working in strains of thinking that ultimately spawned Harmonic Grammar and Optimality Theory, phonotactic generalizations are to a considerable degree what phonological grammars are made of.

A phonological theory which rejects phonotactics as part of the narrow language faculty—as do Hale and Reiss—is one which makes different predictions than theories which do include it, if only because such an assumption necessarily excludes certain sources of evidence. Such a grammar cannot make reference to generalizations about distributions of phonemes that are not tied to allophonic principles or to alternations. Nor can it make reference to the distribution of contrast except in the presence of neutralizing phonological processes.

I illustrated this point very briefly in Gorman 2014 with a famous case from Sanskrit (the so-called diaspirate roots); here I’d like to provide more detailed example using a language I know much better, namely Latin. Anticipating the conclusions drawn below, it seems that nearly all the arguments mustered in this well-known case are phonotactic in nature and are irrelevant in a phonotactics-free theory of phonology.

In Classical Latin, the orthographic sequence qu (or more specifically <QV>) denotes the sound [kw].Similarly, gu is ambiguously either [gu] as in exiguus [ek.si.gu.us] ‘strict’ or [gw] as in anguis [aŋ.gwis] ‘snake’. For whatever reason, it seems that is gu was pronounced as [gw] if and only if it is preceded by an n. It is not at all clear if this should be regarded as an orthographic generalization, a phonological principle, or a mere accident of history.

How should the labiovelars qu and (post-nasal) gu be phonologized? This topic has been the subject of much speculation. Devine and Stephens (1977) devoted half a lengthy book to the topic, for instance. More recently, Cser’s (2020: 22f.) phonology of Latin reconsiders the evidence, revising an earlier presentation (Cser 2013) of these facts. In fact three possibilities are imaginable: qu, for instance, could be unisegmental /kʷ/, bisegmental /kw/, or even /ku/ (Watbled 2005), though as Cser correctly observes, the latter does not seem to be workable. Cser reluctantly concludes that the question is not yet decidable. Let us consider this question briefly, departing from Cser’s theorizing only in the assumption of a phonotactics-free phonology.

  1. Frequency. Following Devine and Stephens, Cser notes that the lexical frequency of qu greatly exceeds that of k and glide [w] (written u) in general. They take this as evidence for unisegmental /kʷ, gʷ/. However, it is not at all clear to me why this ought to matter to the child acquiring Latin. In a phonotactics-free phonology, there is no simply reason for the learner to attend to this statistical discrepancy. 
  2. Phonetic issuesCser reviews testimonia from ancient grammarians suggesting that the “[w] element in <qu> was less consonant-like than other [w]s” (p. 23). However, as he points out, this is trivially handled in the unisegmental analysis and is a trivial example of allophony in the bisegmental analysis. 
  3. Geminates. Cser points out that the labiovelars, unlike all consonants but [w], fail to form intervocalic geminates. However, phonotactics-free phonology has no need to explain which underlying geminates are and are not allowed in the lexicon.
  4. Positional restrictions. Under a bisegmental interpretation, the labiovelars are “marked” in that obstruent-glide sequences are rare in Latin. On the other hand, under a unisegmental interpretation, the absence of word-final labiovelars is unexpected. However, both of thes observations have no status in phonotactics-free phonology.
  5. The question of [sw]. The sequence [sw] is attested initially in a few words (e.g., suāuis ‘sweet’). Is [sw] uni- or bisegmental?  Cser notes that were one to adopt a unisegmental analysis for the labiovelars qu and gu, [sw] is the only complex onset in which [w] may occur. However, an apparently restricted distribution for [w] has no evidentiary status in phonotactics-free phonology; it can only be a historical accident encoded implicitly in the lexicon.
  6. Verb root structure. Devine and Stephens claim that verb roots ending in a three-consonant sequence are unattested except for roots ending in a sonorant-labiovelar sequence (e.g., torquere ‘to turn’, tinguere ‘to dip’). While this is unexplained under a bisegmental analysis, this is an argument based on distributional restrictions that have no status in phonotactics-free phonology. 
  7. Voicing contrast in clusters. Voicing is contrastive in Latin nasal-labiovelar clusters, thus linquam ‘I will/would leave’ (1sg. fut./subj. act.) linguam ‘tongue’ (acc.sg.). According to Cser, under the biphonemic analysis this would be the only context in which a CCC cluster has contrastive voicing, and “[t]his is certainly a fact that points towards the greater plausibility of the unisegmental interpretation of labiovelars” (p. 27). It is is not clear that the distribution of voicing contrasts ought to be taken into account in a phonotactics-free theory, since there is no evidence for a process neutralizing voicing contrasts in word-internal trisegmental clusters.
  8. Alternations. In two verbs, qu alternates with cū [kuː] in the perfect participle (ppl.): loquī ‘to speak’ vs. its ppl. locūtus and sequī ‘to follow’ vs. its ppl. secūtus. Superficially this resembles alternations in which [lv, bv, gv] alternate with [luː, buː, guː] in the perfect participle. This suggests a bisegmental analysis, and since this is based on patterns of alternation, is consistent with a phonotactics-free theory. On the other hand, qu also alternates with plain c [k]. For example, consider the verb coquere ‘to cook’, which has a past participle coctus. Similarly, the verb relinquere ‘to leave’ has a perfect participle relictus, but the loss of the Indo-European “nasal insert” (as it is known) found in the infinitive may suggest an alternative—possibly suppletive—analysis. Cser concludes, and I agree, that this evidence is ambiguous.
  9. ad-assimilation. The prefix ad- variably assimilates in place and manner to the following stem-initial consonant. Cser claims that this is rare with qu-initial stems (e.g., unassimilated adquirere ‘to acquire’ is far more frequent than assimilated acquirere in the corpus). This is suggestive of a bisegmental analysis insofar as ad-assimilation is extremely common with [k]-initial stems. This seems to weakly supports the bisegmental analysis.5
  10. Diachronic considerations. Latin qu is a descendent of the Indo-European *kʷ, one member of a larger labiovelar series. All members of this series appear to be unisegmental in the proto-language. However, as Cser notes, this is simply not relevant for the synchronic status of qu and gu.
  11. Poetic licence. Rarely the poets used a device known as diaeresis, the reading of [w] as [u] to make the meter. Cser claims this does not obtain for qu. This is weak evidence for the unisegmental analysis because the labial-glide portion of /kʷ/ would not obviously be in the scope of diaeresis.
  12. The distribution of gu. As noted above the voiced labiovelar gu is lexically quite rare, and always preceded by n. In a phonological theory which attends to phonotactic constraints, this is an explanandum crying out for an explanans. Cser argues that it is particularly odd under the unisegmental analysis because there is no other segment so restricted. But in phonotactics-free phonology, there is no need to explain this accident of history.

Cser concludes that this series of arguments are largely inconclusive. He takes (7, 11) to be evidence for the unisegmental analysis, (3, 5, 8, 9) to be evidence for the bisegmental analysis, and all other points to be largely inconclusive. Reassessing the evidence in a phonotactics-free theory, only (9) and (11), both based on rather rare evidence, remain as possible arguments for the status of the labiovelars. I too have to regard the evidence as inconclusive, though I am now on the lookout for diaeresis of qu and gu, and hope to obtain a better understanding of prefix-final consonant assimilation.

Clearly, working phonologists are heavily dependent on phonotactic arguments, and rejecting them as explanations would substantially limit the evidence base used in phonological inquiry.

Endnotes

  1. In part this must reflect the obsession with information theory in linguistics at the time. Of this obsession Halle (1975) would later write that this general approach was “of absolutely no use to anyone working on problems in linguistics” (532).
  2. As it happens, monolingual English-speaking New Zealanders are roughly as good at discriminating between “possible” and “impossible” Māori nonce words as are Māori speakers (Oh et al. 2020).
  3. I write this phonetically as [kw] rather than [kʷ] because it is unclear to me how the latter might differ phonetically from the former. These objections do not apply to the phonological transcription /kʷ/, however.
  4. Recently Gouskova and Stanton (2021) have revived this theory and applied it to a number of case studies in other languages. 
  5. It is at least possible that that unassimilated spellings are “conservative” spelling conventions and do not reflect speech. If so, one may still wish to explain the substantial discrepency in rates of (orthographic) assimilation to different stem-initial consonants and consonant clusters. 

References

Chomsky, N. and Halle, M. 1965. Some controversial questions in phonological theory. Journal of Linguistics 1(2): 97-138.
Clayton, M. L. 1976. The redundance of underlying morpheme-structure conditions. Language 52(2): 295-313.
Cser, A. 2013. Segmental identity and the issue of complex segments. Acta Linguistica Hungarica 60(3): 247-264.
Cser, A. 2020. The Phonology of Classical Latin. John Wiley & Sons.
Devine, A. M. and Stephens, L. D. 1977. Two Studies in Latin Phonology. Anma Libri.
Gorman, K. 2013. Generative phonotactics. Doctoral dissertation, University of Pennsylvania.
Gorman, K. 2014. A program for phonotactic theory. In Proceedings of the Forty-Seventh Annual Meeting of the Chicago Linguistic Society: The Main Session, pages 79-93.
Gouskova, M. and Stanton, Juliet. 2021. Learning complex segments. Language 97(1):151-193.
Hale, M. and Reiss, C. 2008. The Phonological Enterprise. Oxford University Press.
Halle, M. 1975. Confessio grammatici. Language 51(3): 525-535.
Oh, Y., Simon, T., Beckner, C., Hay, J., King, J., and Needle, J. 2020. Non-Māori-speaking New Zealanders have a Māori proto-lexicon. Scientific Reports 10: 22318.
Shibatani, M. 1973. The role of surface phonetic constraints in generative phonology. Language 49(1): 87-106.
Stanley, R. 1967. Redundancy rules in phonology. Language 43(2): 393-436.
Watbled, J.-P. 2005. Théories phonologiques et questions de phonologie latine. In C. Touratier (ed.), Essais de phonologie latine, pages 25-57. Publications de l’Université de Provence.

Surprises for the new NLP developer

There are a couple things that surprise students when they first begin to develop natural language processing applications.

  • Some things just take a while. A script that, say, preprocesses millions of sentences isn’t necessarily wrong because it takes a half hour.
  • You really do have to avoid wasting memory. If you’re processing a big file line-by-line,
    • you really can’t afford to read it all in at once, and
    • you should write out data as soon as you can.
  • The OS and program already know how to buffer IO; don’t fight it.
  • Whereas so much software works with data in human non-readable (e.g., wire formats, binary data) or human-hostile (XML) formats, if you’re processing text files, you can just open the files up and read them to see if they’re roughly what you expected.

Should you assign it to a variable?

Let us suppose that you’re going to compute some value, and then send it to another function. Which snippet is better (using lhs as shorthand for a variable identifier and rhs() as shorthand for the right-hand side expression)?

lhs = rhs()
do_work(lhs)
do_work(rhs())

My short answer is it depends. Here is my informal set of heuristics:

    • Is a type-cast involved (e.g., in a statically typed language like C)? If so, assign it to a variable.
    • Would the variable name be a meaningful one that would provide non-obvious information about the nature of the computation? If so, assign it to a variable. For instance, if a long right-hand side expression computes perplexity, ppl = ... or perplex = ... is about as useful as an inline comment.
    • Is the computation used again in the same scope? If so, assign it to a variable.
    • Is the right-hand side just a very complicated expression? If so, consider assigning it to a variable, and try to give it an informative name.
    • Otherwise, do not assign it to a variable.

Generative grammar and reaction

I entered college in fall 2003, planning to major in psychology, but quickly fell in love with an introductory linguistics class taken to fulfill a general education requirement. I didn’t realize it at time, but in retrospect I think that the early “aughts” represented a time of reaction, in the political sense, to generative grammar (GG). A huge portion of the discourse of that era (roughly 2003-2010, and becoming more pronounced later in the decade) was dominated by debates oriented around opposition to GG. This includes:

  • Pullum & Scholz’s (2002) critique of poverty of the stimulus arguments,
  • various attempts to revive the past tense debate (e.g, Pinker 2006),
  • Evans & Levinson (2009) on “the myth of language universals”,
  • Gibson & Fedorenko (2010) on “weak quantitative standards”, and
  • the Pirahã recursion affair.

And there are probably others I can’t recall at present. In my opinion, very little was learned from any of these studies. In particular the work of Pullum & Scholz, Gibson & Fedorenko, and Everett falls apart with careful empirical scrutiny; for which see Legate & Yang 2002, the work of Jon Sprouse and colleagues, and Nevins et al. 2009, respectively; few seem to have been convinced by Pinker or Evans & Levinson. It is something of a surprise to me that these highly-contentious debates, some of which was even covered in the popular press, are now rarely read by young scholars.

I don’t know why opposition to GG was so stiff at the time but I do have a theory. The aughts were essentially an apocalyptic era, culturally and materially, and the crucial event of the decade is the invasion of Iraq by a US-led coalition. The invasion represented a failure of elites: it lacked a coherent political justification legible to the rest of the world, resulted in massive civilian casualties, and lead to institutional failures at home and nuclear proliferation abroad. And there are still-powerful voices in linguistics, intellectuals who have a responsibility “to speak the truth and to expose lies”, who were paid handsomely to manufacture consent for the Iraq war. In that context, it is not surprising that the received wisdom of GG, perceived as hegemonic and culturally associated with the anti-war left, came under heavy attack.

References

Evans, N., and Levinson, S.C. 2009. The myth of language universals: Language diversity and its importance for cognitive science. Behavioral & Brain Sciences 32: 429-492.
Gibson, E., and Fedorenko, E. 2010. Weak quantitative standards in linguistics research. Trends in Cognitive Science 14: P223-234.
Legate, J. A., and Yang, C. D. 2002. Empirical re-assessment of stimulus poverty arguments. The Linguistic Review 19: 151-162.
Nevins, A., Pesetsky, D., and Rodrigues, C. 2009. Pirahã exceptionality: a reassessment. Language 85: 355-4040.
Pinker, S. 2006. Whatever happened to the past tense debate? In Baković, E., Ito, J, and McCarthy, J. J. (ed.), Wondering at the Natural Fecundity of Things: Essays in Honor of Alan Prince, pages 221-238. BookSurge.
Pullum, G., and Scholz, B. 2002. Empirical assessment of stimulus poverty arguments. The Linguistic Review 19: 9-50.

Thought experiment #2

In an earlier post, I argued that for the logical necessity of admitting some kind of “magic” to account for lexically arbitrary behaviors like Romance metaphony or Slavic yers. In this post I’d like to briefly consider the consequences for the theory of language acquisition.

If mature adult representations have magic, infants’ hypothesis space must also include the possibility of positing magical URs (as Jim Harris argues for Spanish or Jerzy Rubach argues for Polish). What might happen the hypothesis space was not so specified? Consider the following thought experiment:

The Rigelians from Thought Experiment #1 did not do a good job sterilizing their space ships. (They normally just lick the flying saucer real good.) Specks of Rigelian dust carry a retrovirus that infects human infants and modifies their their faculty of language so that they no longer entertain magical analyses.

What then do we suppose might happen to Spanish and Polish patterns we previously identified as instances of magic? Initially, the primary linguistic data will not have changed, just the acquisitional hypothesis space. What kind of grammar will infected Spanish-acquiring babies acquire?

For Harris (and Rubach), the answer must be that infected babies cannot acquire the metaphonic patterns present in the PLD. Since there is reason to think (see, e.g., Gorman & Yang 2019:§3) that the diphthongization is the minority pattern in Spanish, it seems most likely that the children will acquire a novel grammar in which negar ‘to deny’ has an innovative non-alternating first person singular indicative *nego rather than niego ‘I deny’.

Not all linguists agree. For instance, Bybee & Pardo (1981; henceforth BP) claim that there is some local segmental conditioning on diphthongization, in the sense that Spanish speakers may be able to partially predict whether or not a stem diphthongizes on the basis of nearby segments.1 Similarly, Albright, Andrade, & Hayes (2001; henceforth AAH) develop a computational model which can extract generalizations of this sort.2 For instance, BP claim that an e followed by __r, __nt, or __rt are more likely to diphthongize, and AAH claim that a following stem-final __rr (the alveolar trill [r], not the alveolar tap [ɾ]) and a following __mb also favor diphthongization. BP are somewhat fuzzy about the representational status of these generalizations, but for AAH, who reject the magical segment analysis, they are expressed by a series of competing rules.

I am not yet convinced by this proposal. Neither BP nor AAH give the reader any general sense of the coverage of the segmental generalizations they propose (or in the case of AAH, that their computational model discovers): I’d like to know basic statistics like precision and recall for existing words. Furthermore, AAH note that their computational model sometimes needs to fall back on “word-specific rules” (their term), rules in which the segmental conditioning is an entire stem, and I’d like to know how often this is necessary.3 Rather than reporting coverage, BP and AAH instead correlate their generalizations with the results of wug-tasks (i.e., nonce word production tasks) by Spanish-speaking adults. The obvious objection here is that no evidenceor even an explicit linking hypothesislinks adults’ generalizations about nonce words in a lab to childrens’ generalizations about novel words in more naturalistic settings.

However, I want to extend an olive branch to linguists who are otherwise inclined to agree with BP and AAH. It is entirely possible that children do use local segmental conditioning to learn the patterns linguists analyzed with magical segments and/or morphs, even if we continue to posit magic segments or morphs. It is even possible that sensitivity to this segmental conditioning persists into adulthood as reflected in the aforementioned wug-tasks. Local segmental conditioning might be an example of domain-general pattern learning, and might be likened to sound symbolism—such as the well-known statistical tendency for English words beginning in gl– to relate to “light, vision, or brightness” (Charles Yang, p.c.)insofar as both types of patterns reduce apparent arbitrariness of the lexicon. I am also tempted to identify both local segmental conditioning and sound symbolism as examples of third factor effect (in the sense of Chomsky 2005). Chomsky identifies three factors in the design of language: the genetic endowment, “experience” (the primary linguistic data), and finally “[p]rinciples not specific to the faculty of language”. Some examples of third factorsas these principles not specific to the faculty of language are calledgiven in the paper include domain-general principles of “data processing” or “data analysis” and biological constraints, whether “architectural”, “computational”, or “developmental”. I submit that general-purpose pattern learning might be an example of of domain-general “data analysis”.

As it happens, we do have one way to probe the coverage of local segmental conditioning. Modern sequence-to-sequence neural networks, arguably the most powerful domain-general string pattern learning tool known to us, have been used for morphological generation tasks. For instance, in the CoNLL-SIGMORPHON 2017 shared task, neural networks are used to predict the inflected form of various words given some citation form  and a morphological specification. For instance, given the pair (dentar, V;IND;PRS;1;SG) the models have to predict diento ‘I am teething’. Very briefly, these models, as currently designed, are much like babies infected with the Rigelian retrovirus: their hypothesis space does not include “magic” segments or lexical diacritics and they must rely solely on local segmental conditioning. It is perhaps not surprising, then, that they misapply diphthongization in Spanish (e.g., *recolan for recuelan ‘they re-strain’; Gorman et al. 2019) or yer deletion in Polish, when presented with previously unseen lemmata. But it is an open question how closely these errors pattern like those made by children, or with adults’ behaviors in wug™-tasks.

Acknowledgments

I thank Charles Yang for drawing my attention to some of the issues discussed above.

Endnotes

  1. Similarly, Rysling (2016) argues that Polish yers are epenthesized to avoid certain branching codas, though she admits that their appearance is governed in part by magic (according to her analysis, exceptional morphs of the Gouskova/Pater variety).
  2. Later versions of this model developed by Albright and colleagues are better known for popularizing the notion of “islands of reliability”.
  3. Bill Idsardi (p.c.) raises the question of whether magical URs and morpholexical rules are extensionally equivalent. Good question.

References

Albright, A., Andrade, A., and Hayes, B. 2001. Segmental environments of Spanish diphthongization. UCLA Working Papers in Linguistics 7: 117-151.
Bybee, J., and Pardo, E. 1981. Morphological and lexical conditioning of rules: experimental evidence from Spanish. Linguistics 19: 937-968.
Chomsky, N. 2005. Three factors in language design. Linguistic Inquiry 36(1): 1-22.
Gorman, K. and Yang, C. 2019. When nobody wins. In Franz Rainer, Francesco Gardani, Hans Christian Luschützky and Wolfgang U. Dressler (ed.), Competition in inflection and word formation, pages 169-193. Springer.
Gorman, K., McCarthy, A.D., Cotterell, R., Vylomova, E., Silfverberg, M., Markowska, M. 2019. Weird inflects but okay: making sense of morphological generation errors. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 140-151.
Rysling, A. 2016. Polish yers revisited. Catalan Journal of Linguistics 15: 121-143.

Why language resources should be dynamic

Virtually all the digital linguistic resources used in speech and language technology are static in the sense that

  1. One-time: they are generated once and never updated.
  2. Read-only: they provide no mechanisms for corrections, feature requests, etc.
  3. Closed-source: code and raw data used to generate the data are not released.

However, there are some benefits to designing linguistic resources dynamically, allowing them to be repeatedly regenerated and iteratively improved with the help of the research community. I’ll illustrate this with WikiPron (Lee et al. 2020), our database-cum-library for multilingual pronunciation data.

The data

Pronunctionary dictionaries are an important resource for speech technologies like automatic speech recognition and text-to-speech synthesis. Several teams have considered the possibility of mining pronunciation data from the internet, particularly from the free online dictionary Wiktionary, which by now contains millions of crowd-sourced pronunciations transcribed using the International Phonetic Alphabet. However, none of these prior efforts released any code, nor were their scrapes run repeatedly, so at best they represent of a single (2016, or 2011) slice of the data.

The tool

WikiPron is, first and foremost, a Python command-line tool for scraping pronunciation data from Wiktionary. Stable versions can be installed from PyPI using tools like pip. Once the tool is installed, users specify a language, optionally, a dialect, and various optional flags, and pronunciation data is printed to STDIN as a two-column TSV file. Since this requires an internet connection and may take a while, the system is even able to retry where it left off in case of connection hiccups. The code is carefully documented, tested, type-checked, reflowed, and linted using the CircleCI continuous integration system. 

The infrastructure

We also release, at least annually, a multilingual pronunciation dictionary created using WikiPron. This increases replicability, permits users to see the format and scale of the data WikiPron makes available, and finally allows casual users to bypass the command-line tool altogether. To do this, we provide the data/ directory, which contains data and code which automates “the big scrape”, the process by which we regenerate the multilingual pronunciation dictionary. It includes

  • the data for 335 (at time of writing) languages, dialects, scripts, etc.,
  • code for discovering languages supported by Wiktionary,
  • code for (re)scraping all languages,
  • code for (re)generating data summaries (both computer-readable TSV files and human-readable READMEs rendered by GitHub), and
  • integration tests that confirm the data summaries match the checked-in data,

as well as code and data used for various quality assurance processes. 

Dynamic language resources

In what sense is WikiPron a dynamic language resource? 

  1. It is many-time: it can be run as many times as one wants. Even “the big scrape” static data sets are updated more-than-annually.
  2. It is read-write: one can improve WikiPron data by correcting Wiktionary, and we provide instructions for contributors wishing to send pull requests to the tool.
  3. It is open-source: all code is licensed under the Apache 2.0 license; the data bears a Creative Commons Attribution-ShareAlike 3.0 Unported License inherited from Wiktionary.

Acknowledgements

Most of the “dynamic” features in WikiPron were implemented by CUNY Graduate Center PhD student Lucas Ashby and my colleague Jackson Lee; I have at best served as an advisor and reviewer.

References

Lee, J. L, Ashby, L. F.E., Garza, M. E., Lee-Sikka, Y., Miller, S., Wong, A.,
McCarthy, A. D., and Gorman, K. 2020. Massively multilingual pronunciation
mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228.

Thought experiment #1

A non-trivial portion of what we know about the languages we speak includes information about lexically-arbitrary behaviors, behaviors that are specific to certain roots and/or segments and absent in other superficially-similar roots and/or segments. One of the earliest examples is the failure of English words like obesity to undergo Chomsky & Halle’s (1968: 181) rule of trisyllabic shortening: compare sereneserenity to obese-obesity (Halle 1973: 4f.). Such phenomena are very common in the world’s languages. Some of the well-known examples include Romance mid-vowel metaphony and the Slavic fleeting vowels, which delete in certain phonological contexts.1

Linguists have long claimed (e.g., Harris 1969) one cannot predict whether a Spanish e or o in the final syllable of a verb stem will or will not undergo diphthongization (to ie or ue, respectively) when stress falls on the stem rather than the desinence. For instance negar ‘to deny’ diphthongizes (niego ‘I deny’, *nego) whereas the superficially similar pegar ‘to stick to s.t.’ does not (pego ‘I stick to s.t.’, *piego). There is no reason to suspect that the preceding segment (n vs. p) has anything to do with it; the Spanish speaker simply needs to memorize which mid vowels diphthongize.2 The same is arguably true of the Polish fleeting vowels known as yers, which delete in, among other contexts, the genitive singular (gen.sg.) of masculine nouns. Thus sen ‘dream’ has a gen.sg. snu, with deletion of the internal e, whereas the superficially similar basen ‘pool’ has a gen.sg. basenu, retaining the internal (Rubach 2016: 421). Once again, the Polish speaker needs to memorize whether or not each deletes.

So as to not presuppose a particular analysis, I will refer to segments with these unpredictable alternations—diphthongization in Spanish, deletion in Polish—as magical. Exactly how this magic ought to be encoded is unclear.3 One early approach was to exploit the feature system so that they were underlyingly distinct from non-magical segments. These “exploits” might include mapping magical segments onto gaps in the surface segmental inventory, underspecification, or simply introducing new features. Nowadays, phonologists are more likely to use prosodic prespecification. For instance, Rubach (1986) proposes that the Polish yers are prosodically defective compared to non-alternating e.4 Others have claimed that magic resides in the morph, not the segment.

Regardless of how the magic is encoded, it is a deductive necessity that it be encoded somehow. Clearly something is representationally different in negar and pegar, and sen and basen. Any account which discounts this will be descriptively inadequate. To make this a bit clearer, consider the following thought experiment:

We are contacted by a benign, intelligent alien race, carbon-based lifeforms from the Rigel system with feliform physical morphology and a fondness for catnip. Our scientists observe that they exhibit a strange behavior: when they imbibe fountain soda, their normally-green eyes turn yellow, and when they imbibe soda from a can, their eyes turn red. Scientists have not yet been able to determine the mechanisms underlying these behaviors.

What might we reason about the alien’s seemingly magical soda sense? If we adopt a sort of vulgar uniformitarianism—one which rejects outlandish explanation like time travel or mind-reading—then the only possible explanation remaining to us is that there really is something chemically distinct between the two classes of soda, and the Rigelian sensory system is sensitive to this difference.

Really, this deduction isn’t so different from the one made by linguists like Harris and Rubach: both observe different behaviors and posit distinct entities to explain them. Of course, there is something ontologically different between the two types of soda and the two types of Polish e. The former is a purely chemical difference; the latter arises  because the human language faculty turns primary linguistic data, through the epistemic process we call first language acquisition, into one type of meat (brain tissue), and that type of meat makes another type of meat (the articulatory apparatus) behave in a way that, all else held equal, will recapitulate the primary linguistic data. But both of these deductions are equally valid.

Endnotes

  1. Broadly-similar phenomena previously studied include fleeting vowels in Finnish, Hungarian, Turkish, and Yine, ternary voice contrasts in Turkish, possessive formation in Huichol, and passive formation in Māori.
  2. For simplicity I put aside the arguments by Pater (2009) and Gouskova (2012) that morphs, not segments, are magical. While I am not yet convinced by their arguments, everything I have to say here is broadly consistent with their proposal.
  3. This is yet another feature of language that is difficult to falsify. But as Ollie Sayeed once quipped, the language faculty did not evolve to satisfy a vulgar Popperian falsificationism.
  4. Specfically, Rubach assumes that the non-alternating e‘s have a prespecified mora, whereas the alternating e‘s do not.

References

Chomsky, N. and Halle, M. 1968. The Sound Pattern of English. Harper & Row.
Gouskova, M. 2012. Unexceptional segments. Natural Language & Linguistic Theory 30: 79-133.
Halle, M. 1973. Prolegomena to a theory of word formation. Linguistic Inquiry 4: 3-16.
Harris, J. 1969. Spanish Phonology. MIT Press.
Pater, J. 2009. Morpheme-specific phonology: constraint indexation and inconsistency resolution. In S. Parker (ed.), Phonological Argumentation: Essays on Evidence and Motivation, pages 123-154. Equinox.
Rubach, J. 1986. Abstract vowels in three-dimensional phonology: the yers. The Linguistic Review 5: 247-280.
Rubach, J. 2016. Polish yers: Representation and analysis. Journal of Linguistics 52: 421-466.

Does GPT-3 have free speech rights?

I have some discomfort with this framing. It strikes me as unnecessarily frivolous about some serious questions. Here is an imagined dialogue.

Should GPT-3 have the right to free speech?

No. Software does not have rights nor should it.  Living things are the only agents in moral-ethical calculations. Free speech as it currently is construed should also be recognized as a civic myth of the United States, one not universally recognized. Furthermore it should be recognized that all rights, including the right to self-expression, can impinge upon the rights and dignity of others.

What if a court recognized a free-speech right for GPT-3?

Then that court would be illegitimate. However, it is very easy to imagine this happening in the States given that the US “civic myth” is commonly used to provide extraordinary legal protections to corporate entities.

What if that allowed it to spread disinformation?

Then the operator would be morally responsible for all consequences of that dissemination.