The role of phonotactics in language change

How does phonotactic knowledge influence the path taken by language change? As is often the case, the null hypothesis seems to be simply that it doesn’t. Perhaps speakers have projected a phonotactic constraint C into the grammar of Old English, but that doesn’t necessarily mean that Middle English will conform to C, or even that Middle English won’t freely borrow words that flagrantly violate C.

One case comes from the history of English. As is well known, modern English /ʃ/ descends from Old English sk; modern instances of word-initial sk are mostly borrowed from Dutch (e.g., skipper) or Norse (e.g., ski); sky was borrowed from an Old Norse word meaning ‘cloud’ (which tells you a lot about the weather in the Danelaw). Furthermore, Old English forbids super-heavy long vowel-consonant cluster rimes. Because the one major source for /ʃ/ is sk, and because a word-final long vowel followed by sk was unheard of, V̄ʃ# was rare in Middle English and word-final sequences of tense vowels followed by [ʃ] are still rare in Modern English (Iverson & Salmons 2005). Of course there are exceptions, but according to Iverson & Salmons, they tend to:

  • be markedly foreign (e.g., cartouche),
  • to be proper names (e.g., LaRouche),
  • or to convey an “affective, onomatopoeic quality” (e.g., sheesh, woosh).

However, it is reasonably clear that all of these were added during the Middle or Modern period. Clearly, this constraint, which is still statistically robust (Gorman 2014:85), did not prevent speakers from borrowing and coining exceptions to it. However, it is hard to  rule out any historical effect of the constraint: perhaps there would be more Modern English V̄ʃ# words otherwise.

Another case of interest comes from Latin. As is well known Old Latin went through a near-exceptionless “Neogrammarian” sound change, a “primary split” or “conditioned merge” of intervocalic s with r. (The terminus ante quem, i.e., the latest possible date, for the actuation of this change is the 4th c. BCE.) This change had the effect of temporarily eliminating all traces of intervocalic in late Old Latin (Gorman 2014b). From this fact, one might posit that speakers of this era of Latin might project a *VsV constraint. And, one might posit that this would prevent subsequent sound changes from reintroducing intervocalic s. But this is clearly not the case: in the 1st c. BCE, degemination of ss after diphthongs and long monophthongs reintroduced intervocalic s (e.g., caussa > classical causa ’cause’). It is also clear that loanwords with intervocalic s were freely borrowed, and with the exception of the very early Greek borrowing tūs-tūris ‘incense’, none of them were adapted in any way to conform to a putative *VsV constraint:

(1) Greek loanwords: ambrosia ‘id.’, *asōtus ‘libertine’ (acc.sg. asōtum), basis ‘pedestal’, basilica ‘public hall’, casia ‘cinnamon’ (cf. cassia), cerasus ‘cherry’, gausapa ‘woolen cloth’, lasanum ‘cooking utensil’, nausea ‘id.’, pausa ‘pause’, philosophus ‘philosopher’, poēsis ‘poetry’, sarīsa ‘lance’, seselis ‘seseli’
(2) Celtic loanwords: gaesī ‘javelins’, omāsum ‘tripe’
(3) Germanic loanwords: glaesum ‘amber’, bisōntes ‘wild oxen’

References

Gorman, K. 2014a. A program for phonotactic theory. In Proceedings of the 47th Annual Meeting of the Chicago Linguistic Society, pages 79-93.
Gorman, K. 2014b. Exceptions to rhotacism, In Proceedings of the 48th Annual Meeting of the Chicago Linguistic Society, pages 279-293.
Iverson, G. K. and Salmons, J. C. 2005. Filling the gap: English tense vowel plus final
/š/. Journal of English Linguistics 33: 1-15.

What phonotactics-free phonology is not

In my previous post, I showed how many phonological arguments are implicitly phonotactic in nature, using the analysis of the Latin labiovelars as an example. If we instead adopt a restricted view of phonotactics as derived from phonological processes, as I argue for in Gorman 2013, what specific forms of argumentation must we reject? I discern two such types:

  1. Arguments from the distribution of phonemes in URs. Early generative phonologists posited sequence structure constraints, constraints on sequences found in URs (e.g, Stanley 1967, et seq.). This seems to reflect more the then-contemporary mania for information theory and lexical compression, ideas which appear to have lead nowhere and which were abandoned not long after. Modern forms of this argument may use probabilistic constraints instead of categorical ones, but the same critiques remain. It has never been articulated why these constraints, whether categorical or probabilistic, are considered key acquirenda. I.e., why would speakers bother to track these constraints, given that they simply recapitulate information already present in the lexicon. Furthermore, as I noted in the previous post, it is clear that some of these generalizations are apparent even to non-speakers of the language; for example, monolingual New Zealand English speakers have a surprisingly good handle on Māori phonotactics despite knowing few if any Māori words. Finally, as discussed elsewhere (Gorman 2013: ch. 3, Gorman 2014), some statistically robust sequence structure constraints appear to have little if any effect on speakers judgments of nonce word well-formedness, loanword adaptation, or the direction of language change.
  2. Arguments based on the distribution of SRs not derived from neutralizing alternations. Some early generative phonologists also posited surface-based constraints (e.g., Shibatani 1973). These were posited to account for supposed knowledge of “wordlikeness” that could not be explained on the basis of constraints on URs. One example is that of German, which has across-the-board word-final devoicing of obstruents, but which clearly permits underlying root-final voiced obstruents in free stems (e.g., [gʀaːt]-[gʀaːdɘ] ‘degree(s)’ from /grad/). In such a language, Shibatani claims, a nonce word with a word-final voiced obstruent would be judged un-wordlike. Two points should be made here. First, the surface constraint in question derives directly from a neutralizing phonological process. Constraint-based theories which separate “disease” and “cure” posit a  constraint against word-final obstruents, but in procedural/rule-based theories there is no reason to reify this generalization, which after all is a mere recapitulation of the facts of alternation, arguably more a more entrenched source of evidence for grammar construction. Secondly, Shibatani did not in fact validate his claim about German speakers’ in any systematic fashion. Some recent work by Durvasula & Kahng (2019) reports that speakers do not necessarily judge a nonce word to be ill-formed just because it fails to follow certain subtle allophonic principles.

References

Durvasula, K. and Kahng, J. 2019. Phonological acceptability is not isomorphic with phonological grammaticality of stimulus. Talk presented at the Annual Meeting on Phonology.
Gorman, K. 2013. Generative phonotactics. Doctoral dissertation, University of Pennsylvania.
Gorman, K. 2014.  A program for phonotactic theory. In Proceedings of the Forty-Seventh Annual Meeting of the Chicago Linguistic Society: The Main Session, pages 79-93.
Shibatani, M. 1973. The role of surface phonetic constraints in generative phonology. Language 49(1): 87-106.
Stanley, R. 1967. Redundancy rules in phonology. Language 43(2): 393-436.

Thought experiment #2

In an earlier post, I argued that for the logical necessity of admitting some kind of “magic” to account for lexically arbitrary behaviors like Romance metaphony or Slavic yers. In this post I’d like to briefly consider the consequences for the theory of language acquisition.

If mature adult representations have magic, infants’ hypothesis space must also include the possibility of positing magical URs (as Jim Harris argues for Spanish or Jerzy Rubach argues for Polish). What might happen the hypothesis space was not so specified? Consider the following thought experiment:

The Rigelians from Thought Experiment #1 did not do a good job sterilizing their space ships. (They normally just lick the flying saucer real good.) Specks of Rigelian dust carry a retrovirus that infects human infants and modifies their their faculty of language so that they no longer entertain magical analyses.

What then do we suppose might happen to Spanish and Polish patterns we previously identified as instances of magic? Initially, the primary linguistic data will not have changed, just the acquisitional hypothesis space. What kind of grammar will infected Spanish-acquiring babies acquire?

For Harris (and Rubach), the answer must be that infected babies cannot acquire the metaphonic patterns present in the PLD. Since there is reason to think (see, e.g., Gorman & Yang 2019:§3) that the diphthongization is the minority pattern in Spanish, it seems most likely that the children will acquire a novel grammar in which negar ‘to deny’ has an innovative non-alternating first person singular indicative *nego rather than niego ‘I deny’.

Not all linguists agree. For instance, Bybee & Pardo (1981; henceforth BP) claim that there is some local segmental conditioning on diphthongization, in the sense that Spanish speakers may be able to partially predict whether or not a stem diphthongizes on the basis of nearby segments.1 Similarly, Albright, Andrade, & Hayes (2001; henceforth AAH) develop a computational model which can extract generalizations of this sort.2 For instance, BP claim that an e followed by __r, __nt, or __rt are more likely to diphthongize, and AAH claim that a following stem-final __rr (the alveolar trill [r], not the alveolar tap [ɾ]) and a following __mb also favor diphthongization. BP are somewhat fuzzy about the representational status of these generalizations, but for AAH, who reject the magical segment analysis, they are expressed by a series of competing rules.

I am not yet convinced by this proposal. Neither BP nor AAH give the reader any general sense of the coverage of the segmental generalizations they propose (or in the case of AAH, that their computational model discovers): I’d like to know basic statistics like precision and recall for existing words. Furthermore, AAH note that their computational model sometimes needs to fall back on “word-specific rules” (their term), rules in which the segmental conditioning is an entire stem, and I’d like to know how often this is necessary.3 Rather than reporting coverage, BP and AAH instead correlate their generalizations with the results of wug-tasks (i.e., nonce word production tasks) by Spanish-speaking adults. The obvious objection here is that no evidenceor even an explicit linking hypothesislinks adults’ generalizations about nonce words in a lab to childrens’ generalizations about novel words in more naturalistic settings.

However, I want to extend an olive branch to linguists who are otherwise inclined to agree with BP and AAH. It is entirely possible that children do use local segmental conditioning to learn the patterns linguists analyzed with magical segments and/or morphs, even if we continue to posit magic segments or morphs. It is even possible that sensitivity to this segmental conditioning persists into adulthood as reflected in the aforementioned wug-tasks. Local segmental conditioning might be an example of domain-general pattern learning, and might be likened to sound symbolism—such as the well-known statistical tendency for English words beginning in gl– to relate to “light, vision, or brightness” (Charles Yang, p.c.)insofar as both types of patterns reduce apparent arbitrariness of the lexicon. I am also tempted to identify both local segmental conditioning and sound symbolism as examples of third factor effect (in the sense of Chomsky 2005). Chomsky identifies three factors in the design of language: the genetic endowment, “experience” (the primary linguistic data), and finally “[p]rinciples not specific to the faculty of language”. Some examples of third factorsas these principles not specific to the faculty of language are calledgiven in the paper include domain-general principles of “data processing” or “data analysis” and biological constraints, whether “architectural”, “computational”, or “developmental”. I submit that general-purpose pattern learning might be an example of of domain-general “data analysis”.

As it happens, we do have one way to probe the coverage of local segmental conditioning. Modern sequence-to-sequence neural networks, arguably the most powerful domain-general string pattern learning tool known to us, have been used for morphological generation tasks. For instance, in the CoNLL-SIGMORPHON 2017 shared task, neural networks are used to predict the inflected form of various words given some citation form  and a morphological specification. For instance, given the pair (dentar, V;IND;PRS;1;SG) the models have to predict diento ‘I am teething’. Very briefly, these models, as currently designed, are much like babies infected with the Rigelian retrovirus: their hypothesis space does not include “magic” segments or lexical diacritics and they must rely solely on local segmental conditioning. It is perhaps not surprising, then, that they misapply diphthongization in Spanish (e.g., *recolan for recuelan ‘they re-strain’; Gorman et al. 2019) or yer deletion in Polish, when presented with previously unseen lemmata. But it is an open question how closely these errors pattern like those made by children, or with adults’ behaviors in wug™-tasks.

Acknowledgments

I thank Charles Yang for drawing my attention to some of the issues discussed above.

Endnotes

  1. Similarly, Rysling (2016) argues that Polish yers are epenthesized to avoid certain branching codas, though she admits that their appearance is governed in part by magic (according to her analysis, exceptional morphs of the Gouskova/Pater variety).
  2. Later versions of this model developed by Albright and colleagues are better known for popularizing the notion of “islands of reliability”.
  3. Bill Idsardi (p.c.) raises the question of whether magical URs and morpholexical rules are extensionally equivalent. Good question.

References

Albright, A., Andrade, A., and Hayes, B. 2001. Segmental environments of Spanish diphthongization. UCLA Working Papers in Linguistics 7: 117-151.
Bybee, J., and Pardo, E. 1981. Morphological and lexical conditioning of rules: experimental evidence from Spanish. Linguistics 19: 937-968.
Chomsky, N. 2005. Three factors in language design. Linguistic Inquiry 36(1): 1-22.
Gorman, K. and Yang, C. 2019. When nobody wins. In Franz Rainer, Francesco Gardani, Hans Christian Luschützky and Wolfgang U. Dressler (ed.), Competition in inflection and word formation, pages 169-193. Springer.
Gorman, K., McCarthy, A.D., Cotterell, R., Vylomova, E., Silfverberg, M., Markowska, M. 2019. Weird inflects but okay: making sense of morphological generation errors. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 140-151.
Rysling, A. 2016. Polish yers revisited. Catalan Journal of Linguistics 15: 121-143.

Thought experiment #1

A non-trivial portion of what we know about the languages we speak includes information about lexically-arbitrary behaviors, behaviors that are specific to certain roots and/or segments and absent in other superficially-similar roots and/or segments. One of the earliest examples is the failure of English words like obesity to undergo Chomsky & Halle’s (1968: 181) rule of trisyllabic shortening: compare sereneserenity to obese-obesity (Halle 1973: 4f.). Such phenomena are very common in the world’s languages. Some of the well-known examples include Romance mid-vowel metaphony and the Slavic fleeting vowels, which delete in certain phonological contexts.1

Linguists have long claimed (e.g., Harris 1969) one cannot predict whether a Spanish e or o in the final syllable of a verb stem will or will not undergo diphthongization (to ie or ue, respectively) when stress falls on the stem rather than the desinence. For instance negar ‘to deny’ diphthongizes (niego ‘I deny’, *nego) whereas the superficially similar pegar ‘to stick to s.t.’ does not (pego ‘I stick to s.t.’, *piego). There is no reason to suspect that the preceding segment (n vs. p) has anything to do with it; the Spanish speaker simply needs to memorize which mid vowels diphthongize.2 The same is arguably true of the Polish fleeting vowels known as yers, which delete in, among other contexts, the genitive singular (gen.sg.) of masculine nouns. Thus sen ‘dream’ has a gen.sg. snu, with deletion of the internal e, whereas the superficially similar basen ‘pool’ has a gen.sg. basenu, retaining the internal (Rubach 2016: 421). Once again, the Polish speaker needs to memorize whether or not each deletes.

So as to not presuppose a particular analysis, I will refer to segments with these unpredictable alternations—diphthongization in Spanish, deletion in Polish—as magical. Exactly how this magic ought to be encoded is unclear.3 One early approach was to exploit the feature system so that they were underlyingly distinct from non-magical segments. These “exploits” might include mapping magical segments onto gaps in the surface segmental inventory, underspecification, or simply introducing new features. Nowadays, phonologists are more likely to use prosodic prespecification. For instance, Rubach (1986) proposes that the Polish yers are prosodically defective compared to non-alternating e.4 Others have claimed that magic resides in the morph, not the segment.

Regardless of how the magic is encoded, it is a deductive necessity that it be encoded somehow. Clearly something is representationally different in negar and pegar, and sen and basen. Any account which discounts this will be descriptively inadequate. To make this a bit clearer, consider the following thought experiment:

We are contacted by a benign, intelligent alien race, carbon-based lifeforms from the Rigel system with feliform physical morphology and a fondness for catnip. Our scientists observe that they exhibit a strange behavior: when they imbibe fountain soda, their normally-green eyes turn yellow, and when they imbibe soda from a can, their eyes turn red. Scientists have not yet been able to determine the mechanisms underlying these behaviors.

What might we reason about the alien’s seemingly magical soda sense? If we adopt a sort of vulgar uniformitarianism—one which rejects outlandish explanation like time travel or mind-reading—then the only possible explanation remaining to us is that there really is something chemically distinct between the two classes of soda, and the Rigelian sensory system is sensitive to this difference.

Really, this deduction isn’t so different from the one made by linguists like Harris and Rubach: both observe different behaviors and posit distinct entities to explain them. Of course, there is something ontologically different between the two types of soda and the two types of Polish e. The former is a purely chemical difference; the latter arises  because the human language faculty turns primary linguistic data, through the epistemic process we call first language acquisition, into one type of meat (brain tissue), and that type of meat makes another type of meat (the articulatory apparatus) behave in a way that, all else held equal, will recapitulate the primary linguistic data. But both of these deductions are equally valid.

Endnotes

  1. Broadly-similar phenomena previously studied include fleeting vowels in Finnish, Hungarian, Turkish, and Yine, ternary voice contrasts in Turkish, possessive formation in Huichol, and passive formation in Māori.
  2. For simplicity I put aside the arguments by Pater (2009) and Gouskova (2012) that morphs, not segments, are magical. While I am not yet convinced by their arguments, everything I have to say here is broadly consistent with their proposal.
  3. This is yet another feature of language that is difficult to falsify. But as Ollie Sayeed once quipped, the language faculty did not evolve to satisfy a vulgar Popperian falsificationism.
  4. Specfically, Rubach assumes that the non-alternating e‘s have a prespecified mora, whereas the alternating e‘s do not.

References

Chomsky, N. and Halle, M. 1968. The Sound Pattern of English. Harper & Row.
Gouskova, M. 2012. Unexceptional segments. Natural Language & Linguistic Theory 30: 79-133.
Halle, M. 1973. Prolegomena to a theory of word formation. Linguistic Inquiry 4: 3-16.
Harris, J. 1969. Spanish Phonology. MIT Press.
Pater, J. 2009. Morpheme-specific phonology: constraint indexation and inconsistency resolution. In S. Parker (ed.), Phonological Argumentation: Essays on Evidence and Motivation, pages 123-154. Equinox.
Rubach, J. 1986. Abstract vowels in three-dimensional phonology: the yers. The Linguistic Review 5: 247-280.
Rubach, J. 2016. Polish yers: Representation and analysis. Journal of Linguistics 52: 421-466.

Idealizations gone wild

Generative grammar and information theory are products of the US post-war defense science funding boom, and it is no surprise that the former attempted to incorporate insights from the latter. Many early ideas in generative phonology—segment structure and morpheme structure rules and constraints (Stanley 1967), the notion of the evaluation metric (Aspects, §6), early debates on opacities, conspiracies, and the alternation condition—are clearly influenced by information theory. It is interesting to note that as early as 1975, Morris Halle regarded his substantial efforts in this area to have been a failure.

In the 1950’s I spent considerable time and energy on attempts to apply concepts of information theory to phonology. In retrospect, these efforts appear to me to have come to naught. For instance, my elaborate computations of the information content in bits of the different phonemes of Russian (Cherry, Halle & Jakobson 1953) have been, as far as I know, of absolutely no use to anyone working on problems in linguistics. And today the same negative conclusion appears to be to be warranted about all my other efforts to make use of information theory in linguistics. (Halle 1975: 532)

Thus, the mania for information theory in early generative grammar—was exactly the sort of bandwagon effect of the sort Claude Shannon, the inventor of information theory, warned about decades earlier.

In the first place, workers in other fields should realize that the basic results of the subject are aimed at a very specific direction, a direction that is not necessarily relevant to such fields as psychology, economics, and other social sciences. (Shannon 1956)

Today, however, information theory is not exactly in disrepute in linguistics. First off, perplexity, a metric derived from information theory, is used as an intrinsic metric in certain natural language processing tasks, particularly language modeling.1 Secondly, there have been attempts to revive information theory notions as an explanatory factor in the study of phonology (e.g., Goldsmith & Riggle 2012) and human morphological processing (e.g., Moscoso del Prado Martı́n et al. 2004). And recently, Mollica & Piantadosi (2019; henceforth M&P) dare to use information theory to measure the size of the grammar of English.

M&P’s program is fundamentally one of idealization. Now, I don’t have any problem per se with idealization. Idealization is an important part of the epistemic process in science, one without which there can be no scientific observation at all. Critics of idealizations (and of idealization itself) are usually concerned with the things an idealization abstracts away from; for instance, critics of Chomsky’s famous “ideal speaker-listener” (Aspects, p. 3f) note correctly that it ignores bilingual interference, working memory limitations, and random errors. But idealizations are not merely the infinitude of variables they choose to ignore (and when the object of study is an enormously complex polysensory, multifactorial system like the human capacity for language, one is simply not going to be able to study the entire system all at once); they are just as much defined by the factors they foreground and the affordances they create, and the constraints they impose on scientific inquiry.

In this case, an information theoretic characterization of grammars constrains us to conceive of our knowledge of language in terms of probability distributions. This is a step I am often uncomfortable with. It is, for example, certainly possible to conceive of speakers’s lexical knowledge as a sort of probability distribution over lexical items, but I am not sure that P(word) has much grammatical work to do except act as a model of the readily apparent observation that more frequent words can be recalled and recognized more rapidly than rare words. To be sure, studies like the aforementioned one by Moscoso del Prado Martı́n et al. attempt to connect information theoretic characterizations of the lexicon to behavioral results, but these studies are correlational and provide little in the way of mechanistic-causal explanation.

However, for sake of argument, let us assume that the probabilistic characterization of grammatical knowledge is coherent. Why then should it be undertaken? M&P claim that the measurements they will allow—grammar sizes, measured in bits—weigh on an familiar debate. As they frame it:

…is the amount of information about language that is learned substantial (empiricism) or minimal (nativism)?

I don’t accept the terms of this debate. While I consider myself a nativist, I have formed no opinions about how many bits it takes to represent the grammar of English, which is by all accounts a rather complex object. The tradeoff between what is to be learned and what is innate is something that has been given extensive consideration in the nativist literature. Nativists recognize that the less there is to be learned, the more that has to have evolved in the rather short amount of time (in evolutionary terms) since we humans split off from our language-lacking primate cousins. But this tradeoff is strictly qualitative; were it possible to satisfactorily measure both evolutionary plausibility and grammar size, they would still be incommensurate quantities.

M&P proceed by computing the number of bits for various linguistic subsystems. They compute the information associated with phonemes (really, the acoustic cues to various features), the phonemic representation of wordforms, lexical semantics (mappings from words to meanings, here represented as a vector space as is the fashion), word frequency, and finally syntax. For each of these they provide lower bounds and upper bounds, though the upper bounds are in some cases constructed by adding an ad-hoc factor-of-two error to the lower bound. Finally, they sum these quantities, giving an estimate of roughly 1.5 megabytes. This M&P consider to be substantial. It is not at all clear why they feel this is the case, or how small a grammar would have to be to be “minimal”.

There is a lot to complain about in the details of M&P’s operationalizations. First, I am not certain that the systems they have identified are well-defined modules that would be recognizable to working linguists; for instance their phonemes module has next to nothing to do with my conception of phonological grammar. Secondly, it seems to me that by summing the bits needed to characterize each module, they are assuming a sort of “feed-forward”, non-interactive relationship between these components, and it is not clear that this is correct; for example, there are well-understood lexico-semantic constraints on verbs’ argument structure.

While I do not wish to go too far afield, it may be useful to consider in more detail their operationalization of syntax. For this module, they use a corpus of textbook example sentences, then compute the number of possible unlabeled binary branching trees that would cover each example. (This quantity is the same as the nth Catalan number.) To turn this into a probability, they assume that one correct parse has been sampled from a uniform distribution over all possible binary trees for the given sentence. First, this assumption of uniformitivity is completely unmotivated. Secondly, since they assume there’s exactly one possible bracketing, and do not provide labels to non-terminals, they have no way of representing the ambiguity of sentences like Call John an ambulance. (Thanks to Brooke Larson for suggesting this example.) Anyone familiar with syntax will have no problem finding gaping faults with this operationalization.2

M&P justify all this hastiness by comparing their work to the informal estimation approach known as a Fermi problem (they call them “Fermi calculations”). In the original framing, the quantity being estimated is the product of many terms, so assuming errors in estimation of each term are independent, the final estimate’s error is expected to grow logarithmically as the number of terms increases (roughly, this is because the logarithm of a product is equal to the sum of the logarithms of its terms). But in M&P’s case, the quantity being estimated is a sum, so the error will grow much faster, i.e., linearly as a function of the number of terms. Perhaps, as one reviewer writes, “you have to start somewhere”. But do we? If something is not worth doing well—and I would submit that measuring grammars, in all their richness, by comparing them to the storage capacity of obsolete magnetic storage media is one such thing—it seems to me to be not worth doing at all.

Footnotes

  1. Though not without criticism; in speech recognition, probably the most important application of language modeling, it is well-known that decreases in perplexity don’t necessarily give rise to decreases in word error rate.
  2. Why do M&P choose such a degenerate version of syntax? Because syntactic theory is “experimentally under-determined”, so they want to be “independent as possible from the specific syntactic formalism.”

References

Cherry, E. C., Halle, M., and Jakobson, R. 1953. Towards the logical description of languages in their phonemic aspect. Language 29(1): 34-46.
Chomsky, N. 1965. Aspects in the theory of syntax. Cambridge: MIT Press.
Goldsmith, J. and Riggle, J. 2012. Information theoretic approaches to phonology: the case of Finnish vowel harmony. Natural Language & Linguistic Theory 30(3): 859-896.
Halle, M. 1975. Confessio grammatici. Language 51(3): 525-535.
Mollica, F. and Piantadosi, S. P. 2019. Humans store about 1.5 megabytes of information during language acquisition. Royal Society Open Science 6: 181393.
Moscoso del Prado Martı́n, F., Kostić, A., and Baayen, R. H. 2004. Putting the bits together: an information theoretical perspective on morphological processing. Cognition 94(1): 1-18.
Shannon, C. E. 1956. The bandwagon. IRE Transactions on Information Theory 2(1): 3.
Stanley, R. 1967. Redundancy rules in phonology. Language 43(2): 393-436.

Fieldwork is hard.

Luo is a language of the Nilotic family spoken by about one million people in Nyanza Province in Kenya in east central Africa. Mr. Ben Blount, then a student at the University of California in Berkeley, went to Kenya in 1967 to make a study of the development of language in eight children encompassing the age range from 12 to 35 months. He intended to make his central procedure the collection on a regular schedule of large samples of spontaneous speech at home, usually with the mother as interpreter. In American and European families, at least of the middle class, it is usually possible to obtain a couple of hundred utterances in as little as a half an hour, at least it is so, once any shyness has passed. Among the Luo, things proved more difficult. In 54 visits of a half an hour or longer Mr. Blount was only able to obtain a total from all the children of 191 multi-word utterances. The problem was primarily one of Luo etiquette, which requires that small children be silent when adults come to visit, and the small children Mr. Blount visited could not throw off their etiquette even though their parents entreated them to speak for the visiting “European,” as Mr. Blount was called.

(Excerpt from A first language: The early stages by Roger Brown, p. 73. There’s a happy ending: Mr. Blount became Dr. Blount in 1969.)

On the Providence word gap intervention

A recent piece in the Boston Globe quoted my take on a grant to Providence, RI for a “word-gap” intervention. In this quote, I expressed some skepticism about the grant’s goals, but omitted the part of the email where I explained why I felt that way. Readers of the piece might have gotten the impression that I had a less, uhm, nuanced take on the Providence grant than I do. So, here is a summary of my full email to Ben from which the quote was taken.

An ambitious proposal

First off, the Providence/LENA team should be congratulated on this successful grant application: I’m glad they got it and not something more “Bloombergian” (like, say, an experimental proposal to ban free-pizzas-with-beer deals in the interest of bulging hipster waistlines). And they deserve respect for getting approved for such an ambitious proposal: the cash involved is an order of magnitude larger than the average applied linguistics grants. And, perhaps most of all, I have a great deal of respect for any linguist who can convince a group of non-experts that, not only is their work important, but that it is worth the opportunity cost. I also note that if materials from the Providence study are made publicly available (and they should be, in a suitably de-identified format, for the sake of the progress of the human race), my own research stands to benefit from this grant.

But there is another sense in which the proposal is ambitious, however: the success of this intervention depends on a long chain of inferences. If any one of these is wrong, the intervention is unlikely to succeed. Here are what I see as the major assumptions under which the intervention is being funded.

Assumption I: There exists a “word gap” in lower-income children

I was initially skeptical of this claim because it is so similar to a discredited assumption of 20th century educational theorists: the assumption that differences in school and standardized test performance were the result of the “linguistically impoverished” environment in which lower class (and especially minority) speakers grew up.

This strikes me as quite silly: no one who has even a tenuous acquintance with African-American communities could fail to note the importance of verbal skills in said community. Every African-American stereotype I can think of has one thing in common: an emphasis on verbal abilities. Here’s what Bill Labov, founder of sociolinguistics, had to say in his 1972 book, Language in the Inner City:

Black children from the ghetto area are said to receive little verbal stimulation, to hear very little well-formed language, and as a result are impoverished in their means of verbal expression…Unfortunately, these notions are based upon the work of educational psychologists who know very little about language and even less about black children. The concept of verbal deprivation has no basis in social reality. In fact, black children in the urban ghettos receive a great deal of verbal stimulation…and participate fully in a highly verbal culture. (p. 201)

I suspect that Labov may have dismissed the possibility of input deficits prematurely, just as I did. After all, it is an empirical hypothesis, and while Betty Hart and Todd Risley’s original study on differences in lexical input involved a small and perhaps-atypical sample, but the correlation between socioeconomic status and lexical input has been many times replicated. So, there may be something to the “impoverishment theory” after all.

Assumption II: LENA can really estimate input frequency

Can we really count words using current speech technology? In a recent Language Log post, Mark Liberman speculated that counting words might be beyond the state of the art. While I have been unable to find much information on the researchers behind the grant or behind LENA, I don’t see any reason to doubt that the LENA Foundation has in fact built a useful state-of-the-art speech system that allows them to estimate input frequencies with great precision. One thing that gives me hope is that a technical report by LENA researchers provides estimates average input frequency in English which are quite close to an estimate computed by developmentalist Dan Swingley (in a peer-reviewed journal) using entirely different methods.

Assumption III: The “word gap” can be solved by intervention

For children who are identified as “at risk”, the Providence intervention offers the following:

Families participating in Providence Talks would receive these data during a monthly coaching visit along with targeted coaching and information on existing community resources like read-aloud programs at neighborhood libraries or special events at local children’s museums.

Will this have an long-term effect? I simply don’t know of any work looking into this (though please comment if you’re aware of something relevant), so this too is a strong assumption.

Given that there is now money in the budget for coaching, why are LENA devices necessary? Would it be better if any concerned parent could get coaching?

And, finally, do the caretakers of the most at-risk children really have time to give to this intervention? I believe the most obvious explanation of the correlation between verbal input and socioeconomic status is that caretakers on the lower end of the socioeconomic scale have less time to give to their children’s education: this follows from the observation that child care quality is a strong predictor of cognitive abilities. If this is the case, then simply offering counseling will do little to eliminate the word gap, since the families most at risk are the least able to take advantage of the intervention.

Assumption IV: The “word gap” has serious life consequences

Lexical input is clearly important for language development: it is, in some sense, the sole factor determining whether a typically developing child acquires English or Yawelmani. And, we know the devastating consequences of impoverished lexical input.

But here we are at risk of falling for the all-too-common fallacy which equates predictors of variance within clinical and subclinical populations. While massively impoverished language input gives rise to clinical language deficits, it does not follow that differences in language skills within typically developing children can be eliminated by leveling the language input playing field.

Word knowledge (as measured by verbal IQ, for instance) is correlated with many other measures of language attainment, but are increases in language skills enough to help an at-risk child to escape the ghetto (so to speak)?

This is the most ambitious assumption of the Providence intervention. Because there is such a strong correlation between lexical input and social class, it is very difficult to control for this while manipuating lexical input (and doing so would presumably be wildly unethical), we know very little on this subject. I hope that the Providence study will shed some light on this question.

So what’s wrong with more words?

This is exactly what my mom wanted to know when I sent her a link to the Globe piece. She wanted to emphasize that I only got the highest-quality word-frequency distributions all throughout my critical period! I support, tentatively, the Providence initiative and wish them the best of luck; if these assumptions all turn out to be true, the organizers and scientists behind the grant will be real heroes to me.

But, that leads me to the only negative effect this intervention could have: if closing the word gap does little to influence long-term educational outcomes, it will have made concerned parents unduly anxious about the environment they provide for their children. And that just ain’t right.

(Disclaimer: I work for OHSU, where I’m supported by grants, but these are my professional and personal opinions, not those of my employer or funding agencies. That should be obvious, but you never know.)