On who is allowed to graduate

There is a convention I’ve seen at several institutions whereby a PhD (usually) student who already has a job or post-doc lined up is permitted to defend a dissertation that is less complete than would otherwise be accepted were they not up against a deadline. One suspects this sort of thing is applied in a rather biased fashion, but let’s suppose it was not. I cannot see any justification for it. It produces poor science, it is bad for departmental morale and espirit de corps, and it doesn’t prepare the student for future success in an environment where their advisor can no longer put a finger on the scale.

Now it is true that advisors or committee members, for whatever reason, occasionally try to squeeze a student for more one more experiment that is more of a nice-to-have than essential to make the argument being made in the thesis, but it is not clear why accepting a sub-par dissertation should be a remedy for it, and why such a remedy should only be available if you have a new job starting in two weeks.

Defectivity in Kinande

[This is part of a series of defectivity case studies.]

I have already written a bit about reduplication in Kinande; it too is an example of inflectional defectivity, and here I’ll focus on that fact.

In this language, most verbs participate in a form of reduplication with the semantics of roughly ‘to hurriedly V’ or ‘to repetitively V’. Mutaka & Hyman (1990; henceforth MH), argue that the reduplicant is a bisyllabic prefix. For instance, the reduplicated form of e-ri-gend-a ‘to leave’ is e-ri-gend-a-gend-a ‘to leave hurriedly’, with the reduplicant underlined. (In MH’s terms, e- is the “augment”, -ri the “prefix”, and -a is the “final vowel” morpheme.)

Certain verbal suffixes, known to Bantuists as extensions, may also be found in the reduplicant when the reduplicant would otherwise be less than bisyllabic. For instance, the passive suffix, underlyingly /-u-/, surfaces as [w] and is copied by reduplication. Thus for the verb root hum ‘beat’ the passive e-ri-hum-w-a reduplicates as e-ri-hum-w-a-hum-w-a. More interesting is there are “unproductive” (MH’s term) extensions.¹ Verbs bearing these extensions rarely have a compositional semantic relationship with their unextended form (if an unextended verb stem exists at all). For instance, whereas luh-uk-a ‘take a rest’ may be semantically related to luh-a ‘be tired’, but there is no unextended *bát-a to go with bát-uk-a ‘move’.

Interesting things happen when we try to reduplicate unproductivity extended monosyllabic verb roots. For some such verbs, the extension is not reduplicated; e.g., e-rí-bang-uk-a ‘to jump about’ has a reduplicated form e-rí-bang-a-bang-uk-a. This is the same behavior found for “productive” extensions. For others, the extension is reduplicated, producing a trisyllabic—instead of the normal bisyllabic—reduplicant; e.g., e-ri-hurut-a ‘to snore’ has a reduplicated form e-ri-hur-ut-a-hur-ut-a. Finally, there are some stems—all monosyllabic verb roots with unproductive extensions—which do not undergo reduplication; e.g., e-rí-bug-ul-a ‘to find’ does not reduplicate and neither *e-rí-bug-a-bug-ul-a or *e-rí-bug-ul-a-bug-ul-a exist.

While one could imagine there are certain semantic restrictions on reduplication, like in Chaha, MH make no mention of such restrictions in Kinande. If possible, we should rule out this as a possible explanation for the aforementioned defectivity.

Endnotes

I will segment these with hyphens though it may make sense to regard some unproductive extensions as part of morphologically simplex stems.

References

Mutaka, N. and Hyman, L. M. 1990. Syllables and morpheme integrity in Kinande reduplication. Phonology 7: 73-119.

re.compile is otiose

Unlike its cousins Perl and Ruby, Python has no literal syntax for regular expressions. Whereas one can express the sheep language /baa+/ with a simple forward-slashed literal in Perl and Ruby, in Python one has to compile them using the function re.compile, which produces objects of type re.Pattern. Such objects have various methods for string matching.

sheep = re.compile(r"baa+")
assert sheep.match("baaaaaaaa")

Except, one doesn’t actually have to compile regular expressions at all, as the documentation explains:

Note: The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

What this means is that in the vast majority of cases, re.compile is otiose (i.e., unnecessary). One can just define expression strings, and pass them to the equivalent module-level functions rather than using the methods of re.Pattern objects.

sheep = r"baa+"
assert re.match(sheep, "baaaaaaaa")

This, I would argue, is slightly easier to read, and certainly no slower. It also makes typing a bit more convenient since str is easier to type than re.Pattern.

Now, I am sure there is some usage pattern which would favor explicit re.compile, but I have not encountered one in code worth profiling.

Defectivity in Polish

[This is part of a series of defectivity case studies.]

Gorman & Yang (2019), following up on a tip from Margaret Borowczyk (p.c.) discuss inflectional gaps in Polish declension. In this language, masculine genitive singular (gen.sg.) are marked either with -a or -u. The two gen.sg. suffixes have a similar type frequency, and neither appears to be more default-like than the other. For instance, both allomorphs are used with loanwords. Because of this, it is generally agreed that the gen.sg. allomorphy is purely arbitrary and must be learned by rote, a process that continues into adulthood (e.g., Dąbrowska 2001, 2005).

Kottum (1981: 182) reports his informants have no gen.sg. for masculine-gender toponyms like Dublin ‘id.’ (e.g., *Dublina/*Dublinu), Göteborg ‘Gothenburg’ and Tarnobrzeg ‘id.’, and Gorman & Yang (2019: 184) report their informants do not have a gen.sg. for words like drut ‘wire’ (e.g., *druta/*drutu, though the latter is prescribed), rower ‘bicycle’, balon ‘baloon’, karabin ‘rifle’, autobus ‘bus’, and lotos ‘lotus flower’.

References

Dąbrowska, E. 2001. Learning a morphological system without a default: The Polish genitive. Journal of Child Language 28: 545-574.
Dąbrowska, E. 2005. Productivity and beyond: mastering the Polish genitive inflection. Journal of Child Language 32:191-205.
Gorman,. K. and Yang, C. 2019. When nobody wins. In F. Rainer, F. Gardani, H. C. Luschützky and W. U. Dressler (ed.), Competition in Inflection and Word Formation, pages 169-193. Springer.
Kottum, S. S. 1981. The genitive singular form of masculine nouns in Polish. Scando-Slavica 27: 179-186.

Defectivity in Chaha

[This is part of a series of defectivity case studies.]

Rose (2000) describes a circumscribed form of defectivity in Chaha, a Semitic language spoken in Ethiopia. Throughout Ethio-Semitic, many verbs have a frequentative formed using a quadriliteral verbal template. Since few verb roots are quadriconsonantal—most are triconsonantal, some are biconsonantal—a sort of reduplication and/or spreading is used to fill in the template. In Tigryina, for instance (p. 318), the frequentative template is of the form CɘCaCɘC. Then, frequentative of the triconsonantal verb root √/grf/ ‘collect’ is [gɘrarɘf], with the root /r/ repeated, and for a biconsonantal verb root like √/ħt/ ‘ask’, the frequentative is [ħatatɘt], with three root /t/s.

Rose contrasts this state of affairs with Chaha. In this language, the frequentative template CɨCɘCɘC cannot be satisfied by a biconsonantal root like √/tʼm/ ‘bend’ or √/Rd/ ‘burn’, and all such verbs lack a frequentative.¹ The expected *[tʼɨmɘmɘm] and *[nɨdɘdɘd] are ill-formed, as are all other alternatives. Furthermore, no frequentatives of any sort can be formed with quadriconsonantal roots.

Rose notes that there are often semantic reasons for a verb to lack a frequentative (e.g., stative and resultative verbs are generally not compatible with it), this does not seem applicable here.

Endnotes

As Rose explains: “R represents a coronal sonorant which may be realized as [n] or [r] depending on context…” (p. 317).

References

Rose, S. 2000. Multiple correspondence in reduplication. In Proceedings of the 23rd Annual Meeting of the Berkeley Linguistic Society, pages 315-326.

Defectivity in English

[This is part of a small but growing series of defectivity case studies.]

English lexical verbs can have up to 5 distinct forms, and I am aware of just a few English verbs which are defective. (The following are all my personal judgments.)

I can use begone as an imperative, though it has the form of a past participle (cf. gone and forgone). Is BEGO even a verb lexeme anymore?
Fodor (1972), following Lakoff (1970 [1965]), notes that BEWARE has a limited distribution and never bears explicit inflection. For me, it can occur only as a positive imperative (e.g., beware the dog!), with or without emphatic do. I agree with Fodor that it is also bad under negation, but perhaps for unrelated reasons: e.g., *don’t beware…
FORGO lacks a simple past: forgo, forgoes, and forgoing are fine, as is the past participle forgone, but *forwent is bad as the preterite/simple past, and *forgoed is perhaps a bit worse.
METHINK can only be used in the 3sg. present active indicative form methinks, and doesn’t allow for an explicit subject.
STRIDE lacks a past participle (e.g., Hill 1976:668, Pinker 1999:136f., Pullum and Wilson 1977:770): *stridden is bad. The simple past strode cannot be reused here, and I cannot use the regular *strided (under the relevant sense).

References

Fodor, J. D. 1972. Beware. Linguistic Inquiry 3: 528-534.
Hill, A. A. 1976. [Obituary:] Albert Henry Marckwardt. Language 52: 667-681.
Lakoff, G. 1970. Irregularity in Syntax. Holt, Rinehart and Winston.
Pinker, S. 1999. Words and Rules: The Ingredients of Language. Basic Books.
Pullum, G. K. and Wilson, D. 1977. Autonomous syntax and the analysis of auxiliaries. Language 53:741-788.

Deriving the major rule/minor rule distinction

The ability to target underspecified lexemes’ specifications for a rule feature, in which feature-filling is implemented by unification (e.g., Bale et al. 2014), ought to enable us to derive the traditional distinction (e.g., Lakoff 1970) between major rules (those for which non-application is exceptional) and minor rules (those for which application is exceptional), making this distinction purely descriptive of later feature-filling rules inserting unmarked rule features upon lexical insertion.

Let us suppose we have a rule R. Let us suppose that every formative is unified with {+R} upon lexical insertion. Then, unification will fail only with formatives specified [−R], and these formatives will exhibit exceptional non-application. This describes the parade example of exceptions to a major rule: the failure of trisyllabic shortening in obesity (assuming obese is [−trisyllabic shortening]; see Chomsky & Halle 1968: §4.2.2).

Let us suppose instead that every formative is unified with {−R} upon lexical insertion. Then, unification will fail only with those formatives specified [+R], and these formatives will exhibit exceptional application, assuming they otherwise satisfy the phonological description of rule R. This describes minor rules.

This (admittedly quite sketchy at present) idea seems to address Zonneveld’s (1978: 160f.) concern that Lakoff and contemporaries did not posit any way to encode whether or not a rule was major or minor, except “transderivationally” via inspection of successful derivations. This also places the major/minor distinction—correctly, I think—in the scope of theory of productivity. More on this later.

References

Bale, A., Papillon, M., and Reiss, C. 2014. Targeting underspecified segments: a formal analysis of feature-changing and feature-filling rules. Lingua 148: 240-253.
Chomsky, N. and Halle, M. 1968. Sound Pattern of English. Harper & Row.
Lakoff, G. 1970. Irregularity in Syntax. Holt, Rinehart and Winston.
Zonneveld, W. 1978. A Formal Theory of Exceptions in Generative Phonology. Peter de Ridder.

Linguistics’ contribution to speech & language processing

How does linguistics contribute to speech & language processing? While there exist some “linguist eliminationists”, who wish to process speech audio or text “from scratch” without intermediate linguistic representations, it is generally recognized that linguistic representations are the end goal of many processing “tasks”. Of course some tasks involve poorly-defined, or ill-posed, end-state representations—the detection of hate speech and named entities, neither of which are particularly well-defined, linguistically or otherwise, come to mind—but are driven by apparent business value to be extracted rather than serious goals to understand speech or text.

The standard example for this kind of argument is syntax. It might be the case that syntactic representations are not as useful for textual understanding as was anticipated, and useful features for downstream machine learning can apparently be induced using far simpler approaches, like the masked language modeling task used for pre-training in many neural models. But it’s not as if a terrorist cell of rogue linguists locked NLP researchers in their office until they developed the field of natural language parsing. NLP researchers decided, of their own volition, to spend the last thirty years building models which could recover natural language syntax, and ultimately got pretty good at it, probably getting up to the point where, I suspect, unresolved ambiguities mostly hinge on world knowledge that is rarely if ever made explicit.

Let us consider another example, less widely discussed: the phoneme. The phoneme was discovered in the late 19th century by Baudouin de Courtenay and Kruszewski. It has been around a very long time. In the century and a half since it emerged from the Polish academy, Poland itself has been a congress, a kingdom, a military dictatorship, and a republic (three times), and annexed by the Russian empire, the German Reich, and the Soviet Union. The phoneme is probably here to stay. The phoneme is, by any reasonable account, one of the most successful scientific abstractions in the history of science.

It is no surprise then, that the phoneme plays a major role in speech technologies. Not only did the first speech recognizers and synthesizers make explicit use of phonemic representations (as well as notions like allophones), so did the next five decades worth of recognizers and synthesizers. Conventional recognizers and synthesizers require large pronunciation lexicons mapping between orthographic and phonemic form, and as they get closer to speech, convert these “context-independent” representations of phonemic sequences onto “context-dependent” representations which can account for allophony and local coarticulation, exactly as any linguist would expect. It is only in the last few years that it has even become possible to build a reasonably effective recognizer or synthesizer which doesn’t have an explicit phonemic level of representation. Such models instead use clever tricks and enormous amounts of data to induce implicit phonemic representations instead. We have every reason to suspect these implicit representations are quite similar to the explicit ones linguists would posit. For one, these implicit representations are keyed to orthographic characters, and as I wrote a month ago, “the linguistic analysis underlying a writing system may be quite naïve but may also encode sophisticated phonemic and/or morphemic insights.” If anything, that’s too weak: in most writing systems I’m aware of, the writing system is either a precise phonemic analysis (possibly omitting a few details of low functional load, or using digraphs to get around limitations of the alphabet of choice) or a precise morphophonemic analysis (ditto). For Sapir (1925, et. seq.) this was key evidence for the existence of phonemes! So whether or not implicit “phonemes” are better than explicit ones, speech technologists have converged on the same rational, mentalistic notions discovered by Polish linguists a century and a half ago.

So it is surprising to me that even those schooled in the art of speech processing view the contribution of linguistics to the field in a somewhat negative light. For instance, Paul Taylor, the founder of the TTS firm Phonetic Arts, published a Cambridge University Press textbook on TTS methods in 2009, and while it’s by now quite out of date, there’s no more-recent work of comparable breadth. Taylor spends the first five hundred (!) pages or so talking about linguistic phenomena like phonemes, allophones, prosodic phrases, and pitch accents—at the time, the state of the art in synthesis made use of explicit phonological representations—so it is genuinely a shock to me that Taylor chose to close the book with a chapter (Taylor 2009: ch. 18) about the irrelevance of linguistics. Here are a few choice quotes, with my commentary.

It is widely acknowledged that researchers in the field of speech technology and linguistics do not in general work together. (p. 533)

It may be “acknowledged”, but I don’t think it has ever been true. The number of linguists and linguistically-trained engineers working on FAANG speech products every day is huge. (Modern corporate “AI” is to a great degree just other people, mostly contractors in the Global South.) Taylor continues:

The first stated reason for this gap is the “aeroplanes don’t flap their wings” argument. The implication of this statement is that, even if we had a complete knowledge of how human language worked, it would not help us greatly because we are trying to develop these processes in machines, which have a fundamentally different architecture. (p. 533)

I do not expect that linguistics will provide deep insights about how to build TTS systems, but it clearly identified the relevant representational units for building such systems many decades ahead of time, just as mechanics provided the basis for mechanical engineering. This was true of Kempelen’s speaking machine (which predates phonemic theory, and so had to discover something like it) and Dudley’s voder as well as speech synthesizers in the digital age. So I guess I kind of think that speech synthesizers do flap their wings: parametric, unit selection, hybrid, and neural synthesizers are all big fat phoneme-realization machines. As is standard practice in physical sciences, the simple elementary particles of phonological theory—phonemes, and perhaps features—were discovered quite early on, but it the study of their onotology has taken up the intervening decades. And unlike the physical sciences, us cognitive scientists some day must also understand their epistemology (what Chomsky calls “Plato’s problem”) and ultimately, their evolutionary history (“Darwin’s problem”) too. Taylor, as an engineer, need not worry himself about these further studies, but I think he is being widely uncharitable about the nature of what he’s studying, or the business value of having a well-defined hypothesis space of representations for his team to engineer around in.

Taylor’s argument wouldn’t be complete without a caricature of the generative enterprise:

The most-famous camp of all is the Chomskian [sic] camp, started of course by Noam Chomsky, which advocates a very particular approach. Here data are not used in any explicit sense, quantitative experiments are not performed and little stress is put on explicit description of the theories advocated. (p. 534)

This is nonsense. Linguistic examples are data, in some cases better data than results from corpora or behavioral studies, as the work of Sprouse and colleagues has shown. No era of generativism was actively hostile to behavioral results; as early as the ’60s, generativist-aligned psycholinguists were experimentally testing the derivational theory of complexity and studying morphological decomposition in the lab. And I simply have never found that generativist theorizing lacks for formal explicitness; in phonology, for instance, the major alternative to generativist thinking is exemplar theory—which isn’t even explicit enough to be wrong—and a sort of neo-connectionism—which ought not to work at all given extensive proof-theoretic studies of formal learnability and the formal properties of stochastic gradient descent and backpropagation. Taylor continues to suggest that the “curse of dimensionality” and issues of generalizability prevent application of linguistic theory. Once again, though, the things we’re trying to represent are linguistic notions: machine learning using “features” or “phonemes”, explicit or implicit, is still linguistics.

Taylor concludes with some future predictions about how he hopes TTS research will evolve. His first is that textual analysis techniques from NLP will become increasingly important. Here the future has been kind to him: they are, but as the work of Sproat and colleagues has shown, we remain quite dependent on linguistic expertise—of a rather different and less abstract sort than the notion of the phoneme—to develop these systems.

References

Sapir, E. 1925. Sound patterns in language. Language 1:37-51.
Taylor, P. 2009. Text-to-Speech Synthesis. Cambridge University Press.

“Natural language processing” is not a proper name

The phrase natural language processing is not a proper name, so there’s no reason for it to be written in titlecase: it should be lowercase, like any other common noun phrase.

Defectivity in Turkish; part 1: monosyllables

[This is part of a small but growing series of defectivity case studies.]

While there are some languages—like Greek or Russian—in which there are dozens or even hundreds of defective lexemes, in most cases defectivity is markedly constrained, conditioned by both morphological class or status and lexical identity. This is somewhat in conflict with models which view defectivity as essentially “absolute phonotactic ungrammaticality” (e.g., Orgun & Sprouse 1999; henceforth OS), since the generalizations about which items are or are not defective are not primarily phonotactic. A good demonstration of the morphological-lexical nature of defectivity comes from Turkish.

As first reported (to my knowledge) by Itô & Hankamer (1989; henceforth IH) Turkish has just a small number of monosyllabic stems. In verbs, one forms the “simple” (active) imperative using the bare stem: e.g., ye ‘eat!’. However, one cannot form a passive imperative of monosyllabic verbs. For instance, for EAT, we would expect *yen (with -n being the expected allomorph of the passive imperative), but this is apparently ill-formed under the appropriate interpretation, with no obvious alternative.¹ I say it this way because because yen exists as is the simple imperative ‘conquer!’. As IH note, this shows there is nothing phonotactically wrong with the ill-formed passive imperatives. Another example they give is kon ‘alight! (like a bird)’. Apparently, we would expect it to have a passive imperative homophonous with the simple imperative, but it is ill-formed under this interpretation. However, I find these two examples less than convincing since one could imagine that the homophony with another type of imperative might be implicated in these judgments.

Something similar characterizes certain monosyllabic noun stems. Turkish has apparently borrowed the seven solfège syllables do, re, mi, etc. Of these, six are CV monosyllables, which we would expect to select the /-m/ allomorph of the 1sg. poss. suffix. However, these 1sg. poss. forms are apparently ill-formed; e.g., *do-m ‘my do‘, re-m ‘my re‘, *mi-m ‘my mi‘, and so on. However, one can use these with the other declensional suffixes which produce polysyllabic outputs; e.g., 1pl. poss. domuz ‘our do’. The same facts are true for the names of the letters of the (Latin, post-1928) alphabet: e.g., de ‘the letter d’, but *de-m ‘my letter d’, and so on. OS report however that the one CVC solfège syllable, sol, has a well-formed 1sg. poss.; this selects the /-Im/ allomorph (where /I/ is a high vocalic archiphoneme subject to stem-controlled vowel harmony), which gives us the licit 1sg. poss. solüm [solʲym] ‘my sol‘.² The same facts hold of the 2sg. poss. ‘your __’, which for CV monosyllables would be realized as /-n/; e.g., *do-n ‘your do‘.

From the above facts IH and OS conclude there is an exceptionless constraint in Turkish such that monosyllabic derived forms produced by the grammar are ill-formed, with no possible “repair”. However, Selin Alkan (p.c.) draws my attention to at least one CV nominal stem which is well-formed in the 1sg. and 2sg,. poss.: su ‘water. For this stem, a [j] glide is inserted between the vowel and the stem, and the stem selects for the -VC allomorphs of the possessive; e.g., su-y-um ‘my water’, su-y-un ‘their water’. This is surprisingly insofar as OS take pains (p. 195f.) to specifically rule out repair by epenthesis in 1sg. poss. forms!

It would be nice to conclude that the only affected lexemes are transparent borrowings, but this does not seem to accord with the evidence from monosyllabic verbs. But the evidence from native stems is really quite weak, and the generalizations are clearly morphological (i.e., the restriction of the constraint to derived environments) and lexical (i.e., the fact that su has an “escape hatch”), something that has largely been ignored in previous attempts to describe defectivity in Turkish.

To move forward on this topic, it would be nice to know the following. How many, if any, verbs behave like ye or kon, and are any unexpectedly well-formed in the passive imperative? Are there any other forms in the verbal paradigm that show “monosyllabism” gaps? Similarly, how many (if any) defective nouns are there beyond those already mentioned, and how many behave like su?

[h/t: Selin Alkan]

Endnotes

IH note (p. 61, fn. 1) that the passive imperative “are somewhat odd in normal circumstances”. Therefore, they asked their informants to imagine they were directors giving instructions to actors, which apparently helped to render these examples more felicitious.
It seems plausible that -m and -n here are purely-phonological allomorphs of /-Im, -In/ respectively, but I am not sure.

References

Itô, J. and Hankamer, J. 1989. Notes on monosyllabism in Turkish. In J. Itô and J. Runner (ed.), Phonology at Santa Cruz, pages 61-69. Linguistics Research Center, University of California, Santa Cruz.
Orgun, C. O. and Sprouse, R. L. 1999. From MPARSE to CONTROL: deriving ungrammaticality. Phonology 16:191-224.