More than one rule

[Leaving this as a note to myself to circle back.]

I’m just going to say it: some “rules” are probably two or three rules, because the idea that rules are defined by natural classes (and thus free of disjunctions) is more entrenched than our intuitions about whether or not a process in some language is really one rule or not, and we should be Gallilean about this. Here are some phonological “rules” that are probably two or three rules different rules.

  • Indo-Iranian, Balto-Slavic families, and Albanian “ruki” (environment: preceding {w, j, k, r}): it is not clear to me if any of these languages actually need this as a synchronic rule at all.
  • Breton voiced stop lenition (change: /b/ to [v], /d/ to [z], /g/ to [x]): the devoicing of /g/ must be a separate rule. Hat tip: Richard Sproat. I believe there’s a parallel set of processes in German.
  • Lamba patalatalization (change: /k/ to [tʃ], /s/ to [ʃ]): two rules, possibly with a Duke-of-York thing. Hat tip: Charles Reiss.
  • Mid-Atlantic (e.g., Philadelphia) English ae-tensing (environment: following tautosyllabic, same-stem {m, n, f, θ, s, ʃ]): let’s assume this is allophony; then the anterior nasal and voiceless fricative cases should be separate rules. It is possible the incipient restructuring of this as having a simple [+nasal] context provides evidence for the multi-rule analysis.
  • Latin glide formation (environment: complex). Front and back glides are formed from high short monophthongs in different but partially overlapping contexts.

Industry postdocs

I find the very idea of industry postdocs funny (funny-sad, though). Sure, it makes sense for the academy, with all of its scarcities, to make use of precarious, casualized post-graduate labor, but to extend this to the tech sector is vaguely monstrous. It’s extra funny (but funny-sad too) when you hear of a senior professor doing an industry postdoc at a company with a name like baz.ly during their sabbatical.

Neurolinguistic deprogramming

I venture to say most working linguists would reject—outright—strong versions of linguistic relativity and the Sapir-Whorf hypothesis, and would regard neuro-linguistic programming as pseudoscientific rubbish. This is of course in contrast to the general public: even the highly-educated take linguistic relativity as an obvious description of human life. Yet, it is not uncommon for the same linguists to endorse beliefs in the power of renaming that is hard to reconcile with the general disrepute of the vulgar Whorfian view the power of renaming assumes.

For instance, George Lakoff’s work on “framing” in politics argued that renaming social programs was the one weird trick needed to get Howard Dean into the White House. While this seems quaint in retrospect, his proposal was widely debated at the time. Pinker’s (sigh) takedown is necessary reading. The problem, of course, is that Lakoff ought to have provided, and ought to have been expected to provide, any evidence at all for a view of language widely regarded as untutored by his colleagues.

The case of renaming languages is a grayer one. I believe that one ought to call people what they want to be called, and that if stakeholders would prefer their language to be referred to as Tohono Oʼodham rather than Pápago, I am and will remain happy to oblige.1 If African American Vernacular English is renamed to African American Language (as seems to be increasing common in scholarship), I will gladly follow suit. But I can’t imagine how it could be the case that the renaming represents a reconceptualization of either the language itself, or a change in how we study it. Indeed, it would be strange for the name of any language to reflect any interesting property of said language. French by any other name would still have V-to-T movement and liaison.

It may be that these acts of renaming have power. Indeed, I hope they do. But I have to suspect the opposite: they’re the sort of fiddling one does when one is out power, when one is struggling to believe that a better world is possible. And if I’m wrong, who is better suited to show that than the trained linguist?

Endnotes

  1. Supposedly, the older name of the language comes from a pejorative used by a neighboring tribe, the Pima. Ba꞉bawĭkoʼa means, roughly ‘tepary bean eater’. The Spanish colonizers adapted this as Pápago. I feel like the gloss sounds like a cutting insult in English too, so I get why this exonym has fallen in disrepute.

Filtering text at scale

[This post describes work in collaboration with Emily Charde.]

It is now commonplace for NLP applications to consume massive amounts of web text  of unknown provenance. Applications which stand to benefit from this firehose of data, but at the same time don’t need it all, may require more attention paid to data quality in the form of high-precision methods to filter out redundancies and junk.

Gorman et al. (2021) follow standard practices for obtaining a “clean” subsample of web data: they filter sentence strings based on the presence of capitalization and sentential punctuation, length, and predictability as measured by a character language model. In an ongoing project on defectivity, we sought to do something similar at a much larger scale. This project was undertaken by myself in collaboration with Emily Charde, a graduate of our master’s program who worked as an RA on the project.

Our data for this project is drawn from CC-100, a recreation of the earlier CC-Net corpus (Wenzek et al. 2020). CC-100 consists of strings from 2018 Common Crawl snapshots, already filtered somewhat and grouped by language using language ID tools. At rest, the CC-100 data is stored in enormous LZMA-compressed files, one per language/locale/script. The largest, English (naturally), occupies 82 GB despite this aggressive compression scheme.

We proceed as follows.

We first shard the data for each language into roughly 1 GB chunks, preserving the LZMA compression.

We then perform sentence and word tokenization in parallel using mudpipe.py, a Python wrapper around the C++ command-line tool UDPipe 1 which automatically decompresses the LZMA files, invokes UDPipe, and recompresses the output CoNLL-U-formatted data, preserving disk space; since this is mostly IO-bound, mudpipe.py does this in parallel across the various shards (the “m” in “mudpipe” stands for “multiprocessing”). This script was originally developed by Yulia Spektor, another graduate student, for her 2020 master’s thesis (Spektor 2020). Applying mudpipe.py to English, Greek, and Russian (our three target languages) took a few weeks of compute time on a single desktop that otherwise would have sat idle. The resulting shards of compressed CoNLL-U sentences are somewhat larger, roughly 2 GB, presumably because of the additional markup.

We now turn to filtering in earnest. Whereas Gorman et al. were working with dozens of millions of sentences of English, the CC-100 language samples contain many billlions of sentences, so filtering based on percentiles, like those used by Gorman et al., must be performed out-of-core. We thus chose SQLite as our data store for this project, and envisioned that SQL would be a natural way to express filters.

Filtering was ultimately performed by a single Python script using the sqlite3 standard library. This script runs through the tokenized shards produced by mudpipe.py, and ultimately produces a single LZMA-compressed, CoNLL-U format file for each language. Working incrementally, each shard is decompressed and the CoNLL-U format is parsed line by line. Once a sentence is obtained, we apply ordinary re regular expression filters. These expressions require each sentence to start with an uppercase letter of the appropriate script, to continue with more letters, space, or punctuation of the appropriate script, and finally to end with sentential punctuation (e.g., /[.!?]/). For instance, a Russian or Greek sentence that contains Latin characters was discarded. If quotation marks are present, they were required to “balance”. Sentences that fail one or more of these constraints are simply removed from further consideration. Additional data is extracted from the sentences that remain:

  • length in characters
  • length in tokens
  • bits per character (BPC) entropy according to an OpenGrm-NGram (Roark et al. 2012) 6-gram character language model

The sentence and these three statistics are then stored in the SQLite database; we also use gzip compression, with the shortest possible compression window and no headers, to save temporary disk space. Accumulating this portion of the table takes quite some time, but it can be performed in parallel across shards or languages. We perform batches of 1m updates at a time. We experimented—well, Emily did, I watched—with various database PRAGMAs to improve performance, but none of these were clearly performance-positive.

Our next step is to actually filter the data. In an inner subquery, we compute quartiles for character length, token length, and BPC. Then in an outer subquery, we return the row IDs of every sentence which is in Q2 or Q3—the middle two quartiles—for all three measures. That is, if a sentence has median BPC but is in the 80th percentile for character length, we remove it. This is highly conservative, but we have more than enough data, and we anticipate that at least character length and token length are highly correlated in any language. In the outermost query, we SELECT row IDs not returned by the outer subquery. This query is a work of art.

SELECT tokenlist FROM table WHERE rowid IN (
    SELECT rowid FROM (
        SELECT rowid,
        NTILE(4) OVER (ORDER BY char_len) AS char_q,
        NTILE(4) OVER (ORDER BY word_len) AS word_q,
        NTILE(4) OVER (ORDER BY bpc) AS bpc_q
        FROM table
    )
    WHERE (char_q BETWEEN 2 AND 3)
    AND (word_q BETWEEN 2 AND 3)
    AND (bpc_q BETWEEN 2 AND 3)
);

We then reserialize and recompress the remaining sentences into a new LZMA-compressed file. Here are some logging statements that give a sense of the scale (this is from Russian):

WARNING 2023-01-06 20:39:41,896: 1,576,171,212 input sentences processed
WARNING 2023-01-06 20:39:41,896: 362 sentences missing text
WARNING 2023-01-06 20:39:41,896: 539,046,034 sentences incomplete
WARNING 2023-01-06 20:39:41,896: 772,566 sentences fail LM composition
WARNING 2023-01-06 21:16:35,406: 1,036,352,250 sentences after primary filtration
WARNING 2023-01-08 09:14:13,110: 232,404,041 sentences after secondary filtration
INFO 2023-01-08 09:14:13,117: Writing to ../conllu/cc100/ru.cc100.filtered.conllu.xz...
INFO 2023-01-09 03:22:08,252: Dropping ru_cc100 table
INFO 2023-01-09 10:42:07,085: Filtering complete

To summarize: there were about 1.6b input sentences after mudpipe.py; of these, 362 (inexplicably, but it happens) had no text at all. Roughly a half billion of these are “incomplete”, meaning they failed the regular expression constraints. A bit less than one million “fail LM composition”; this usually indicates they contain odd, language-inappropriate characters, which were never seen in the (held-out) materials used to train the character LMs. This leaves us with just over one billion sentences for “secondary filtration”. Of these, 232m fall in the two median quartiles for the length and entropy measures and are retained. As you can see, secondary filtration took an otherwise-idle desktop about 36 hours, with reserialization and recompression taking about 18 hours, and DB cleanup (not strictly necessary, but sort of like “be kind, rewind”) adding another 7 hours at the end. Not bad, though certainly this could be made to run much faster (possibly with a different database engine designed for parallel writes).

In practice, we find that this produces data that is highly diverse but extremely clean. Should even more data ever be desired, one could easily imagine relaxing the quartile constraints a bit.

[Late-breaking addition: I should probably explain why we want median entropy text. If one sorts the sentence of a large corpus by bits per character, you will see that the lowest-entropy sentences tend to be boilerplate and the highest-entropy sentences tend to be rubbish. So the middle is “just right” here.]

Acknowledgments

Support for this project was provided by a PSC-CUNY award, jointly funded by the Professional Staff Congress and the City University of New York.

References

Gorman, K., Kirov, C., Roark, B., and Sproat, R. 2021. Structured abbreviation expansion in context. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 995-1005.
Wenzek, G., Lachaux, M.-A., Conneau, A.,  Chaudhary, V., Guzmán, F., Joulin, A., and Grave, E. 2020. CCNet: extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003-4012.
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., and Tai, T. 2012. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, pages 61-66.
Spektor, Y. 2021. Detection and morphological analysis of novel Russian loanwords. Master’s thesis, Graduate Center, City University of New York.

Feature maximization and phonotactics

[This is a quick writing exercise for in-progress work with Charles Reiss. Sorry if it doesn’t make sense out of context.]

An anonymous reviewer asks:

I wonder how the author(s) would reconcile this learning model with the evidence that both children and adults seem to aggressively generalize phonotactic restrictions from limited data (e.g. just [p]) to larger, unobserved natural classes (e.g. [p f b v]). See e.g. the discussion in Linzen & Gallagher (2017). If those results are credible, they seem much more consistent with learning minimal feature specifications for natural classes than learning maximal ones.

First, note that Linzen & Gallagher’s study is a study of phonotactic learning, whereas our proposal concerns induction of phonological rules. We have been, independently but complementarily, quite critical of the naïve assumptions inherent in prior work on this topic (e.g., Gorman 2013, ch. 2; Reiss 2017, §6); we have both argued that knowledge of phonotactic generalizations may require much less grammatical knowledge than is generally believed.

Secondly, we note that Linzen & Gallagher’s subjects are (presumably; they were recruited on Mechanical Turk and were paid $0.65 USD for their efforts) adults briefly exposed to an artificial language. While we recognize that adult “artificial language learning” studies are common practice in psycholinguistics, it is not clear what such studies contribute to our understanding of phonotactic acqusition (whatever the phonotactic acquirenda turn out to be) by children robustly exposed to realistic languages in situ.

Third, the reviewer is incorrect; the result reported by Linzen & Gallagher (henceforth L&G) is not consistent with minimal generalization. Let us grant—for sake of argument—that our proposal about rule induction in children is relevant to their work on rapid phonotactic learning in adults. One hypothesis they entertain is that their participants will construct “minimal classes”:

For example, when acquiring the phonotactics of English, learners may first learn that both [b] and [g] are valid onsets for English syllables before they can generalize to other voiced stops (e.g., [d]). This generalization will be restricted to the minimal class that contained the attested onsets (i.e., voiced stops), at least until a voiceless stop onset is encountered.

If by a “minimal class” L&G are referring to a natural class which is consistent with the data and has an extension with the fewest members, then presumably they would endorse our proposal of feature maximization, since the class that satisfies this definition is the most fully specified empirically adequate class. However, it is an open question whether or not such a class would actually contain [d]. For instance, if one assumes that major place features are bivalent, then the intersection of the features associated with [b, g] will contain the specification [−coronal], which rules out [d].

Interestingly, the matter is similarly unclear if we interpret “minimal class” intensionally, in terms of the number of features, rather than in terms of the number of phonemes the class picks out. The (featurewise-)minimal specification for a single phone (as in the reviewer’s example) is the empty set, which would (it is generally assumed) pick out any segment. Then, we would expect that any generalization which held of [p], as in the reviewer’s example, to generalize not just to other labial obstruents (as the reviewer suggests), but to any segment at all. Minimal feature specification cannot yield a generalization from [p] to any proper subset of segments, contra the anonymous reviewer and L&G. An adequate minimal specification which picks out [p] will pick out just [p].; L&G suggest that maximum entropy models of phonotactic knowledge may have this property, but do not provide a demonstration of this for any particular implementation of these models.

We thank the anonymous reviewer for drawing our attention to this study and the opportunity their comment has given us to clarify the scope of our proposal and to draw attention to a defect in L&G’s argumentation.

References

Gorman, K. 2013. Generative phonotactics. Doctoral dissertation, University of Pennsylvania.
Linzen, T., and Gallagher, G. 2017. Rapid generalization in phonotactic learning. Laboratory Phonology: Journal of the Association for Laboratory Phonology 8(1): 1-32.
Reiss, C. 2017. Substance free phonology. In S.J. Hannahs and A. Bosch (ed.), The Routledge Handbook of Phonological Theory, pages 425-452. Routledge.

Journal websites

It is now 2023, and virtually every journal I review for has a broken website, which further penalizes me for volunteer work I ought to be paid for. This is really unacceptable. Maybe some of the big publishers can take a tiny bite out of their massive revenues (Springer Nature apparently pulled down 1.72b USD in revenue in 2021) and invest it into actually testing their the CRUD apps.

Large LMs and disinformation

I have never understood the idea that large LMs are uniquely positioned to enable the propagation of disinformation. Let us stipulate, for sake of argument, that large LMs can generate high-quality disinformation and that its artificial quality (i.e., not generated by human writers) cannot be reliably detected either by human readers nor by computational means. At the same time, I know of no reason to suppose that large LMs can generate better (less detectable, more plausible) disinformation than can human writers. Then, it is hard to see what advantage there is to using large LMs for disinformation generation beyond a possible economic benefit realized by firing PR writers and replacing them with “prompt engineers”. Ignoring the dubious economics—copywriters are cheap, engineers are expensive—there is a presupposition that disinformation needs to scale, i.e., be generated in bulk, but I see no reason to suppose this either. Disinformation, it seems to me, comes to us either in the form of “big lies” from sources deemed reputable by journalists and lay audiences (think WMDs), or increasingly, from the crowds (think Qanon).

e- and i-France

It will probably not surprise the reader to see me claim that France and French are both sociopolitical abstractions. France is, like all states, an abstraction, and it is hard to point to physical manifestations of France the state. But we understand that states are a bundle of related institutions with (mostly) shared goals. These institutions give rise to our impression of the Fifth Republic, though at other times in history conflict between these institutions gave rise to revolution. But currently the defining institutions share a sufficient alignment that we can usefully talk as if they are one. This is not so different from the i-language perspective on languages. Each individual “French” speaker has a grammar projected by their brain, and these are (generally speaking) sufficiently similar that we can maintain the fiction that they are the same. The only difference I see is that linguists can give a rather explicit account of any given instance of i-French whereas it’s difficult to describe political institutions in similarly detailed terms (though this may just reflect my own ignorance about modern political science). In some sense, this explicitness at the i-language level makes e-French seem even more artificial than e-France.

Character-based speech technology

Right now everyone seems to be moving to character-based speech recognizers and synthesizers. A character-based speech recognizer is an ASR system in which there is no explicit representation of phones, just Unicode codepoints on the output side. Similarly, a character-based synthesizer is a TTS engine without an explicit mapping onto pronunciations, just orthographic inputs. It is generally assumed that the model ought to learn this sort of thing implicitly (and only as needed).

I genuinely don’t understand why this is supposed to be better. Phonemic transcription really does carry more information than orthography, in the vast majority of languages, and making it an explicit target is going to do a better job of guiding the model than hoping the system automatically self-organizes. Neural nets trained for language tasks often have a implicit representation of some linguistically well-defined feature, but they often do better when that feature is made explicit.

My understanding is that end-to-end systems have potential advances over feed-forward systems when information and uncertainty from previous steps can be carried through to help later steps in the pipeline. But that doesn’t seem applicable here. Building these explicit mappings from words to pronunciations and vice versa is not all that hard, and the information used to resolve ambiguity is not particularly local. Cherry-picked examples aside, it is not at all clear that these models can handle locally conditioned pronunciation variants (the article a pronounced uh or aye), homographs (the two pronunciations of bass in English), or highly deficient writing systems (think Perso-Arabic) better than the ordinary pipeline approach. One has to suspect the long tail of these character-based systems are littered with nonsense.

RoboCop

I like a lot of different types of films, but my favorite are the subtextually rich, nuance-light action/science fiction films of the late 1970s, 1980s, and early 1990s, made by directors like Cameron, Carpenter, Cronenberg, McTiernan, Scott, and Verhoeven. Perhaps the most prescient of all of these is RoboCop (1984). The film’s feel is set by over-the-top comic sex and violence and silly diagetic TV clips. In less deft hands, it could easily have become the sort of campy farce best described (or perhaps, denigrated) as a “cult classic”. (This usually means a film is just bad.) But Verhoeven wields sex and violence like a master wields a paintbrush. (I take this to be a sort of self-critique of his childhood aesthetic appreciation of the violence he saw as a boy growing up in Nazi-occupied Holland, not far from the V-2 launch sites.) The film is thematically rich, so much so that one can easily forgive Verhoeven’s apparent decision to leave out (in what is probably the most “dated” element of the film) any overt criticism of policing as an institution. It is ruthlessly critical of what we’d now call neoliberalism, of corporatism, and has much to say about the nature of the self. The theme that strikes me as most prescient is how the film hinges on the very modern realization that, to a striking degree, what we call “AI” is fundamentally just “other people”, alienated and dehumanized by contractual labor relations. Verhoeven could somehow see this coming decades before anything that could reasonably be called AI.