Online poisoning

One of my working theories for why natural language processing feels unusually contentious at present is, yes, social media. The outspoken researchers speak, more or less constantly, to a large social media audience, and use this forum as the primary way to form and disseminate opinions. For instance, there is a very strong correlation between being an “ACL thought leader”, if not an officer, and tweeting often and aggressively. People of my age understand the addictive and corrosive nature of presenting oneself for online kudos (and jeers), but some people of the older generations lack the appropriate internet literacy to use these tools in moderation, and some people of the younger generations lack the maturity to do the same. Such people have online poisoning. Side-effects include outing oneself as the subject of a subtweet and complaining to a student’s advisor. If you have any of these symptoms, please log off immediately and touch grass.

Defectivity in Turkish; part 2: desideratives

[This is part of a series of defectivity case studies.]

Thanks to correspondence with one of the authors I recently became aware of another possible paradigm gap in Turkish. According to İleri & Demirok (2022), henceforth ID, Turkish speakers are uncertain about the form of 3rd person plural desideratives. In this language, desideratives are deverbal nominals which select for and agree with a genitive subject. The desiderative suffix is /-AsI-/, where the capital letters mark archiphonemes subject to root harmony, and the 3rd person plural (3pl;.) possessive agreement suffix is /-lArI/. However, according to ID’s survey, Turkish speakers rate 3pl. desideratives formed from the root plus /-AsI-lArI/ as quite poor, and 3pl. desideratives are exceeding rare in corpora, even compared to other desiderative forms.

ID relate this observation to something unexpected about the 3rd person singular (3sg.) desiderative. Desideratives select, and agree with, a genitive subject, and the ordinary 3sg. genitive agreement suffix is /-sI/, but the 3sg. desiderative, there is apparently a haplology and we get just /-AsI/ (e.g., yapası, the 3sg. desiderative of ‘do’) instead of the expected */-AsI-sI/. They suggest that speakers may have reanalyzed /-AsI/ as a desiderative allomorph /-A/ followed by a 3sg. agreement suffix /-sI/, and thus predict that the 3pl. desiderative will be expressed by /-A-lArI/, though this is also judged to be quite bad (thus *yapasıları but also *yapaları). However, it is not immediately clear to me why ID expect speakers to hypothesize that the 3sg. desiderative allomorph should generalize to the 3pl.

This has a rather different flavor than the other defectivity case studies I’ve presented thus far. It could be that there simply are not enough desideratives in this person/number slot in the input, but I still don’t see what could be objectionable about /-AsI-lArI/. Another mystery is that their judgment task finds an unexplained very low acceptability for 2nd person plural desideratives (which seem to be of the form /-AsI-n/).

References

İleri, M. & Demirok, Ö. 2022. A paradigm gap in Turkish. In Proceedings of the Workshop on Turkic and Languages in Contact with Turkic 7, pages 1-15.

It’s time to retire “agglutinative”

A common trope in computational linguistics papers is the use of the technical term agglutinative as a synonym for rich inflectional morphology. This is not really what that term means. Properly, a language has agglutinative morphology in the case that it has affixes, each of which has a single syntacto-semantic function. (To really measure this properly, you probably need a richer, and more syntactically-oriented, theory of morphology than is au courant among the kind of linguistic typologist who would think it interesting to measure this over a wide variety of languages in the first place, but that’s another issue.) Thus Russian, for instance, has rich inflectional morphology, but it is not at all agglutinative, because it is quite happy for the suffix -ov to mark both the genitive and the plural, whereas the genitive plural in Hungarian is marked by two affixes.

I propose that we take agglutinative away from NLP researchers until they learn even a little bit about morphology. If you want to use the term, you need to state why agglutination, rather than the mere matter of lexemes having a large number of inflectional variants, is the thing you want to highlight. While I don’t think WALS is very good —certainly it’s over-used in NLP—it nicely distinguishes between isolation (#20), exponence (#21), and synthesis (#22). This ought to allow one to distinguish between agglutination and synthesis with a carefully-drawn sample, should one wish to.

A prediction

You didn’t build that. – Barack Obama, July 13, 2012

Connectionism originates in psychology, but the “old connectionists” are mostly gone, having largely failed to pass on their ideology to their trainees, and there really aren’t many “young connectionists” to speak of. But, I predict that in the next few years we’ll see a bunch of psychologists of language—the ones who define themselves by their opposition to internalism, innateness, and generativism—become some of the biggest cheerleaders for large language models (LLMs). In fact, psychologists have not made substantial contributions to neural network modeling in many years. Virtually all the work on improving neural networks over the last few decades has been done by computer scientists who cared not a whit whether they had anything to do with human brains or cognitive plausibility.1 (Sometimes they’ll put things like “…inspired by the human brain…” in the press releases, but we all know that’s just fluff.) At this point, psychology as a discipline has no more claim to neural networks than the Irish do to Gaul, and in the rather unlikely case that LLMs do end up furnishing deep truths about cognition, psychology as a discipline will have failed us by not following up on a promising lead. I think it will be particularly revealing if psychologists who previously worshipped at the Church of Bayes suddenly lose all interest in mathematical rigor and find themselves praying to the great Black Box. I want to say it now: if this happens—and I am starting to see signs that it will—those people will be cynics, haters, and trolls, and you shouldn’t pay them any mind.

Endnotes

  1. I am also critical of machine learning pedagogy, and it is therefore interesting to see that those same computer scientists pushing things forward don’t seem to care much for machine learning as an academic discipline either.

Noam and Bill are friends

One of the more confusing slanders against generativism is the belief that it has all somehow been undone by William Labov and the tradition of variationist sociolinguistics. I have bad news: Noam and Bill are friends. I saw them chopping it up once, in Philadelphia, and I have to assume they were making fun of functionalists. Bill has nice things to say about the generativist program in his classic paper on negative concord; Noam has some interesting comments about how the acquirenda probably involve multiple competing grammars in that Piaget lecture book. They both think functionalism is wildly overrated. And of course, the i-language perspective that Noam brings is an absolute essential to dialogues about language ideologies, language change, stigma and stratification, and so forth that we associate with Bill.

More than one rule

[Leaving this as a note to myself to circle back.]

I’m just going to say it: some “rules” are probably two or three rules, because the idea that rules are defined by natural classes (and thus free of disjunctions) is more entrenched than our intuitions about whether or not a process in some language is really one rule or not, and we should be Gallilean about this. Here are some phonological “rules” that are probably two or three rules different rules.

  • Indo-Iranian, Balto-Slavic families, and Albanian “ruki” (environment: preceding {w, j, k, r}): it is not clear to me if any of these languages actually need this as a synchronic rule at all.
  • Breton voiced stop lenition (change: /b/ to [v], /d/ to [z], /g/ to [x]): the devoicing of /g/ must be a separate rule. Hat tip: Richard Sproat. I believe there’s a parallel set of processes in German.
  • Lamba patalatalization (change: /k/ to [tʃ], /s/ to [ʃ]): two rules, possibly with a Duke-of-York thing. Hat tip: Charles Reiss.
  • Mid-Atlantic (e.g., Philadelphia) English ae-tensing (environment: following tautosyllabic, same-stem {m, n, f, θ, s, ʃ]): let’s assume this is allophony; then the anterior nasal and voiceless fricative cases should be separate rules. It is possible the incipient restructuring of this as having a simple [+nasal] context provides evidence for the multi-rule analysis.
  • Latin glide formation (environment: complex). Front and back glides are formed from high short monophthongs in different but partially overlapping contexts.

Industry postdocs

I find the very idea of industry postdocs funny (funny-sad, though). Sure, it makes sense for the academy, with all of its scarcities, to make use of precarious, casualized post-graduate labor, but to extend this to the tech sector is vaguely monstrous. It’s extra funny (but funny-sad too) when you hear of a senior professor doing an industry postdoc at a company with a name like baz.ly during their sabbatical.

Neurolinguistic deprogramming

I venture to say most working linguists would reject—outright—strong versions of linguistic relativity and the Sapir-Whorf hypothesis, and would regard neuro-linguistic programming as pseudoscientific rubbish. This is of course in contrast to the general public: even the highly-educated take linguistic relativity as an obvious description of human life. Yet, it is not uncommon for the same linguists to endorse beliefs in the power of renaming that is hard to reconcile with the general disrepute of the vulgar Whorfian view the power of renaming assumes.

For instance, George Lakoff’s work on “framing” in politics argued that renaming social programs was the one weird trick needed to get Howard Dean into the White House. While this seems quaint in retrospect, his proposal was widely debated at the time. Pinker’s (sigh) takedown is necessary reading. The problem, of course, is that Lakoff ought to have provided, and ought to have been expected to provide, any evidence at all for a view of language widely regarded as untutored by his colleagues.

The case of renaming languages is a grayer one. I believe that one ought to call people what they want to be called, and that if stakeholders would prefer their language to be referred to as Tohono Oʼodham rather than Pápago, I am and will remain happy to oblige.1 If African American Vernacular English is renamed to African American Language (as seems to be increasing common in scholarship), I will gladly follow suit. But I can’t imagine how it could be the case that the renaming represents a reconceptualization of either the language itself, or a change in how we study it. Indeed, it would be strange for the name of any language to reflect any interesting property of said language. French by any other name would still have V-to-T movement and liaison.

It may be that these acts of renaming have power. Indeed, I hope they do. But I have to suspect the opposite: they’re the sort of fiddling one does when one is out power, when one is struggling to believe that a better world is possible. And if I’m wrong, who is better suited to show that than the trained linguist?

Endnotes

  1. Supposedly, the older name of the language comes from a pejorative used by a neighboring tribe, the Pima. Ba꞉bawĭkoʼa means, roughly ‘tepary bean eater’. The Spanish colonizers adapted this as Pápago. I feel like the gloss sounds like a cutting insult in English too, so I get why this exonym has fallen in disrepute.

Filtering text at scale

[This post describes work in collaboration with Emily Charde.]

It is now commonplace for NLP applications to consume massive amounts of web text  of unknown provenance. Applications which stand to benefit from this firehose of data, but at the same time don’t need it all, may require more attention paid to data quality in the form of high-precision methods to filter out redundancies and junk.

Gorman et al. (2021) follow standard practices for obtaining a “clean” subsample of web data: they filter sentence strings based on the presence of capitalization and sentential punctuation, length, and predictability as measured by a character language model. In an ongoing project on defectivity, we sought to do something similar at a much larger scale. This project was undertaken by myself in collaboration with Emily Charde, a graduate of our master’s program who worked as an RA on the project.

Our data for this project is drawn from CC-100, a recreation of the earlier CC-Net corpus (Wenzek et al. 2020). CC-100 consists of strings from 2018 Common Crawl snapshots, already filtered somewhat and grouped by language using language ID tools. At rest, the CC-100 data is stored in enormous LZMA-compressed files, one per language/locale/script. The largest, English (naturally), occupies 82 GB despite this aggressive compression scheme.

We proceed as follows.

We first shard the data for each language into roughly 1 GB chunks, preserving the LZMA compression.

We then perform sentence and word tokenization in parallel using mudpipe.py, a Python wrapper around the C++ command-line tool UDPipe 1 which automatically decompresses the LZMA files, invokes UDPipe, and recompresses the output CoNLL-U-formatted data, preserving disk space; since this is mostly IO-bound, mudpipe.py does this in parallel across the various shards (the “m” in “mudpipe” stands for “multiprocessing”). This script was originally developed by Yulia Spektor, another graduate student, for her 2020 master’s thesis (Spektor 2020). Applying mudpipe.py to English, Greek, and Russian (our three target languages) took a few weeks of compute time on a single desktop that otherwise would have sat idle. The resulting shards of compressed CoNLL-U sentences are somewhat larger, roughly 2 GB, presumably because of the additional markup.

We now turn to filtering in earnest. Whereas Gorman et al. were working with dozens of millions of sentences of English, the CC-100 language samples contain many billlions of sentences, so filtering based on percentiles, like those used by Gorman et al., must be performed out-of-core. We thus chose SQLite as our data store for this project, and envisioned that SQL would be a natural way to express filters.

Filtering was ultimately performed by a single Python script using the sqlite3 standard library. This script runs through the tokenized shards produced by mudpipe.py, and ultimately produces a single LZMA-compressed, CoNLL-U format file for each language. Working incrementally, each shard is decompressed and the CoNLL-U format is parsed line by line. Once a sentence is obtained, we apply ordinary re regular expression filters. These expressions require each sentence to start with an uppercase letter of the appropriate script, to continue with more letters, space, or punctuation of the appropriate script, and finally to end with sentential punctuation (e.g., /[.!?]/). For instance, a Russian or Greek sentence that contains Latin characters was discarded. If quotation marks are present, they were required to “balance”. Sentences that fail one or more of these constraints are simply removed from further consideration. Additional data is extracted from the sentences that remain:

  • length in characters
  • length in tokens
  • bits per character (BPC) entropy according to an OpenGrm-NGram (Roark et al. 2012) 6-gram character language model

The sentence and these three statistics are then stored in the SQLite database; we also use gzip compression, with the shortest possible compression window and no headers, to save temporary disk space. Accumulating this portion of the table takes quite some time, but it can be performed in parallel across shards or languages. We perform batches of 1m updates at a time. We experimented—well, Emily did, I watched—with various database PRAGMAs to improve performance, but none of these were clearly performance-positive.

Our next step is to actually filter the data. In an inner subquery, we compute quartiles for character length, token length, and BPC. Then in an outer subquery, we return the row IDs of every sentence which is in Q2 or Q3—the middle two quartiles—for all three measures. That is, if a sentence has median BPC but is in the 80th percentile for character length, we remove it. This is highly conservative, but we have more than enough data, and we anticipate that at least character length and token length are highly correlated in any language. In the outermost query, we SELECT row IDs not returned by the outer subquery. This query is a work of art.

SELECT tokenlist FROM table WHERE rowid IN (
    SELECT rowid FROM (
        SELECT rowid,
        NTILE(4) OVER (ORDER BY char_len) AS char_q,
        NTILE(4) OVER (ORDER BY word_len) AS word_q,
        NTILE(4) OVER (ORDER BY bpc) AS bpc_q
        FROM table
    )
    WHERE (char_q BETWEEN 2 AND 3)
    AND (word_q BETWEEN 2 AND 3)
    AND (bpc_q BETWEEN 2 AND 3)
);

We then reserialize and recompress the remaining sentences into a new LZMA-compressed file. Here are some logging statements that give a sense of the scale (this is from Russian):

WARNING 2023-01-06 20:39:41,896: 1,576,171,212 input sentences processed
WARNING 2023-01-06 20:39:41,896: 362 sentences missing text
WARNING 2023-01-06 20:39:41,896: 539,046,034 sentences incomplete
WARNING 2023-01-06 20:39:41,896: 772,566 sentences fail LM composition
WARNING 2023-01-06 21:16:35,406: 1,036,352,250 sentences after primary filtration
WARNING 2023-01-08 09:14:13,110: 232,404,041 sentences after secondary filtration
INFO 2023-01-08 09:14:13,117: Writing to ../conllu/cc100/ru.cc100.filtered.conllu.xz...
INFO 2023-01-09 03:22:08,252: Dropping ru_cc100 table
INFO 2023-01-09 10:42:07,085: Filtering complete

To summarize: there were about 1.6b input sentences after mudpipe.py; of these, 362 (inexplicably, but it happens) had no text at all. Roughly a half billion of these are “incomplete”, meaning they failed the regular expression constraints. A bit less than one million “fail LM composition”; this usually indicates they contain odd, language-inappropriate characters, which were never seen in the (held-out) materials used to train the character LMs. This leaves us with just over one billion sentences for “secondary filtration”. Of these, 232m fall in the two median quartiles for the length and entropy measures and are retained. As you can see, secondary filtration took an otherwise-idle desktop about 36 hours, with reserialization and recompression taking about 18 hours, and DB cleanup (not strictly necessary, but sort of like “be kind, rewind”) adding another 7 hours at the end. Not bad, though certainly this could be made to run much faster (possibly with a different database engine designed for parallel writes).

In practice, we find that this produces data that is highly diverse but extremely clean. Should even more data ever be desired, one could easily imagine relaxing the quartile constraints a bit.

[Late-breaking addition: I should probably explain why we want median entropy text. If one sorts the sentence of a large corpus by bits per character, you will see that the lowest-entropy sentences tend to be boilerplate and the highest-entropy sentences tend to be rubbish. So the middle is “just right” here.]

Acknowledgments

Support for this project was provided by a PSC-CUNY award, jointly funded by the Professional Staff Congress and the City University of New York.

References

Gorman, K., Kirov, C., Roark, B., and Sproat, R. 2021. Structured abbreviation expansion in context. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 995-1005.
Wenzek, G., Lachaux, M.-A., Conneau, A.,  Chaudhary, V., Guzmán, F., Joulin, A., and Grave, E. 2020. CCNet: extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003-4012.
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., and Tai, T. 2012. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, pages 61-66.
Spektor, Y. 2021. Detection and morphological analysis of novel Russian loanwords. Master’s thesis, Graduate Center, City University of New York.