[This post describes work in collaboration with Emily Charde.]
It is now commonplace for NLP applications to consume massive amounts of web text of unknown provenance. Applications which stand to benefit from this firehose of data, but at the same time don’t need it all, may require more attention paid to data quality in the form of high-precision methods to filter out redundancies and junk.
Gorman et al. (2021) follow standard practices for obtaining a “clean” subsample of web data: they filter sentence strings based on the presence of capitalization and sentential punctuation, length, and predictability as measured by a character language model. In an ongoing project on defectivity, we sought to do something similar at a much larger scale. This project was undertaken by myself in collaboration with Emily Charde, a graduate of our master’s program who worked as an RA on the project.
Our data for this project is drawn from CC-100, a recreation of the earlier CC-Net corpus (Wenzek et al. 2020). CC-100 consists of strings from 2018 Common Crawl snapshots, already filtered somewhat and grouped by language using language ID tools. At rest, the CC-100 data is stored in enormous LZMA-compressed files, one per language/locale/script. The largest, English (naturally), occupies 82 GB despite this aggressive compression scheme.
We proceed as follows.
We first shard the data for each language into roughly 1 GB chunks, preserving the LZMA compression.
We then perform sentence and word tokenization in parallel using mudpipe.py, a Python wrapper around the C++ command-line tool UDPipe 1 which automatically decompresses the LZMA files, invokes UDPipe, and recompresses the output CoNLL-U-formatted data, preserving disk space; since this is mostly IO-bound, mudpipe.py does this in parallel across the various shards (the “m” in “mudpipe” stands for “multiprocessing”). This script was originally developed by Yulia Spektor, another graduate student, for her 2020 master’s thesis (Spektor 2020). Applying mudpipe.py to English, Greek, and Russian (our three target languages) took a few weeks of compute time on a single desktop that otherwise would have sat idle. The resulting shards of compressed CoNLL-U sentences are somewhat larger, roughly 2 GB, presumably because of the additional markup.
We now turn to filtering in earnest. Whereas Gorman et al. were working with dozens of millions of sentences of English, the CC-100 language samples contain many billlions of sentences, so filtering based on percentiles, like those used by Gorman et al., must be performed out-of-core. We thus chose SQLite as our data store for this project, and envisioned that SQL would be a natural way to express filters.
Filtering was ultimately performed by a single Python script using the sqlite3 standard library. This script runs through the tokenized shards produced by mudpipe.py, and ultimately produces a single LZMA-compressed, CoNLL-U format file for each language. Working incrementally, each shard is decompressed and the CoNLL-U format is parsed line by line. Once a sentence is obtained, we apply ordinary re regular expression filters. These expressions require each sentence to start with an uppercase letter of the appropriate script, to continue with more letters, space, or punctuation of the appropriate script, and finally to end with sentential punctuation (e.g., /[.!?]/). For instance, a Russian or Greek sentence that contains Latin characters was discarded. If quotation marks are present, they were required to “balance”. Sentences that fail one or more of these constraints are simply removed from further consideration. Additional data is extracted from the sentences that remain:
- length in characters
- length in tokens
- bits per character (BPC) entropy according to an OpenGrm-NGram (Roark et al. 2012) 6-gram character language model
The sentence and these three statistics are then stored in the SQLite database; we also use gzip compression, with the shortest possible compression window and no headers, to save temporary disk space. Accumulating this portion of the table takes quite some time, but it can be performed in parallel across shards or languages. We perform batches of 1m updates at a time. We experimented—well, Emily did, I watched—with various database PRAGMAs to improve performance, but none of these were clearly performance-positive.
Our next step is to actually filter the data. In an inner subquery, we compute quartiles for character length, token length, and BPC. Then in an outer subquery, we return the row IDs of every sentence which is in Q2 or Q3—the middle two quartiles—for all three measures. That is, if a sentence has median BPC but is in the 80th percentile for character length, we remove it. This is highly conservative, but we have more than enough data, and we anticipate that at least character length and token length are highly correlated in any language. In the outermost query, we SELECT row IDs not returned by the outer subquery. This query is a work of art.
SELECT tokenlist FROM table WHERE rowid IN ( SELECT rowid FROM ( SELECT rowid, NTILE(4) OVER (ORDER BY char_len) AS char_q, NTILE(4) OVER (ORDER BY word_len) AS word_q, NTILE(4) OVER (ORDER BY bpc) AS bpc_q FROM table ) WHERE (char_q BETWEEN 2 AND 3) AND (word_q BETWEEN 2 AND 3) AND (bpc_q BETWEEN 2 AND 3) );
We then reserialize and recompress the remaining sentences into a new LZMA-compressed file. Here are some logging statements that give a sense of the scale (this is from Russian):
WARNING 2023-01-06 20:39:41,896: 1,576,171,212 input sentences processed WARNING 2023-01-06 20:39:41,896: 362 sentences missing text WARNING 2023-01-06 20:39:41,896: 539,046,034 sentences incomplete WARNING 2023-01-06 20:39:41,896: 772,566 sentences fail LM composition WARNING 2023-01-06 21:16:35,406: 1,036,352,250 sentences after primary filtration WARNING 2023-01-08 09:14:13,110: 232,404,041 sentences after secondary filtration INFO 2023-01-08 09:14:13,117: Writing to ../conllu/cc100/ru.cc100.filtered.conllu.xz... INFO 2023-01-09 03:22:08,252: Dropping ru_cc100 table INFO 2023-01-09 10:42:07,085: Filtering complete
To summarize: there were about 1.6b input sentences after mudpipe.py; of these, 362 (inexplicably, but it happens) had no text at all. Roughly a half billion of these are “incomplete”, meaning they failed the regular expression constraints. A bit less than one million “fail LM composition”; this usually indicates they contain odd, language-inappropriate characters, which were never seen in the (held-out) materials used to train the character LMs. This leaves us with just over one billion sentences for “secondary filtration”. Of these, 232m fall in the two median quartiles for the length and entropy measures and are retained. As you can see, secondary filtration took an otherwise-idle desktop about 36 hours, with reserialization and recompression taking about 18 hours, and DB cleanup (not strictly necessary, but sort of like “be kind, rewind”) adding another 7 hours at the end. Not bad, though certainly this could be made to run much faster (possibly with a different database engine designed for parallel writes).
In practice, we find that this produces data that is highly diverse but extremely clean. Should even more data ever be desired, one could easily imagine relaxing the quartile constraints a bit.
[Late-breaking addition: I should probably explain why we want median entropy text. If one sorts the sentence of a large corpus by bits per character, you will see that the lowest-entropy sentences tend to be boilerplate and the highest-entropy sentences tend to be rubbish. So the middle is “just right” here.]
Acknowledgments
Support for this project was provided by a PSC-CUNY award, jointly funded by the Professional Staff Congress and the City University of New York.
References
Gorman, K., Kirov, C., Roark, B., and Sproat, R. 2021. Structured abbreviation expansion in context. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 995-1005.
Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzmán, F., Joulin, A., and Grave, E. 2020. CCNet: extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003-4012.
Roark, B., Sproat, R., Allauzen, C., Riley, M., Sorensen, J., and Tai, T. 2012. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, pages 61-66.
Spektor, Y. 2021. Detection and morphological analysis of novel Russian loanwords. Master’s thesis, Graduate Center, City University of New York.
Holy moly. Are you going to share the clean data somehow?
(Small typo: “half million” should be “half billion”).
I can for English at least, sure. (We have WMT newscrawl and CC-100.)
What makes this difficult to supersize this to 100 or so languages is that not all of them have Universal Dependencies corpora (which is what we use to do the character LM), we have to write the regular expressions using a bit of language knowledge (there ought to be a database that has codepoint ranges for each language/script pair), plus it takes a long time (weeks, even) for the big languages. Maybe it would be faster if I used Postgres because then we could do parallel writes from different workers.