Do you know why the following comparison (in Python 3.7) fails?
>>> s1 = "ड़"
>>> s2 = "ड़"
>>> s1 == s2
False
I’ll give you a hint:
>>> len(s1)
1
>>> len(s2)
2
Despite the two strings rendering identically, they are encoded differently. The string s1 is a single-codepoint sequence, whereas s2 contains two codepoints. Thus string comparison fails, whether it’s done at the level of bytes or of Unicode codepoints.
Some NLP researchers are aware of issues arising from faulty string encoding. Eckhart de Castilho (2016), for example, describes a tool which automatically identifies misencoded pre-trained data, whereas Wu & Yarowsky (2018) report issues using an existing tool for transliteration on certain languages because of encoding issues. However, I suspect that far fewer NLP researchers are familiar with the aforementioned problem, which is specific to Unicode normalization. To put it simply, Unicode defines four normalization forms (and associated conversion algorithms) for strings, and the key distinction is between “composed” and “decomposed” forms of characters (using that term in a pretheoretic sense). The string s1 is composed into a single Unicode codepoint; s2 is decomposed into two.
Unfortunately, three columns of the Hindi Dependency Treebank (hi_hdtb, commit 54c4c0f; Bhat et al. 2017, Palmer et al. 2009) have a chaotic mix of composed and decomposed representations. It seems most if not all of these have to do with the encoding of the six nuqta (‘dot’) consonants, which are usually found in borrowings from Arabic or Persian (via Urdu, presumably). In Devangari these consonants are written by adding a dot to a phonetically similar native consonant; for instance ड [ɖə] plus the nuqta produces ड़ [ɽə]. As is usually the case in Unicode, there is more than one way to do it: you can either encode ड़ with a composed character (U+095C DEVANAGARI LETTER DDDHA) or with the native Devangari character (U+O921 DEVANAGARI LETTER DDA) plus a combining character (U+093C DEVANAGARI SIGN NUKTA). In practical terms, this means that strings containing diferent encodings of <ṛa> (as it is sometimes transliterated) will be treated as totally separate during training and evaluation, except on the off chance that all associated tools perform Unicode normalization ahead of time.
This does have negative consequences for NLP. Consider the UDPipe system (Straka & Straková 2017) at the CoNLL 2017 shared task on dependency parsing (Zeman et al. 2017), for which the primary metric is labeled attachment score (LAS). I first attempted to replicate the UDPipe results for the Hindi Dependency Treebank. Using UDPipe 1.2.0, word2vec (commit 20c129a), the hyperparameters given in the authors’ supplementary materials, and the official evaluation script, I obtain LAS = 87.09 on the “gold tokenization” subtask. However I can improve this simply by converting the training, development, and test data to a consistent normalization like so:
for FILE in *.conllu; do
TMPFILE="$(mktemp)"
uconv -x nfkc "${FILE}" > "${TMPFILE}"
mv "${TMPFILE}" "${FILE}"
done
and then retraining. Here I have chosen to apply the NFKC (“compatibility composed”) normalization form. While Zeman et al. do not discuss the encoding of the labeled Universal Dependencies data, they do mention that they apply NFKC normalization to the addditional raw data. But it doesn’t really matter in this case which you choose so long as you are consistent. After retraining, I obtain LAS = 87.38, or .29 points for free. I also ran an “mismatch” experiment, where the training and testing data have different normalization forms; naturally, this causes a slight degradation to LAS = 86.98.
Straka & Straková (2017) report a separate set of experiments in which they have attempted to rebalance the training-development-test splits. Just to be sure, I repeated the above experiments using their original rebalancing script. With the baseline—mixed normalization—data, I can replicate their result exactly: LAS = 87.30. With a consistent NFKC normalization of training, development and test data, I get LAS = 87.50. And with a normalization mismatch between training and test data, I get LAS = 87.07, a slight degradation. And the improvements are more or less for free.
While I have not yet done a consistent audit, I found three other UD treebanks that have encoding issues. The ar_padt treebank has a non-canonical ordering of combining characters in the lemma column (the shaddah, which indicates geminates, should come before the fathah and not the other way around), but this is unlikely to have any major effect on model performance because it uses this non-canonical ordering consistently. The ko_kaist and ur_udtb treebanks also have minor inconsistencies.
Unfortunately my corporate overlord doesn’t permit me to file a pull request here because of the Hindi data is released under a CC BY-NC-SA license. But if you’re not so constrained, feel free to do so, and ping this thread once you have! And pay attention in the future.
References
Bhat, R. A., Bhatt, R., Farudi, A., Klassen, P., Narasimhan, B., Palmer, M., Rambow, O., Sharma, D. M., Vaidya, A., Vishnu, S. R., and Xia, F. 2017. The Hindi/Urdu Treebank Project. In Ide., N., and Pustejovsky, J. (ed.), The Handbook of Linguistic Annotation, pages 659-698. Springer.
Eckhart de Castilho, R. 2016. Automatic analysis of flaws in pre-trained NLP models. In 3rd International Workshop on Worldwide Language Service Infrastructure and 2nd Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies, pages 19-27.
Palmer, M., Bhatt, R., Narasimhan, B., Rambow, O., Sharma, D. M., and Xia, F. 2009. Hindi syntax: Annotation dependency, lexical predicate-argument structure, and phrase structure. In ICON, pages 14-17.
Straka, M., and Straková, J. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88-99.
Wu, W. and Yarowsky, D. 2018. A comparative study of extremely low-resource transliteration of the world’s languages. In LREC, pages 938-943.
Zeman, D., Popel, M., Straka, M., Hajič, J., Nivre, J., Ginter, F., … and Li, J. 2017. CoNLL Shared Task: Multilingual parsing from raw text to Universal Dependencies. In CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1-19.