Learned tokenization

Conventional (i.e., non-neural, pre-BERT) NLP stacks tend to use rule-based systems for tokenizing sentences into words. One good example is Spacy, which provides rule-based tokenizers for the languages it supports. I am sort of baffled this is considered a good idea for languages other than English, since it seems to me that most languages need machine learning for even this task to properly handle phenomena like clitics. If you like the Spacy interface—I admit it’s very convenient—and work in Python, you may want to try thespacy-udpipe library, which exposes the UDPipe 1.5 models for Universal Dependencies 2.5; these in turn use learned tokenizers (and taggers, morphological analyzers, and dependency parsers, if you care) trained on high-quality Universal Dependencies data.

Leave a Reply Cancel reply