Professional organizations in linguistics

I am a member of the Linguistic Society of America (LSA) and the Association for Computational Linguistics (ACL), US-based professional organizations for linguists and computational linguists, respectively. (More precisely, I am usually a member. I think my memberships both lapsed during the pandemic and I renewed once I started going to their respective conferences again.)

I attend LSA meetings when they’re conveniently located (next year’s in Philly and we’re doing a workshop on Logical Phonology), and roughly one ACL-hosted meeting a year as well. As a (relatively) senior scholar I don’t find the former that useful (the scholarship is hit-or-miss and the LSA is dominated by a pandemonium of anti-generativists who are best just ignored), but the networking can be good. The *CL meetings tend to have more relevant science (or at least they did before prompt engineering…) but they’re expensive and rarely held in the ACELA corridor.

While the LSA and the ACL are called professional organizations, their real purview is mostly to host conferences. The LSA does some other stuff of course: they run Language, the institutes, and occasionally engage in lobbying, etc. But they do not have much to say about the lives of workers in these fields. The LSA doesn’t tell you about the benefits of unionizing your workplace. The ACL doesn’t give you ethics tips about what to do if your boss wants you to spy on protestors. They don’t really help you get jobs in these fields either. They could; they just don’t.

There is an interesting contrast here with another professional organization I was once a member of: the Institute of Electrical and Electronics Engineers (IEEE, pronounced “aye Tripoli”). Obviously, I am not an electrical engineer, but electrical engineering was historically the home of speech technology research and their ASRU and SLT conferences are quite good in that subfield. During the year or so I was an IEEE member, I received their monthly magazine. Roughly half of it is in fact just stories of general interest to electrical engineers; one that stuck with me argued that the laws of physics preclude the existence of “directed energy weapons” claimed to cause Havana Syndrome. But the other half were specifically about the professional life of electrical engineers, including stuff about interviewing, the labor market outlook, and working conditions.

Imagine if Language had a quarterly professional column or if the ACL Anthology had a blog-post series…

Hiring season

It’s hiring season and your dean has approved your linguistics department for a new tenure line. Naturally, you’re looking to hire an exciting young “hyphenate” type who can, among other things, strengthen your computational linguistics offerings, help students transition into industry roles and perhaps even incorporate generative AI into more mundane parts of your curriculum (sigh). There are two problems I see with this. First, most people applying for these positions don’t actually have relevant industry experience, so while they can certainly teach your students to code, they don’t know much about industry practices. Secondly, an awful lot of them would probably prefer to be a full-time software engineer, all things considered, and are going to take leave—if not quit outright—if the opportunity ever becomes available. (“Many such cases.”) The only way to avoid this scenario, as I see it, is to find people who have already been software engineers and don’t want to be them anymore, and fortunately, there are several of us.

Hugging Face needs better curation

Hugging Face is, among other things, a platform for obtaining pre-trained neural network models. We use their tokenizers and transformers Python libraries in a number of projects. While these have a bit more abstraction than I like, and are arguably over-featured, they are fundamentally quite good and make it really easy to, e.g., add a pre-trained encoder. I also appreciate that the tokenizers are mostly compiled code (they’re Rust extensions, apparently), which in practice means that tokenization is IO-bound rather than CPU bound.

My use case mostly involves loading Hugging Face transformers and their tokenizers and using their encoding layers for fine-tuning. To load a model in transformers, one uses the function transformers.AutoModel.from_pretrained and provides the name of the model on Hugging Face as a string argument. If the model exists, but you don’t already have a local copy, Hugging Face will automatically download it for you (and stashes the assets in some hidden directory). One can do something similar with the tokenizers.AutoTokenizer, or one can request the tokenizer from the model instance.

Now you might think that this would make it easy to, say, write a command-line tool where the user can specify any Hugging Face model, but unfortunately, you’d be wrong. First off, a lot of models, including so-called token–free ones lack a tokenizer. Why doesn’t ByT5, for instance, provide as its tokenizer a trivial Rust (or Python, even) function that returns bytes? In practice, one cannot support arbitrary Hugging Face models because one cannot count on them having a tokenizer. In this case, I see no alternative but to keep a list of supported models that lack their own tokenizer. Such a list is necessarily incomplete because the model hub continues to grow.

A similar problem comes with how parameters of the models are named. Most models are trained with dropout and support a dropout parameter, but the name of this parameter is inconsistent from model to model. In UDTube, for instance, dropout is a global parameter and it is applied to each hidden layer of the encoder (which requires us to access the guts of the Hugging Face model), and then again to the pooled contextual subword embeddings just before they’re pooled into word embeddings. Most of the models we’ve looked at call the dropout probability of the encoder hidden_dropout_prob, but others call it dropout or dropout_rate. Because of this, we have to maintain a module which keeps track of what the hidden layer dropout probability parameter is called.

I think this is basically a failure of curation. Hugging Face community managers should be out there fixing these gaps and inconsistencies, or perhaps should also publish standards for such things. They’re valued at $4.5 billion. I would argue this is at least as important as their efforts with model cards and the like.

Announcing UDTube

In collaboration with CUNY master’s program graduate Daniel Yakubov, we have recently open-sourced UDTube, our neural morphological analyzer. UDTube performs what is sometimes called morphological analysis in context: it provides morphological analyses—coarse POS tagging, more-detailed morphosyntactic tagging, and lemmatization—to whole sentences using nearby words as context.

The UDTube model, developed in Yakubov 2024, is quite simple: it uses a pre-trained Hugging Face encoders to compute subword embeddings. We then take the last few layers of these embeddings and mean-pool them, then mean-pool subword embeddings for those words which correspond to multiple subwords. The resulting encoding of the input is then fed to separate classifier heads for the different tasks (POS tagging, etc.). During training we fine-tune the pre-trained encoder in addition to fitting the classifier heads, and we make it possible to set separate optimizers, learning rates, and schedulers for the encoder and classifier modules.

UDTube is built atop PyTorch and Lightning, and its command-line interface is made much simpler by the use of LightningCLI, a module which handles most of the interface work. One can configure the entire thing using YAML configuration files. CUDA GPUs and MPS-era Macs (M1 etc.) can be used to accelerate training and inference (and should work out of the box). We also provide scripts to perform hyperparameter tuning using Weights & Biases. We believe that this model, with appropriate tuning, is probably state-of-the-art for morphological analysis in context.

UDTube is available under an Apache 2.0 license on GitHub and on PyPI.

References

Yakubov, D. 2024. How do we learn what we cannot say? Master’s thesis, CUNY Graduate Center.

Learned tokenization

Conventional (i.e., non-neural, pre-BERT) NLP stacks tend to use rule-based systems for tokenizing sentences into words. One good example is Spacy, which provides rule-based tokenizers for the languages it supports. I am sort of baffled this is considered a good idea for languages other than English, since it seems to me that most languages need machine learning for even this task to properly handle phenomena like clitics. If you like the Spacy interface—I admit it’s very convenient—and work in Python, you may want to try thespacy-udpipe library, which exposes the UDPipe 1.5 models for Universal Dependencies 2.5; these in turn use learned tokenizers (and taggers, morphological analyzers, and dependency parsers, if you care) trained on high-quality Universal Dependencies data.