Hugging Face needs better curation

Hugging Face is, among other things, a platform for obtaining pre-trained neural network models. We use their tokenizers and transformers Python libraries in a number of projects. While these have a bit more abstraction than I like, and are arguably over-featured, they are fundamentally quite good and make it really easy to, e.g., add a pre-trained encoder. I also appreciate that the tokenizers are mostly compiled code (they’re Rust extensions, apparently), which in practice means that tokenization is IO-bound rather than CPU bound.

My use case mostly involves loading Hugging Face transformers and their tokenizers and using their encoding layers for fine-tuning. To load a model in transformers, one uses the function transformers.AutoModel.from_pretrained and provides the name of the model on Hugging Face as a string argument. If the model exists, but you don’t already have a local copy, Hugging Face will automatically download it for you (and stashes the assets in some hidden directory). One can do something similar with the tokenizers.AutoTokenizer, or one can request the tokenizer from the model instance.

Now you might think that this would make it easy to, say, write a command-line tool where the user can specify any Hugging Face model, but unfortunately, you’d be wrong. First off, a lot of models, including so-called token–free ones lack a tokenizer. Why doesn’t ByT5, for instance, provide as its tokenizer a trivial Rust (or Python, even) function that returns bytes? In practice, one cannot support arbitrary Hugging Face models because one cannot count on them having a tokenizer. In this case, I see no alternative but to keep a list of supported models that lack their own tokenizer. Such a list is necessarily incomplete because the model hub continues to grow.

A similar problem comes with how parameters of the models are named. Most models are trained with dropout and support a dropout parameter, but the name of this parameter is inconsistent from model to model. In UDTube, for instance, dropout is a global parameter and it is applied to each hidden layer of the encoder (which requires us to access the guts of the Hugging Face model), and then again to the pooled contextual subword embeddings just before they’re pooled into word embeddings. Most of the models we’ve looked at call the dropout probability of the encoder hidden_dropout_prob, but others call it dropout or dropout_rate. Because of this, we have to maintain a module which keeps track of what the hidden layer dropout probability parameter is called.

I think this is basically a failure of curation. Hugging Face community managers should be out there fixing these gaps and inconsistencies, or perhaps should also publish standards for such things. They’re valued at $4.5 billion. I would argue this is at least as important as their efforts with model cards and the like.

Python ellipses considered harmful

Python has a conventional object-oriented design, but it was slowly grafted onto the language, something which shows from time to time. Arguably, you see this in the convention that instance methods need self passed as their first argument, and class methods need clsas their first argument. Another place you see it is how Python does abstract classes. First, one can use definitions in the built-in abc module, proposed in PEP-3119, to declare a class as abstract. But in practice most Pythonistas make a class abstract by declaring unimplemented instance methods. There are two conventional ways to do this, either with ellipses or by raising an exception, illustrated below.

class AbstractCandyFactory:
    def make_candy(self, batch_size: int): ...

class AbstractCandyFactory:
    def make_candy(self, batch_size: int):
        raise NotImplementedError

The latter is a bit more verbose, but there is actually a very good reason to prefer it to the former, elliptical version. With the exception version, if one forgets to implement make_candy—say, in a concrete subclass like SnickersFactory(AbstractCandyFactory)—an informative exception will be raised when make_candy is called on a SnickersFactory instance. However, in the elliptical form, the inherited form will be called, and of course will do nothing because the method has no body. This will likely cause errors down the road, but they will not be nearly as easy to track down because there is nothing to directly link the issue to the failure to override this method. For this reason alone, I consider ellipses used to declare abstract instance methods as harmful.

Announcing UDTube

In collaboration with CUNY master’s program graduate Daniel Yakubov, we have recently open-sourced UDTube, our neural morphological analyzer. UDTube performs what is sometimes called morphological analysis in context: it provides morphological analyses—coarse POS tagging, more-detailed morphosyntactic tagging, and lemmatization—to whole sentences using nearby words as context.

The UDTube model, developed in Yakubov 2024, is quite simple: it uses a pre-trained Hugging Face encoders to compute subword embeddings. We then take the last few layers of these embeddings and mean-pool them, then mean-pool subword embeddings for those words which correspond to multiple subwords. The resulting encoding of the input is then fed to separate classifier heads for the different tasks (POS tagging, etc.). During training we fine-tune the pre-trained encoder in addition to fitting the classifier heads, and we make it possible to set separate optimizers, learning rates, and schedulers for the encoder and classifier modules.

UDTube is built atop PyTorch and Lightning, and its command-line interface is made much simpler by the use of LightningCLI, a module which handles most of the interface work. One can configure the entire thing using YAML configuration files. CUDA GPUs and MPS-era Macs (M1 etc.) can be used to accelerate training and inference (and should work out of the box). We also provide scripts to perform hyperparameter tuning using Weights & Biases. We believe that this model, with appropriate tuning, is probably state-of-the-art for morphological analysis in context.

UDTube is available under an Apache 2.0 license on GitHub and on PyPI.

References

Yakubov, D. 2024. How do we learn what we cannot say? Master’s thesis, CUNY Graduate Center.

Learned tokenization

Conventional (i.e., non-neural, pre-BERT) NLP stacks tend to use rule-based systems for tokenizing sentences into words. One good example is Spacy, which provides rule-based tokenizers for the languages it supports. I am sort of baffled this is considered a good idea for languages other than English, since it seems to me that most languages need machine learning for even this task to properly handle phenomena like clitics. If you like the Spacy interface—I admit it’s very convenient—and work in Python, you may want to try thespacy-udpipe library, which exposes the UDPipe 1.5 models for Universal Dependencies 2.5; these in turn use learned tokenizers (and taggers, morphological analyzers, and dependency parsers, if you care) trained on high-quality Universal Dependencies data.

Automatic batch sizing

Yoyodyne is my lab’s sequence-to-sequence library, intended to be a replacement for Fairseq, which is (essentially) abandonware. One matter of urgency for me in building Yoyodyne was to enable automatic hyperparameter tuning. This was accomplished by logging results to Weights & Biases (W&B). We can perform a random or Bayesian hyperparameter sweep using a “grid” specified via a YAML file, monitor progress on the W&B website, or even hit the API to grab the best hyperparameters. One issue that kept coming up, however, is that it is easy to hit out-of-memory (OOM) errors during this process. Here’s what we did about it:

OOMs are not purely due to model size: the model, batch, and gradients all need to fit into the same VRAM. PyTorch Lightning, which is a key part of the Yoyodyne backend, provides a function for automatically determining the maximum batch size that will not trigger an OOM. Basically, it works by starting with a low batch size (by default, 2), randomly drawing three batches of that size, and then attempting training (but in fact caching parameters so that no real training occurs). If this does not trigger an OOM, it doubles the batch size, and so on.^1,2 You can enable this approach in Yoyodyne using the flag --find_batch_size max. You’d want to use this if you believe that a giant batch size is fine and you just want to fully saturate your GPU.

A slightly more sophisticated version of this, useful when you actually want to tune batch size, is enabled with the flag --find_batch_size opt. This again begins by doubling the size of randomly drawn batches as well, but here it halts once the doubling exceeds the value of the --batch_sizeflag. If the max batch size is larger than the requested size, it is used as is; thus this acts as a soft check against OOMs. If, however, the max batch size is smaller than --batch_size it instead solves for a new batch size, the largest batch size which is smaller than the max and which is a divisor of --batch_size`. It then enables multiple rounds of gradient accumulation per update,³ thus perfectly-losslessly simulating the desired batch size while using as much of VRAM as possible. I can assure you this is a killer feature for neural network tuning.

Endnotes

This is a little imprecise, and one can refine it by doing a binary search, but in practice it’s not worth the effort when working with ragged data.
Whatever batch size was requested with the --batch_size flag is ignored.
More formally, given desired batch size $b$ and a max batch size $n’$, it finds $a, n$ such that $a$ is the smallest integer, and $n$ is the largest integer, where $an = b$. This is computed via brute force; my implementation of an elegant solution based on the prime factorization was a bit slower.

The Unicoder

I have long encouraged students to turn software demos (which work on their laptop, in their terminal, and maybe nowhere else) into simple web apps. Years ago I built a demo of what this might look like, using Python’s Flask library. The entire app is under 200 lines of Python (and jinja2 template), plus a bit of static HTML and CSS.

It turns out this little demonstration is actually quite useful for my research. For any given string, it gives you the full decomposition of it into Unicode codepoints, with optional Unicode normalization, whitespace stripping, and case-folding. This is very useful for debugging.

The Unicoder, as it is called, is hosted on the free tier of Glitch. [Edit: it is now on Render.] (It used to also be on Heroku, but Salesforce is actively pushing people off that very useful platform.) Because of that, it takes about 10 seconds to “start up” (i.e., I assume the workers are put into some kind of hibernation mode) if it hasn’t been used in the last half hour or so. But, it’s very, very useful.

Debugging CUDA indexing errors

Perhaps you’ve seen pages of the following scary error:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [99,0,0], thread: [115,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

It turns out there is a relatively simple way to figure out what the indexing issue is. The internet suggests prepending

CUDA_LAUNCH_BLOCKING=1

to your command, but this doesn’t seem to help much either. There is a simpler solution: run whatever you’re doing on CPU. It’ll give you much nicer errors.

The next toolkit 2: electric boogaloo

I just got back from the excellent Workshop on Model Theoretic Representations in Phonology. While I am not exactly a member of the “Delaware school”, i.e., the model-theoretic phonology (MTP) crowd, I am a big fan. In my talk, I contrasted the model-theoretic approach to an approach I called the black box approach, using neural networks and program synthesis solvers as examples of the latter. I likened the two styles to neat vs. scruffy, better is better vs. worse is better, rationalists vs. empiricists, and cowboys vs. aliens.

One lesson I drew from this comparison is the need for MTPists to develop high-quality software—the next toolkit 2. I didn’t say much during my talk about what I imagine this to be like, so I thought I’d leave my thoughts here. Several people—Alëna Aksënova, Hossep Dolatian, and Dakotah Lambert, for example—have developed interesting MTP-oriented libraries. While I do not want to give short schrift to their work, I think there are two useful models for the the next next toolkit: (my own) Pynini and PyTorch. Here is what I see as the key features:

They are ordinary Python on the front-end. Of course, both have a C++ back-end, and PyTorch has a rarely used C++ API, but that’s purely a matter of performance; both have been slowly moving Python code into the C++ layer over the course of their development.The fact of the matter is that in 2022, just about anyone who can code at all can do so in Python.
While both are devilishly complex, their design follows the principle of least suprise; there is only a bit of what Pythonistas call exhuberant syntax (Pynini’s use of the @ operator, PyTorch’s use of _ to denote in-place methods).
They have extensive documentation (both in-module and up to book length).
They have extensive test suites.
They are properly packaged and can be installed via PyPi (i.e., via pip) or Conda-Forge (via conda).
They have corporate backing.

I understand that many in the MTP community are naturally—constitutionally, even—drawn to functional languages and literate programming. I think this should not be the initial focus. It should be ease of use, and for that it is hard to beat ordinary Python in 2022. Jupyter/Colab support is a great idea, though, and might satisfy the literate programming itch too.

re.compile is otiose

Unlike its cousins Perl and Ruby, Python has no literal syntax for regular expressions. Whereas one can express the sheep language /baa+/ with a simple forward-slashed literal in Perl and Ruby, in Python one has to compile them using the function re.compile, which produces objects of type re.Pattern. Such objects have various methods for string matching.

sheep = re.compile(r"baa+")
assert sheep.match("baaaaaaaa")

Except, one doesn’t actually have to compile regular expressions at all, as the documentation explains:

Note: The compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

What this means is that in the vast majority of cases, re.compile is otiose (i.e., unnecessary). One can just define expression strings, and pass them to the equivalent module-level functions rather than using the methods of re.Pattern objects.

sheep = r"baa+"
assert re.match(sheep, "baaaaaaaa")

This, I would argue, is slightly easier to read, and certainly no slower. It also makes typing a bit more convenient since str is easier to type than re.Pattern.

Now, I am sure there is some usage pattern which would favor explicit re.compile, but I have not encountered one in code worth profiling.

“Python” is a proper name

In just the last few days I’ve seen a half dozen instances of the phrase python package or python script in published academic work. It’s disappointing to me that this got by the reviewers, action editors, and copy editors, since Python is obviously a proper name and should be in titlecase. (The fact that the interpreter command is python is irrelevant.)