Hugging Face is, among other things, a platform for obtaining pre-trained neural network models. We use theirtokenizers
and transformers
Python libraries in a number of projects. While these have a bit more abstraction than I like, and are arguably over-featured, they are fundamentally quite good and make it really easy to, e.g., add a pre-trained encoder. I also appreciate that the tokenizers are mostly compiled code (they’re Rust extensions, apparently), which in practice means that tokenization is IO-bound rather than CPU bound.
My use case mostly involves loading Hugging Face transformers and their tokenizers and using their encoding layers for fine-tuning. To load a model in transformers
, one uses the function transformers.AutoModel.from_pretrained
and provides the name of the model on Hugging Face as a string argument. If the model exists, but you don’t already have a local copy, Hugging Face will automatically download it for you (and stashes the assets in some hidden directory). One can do something similar with the tokenizers.AutoTokenizer
, or one can request the tokenizer from the model instance.
Now you might think that this would make it easy to, say, write a command-line tool where the user can specify any Hugging Face model, but unfortunately, you’d be wrong. First off, a lot of models, including so-called token–free ones lack a tokenizer. Why doesn’t ByT5, for instance, provide as its tokenizer a trivial Rust (or Python, even) function that returns bytes? In practice, one cannot support arbitrary Hugging Face models because one cannot count on them having a tokenizer. In this case, I see no alternative but to keep a list of supported models that lack their own tokenizer. Such a list is necessarily incomplete because the model hub continues to grow.
A similar problem comes with how parameters of the models are named. Most models are trained with dropout and support a dropout parameter, but the name of this parameter is inconsistent from model to model. In UDTube, for instance, dropout is a global parameter and it is applied to each hidden layer of the encoder (which requires us to access the guts of the Hugging Face model), and then again to the pooled contextual subword embeddings just before they’re pooled into word embeddings. Most of the models we’ve looked at call the dropout probability of the encoder hidden_dropout_prob
, but others call it dropout
or dropout_rate
. Because of this, we have to maintain a module which keeps track of what the hidden layer dropout probability parameter is called.
I think this is basically a failure of curation. Hugging Face community managers should be out there fixing these gaps and inconsistencies, or perhaps should also publish standards for such things. They’re valued at $4.5bn. I would argue this is at least as important as their efforts with model cards and the like.
Everything they do is Quick-and-Dirty, then left undermaintained for years. Their tutorials are inconsistent with their packages, and often include algorithmic errors. And here’s a fun rant about tokenizers.
Some nightmarish details in that rant, thanks.