Hugging Face needs better curation

Hugging Face is, among other things, a platform for obtaining pre-trained neural network models. We use their tokenizers and  transformers Python libraries in a number of projects. While these have a bit more abstraction than I like, and are arguably over-featured, they are fundamentally quite good and make it really easy to, e.g., add a pre-trained encoder. I also appreciate that the tokenizers are mostly compiled code (they’re Rust extensions, apparently), which in practice means that tokenization is IO-bound rather than CPU bound.

My use case mostly involves loading Hugging Face transformers and their tokenizers and using their encoding layers for fine-tuning. To load a model in transformers, one uses the function transformers.AutoModel.from_pretrained and provides the name of the model on Hugging Face as a string argument. If the model exists, but you don’t already have a local copy, Hugging Face will automatically download it for you (and stashes the assets in some hidden directory). One can do something similar with the tokenizers.AutoTokenizer, or one can request the tokenizer from the model instance.

Now you might think that this would make it easy to, say, write a command-line tool where the user can specify any Hugging Face model, but unfortunately, you’d be wrong. First off, a lot of models, including so-called tokenfree ones lack a tokenizer. Why doesn’t ByT5, for instance, provide as its tokenizer a trivial Rust (or Python, even) function that returns bytes? In practice, one cannot support arbitrary Hugging Face models because one cannot count on them having a tokenizer. In this case, I see no alternative but to keep a list of supported models that lack their own tokenizer. Such a list is necessarily incomplete because the model hub continues to grow.

A similar problem comes with how parameters of the models are named. Most models are trained with dropout and support a dropout parameter, but the name of this parameter is inconsistent from model to model. In UDTube, for instance, dropout is a global parameter and it is applied to each hidden layer of the encoder (which requires us to access the guts of the Hugging Face model), and then again to the pooled contextual subword embeddings just before they’re pooled into word embeddings. Most of the models we’ve looked at call the dropout probability of the encoder hidden_dropout_prob, but others call it dropout or dropout_rate.  Because of this, we have to maintain a module which keeps track of what the hidden layer dropout probability parameter is called.

I think this is basically a failure of curation. Hugging Face community managers should be out there fixing these gaps and inconsistencies, or perhaps should also publish standards for such things. They’re valued at $4.5 billion. I would argue this is at least as important as their efforts with model cards and the like.

The dark triad professoriate

[I once again need to state that I am not responding any person or recent event. But remember the Law of the Subtweet: if you see yourself in some negative description but are not explicitly named, you can just keep that to yourself.]

There is a long debate about the effects of birth order on stable personality traits. A recent article in PNAS1 claims the effects are near null once proper controls are in place; the commentary it’s paired with suggests the whole thing is a zombie theory. Anyways, one of the claims I remember hearing was that older siblings were more likely to exhibit subclinical “Dark Triad” (DT) traits: Machiavellianism, narcissism, and psychopathy. Alas, this probably isn’t true, but it is easy to tell a story about why this might be adaptive. Time for some game theory. In a zero-sum scenario, if you’re the most mature (and biggest) of your siblings, you probably have more to gain from non-cooperative behaviors, and DT traits ought to select for said behaviors. A concrete (if contrived example): you can either hog or share the toy, and the eldest is by more likely to get away with hogging.

I wonder whether the scarcity of faculty positions—even if overstated (and it is)—might also be adaptive for dark triad traits. I know plenty of evil Boomer professors, but not many that are actually DT, and if I had to guess, these traits (particularly the narcissism) are much more common in younger (Gen X and Millennial) cohorts. Then again, this could be age-grading, since anti-social behaviors peak in adolescence and decline afterwards.

Endnotes

  1. This is actually a “direct submission”, not one of those mostly-phony “Prearranged Editor” pieces. So it might be legit.