Neural fossils

Neural network cognitive modeling had a brief, precocious golden era between 1986 (the year the Parallel Distributed Processing books came out) and maybe about 1997 (at which point the limitations of those models were widely known…though I’m little fuzzier about when this realization settled in). During that period, I think it’s fair to say, a lot of people got hired into the faculty, in psychology and linguistics in particular, simply because they knew a bit about this exciting new approach. Some of those people went on to do other interesting things once the shine had worn off, but a lot of them didn’t, and some of them are even still around, haunting the halls of R1s. I think something similar will happen to the new crop of LLMologists in the academy: some have the skills to pivot should we reach peak LLM (if we haven’t already), but many don’t.

When LLMing goes wrong

[The following is a guest post from Daniel Yakubov.]

You’ve probably noticed that industries have been jumping to adopt some vague notion of “AI” or peacocking about their AI-powered something-or-other. Unsurprisingly, the scrambled nature of this adoption leads to a slew of issues. This post outlines a fact obvious to technical crowds, but not business folks; even though LLMs are a shiny new toy, LLM-centric systems still require careful consideration.

Hallucination is possibly the most common issue in LLM systems. It is the tendency for an LLM to prioritize responding rather than responding accurately, aka. making stuff up. Considering some of the common approaches to fixing this, we can understand what problems these techniques introduce.

A quick approach that many prompt engineers I know think is the end-all be-all of Generative AI is Chain-of-Thought (CoT; Wei et al 2023). This simple approach just tells the LLM to break down its reasoning “step-by-step” before outputting a response. While a bandage, CoT does not actually inject new knowledge into an LLM, this is where the Retrieval Augmented Generation (RAG) craze began. RAG represents a family of approaches that add relevant context to a prompt via search (Patrick et al. 2020). RAG pipelines come with their own errors that need to be understood, including noise in the source documents, misconfigurations in the context window of the search encoder, and specificity of the LLM reply (Barnett et al. 2024). Specificity is particularly frustrating. Imagine you ask a chatbot “Where is Paris?” and it replies “According to my research, Paris is on Earth.” At this stage, RAG and CoT combined still cannot deal with complicated user queries accurately (or well, math). To address that, the ReAct agent framework (Yao et al 2023) is commonly used. ReAct, in a nutshell, gives the LLM access to a series of tools and the ability to “requery” itself depending on the answer it gave to the user query. A central part of ReAct is the LLM being able to choose which tool to use. This is a classification task, and LLMs are observed to suffer from an inherent label bias (Reif and Schwarz, 2024), another issue to control for.

This can go for much longer, but I feel the point should be clear. Hopefully this gives a more academic crowd some insight into when LLMing goes wrong.

References

Barnett, S., Kurniawan, S., Thudumu, S. Brannelly, Z., and Abdelrazek, M. 2024. Seven failure points when engineering a retrieval augmented generation system.
Lewis, P., Perez, E., Pitkus, A., Petroni, F., Karpukhin, V., Goyal, N., …, Kiela, D. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks.
Reif, Y., and Schwartz, R. 2024. Beyond performance: quantifying and mitigating label bias in LLMs.
Wei, J. Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., …, Zhou, D. 2023. Chain-of-thought prompting elicits reasoning in large language models.
Yao, S., Zhao, J., Yu, D., Shafran, I., Narasimhan, K., and Cao, Y. 2023. ReAct: synergizing reasoning and acting in language models.

Professional organizations in linguistics

I am a member of the Linguistic Society of America (LSA) and the Association for Computational Linguistics (ACL), US-based professional organizations for linguists and computational linguists, respectively. (More precisely, I am usually a member. I think my memberships both lapsed during the pandemic and I renewed once I started going to their respective conferences again.)

I attend LSA meetings when they’re conveniently located (next year’s in Philly and we’re doing a workshop on Logical Phonology), and roughly one ACL-hosted meeting a year as well. As a (relatively) senior scholar I don’t find the former that useful (the scholarship is hit-or-miss and the LSA is dominated by a pandemonium of anti-generativists who are best just ignored), but the networking can be good. The *CL meetings tend to have more relevant science (or at least they did before prompt engineering…) but they’re expensive and rarely held in the ACELA corridor.

While the LSA and the ACL are called professional organizations, their real purview is mostly to host conferences. The LSA does some other stuff of course: they run Language, the institutes, and occasionally engage in lobbying, etc. But they do not have much to say about the lives of workers in these fields. The LSA doesn’t tell you about the benefits of unionizing your workplace. The ACL doesn’t give you ethics tips about what to do if your boss wants you to spy on protestors. They don’t really help you get jobs in these fields either. They could; they just don’t.

There is an interesting contrast here with another professional organization I was once a member of: the Institute of Electrical and Electronics Engineers (IEEE, pronounced “aye Tripoli”). Obviously, I am not an electrical engineer, but electrical engineering was historically the home of speech technology research and their ASRU and SLT conferences are quite good in that subfield. During the year or so I was an IEEE member, I received their monthly magazine. Roughly half of it is in fact just stories of general interest to electrical engineers; one that stuck with me argued that the laws of physics preclude the existence of “directed energy weapons” claimed to cause Havana Syndrome. But the other half were specifically about the professional life of electrical engineers, including stuff about interviewing, the labor market outlook, and working conditions.

Imagine if Language had a quarterly professional column or if the ACL Anthology had a blog-post series…

Hiring season

It’s hiring season and your dean has approved your linguistics department for a new tenure line. Naturally, you’re looking to hire an exciting young “hyphenate” type who can, among other things, strengthen your computational linguistics offerings, help students transition into industry roles and perhaps even incorporate generative AI into more mundane parts of your curriculum (sigh). There are two problems I see with this. First, most people applying for these positions don’t actually have relevant industry experience, so while they can certainly teach your students to code, they don’t know much about industry practices. Secondly, an awful lot of them would probably prefer to be a full-time software engineer, all things considered, and are going to take leave—if not quit outright—if the opportunity ever becomes available. (“Many such cases.”) The only way to avoid this scenario, as I see it, is to find people who have already been software engineers and don’t want to be them anymore, and fortunately, there are several of us.

Hugging Face needs better curation

Hugging Face is, among other things, a platform for obtaining pre-trained neural network models. We use their tokenizers and transformers Python libraries in a number of projects. While these have a bit more abstraction than I like, and are arguably over-featured, they are fundamentally quite good and make it really easy to, e.g., add a pre-trained encoder. I also appreciate that the tokenizers are mostly compiled code (they’re Rust extensions, apparently), which in practice means that tokenization is IO-bound rather than CPU bound.

My use case mostly involves loading Hugging Face transformers and their tokenizers and using their encoding layers for fine-tuning. To load a model in transformers, one uses the function transformers.AutoModel.from_pretrained and provides the name of the model on Hugging Face as a string argument. If the model exists, but you don’t already have a local copy, Hugging Face will automatically download it for you (and stashes the assets in some hidden directory). One can do something similar with the tokenizers.AutoTokenizer, or one can request the tokenizer from the model instance.

Now you might think that this would make it easy to, say, write a command-line tool where the user can specify any Hugging Face model, but unfortunately, you’d be wrong. First off, a lot of models, including so-called token–free ones lack a tokenizer. Why doesn’t ByT5, for instance, provide as its tokenizer a trivial Rust (or Python, even) function that returns bytes? In practice, one cannot support arbitrary Hugging Face models because one cannot count on them having a tokenizer. In this case, I see no alternative but to keep a list of supported models that lack their own tokenizer. Such a list is necessarily incomplete because the model hub continues to grow.

A similar problem comes with how parameters of the models are named. Most models are trained with dropout and support a dropout parameter, but the name of this parameter is inconsistent from model to model. In UDTube, for instance, dropout is a global parameter and it is applied to each hidden layer of the encoder (which requires us to access the guts of the Hugging Face model), and then again to the pooled contextual subword embeddings just before they’re pooled into word embeddings. Most of the models we’ve looked at call the dropout probability of the encoder hidden_dropout_prob, but others call it dropout or dropout_rate. Because of this, we have to maintain a module which keeps track of what the hidden layer dropout probability parameter is called.

I think this is basically a failure of curation. Hugging Face community managers should be out there fixing these gaps and inconsistencies, or perhaps should also publish standards for such things. They’re valued at $4.5 billion. I would argue this is at least as important as their efforts with model cards and the like.

Python ellipses considered harmful

Python has a conventional object-oriented design, but it was slowly grafted onto the language, something which shows from time to time. Arguably, you see this in the convention that instance methods need self passed as their first argument, and class methods need clsas their first argument. Another place you see it is how Python does abstract classes. First, one can use definitions in the built-in abc module, proposed in PEP-3119, to declare a class as abstract. But in practice most Pythonistas make a class abstract by declaring unimplemented instance methods. There are two conventional ways to do this, either with ellipses or by raising an exception, illustrated below.

class AbstractCandyFactory:
    def make_candy(self, batch_size: int): ...

class AbstractCandyFactory:
    def make_candy(self, batch_size: int):
        raise NotImplementedError

The latter is a bit more verbose, but there is actually a very good reason to prefer it to the former, elliptical version. With the exception version, if one forgets to implement make_candy—say, in a concrete subclass like SnickersFactory(AbstractCandyFactory)—an informative exception will be raised when make_candy is called on a SnickersFactory instance. However, in the elliptical form, the inherited form will be called, and of course will do nothing because the method has no body. This will likely cause errors down the road, but they will not be nearly as easy to track down because there is nothing to directly link the issue to the failure to override this method. For this reason alone, I consider ellipses used to declare abstract instance methods as harmful.

Announcing UDTube

In collaboration with CUNY master’s program graduate Daniel Yakubov, we have recently open-sourced UDTube, our neural morphological analyzer. UDTube performs what is sometimes called morphological analysis in context: it provides morphological analyses—coarse POS tagging, more-detailed morphosyntactic tagging, and lemmatization—to whole sentences using nearby words as context.

The UDTube model, developed in Yakubov 2024, is quite simple: it uses a pre-trained Hugging Face encoders to compute subword embeddings. We then take the last few layers of these embeddings and mean-pool them, then mean-pool subword embeddings for those words which correspond to multiple subwords. The resulting encoding of the input is then fed to separate classifier heads for the different tasks (POS tagging, etc.). During training we fine-tune the pre-trained encoder in addition to fitting the classifier heads, and we make it possible to set separate optimizers, learning rates, and schedulers for the encoder and classifier modules.

UDTube is built atop PyTorch and Lightning, and its command-line interface is made much simpler by the use of LightningCLI, a module which handles most of the interface work. One can configure the entire thing using YAML configuration files. CUDA GPUs and MPS-era Macs (M1 etc.) can be used to accelerate training and inference (and should work out of the box). We also provide scripts to perform hyperparameter tuning using Weights & Biases. We believe that this model, with appropriate tuning, is probably state-of-the-art for morphological analysis in context.

UDTube is available under an Apache 2.0 license on GitHub and on PyPI.

References

Yakubov, D. 2024. How do we learn what we cannot say? Master’s thesis, CUNY Graduate Center.

Learned tokenization

Conventional (i.e., non-neural, pre-BERT) NLP stacks tend to use rule-based systems for tokenizing sentences into words. One good example is Spacy, which provides rule-based tokenizers for the languages it supports. I am sort of baffled this is considered a good idea for languages other than English, since it seems to me that most languages need machine learning for even this task to properly handle phenomena like clitics. If you like the Spacy interface—I admit it’s very convenient—and work in Python, you may want to try thespacy-udpipe library, which exposes the UDPipe 1.5 models for Universal Dependencies 2.5; these in turn use learned tokenizers (and taggers, morphological analyzers, and dependency parsers, if you care) trained on high-quality Universal Dependencies data.

Automatic batch sizing

Yoyodyne is my lab’s sequence-to-sequence library, intended to be a replacement for Fairseq, which is (essentially) abandonware. One matter of urgency for me in building Yoyodyne was to enable automatic hyperparameter tuning. This was accomplished by logging results to Weights & Biases (W&B). We can perform a random or Bayesian hyperparameter sweep using a “grid” specified via a YAML file, monitor progress on the W&B website, or even hit the API to grab the best hyperparameters. One issue that kept coming up, however, is that it is easy to hit out-of-memory (OOM) errors during this process. Here’s what we did about it:

OOMs are not purely due to model size: the model, batch, and gradients all need to fit into the same VRAM. PyTorch Lightning, which is a key part of the Yoyodyne backend, provides a function for automatically determining the maximum batch size that will not trigger an OOM. Basically, it works by starting with a low batch size (by default, 2), randomly drawing three batches of that size, and then attempting training (but in fact caching parameters so that no real training occurs). If this does not trigger an OOM, it doubles the batch size, and so on.^1,2 You can enable this approach in Yoyodyne using the flag --find_batch_size max. You’d want to use this if you believe that a giant batch size is fine and you just want to fully saturate your GPU.

A slightly more sophisticated version of this, useful when you actually want to tune batch size, is enabled with the flag --find_batch_size opt. This again begins by doubling the size of randomly drawn batches as well, but here it halts once the doubling exceeds the value of the --batch_sizeflag. If the max batch size is larger than the requested size, it is used as is; thus this acts as a soft check against OOMs. If, however, the max batch size is smaller than --batch_size it instead solves for a new batch size, the largest batch size which is smaller than the max and which is a divisor of --batch_size`. It then enables multiple rounds of gradient accumulation per update,³ thus perfectly-losslessly simulating the desired batch size while using as much of VRAM as possible. I can assure you this is a killer feature for neural network tuning.

Endnotes

This is a little imprecise, and one can refine it by doing a binary search, but in practice it’s not worth the effort when working with ragged data.
Whatever batch size was requested with the --batch_size flag is ignored.
More formally, given desired batch size $b$ and a max batch size $n’$, it finds $a, n$ such that $a$ is the smallest integer, and $n$ is the largest integer, where $an = b$. This is computed via brute force; my implementation of an elegant solution based on the prime factorization was a bit slower.

“Segmented languages”

In a recent paper (Gorman & Sproat 2023), we complain about conflation of writing systems with the languages they are used to write, highlighting the nonsense underlying common expressions like “right-to-left language”, “syllabic language” or “ideographic” language found in the literature. Thus we were surprised to find the following:

Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate (CER), instead of WER… (Gemini Team 2024:18)

Since the most salient feature of the writing systems used to write Mandarin, Japanese, Korean, and Thai is the absence of segmentation information (e.g., whitespace used to indicate word boundaries), presumably the authors mean to say that the data they are using has already been pre-segmented (by some unspecified means). But this is not a property of these languages, but rather of the available data.

[h/t: Richard Sproat]

References

Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint 2312.11805. URL: https://arxiv.org/abs/2312.11805.

Gorman, K. and Sproat, R.. 2023. Myths about writing systems in speech & language technology. In Proceedings of the Workshop on Computation and Written Language, pages 1-5.