Markdown isn’t good enough to replace LaTeX

I am generally sympathetic with calls to replace LaTeX with something else. LaTeX has terrible defaults, Unicode and font support is a constant problem, the syntax is deliberately obfuscatory, and actual generation is painfully slow (probably because the whole thing is a big pasta factory of interpreted code instead of a single static library).

But at the same time, I don’t think Markdown is really good enough for LaTeX. Of course one can use Pandoc to generate LaTeX from Markdown notes, and its output is often a decent thing to copy and paste into your LaTeX document. But Markdown just doesn’t solve any of the issues I mention, except making the syntax a tad more WYSIWYG than it would be otherwise. And Markdown is quite a bit worse at one thing: the extended syntax for tables is very hard to key in and still much less expressive than LaTeX’s actually pretty rational tabular environment.

Python hasn’t changed much

Since successfully sticking the landing for the migration from Python 2 (circa 3.6 or so), Python has been on a tear with a large number of small releases. These releases have cleaned up some warts in the “batteries included” modules and made huge improvements to the performance of the parser and run-time. There are also a few minor language features added; for instance, f-strings (which I like a lot) and the so-called walrus operator, mostly used for regular expression matching.

When Python improvements (and they are improvements, IMO) are discussed on sites like Hacker News, there is a lot of fear and trepidation. I am not sure why. These are rather minor changes, and they will take years to diffuse through the Python community. Overall, very little has changed.

Noam on neural networks

I just crashed a Zoom conference in which Noam Chomsky was the discussant. (What I have to say will be heavily paraphrased: I wasn’t taking notes.) One back-and-forth stuck with me. Someone asked Noam what people interested in language and cognition ought to study, other than linguistics itself. He mentioned various biological systems, and said however, that they probably shouldn’t bother to study neural networks, since they have very little in common with intelligent biological systems (despite their branding as “neural” and “brain-inspired”). He stated that he is grateful for Zoom closed captions  (he has some hearing loss), but that one should not conflate that with language understanding. He said, similarly, that he’s grateful for snow plows, but one shouldn’t confuse such a useful technology with theories of the physical world.

For myself, I think they’re not uninteresting devices, and that linguists are uniquely situated to evaluate them—adversarily, I hope—as models of language. I also think they can be viewed as powerful black boxes for studying the limits of domain-general pattern learning. Sometimes we actually want to ask whether certain linguistic information is actually present in the input, and some of my work (e.g., Gorman et al. 2019) looks at that in some detail. But I do share some intuition that they are not likely to greatly expand our understanding of human language overall.

References

Gorman, K., McCarthy, A. D., Cotterell, R., Vylomova, E., Silfverberg, M., and Markowska, M. Weird inflects but OK: making sense of morphological generation errors. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 140-151.

Lambda lifting in Python

Python really should have a way to lambda-lift a value e to a no-argument callable function which returns e. Let us suppose that our e is denoted by the variable alpha. One can approximate such a lifting by declaring alpha_fnc = lambda: alpha. Python lambdas are slow compared to true currying functionality, like provided by functools.partial and the functions of the operator library, but it basically works. The problem, however, is that lambda declarations in Python, unlike in, say, C++ 11, have no closure mechanism to capture the local scope, so lambda which refer to outer variables are context-dependent. The following interactive session illustrates the problem.

In [1]: alpha_fnc = lambda: alpha

In [2]: alpha_fnc()
------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [2], in ()
----> 1 alpha_fnc()

Input In [1], in ()
----> 1 alpha_fnc = lambda: alpha

NameError: name 'alpha' is not defined

In [3]: alpha = .5

In [4]: alpha_fnc()
Out[4]: 0.5

In [5]: alpha = .4

In [6]: alpha_fnc()
Out[6]: 0.4

A* shortest string decoding for non-idempotent semirings

I recently completed some work, in collaboration with Google’s Cyril Allauzen, on a new algorithm for computing the shortest string through weighted finite-state automaton. For so-called path semirings, the shortest string is given by the shortest path, but up until now, there was no general-purpose algorithm for computing the shortest string over non-idempotent semirings (like the log or probability semiring). Such an algorithm would make it much easier to decode with interpolated language models or elaborate channel models in a noisy-channel formalism. In this preprint, we propose such an algorithm using A* search and lazy (“on-the-fly”) determinization, and prove that it is correct. The algorithm in question is implemented in my OpenGrm-BaumWelch library by the baumwelchdecode command-line tool.

Please don’t send .docx or .xlsx files

.docx and .xlsx can only be read on a small subset of devices and only after purchasing a license. It is frankly a bit rude to expect everyone to have such licenses in 2022 given the proliferation of superior, and free, alternatives. If the document is static, read-only content, convert it to a PDF. If it’s something you want me to edit or comment on, or which will be changing with time, send me the document via Microsoft 365 or the equivalent Google offerings. Or a Git repo. Sorry to be grumpy but everyone should know this by now. If you’re still emailing these around, please stop.

WFST talk

I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.

Evaluations from the past

In a literature review, speech and language processing specialists often feel tempted to report evaluation metrics like accuracy, F-score, or word error rate for systems described in the literature review. In my opinion, this is only informative if the prior and present work use the exact same data set(s) for evaluations. (Such results should probably be presented in a table along with results from the present work, not in the body of the literature review.) If instead, they were tested on some proprietary data set, an obsolete corpus, or a data set the authors of the present work have declined to evaluate on, this information is inactionable. Authors should omit this information, and reviewers and editors should insist that it be omitted.

It is also clear to me that these numbers are rarely meaningful as measures of how difficult a task is “generally”. To take an example from an unnamed 2019 NAACL paper (one guilty of the sin described above), word error rates on a single task in a single language range between 9.1% and 23.61% (note also the mixed precision). What could we possibly reason from this enormous spread of results across different data sets?

Logistic regression as the bare minimum. Or, Against naïve Bayes

When I teach introductory machine learning, I begin with (categorical) naïve Bayes classifiers. These are arguably the simplest possible supervised machine learning model, and can be explained quickly to anyone who understands probability and the method of maximum likelihood estimation. I then pivot and introduce logistic regression and its various forms. Ng et al. (2002) provide a nice discussion of how the two relate, and I encourage students to read their study.

Logistic regression is a more powerful technique than naïve Bayes. First, it is “easier” in some sense (Breiman 2001) to estimate the conditional distribution, as one does in logistic regression, than to model the joint distribution, as one does in naïve Bayes. Secondly, logistic regression can be learned using standard (online) stochastic gradient descent methods. Finally, it naturally supports conventional regularization strategies needed to avoid overfitting. For this reason, in 2022, I consider regularized logistic regression the bare minimum supervised learning method, the least sophisticated method that is possibly good enough. The pedagogical-instructional problem I then face is trying to convince students not to use naïve Bayes, given that it is obsolete—it is virtually always inferior to regularized logistic regression—given that tools like scikit-learn (Pedregosa et al. 2011) make it almost trivial to swap one machine learning method for the other.

References

Breiman, Leo. 2001. Statistical modeling: the two cultures. Statistical Science 16:199-231.
Ng, Andrew Y., and Michael I. Jordan. 2002. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Proceedings of NeurIPS, pages 841-848.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, …, and Édouard Duchesnay. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12:2825-2830.

Academic reviewing in NLP

It is obvious to me that NLP researchers are, on average, submitting manuscripts far earlier and more often than they ought to. The average manuscript I review is typo-laden, full of figures and tables far too small to actually read or intruding on the margins, with an unusable bibliography that the authors have clearly never inspected. Sometimes I receive manuscripts whose actual titles are transparently ungrammatical.

There are several reasons this is bad, but most of all it is a waste of reviewer time, since the reviewers have to point out (in triplicate or worse) minor issues that would have been flagged by proof-readers, advisors, or colleagues, were they involved before submission. Then, once these issues are corrected, the reviewers are again asked to read the paper and confirm they have been addressed. This is work the authors could have done, but which instead is pushed onto committees of unpaid volunteers.

The second issue is that the reviewer pool lacks relevant experience. I am regularly tasked with “meta-reviewing”, or critically summarizing the reviews. This is necessary in part because many, perhaps a majority, of the reviewers simply do not know how to review an academic paper, having not received instruction on this topic from their advisors or mentors, and their comments need to be recast in language that can be quickly understood by conference program committees.

[Moving from general to specific.]

I have recently been asked to review an uncommonly large collection of papers on the topic of prompt engineering. Several years ago, it became apparent that neural network language models, trained on enormous amounts of text data, could often provide locally coherent (though rarely globally coherent) responses to prompts or queries. The parade example of this type of model is GPT-2. For instance, if the prompt was:

Malfoy hadn’t noticed anything.

the model might continue:

“In that case,” said Harry, after thinking it over, “I suggest you return to the library.”

I assume this is because there’s fan fiction in the corpus, but I don’t really know. Now it goes without saying that at no point will, Facebook, say, launch a product in which a gigantic neural network is allowed to regurgitate Harry Potter fan fiction (!) at their users. However, researchers persist for some reason (perhaps novelty) to try to “engineer” clever prompts that produce subjectively “good” responses, rather than attempting to understand how any of this works. (It is not an overstatement to say that we have little idea why neural networks, and the methods we use to train them in particular, work at all.) What am I to do when asked to meta-review papers like this? I try to remain collegial, but I’m not sure this kind of work ought to exist at all. I consider GPT-2 a billionaire plaything, a rather wasteful one at that, and it is hard for me to see how this line of work might make the world a better place.