“Python” is a proper name

In just the last few days I’ve seen a half dozen instances of the phrase python package or python script in published academic work. It’s disappointing to me that this got by the reviewers, action editors, and copy editors, since Python is obviously a proper name and should be in titlecase. (The fact that the interpreter command is python is irrelevant.)

Markdown isn’t good enough to replace LaTeX

I am generally sympathetic with calls to replace LaTeX with something else. LaTeX has terrible defaults, Unicode and font support is a constant problem, the syntax is deliberately obfuscatory, and actual generation is painfully slow (probably because the whole thing is a big pasta factory of interpreted code instead of a single static library).

But at the same time, I don’t think Markdown is really good enough for LaTeX. Of course one can use Pandoc to generate LaTeX from Markdown notes, and its output is often a decent thing to copy and paste into your LaTeX document. But Markdown just doesn’t solve any of the issues I mention, except making the syntax a tad more WYSIWYG than it would be otherwise. And Markdown is quite a bit worse at one thing: the extended syntax for tables is very hard to key in and still much less expressive than LaTeX’s actually pretty rational tabular environment.

Python hasn’t changed much

Since successfully sticking the landing for the migration from Python 2 (circa 3.6 or so), Python has been on a tear with a large number of small releases. These releases have cleaned up some warts in the “batteries included” modules and made huge improvements to the performance of the parser and run-time. There are also a few minor language features added; for instance, f-strings (which I like a lot) and the so-called walrus operator, mostly used for regular expression matching.

When Python improvements (and they are improvements, IMO) are discussed on sites like Hacker News, there is a lot of fear and trepidation. I am not sure why. These are rather minor changes, and they will take years to diffuse through the Python community. Overall, very little has changed.

Noam on neural networks

I just crashed a Zoom conference in which Noam Chomsky was the discussant. (What I have to say will be heavily paraphrased: I wasn’t taking notes.) One back-and-forth stuck with me. Someone asked Noam what people interested in language and cognition ought to study, other than linguistics itself. He mentioned various biological systems, and said however, that they probably shouldn’t bother to study neural networks, since they have very little in common with intelligent biological systems (despite their branding as “neural” and “brain-inspired”). He stated that he is grateful for Zoom closed captions (he has some hearing loss), but that one should not conflate that with language understanding. He said, similarly, that he’s grateful for snow plows, but one shouldn’t confuse such a useful technology with theories of the physical world.

For myself, I think they’re not uninteresting devices, and that linguists are uniquely situated to evaluate them—adversarily, I hope—as models of language. I also think they can be viewed as powerful black boxes for studying the limits of domain-general pattern learning. Sometimes we actually want to ask whether certain linguistic information is actually present in the input, and some of my work (e.g., Gorman et al. 2019) looks at that in some detail. But I do share some intuition that they are not likely to greatly expand our understanding of human language overall.

References

Gorman, K., McCarthy, A. D., Cotterell, R., Vylomova, E., Silfverberg, M., and Markowska, M. Weird inflects but OK: making sense of morphological generation errors. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 140-151.

Lambda lifting in Python

Python really should have a way to lambda-lift a value e to a no-argument callable function which returns e. Let us suppose that our e is denoted by the variable alpha. One can approximate such a lifting by declaring alpha_fnc = lambda: alpha. Python lambdas are slow compared to true currying functionality, like provided by functools.partial and the functions of the operator library, but it basically works. The problem, however, is that lambda declarations in Python, unlike in, say, C++ 11, have no closure mechanism to capture the local scope, so lambda which refer to outer variables are context-dependent. The following interactive session illustrates the problem.

In [1]: alpha_fnc = lambda: alpha

In [2]: alpha_fnc()
------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [2], in ()
----> 1 alpha_fnc()

Input In [1], in ()
----> 1 alpha_fnc = lambda: alpha

NameError: name 'alpha' is not defined

In [3]: alpha = .5

In [4]: alpha_fnc()
Out[4]: 0.5

In [5]: alpha = .4

In [6]: alpha_fnc()
Out[6]: 0.4

A* shortest string decoding for non-idempotent semirings

I recently completed some work, in collaboration with Google’s Cyril Allauzen, on a new algorithm for computing the shortest string through weighted finite-state automaton. For so-called path semirings, the shortest string is given by the shortest path, but up until now, there was no general-purpose algorithm for computing the shortest string over non-idempotent semirings (like the log or probability semiring). Such an algorithm would make it much easier to decode with interpolated language models or elaborate channel models in a noisy-channel formalism. In this preprint, we propose such an algorithm using A* search and lazy (“on-the-fly”) determinization, and prove that it is correct. The algorithm in question is implemented in my OpenGrm-BaumWelch library by the baumwelchdecode command-line tool.

Please don’t send .docx or .xlsx files

.docx and .xlsx can only be read on a small subset of devices and only after purchasing a license. It is frankly a bit rude to expect everyone to have such licenses in 2022 given the proliferation of superior, and free, alternatives. If the document is static, read-only content, convert it to a PDF. If it’s something you want me to edit or comment on, or which will be changing with time, send me the document via Microsoft 365 or the equivalent Google offerings. Or a Git repo. Sorry to be grumpy but everyone should know this by now. If you’re still emailing these around, please stop.

WFST talk

I have posted a lightly-revised slide deck from a talk I gave at Johns Hopkins University here. In it, I give my most detailed-yet description of the weighted finite-state transducer formalism and describe two reasonably interesting algorithms, the optimization algorithm underlying Pynini’s optimize method and Thrax’s Optimize function, and a new A*-based single shortest string algorithm for non-idempotent semirings underlying BaumWelch’s baumwelchdecode CLI tool.

Evaluations from the past

In a literature review, speech and language processing specialists often feel tempted to report evaluation metrics like accuracy, F-score, or word error rate for systems described in the literature review. In my opinion, this is only informative if the prior and present work use the exact same data set(s) for evaluations. (Such results should probably be presented in a table along with results from the present work, not in the body of the literature review.) If instead, they were tested on some proprietary data set, an obsolete corpus, or a data set the authors of the present work have declined to evaluate on, this information is inactionable. Authors should omit this information, and reviewers and editors should insist that it be omitted.

It is also clear to me that these numbers are rarely meaningful as measures of how difficult a task is “generally”. To take an example from an unnamed 2019 NAACL paper (one guilty of the sin described above), word error rates on a single task in a single language range between 9.1% and 23.61% (note also the mixed precision). What could we possibly reason from this enormous spread of results across different data sets?

Logistic regression as the bare minimum. Or, Against naïve Bayes

When I teach introductory machine learning, I begin with (categorical) naïve Bayes classifiers. These are arguably the simplest possible supervised machine learning model, and can be explained quickly to anyone who understands probability and the method of maximum likelihood estimation. I then pivot and introduce logistic regression and its various forms. Ng et al. (2002) provide a nice discussion of how the two relate, and I encourage students to read their study.

Logistic regression is a more powerful technique than naïve Bayes. First, it is “easier” in some sense (Breiman 2001) to estimate the conditional distribution, as one does in logistic regression, than to model the joint distribution, as one does in naïve Bayes. Secondly, logistic regression can be learned using standard (online) stochastic gradient descent methods. Finally, it naturally supports conventional regularization strategies needed to avoid overfitting. For this reason, in 2022, I consider regularized logistic regression the bare minimum supervised learning method, the least sophisticated method that is possibly good enough. The pedagogical-instructional problem I then face is trying to convince students not to use naïve Bayes, given that it is obsolete—it is virtually always inferior to regularized logistic regression—given that tools like scikit-learn (Pedregosa et al. 2011) make it almost trivial to swap one machine learning method for the other.

References

Breiman, Leo. 2001. Statistical modeling: the two cultures. Statistical Science 16:199-231.
Ng, Andrew Y., and Michael I. Jordan. 2002. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Proceedings of NeurIPS, pages 841-848.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, …, and Édouard Duchesnay. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12:2825-2830.