Robot autopsies

I don’t really understand the exhuberance for studying whether neural networks know syntax. I have a lot to say about this issue—I’ll return to it later—but for today I’d like to briefly discuss this passage from a recent(ish) paper by Baroni (2022). The author expresses great surprise that few formal linguists have cited a particular paper (Linzen et al. 2016) about the ability of neural networks to learn long-distance agreement phenomena. (To be fair, Baroni is not a coauthor of said paper.) He then continues:

While it is possible that deep nets are relying on a completely different approach to language processing than the one encoded in human linguistic competence, theoretical linguists should investigate what are the building blocks making these systems so effective: if not for other reasons, at least in order to explain why a model that is supposedly encoding completely different priors than those programmed into the human brain should be so good at handling tasks, such as translating from a language into another, that should presuppose sophisticated linguistic knowledge. (Baroni 2022: 11).

I think this passage is a useful stepping-off point for what I think. I want to be clear: I am not “picking on” Baroni, who is probably far more senior to and certainly better known than me anyways; this is just a particularly clearly written claim, and I just happen to disagree.

Baroni says it is “possible that deep nets are relying on a completely different approach to language processing…” than humans; I’d say it’s basically certain that they are. We simply have no reason to think they might be using similar mechanisms since humans and neural networks don’t contain any of the same ingredients. Any similarities will naturally be analogies, not homologies.

Without a strong reason to think neural models and humans share some kind of cognitive homologies, there is no reason for theoretical linguists to investigate them; as artifacts of human culture they are no more in the domain of study for theoretical linguists than zebra finches, carburetors, or the perihelion of Mercury. 

It is not even clear how one ought to poke into the neural black box. Complex networks are mostly resistent to the kind of proof-theoretic techniques that mathematical linguists (witness the Delaware school or even just work by, say, Tesar) actually rely on, and most of the results are both negative and of minimal applicability: for instance, we know that there always exists a single-layer network large enough to encode, with arbitrary precision, any function a multi-layer network encodes, but we have no way to figure out how big is big enough for a given function.

Probing and other interpretative approaches exist, but have not yet proved themselves, and it is not clear that theoretical linguists have the relevant skills to push things forward anyways. Quality assurance, and adversarial data generation, is not exactly a high-status job; how can Baroni demand Cinque or Rizzi (to choose two of Baroni’s well-known countrymen) to put down their chalk and start doing free or poorly-paid QA for Microsoft?

Why should theoretical linguists of all people be charged with doing robot autopsies when the creators of the very same robots are alive and well? Either it’s easy and they’re refusing to do the work, or—and I suspect this is the case—it’s actually far beyond our current capabilities and that’s why little progress is being made.

I for one am glad that, for the time being, most linguists still have a little more self-respect. 

References

Baroni, M. 2022. On the proper role of linguistically oriented deep net analysis in linguistic theorising. In S. Lappin (ed). Algebraic Structures in Natural Language, pages 1-16. Taylor & Francis.
Linzen, T., Dupoux, E., and Goldberg, Y. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4: 521-535.

Isaacson and Lewis

It’s amusing to me that Walter Isaacson and Michael Lewis—who happened to go to the same elite private high school in New Orleans, just a few years apart—are finally having their oeuvres as favorable stenographers for the rich and powerful reassessed more or less simultaneously. Isaacson clearly met his match with Elon Musk, a deeply incurious abuser who gave Isaacson quite minimal access; Lewis does seem to be one of a handful of people who actually believed in that ethical altruism nonsense Sam Bankman-Fried was  cooking up. Good riddance, I say.

Defectivity in English: more observations

[This is part of a series of defectivity case studies.]

In an earlier post I listed some defective verbs in my idiolect. After talking with our PhD student Aidan Malanoski, I have a couple additional generalizations to note.

  1. Aidan is fine with infinitival BEWARE (e.g., Caesar was told to beware the Ides of March). I am not sure about this myself. 
  2. Aidan points out that SCRAM, SHOO, and GO AWAY (we might call them, along with BEWARE, “imperative-dominant verbs”) have a similarly restricted distribution. Roughly, our judgments are:
  • imperatives ok: ScramShooGo away! 
  • infinitives ok: Roaches started to scram when I turned the lights on. She shouted for the pigeons to shoo. The waiters couldn’t wait for them to go away
  • gerunds marginal: Scramming would be a good idea right about now. Just going away might be the best thing.
  • Other -ing participles degraded: (past continuous) Roaches started scramming when I turned the lights on. (small clause) I saw the police scramming
  • Simple pasts degraded: Roaches scrammed when I turned the lights on. (compositional reading only) He went away.

They point out that same -ing surface forms may differ in acceptability. I also note that for me shooed [s.o.] away is fine as a transitive.

Why binarity is probably right

Consider the following passage, about phonological features:

I have not seen any convicing justification for the doctrine that all features must be underlyingly binary rather than ternary, quaternary, etc. The proponents of the doctrine often realize it needs defending, but the calibre of the defense is not unfairly represented by the subordinary clause devoted to the subject in SPE (297): ‘for the natural way of indicating whether or not an item belongs to a particular category is by means of binary features.’ The restriction to two underlying specifications creates problems and solves none. (Sommerstein 1977: 109)

Similarly, I had a recent conversation by someone who insisted certain English multi-object constructions in syntax are better handled by assuming the possibility of ternary branching.

I disagree with Sommerstein, though: a logical defense of the assumption of binarity—both for the specification of phonological feature polarity and for the arity of syntactic trees—is so obvious that it fits on a single page. Roughly: 1) less than two is not enough, and; 2) two is enough.

Less than two is not enough. This much should be obvious: theories in which features only have one value, or syntactic constituents cannot dominate more than one element, have no expressive power whatsover.1,2

Two is enough. Every time we might desire to use a ternary feature polarity, or a ternary branching non-terminal, there exists a weakly equivalent specification which uses binary polarity or binary branching, respectively, and more features or non-terminals. It is then up to the analyst to determine whether or not they are happy with the natural classes and/or constituents obtained, but this possibility is always available to the analyst. One opposed to the this strategy has a duty to say why the hypothesized features or non-terminals are wrong.

Endnotes

  1. It is important to note in this regard that privative approaches to feature theory (as developed by Trubetzkoy and disciples) are themselves special cases of the binary hypothesis which happen to treat absence as a non-referable. For instance, if we treat the set of nasals as a natural class (specified [Nasal]) but deny the existence of the (admittedly rather diverse) natural class [−Nasal]—and if we further insist rules be defined in terms of natural classes, and deny the possibility of disjunctive specification—we are still working in a binary setting, we just have added an additional stipulation that negated features cannot be referred to by rules.
  2. I put aside the issue of cumulativity of stress—a common critique in the early days—since nobody believes this is done by feature in 2023.

References

Sommerstein, A. 1977. Modern Phonology. Edward Arnold.

The different functions of probabilty in probabilistic grammar

I have long been critical of naïve interpretations of probabilistic grammar.  To me, it seems like the major motivation for this approach derives from a naïve—I’d say overly naïve—linking hypothesis mapping between acceptability judgments and grammaticality, as seen in Likert scale-style acceptability tasks. (See chapter 2 of my dissertation for a concrete argument against this.) But in this approach, the probabilities are measures of wellformedness.

It occurs to me that there are a number of ontologically distinct interpretations of grammatical probabilities of the sort produced by “maxent”, i.e., logistic regression models.

For instance, at M100 this weekend, I heard Bruce Hayes talk about another use of maximum entropy models: scansion. In poetic meters, there is variation in, say, whether the caesura is masculine (after a stressed syllable) or feminine (after an unstressed syllable), and the probabilities reflect that.1 However, I don’t think it makes sense to equate this with grammaticality, since we are talking about variation in highly self-conscious linguistic artifacts here and there is no reason to think one style of caesura is more grammatical than the other.2

And of course there is a third interpretation, in which the probabilities are production probabilities, representing actual variation in production, within a speaker or across multiple speakers.

It is not obvious to me that these facts all ought to be modeled the same way, yet the maxent community seems comfortable assuming a single cognitive model to cover all three scenarios. To state the obvious, it makes no sense for a cognitive model to acconut for interspeaker variation because there is no such thing as “interspeaker cognition”, there are just individual mental grammars.

Endnotes

  1. This is a fabricated example because Hayes and colleagues mostly study English meter—something I know nothing about—whereas I’m interested in Latin poetry. I imagine English poetry has caesurae too but I’ve given it no thought yet.
  2. I am not trying to say that we can’t study grammar with poetry. Separately, I note, as did, I think, Paul Kiparsky at the talk, that this model also assumes that the input text the poet is trying to fit to the meter has no role to play in constraining what happens.

Use the minus sign for feature specifications

LaTeX has a dizzying number of options for different types of horizontal dash. The following are available:

  • A single - is a short dash appropriate for hyphenated compounds (like encoder-decoder).
  • A single dash in math mode,$-$, is a longer minus sign
  • A double -- is a longer “en-dash” appropriate for numerical ranges (like 3-5).
  • A triple --- is a long “em-dash” appropriate for interjections (like this—no, I mean like that).

My plea to linguists is to actually use math mode and the minus sign when they are writing binary features. If you want to turn this into a simple macro, you can please the following in your preamble:

\newcommand{feature}[2]{\ensuremath{#1}\textsc{#2}}

and then write \feature{-}{Back} for nicely formatted feature specifications.

Note that this issue has an exact parallel in Word and other WYSIWYG setups: there the issue is as simple as selecting the Unicode minus sign (U+2212) from the inventory of special characters (or just googling “Unicode minus sign” and copying and pasting what you find). 

On alternative frameworks

[continuing an argument from earlier…]

I think that intellectual diversity and academic freedom are good baseline values, but that they are not so obviously a positive value in pedagogy. Students genuinely look to their instructors for information about approaches to pursue, and not all frameworks (etc.) are of equal value. Let us suppose that I subscribe to theory P and you to theory (≠ P). Necessarily one of the following must be true:

(1) P and Q are mere notational variants (i.e., weakly equivalent or better).
(2) P is “more right” than Q or is “more right” than P.

In the case of (1), the best pedagogical practice would probably be to continue to propagate whichever of {P, Q} is more widely used, more intellectually robust, etc. For instance if is the more robust tradition, efforts to “port” insights and technologies from Q to P contribute little, and often quite difficult in practice. Of course this doesn’t mean that P and its practitioners should be suppressed or anything of the sort, but there is no strong imperative to transmit P to young scholars.

In the case of (2), the best practice is also to focus pedagogy on whichever of the two is “more right”. Of course it is often useful to teach the intellectual history, and it is not always clear which theory is out ahead, but it is imperative to make sure students are conversant in the most promising approaches.

In the case of computational syntactic formalisms, weak equivalences hold between minimalist grammars (MGs) as formalized by Stabler and colleagues) and most of the so-called alternative formalisms. It is also quite clear to me that insights largely flow from minimalism and friends (broadly construed) to the alternative formalisms, and not the other way around. Finally, efforts to “port” these insights to alternative formalisms have stalled, or perhaps are just many years behind the bleeding edge in syntactic theory. (1) therefore applies, and I simply don’t see the strong imperative to teach the alternative formalisms.

In the case of computational phonology, there is also an emerging consensus that harmonic grammar (HG) of the sort learned by “maxent” technologies) have substantial pathologies compared to earlier formalisms, so that classic Optimality Theory (OT) is clearly “less wrong”. I am similarly sympathetic to arguments that global evaluation frameworks including HG and OT are overly constrained with respect to opacity phenomena and overgenerate in other dimensions, and these issues of expressivity are not shared with imperative (i.e., rule-based) and declarative (i.e., “weakly deterministic” formalisms of the “Delaware school”). (2) thus applies here.

Of course I am not advocating for restrictions on academic freedom or speech in general; I make this argument only regarding best pedagogical practices, and I’m not sure I’m in the right here.

Trust me, I’m a linguist

Grice’s maxim of quantity requires that one give no more information than is strictly required. This is somestimes misunderstood as a firm constraint, but one intuition you may have—and which is nicely expressed by rational speech act theory—is that apparent violations of this maxim by an otherwise cooperative speaker may actually tell you that seemingly-irrelevant information is, in the mind of the speaker, of great relevance to the discourse.


I recently read two interviews in which the subject—crucially, not a working linguist—drew attention to their linguistics education.

The first is this excellent profile of Joss Sackler, a woman who married into the Sackler opioid fortune. To be fair, she does hold a PhD in Hispanic & Luso-Brazilian literatures & languages (dissertation here), which ought to qualify one, but her response to the Town & Country reporter asking some bad press is exactly the kind of non-sequitur a rational speaker ought not to make: “They’re going to regret fucking with a linguist. They already do.”

The second comes from interviews with Nicole Daedone, the co-founder of an organization (it’s hard to describe, just read about it if you’re interested) called OneTaste. I’ve now read several profiles of her (the first was in the New York Times, I think, years ago, but I can’t find it anymore), and in each they mention that she studied linguistics in San  Francisco; one source says she has a bachelor’s degree in “gender communications and semantics” from San Francisco State University, another says she was at some point working on a linguistics PhD there. The relevance was again unclear to me, but later I read the very interesting book Future Sex, which also profiles her. There, there is a brief discussion of the lexical semantics of pussy; Daedone, who is (it’s complicated) a sex educator, proposes that it fills a lexical lacuna, by providing a single term that refers to the human vulva and vagina as a whole.


This all makes me wonder whether to the general public, linguistics really connotes brilliance, and, perhaps, tenacity. And it makes me wonder whether one could actually wield their linguistics education as a shield against criticisms having nothing to do with language per se.

ACL Rolling Reviews don’t roll anymore

Recently I wanted to submit a paper to the ACL’s rolling reviews system. The idea of this system is that instead of people rushing to make somewhat arbitrary conference deadlines—most everything is published at conferences rather than books or journals in NLP—one can instead submit to an ever-running pool of reviewers and get quick comments. Furthermore, the preprints are available online and one can see the comments. After you the author feel that you’ve received a satisfactory review you can then “submit”, with the push of a button, your already-reviewed paper to a conference and the organizers and area chairs put together a program from these papers. This seems like a good idea thus far, even if the very strong COI policy means that none of the papers I get assigned to review are interesting to me but rather in adjacent (and boring) areas.

I was recently surprised to find—it’s not documented anywhere, I had to write tech support and wait to hear back—that it there are now blackout periods of several weeks where one cannot submit. I have no idea why this is true. Granted, they reduced the frequency of the cycles to six a year (or one every two months), but I don’t understand why I can’t, on July 1st, submit to the August 15th cycle. This makes no sense to me and seems to defeat the most important part of this initiative: the idea that you can submit work when it’s done rather than when certain stars align.

Myths about writing systems

In collaboration with Richard Sproat, I just published a short position paper on “myths about writing systems” in NLP to appear in the proceedings for CAWL, the ACL Workshop on Computation and Writing Systems. I think it will be most of all useful to reviewers and editors who need a resource to combat nonsense like Persian is a right-to-left language and want to suggest a correction. Take a look here.