I wrote this quite a while ago, but here’s my handout on BLEU, a metric used to evaluate machine translation systems. Everything here is still just as applicable in the era of neural machine translation.
Author: Kyle Gorman
New Pynini tutorial
Pynini is my weighted finite-state transducer/grammar compilation library for Python, and O’Reilly Media recently published a short introductory tutorial on Pynini, cowritten with my colleague Richard Sproat.
Using the P2FA/FAVE-align SCOTUS acoustic models in Prosodylab-Aligner
Chris Landreth writes in with a tip on how to use the SCOTUS Corpus acoustic model (the one used in P2FA and FAVE-align) from within Prosodylab-Aligner. This is as simple as downloading the data and modifying the YAML configuration file and placing the model data in the right place. Here is the 16k model.
To use it, simply download into your working directory and then execute something like the following:
python3 -m aligner -r eng-SCOTUS-16k.zip -a yrdata -d eng.dict
Please let me know if you have any problems with that.
Libfix report for May 2016
Two bits of creative morphology I’ve been seeing around the city:
- Lime-a-rita: This trademark (of Anheuser-Busch InBev) isn’t just a redundant way to refer to a margarita (which has a lime base—a non-lime “margarita” is a barbarism), but rather a “light American lager” blended with additional lime-y-ness. I have to imagine this coinage, albeit rather corporate, was helped along by the existence of the truncation ‘rita, occasionally used in casual conversation by their most comitted devotees.
- -otto: I first came aware of this through pastotto, the suggested name for a dish of pasta (perhaps penne), fried in olive oil and butter and then cooked in stock, like risotto; according to popularizer Mark Bittman, this is an old trick. Now, that one looks a bit blend-y, given that the ris- part of risotto is really a reference to arborio rice, and that the final -a in the base pasta appears to be lost in the combination. But not so much for barleyotto, which satisfies even the most stringent criteria for libfix-hood.
Latent semantic analysis lecture
Here is an IPython notebook from a recent lecture I gave on Latent Semantic Analysis (LSA) in my natural language processing class (CS 562/662).
Understanding text encoding in Python 2 and Python 3
Computers were rather late to the word processing game. The founding mothers and fathers of computing were primarily interested in numbers. This is fortunate: after all, computers only know about numbers. But as Brian Kunde explains in his brief history of word processing, word processing existed long before digital computing, and the text processing has always been something of an afterthought.
Humans think of text as consisting of ordered sequence of “characters” (an ill-defined Justice-Stewart-type concept which I won’t attempt to clarify here). To manipulate text in digital computers, we have to have a mapping between the character set (a finite list of the characters the system recognizes) and numbers. Encoding is the process of converting characters to numbers, and decoding is (naturally) the process of converting numbers to characters. Before we get to Python, a bit of history.
ASCII and Unicode
There are only a few character sets that have any relevance to life in 2014. The first is ASCII (American Standard Code for Information Interchange), which was first published in 1963. This character set consists of 128 characters intended for use by an English audience. Of these 95 are printable, meaning that they correspond to lay-human notions about characters. On a US keyboard, these are (approximately) the alphanumeric and punctuation characters that can be typed with a single keystroke, or with a single keystroke while holding down the Shift key, space, tab, the two newline characters (which you get when you type return), and a few apocrypha. The remaining 33 are non-printable “control characters”. For instance, the first character in the ASCII table is the “null byte”. This is indicated by a ''
in C and other languages, but there’s no standard way to render it. Many control characters were designed for earlier, more innocent times; for instance, character #7 'a'
tells the receiving device to ring a cute little bell (which were apparently attached to teletype terminals); today your computer might make a beep, or the terminal window might flicker once, but either way, nothing is printed.
Of course, this is completely inadequate for anything but English (not to mention those users of superfluous diaresis…e.g., the editors of the New Yorker, Motörhead). However, each ASCII character takes up only 7 bits, leaving room for another 128 characters (since a byte has an integer value between 0-255, inclusive), and so engineers could exploited the remaining 128 characters to write the characters from different alphabets, alphasyllabaries, or syllabaries. Of these ASCII-based character sets, the best-known are ISO/IEC 8859-1, also known as Latin-1, and Windows-1252, also known as CP-1252. Unfortunately, this created more problems than it solved. That last bit just didn’t leave enough space for the many languages which need a larger character set (Japanese kanji being an obvious example). And even when there are technically enough code points left over, engineers working in different languages didn’t see eye-to-eye about what to do with them. As a result, the state of affairs made it impossible to, for example, write in French (ISO/IEC 8859-1) about Ukrainian (ISO/IEC 8859-5, at least before the 1990 orthography reform).
Clearly, fighting over scraps isn’t going to cut it in the global village. Enter the Unicode standard and its Universal Character Set (UCS), first published in 1991. Unicode is the platonic ideal of an character encoding, abstracting away from the need to efficiently convert all characters to numbers. Each character is represented by a single code with various metadata (e.g., A
is an “Uppercase Letter” from the “Latin” script). ASCII and its extensions map onto a small subset of this code.
Fortunately, not all encodings are merely shadows on the walls of a cave. The One True Encoding is UTF-8, which implements the entire UCS using an 8-bit code. There are other encodings, of course, but this one is ours, and I am not alone in feeling strongly that UTF-8 is the chosen encoding. At the risk of getting too far afield, here are two arguments for why you and everyone you know should just use UTF-8. First off, it is hardly matters much which UCS-compatible encoding we all use (the differences between them are largely arbitrary), but what does matter is that we all choose the same one. There is no general procedure for “sniffing” out the encoding of a file, and there’s nothing preventing you from coming up with a file that’s a French cookbook in one encoding, and a top-secret message in another. This is good for steganographers, but bad for the rest of us, since so many text files lack encoding metadata. When it comes to encodings, there’s no question that UTF-8 is the most popular Unicode encoding scheme worldwide, and is on its way to becoming the de-facto standard. Secondly, ASCII is valid UTF-8, because UTF-8 and ASCII encode the ASCII characters in exactly the same way. What this means, practically speaking, is you can achieve nearly complete coverage of the world’s languages simply by assuming that all the inputs to your software are UTF-8. This is a big, big win for us all.
Decode early, encode late
A general rule of thumb for developers is “decode early” (convert inputs to their Unicode representation), “encode late” (convert back to bytestrings). The reason for this is that in nearly any programming language, Unicode strings behave the way our monkey brains expect them to, but bytestrings do not. To see why, try iterating over non-ASCII bytestring in Python (more on the syntax later).
>>> for byte in b"año":
... print(byte)
...
a
?
?
o
There are two surprising things here: iterating over the bytestring returned more bytes then there are “characters” (goodbye, indexing), and furthermore the 2nd “character” failed to render properly. This is what happens when you let computers dictate the semantics to our monkey brains, rather than the other way around. Here’s what happens when we try the same with a Unicode string:
>>> for byte in u"año":
... print(byte)
...
a
ñ
o
The Python 2 & 3 string models
Before you put this all into practice, it is important to note that Python 2 and Python 3 use very different string models. The familiar Python 2 str
class is a bytestring. To convert it to a Unicode string, use the str.decode
instance method, which returns a copy of the string as an instance of the unicode
class. Similarly, you can make a str
copy of a unicode
instance with unicode.encode
. Both of these functions take a single argument: a string (either kind!) representing the encoding.
Python 2 provides specific syntax for Unicode string literals (which you saw above): the a lower-case u
prefix before the initial quotation mark (as in u"año"
).
When it comes to Unicode-awareness, Python 3 has totally flipped the script; in my opinion, it’s for the best. Instances of str
are now Unicode strings (the u""
syntax still works, but is vacuous). The (reduced) functionality of the old-style strings is now just available for instances of the class bytes
. As you might expect, you can create a bytes
instance by using the encode
method of a new-style str
. Python 3 decodes bytestrings as soon as they are created, and (re)encodes Unicode strings only at the interfaces; in other words, it gets the “early/late” stuff right by default. Your APIs probably won’t need to change much, because Python 3 treats UTF-8 (and thus ASCII) as the default encoding, and this assumption is valid more often than not.
If for some reason, you want a bytestring literal, Python has syntax for that, too: prefix the quotation marks delimiting the string with a lower-case b
(as in b"año"
; see above also).
tl;dr
Strings are ordered sequences of characters. But computers only know about numbers, so they are encoded as byte arrays; there are many ways to do this, but UTF-8 is the One True Encoding. To get the strings to have the semantics you expect as a human, decode a string to Unicode as early as possible, and encode it as bytes as late as possible. You have to do this explicitly in Python 2; it happens automatically in Python 3.
Further reading
For more of the historical angle, see Joel Spolsky’s The absolute minimum every software developer absolutely, positively must know About Unicode and character sets (no excuses!).
Simpler sentence boundary detection
Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:
Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain steady at about 1,200 cars in 1990.
This sentence contains 4 periods, but only the last denotes a sentence boundary. It’s obvious that the first one in U.S. is unambiguously part of an acronym, not a sentence boundary, and the same is true of expressions like $12.53. But the periods at the end of Inc. and U.S. could easily have been on the left edge of a sentence boundary; it just turns out they’re not. Humans can use local context to determine that neither of these are likely to be sentence boundaries; for example, the verb expect selects two arguments (an object its U.S. sales and the infinitival clause to remain steady…), neither of which would be satisfied if U.S. was sentence-final. Similarly, not all question marks or exclamation points are sentence-final (strictu sensu):
He says the big questions–“Do you really need this much money to put up these investments? Have you told investors what is happening in your sector? What about your track record?–“aren’t asked of companies coming to market.
Much of the available data for natural language processing experiment—including the enormous Gigaword corpus—does not include annotations for sentence boundaries providence annotations for sentence boundaries. In Gigaword, for example, paragraphs and articles are annotated, but paragraphs may contain internal sentence boundaries, which are not indicated in any way. In natural language processing (NLP), this task is known as sentence boundary detection (SBD). [1] SBD is one of the earliest steps in many natural language processing (NLP) pipelines, and since errors at this step are very likely to propagate, it is particularly important to just Get It Right.
An important component of this problem is the detection of abbreviations and acronyms, since a period ending an abbreviation is generally not a sentence boundary. But some abbreviations and acronyms do sometimes occur in sentence-final position (for instance, in the Wall St. Journal portion of the Penn Treebank, there are 99 sentence-final instances of U.S.). In this context, English writers generally omit one period, a sort of orthographic haplology.
NLTK provides an implementation of Punkt (Kiss & Strunk 2006), an unsupervised sentence boundary detection system; perhaps because it is easily available, it has been widely used. Unfortunately, Punkt is simply not very accurate compared to other systems currently available. Promising early work by Riley (1989) suggested a different way: a supervised classifier (in Riley’s case, a decision tree). Gillick (2009) achieved the best published numbers on the “standard split” for this task using another classifier, namely a support vector machine (SVM) with a linear kernel; Gillick’s features are derived from the words to the left and right of a period. Gillick’s code has make available under the name Splitta.
I recently attempted to construct my own SBD system, loosely inspired by Splitta, but expanding the system to handle ellipses (…), question marks, exclamation points, or sentence-final punctuation marks. Since Gillick found no benefits from tweaking the hyperparameters of the SVM, I used a hyperparameter-free classifier, the averaged perceptron (Freund & Schapire 1999). After performing a stepwise feature ablation, I settled on a relatively small set of features, extracted as follows. Candidate boundaries are identified using a nasty regular expression. If the left or right contextual tokens match a regular expression for American English numbers (including prices, decimals, negatives, etc.), they are merged into a special token *NUMBER*
(per Kiss & Strunk 2006); a similar approach is used to convert various types of quotation marks into *QUOTE*
. The following features were then extracted:
- the identity of the punctuation mark
- identity of L and R (Reynar & Ratnaparkhi 1997, etc.)
- the joint identity of both L and R (Gillick 2009)
- does L contain a vowel? (Mikheev 2002)
- does L contain a period? (Grefenstette 1999)
- length of L (Riley 1989)
- case of L and R (Riley 1989)
This 8-feature system performed exceptionally well on the “standard split”, with an accuracy of .9955, an F-score of .9971, and just 46 errors in all. This is very comparable with the results I obtained with a fork of Splitta extended to handle ellipses, question marks, etc.; this forked system produced 55 errors.
I have made my system freely available as a Python 3 module (and command-line tool) under the name DetectorMorse. Both code and dependencies are pure Python, so it can be run using pypy3
, if you’re in a hurry.
Endnotes
[1] Or, sometimes, sentence boundary disambiguation, sentence segmentation, sentence splitting, sentence tokenization, etc.
References
Y. Freund & R.E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37(3): 277-296.
D. Gillick. 2009. Sentence boundary detection and the problem with the U.S. In Proc. NAACL-HLT, pages 241-244.
G. Grefenstette. 1999. Tokenization. In H. van Halteren (ed.), Syntactic wordclass tagging, pages 117-133. Dordrecht: Kluwer.
T. Kiss & J. Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics 32(4): 485-525.
A. Mikheev. 2002. Periods, capitalized words, etc. Computational Linguistics 28(3): 289-318.
J.C. Reynar & A. Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proc. 5th Conference on Applied Natural Language Processing, pages 16-19.
M.D. Riley. 1989. Some applications of tree-based modelling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop, pages 339-352.
UNIX AV club
SoX and FFmpeg are fast, powerful command-line tools for manipulating audio and video data, respectively. In this short tutorial, I’ll show how to use these tools for two very common tasks: 1) resampling and 2) (de)multiplexing. Both tools are available from your favorite package manager (like Homebrew or apt-get
).
SoX and friends
SoX is a suite of programs for manipulating audio files. Commands are of the form:
sox [flag ...] infile1 [...] outfile [effect effect-options] ...
That is, the command sox
, zero or more global flags, one or more input files, one output file, and then a list of “effects” to apply. Unlike most UNIX command-line programs, though, SoX actually cares about file extensions. If the input file is in.wav
it better be a WAV file; if the output file is out.flac
it will be encoded in the FLAC (“free lossless audio codec”) format.
The simplest invocation of sox
converts audio files to new formats. For instance, the following would use audio.wav to create a new FLAC file audio.flac with the same same bit depth and sample rate.
sox audio.wav audio.flac
Concatenating audio files is only slightly more complicated. The following would concatenate 01_Intro.wav
and 02_Untitled.wav
together into a new file concatenated.wav
.
sox 01_Intro.wav 02_Untitled.wav concatenated.wav
Resampling with SoX
But SoX really shines for resampling audio. For this, use the rate effect. The following would downsample the CD-quality (44.1 kHz) audio in CD.wav
to the standard sample rate used on telephones (8 kHz) and store the result in telephone.wav
sox CD.wav telephone.wav rate 8k
There are two additional effects you may want to invoke when resampling. First, you may want to “dither” the audio. As man sox
explains:
Dithering is a technique used to maximize the dynamic range of audio stored at a particular bit-depth. Any distortion introduced by quantization is decorrelated by adding a small amount of white noise to the signal. In most cases, SoX can determine whether the selected processing requires dither and will add it during output formatting if appropriate.
The following would resample to the telephone rate with dithering (if necessary).
sox CD.wav telephone.wav rate 8k dither -s
Finally, when resampling audio, you may want to invoke the gain effect to avoid clipping. This can be done using the -G
(“Gain”) global option.
sox -G CD.wav telephone.wav rate 8k dither -s
(De)multiplexing with SoX
The SoX remix
effect is useful for manipulating multichannel audio. The following would remix a multi-channel audio file stereo.wav
down to mono.
sox stereo.wav mono.wav remix -
We also can split a stereo file into two mono files.
sox stereo.wav left.wav remix 1
sox stereo.wav right.wav remix 2
Finally, we can merge two mono files together to create one stereo file using the -M
(“merge”) global option; this file should be identical to stereo.wav
.
sox -M left.wav right.wav stereo2.wav
Other SoX goodies
There are three other useful utilities in SoX: soxi
prints information extracted from audio file headers, play
uses the SoX libraries to play audio files, and rec
records new audio files using a microphone.
FFmpeg
The FFmpeg suite is to video files what SoX is to audio. Commands are of the form:
ffmpeg [flag ...] [-i infile1 ...] [-effect ...] [outfile]
The -acodec
and -vcodec
effects can be used to extract the audio and video streams from a video file, respectively; this process sometimes known as demuxing (short for “de-multiplexing”).
ffmpeg -i both.mp4 -acodec copy -vn audio.ac3
...
ffmpeg -i both.mp4 -vcodec copy -an video.h264
...
We can also mux (“multiplex”) them back together.
ffmpeg -i video.h264 -i audio.ac3 -vcodec copy -acodec copy both2.mp4
Hopefully that’ll get you started. Both programs have excellent manual pages; read them!
How "uh" and "um" differ
If you’ve been following the recent discussions on Language Log, then you know that there is a great deal of inter-speaker variation in the use of the fillers uh and um, despite their superficial similarity. In this post, I’ll discuss some published results, summarize some of the Language Log findings (with the obvious caveat that none of it has been subject to any sort of peer review) and explain what I think it all means for our understanding of the contrast between uh and um.
The function of uh and um
The vast majority of work on disfluencies (which include fillers like uh and um as well as repetitions, revisions, and false starts) assumes that uh and um are functionally equivalent, substitutable forms. But Clark and Fox Tree (2002) argue that they are subtly different. They claim that uh serves as signal minor delays and um signals major delays. The evidence for this is straightforward:
- Um is more often followed by a pause than uh.
- Pauses after ums tend to be longer than those occurring after uhs (though Mark has failed to replicate this in a much larger corpus, and I am inclined to defer to him).
- Um is more common than uh in utterance-initial position, the point at which speech planning demands are presumably at their greatest. [1]
From these results, though, it is not obvious that uh and um are qualitatively different. This has not prevented people (myself included) from making this jump. For example, Mark speculated a bit about this for the Atlantic: “People tend to use UM when they’re trying to decide what to say, and UH when they’re trying to decide how to say it.” This is plausible, but the evidence for differential functions of uh and um is lacking.
Intraspeaker differences in uh and um
Gender effects
The first—and probably most robust—finding, is that female speakers have a higher average um/uh ratio than males. This pattern was found in several corpora of American English available from the LDC (1 2 3). It also reported in a recent paper by Acton (2011), who looks two American English corpora. A higher um/uh ratio in females was also found in two corpora of British English. The first looks at data from the HCRC map task and the second at the conversational portion of the British National Corpus (BNC). The latter was earlier the subject of a study by Rayson et al. (1997), who found that that er (the British equivalent of uh) [2] was the one of the words most strongly associated with male (rather than female) speakers; the only word more “masculine” than uh was the expletive fucking.
Social class effects
The second finding is that um/uh ratio is correlated with social class: higher status speakers have a higher um/uh ratio. Once again, this was first reported by Rayson et al., who found that erm is more common in speakers with high-status occupations. Mark found a similar pattern in American English using educational attainment—rather than occupation—as a measure of social class.
Age effects
The third finding is that younger speakers have a higher um/uh ratio than older speakers. This was first reported by Rayson et al. (once again, studying the conversational portion of the BNC), who found that that er is much more common in speakers over the age of 35. Similar patterns are reported by Acton, and several Language Log correspondents (1 2 3 4).
Geographic effects
Finally, Jack Grieve looked at um/uh ratio geographically, and found that um was more common in the Midlands and the central southwest. I see two issues with this result, however. First, I don’t observe any geographic patterns in the raw data (ibid., in the comments section of that post); to my eye, the geographic patterns only emerge after aggressive smoothing; this may just be another case of Smoothers Gone Wild. Secondly, the data was taken from geocoded Twitter posts, not speech. As commenter “BK” asks: “do we have any reason to believe that writing ‘UM’ vs ‘UH’ in a tweet is at all correlated with the use of ‘UM’ vs ‘UH’ in speech?” Regrettably, I suspect the answer is no, but there still is probably something to be gleaned from tweeters’ stylistic use of these fillers.
Uh and um in children with autism
Our recent work on filler use in children with autism spectrum disorders (ASD) might provide us another way to get at the functional differentiation between uh and um. We [3] used a semi-structured corpus of diagnostic interviews of children ages 4-8, and find that children with ASD produce a much lower um–uh ratio than typically developing children matched for age and intelligence. Children with specific language impairment—a neurodevelopmental disorder characterized by language delays or deficits in the absence of other developmental or sensory impairments—have an um–uh ratio much closer to the typical children; this tells us it’s not about language impairment (something which is relatively common—but not specific to—children with ASD). We also find that um–uh ratio is correlated with the Communication Total Score of the Social Communication Questionnaire, a parent-reported measure of communication ability. At the very least, individuals who use more um are perceived to have better communication abilities by their parents. At best, use of um itself contributes to these perceptions.
How uh and um differ
To the sociolinguistic eye, the effects of gender, class, and age just described tell us a lot about uh and um. Given that women have a higher um/uh ratio than men, we expect that um is either the more prestigious variant, or the incoming variant, or both. This is what Labov calls the gender paradox: women consistently lead men in the use of prestige variants, and lead men in the adoption of innovative variants. Further evidence that um is the prestige variant comes from social class: higher status individuals have a higher um/uh ratio. Younger speakers have a higher um/uh ratio, suggesting that um is also the incoming variant. This is not the only possible interpretation, however; it may be that the variants are subject to age grading—meaning that speakers change their use of uh and um as they age—which does not entail that there is any change in progress. Given a change in apparent time—meaning that younger and older speakers use the variants at different rates—the only way to tell whether there is change in progress is to look at data collected at multiple time points. While the evidence is limited, it looks like both age grading and change in progress are occurring—they are not mutually exclusive, after all.
Unfortunately, some evidence from style shifting problematizes this view of um as a prestige variant. O’Connell and Kowal (2005) look at uh and um by analyzing the speech of professional TV and radio personalities interviewing Hillary Clinton. If um is the more prestigious variant, then we would expect a higher um/uh ratio in this formal context compared to the more casual styles recorded in other corpora. But in fact these experienced public speakers have a particularly low um/uh ratio. Hillary Clinton produced 640 uhs and 160 ums, for an um/uh ratio of 0.250; in contrast, Mark found that on average, female speakers in the Fisher corpora favored um more than 2-to-1.
So why is Hillary Clinton hating on um? Can an incoming variable be associated with women and the upper classes yet still avoided in formal contexts? Or are we simply wrong to think of uh and um as variants of a single variable? Is it possible that, given our limited understanding of the functional differences between uh and um, we have failed to account for associations between discourse demands and social groups (or speech styles)? Perhaps Clinton just needs uh more than we could ever know.
Endnotes
[1] This finding is so robust, it even holds in Dutch, which has very similar fillers to those of English (Swerts 1998).
[2] Note that, at least according to the Oxford English Dictionary, British er and erm are just orthographic variants of uh and um, respectively. That’s not to say that they’re pronounced identically, just that they are functionally equivalent.
[3] Early studies geared at speech researchers were conducted by Peter Heeman and Rebecca Lunsford. Other coauthors include Lindsay Olson, Alison Presmanes Hill, and Jan van Santen.
References
E.K. Acton. 2011. On gender differences in the distribution of um and uh. Penn Working Papers in Linguistics 17(2): 1-9.
H.H. Clark & J.E. Fox Tree. 2002. Using uh and um in spontaneous speaking. Cognition 84(1): 73-111.
D.C. O’Connell and S. Kowal. 2005. Uh and um revisited: Are they interjections for signaling delay? Journal of Psycholinguistic Research 34(6): 555-576.
P. Rayson, G. Leech, and M. Hodges. 1997. Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics 2(1): 133-152.
M. Swerts. 1998. Filled pauses as markers of discourse structure. Journal of Pragmatics 30(4): 485-496.