The 24th century Universal Translator is unsupervised and requires minimal resources

The Star Trek: Deep Space Nine episode “Sanctuary” pretty clearly establishes that by the 24th century, the Star Trek universe’s Universal Translator works in an unsupervised fashion and requires only a (what we in the real 21st century would consider) minimal monolingual corpus and a few hours of processing to translate Skrreean, a language new to Starfleet and friends. Free paper idea: how does the Universal Translator’s capabilities (in the 22nd through the 24th century, from Enterprise to the original series to the 24th century shows) map onto known terms of art in machine translation in our universe?

“I understood the assignment”

We do a lot of things downstream with the machine learning tool we build, but not always can a model reasonably say it “understood the assignment” in the sense that the classifier is trained to do exactly what it we are making it do.

Take for example, Yuan and Liberman (2011), who study the realization of word-final ing in American English. This varies between a dorsal variant [ɪŋ] and a coronal variant [ɪn].1 They refer to this phenomenon using the layman’s term g-dropping; I will use the notation (ing) to refer to all variants. They train Gaussian mixture models on this distinction, then enrich their pronunciation dictionary so that each word can be pronounced with or without g-dropping; it is as if the two variants are homographs. Then, they perform a conventional forced alignment; as a side effect, it determines which of the “homographs” was most likely used. This does seem to work, and is certainly very clever, but strikes me as a mild abuse of the forced alignment technique, since the model was not so much trained to distinguish between the two variants as produce a global joint model over audio and phoneme sequences.

What would an approach to the g-dropping problem that better understood the assignment look like? One possibility would be to run ordinary forced alignment, with an ordinary dictionary, and then extract all instances of (ing). The alignment would, naturally, give us reasonably precise time boundaries for the relevant segments. These could then be submitted to a discriminative classifier (perhaps an LSTM) trained to distinguish the various forms of (ing). In this design, one can accurately say that the two components, aligner and classifier, understand the assignment. I expect that this would work quite a bit better than what Yuan and Liberman did, though that’s just conjecture at present.

Some recent work by my student Angie Waller (published as Waller and Gorman 2020), involved an ensemble of two classifiers, one which more clearly understood the assignment than the other. The task here was to detect reviews of professors which are objectifying, in the sense that they make off-topic, usually-positive, comments about the professors’ appearance. One classifier makes document-level classifications, and cannot be said to really understand the assignment. The other classifier attempts to detect “chunks” of objectifying text; if any such chunks are found, one can label the entire document as objectifying. While neither technique is particularly accurate (at the document level), the errors they make are largely uncorrelated so an ensemble of the two obtains reasonably high precision, allowing us to track trends in hundreds of thousands of professor reviews over the last decade.

Endnotes

  1. This doesn’t exhaust the logical possibilities of variation; for instance, for some speakers (including yours truly), there is a variant with a tense vowel followed by the coronal nasal.

References

Waller, A. and Gorman, K. 2020. Detecting objectifying language in online professor  reviews. In Proceedings of the Sixth Workshop on Noisy User-Generated Text, pages 171-180.
Yuan, J. and Liberman, M. 2011. Automatic detection of “g-dropping” in American English using forced alignment. In IEEE Workshop on Automatic Speech Recognition & Understanding, pages 490-493.

Surprises for the new NLP developer

There are a couple things that surprise students when they first begin to develop natural language processing applications.

  • Some things just take a while. A script that, say, preprocesses millions of sentences isn’t necessarily wrong because it takes a half hour.
  • You really do have to avoid wasting memory. If you’re processing a big file line-by-line,
    • you really can’t afford to read it all in at once, and
    • you should write out data as soon as you can.
  • The OS and program already know how to buffer IO; don’t fight it.
  • Whereas so much software works with data in human non-readable (e.g., wire formats, binary data) or human-hostile (XML) formats, if you’re processing text files, you can just open the files up and read them to see if they’re roughly what you expected.

Should you assign it to a variable?

Let us suppose that you’re going to compute some value, and then send it to another function. Which snippet is better (using lhs as shorthand for a variable identifier and rhs() as shorthand for the right-hand side expression)?

lhs = rhs()
do_work(lhs)
do_work(rhs())

My short answer is it depends. Here is my informal set of heuristics:

    • Is a type-cast involved (e.g., in a statically typed language like C)? If so, assign it to a variable.
    • Would the variable name be a meaningful one that would provide non-obvious information about the nature of the computation? If so, assign it to a variable. For instance, if a long right-hand side expression computes perplexity, ppl = ... or perplex = ... is about as useful as an inline comment.
    • Is the computation used again in the same scope? If so, assign it to a variable.
    • Is the right-hand side just a very complicated expression? If so, consider assigning it to a variable, and try to give it an informative name.
    • Otherwise, do not assign it to a variable.

Why language resources should be dynamic

Virtually all the digital linguistic resources used in speech and language technology are static in the sense that

  1. One-time: they are generated once and never updated.
  2. Read-only: they provide no mechanisms for corrections, feature requests, etc.
  3. Closed-source: code and raw data used to generate the data are not released.

However, there are some benefits to designing linguistic resources dynamically, allowing them to be repeatedly regenerated and iteratively improved with the help of the research community. I’ll illustrate this with WikiPron (Lee et al. 2020), our database-cum-library for multilingual pronunciation data.

The data

Pronunctionary dictionaries are an important resource for speech technologies like automatic speech recognition and text-to-speech synthesis. Several teams have considered the possibility of mining pronunciation data from the internet, particularly from the free online dictionary Wiktionary, which by now contains millions of crowd-sourced pronunciations transcribed using the International Phonetic Alphabet. However, none of these prior efforts released any code, nor were their scrapes run repeatedly, so at best they represent of a single (2016, or 2011) slice of the data.

The tool

WikiPron is, first and foremost, a Python command-line tool for scraping pronunciation data from Wiktionary. Stable versions can be installed from PyPI using tools like pip. Once the tool is installed, users specify a language, optionally, a dialect, and various optional flags, and pronunciation data is printed to STDIN as a two-column TSV file. Since this requires an internet connection and may take a while, the system is even able to retry where it left off in case of connection hiccups. The code is carefully documented, tested, type-checked, reflowed, and linted using the CircleCI continuous integration system. 

The infrastructure

We also release, at least annually, a multilingual pronunciation dictionary created using WikiPron. This increases replicability, permits users to see the format and scale of the data WikiPron makes available, and finally allows casual users to bypass the command-line tool altogether. To do this, we provide the data/ directory, which contains data and code which automates “the big scrape”, the process by which we regenerate the multilingual pronunciation dictionary. It includes

  • the data for 335 (at time of writing) languages, dialects, scripts, etc.,
  • code for discovering languages supported by Wiktionary,
  • code for (re)scraping all languages,
  • code for (re)generating data summaries (both computer-readable TSV files and human-readable READMEs rendered by GitHub), and
  • integration tests that confirm the data summaries match the checked-in data,

as well as code and data used for various quality assurance processes. 

Dynamic language resources

In what sense is WikiPron a dynamic language resource? 

  1. It is many-time: it can be run as many times as one wants. Even “the big scrape” static data sets are updated more-than-annually.
  2. It is read-write: one can improve WikiPron data by correcting Wiktionary, and we provide instructions for contributors wishing to send pull requests to the tool.
  3. It is open-source: all code is licensed under the Apache 2.0 license; the data bears a Creative Commons Attribution-ShareAlike 3.0 Unported License inherited from Wiktionary.

Acknowledgements

Most of the “dynamic” features in WikiPron were implemented by CUNY Graduate Center PhD student Lucas Ashby and my colleague Jackson Lee; I have at best served as an advisor and reviewer.

References

Lee, J. L, Ashby, L. F.E., Garza, M. E., Lee-Sikka, Y., Miller, S., Wong, A.,
McCarthy, A. D., and Gorman, K. 2020. Massively multilingual pronunciation
mining with WikiPron. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4223-4228.

Results of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion

The results of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion are now in, and are summarized in our task paper. A couple bullet points:

  • Unsurprisingly, the best systems all used some form of ensembling.
  • Many of the best teams performed self-training and/or data augmentation experiments, but most of these experiments were performance-negative except in simulated low-resource conditions. Maybe we’ll do a low-resource challenge in a future year.
  • LSTMs and transformers are roughly neck-and-neck; one strong submission used a variant of hard monotonic attention.
  • Many of the best teams used some kind of pre-processing romanization strategy for Korean, the language with the worst baseline accuracy. We speculate why this helps in the task paper.
  • There were some concerns about data quality for three languages (Bulgarian, Georgian, and Lithuanian). We know how to fix them and will do so this summer, if time allows. We may also “re-issue” the challenge data with these fixes.

Optimizing three-way composition for decipherment problems

Knight et al. (2006) introduce a class of problems they call decipherment. In this scenario, we observe a “ciphertext” C , which we wish to decode. We imagine that there exists a corpus of “plaintext” P, and which to recover the encipherment model G that transduces from P to C. All three components can be represented as (weighted) finite-state transducers: P is a language model over plaintexts, C is a list of strings, and G is an initially-uniform transducer from P to C. We can then estimate the parameters (i.e.. arc weights) of G by holding P and C constant and applying the expectation maximization algorithm (Dempster et al. 1977).

Both training and prediction require us to repeatedly compute the “cascade”, the three-way composition P ○ G ○ C. First off, two-way composition is associative, so for all ab, c : (ab) ○ c = a ○ (b ○ c). However, given any n-way composition, some associations may be radically more efficient than others. Even were the time complexity of each possible composition known, it is still not trivial to compute the optimal association. Fortunately, in this case we are dealing with three-way composition, for which there are only two possible associations; we simply need to compare the two.1

Composition performance depends on the sorting properties of the relevant machines. In the simplest case, the inner loop of (two-way) composition consists of a complex game of “go fish” between a state in the left-hand side automaton and a state in the right-hand side automaton. One state enumerates over its input (respectively, output) labels and queries the other state’s output (respectively input) labels. In the case that the state in the automaton being queried has its arcs sorted according to the label values, a sublinear binary search is used; otherwise, linear-time search is required. Optimal performance obtains when the left-hand side of composition is sorted by output labels and the right-hand side is sorted by input labels.2 Naturally, we also want to perform arc-sorting offline if possible.

Finally, OpenFst, the finite-state library we use, implements composition as an on-the-fly operation: states in the composed FST are lazily computed and stored in an LRU cache.3 Assiduous use of the cache can make it feasible to compute very large compositions when it is not necessary to visit all state of the composed machine. Today I focus on associativity and assume optimal label sorting; caching will have to wait for another day.

Our cascade consists of three weighted finite-state machines:

  • P is a language model expressed as a weighted label-sorted finite-state acceptor. The model is order 6, with Witten-Bell smoothing (Bell et al. 1990) backoffs encoded using φ (i.e., “failure”) transitions, and has been shrunk to 1 million n-grams using relative entropy pruning (Stolcke 1998).
  • G is a uniform channel model encoded as a finite-state transducer. Because it is a non-deterministic transducer, it can be input-label-sorted or output-label sorted, but not both.
  • C is an unweighted label-sorted string finite-state acceptor encoding a long plaintext.

There are two possible associativities, which we illustrate using the OpenFst Python bindings.4 In the first, we use a left-associative composition. Offline, before composition, we input label-sort G:

In [5]: G.arcsort("ilabel")

Then, we perform both compositions, sorting the intermediate object by output label:

In [6]: %timeit -n 10 
...          partial = compose(P, G, connect=False).arcsort("olabel"); 
...          cascade = compose(partial, C, connect=False)
10 loops, best of 3: 41.6 s per loop

In our second design, we use the parallel right-associative construction. Offline, we output label-sort G:

In [7]: G.arcsort("olabel")

Then, we perform both compositions, sorting the intermediate object by input label:

In [8]: %timeit -n 10 
...          partial = compose(G, C, connect=False).arcsort("ilabel"); 
...          cascade = compose(P, partial, connect=False)
3 loops, best of 3: 38.5 s per loop

So we see a small advantage for the right-associative composition, which we take advantage of in OpenGrm-BaumWelch, freely available from the OpenGrm website.

Endnotes

  1. There exist FST algorithms for n-ary composition (Allauzen & Mohri 2009), but in practice one can achieve similar effects using composition filters (Allauzen et al. 2010) instead.
  2. Note that acceptors which are input label-sorted are implicitly output label-sorted and vice versa, and string FSTs are input and output label-sorted by definition.
  3. In the case where one needs the entire composition at once, we can simply disable caching; in OpenFst, the result is also connected (i.e., trimmed) by default, but we disable that since we need to track the original state IDs.
  4. The timeit module is used to estimate execution times irrespective of caching.

References

Allauzen, C., and Mohri, M.. 2009. N-way composition of weighted finite-state transducers. International Journal of Foundations of Computer Science 20(4): 613-627.
Allauzen, C., Riley, M. and Schalkwyk, J. 2010. Filters for efficient composition of weighted finite-state transducers. In Implementation and Application of Automata: 15th International Conference, CIAA 2010, pages 28-38. Manitoba.
Bell, T.C., Clearly, J. G., and Witten, I.H. 1990. Text Compression. Englewood Cliffs, NJ: Prentice Hall.
Dempster, A. P., Laird, N., M, and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39(1): 1-38.
Knight, K., Nair, A., Rathod, N, Yamada, K. 2006. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 499-506. Sydney.
Stolcke, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA Broadcast News And Understanding Workshop, pages 270–274. Lansdowne, Virginia.

Pynini 2020: State of the Sandwich

I have been meaning to describe some of the work I have been doing on Pynini, our weighted finite-state grammar development platform. For one, while I have been the primary contributor through the history of the project (Richard Sproat wrote the excellent path iteration library), we are now also getting many contributions from Lawrence Wolf-Sonkin (rewrite of the symbol table wrapper, type hints) and lots of usability and bug reports from the Google linguists.

We are currently on Pynini release 2.1.1. Here are some new features/improvements from the last few releases:

  • 2.0.9: Adds an efficient multi-argument union.
  • 2.0.9: Pynini (and the rest of OpenGrm) are available on Conda via Conda-Forge. This means that for most users, there is no longer any need to compile Pynini by hand; instead Pynini is compiled (for a variety of platforms) in the cloud, using a continuous integration framework.
  • 2.1.0: Rewrites the string compiler so that symbol tables are no longer attached to compiled FSTs, eliminating the need for expensive symbol table merging and relabeling options.
  • 2.1.0: Rewrites the FST and symbol table class hierarchies to better reflect the organization of lower-level APIs.
  • 2.1.1: Adds PEP 484/PEP 561-compatible type stubs.

We also have removed or renamed quite a few features:

  • stringify is renamed string.
  • text is renamed print (cf. the command-line tool fstprint).
  • The defaults struct is removed, though it may be reintroduced as a context manager at some point.
  • The * infix operator, previously used for composition is removed; use @ instead.
  • transducer‘s arguments input_token_type and output_token_type are merged as token_type.

Finally, we have broken Python 2.7 compatibility as of 2.1.0; pywrapfst, the lower-level API, still has some degree of Python 2.7 compatibility, but this is probably the last release to maintain that property.