Knight et al. (2006) introduce a class of problems they call decipherment. In this scenario, we observe a “ciphertext” C , which we wish to decode. We imagine that there exists a corpus of “plaintext” P, and which to recover the encipherment model G that transduces from P to C. All three components can be represented as (weighted) finite-state transducers: P is a language model over plaintexts, C is a list of strings, and G is an initially-uniform transducer from P to C. We can then estimate the parameters (i.e.. arc weights) of G by holding P and C constant and applying the expectation maximization algorithm (Dempster et al. 1977).
Both training and prediction require us to repeatedly compute the “cascade”, the three-way composition P ○ G ○ C. First off, two-way composition is associative, so for all a, b, c : (a ○ b) ○ c = a ○ (b ○ c). However, given any n-way composition, some associations may be radically more efficient than others. Even were the time complexity of each possible composition known, it is still not trivial to compute the optimal association. Fortunately, in this case we are dealing with three-way composition, for which there are only two possible associations; we simply need to compare the two.1
Composition performance depends on the sorting properties of the relevant machines. In the simplest case, the inner loop of (two-way) composition consists of a complex game of “go fish” between a state in the left-hand side automaton and a state in the right-hand side automaton. One state enumerates over its input (respectively, output) labels and queries the other state’s output (respectively input) labels. In the case that the state in the automaton being queried has its arcs sorted according to the label values, a sublinear binary search is used; otherwise, linear-time search is required. Optimal performance obtains when the left-hand side of composition is sorted by output labels and the right-hand side is sorted by input labels.2 Naturally, we also want to perform arc-sorting offline if possible.
Finally, OpenFst, the finite-state library we use, implements composition as an on-the-fly operation: states in the composed FST are lazily computed and stored in an LRU cache.3 Assiduous use of the cache can make it feasible to compute very large compositions when it is not necessary to visit all state of the composed machine. Today I focus on associativity and assume optimal label sorting; caching will have to wait for another day.
Our cascade consists of three weighted finite-state machines:
- P is a language model expressed as a weighted label-sorted finite-state acceptor. The model is order 6, with Witten-Bell smoothing (Bell et al. 1990) backoffs encoded using φ (i.e., “failure”) transitions, and has been shrunk to 1 million n-grams using relative entropy pruning (Stolcke 1998).
- G is a uniform channel model encoded as a finite-state transducer. Because it is a non-deterministic transducer, it can be input-label-sorted or output-label sorted, but not both.
- C is an unweighted label-sorted string finite-state acceptor encoding a long plaintext.
There are two possible associativities, which we illustrate using the OpenFst Python bindings.4 In the first, we use a left-associative composition. Offline, before composition, we input label-sort G:
In [5]: G.arcsort("ilabel")
Then, we perform both compositions, sorting the intermediate object by output label:
In [6]: %timeit -n 10 ... partial = compose(P, G, connect=False).arcsort("olabel"); ... cascade = compose(partial, C, connect=False) 10 loops, best of 3: 41.6 s per loop
In our second design, we use the parallel right-associative construction. Offline, we output label-sort G:
In [7]: G.arcsort("olabel")
Then, we perform both compositions, sorting the intermediate object by input label:
In [8]: %timeit -n 10 ... partial = compose(G, C, connect=False).arcsort("ilabel"); ... cascade = compose(P, partial, connect=False) 3 loops, best of 3: 38.5 s per loop
So we see a small advantage for the right-associative composition, which we take advantage of in OpenGrm-BaumWelch, freely available from the OpenGrm website.
Endnotes
- There exist FST algorithms for n-ary composition (Allauzen & Mohri 2009), but in practice one can achieve similar effects using composition filters (Allauzen et al. 2010) instead.
- Note that acceptors which are input label-sorted are implicitly output label-sorted and vice versa, and string FSTs are input and output label-sorted by definition.
- In the case where one needs the entire composition at once, we can simply disable caching; in OpenFst, the result is also connected (i.e., trimmed) by default, but we disable that since we need to track the original state IDs.
- The
timeit
module is used to estimate execution times irrespective of caching.
References
Allauzen, C., and Mohri, M.. 2009. N-way composition of weighted finite-state transducers. International Journal of Foundations of Computer Science 20(4): 613-627.
Allauzen, C., Riley, M. and Schalkwyk, J. 2010. Filters for efficient composition of weighted finite-state transducers. In Implementation and Application of Automata: 15th International Conference, CIAA 2010, pages 28-38. Manitoba.
Bell, T.C., Clearly, J. G., and Witten, I.H. 1990. Text Compression. Englewood Cliffs, NJ: Prentice Hall.
Dempster, A. P., Laird, N., M, and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39(1): 1-38.
Knight, K., Nair, A., Rathod, N, Yamada, K. 2006. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 499-506. Sydney.
Stolcke, A. 1998. Entropy-based pruning of backoff language models. In Proceedings of the DARPA Broadcast News And Understanding Workshop, pages 270–274. Lansdowne, Virginia.