Install OpenFst and OpenGrm-NGram:
$ conda install -c conda-forge openfst ngram
Download text data from the Wall St. Journal portion of the Penn Treebank corpus at the following URL:
https://www.wellformedness.com/courses/LING83800/Data/wsj.tar.gz
$ curl -O https://www.wellformedness.com/courses/LING83800/data/wsj.tar.gz
Then decompress it like so.
$ tar -xzf wsj.tar.gz
This creates a directory with three files; we'll use wsj_train.txt
, which has one sentence per line.
$ head -1 wsj_train.txt # For demonstration purposes only.
Efforts by the Hong Kong Futures Exchange to introduce a new interest-rate futures contract continue to hit snags, despite the support the proposed instrument enjoys in the colony's financial community.
$ farcompilestrings \
--fst_type=compact \
--token_type=byte \
wsj_train.txt \
wsj_train.far
$ farinfo wsj_train.far # For demonstration.
far type sttable
arc type standard
fst type compact_string
# of FSTs 34827
total # of states 4420561
total # of arcs 4385734
total # of final states 34827
$ ngramcount --require_symbols=false --order=6 wsj_train.far wsj_train.cnt
$ ngrammake --method=witten_bell wsj_train.cnt wsj_train.lm
$ fstinfo wsj_train.lm # For demonstration: it's an ordinary FST.
fst type vector
arc type standard
input symbol table none
output symbol table none
# of states 402801
# of arcs 1361523
initial state 1
# of final states 8653
# of input/output epsilons 402800
# of input epsilons 402800
# of output epsilons 402800
input label multiplicity 1
output label multiplicity 1
# of accessible states 402801
# of coaccessible states 402801
# of connected states 402801
# of connected components 1
# of strongly conn components 6551
input matcher y
output matcher y
input lookahead n
output lookahead n
expanded y
mutable y
error n
acceptor y
input deterministic y
output deterministic y
input/output epsilons y
input epsilons y
output epsilons y
input label sorted y
output label sorted y
weighted y
cyclic y
cyclic at initial state n
top sorted n
accessible y
coaccessible y
string n
weighted cycles y
$ ngramshrink \
--method=relative_entropy \
--target_number_of_ngrams=100000 \
wsj_train.lm \
wsj_train.shrunk.lm
$ ngraminfo wsj_train.shrunk.lm # For demonstration.
# of states 33971
# of ngram arcs 99815
# of backoff arcs 33970
initial state 1
unigram state 0
# of final states 185
ngram order 6
# of 1-grams 87
# of 2-grams 2356
# of 3-grams 13808
# of 4-grams 32411
# of 5-grams 35891
# of 6-grams 15447
well-formed y
normalized y
In this case one needs to tokenize the data, which can be done with this simple script. One also may wish to case-fold the data, though this is not done here. Whereas the character model uses each byte's ASCII representation as its arc label, it is necessary to automatically build a symbol table mapping from tokens to integer labels. (Standard output is shown in bold.)
$ ./word_tokenize.py wsj_train.txt > wsj_train.tok
$ head -1 wsj_train.tok # For demonstration purposes only.
Efforts by the Hong Kong Futures Exchange to introduce a new interest-rate futures contract continue to hit snags , despite the support the proposed instrument enjoys in the colony 's financial community .
$ ngramsymbols wsj_train.tok wsj_train.sym
$ farcompilestrings \
--fst_type=compact \
--symbols=wsj_train.sym \
--keep_symbols \
wsj_train.tok \
wsj_train.far
$ farinfo wsj_train.far # For demonstration.
far type sttable
arc type standard
fst type compact_string
# of FSTs 34827
total # of states 873020
total # of arcs 838193
total # of final states 34827
$ ngramcount \
--order=3 \
wsj_train.far \
wsj_train.cnt
$ ngrammake --method=kneser_ney wsj_train.cnt wsj_train.lm
$ fstinfo wsj_train.lm # For demonstration: it's an ordinary FST.
fst type vector
arc type standard
input symbol table wsj_train.sym
output symbol table wsj_train.sym
# of states 351486
# of arcs 1292422
initial state 1
# of final states 8292
# of input/output epsilons 351485
# of input epsilons 351485
# of output epsilons 351485
input label multiplicity 1
output label multiplicity 1
# of accessible states 351486
# of coaccessible states 351486
# of connected states 351486
# of connected components 1
# of strongly conn components 4100
input matcher y
output matcher y
input lookahead n
output lookahead n
expanded y
mutable y
error n
acceptor y
input deterministic y
output deterministic y
input/output epsilons y
input epsilons y
output epsilons y
input label sorted y
output label sorted y
weighted y
cyclic y
cyclic at initial state n
top sorted n
accessible y
coaccessible y
string n
weighted cycles y
$ ngramshrink \
--method=relative_entropy \
--target_number_of_ngrams=100000 \
wsj_train.lm \
wsj_train.shrunk.lm
$ ngraminfo wsj_train.shrunk.lm # For demonstration.
# of states 21175
# of ngram arcs 99934
# of backoff arcs 21174
initial state 1
unigram state 0
# of final states 66
ngram order 3
# of 1-grams 40925
# of 2-grams 49061
# of 3-grams 10014
well-formed y
normalized y