In this assignment, you will write code for parsing strings using a probabilistic context-free grammars and the CYK (Cocke-Younger-Kasami) algorithm.
trn_02-21.cps contains approximately 40,000 parse trees from sections 02-21 of the Wall St. Journal portion of the Penn Treebank (PTB). These trees have been processed with the following transforms:
NNPS) or digits (
CD) are replaced with special tokens (
pcfg.py contains the
PCFG class, designed specifically for performing grammar intersection. A
PCFG instance is initialized from a iterable of productions (e.g., from instances of
Tree instantiated using
tree.py). While using this class, we look up the "right hand side" (RHS) of a production, which provide the candidate "left hand sides" (LHSes) which expand to this RHS. In this implementation, there are three types of rules:
TOPexpanding to one nonterminal
All three are represented in separate (maximum likelihood) probability distributions, the latter two conditional. The rule probabilities themselves are stored as bitweights, i.e., negative base-2 log probabilities. The
BitWeight class (in
bitweight.py) overloads unary and binary mathematical operators so that they have the same semantics as do real-valued probabilities; e.g., performing multiplication using internal log-space addition. This effectively prevents floating-point underflow. Underflow is an acute problem in PCFG parsing, as parse probabilities—formally, it is the joint probability of the sentence being generated by the grammar using the "generator story" given by the parse—are exceptionally small under any non-trivial grammar.
To create the PCFG, execute the
pcfg.py, which will read the training data (
trn_02-21.cps) and serialize the resulting PCFG into a compressed text file
PCFG.pkl.gz. This process will take several minutes in all, but only needs to be completed once. You do not need to turn in anything for this part of the assignment.
Tip: While developing your parser, you may want to start small. Modify the scripts to induce a very small PCFG on just a few simple sentences, and then attempt to parse this PCFG in part 2.
cyk.py provides a class
Chart which represents a fixed-size CYK chart. It consists of a tuple of rows—representing the span size of a non-terminal at that position—each of which consists of a tuple of cells—representing a span starting at that horizontal position—and each cell is a dictionary where keys consist of a candidate non-terminal label and values are the associated probability of the production. In this implementation, unlike the lecture on CYK, the chart is the upper left-hand diagonal of a matrix; the first row contains preterminals.
Your assignment is to implement the probabilistic CYK recognition algorithm and print out the negative base-2 log probability according to the PCFG induced in Part 1. A sketch of this is provided in the function
CYK_chart. The function takes a list of case-folded tokens and a
PCFG object as arguments, and completes the chart, and returns the
Chart object and the joint probability. The function as provided to you initializes the chart, populates the first row (with candidate preterminals), and also computes the maximum joint probability for this sentence. You are to implement the middle part of this function, which deals with all non-terminal productions. Then use this to compute the negative base-2 log joint probabilities for the 100 sentences in
tst-100.tok. All of these sentences have been chosen because they are relatively short and have non-zero joint probabilities under the training PCFG.
What to turn in:
Tip: You may also find this animated CYK demo useful, as it uses the same orientation to the table here.
Tip: The best way to begin is to start small. Rather than building a grammar out of the entire WSJ treebank and starting with a long and complex sentence, try instead to induce a grammar from a single short, simple sentence, and then try and run your CYK implementation on that. Having a small sentence that you can parse by hand and whose rules are very simple will make it much easier to debug the various off-by-one errors, etc. that will inevitably crop up, and a small and simple PCFG will make it easier to debug the math.