MP7: Constituency parsing

In this assignment, you will write code for parsing strings using a probabilistic context-free grammars and the CYK (Cocke-Younger-Kasami) algorithm.

Part 1: Inducing a probabilistic context-free grammar

The file trn_02-21.cps contains approximately 40,000 parse trees from sections 02-21 of the Wall St. Journal portion of the Penn Treebank (PTB). These trees have been processed with the following transforms:

  1. Empty nodes are removed
  2. Trees are binarized according the Collins head rules
  3. Non-terminals are annotated with their sisters and parents (v = 2, h = 2)
  4. All terminals are case-folded, and terminals tagged as proper names (NNP, NNPS) or digits (CD) are replaced with special tokens (<NNP>, <NNPS>, <CD>, respectively)

The program pcfg.py contains the PCFG class, designed specifically for performing grammar intersection. A PCFG instance is initialized from a iterable of productions (e.g., from instances of Tree instantiated using tree.py). While using this class, we look up the "right hand side" (RHS) of a production, which provide the candidate "left hand sides" (LHSes) which expand to this RHS. In this implementation, there are three types of rules:

  1. The special start symbol TOP expanding to one nonterminal
  2. One nonterminal expanding to two nonterminals
  3. One preterminal expanding to one terminal

All three are represented in separate (maximum likelihood) probability distributions, the latter two conditional. The rule probabilities themselves are stored as bitweights, i.e., negative base-2 log probabilities. The BitWeight class (in bitweight.py) overloads unary and binary mathematical operators so that they have the same semantics as do real-valued probabilities; e.g., performing multiplication using internal log-space addition. This effectively prevents floating-point underflow. Underflow is an acute problem in PCFG parsing, as parse probabilities—formally, it is the joint probability of the sentence being generated by the grammar using the "generator story" given by the parse—are exceptionally small under any non-trivial grammar.

To create the PCFG, execute the pcfg.py, which will read the training data (trn_02-21.cps) and serialize the resulting PCFG into a compressed text file PCFG.pkl.gz. This process will take several minutes in all, but only needs to be completed once. You do not need to turn in anything for this part of the assignment.

Tip: While developing your parser, you may want to start small. Modify the scripts to induce a very small PCFG on just a few simple sentences, and then attempt to parse this PCFG in part 2.

Part 2: The CYK algorithm

The script cyk.py provides a class Chart which represents a fixed-size CYK chart. It consists of a tuple of rows—representing the span size of a non-terminal at that position—each of which consists of a tuple of cells—representing a span starting at that horizontal position—and each cell is a dictionary where keys consist of a candidate non-terminal label and values are the associated probability of the production. In this implementation, unlike the lecture on CYK, the chart is the upper left-hand diagonal of a matrix; the first row contains preterminals.

Your assignment is to implement the probabilistic CYK recognition algorithm and print out the negative base-2 log probability according to the PCFG induced in Part 1. A sketch of this is provided in the function CYK_chart. The function takes a list of case-folded tokens and a PCFG object as arguments, and completes the chart, and returns the Chart object and the joint probability. The function as provided to you initializes the chart, populates the first row (with candidate preterminals), and also computes the maximum joint probability for this sentence. You are to implement the middle part of this function, which deals with all non-terminal productions. Then use this to compute the negative base-2 log joint probabilities for the 100 sentences in tst-100.tok. All of these sentences have been chosen because they are relatively short and have non-zero joint probabilities under the training PCFG.

What to turn in:

  1. Your program for generating the joint probabilities
  2. A list of joint probabilities in some standard format
  3. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: You may also find this animated CYK demo useful, as it uses the same orientation to the table here.

Tip: The best way to begin is to start small. Rather than building a grammar out of the entire WSJ treebank and starting with a long and complex sentence, try instead to induce a grammar from a single short, simple sentence, and then try and run your CYK implementation on that. Having a small sentence that you can parse by hand and whose rules are very simple will make it much easier to debug the various off-by-one errors, etc. that will inevitably crop up, and a small and simple PCFG will make it easier to debug the math.