In this assignment, you will develop a few tools for working with context-free grammars. Both grammars and trees are often subjected to various transformations. This may be for convenience or computational efficiency, but in other cases, transforms are necessary to satisfy requirements of particular parsing algorithms. One example of the latter is the Chomsky normal form transformation is required by the CYK parsing algorithm. It is often simplest to transform trees and then extract rules from the transformed trees, as you will in this assignment. (Note that the mapping between trees and lists of productions is "lossy"—otherwise, parsing would be a lot easier.)
We have provided an incomplete module
tree.py which contains code for tree serialization (i.e., printing) and deserialization (i.e., reading from a text file), though once again you are welcome to write your own or improve upon the included code. Since these transforms are difficult to test by pure introspection, this module contains a number of unit tests, which can be run by executing the module (e.g.,
python tree.py). We have also provided
wsj-train.mrg.gz, a file containing a large sample of the Penn Treebank, in case you desire additional test data.
A unary production is a production in which a non-terminal "mother" expands to a single "daughter". It is conventional (and advantageous) to collapse them into a single node in most contexts. Let a mother be collapsible into its daughter if it is not the root node, its daughter is an "only child", and the daughter is neither terminal nor "pre-terminal" (the mother of a terminal). Your assignment is to write a function which takes a tree as input and returns a tree in which all such mothers have been collapsed. Your function should give the collapsed node a label that contains the labels of both the original mother and daughter, separated by some delimiter (for example,
+). For instance, the mother
S and its sole daughter
VP might be collapsed into
S+VP, like so:
What to turn in:
Tip: the easiest solution involves recursion.
A grammar is said to be in Chomsky normal form (CNF) if all production rules either produce two non-terminals or one terminal. Similarly, in trees generated by CNF grammars, each non-terminal has either two non-terminal daughters or one terminal daughter. Your next assignment is to transform all trees so that they can be generated by a CNF grammar. This is accomplished by introducing non-terminals until each non-terminal production is binary. This is to be accomplished by introducing new non-terminals on the right side of a ternary (or greater) branching. The labels of these new branches should combine the label of the mother and of the two new sisters. For instance, the tree
contains a ternary production, and therefore is not in Chomsky normal form. It can be made to conform to CNF by introducing a new non-terminal node
NP|<NNP&POS> which is to mother of
Longman 's and sister to
What to turn in:
Tip: The requirement that any rule which produces non-terminals produces exactly two non-terminals is satisfied by a grammar/tree in which unary productions have been collapsed. Therefore, you will want to collapse unary productions before applying this transform.
Your final assignment is not a tree transform (per se.) Rather, you are to to write a function which generates all productions (i.e., context-free rules) of a tree in Chomsky normal form. For instance, the tree
(TOP (S (VP (TO to) (VP (VB play) ) ) ) )
contains the following productions:
TOP -> S S -> VP VP -> TO VP VP -> VB TO -> "to" VB -> "play"
What to turn in:
Tip: once again, the easiest way to implement this to generate the top-level production and then apply the function recursively with a tail call.
Tip: the order of the productions doesn't matter, but the unit test assumes that productions were generated in depth-first order.
tree.from_stream, and describe what you did.