MP6: Tree transforms

In this assignment, you will develop a few tools for working with context-free grammars. Both grammars and trees are often subjected to various transformations. This may be for convenience or computational efficiency, but in other cases, transforms are necessary to satisfy requirements of particular parsing algorithms. One example of the latter is the Chomsky normal form transformation is required by the CYK parsing algorithm. It is often simplest to transform trees and then extract rules from the transformed trees, as you will in this assignment. (Note that the mapping between trees and lists of productions is "lossy"—otherwise, parsing would be a lot easier.)

We have provided an incomplete module tree.py which contains code for tree serialization (i.e., printing) and deserialization (i.e., reading from a text file), though once again you are welcome to write your own or improve upon the included code. Since these transforms are difficult to test by pure introspection, this module contains a number of unit tests, which can be run by executing the module (e.g., python tree.py). We have also provided wsj-train.mrg.gz, a file containing a large sample of the Penn Treebank, in case you desire additional test data.

Collapse unary productions

A unary production is a production in which a non-terminal "mother" expands to a single "daughter". It is conventional (and advantageous) to collapse them into a single node in most contexts. Let a mother be collapsible into its daughter if it is not the root node, its daughter is an "only child", and the daughter is neither terminal nor "pre-terminal" (the mother of a terminal). Your assignment is to write a function which takes a tree as input and returns a tree in which all such mothers have been collapsed. Your function should give the collapsed node a label that contains the labels of both the original mother and daughter, separated by some delimiter (for example, +). For instance, the mother S and its sole daughter VP might be collapsed into S+VP, like so:


(S
    (VP
        (DT He) 
        (VBP ran)
    )
)
            

Should become:


(S+VP
    (DT He) 
    (VBP ran)
)
            

What to turn in:

  1. Your program/function for collapsing unary productions on a tree
  2. Some sample terminal input/output showing precisely how your code is used
  3. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: the easiest solution involves recursion.

Chomsky normal form

A grammar is said to be in Chomsky normal form (CNF) if all production rules either produce two non-terminals or one terminal. Similarly, in trees generated by CNF grammars, each non-terminal has either two non-terminal daughters or one terminal daughter. Your next assignment is to transform all trees so that they can be generated by a CNF grammar. This is accomplished by introducing non-terminals until each non-terminal production is binary. This is to be accomplished by introducing new non-terminals on the right side of a ternary (or greater) branching. The labels of these new branches should combine the label of the mother and of the two new sisters. For instance, the tree

(NP 
    (NNP Shane) 
    (NNP Longman) 
    (POS 's)
)

contains a ternary production, and therefore is not in Chomsky normal form. It can be made to conform to CNF by introducing a new non-terminal node labeled NP|<NNP&POS> which is to mother of Longman 's and sister to Shane.

(NP
    (NNP Shane)
    (NP|<NNP&POS>
        (NNP Longman)
        (POS 's)
    )
)

What to turn in:

  1. Your program/function implementing the Chomsky normal form tree transformation
  2. Some sample terminal input/output showing precisely how your code is used
  3. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: The requirement that any rule which produces non-terminals produces exactly two non-terminals is satisfied by a grammar/tree in which unary productions have been collapsed. Therefore, you will want to collapse unary productions before applying this transform.

Generate productions

Your final assignment is not a tree transform (per se.) Rather, you are to to write a function which generates all productions (i.e., context-free rules) of a tree in Chomsky normal form. For instance, the tree

(TOP
    (S
        (VP
            (TO to)
            (VP
                (VB play)
            )
        )
    )
)

contains the following productions:

TOP -> S
S   -> VP
VP  -> TO VP
VP  -> VB
TO  -> "to"
VB  -> "play"

What to turn in:

  1. Your program/function for generating all productions of a tree
  2. Some sample terminal input/output showing precisely how your code is used
  3. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: once again, the easiest way to implement this to generate the top-level production and then apply the function recursively with a tail call.

Tip: the order of the productions doesn't matter, but the unit test assumes that productions were generated in depth-first order.

Bonus: improve tree.from_stream, and describe what you did.