MP2: Word similarity

In this assignment, you will use simple knowledge-driven and data-driven methods for estimating word similarity. These methods are widely used for information extraction and data mining.

Part 1: Knowledge-driven word similarity

WordNet is a lexical database of English which groups context words into sets of synonyms called synsets. The word dog, for example, is highly polysemous (it has many different senses) and belongs to 8 synsets:

  1. Synset('dog.n.01'): a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
  2. Synset('frump.n.01'): a dull unattractive unpleasant girl or woman
  3. Synset('dog.n.03'): informal term for a man
  4. Synset('cad.n.01'): someone who is morally reprehensible
  5. Synset('frank.n.02'): a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
  6. Synset('pawl.n.01'): a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
  7. Synset('andiron.n.01'): metal supports for logs in a fireplace
  8. Synset('chase.v.01'): go after with the intent to catch

Synsets are linked by conceptual-semantic relationships which can be interpreted as a type of lexical ontology. WordNet implements several algorithms which use this ontology (and in some cases, external resources) to estimate the similarity between synsets. Download wordsim.csv, a CSV file in which each line contains two words and a human rating of their similarity (in the range [0, 10]). For each word pair, use the WordNet API to look up the first synset of that word, and then use the API to compute path similarity between each pair. Then, compute the Kendall tau rank correlation statistic between the human rating and path similarity.

What to turn in:

  1. Your program/function for calculating path similarity
  2. Some sample terminal input/output showing precisely how your code is used
  3. The Kendall tau rank correlation statistic for the human and path similarity
  4. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: Use the NLTK WordNet API; familiarize yourself with the API by performing some of the examples from NLTK's documentation.

Tip: Use scipy.stats.kendalltau rather than implementing your own.

Bonus: Calculate other WordNet similarity measures between the word pairs. Note that some of the measures require the two words to have the same part of speech. For this, you may want to figure out the most likely tag for each word (using a unigram frequency distribution) and use that tag to select the best synset.

Part 2: Data-driven word similarity

Using the normalized Gigaword data you generated in MP1, compute co-occurrence frequencies for the word pairs in wordsim.csv, treating each sentence as a "document" (i.e., any word pairs that occur in the same sentence are said to co-occur). Adjust your frequency counts using Laplace ("add one") smoothing. Then, compute positive pointwise mutual information (PPMI) for each pair. Finally, compute the Kendall tau rank correlation statistic between the human rating and PPMI.

What to turn in:

  1. Your program/function for calculating Laplace-smoothed PPMI
  2. Some sample terminal input/output showing precisely how your code is used
  3. The Kendall tau rank correlation statistic for the human and PPMI similarity
  4. Choose a few examples of the most similar token pairs according to PPMI; do these pairs also have high WordNet path similarity scores? Choose a few examples of the most similar pairs according to WordNet path similarity; do these pairs also have high PPMI scores? Which measure (WordNet path similarity or PPMI) do you think is better at capturing similarity? Why?
  5. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: You will need to apply the same normalization procedures you applied in MP1 to the word pairs in wordsim.csv.

Tip: The co-occurrence frequencies for computing PPMI should be structured as a dictionary with each pair of words as a key and the number of times the pairs co-occur in the same "document" as value. As always, collections.defaultdict is your friend.