In this assignment, you will use simple knowledge-driven and data-driven methods for estimating word similarity. These methods are widely used for information extraction and data mining.
WordNet is a lexical database of English which groups context words into sets of synonyms called synsets. The word dog, for example, is highly polysemous (it has many different senses) and belongs to 8 synsets:
Synset('dog.n.01'): a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Synset('frump.n.01'): a dull unattractive unpleasant girl or woman
Synset('dog.n.03'): informal term for a man
Synset('cad.n.01'): someone who is morally reprehensible
Synset('frank.n.02'): a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
Synset('pawl.n.01'): a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
Synset('andiron.n.01'): metal supports for logs in a fireplace
Synset('chase.v.01'): go after with the intent to catch
Synsets are linked by conceptual-semantic relationships which can be interpreted as a type of lexical ontology. WordNet implements several algorithms which use this ontology (and in some cases, external resources) to estimate the similarity between synsets. Download
wordsim.csv, a CSV file in which each line contains two words and a human rating of their similarity (in the range [0, 10]). For each word pair, use the WordNet API to look up the first synset of that word, and then use the API to compute path similarity between each pair. Then, compute the Kendall tau rank correlation statistic between the human rating and path similarity.
What to turn in:
Tip: Use the NLTK WordNet API; familiarize yourself with the API by performing some of the examples from NLTK's documentation.
scipy.stats.kendalltau rather than implementing your own.
Bonus: Calculate other WordNet similarity measures between the word pairs. Note that some of the measures require the two words to have the same part of speech. For this, you may want to figure out the most likely tag for each word (using a unigram frequency distribution) and use that tag to select the best synset.
Using the normalized Gigaword data you generated in MP1, compute co-occurrence frequencies for the word pairs in
wordsim.csv, treating each sentence as a "document" (i.e., any word pairs that occur in the same sentence are said to co-occur). Adjust your frequency counts using Laplace ("add one") smoothing. Then, compute positive pointwise mutual information (PPMI) for each pair. Finally, compute the Kendall tau rank correlation statistic between the human rating and PPMI.
What to turn in:
Tip: You will need to apply the same normalization procedures you applied in MP1 to the word pairs in
Tip: The co-occurrence frequencies for computing PPMI should be structured as a dictionary with each pair of words as a key and the number of times the pairs co-occur in the same "document" as value. As always,
collections.defaultdict is your friend.