This assignment has two parts. In the first, you will implement the multiclass averaged perceptron for sparse, binary feature vectors. In the second, you will apply it to a novel text classification problem.
Your first assignment is to implement the multiclass averaged perceptron classifier as a Python class. A skeleton can be found in
perceptron.py. Crucially, the instance method
fit should perform training using the raw weights (e.g., wt), whereas the instance method
predict should make predictions using the averaged weights (e.g., w̄). For more advice on implementation, review the lecture on classifiers.
What to turn in:
Tip: Store weights in a hashtable of hashtables, where the outer key is
the feature label, and the inner key is the candidate label—as always,
collections.defaultdict is your friend.
Once you have completed your multiclass averaged perceptron class, you will apply it to a novel text classification problem.
There are twelve Zodiac signs, each of which is thought to be associated with a particular set of personality traits. For instance, Aries (those born between March 20th and April 20th) are said to be "fiery", "ambitious", and "impulsive", and Sagittarius (those born between November 22nd and December 22nd) are said to be "intellectual", "optimistic", and "impatient". The CSV file
horo-train.csv contains a random sample of 20,000 horoscopes labeled for sign, one per line. Each horoscope "document" contains suggestions or predictions for each sign. The horoscopes are a mixture of daily, weekly, monthly, and yearly predictions. Some are specific to professional life, love life, or teenage life.
Your second assignment is to write a program (
horo-class.py) which uses your implementation of the multiclass averaged perceptron to predict the Zodiac sign associated with each horoscope. Your program will take two command-line arguments: the first will be a CSV file of training data (i.e.,
horo-train.csv), the second will be CSV test data. The program should:
textcolumn of the training data
signcolumn of the training data
textcolumn of the test data
signcolumn of the test data
A small "development set",
horo-dev.csv, has also been provided to you. You are to use this as test data while you are developing the classifier; it may help you to design your feature extraction function and select the number of epochs of training to perform.
What to turn in:
Once you turn in your classifier, I will evalute your program's performance on a held-out test set. Since the data is approximately balanced across the 12 signs, a random classifier will have an 8% accuracy (≈ 1/12). You should be able to outperform this by at least a couple points: my baseline system, using a very simple set of features, achieves 14% accuracy.
Tip: Don't forget about the bias term in your feature extraction code.
Tip: During feature extraction, consider applying at least some of the deterministic text normalization steps you used in MP1.
Tip: Some features you may want to consider include "personality trait" words (like "impulsive"), words referring to Zodiac signs (e.g., "Leo", but also "Leonine", etc.), words referring to planets, and words referring to months and season; you may even want to try to include any non-stopword word, or any word below some threshold of document frequency. This assignment is intentionally open-ended—be creative! You are permitted to use any external data resource for your features.