MP3: Text classification

This assignment has two parts. In the first, you will implement the multiclass averaged perceptron for sparse, binary feature vectors. In the second, you will apply it to a novel text classification problem.

The averaged perceptron

Your first assignment is to implement the multiclass averaged perceptron classifier as a Python class. A skeleton can be found in perceptron.py. Crucially, the instance method fit should perform training using the raw weights (e.g., wt), whereas the instance method predict should make predictions using the averaged weights (e.g., ). For more advice on implementation, review the lecture on classifiers.

What to turn in:

  1. Source for a module which contains a multinomial averaged perceptron class, according to the above specification
  2. Some sample code showing precisely how the class is to be invoked
  3. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: Store weights in a hashtable of hashtables, where the outer key is the feature label, and the inner key is the candidate label—as always, collections.defaultdict is your friend.

Text classification

Once you have completed your multiclass averaged perceptron class, you will apply it to a novel text classification problem.

There are twelve Zodiac signs, each of which is thought to be associated with a particular set of personality traits. For instance, Aries (those born between March 20th and April 20th) are said to be "fiery", "ambitious", and "impulsive", and Sagittarius (those born between November 22nd and December 22nd) are said to be "intellectual", "optimistic", and "impatient". The CSV file horo-train.csv contains a random sample of 20,000 horoscopes labeled for sign, one per line. Each horoscope "document" contains suggestions or predictions for each sign. The horoscopes are a mixture of daily, weekly, monthly, and yearly predictions. Some are specific to professional life, love life, or teenage life.

Your second assignment is to write a program (horo-class.py) which uses your implementation of the multiclass averaged perceptron to predict the Zodiac sign associated with each horoscope. Your program will take two command-line arguments: the first will be a CSV file of training data (i.e., horo-train.csv), the second will be CSV test data. The program should:

  1. read in the training data indicated by the first command-line argument
  2. extract classifier features from the text column of the training data
  3. train the classifier to predict the sign column of the training data
  4. read in the test data indicated by the second command-line argument
  5. extract classifier features from the text column of the test data
  6. use the trained classifier to classify the text features of the test data
  7. compute the accuracy of this classification by comparing the predicted labels to the true labels in the sign column of the test data

A small "development set", horo-dev.csv, has also been provided to you. You are to use this as test data while you are developing the classifier; it may help you to design your feature extraction function and select the number of epochs of training to perform.

What to turn in:

  1. Your program for Zodiac classification, according to the above specification
  2. Some sample terminal input/output showing precisely how your program is used
  3. A paragraph describing the text features your classifier uses, as well as features you tried but rejected

Once you turn in your classifier, I will evalute your program's performance on a held-out test set. Since the data is approximately balanced across the 12 signs, a random classifier will have an 8% accuracy (≈ 1/12). You should be able to outperform this by at least a couple points: my baseline system, using a very simple set of features, achieves 14% accuracy.

Tip: Don't forget about the bias term in your feature extraction code.

Tip: During feature extraction, consider applying at least some of the deterministic text normalization steps you used in MP1.

Tip: Some features you may want to consider include "personality trait" words (like "impulsive"), words referring to Zodiac signs (e.g., "Leo", but also "Leonine", etc.), words referring to planets, and words referring to months and season; you may even want to try to include any non-stopword word, or any word below some threshold of document frequency. This assignment is intentionally open-ended—be creative! You are permitted to use any external data resource for your features.