MP1: Text Normalization

In this MP you will perform simple forms of text normalization, "cleaning" a portion of the Gigaword corpus so that it can be used to estimate data-driven word similarity models. This MP is intended to provide practical experience (in particular, familiarity with the Python APIs) for several simple but important NLP tasks, including XML parsing, sentence tokenization, word tokenization, stopword filtering, stemming, and case-folding. The text that you produce will be used later on, for MP2.

This directory contains news documents from the Xinhua News Agency, the official press organization of the People's Republic of China, which have been drawn from the 5th edition of the Gigaword corpus (LDC2011T07). Each file is a compressed XML file containing multiple documents. Each document is wrapped with a <DOC> tag, which also have a type attribute. Immediately below <DOC> tags in the hierarchy, story text is separated from the headline and dateline by the <TEXT> tag. Finally, paragraphs of story text are marked by <P> tags. Your assignment will be to write a script (gigaprep.py) to decompress these files, extract the text making up the body of the news stories, perform sentence tokenization and word tokenization, remove stopwords, remove non-alphabetic tokens, stem all tokens, and then case-fold, writing the resulting data into a single large text file. Your script should proceed as follows.

For each of the provided XML files, it should decompress that file into memory. Then, it should parse the XML file in memory using a standard XML parsing library; you are not allowed to use regular expressions or a purpose-built text munging function. For each XML file, your script should extract every (<P>) from the text body (<TEXT>) portion of each story document (<DOC type="story">).

For each paragraph, it should apply sentence tokenization to segment the paragraphs into sentences.

For each sentence, it should apply the Penn Treebank word tokenizer to segment the sentences into tokens.

Remove all tokens which are not purely alphabetic (i.e., which do not match /^[a-zA-Z]+$/) or which occur in the NLTK English stopword list. Stem all the remaining tokens using the Porter stemmer. Finally, convert all tokens to uppercase and print each sentence to standard output. The first sentence of the file xin_eng_199501.xml.gz reads:

Russian troops today seized the Chechen capital Grozny and have fully kept the city under their control, the Press Service of the Russian Government announced this night.

After the above processing pipeline, it should read:

RUSSIAN TROOP TODAY SEIZ CHECHEN CAPIT GROZNI FULLI KEPT CITI CONTROL PRESS SERVIC RUSSIAN GOVERN ANNOUNC NIGHT

What to turn in:

  1. Your script for text processing
  2. Sample terminal output: the first 100 sentences of the corpus
  3. A paragraph describing your approach, mentioning any interesting or unexpected problems or bugs you ran into as well as how you got around them

Tip: Save your wrists by downloading the data files with command-line tools like curl or wget.

Tip: The reference implementation is a single script using the gzip module for decompression, lxml.etree.parse for XML parsing (and in particular, the XPath API), nltk.tokenize.punkt.PunktSentenceTokenizer for sentence tokenization, nltk.tokenize.TreebankWordTokenizer for word tokenization, nltk.corpus.stopwords.words("english") for the stopword list, and nltk.stem.snowball.SnowballStemmer("english") for the stemmer. If you choose to use these libraries, you may need to install lxml and nltk. The easiest way to do this is via pip. You will also need to install data files for Punkt. There are two ways to do this: