CSC 9010: Natural Language Processing
Spring, 2005, Thurs, 6:15-9:00
Dr. Mary-Angela Papalaskari
Dr. Paula Matuszek
Homework 5: N-Grams
Due Feb 24, 6:15PM.
Complete exercise 6.4, pages 232-233 in the text, using the NLTK.
In other words:
You will probably also want to include some utilities to help you answer the questions, such as printing out the most common bigrams, the total # of unigrams and bigrams, etc.
You can ignore smoothing.
- Identify two small corpora of (hopefully stylistically different) text.
- Calculate unigram and bigram frequencies. (Tutorial 3 shows how to do this.)
- Normalize the bigram table as shown by equation 6.11 in the text.
- Compare the bigram stats for your two corpora, answering the questions on p.233.
Some possible sources of text to process:
- The Gutenberg Project: www.gutenberg.org.
More than 13,000 public-domain books in many subjects and languages.
- The Baen Webscription Free Books: www.webscription.org/free.
Several dozen science fiction and fantasy novels by various authors.
- Google Groups: groups-beta.google.com.
News groups on topics ranging from artificial inelligence to zippo.
- And of course, the corpora from the NLTK data.