CSC 9010: Text Mining Applications
Spring, 2012
Paula Matuszek
Adjunct Professor, Villanova
E-mail: Paula.Matuszek@villanova.edu or Paula.Matuszek@gmail.com
Phone: (610) 647-9789

Some example bits of code which may be useful in using NLTK

corpus_in.py : Read in all non-hidden files in a directory and turn them into an NLTK corpus. (Thanks to Casey Burkhardt and Tom Carpenter)

Jan31Lab.py : code we went through in lab 3: read in files, create a feature set, run a Naive Bayes classifier and look at results. (This code is largely taken from http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html)

Feb 7 Lab.py : code we went through in lab 4: Read in a set of files; create a feature set based on word count and a class label based on directory, then classify the documents in the corpus. (This code is largely taken from http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html)

Feb 14 Lab K-Means : k-means clustering
Feb 14 Lab Hierarchical : Hierarchical clustering
Feb 14 Lab EM : EM clustering

Creating a count vector : creating an array of bag of word vectors for clustering, using simple word counts, for nltk.Text objects
Creating a TF*IDF vector : creating an array of bag of word vectors for clustering, using TF*IDF, for nltk.Text objects

Creating a vector from nltk.corpus objects : for many problems it's easier to use the nltk.corpus.reader classes to read in the files. It's a trade-off; it's easier to get the documents in and there's a method to generate a list of the words in the corpus. But the count and tf*idf methods are only available for nltk.Text objects. So this code uses a reader but then makes nltk.Text objects out of the word lists.

Note: many of these code examples are based on Natural Language Processing by Steven Bird, Ewan Klein, Edward Loper, found at http://nltk.googlecode.com/svn/trunk/doc/book/book.html or on the source code in the nltk repository.