CSC 9010: Text Mining Applications
Adjunct Professor, Villanova
E-mail: Paula.Matuszek@villanova.edu or Paula.Matuszek@gmail.com
Phone: (610) 647-9789
Some example bits of code which may be useful in using NLTK
corpus_in.py : Read in all
non-hidden files in a directory and turn them into an NLTK corpus.
(Thanks to Casey Burkhardt and Tom Carpenter)
Jan31Lab.py : code we went through in lab 3:
read in files, create a feature set, run a Naive Bayes classifier and
look at results. (This code is largely taken from
Feb 7 Lab.py : code we went through in lab 4:
Read in a set of files; create a feature set based on word count and a
class label based on directory, then classify the documents in the corpus.
(This code is largely taken from
Feb 14 Lab K-Means : k-means clustering
Feb 14 Lab Hierarchical : Hierarchical clustering
Feb 14 Lab EM : EM clustering
Creating a count vector : creating an array of
bag of word vectors for clustering, using simple word counts, for nltk.Text objects
Creating a TF*IDF vector : creating an array of bag
of word vectors for clustering, using TF*IDF, for nltk.Text objects
Creating a vector from nltk.corpus objects : for many
problems it's easier to use the nltk.corpus.reader classes to read in the
files. It's a trade-off; it's easier to get the documents in and there's a
method to generate a list of the words in the corpus. But the count
and tf*idf methods are only available for nltk.Text objects. So this code
uses a reader but then makes nltk.Text objects out of the word lists.
Note: many of these code examples are based on Natural Language
Processing by Steven Bird, Ewan Klein, Edward Loper, found at
on the source code in the nltk repository.