CSC 5930/9010: Text Mining
Fall, 2013
Paula Matuszek
Adjunct Professor, Villanova
E-mail: or
Phone: (610) 647-9789

Description: The internet has changed the nature of problems people face in dealing with information. Online access to documents is now largely taken for granted; the amount of information available in text form is massive and still expanding. This has created an entirely new problem: how to deal with a flood of documents relevant to some question.

Search tools such as Google have become increasingly sophisticated at retrieving an appropriate document or piece of information. However, search cannot by itself deal with knowledge which is spread across a large corpus of documents. Other tools are needed for that; several such technologies are grouped under the general term of text mining. Text mining tools include techniques from natural language processing, data mining, machine learning, and other areas of AI. They are applied to large corpora of documents to accomplish tasks such as:

This course will be an exploration of text mining. We will cover a basic introduction to the field of text mining, and explore in detail two tools: NLTK (Natural Language Toolkit) and GATE (General Architecture for Text Engineering). We will explore areas such as using natural language processing to prepare text, categorization, clustering, summarization and information extraction.

The class will have three components.

Text Processing with GATE (Version 6). Hamish Cunningham, Diana Maynard, Kalina Bontcheva. GATE, 2011.
ISBN-10: 0956599311,
ISBN-13: 978-0956599315

Note that this book contains approximately the same materials as the online user's manual at the GATE site. You can use the online manual instead if you prefer.

Course links:
Requirements and Grading for 5930
Requirements and Grading for 9010
Academic Integrity
Student Questionnaire

I will be on campus primarily to teach class; I can meet with you before or after class, or by arrangement at other times. Email is the best way to reach me.