CSC 9010: Text Mining Applications

Fall 2003, Thurs, 6:15-8:45

Paula Matuszek
Principal Computing Scientist, GlaxoSmithKline
E-mail: Paula_A_Matuszek@gsk.com
Phone: (610) 270-6851

Note: links to assignments and class presentations are on the syllabus page.
Syllabus

Description: The world wide web has changed the nature of problems people face in dealing with information. Online access to documents is now largely taken for granted. This has stimulated a number of technical approaches for dealing with large amounts of text, as people look for ways to deal with the flood of information now available.

Several such technologies are grouped under the general term of text mining. Text mining tools use a combination of statistics, natural language processing, and other artificial intelligence techniques to classify, categorize and summarize documents, and to extract information from the documents into a usable form such as a semantic net or database.

This course will be a seminar in applying text mining tools. We will cover a basic introduction to the field of text mining, followed by hands-on experience with several kinds of text mining tools. We will install and apply tools in the areas of basic natural language processing, categorization, clustering, summarization and information extraction.

The class will have three components.

For the first part I will present basic concepts and topics in the area, and make homework assignments that I think reinforce or add insight to these concepts.

For the second component we will explore one or more specific text mining applications, starting with GATE.

For the third component, we will have presentations on additional text mining tools, applications and projects. These will include presentations or demos from some vendors, some that interest me, and some from the students. p>Note: links to assignments and class presentations are on the syllabus page.

Syllabus
Requirements and Grading
Academic Integrity
Student Questionnaire

I am usually on campus only to teach my class; I can meet with you before or after class, or by arrangement at other times. Email is the best way to reach me.

Prerequisites: 8301 (Design and Analysis of Algorithms) and one of: 8520, Special Topics--9010 Web Mining, or permission of instructor. Students should also feel comfortable downloading and installing software from the web.