CSC 9010, Spring, 2012
Special Topic: Text Mining Applications
Tues 6:15 - 9:00, Mendel G87
Dr. Paula Matuszek


This course will be taught as a seminar, and the syllabus may be modified significantly as the semester progresses.

Jan 17: Intro. Presentation.  Lab.  Assignment 1.

Jan 24: Text Features, NLTK. Presentation.  Lab. Assignment 2.

Jan 31, Feb 7:   Classifying documents. Presentation.  Lab.
Assignment 3. Project Info.

Feb 14:  Clustering. Presentation.  Lab. Assignment 4.

Feb 21:  Social Media Analytics, Chris Bound.

Feb 21:  Information Extraction Introduction. Presentation

Feb 28:  Guest Speaker: Susan LeBeau. Linguamatics and I2E. Assignment 5.

Mar 6: Spring break 

Mar 13, Mar 20:  Information Extraction; GATE. 
GATE Overview PresentationGATE ANNIE PowerpointGATE ANNIE PDF
GATE LabJAPE Lab PowerpointJAPE Lab PDF
Assignment 6. Assignment 7.

Mar 27:  Information Retrieval and Search.
Information Retrieval Presentation PowerpointInformation Retrieval Presentation PDF

Apr 3:   Document Summarization.
Summarization Presentation PowerpointSummarization Presentation PDF

Apr 10:  Machine Learning in GATE.
GATE ML Presentation PowerpointGATE ML Presentation PDF

Apr 17:  Guest speaker: Cynthia Matuszek
Talking to Robots: How can robots learn to understand unstructured text?

Apr 24:   Projects.

Ron Boehm: I would like to work with my email corpus using the full text search features in SQL Server 2012: Uploading the corpus, Extracting the features, Extracting and stemming the BOW, Removing stop words, investigating classifying the emails using the semantic search feature.

Jeff Zurita: ReVerb Open Information Extraction. For my class presentation, I plan to examine the ReVerb Open Information Extraction software. ReVerb was created to overcome some limitations found in other IE software, such as TextRunner and Woe. The presentation will summarize a paper written by the authors of ReVerb which shows the limitations of TextRunner and Woe, and what ReVerb implemented to overcome these limitations. The paper, titled "Identifying Relations for Open Information Extraction", also presents the results of experiments in which the effectiveness of ReVerb in performing coherent or meaningful information extraction (as opposed to uninformative extraction) and several other programs is compared. In addition, it is anticipated that the results of a small ReVerb demonstration can be presented along with this author's impressions of its performance.

Casey Burkhardt: For my project, I hope to use text mining techniques and a social networking API as a dynamic corpus to build a system that will generate adjective associations given a particular named entity. The implementation will prompt a user for a keyword named entity, assemble a corpus of acceptable documents, analyze parts of speech of the documents, and will use a weighted bag of words approach to identify the most relevant adjectives associated with the named entity.

Chris Bounds: The focus of the project will be the real-world application of analyses and text mining techniques that we are performing in class. My corpus consists of a year's worth of social media sound bites in which the topic is one of either three major online vacation booking sites; Travelocity (main focus), Expedia or HotWire. I plan to demonstrate how this data can be cleaned, parsed, structured, and displayed in a visually intuitive way enabling companies to integrate social media data into their pre-existing marketing strategies and optimize their overall marketing mix.

May 1:   Projects.

Nakul Rathod: For my final project, I would like to train classifiers to learn to distinguish between 3 different subject areas of computer science syllabi and other web pages (like faculty's teaching pages, list of courses, etc).

Jeff Montone: SAS

Tom Carpenter: Tool to do sentiment Analysis on a specified topic over a specified amount of time using Tweets

Randy Escoto: The goal of this project is assembly of a corpus of articles about software development methodologies and determining the answers to interesting questions of that corpus. Questions to be explored will include trend analysis of methodologies/techniques, historical sequencing of methodologies, and cross-correlation of techniques associated with methodologies. Work on the project will be charted primarily for the more interesting questions about the corpus rather than detailing the formulation of the corpus itself

Carmen Nigro: For my project I plan on exploring the topic of concept linkage. I plan to present the topic to the class and then giving a quick demo to the class using a concept linkage tool, possibly c-link.

Anthony Dovelle: For my talk, I plan to discuss spam filtering methods in use currently by both individual users as well as more prominent software companies. I also plan to drill down to specifically analyze the various text mining techniques utilized for spam filtering. Text based spam filtering can be classified into three general schemes: Word-based analysis, Rule-based analysis (Heuristics), as well as Statistical analysis, each more complex than the last and each with their own advantages and drawbacks. Furthermore, I plan to implement each technique in its most basic sense in order to explore the inner workings of the code for myself. To supplement these thrown-together code samples, I also plan to showcase other, more complex open-source algorithms I can find from various users to demonstrate keen spam filter usage.