Dr. Paula Matuszek, Spring 2002
Principal Computing Scientist, GlaxoSmithKline
E-mail: Paula_A_Matuszek@gsk.com
Phone: (610) 270-6851

CSC 9010-001:  Special Topic:
Web Mining and Knowledge Discovery

This class has been moved. Beginning Jan 31 we will meet in Mendel G30.

Prereq: 8301 (Design and Analysis of Algorithms)

The world wide web has changed the nature of problems people face in dealing with information. Simple access to information is no longer the major issue; the amount of information available on the web is truly astounding. In addition, the increased level of shopping and other online activity is generating a new kind of information which can potentially be very useful. Now the major issue is dealing with the resulting flood.

This course will be a seminar looking at some of the research around this issue. There will be three primary foci:

Locating documents. Search engines have become increasingly sophisticated, both in locating documents and in establishing relevance. We will review some of the methods that are used to efficiently spider and index documents and the methods that are used to rank documents for relevance.

Text Mining. Once you have 500 hits for your search, what then? The biggest portion of this course will be focused on text mining tools, which use a combination of statistics, natural language processing, and other artificial intelligence techniques to classify, categorise and summarize documents, and to extract information from the documents into a usable form such as a semantic net or database.

Mining clickstream and other web usage data. This is a very fast-growing area; we will look at the application of data-mining techniques to some questions such as finding problems in web-site design and improving marketing on web sites. We will also discuss privacy issues related to captured user data.

The course will fall into two general sections. The first section will be a relatively formal presentation of some basic ideas in these areas. The idea is to give us all some background for understanding later topics and presentations. I am not going to go into a great deal of detail about algorithms and detailed technical issues at this level. The seond part of the course will be more exploratory; it may include guest speakers, student presentations, and demonstrations of projects and tools. Basically it will cover things that I think are interesting and relevant. I'm also open to suggestions or requests.

The textbook for the course is Finding Out About, by Richard Belew, ISBN # 0-521-63028-2. Be sure you get a copy with a CD, because we will probably use some of the data sets from it for assignments.

Note: links to assignments and class presentations are on the syllabus page.

Syllabus
Requirements and Grading
Academic Integrity
Student Questionnaire

I am usually on campus only to teach my class; I can meet with you before or after class, or by arrangement at other times.  Email is the best way to reach me.