Villanova University

Department of Computing Sciences


CSC 5930 Web Crawling


Professor:  Dr. Lillian N. Cassel

E-mail:  lillian.cassel@villanova.edu  (Please put "CSC5930" in the subject line so it will get my attention quickly)
Office hours:  Tuesday and Wednesday 3 to 5 pm.  Other times by arrangement.  If my door is open, you are welcome to come in any time.  


Semester: Fall 2011

Course Description: How do they do that?  Many web applications, from Google to travel sites to resource collections present results found by crawling the Web to find specific materials of interest to the application theme.  Crawling the Web involves technical issues, politeness conventions, characterization of materials, decisions about the breadth and depth of a search, and choices about what to present and how to display results.  This course will explore all of these issues.  In addition, we will address what happens after you crawl the web and acquire a collection of pages.  You will decide on the questions, but some possibilities might include these:  What summer jobs are advertised on web sites in your favorite area?  What courses are offered in most (or few) computer science departments?  What theatres are showing what movies?  etc?   Students will develop a web site built by crawling at least some part of the web to find appropriate materials, categorize them, and display them effectively.  Prerequisites: some programming experience: CSC 1051 or the equivalent.

Textbook:  We have no textbook for this class.  However,  some relevant references will be mentioned.  We will use many websites.

Course Goals:

Students will achieve sufficient understanding of the structure of the web and the processes of Web crawling and information extraction to be able to create web applications that depend on retrieving and managing information obtained from the Web.

Information about the course management:

Introduction : 

Every indication is that the best way to learn is to be actively involved in discovery, in creating your own knowledge. As a result, your active participation is a fundamental requirement of this course.   Lectures will generally be short and interspersed with exercises to demonstrate understanding and the ability to move to the next stage of activity.  There will be discussions and exploration carried out by groups during class periods. 

Additional information : 

Attendance: I assume that every student will attend every class unless I have heard previously that you have a reason for missing class. In all cases, you are responsible to discover what has been done in the class you miss. Just keeping up with class by hearing about happenings from other students is not sufficient, however. Your input into our discussions is important. Your absence not only hurts you; it deprives the rest of the class of the valuable contributions you would have made. A part of the grade is reserved for active participation in every class session. 

Online component: 

This class will include regular classroom sessions and also some online sessions.  There will be some occasions when I participate in the class from a distance.  For these classes, you will need access to a computer and internet connection sufficient to display live video.  If you have this type of resource, you may attend the class from your chosen location.  If you do not have access to these resources, they will be provided for you and you will be expected to attend at Villanova.  In any case, your attendance and active participation will be required.

Writing:

Computer scientists and software engineers must communicate with people, as well as with computers.  During this class there will be opportunities for each student to present problems and solutions to the class as a whole.  There will not be any long papers.  Each student will do a significant presentation to the class about the project completed.

I like this quote, found on the department web page:

                                  Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

                                                                                                                                                                      — Martin Fowler
We aspire to be good programmers.


Grading Policy

Grades will be based on successful and timely completion of assignments and projects, and on active participation in class and in online discussions.  The plan is to have frequent small tests rather than midterm and final.  These serve to demonstrate that each student understands the material and is able to work alone or with others.  If these go well, we will not have any other exams.  If it appears that the forced focus of major exams is required, we will have them.



Course Web Presence

There are advantages and disadvantages to the Blackboard course management system used at Villanova.  I prefer that my materials be open and accessible, so I will use this public web site.  We will use the Blackboard site from time to time to take advantage of tools that are there.  Be sure to log in to the course site and become familiar with its use.


Semester Schedule

The following calendar will be developed as the class progresses. 


Week


Subject

Links_to Slides

Notes/Assignments

Activity

1

8/30

The structure of the Web.  How does crawling work.

Introduction and background

Read: As We May Think  Written in 1945, this article by Vannevar Bush is credited with the original notions that led to hypertext, the Web, and digital libraries.
Read: The World Wide Web: Origins and Beyond (note: This was written in 1995, very early in the history of the Web, and updated 10 years later.)
Read The Diverse and Exploding Digital Universe

What project do you want to do?  If you cannot decide, I can give you one, but it would be good to have one that is meaningful to you.  Be prepared to discuss your choice next week.

2

9/6

Architecture of a crawler.  A language well suited to the task

Crawling and Intro to Python

Program assignment described in class, due next week.  Read as much as needed of the Python tutorial or some examples from the Beginner's Guide

Discuss readings, and proposed projects.

3

9/13

Python for fetching from the web. 
Visualization of web crawling;

Python for the web, WebSphinx

Nakul's WebSphinx lab at his site and a copy at the class site.

Quiz 1 -- about 30 minutes
Lab exercise using WebSphinx
Project proposal reviews

4

9/20

Creating a database to hold the fetched documents, dealing with messy HTML, reviewing an open-source crawler




5

9/27

Continued Python, crawler, database, BeautifulSoup, Nutch

Python snippets from the class
Google Python Class
Nutch crawling, step-by-step
Ideas for writing your own crawler



6
10/04
Introduction to Information Retrieval IntroIR
Due:  Your own crawler.  See instructions
Quiz 2 -- About 30 minutes
Demonstration of crawlers


10/11
Fall Break



7
10/18
Elements of searching.
Indexing. 
Creating an index of your documents.
Indexing-1
Indexing-2
Indexing-3
Voice version of part 3

Due: Detailed report on your project plan
Who is on your team, or are you working alone?
What is the subject area of your project?  Where do you plan to crawl?  How will you specify what you want to fetch?  What do you plan to do with the materials that you retrieve?  What goals does your project serve? (See further description in Blackboard in the assignment upload spot)
I can give you project ideas if you need them.  If you have your own ideas, that is fine.  One way or the other, you must have a detailed plan by the first day of the second half of the term.
8
10/25
Continued indexing.  Vector space model for information retrieval
Indexing and vector space


9
11/1
Hands-on laboratory  Indexing and searching tools

Lucene and Solr are open source projects that provide all the indexing and searching capability that your project will require
Quiz 3 -- probably 45 minutes.  Introduction to Information retrieval through indexing and vector spaces.
10
11/8
Search engines
Click here to take survey
Search
Anatomy of a Search Engine - Sergey Brin and Larry Page
Google Page Rank
These links are by way of the materials made available by C. Lee Giles of Penn State.
11
11/15
Introducing the Semantic Web
Semantic Web
The Semantic Web in Breadth
The Semantic Web: An Introduction


12
11/22
Review and put things together

Class presentations of projects
Overview of the topics

13
11/29



Last quiz -- whole semester -- hour or so.
14
12/6
Project presentations

Last class day

15






References: (This list will expand during the semester)
1) Manning, Christopher D., Prabhakar Raghavan, Hinrich Schutz.  Introduction to Information Retrieval.  Cambridge University Press.  2008.  Book website is http://informationretrieval.org  Most of the book is there and open to anyone.
2) Hatcher, Eric and Otis Gospodnetić.  Lucene in Action  Manning Publications Co.
3) Croft, W. Bruce, Donald Metzler, Trevor Strohman.  Search Engines. Information Retrieval in Practice.  Addison Wesley.  2010
4) Foord, Michael.  How to Fetch Internet Resources using the urllib Package.  http://docs.python.org/dev/howto/urllib2.html
5) urllib2 -- Extensible Library for Opening URLs.  http://docs.python.org/library/urllib2.html

Python Programming Project #1:
Write a program to do the following:
* Input a text string
* If the string has an anchor tag, extract the link, append it to a list of links
* If the string has HTML tags, strip the tags and print out the remaining string
* Finally, print out the list of links, one per line
Use Python for this assignment.  You may use whatever you wish for your project, but give Python a try for this much. 
You may collaborate on this assignment, but each person must be ready to explain any detail of the resulting program and its testing.

Your basic crawler (Individual work)

–Read a seed url

–Fetch the page

–Extract links from the page

•Put the links on a queue of pages to visit

–Extract the text from the page, stripping off the html code

•Deal with possibly bad html

•Put the extracted documents in a database for later analysis

–Take the next url from the queue and repeat

–How will you deal with robot exclusions?

–What will you do about rapid access to a server?