Professor: Dr. Lillian N. Cassel
E-mail: lillian.cassel@villanova.edu
(Please put "CSC5930" in the subject line so it will get my
attention quickly)
Office hours: Tuesday and Wednesday 3 to 5 pm. Other
times by arrangement. If my door is open, you are welcome to
come in any time.
Semester: Fall 2011
Course Description: How do they do that? Many web applications, from Google to travel sites to resource collections present results found by crawling the Web to find specific materials of interest to the application theme. Crawling the Web involves technical issues, politeness conventions, characterization of materials, decisions about the breadth and depth of a search, and choices about what to present and how to display results. This course will explore all of these issues. In addition, we will address what happens after you crawl the web and acquire a collection of pages. You will decide on the questions, but some possibilities might include these: What summer jobs are advertised on web sites in your favorite area? What courses are offered in most (or few) computer science departments? What theatres are showing what movies? etc? Students will develop a web site built by crawling at least some part of the web to find appropriate materials, categorize them, and display them effectively. Prerequisites: some programming experience: CSC 1051 or the equivalent.
Textbook:
We have no textbook for this class. However,
some relevant references will be mentioned. We will use
many websites.
Introduction :
Every indication is that the best way to learn is to be actively involved in discovery, in creating your own knowledge. As a result, your active participation is a fundamental requirement of this course. Lectures will generally be short and interspersed with exercises to demonstrate understanding and the ability to move to the next stage of activity. There will be discussions and exploration carried out by groups during class periods.
Additional information :
Attendance: I assume that every student will attend every class unless I have heard previously that you have a reason for missing class. In all cases, you are responsible to discover what has been done in the class you miss. Just keeping up with class by hearing about happenings from other students is not sufficient, however. Your input into our discussions is important. Your absence not only hurts you; it deprives the rest of the class of the valuable contributions you would have made. A part of the grade is reserved for active participation in every class session.
Online component:
This class will include regular classroom sessions and also some online sessions. There will be some occasions when I participate in the class from a distance. For these classes, you will need access to a computer and internet connection sufficient to display live video. If you have this type of resource, you may attend the class from your chosen location. If you do not have access to these resources, they will be provided for you and you will be expected to attend at Villanova. In any case, your attendance and active participation will be required.
Writing:
Computer scientists and software engineers must communicate with people, as well as with computers. During this class there will be opportunities for each student to present problems and solutions to the class as a whole. There will not be any long papers. Each student will do a significant presentation to the class about the project completed.
I like this quote, found on the department web page:
Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
—
Martin Fowler
We aspire to be good programmers.
There are advantages and disadvantages to the Blackboard course management system used at Villanova. I prefer that my materials be open and accessible, so I will use this public web site. We will use the Blackboard site from time to time to take advantage of tools that are there. Be sure to log in to the course site and become familiar with its use.
The following calendar will be developed as the class
progresses.
Week |
|
Subject |
Links_to Slides |
Notes/Assignments |
Activity |
1 |
8/30 |
The structure of the Web. How does
crawling work. |
Read: As We May Think
Written in 1945, this article by Vannevar Bush is credited
with the original notions that led to hypertext, the Web,
and digital libraries. |
What project do you want to do? If you
cannot decide, I can give you one, but it would be good to
have one that is meaningful to you. Be prepared to
discuss your choice next week. |
|
2 |
9/6 |
Architecture of a crawler. A language
well suited to the task |
Program assignment
described in class, due next week. Read as much as
needed of the Python
tutorial or some examples from the Beginner's
Guide |
Discuss readings, and proposed projects. |
|
3 |
9/13 |
Python for fetching from the web. |
Quiz 1 -- about 30 minutes |
||
4 |
9/20 |
Creating a database to hold the fetched
documents, dealing with messy HTML, reviewing an
open-source crawler |
|
|
|
5 |
9/27 |
Continued Python, crawler, database,
BeautifulSoup, Nutch |
Python
snippets from the class |
|
|
6 |
10/04 |
Introduction to Information Retrieval | IntroIR |
Due: Your own
crawler. See instructions |
Quiz 2 -- About 30 minutes Demonstration of crawlers |
10/11 |
Fall Break |
||||
7 |
10/18 |
Elements of searching. Indexing. Creating an index of your documents. |
Indexing-1 Indexing-2 Indexing-3 Voice version of part 3 |
Due: Detailed report on your project plan Who is on your team, or are you working alone? What is the subject area of your project? Where do you plan to crawl? How will you specify what you want to fetch? What do you plan to do with the materials that you retrieve? What goals does your project serve? (See further description in Blackboard in the assignment upload spot) |
I can give you project ideas
if you need them. If you have your own ideas, that is
fine. One way or the other, you must have a detailed
plan by the first day of the second half of the term. |
8 |
10/25 |
Continued indexing.
Vector space model for information retrieval |
Indexing and vector space |
||
9 |
11/1 |
Hands-on laboratory
Indexing and searching tools |
Lucene and Solr are open
source projects that provide all the indexing and searching
capability that your project will require |
Quiz 3 -- probably 45
minutes. Introduction to Information retrieval through
indexing and vector spaces. |
|
10 |
11/8 |
Search engines |
Click here to
take survey Search |
Anatomy of a Search
Engine - Sergey Brin and Larry Page Google Page Rank |
These links are by way of the
materials made available by C. Lee Giles of Penn State. |
11 |
11/15 |
Introducing the Semantic Web |
Semantic
Web |
The
Semantic Web in Breadth The Semantic Web: An Introduction |
|
12 |
11/22 |
Review and put things
together |
Class presentations of
projects Overview of the topics |
||
13 |
11/29 |
Last quiz -- whole semester
-- hour or so. |
|||
14 |
12/6 |
Project presentations |
Last class day |
||
15 |
Your basic crawler (Individual work)
–Read a seed url
–Fetch the page
–Extract links from the page
•Put the links on a queue of pages to visit
–Extract the text from the page, stripping off the html code
•Deal with possibly bad html
•Put the extracted documents in a database for later analysis
–Take the next url from the queue and repeat
–How will you deal with robot exclusions?
–What will you do about rapid access to a server?