Information Retrieval

Fall 2015 Schedule

(Subject to change as needed)



Regular schedule: Each class will begin with a quiz covering the assigned work that was to be done prior to the class.  Any question asked on Piazza will be answered before any quiz.  Other questions will have to wait until after the quiz.


TA announced.  Our TA will be Hema Chandra Reddy Ethapu.  Schedule to be posted soon.

Week

Date

Topic

Pre-class

In-Class

Post-Class

Notes

1

8/26



Class introduction

Crash course in Python, NLTK


_______________________________

Install Anaconda

See 45 minute intro to Python on YouTube: https://www.youtube.com/watch?v=N4mEzFDjqtA

Another resource: Python 3 tutorial: https://docs.python.org/3/tutorial/




Python/ NLTK exercises.
Intro to IR





Confirm registration in Piazza.  Post one comment -- something about the python intro or about downloading Anaconda or some other python installation - anything you want.


Small programming project to confirm comfort with Python & NLTK basics:


* Repeat the analysis of average word length, etc. using the inaugural addresses instead of the gutenberg files.  Choose another set and repeat.


_____________________________

Anaconda is a python distribution that includes all the libraries we will need.  It is free and easy to install.
Be sure to get version 3.4

https://store.continuum.io/cshop/anaconda/

 

2

9/2

How much information? Where does information come from?

Boolean Retrieval

IIR  Read Chapter 1 - 2.2
Exercises 1.1, 1.2, 1.3, 1.5, 1.6, 2.1, 2.3


Quiz 1

Program: read documents, produce  term-document index list, then a posting list

 Finish programs if needed. 

Be sure to use Piazza for any questions. You may well have some problems with these exercises.  That is ok.   Remember that you can be anonymous to fellow students when you ask a question if you wish, though I will know who you are.  That is good, because you get credit for asking a question. 

3

9/9

Web Crawling and Indexes

Rest of Chapter 2, Chapter 20
(Narrated slides are posted in Piazza)
Exercise: write robots.txt code to allow no crawlers to see directories cgi-bin, or user "admin"


Quiz 2 - Note each quiz will include one question from the previous week as well as one or two from the preparation for this week's class.

WebSphinx


illustration of web crawling as source of documents.

Write crawler and retrieve some web pages

_______________________________________

 

 

 

4

9/16

Term weighting, vector space model, ranked retrieval, similarity metrics, TF-IDF weighting

Read Chapter 6 through Section 6.2

Implement the algorithm in Figure 6.4 in Python

Exercises  6.2, 6.10, 6.11

 

Quiz 3
Chapter 6
Presentation and discussion of Secton 6.3

Present figure 6.4 code

Make up some data for vector space code (Create 5 small documents of about 40 - 50 words.  You can cut and paste from a web page or some file you have or just make them up. ) Create a query that is appropriate for that environment.  Hand off your data to another team. Write code for Figure 6.14

Complete your code.  Upload a clean copy of the code for testing with other data.

 

5

9/23

Indexing

Read Chapter 4: Sections 1-2(all students)
all chapter 4 (graduate students)
Chapter 7 section 1 (all students)

Quiz 4

Present initial plan for semester project

 

 

6

9/30

 Twitter API and information fetching 

Read background material

Decide on goals - What tweets to fetch

Exercises TBD 

Quiz 5

Fetch a collection of Tweets

Explore common themes over these tweets 

Repeat Twitter fetch and analysis with chosen topic 

 

7

10/7

Processing text

Chapter 3 of NLTK book.
Exercises
 2, 6, 14, 16, 17


Quiz 6


Additional Exercises TBD 

 

 

 

10/14

Fall Break

 

 

 

 

8

10/21

Information Retrieval Evaluation

 IIR Chapter 8

Exercises
8.1, 8.2, 8.4

Quiz 7

 

 

9

10/28

 Text Classification and Naive Bayes

Chapter 13 

 Quiz 8

 

 

10

11/4

Support Vector Machines and Machine Learning on Documents 

Chapter 15

Install Weka

Quiz 9

Weka exercises

 

 

11

11/11

 Flat Clustering

Chapter 16 

Quiz 10

 

 

12

11/18

Latent Semantic Indexing  

Chapter 18 

Quiz 11

 

 

 

11/25

Thanksgiving Break

 

 

 

 

13

12/2

The Web as document source.  Link Analysis 

Chapter 21 

 Quiz 12

 

 

14

12/9

Project Presentations I 

 

 

 

 

15

12/15 

Project Presentations II

 

 

 

 

 

 

IIR = Introduction to Information Retrieval:  http://nlp.stanford.edu/IR-book/information-retrieval-book.html