CSE 6240: Web Search And Text Mining

Georgia Tech / Spring 2020

This course introduces the fundamental knowledge of Web Mining. The topics covered in this course broadly lie in text mining, network science, recommender systems, and social media analysis. The emphasis is on both the theoretical and empirical aspects. Students will be introduced to machine learning techniques and data mining tools apt to reveal insights from large-scale web-based datasets.

  • Instructor: Prof. Srijan Kumar [Twitter]
  • TAs: Roshan Pati, Arindum Roy
  • Office Hours:
    • Srijan Kumar: 10-11am Wednesday, Coda S1303
    • Roshan Pati: 3-4pm Thursday, Klaus 3rd floor Atrium
    • Arindum Roy: 3-4pm Tuesday, Klaus 3rd floor Atrium
  • Lectures: are on Monday/Wed 3:00-4:15pm in Boggs B9
  • Piazza: Enroll here. The students should use Piazza for all course-related queries.
  • Announcement (01/07/2020): Due to huge demand from students, we have increased the class size to 85 students!

  • Announcement (01/09/2020): Sample dataset and project ideas are released. See the project section below.

  • Announcement (01/13/2020): We will be conducting a Hands-on Python Session during Office Hours from 3-4 PM on Tuesday, January 13, at the Klaus 3rd floor Atrium. An interactive tutorial is available here.

  • Announcement (01/15/2020): HW1 is now available on Canvas.

  • Announcement (01/19/2020): For the week starting Mon, Jan 20 only, Prof. Kumar will be conducting his office hours immediately after class on Wednesday, Jan 22, instead of his usual office hours time and location.

  • Announcement (01/20/2020): Project Rubric has now been released. You can find it here.
  • Announcement (01/27/2020): For the week starting Mon, January 27, Roshan will be taking his office hours from 3-4 PM on Tuesday at Clough 342, Arindum will be taking his office hours from 9-10 AM on Wednesday, at the Klaus 3rd Floor Atrium. Meanwhile Professor Kumar will be taking his office hours from 2-3 PM on Thursday in his office( Coda S1303).
  • Announcement (01/29/2020): The deadlines for the submission of the Project Proposal have been moved, and the proposals are now due on Wednesday, Feb 5. For the week starting Mon, Feb 3, Professor Kumar will be taking his office hours from 10-11 AM on Monday, Feb 3 at his office (Coda S1303), in order to provide feedback to students on their proposals.
  • Announcement (02/05/2020): HW2 is now live on Canvas.

Schedule and Slides

The schedule is subject to change. Reading materials will be posted periodically below.

The time for all deadlines used in this course is 23:59 Eastern Time (11:59 PM ET).

Date
Description Readings and Notes Events Deadlines
Mon Jan 6
Introduction [Slides]
Wed Jan 8 IR basics and evaluation I [Slides]
Mon Jan 13 IR basics and evaluation II [Slides]
Tue Jan 14 Hands-on Python Session [Tutorial] Note: This is optional to attend.
Time: 3-4 PM
Location: Klaus 3rd Floor Atrium
Useful Links: Numpy Documentation, Scikit-Learn Classifier Examples, Text Pre-Processing Methods
Wed Jan 15 Word embeddings [Slides] Original word2vec paper, word2vec blogpost HW1 out on canvas
Mon Jan 20 No Class - MLK National Holiday
Project Teams due
Wed Jan 22 Word Embeddings (continued) [slides]
Skip gram with negative sampling, Original GloVe embedding paper, and project page
Mon Jan 27 Language Models [slides]
Chapter on Language Models from the book "Speech and Language Processing"
Wed Jan 29 Language Models and IR [slides] HW1 due
Mon Feb 3
Web Networks and Properties [slides] Graph structure in the Web
Wed Feb 5
Random Graph Models [slides] Small world phenomenon, Collective dynamics of ‘small-world’ networks HW2 out
Project Proposal due
Mon Feb 10 Link Analysis (PageRank and HITS) [slides]
Book chapter, PageRank paper, HITS paper
Wed Feb 12
Personalized PageRank and Recommendations [slides]
Random walk with restart, Pixie paper
Mon Feb 17 Message Passing and Node Classification [slides]
REV2 fraud detection paper
Wed Feb 19
Belief Propagation and Applications [slides]
Netprobe paper
Mon Feb 24
Graph Representation Learning
HW2 due
Wed Feb 26
Graph Neural Networks
Mon Mar 2
Temporal Graph Representation Learning
Wed Mar 4
Community Structure in Networks
Mon Mar 9
Recommender Systems I
Wed Mar 11
Recommender Systems II
Project Milestone due
Mon Mar 16
No class - Spring Break
Wed Mar 18
No class - Spring Break
Mon Mar 23
Recommender Systems III
HW3 out
Wed Mar 25
Neural User Modelling
Mon Mar 30
Social Media Analysis I
Wed Apr 1
Social Media Analysis II
Mon Apr 6
Antisocial Behavior I
HW3 due
Wed Apr 8
Antisocial Behavior II
Mon Apr 13
Advanced Topic I
Wed Apr 15
Advanced Topic II
Mon Apr 20
Final Project Poster Session
Wed Apr 22
No classes - Exams start
Project Report due

Project

Sample datasets and projects are available here. You are welcome to use these datasets, but feel free to be creative and use other public datasets. The selected dataset and project topics should align with one or more of the topics taught in the class. Sample projects are given below as well to spark ideas and style of work.
Project teams are due on Jan 20 (past due now) and proposals are due on Feb 5 (upcoming).

Project Rubric:

The Course Project will be worth 55% of your overall grade for the subject. The breakup of marks is as follows:

  • Proposal: 5% of your overall grade for the course
  • Milestone Report: 20% of your overall grade for the course
  • Final report and poster presentation: 30% of your overall grade for the course

George H. Heilmeier,a former director of DARPA, came up with a list of questions known as “Heilmeier’s Catechism”, which we will be following for the purpose of this project. The questions are as follows:

  • What are you trying to do? Articulate your objectives using absolutely no jargon.
  • How is it done today, and what are the limits of current practice?
  • What is new in your approach and why do you think it will be successful?
  • Who cares? If you are successful, what difference will it make?
  • What are the risks?
  • How much will it cost?
  • How long will it take?
  • What are the midterm and final exams to check for success?

We hope that students will be able to answer all of these above questions, when they submit their final project report. It is okay if your results are unsatisfactory, as long as you can justify that you have put in enough time and effort, clearly understand how your project is related to Web Search and Text Mining, and can communicate that understanding.

Proposal:

The project proposal will be worth 5% of your overall grade for this course. For the proposal, we would be evaluating your submission based on whether you have been able to successfully address the following questions.

  1. Introduction/ Background/ Motivation [20 points]

    This section will ideally deal with the following Heilmeier questions:

    • What are you trying to do? Articulate your objectives using absolutely no jargon.
    • Who cares? If you are successful, what difference will it make?
  2. Reaction to existing papers/technologies [10 points]

    This section will deal with the following:

    • How is it done today, and what are the limits of current practice?

    This section is basically your literature review and should include the summary/critique/shortcomings of the papers (existing methods) you came across while researching your topic. Cite at least 3 works.

  3. Plan of Action [20 points]

    This section will deal with the following:

    • What is your proposed approach and why do you think it will be better than the existing work?
    • What are the risks and anticipated challenges that may be a roadblock for your project?
    • Which dataset will you use?
    • Which code repository will you start with, if any?

Deliverable: A 2 page document in ACM format mentioned below. It should contain all the sections outlined above. For a template, please use the ACM Conference Template, available on Overleaf.

Milestone Report:

Details will be given later.

Final Report:

Details will be given later.

Poster Presentation:

Details will be given later.