CS369M: Algorithms for Massive Data Set Analysis


Instructor: Michael Mahoney
  • Email: mmahoneyWXYZ AT ZYXW.cs.stanford.edu
  • Office hours: Most weeks, M 3:00pm to 4:30pm in Math building Rm. 383A (third floor). Alternatively, by appointment.

    Teaching Assistant: Alex Shkolnik
  • Email: ads2WXYZ AT ZYXW.stanford.edu
  • Office hours: Tue & Thu 3:00pm to 5:00pm, Location: Gates B28

    Class time and Location:
  • MW 11:00-12:15, Building Terman, Room 156. (First meeting is Monday, September 21, 2009.)


    Course description: Algorithmic and statistical methods for large-scale data analysis: matrix and graph algorithms; strengths and weaknesses of theoretical techniques for practical scientific and internet data analysis; overlap with related problems in statistics, optimization, numerical analysis, and machine learning. Representative topics include: Matrix problems (numerical and statistical perspectives; algorithmic approaches, including Johnson-Lindenstrauss lemma and randomized projection and sampling algorithms; novel matrix factorizations); Graph problems (graph partitioning algorithms, including spectral methods, flow-based methods, and recent geometric methods; local graph algorithms and approximate eigenvector computation); and Applications to machine learning and statistical data analysis (motivating applications; algorithmic basis of the RKHS method; geometric data analysis, regularization, and statistical inference; boosting and its relationship to conjugate gradient methods, duality, convexity, online learning, and approximation algorithms). Basics of implementing these ideas in medium and large-scale applications.

    Prerequisites: Basic understanding of Algorithms (e.g., CS161), Linear Algebra (e.g., Math51), and Probability Theory (e.g., CS109), or equivalent.

    Course requirements: Most likely, three homeworks (ca. 15-20% each), scribe two lectures (ca. 5%), and a major project (ca. 50%, which includes initial proposal, then intermediate report, then final report).

    Syllabus: pdf

    Primary references: Although there are many good books on related topics, there is not a single reference covering the topics we will cover. Thus, we will be reading primary sources from the literature. Relevant papers for each lecture are listed in the Lectures section. A more detailed list of relevant references is also provided below.

    Homeworks:

    Major project: pdf


    Lectures:

    Scribe Templates pdf tex

    Important Note: These scribed notes have been put up in real time to aid dissemination and complement class notes, but they have not been fully edited---Caveat emptor!


    Detailed list of references:

    Papers marked (*) should be read; they are either useful background or will provide the basis for the lectures. Other papers are provided for additional background; this may be useful for helping you to scribe up the lectures and also for initial pointers for your project.

    You should be able to find all of these online, given the title and authors. If there are any you can't find relatively easily, please let the instructor or TA know and we will provide a copy or pointer.

    A note on reading papers: Research papers (especially conference versions in computer science) have a tendency to be terse, hard-to-read, and error-prone. To make matters worse, papers in different areas often refer to similar concepts by very different names. A good way to start is to peel apart the paper like an onion: read the introduction, and maybe read the introduction of related papers subsequent to it, to get a general idea; then get a sense of the main technical results (e.g., the main theoretical or empirical claims); and then go into the details of what is proved and how it is proved (or of how the empirical claims are validated). It may be hard to understand every technical detail of certain papers you're reading, but with effort you should develop a good understanding of the main technical contributions, the gist of how they are proved or evaluated, and a detailed understanding of at least one or two results.