Michael Mahoney's new home page



Basic info

I am currently in the math department at Stanford University. You can reach me via electronic mail: mmahoney is the username, and cs dot stanford dot edu follows the at symbol. (Of course, my gmail address is still valid.) Also, some other (somewhat outdated) information can be found at my old web page at Yale computer science.


Recent items of interest:

  • We are going to run MMDS again! MMDS 2010 will take place on the campus of Stanford University on June 15-18, 2010. Details coming soon!
  • Recent teaching: Fall 2009: CS369M: Algorithms for Massive Data Set Analysis
  • We have made public many of the networks we have used in our "Community Structure" papers. Click here for the networks and for related information. The older networks are still here.
  • Our PNAS paper on "CUR Matrix Decompositions for Improved Data Analysis" has appeared as Proc. Natl. Acad. Sci. USA, 106, 697-702 (2009). Please email me if you can't access a copy from the PNAS web site.

  • Research interests

    • Algorithmic and statistical aspects of modern large-scale data analysis.
    • Design and analysis of algorithms for matrix, graph, and regression problems.
    • Statistical data analysis in large-scale scientific and Internet applications.
    • Applications to the analysis of large social and information networks.
    • Applications to DNA microarray and single nucleotide polymorphism data.

    Much of my current research focuses on geometric network analysis, i.e., using algorithms with a geometric or statistical flavor to analyze the structure and dynamics of large informatics graphs; developing approximate computation and regularization methods for large informatics graphs; and applications to community detection, clustering, and information dynamics in large social and information networks. I am also continuing work on matrix sampling algorithms, applications to DNA single nucleotide polymorphism (SNP) data, and large-scale statistical data analysis more generally.

    In the last few years, I developed and analyzed algorithms for large matrix, graph, and regression problems, and I applied these and related tools to the statistical data analysis of extremely large scientific and Internet data sets. For example, I worked on large-scale web analytics, machine learning, and query log analysis; applications of graph partitioning algorithms to clustering and community identification; and applications of randomized algorithms to hyperspectral medical image data, DNA microarray data, and DNA SNP data.

    In the past, I have also worked on developing and analyzing Monte Carlo algorithms for performing useful computations on extremely large matrices, e.g., the additive-error and relative-error CUR matrix decompositions. Recently, this work led to improved algorithms for two classical linear algebra problems: a faster algorithm for the overconstrained least squares approximation problem, and a better algorithm for the problem of choosing the best set of exactly k columns from a matrix. Past research has also included work in computational statistical mechanics on the development and analysis of the TIP5P model of liquid water, as well as work in both computational and experimental biophysics on proteins and protein-nucleic acid interactions.


    MMDS Workshops

    We started the MMDS Workshops on "Algorithms for Modern Massive Data Sets" to address algorithmic and statistical challenges in modern large-scale statistical data analysis.

    • We are going to run MMDS again! MMDS 2010 will take place on the campus of Stanford University on June 15-18, 2010. Details coming soon!
    • MMDS 2008 took place on June 25-28, 2008. MMDS 2008 grew out of our expectation for what the algorithmic and statistical foundations of large-scale data analysis should look like a generation from now. Click here for an article that appeared in SIGKDD Explorations and SIAM News about the meeting.
    • MMDS 2006 took place on June 21-24, 2006. MMDS 2006 was originally motivated by the complementary perspectives brought by numerical linear algebra and theoretical computer science to matrix algorithms in large-scale data applications. Click here for an article in SIAM News about the meeting.
    These MMDS meetings generated intense interdisciplinary interest and were a big success --- so keep an eye out for future MMDSs!

    Publications

    • Empirical Comparison of Algorithms for Network Community Detection,
    • J. Leskovec, K. J. Lang, and M. W. Mahoney,
      Proc. 19-th International WWW, 000-000 (2010)
    • A Spectral Algorithm for Improving Graph Partitions,
    • M. W. Mahoney, L. Orecchia, and N. K. Vishnoi,
      Technical Report, Preprint: arXiv:0912.0681 (2009) (arXiv).
    • Unsupervised Feature Selection for the k-means Clustering Problem,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Proceedings of the 2009 NIPS Conference.
    • Learning with Spectral Kernels and Heavy-Tailed Data,
    • M. W. Mahoney and H. Narayanan,
      Technical Report, Preprint: arXiv:0906.4539 (2009) (arXiv).
    • Empirical Evaluation of Graph Partitioning Using Spectral Embeddings and Flow,
    • K. J. Lang, M. W. Mahoney, and L. Orecchia,
      Proc. 8-th International SEA, 197-208 (2009).
    • CUR Matrix Decompositions for Improved Data Analysis,
    • M. W. Mahoney and P. Drineas,
      Proc. Natl. Acad. Sci. USA, 106, 697-702 (2009).
    • An Improved Approximation Algorithm for the Column Subset Selection Problem,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Technical Report, Preprint: arXiv:0812.4293 (2008) (arXiv).
      Proc. 20-th Annual SODA, 968-977 (2009) (ps, pdf).
      Journal version submitted for publication.
    • Algorithmic and Statistical Challenges in Modern Large-Scale Data Analysis are the Focus of MMDS 2008
    • M. W. Mahoney, L.-H. Lim, and G. E. Carlsson
      Technical Report, Preprint: arXiv:0812.3702 (2008) (arXiv),
      Appeared in SIGKDD Explorations (ps, pdf), SIAM News (ps, pdf), and ASA-SCGN Newsletter (ps, pdf), and abridged versions appeared in IMS Bulletin (ps, pdf) and AmStat News.
    • Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters,
    • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:0810.1355 (2008) (arXiv).
    • Unsupervised Feature Selection for Principal Components Analysis,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Proc. 14-th Annual SIGKDD, 61-69 (2008) (ps, pdf).
    • Statistical Properties of Community Structure in Large Social and Information Networks,
    • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
      Proc. 17-th International WWW, 695-704 (2008) (ps, pdf).
    • Faster Least Squares Approximation,
    • P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlos,
      Technical Report, Preprint: arXiv:0710.1435 (2007) (arXiv),
      Journal version submitted for publication.
    • PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations,
    • P. Paschou, E. Ziv, E. G. Burchard, S. Choudhry, W. Rodriguez-Cintron, M. W. Mahoney, and P. Drineas,
      PLoS Genetics, 3, 1672-1686 (2007) (ps, pdf).
    • Relative-Error CUR Matrix Decompositions,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Technical Report, Preprint: arXiv:0708.3696 (2007) (arXiv),
      SIAM J. Matrix Analysis and Applications, 30, 844-881 (2008) (ps, pdf).
    • Feature Selection Methods for Text Classification,
    • A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney,
      Proc. 13-th Annual SIGKDD, 230-239 (2007) (ps, pdf).
    • Sampling Algorithms and Coresets for Lp Regression,
    • A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:0707.1714 (2007) (arXiv),
      Proc. 19-th Annual SODA, 932-941 (2008) (ps, pdf),
      SIAM J. Computing, 38, 2060-2078 (2009) (ps, pdf).
    • Web Information Retrieval and Linear Algebra Algorithms,
    • A. Frommer, M. W. Mahoney, and D. B. Szyld (Eds.),
      Proc. of Dagstuhl Seminar 07071, (2007) (web).
    • Intra- and interpopulation genotype reconstruction from tagging SNPs,
    • P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas,
      Genome Research, 17(1), 96-107 (2007) (ps, pdf).
    • Bridging the Gap Between Numerical Linear Algebra, Theoretical Computer Science, and Data Applications,
    • G. H. Golub, M. W. Mahoney, P. Drineas, and L.-H. Lim,
      SIAM News 39:8 October 2006 (ps, pdf).
    • Randomized Algorithms for Matrices and Massive Data Sets,
    • P. Drineas and M. W. Mahoney,
      Proc. 32-nd Annual VLDB, 1269 (2006) (ps, pdf).
    • Subspace Sampling and Relative-Error Matrix Approximation: Column-Row-Based Methods,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. 14-th Annual ESA, 304-314 (2006) (ps, pdf).
    • Subspace Sampling and Relative-Error Matrix Approximation: Column-Based Methods,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. 10-th Annual RANDOM, 316-326 (2006) (ps, pdf).
    • Tensor-CUR Decompositions For Tensor-Based Data,
    • M. W. Mahoney, M. Maggioni, and P. Drineas,
      Proc. 12-th Annual SIGKDD, 327-336 (2006) (ps, pdf),
      SIAM J. Matrix Analysis and Applications, 30, 957-987 (2008) (ps, pdf).
    • Polynomial Time Algorithm for Column-Row-Based Relative-Error Low-Rank Matrix Approximation,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Technical Report, DIMACS TR 2006-04 March 2006 (ps, pdf).
    • Sampling Algorithms for L2 Regression and Applications,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. 17-th Annual SODA, 1127-1136 (2006) (ps, pdf).
    • A Randomized Algorithm for a Tensor-Based Generalization of the Singular Value Decomposition,
    • P. Drineas and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1327, June 2005 (ps, pdf),
      Linear Algebra and its Applications, 420, 553-571 (2007) (ps, pdf).
    • On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning,
    • P. Drineas and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1319, April 2005 (ps, pdf),
      Proc. 18-th Annual COLT, 323-337 (2005) (ps, pdf),
      J. Machine Learning Research, 6, 2153-2175 (2005) (ps, pdf).
    • Sampling Sub-problems of Heterogeneous Max-Cut Problems and Approximation Algorithms,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1283, April 2004 (ps, pdf),
      Proc. 22-nd Annual STACS, 57-68 (2005) (ps, pdf),
      Random Structures and Algorithms, 32:3, 307-333 (2008) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices III: Computing an Efficient Approximate Decomposition of a Matrix,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1271, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 184-206 (2006) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices II: Computing Low-Rank Approximations to a Matrix,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1270, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 158-183 (2006) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1269, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 132-157 (2006) (ps, pdf).
    • Rapid Mixing of Several Markov Chains for a Hard-Core Model,
    • R. Kannan, M. W. Mahoney, and R. Montenegro,
      Proc. 14-th Annual ISAAC, 663-675 (2003) (ps, pdf).
    • Quantum, Intramolecular Flexibility, and Polarizability Effects on the Reproduction of the Density Anomaly of Liquid Water by Simple Potential Functions,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 115, 10758-10768 (2001) (ps, pdf).
    • Rapid Estimation of Electronic Degrees of Freedom in Monte Carlo Calculations for Polarizable Models of Liquid Water,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 114, 9337-9349 (2001) (ps, pdf).
    • Diffusion Constant of the TIP5P Model of Liquid Water,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 114, 363-366 (2001) (ps, pdf).
    • A Five-Site Model for Liquid Water and the Reproduction of the Density Anomaly by Rigid, Nonpolarizable Potential Functions,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 112, 8910-8922 (2000) (ps, pdf).
    • Repression and Activation of Promoter-Bound RNA Polymerase Activity by Gal Repressor,
    • H. E. Choy, R. R. Hanger, T. Aki, M. Mahoney, K. Murakami, A. Ishihama, and S. Adhya,
      J. Mol. Biol. 272: 293-300, 1997 (ps, pdf).
    • Discrete Representations of the Protein C-alpha Chain,
    • X. F. de la Cruz, M. W. Mahoney, and B. K. Lee,
      Fold. & Des. 2: 223-234, 1997 (ps, pdf).

    Talks and presentations

    Overheads for several recent talks:

  • Community structure in large social and information networks (newer) (pdf, ppt)
  • Statistical leverage and improved matrix algorithms (newer and long) (pdf, ppt)
  • Overheads for several tutorial presentations:

  • Randomized Algorithms for Matrices and Massive Data Sets (at SIAM-SDM06 2006 and VLDB 2006) (ppt)
  • Randomized Algorithms for Matrices and Massive Data Sets (at ACM-SIGKDD 2005) (ppt)
  • Overheads for several other older talks:

  • Approximation Algorithms as Experimental Probes of Informatics Graphs (pdf, ppt)
  • Community structure in large social and information networks (talk from MMDS 2008) (pdf, ppt)
  • Community structure in large social and information networks (older) (pdf, ppt)
  • Statistical leverage and improved matrix algorithms (older and short) (pdf, ppt)
  • Sampling algorithms and core-sets for Lp regression and applications (pdf, ppt)
  • CUR Matrix Decompositions for Improved Data Analysis (talk from MMDS 2006) (pdf, ppt)
  • A Relative-Error CUR Decomposition for Matrices and Its Data Applications (pdf, ppt)
  • Sampling Algorithms for L2 Regression and Applications (talk from SODA 2006) (pdf, ppt)
  • Approximating a Gram Matrix for Improved Kernel-Based Learning (talk from COLT 2005) (ps, pdf)
  • Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis (newer) ( pdf, ppt)
  • Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis (older) ( pdf)
  • The CUR Matrix Decomposition with Applications to Algorithm Design and Massive Data Set Analysis (pdf)
  • Fast Monte Carlo Algorithms for Massive Data Sets and Approximating Max-Cut (ps, pdf)
  • Overheads for the TIP5P water talk:

  • The Computational Statistical Mechanics of Simple Models of Liquid Water ( pdf)
  • A few videos of talks:

  • "Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis" at 2005 IPAM Summer School.
  • "Statistical Leverage and Improved Matrix Algorithms" at 2009 ICML Workshops.


  • My old web page at Yale computer science has additional (somewhat outdated) information; but you should click on it (or Google search for it) if you are interested in extremely large homes.