Michael Mahoney's new home page



Basic info

I am in the math department at Stanford University. You can reach me via electronic mail: mmahoney is the username, and cs dot stanford dot edu follows the at symbol. (Of course, my gmail address is still valid.) Also, some other (somewhat outdated) information can be found at my old web page at Yale computer science.


Recent items of interest:

  • We will be running MMDS again! MMDS 2012 will take place on the campus of Stanford University on July 10-13, 2012. Details will be available soon!
  • My monograph on "Randomized Algorithms for Matrices and Data" is available in NOW's "Foundations and Trends in Machine Learning" series, and it is also available on the arXiv here.
  • The overview chapter on "Algorithmic and Statistical Perspectives on Large-Scale Data Analysis" is finally on the arXiv here; the book in which it will appear is in press; and a video of the associated talk is here.
  • Click here for information (including the slides and video!) on the Tutorial on "Geometric Tools for Identifying Structure in Large Social and Information Networks," given originally at ICML10 and KDD10 and subsequently at many other places. (The slides are also linked to below.)
  • Recent teaching: Fall 2009: CS369M: Algorithms for Massive Data Set Analysis
  • We have made public many of the networks we have used in our "Community Structure" papers. Click here for the networks and for related information; older networks are still here.
  • Our PNAS paper on "CUR Matrix Decompositions for Improved Data Analysis" has appeared as Proc. Natl. Acad. Sci. USA, 106, 697-702 (2009). Please email me if you can't access a copy from the PNAS web site.

  • Research interests

    • Algorithmic and statistical aspects of modern large-scale data analysis.
    • Design and analysis of algorithms for matrix, graph, and regression problems.
    • Applications to analytics on large social and information networks.
    • Applications to DNA microarray and single nucleotide polymorphism data.
    • Randomized algorithms for large linear algebra problems.

    Much of my current research focuses on geometric network analysis, i.e., using algorithms with a geometric or statistical flavor to analyze the structure and dynamics of large informatics graphs; developing approximate computation and regularization methods for large informatics graphs; and applications to community detection, clustering, and information dynamics in large social and information networks. I am also continuing work on matrix sampling algorithms, applications to DNA single nucleotide polymorphism (SNP) data, and large-scale statistical data analysis more generally.

    In the last few years, I developed and analyzed algorithms for large matrix, graph, and regression problems, and I applied these and related tools to the statistical data analysis of extremely large scientific and Internet data sets. For example, I worked on large-scale web analytics, machine learning, and query log analysis; applications of graph partitioning algorithms to clustering and community identification; and applications of randomized algorithms to hyperspectral medical image data, DNA microarray data, and DNA SNP data.

    In the past, I have also worked on developing and analyzing Monte Carlo algorithms for performing useful computations on extremely large matrices, e.g., the additive-error and relative-error CUR matrix decompositions. Recently, this work led to improved algorithms for two classical linear algebra problems: a faster algorithm for the overconstrained least squares approximation problem, and a better algorithm for the problem of choosing the best set of exactly k columns from a matrix. Past research has also included work in computational statistical mechanics on the development and analysis of the TIP5P model of liquid water, as well as work in both computational and experimental biophysics on proteins and protein-nucleic acid interactions.


    MMDS Workshops

    We started the MMDS Workshops on "Algorithms for Modern Massive Data Sets" to address algorithmic and statistical challenges in modern large-scale statistical data analysis.

    • We will be running MMDS again! MMDS 2012 will take place on the campus of Stanford University on July 10-13, 2012. Details will be available soon!
    • MMDS 2010 took place on the campus of Stanford University on June 15-18, 2010. MMDS 2010 addressed computation in large-scale scientific and internet data applications more generally. See the MMDS web page for details, including articles and all the speaker overheads!
    • MMDS 2008 took place on June 25-28, 2008. MMDS 2008 grew out of our expectation for what the algorithmic and statistical foundations of large-scale data analysis should look like a generation from now. Click here for an article that appeared in SIGKDD Explorations and SIAM News about the meeting.
    • MMDS 2006 took place on June 21-24, 2006. MMDS 2006 was originally motivated by the complementary perspectives brought by numerical linear algebra and theoretical computer science to matrix algorithms in large-scale data applications. Click here for an article in SIAM News about the meeting.
    These MMDS meetings generated intense interdisciplinary interest and were a big success --- so keep an eye out for future MMDSs!

    Publications

    • On the Hyperbolicity of Small-World Networks and Tree-Like Graphs,
    • W. Chen, W. Fang, G. Hu, M. W. Mahoney,
      Technical Report, Preprint: arXiv:1201.1717 (2012) (arXiv),
    • Stochastic Dimensionality Reduction for K-means Clustering,
    • C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas,
      Technical Report, Preprint: arXiv:1110.2897 (2011) (arXiv),
    • Regularized Laplacian Estimation and Fast Eigenvector Approximation,
    • P. O. Perry and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1110.1757 (2011) (arXiv),
      Proceedings of the 2011 NIPS Conference.
    • LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems,
    • X. Meng, M. A. Saunders, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1109.5981 (2011) (arXiv).
    • Fast approximation of matrix coherence and statistical leverage,
    • P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff,
      Technical Report, Preprint: arXiv:1109.3843 (2011) (arXiv).
    • Localization on low-order eigenvectors of data matrices,
    • M. Cucuringu and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1109.1355 (2011) (arXiv).
    • Efficient Genomewide Selection of PCA-Correlated tSNPs for Genotype Imputation,
    • A. Javed, P. Drineas, M. W. Mahoney, and P. Paschou,
      Annals of Human Genetics, 75, 707-722 (2011) ( pdf).
    • Randomized Algorithms for Matrices and Data,
    • M. W. Mahoney,
      Technical Report, Preprint: arXiv:1104.5557 (2011) (arXiv),
      NOW Publishers, Foundations and Trends in Machine Learning, Volume 3, Issue 2, 2011 (now),
    • Computation in Large-Scale Scientific and Internet Data Applications is a Focus of MMDS 2010,
    • M. W. Mahoney,
      Technical Report, Preprint: arXiv:1012.4231 (2010) (arXiv),
      Appeared in SIGKDD Explorations, SIGACT News, ASA-SCGN Newsletter, and IMS Bulletin.
    • CUR from a Sparse Optimization Viewpoint,
    • J. Bien, Y. Xu, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1011.0413 (2010) (arXiv),
      Proceedings of the 2010 NIPS Conference (ps, pdf).
    • Algorithmic and Statistical Perspectives on Large-Scale Data Analysis,
    • M. W. Mahoney,
      Technical Report, Preprint: arXiv:1010.1609 (2010) (arXiv),
      To appear in: Combinatorial Scientific Computing, U. Naumann and O. Schenk, eds., 2012.
    • Implementing regularization implicitly via approximate eigenvector computation,
    • M. W. Mahoney and L. Orecchia,
      Technical Report, Preprint: arXiv:1010.0703 (2010) (arXiv),
      Proceedings of the 28th ICML Conference, 121-128 (2011) ( pdf).
    • Approximating Higher-Order Distances Using Random Projections,
    • P. Li, M. W. Mahoney, and Y. She,
      Proceedings of the 26th UAI Conference, 312-321 (2010) (ps, pdf).
    • Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving,
    • P. Drineas and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1005.3097 (2010) (arXiv).
    • Empirical Comparison of Algorithms for Network Community Detection,
    • J. Leskovec, K. J. Lang, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1004.3539 (2010) (arXiv),
      Proc. 19-th International WWW, 631-640 (2010) (ps, pdf).
    • A Local Spectral Method for Graphs: with Applications to Improving Graph Partitions and Exploring Data Graphs Locally,
    • M. W. Mahoney, L. Orecchia, and N. K. Vishnoi,
      Technical Report, Preprint: arXiv:0912.0681 (2009) (arXiv).
    • Unsupervised Feature Selection for the k-means Clustering Problem,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Proceedings of the 2009 NIPS Conference (ps, pdf).
    • Learning with Spectral Kernels and Heavy-Tailed Data,
    • M. W. Mahoney and H. Narayanan,
      Technical Report, Preprint: arXiv:0906.4539 (2009) (arXiv).
    • Empirical Evaluation of Graph Partitioning Using Spectral Embeddings and Flow,
    • K. J. Lang, M. W. Mahoney, and L. Orecchia,
      Proc. 8-th International SEA, 197-208 (2009) (ps, pdf).
    • CUR Matrix Decompositions for Improved Data Analysis,
    • M. W. Mahoney and P. Drineas,
      Proc. Natl. Acad. Sci. USA, 106, 697-702 (2009) (ps, pdf).
    • An Improved Approximation Algorithm for the Column Subset Selection Problem,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Technical Report, Preprint: arXiv:0812.4293 (2008) (arXiv),
      Proc. 20-th Annual SODA, 968-977 (2009) (ps, pdf).
    • Algorithmic and Statistical Challenges in Modern Large-Scale Data Analysis are the Focus of MMDS 2008
    • M. W. Mahoney, L.-H. Lim, and G. E. Carlsson
      Technical Report, Preprint: arXiv:0812.3702 (2008) (arXiv),
      Appeared in SIGKDD Explorations (ps, pdf), SIAM News (ps, pdf), and ASA-SCGN Newsletter (ps, pdf), and abridged versions appeared in IMS Bulletin (ps, pdf) and AmStat News.
    • Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters,
    • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:0810.1355 (2008) (arXiv),
      Internet Mathematics, 6(1), 29-123 (2009).
    • Unsupervised Feature Selection for Principal Components Analysis,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Proc. 14-th Annual SIGKDD, 61-69 (2008) (ps, pdf).
    • Statistical Properties of Community Structure in Large Social and Information Networks,
    • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
      Proc. 17-th International WWW, 695-704 (2008) (ps, pdf).
    • Faster Least Squares Approximation,
    • P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlos,
      Technical Report, Preprint: arXiv:0710.1435 (2007) (arXiv),
      Numerische Mathematik, 117, 219-249 (2011).
    • PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations,
    • P. Paschou, E. Ziv, E. G. Burchard, S. Choudhry, W. Rodriguez-Cintron, M. W. Mahoney, and P. Drineas,
      PLoS Genetics, 3, 1672-1686 (2007) (ps, pdf).
    • Relative-Error CUR Matrix Decompositions,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Technical Report, Preprint: arXiv:0708.3696 (2007) (arXiv),
      SIAM J. Matrix Analysis and Applications, 30, 844-881 (2008) (ps, pdf).
    • Feature Selection Methods for Text Classification,
    • A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney,
      Proc. 13-th Annual SIGKDD, 230-239 (2007) (ps, pdf).
    • Sampling Algorithms and Coresets for Lp Regression,
    • A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:0707.1714 (2007) (arXiv),
      Proc. 19-th Annual SODA, 932-941 (2008) (ps, pdf),
      SIAM J. Computing, 38, 2060-2078 (2009) (ps, pdf).
    • Web Information Retrieval and Linear Algebra Algorithms,
    • A. Frommer, M. W. Mahoney, and D. B. Szyld (Eds.),
      Proc. of Dagstuhl Seminar 07071, (2007) (web).
    • Intra- and interpopulation genotype reconstruction from tagging SNPs,
    • P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas,
      Genome Research, 17(1), 96-107 (2007) (ps, pdf).
    • Bridging the Gap Between Numerical Linear Algebra, Theoretical Computer Science, and Data Applications,
    • G. H. Golub, M. W. Mahoney, P. Drineas, and L.-H. Lim,
      SIAM News 39:8 October 2006 (ps, pdf).
    • Randomized Algorithms for Matrices and Massive Data Sets,
    • P. Drineas and M. W. Mahoney,
      Proc. 32-nd Annual VLDB, 1269 (2006) (ps, pdf).
    • Subspace Sampling and Relative-Error Matrix Approximation: Column-Row-Based Methods,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. 14-th Annual ESA, 304-314 (2006) (ps, pdf).
    • Subspace Sampling and Relative-Error Matrix Approximation: Column-Based Methods,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. 10-th Annual RANDOM, 316-326 (2006) (ps, pdf).
    • Tensor-CUR Decompositions For Tensor-Based Data,
    • M. W. Mahoney, M. Maggioni, and P. Drineas,
      Proc. 12-th Annual SIGKDD, 327-336 (2006) (ps, pdf),
      SIAM J. Matrix Analysis and Applications, 30, 957-987 (2008) (ps, pdf).
    • Polynomial Time Algorithm for Column-Row-Based Relative-Error Low-Rank Matrix Approximation,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Technical Report, DIMACS TR 2006-04 March 2006 (ps, pdf).
    • Sampling Algorithms for L2 Regression and Applications,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. 17-th Annual SODA, 1127-1136 (2006) (ps, pdf).
    • A Randomized Algorithm for a Tensor-Based Generalization of the Singular Value Decomposition,
    • P. Drineas and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1327, June 2005 (ps, pdf),
      Linear Algebra and its Applications, 420, 553-571 (2007) (ps, pdf).
    • On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning,
    • P. Drineas and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1319, April 2005 (ps, pdf),
      Proc. 18-th Annual COLT, 323-337 (2005) (ps, pdf),
      J. Machine Learning Research, 6, 2153-2175 (2005) (ps, pdf).
    • Sampling Sub-problems of Heterogeneous Max-Cut Problems and Approximation Algorithms,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1283, April 2004 (ps, pdf),
      Proc. 22-nd Annual STACS, 57-68 (2005) (ps, pdf),
      Random Structures and Algorithms, 32:3, 307-333 (2008) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices III: Computing an Efficient Approximate Decomposition of a Matrix,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1271, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 184-206 (2006) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices II: Computing Low-Rank Approximations to a Matrix,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1270, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 158-183 (2006) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1269, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 132-157 (2006) (ps, pdf).
    • Rapid Mixing of Several Markov Chains for a Hard-Core Model,
    • R. Kannan, M. W. Mahoney, and R. Montenegro,
      Proc. 14-th Annual ISAAC, 663-675 (2003) (ps, pdf).
    • Quantum, Intramolecular Flexibility, and Polarizability Effects on the Reproduction of the Density Anomaly of Liquid Water by Simple Potential Functions,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 115, 10758-10768 (2001) (ps, pdf).
    • Rapid Estimation of Electronic Degrees of Freedom in Monte Carlo Calculations for Polarizable Models of Liquid Water,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 114, 9337-9349 (2001) (ps, pdf).
    • Diffusion Constant of the TIP5P Model of Liquid Water,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 114, 363-366 (2001) (ps, pdf).
    • A Five-Site Model for Liquid Water and the Reproduction of the Density Anomaly by Rigid, Nonpolarizable Potential Functions,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 112, 8910-8922 (2000) (ps, pdf).
    • Repression and Activation of Promoter-Bound RNA Polymerase Activity by Gal Repressor,
    • H. E. Choy, R. R. Hanger, T. Aki, M. Mahoney, K. Murakami, A. Ishihama, and S. Adhya,
      J. Mol. Biol. 272: 293-300, 1997 (ps, pdf).
    • Discrete Representations of the Protein C-alpha Chain,
    • X. F. de la Cruz, M. W. Mahoney, and B. K. Lee,
      Fold. & Des. 2: 223-234, 1997 (ps, pdf).

    Talks and presentations

    Several recent talks:

  • Extracting insight from large networks: implications of small-scale and large-scale structure (1 hour version) (pdf, ppt)
  • Approximate computation and implicit regularization in large-scale data analysis (short vrsn) (pdf, ppt)
  • Fast Approximation of Matrix Coherence and Statistical Leverage (pdf, ppt)
  • Linear Algebra and Machine Learning of Large Informatics Graphs (pdf, ppt)
  • Algorithmic and Statistical Perspectives on Large-Scale Data Analysis (pdf, ppt)
  • Community structure in large social and information networks (newer) (pdf, ppt)
  • Statistical leverage and improved matrix algorithms (newer and long) (pdf, ppt)
  • Several tutorial presentations:

  • Geometric Tools for Identifying Structure in Large Social and Information Networks (1.5 hr version at SAMSI Opening Workshop 2010) (pdf, ppt)
  • Geometric Tools for Identifying Structure in Large Social and Information Networks (3 hr version at ICML 2010 and KDD 2010) (pdf, ppt) (The pdf file in four pieces: here, here, here, and here.)
  • Randomized Algorithms for Matrices and Massive Data Sets (at SIAM-SDM06 2006 and VLDB 2006) (ppt)
  • Randomized Algorithms for Matrices and Massive Data Sets (at ACM-SIGKDD 2005) (ppt)
  • Several other older talks:

  • Looking for clusters in your data ... in theory and in practice (pdf, ppt)
  • Implementing regularization implicitly via approximate eigenvector computation (pdf, ppt)
  • Geometric Network Analysis Tools (talk from MMDS 2010) (pdf, ppt)
  • Approximation Algorithms as Experimental Probes of Informatics Graphs (pdf, ppt)
  • Community structure in large social and information networks (talk from MMDS 2008) (pdf, ppt)
  • Community structure in large social and information networks (older) (pdf, ppt)
  • Statistical leverage and improved matrix algorithms (older and short) (pdf, ppt)
  • Sampling algorithms and core-sets for Lp regression and applications (pdf, ppt)
  • CUR Matrix Decompositions for Improved Data Analysis (talk from MMDS 2006) (pdf, ppt)
  • A Relative-Error CUR Decomposition for Matrices and Its Data Applications (pdf, ppt)
  • Sampling Algorithms for L2 Regression and Applications (talk from SODA 2006) (pdf, ppt)
  • Approximating a Gram Matrix for Improved Kernel-Based Learning (talk from COLT 2005) (ps, pdf)
  • Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis (newer) ( pdf, ppt)
  • Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis (older) ( pdf)
  • CUR Matrix Decomposition with Applications to Algorithm Design and Massive Data Set Analysis (pdf)
  • Fast Monte Carlo Algorithms for Massive Data Sets and Approximating Max-Cut (ps, pdf)
  • The TIP5P Water talk:

  • The Computational Statistical Mechanics of Simple Models of Liquid Water ( pdf)
  • Videos of several talks:

  • "Approximate computation and implicit regularization in large-scale data analysis," or click here
    (at the Workshop on Beyond Worst-Case Analysis, September, 2011).
  • "Linear Algebra and Machine Learning of Large Informatics Graphs"
    (at the NIPS Workshops, December, 2010).
  • "Geometric Tools for Identifying Structure in Large Social and Information Networks"
    (90 minute version, tutorial at SAMSI Opening workshop, August 2010).
  • "Geometric Tools for Graph Mining of Large Social and Information Networks"
    (3 hour version, tutorial at KDD 2010, July 2010).
  • "Community Structure in Large Social and Information Networks"
    (at the Newton Institute in Cambridge, June 2010).
  • "Algorithmic and Statistical Perspectives on Large-Scale Data Analysis"
    (at the SF Bay Area DM-SIG Meeting, February 2010).
  • "Community Structure in Large Social and Information Networks," in avi or mpg (note: slow to download),
    (at the 2009 IIT Kanpur Processing Massive Data Sets Workshop, December 2009).
  • "Statistical Leverage and Improved Matrix Algorithms"
    (at the 2009 ICML Workshops, June 2009).
  • "Sampling Algorithms and Coresets for Lp Regression and Applications," in mpg (note: slow to download),
    (at the 2006 IIT Kanpur Data Streams Workshop, December 2006).
  • "Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis"
    (at the 2005 IPAM Summer School, July 2005).


  • My old web page at Yale computer science has additional (somewhat outdated) information; but you should click on it (or Google search for it) if you are interested in extremely large homes.