I am not at Stanford any more, so this web page is somewhat out of date.

I am currently at ICSI and the Department of Statistics at UC Berkeley. Click here for my new web page and for more information.

You can reach me via electronic mail: mmahoney is the username, and stat dot berkeley dot edu follows the at symbol. (Of course, my gmail address is still valid.)



Michael Mahoney's old home page



Basic info

I am in the math department at Stanford University. You can reach me via electronic mail: mmahoney is the username, and cs dot stanford dot edu follows the at symbol. (Of course, my gmail address is still valid.) Also, some other (somewhat outdated) information can be found at my old web page at Yale computer science.


Recent items of interest:

  • The scribed lecture notes for Stat260/CS294: Randomized Algorithms for Matrices and Data have been posted. Feedback welcome.
  • We will be running MMDS again! MMDS 2014 will take place on the campus of UC Berkeley on Tuesday, June 17 through Friday, June 20, 2014. See the main MMDS web page for more information on registration, speakers, etc., all of which will be available soon.
  • I'm moving to ICSI and the Department of Statistics at UC Berkeley.
  • Teaching, Fall 2013, at UC Berkeley: Stat260/CS294: Randomized Algorithms for Matrices and Data. A cleaned-up version of the lecture notes should be available soon.
  • Our NRC report on the "Frontiers in Massive Data Analysis" is out and available here.
  • I'm involved with running the program on "Theoretical Foundations of Big Data Analysis," to be held at the Simons Institute at UC Berkeley during the fall of 2013. Click here for details; in particular, note the link at the bottom of the page to information about fellowships for young researchers to participate.
  • We ran MMDS again. MMDS 2012 took place on the campus of Stanford University on July 10-13, 2012. Speaker videos and presentations are now available. For pdfs and videos of the speaker presentations, go to the main MMDS web page; or click here for the entire video collection.
  • The overview article on "Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis" associated with the invited talk at PODS 2012 meeting is on the arXiv here.
  • LSRN: An initial version of our code for LSRN, the randomized least-squares solver for parallel environments, is available here.
  • The monograph on "Randomized Algorithms for Matrices and Data" is available in NOW's "Foundations and Trends in Machine Learning" series here, and it is also available on the arXiv here.
  • Click here for information (including the slides and video!) on the Tutorial on "Geometric Tools for Identifying Structure in Large Social and Information Networks," given originally at ICML10 and KDD10 and subsequently at many other places. (The slides are also linked to below.)
  • The overview chapter on "Algorithmic and Statistical Perspectives on Large-Scale Data Analysis" is finally on the arXiv here; the book in which it will appear is in press; and a video of the associated talk is here.
  • Recent teaching: Fall 2009: CS369M: Algorithms for Massive Data Set Analysis
  • We have made public many of the networks we have used in our "Community Structure" papers. Click here for the networks and for related information; older networks are still here.
  • Our PNAS paper on "CUR Matrix Decompositions for Improved Data Analysis" has appeared as Proc. Natl. Acad. Sci. USA, 106, 697-702 (2009). Please email me if you can't access a copy from the PNAS web site.

  • Research interests

    • Algorithmic and statistical aspects of modern large-scale data analysis.
    • Design, analysis, and implementation of randomized algorithms for very large matrix, graph, and regression problems.
    • Implicit regularization and implicit optimization methods in scalable approximation algorithms.
    • Applications to analytics and vector space analytics on large social and information networks.
    • Applications to DNA microarray, single nucleotide polymorphism, astronomical, and medical imaging data.

    Much of my current research focuses on Randomized Numerical Linear Algebra, i.e., using random sampling and random projection methods to solve very large matrix-based problems; developing geometric network analysis tools, i.e., using scalable approximation algorithms with a geometric or statistical flavor to analyze the structure and dynamics of large informatics graphs; developing approximate computation and regularization methods for large informatics graphs; applications to community detection, clustering, and information dynamics in large social and information networks; and applications to DNA single nucleotide polymorphism (SNP) data, astronomical and medical imaging data, and large-scale statistical data analysis more generally.

    In the past, I developed and analyzed algorithms for large matrix, graph, and regression problems, and I applied these and related tools to the statistical data analysis of extremely large scientific and Internet data sets. For example, I worked on large-scale web analytics, machine learning, and query log analysis; applications of graph partitioning algorithms to clustering and community identification; and applications of randomized matrix algorithms to hyperspectral medical image data, DNA microarray data, and DNA SNP data.

    In the more distant past, I have also worked on developing and analyzing Monte Carlo algorithms for performing useful computations on extremely large matrices, e.g., the additive-error and relative-error CUR matrix decompositions. Past research has also included work in computational statistical mechanics on the development and analysis of the TIP5P model of liquid water, as well as work in both computational and experimental biophysics on proteins and protein-nucleic acid interactions.


    MMDS Workshops

    I run the MMDS meetings. We started the MMDS Workshops on "Algorithms for Modern Massive Data Sets" to address algorithmic and statistical challenges in modern large-scale statistical data analysis.

    • We will be running MMDS again! MMDS 2014 will take place on the campus of UC Berkeley on June 17-20, 2014. See the main MMDS web page for more information on registration, speakers, etc., all of which will be available soon.
    • MMDS 2012 took place on the campus of Stanford University on July 10-13, 2012. For pdfs and videos of the speaker presentations, go to the main MMDS web page; or click here for the entire video collection.
    • MMDS 2010 took place on the campus of Stanford University on June 15-18, 2010. MMDS 2010 addressed computation in large-scale scientific and internet data applications more generally. See the MMDS web page for details, including articles and all the speaker overheads!
    • MMDS 2008 took place on June 25-28, 2008. MMDS 2008 grew out of our expectation for what the algorithmic and statistical foundations of large-scale data analysis should look like a generation from now. Click here for an article that appeared in SIGKDD Explorations and SIAM News about the meeting.
    • MMDS 2006 took place on June 21-24, 2006. MMDS 2006 was originally motivated by the complementary perspectives brought by numerical linear algebra and theoretical computer science to matrix algorithms in large-scale data applications. Click here for an article in SIAM News about the meeting.
    These MMDS meetings generated intense interdisciplinary interest and were a big success --- so keep an eye out for future MMDSs!

    Publications

    2013
    • Tree-like Structure in Large Social and Information Networks,
    • A. B. Adcock, B. D. Sullivan, and M. W. Mahoney,
      Proc. of the 2013 IEEE ICDM, (2013).
    • Objective Identification of Informative Wavelength Regions in Galaxy Spectra,
    • C.-W. Yip, M. W. Mahoney, A. S. Szalay, I. Csabai, T. Budavari, R. F. G. Wyse, and L. Dobos,
      Technical Report, Preprint: arXiv:1312.0637 (2013) (arXiv),
      Accepted for publication, Astronomical Journal.
    • Evaluating OpenMP Tasking at Scale for the Computation of Graph Hyperbolicity,
    • A. B. Adcock, B. D. Sullivan, O. R. Hernandez, and M. W. Mahoney,
      Proc. of the 9-th IWOMP, 71-83 (2013).
    • Frontiers in Massive Data Analysis,
    • Committee on the Analysis of Massive Data, et al. (M. I. Jordan, et al.),
      The National Academies Press (2013) (web).
    • A Statistical Perspective on Algorithmic Leveraging,
    • P. Ma, M. W. Mahoney, and B. Yu,
      Technical Report, Preprint: arXiv:1306.5362 (2013) (arXiv),
      To appear in: Proc. of the 31st ICML Conference (2014),
      Journal version submitted for publication.
    • Robust Regression on MapReduce,
    • X. Meng, and M. W. Mahoney,
      Proc. of the 30th ICML Conference (2013) (pdf).
    • Quantile Regression for Large-scale Applications,
    • J. Yang, X. Meng, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1305.0087 (2013) (arXiv), (code),
      Proc. of the 30th ICML Conference (2013) (pdf),
      Accepted for publication, SIAM J. Scientific Computing.
    • Revisiting the Nystrom Method for Improved Large-Scale Machine Learning,
    • A. Gittens and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1303.1849 (2013) (arXiv), (code),
      Proc. of the 30th ICML Conference (2013) (pdf),
      Journal version submitted for publication.
    • Semi-supervised Eigenvectors for Large-scale Locally-biased Learning,
    • T. J. Hansen and M. W. Mahoney,
      Proc. of the 2012 NIPS Conference (pdf), (code),
      Technical Report, Preprint: arXiv:1304.7528 (2013) (arXiv),
      Accepted for publication, J. Machine Learning Research.
    2012
    • Low-distortion Subspace Embeddings in Input-sparsity Time and Applications to Robust Linear Regression,
    • X. Meng and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1210.3135 (2012) (arXiv),
      Proc. of the 45-th STOC, 91-100 (2013).
    • The Fast Cauchy Transform and Faster Robust Linear Regression,
    • K. L. Clarkson, P. Drineas, M. Magdon-Ismail, M. W. Mahoney, X. Meng, and D. P. Woodruff,
      Technical Report, Preprint: arXiv:1207.4684 (2012) (arXiv),
      Proc. of the 24-th Annual SODA, 466-477 (2013) (pdf),
      Journal version submitted for publication.
    • rCUR: an R package for CUR matrix decomposition,
    • A. Bodor, I. Csabai, M. W. Mahoney, and N. Solymosi,
      BMC Bioinformatics, 13:103 (2012) (pdf).
    • Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis,
    • M. W. Mahoney,
      Technical Report, Preprint: arXiv:1203.0786 (2012) (arXiv),
      Proc. of the 2012 ACM Symposium on Principles of Database Systems, 143-154, 2012 (pdf).
    • On the Hyperbolicity of Small-World and Tree-Like Random Graphs,
    • W. Chen, W. Fang, G. Hu, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1201.1717 (2012) (arXiv),
      Proc. of the 23-rd ISAAC 278-288 (2012) (pdf),
      Internet Mathematics, 9(4), 434-491 (2013).
    2011
    • Randomized Dimensionality Reduction for K-means Clustering,
    • C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas,
      Technical Report, Preprint: arXiv:1110.2897 (2011) (arXiv),
      Journal version submitted for publication.
    • Regularized Laplacian Estimation and Fast Eigenvector Approximation,
    • P. O. Perry and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1110.1757 (2011) (arXiv),
      Proc. of the 2011 NIPS Conference (pdf).
    • LSRN: A Parallel Iterative Solver for Strongly Over- or Under-Determined Systems,
    • X. Meng, M. A. Saunders, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1109.5981 (2011) (arXiv), (code),
      Accepted for publication, SIAM J. Scientific Computing.
    • Fast approximation of matrix coherence and statistical leverage,
    • P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff,
      Technical Report, Preprint: arXiv:1109.3843 (2011) (arXiv),
      Proc. of the 29th ICML Conference (2012) (pdf),
      J. Machine Learning Research, 13, 3475-3506 (2012) (pdf).
    • Localization on low-order eigenvectors of data matrices,
    • M. Cucuringu and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1109.1355 (2011) (arXiv).
    • Efficient Genomewide Selection of PCA-Correlated tSNPs for Genotype Imputation,
    • A. Javed, P. Drineas, M. W. Mahoney, and P. Paschou,
      Annals of Human Genetics, 75, 707-722 (2011) ( pdf).
    • Randomized Algorithms for Matrices and Data,
    • M. W. Mahoney,
      Foundations and Trends in Machine Learning, NOW Publishers, Volume 3, Issue 2, 2011 (now),
      TR version: Technical Report, Preprint: arXiv:1104.5557 (2011) (arXiv).
      (Abridged version in: Advances in Machine Learning and Data Mining for Astronomy, edited by M. J. Way, et al., pp. 647-672, 2012.)
    2010
    • Computation in Large-Scale Scientific and Internet Data Applications is a Focus of MMDS 2010,
    • M. W. Mahoney,
      Technical Report, Preprint: arXiv:1012.4231 (2010) (arXiv),
      Appeared in SIGKDD Explorations, SIGACT News, ASA-SCGN Newsletter, and IMS Bulletin.
    • CUR from a Sparse Optimization Viewpoint,
    • J. Bien, Y. Xu, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1011.0413 (2010) (arXiv),
      Proc. of the 2010 NIPS Conference (ps, pdf).
    • Algorithmic and Statistical Perspectives on Large-Scale Data Analysis,
    • M. W. Mahoney,
      Technical Report, Preprint: arXiv:1010.1609 (2010) (arXiv),
      In: Combinatorial Scientific Computing, pp. 427-469, edited by U. Naumann and O. Schenk, 2012.
    • Implementing regularization implicitly via approximate eigenvector computation,
    • M. W. Mahoney and L. Orecchia,
      Technical Report, Preprint: arXiv:1010.0703 (2010) (arXiv),
      Proc. of the 28th ICML Conference, 121-128 (2011) (pdf).
    • Approximating Higher-Order Distances Using Random Projections,
    • P. Li, M. W. Mahoney, and Y. She,
      Proc. of the 26th UAI Conference, 312-321 (2010) (ps, pdf),
      Technical Report, Preprint: arXiv:1203.3492 (2012) (arXiv).
    • Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving,
    • P. Drineas and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1005.3097 (2010) (arXiv).
    • Empirical Comparison of Algorithms for Network Community Detection,
    • J. Leskovec, K. J. Lang, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:1004.3539 (2010) (arXiv),
      Proc. of the 19-th International WWW, 631-640 (2010) (ps, pdf).
    2009
    • A Local Spectral Method for Graphs: with Applications to Improving Graph Partitions and Exploring Data Graphs Locally,
    • M. W. Mahoney, L. Orecchia, and N. K. Vishnoi,
      Technical Report, Preprint: arXiv:0912.0681 (2009) (arXiv),
      J. Machine Learning Research, 13, 2339-2365 (2012) (ps, pdf).
    • Unsupervised Feature Selection for the k-means Clustering Problem,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Proc. of the 2009 NIPS Conference (ps, pdf).
    • Learning with Spectral Kernels and Heavy-Tailed Data,
    • M. W. Mahoney and H. Narayanan,
      Technical Report, Preprint: arXiv:0906.4539 (2009) (arXiv).
    • Empirical Evaluation of Graph Partitioning Using Spectral Embeddings and Flow,
    • K. J. Lang, M. W. Mahoney, and L. Orecchia,
      Proc. of the 8-th International SEA, 197-208 (2009) (ps, pdf).
    • CUR Matrix Decompositions for Improved Data Analysis,
    • M. W. Mahoney and P. Drineas,
      Proc. Natl. Acad. Sci. USA, 106, 697-702 (2009) (ps, pdf).
    2008
    • An Improved Approximation Algorithm for the Column Subset Selection Problem,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Technical Report, Preprint: arXiv:0812.4293 (2008) (arXiv),
      Proc. of the 20-th Annual SODA, 968-977 (2009) (ps, pdf).
    • Algorithmic and Statistical Challenges in Modern Large-Scale Data Analysis are the Focus of MMDS 2008
    • M. W. Mahoney, L.-H. Lim, and G. E. Carlsson
      Technical Report, Preprint: arXiv:0812.3702 (2008) (arXiv),
      Appeared in SIGKDD Explorations (ps, pdf), SIAM News (ps, pdf), and ASA-SCGN Newsletter (ps, pdf), and abridged versions appeared in IMS Bulletin (ps, pdf) and AmStat News.
    • Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters,
    • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:0810.1355 (2008) (arXiv),
      Internet Mathematics, 6(1), 29-123 (2009).
    • Unsupervised Feature Selection for Principal Components Analysis,
    • C. Boutsidis, M. W. Mahoney, and P. Drineas,
      Proc. of the 14-th Annual SIGKDD, 61-69 (2008) (ps, pdf).
    • Statistical Properties of Community Structure in Large Social and Information Networks,
    • J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney,
      Proc. of the 17-th International WWW, 695-704 (2008) (ps, pdf).
    2007
    • Faster Least Squares Approximation,
    • P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlos,
      Technical Report, Preprint: arXiv:0710.1435 (2007) (arXiv),
      Numerische Mathematik, 117, 219-249 (2011).
    • PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations,
    • P. Paschou, E. Ziv, E. G. Burchard, S. Choudhry, W. Rodriguez-Cintron, M. W. Mahoney, and P. Drineas,
      PLoS Genetics, 3, 1672-1686 (2007) (ps, pdf).
    • Relative-Error CUR Matrix Decompositions,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Technical Report, Preprint: arXiv:0708.3696 (2007) (arXiv),
      SIAM J. Matrix Analysis and Applications, 30, 844-881 (2008) (ps, pdf).
    • Feature Selection Methods for Text Classification,
    • A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney,
      Proc. of the 13-th Annual SIGKDD, 230-239 (2007) (ps, pdf).
    • Sampling Algorithms and Coresets for Lp Regression,
    • A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney,
      Technical Report, Preprint: arXiv:0707.1714 (2007) (arXiv),
      Proc. of the 19-th Annual SODA, 932-941 (2008) (ps, pdf),
      SIAM J. Computing, 38, 2060-2078 (2009) (ps, pdf).
    • Web Information Retrieval and Linear Algebra Algorithms,
    • A. Frommer, M. W. Mahoney, and D. B. Szyld (Eds.),
      Proc. of Dagstuhl Seminar 07071, (2007) (web).
    • Intra- and interpopulation genotype reconstruction from tagging SNPs,
    • P. Paschou, M. W. Mahoney, A. Javed, J. R. Kidd, A. J. Pakstis, S. Gu, K. K. Kidd, and P. Drineas,
      Genome Research, 17(1), 96-107 (2007) (ps, pdf).
    2006
    • Bridging the Gap Between Numerical Linear Algebra, Theoretical Computer Science, and Data Applications,
    • G. H. Golub, M. W. Mahoney, P. Drineas, and L.-H. Lim,
      SIAM News 39:8 October 2006 (ps, pdf).
    • Randomized Algorithms for Matrices and Massive Data Sets,
    • P. Drineas and M. W. Mahoney,
      Proc. of the 32-nd Annual VLDB, 1269 (2006) (ps, pdf).
    • Subspace Sampling and Relative-Error Matrix Approximation: Column-Row-Based Methods,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. of the 14-th Annual ESA, 304-314 (2006) (ps, pdf).
    • Subspace Sampling and Relative-Error Matrix Approximation: Column-Based Methods,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. of the 10-th Annual RANDOM, 316-326 (2006) (ps, pdf).
    • Tensor-CUR Decompositions For Tensor-Based Data,
    • M. W. Mahoney, M. Maggioni, and P. Drineas,
      Proc. of the 12-th Annual SIGKDD, 327-336 (2006) (ps, pdf),
      SIAM J. Matrix Analysis and Applications, 30, 957-987 (2008) (ps, pdf).
    • Polynomial Time Algorithm for Column-Row-Based Relative-Error Low-Rank Matrix Approximation,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Technical Report, DIMACS TR 2006-04 March 2006 (ps, pdf).
    • Sampling Algorithms for L2 Regression and Applications,
    • P. Drineas, M. W. Mahoney, and S. Muthukrishnan,
      Proc. of the 17-th Annual SODA, 1127-1136 (2006) (ps, pdf).
    2005
    • A Randomized Algorithm for a Tensor-Based Generalization of the Singular Value Decomposition,
    • P. Drineas and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1327, June 2005 (ps, pdf),
      Linear Algebra and its Applications, 420, 553-571 (2007) (ps, pdf).
    • On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-Based Learning,
    • P. Drineas and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1319, April 2005 (ps, pdf),
      Proc. of the 18-th Annual COLT, 323-337 (2005) (ps, pdf),
      J. Machine Learning Research, 6, 2153-2175 (2005) (ps, pdf).
    2004
    • Sampling Sub-problems of Heterogeneous Max-Cut Problems and Approximation Algorithms,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1283, April 2004 (ps, pdf),
      Proc. of the 22-nd Annual STACS, 57-68 (2005) (ps, pdf),
      Random Structures and Algorithms, 32:3, 307-333 (2008) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices III: Computing an Efficient Approximate Decomposition of a Matrix,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1271, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 184-206 (2006) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices II: Computing Low-Rank Approximations to a Matrix,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1270, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 158-183 (2006) (ps, pdf).
    • Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication,
    • P. Drineas, R. Kannan, and M. W. Mahoney,
      Technical Report, YALEU/DCS/TR-1269, February 2004 (ps, pdf),
      SIAM J. Computing, 36, 132-157 (2006) (ps, pdf).
    2003
    • Rapid Mixing of Several Markov Chains for a Hard-Core Model,
    • R. Kannan, M. W. Mahoney, and R. Montenegro,
      Proc. of the 14-th Annual ISAAC, 663-675 (2003) (ps, pdf).
    2001
    • Quantum, Intramolecular Flexibility, and Polarizability Effects on the Reproduction of the Density Anomaly of Liquid Water by Simple Potential Functions,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 115, 10758-10768 (2001) (ps, pdf).
    • Rapid Estimation of Electronic Degrees of Freedom in Monte Carlo Calculations for Polarizable Models of Liquid Water,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 114, 9337-9349 (2001) (ps, pdf).
    • Diffusion Constant of the TIP5P Model of Liquid Water,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 114, 363-366 (2001) (ps, pdf).
    2000
    • A Five-Site Model for Liquid Water and the Reproduction of the Density Anomaly by Rigid, Nonpolarizable Potential Functions,
    • M. W. Mahoney and W. L. Jorgensen,
      J. Chem. Phys., 112, 8910-8922 (2000) (ps, pdf).
    1997
    • Repression and Activation of Promoter-Bound RNA Polymerase Activity by Gal Repressor,
    • H. E. Choy, R. R. Hanger, T. Aki, M. Mahoney, K. Murakami, A. Ishihama, and S. Adhya,
      J. Mol. Biol. 272: 293-300, 1997 (ps, pdf).
    • Discrete Representations of the Protein C-alpha Chain,
    • X. F. de la Cruz, M. W. Mahoney, and B. K. Lee,
      Fold. & Des. 2: 223-234, 1997 (ps, pdf).

    Talks and presentations

    Several recent talks:

  • Revisiting the Nystrom Method for Improved Large-Scale Machine Learning (pdf, ppt)
  • Randomized Regression in Parallel and Distributed Environments (talk from GraphLab 2013) (pdf)
  • Theory (and some practice) of Randomized Algorithms for Matrices and Data (talk from FOCS 2012 Workshop) (pdf, ppt)
  • Extracting insight from large networks: implications of small-scale and large-scale structure (pdf, ppt)
  • Several tutorial presentations:

  • Geometric Tools for Identifying Structure in Large Social and Information Networks (1.5 hr version at SAMSI Opening Workshop 2010, etc.) (pdf, ppt)
  • Geometric Tools for Identifying Structure in Large Social and Information Networks (2 hr version at ICASSP 2011, etc.) (pdf, ppt)
  • Geometric Tools for Identifying Structure in Large Social and Information Networks (3 hr version at ICML 2010 and KDD 2010, etc.) (pdf, ppt) (The pdf file in four pieces: here, here, here, and here.)
  • Randomized Algorithms for Matrices and Massive Data Sets (at SIAM-SDM06 2006 and VLDB 2006) (ppt)
  • Randomized Algorithms for Matrices and Massive Data Sets (at ACM-SIGKDD 2005) (ppt)
  • Several other older talks:

  • Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments (version from MMDS 2012) (pdf)
  • Sensors, networks, and massive data (pdf, ppt)
  • Randomized Algorithms for Matrices and Data (pdf, ppt)
  • Approximate computation and implicit regularization in large-scale data analysis (PODS vsn) (pdf, ppt)
  • Approximate computation and implicit regularization in large-scale data analysis (Stats vrsn1) (pdf, ppt)
  • Approximate computation and implicit regularization in large-scale data analysis (Short vrsn) (pdf, ppt)
  • Looking for clusters in your data ... in theory and in practice (pdf, ppt)
  • Fast Approximation of Matrix Coherence and Statistical Leverage (pdf, ppt)
  • Implementing regularization implicitly via approximate eigenvector computation (pdf, ppt)
  • Linear Algebra and Machine Learning of Large Informatics Graphs (pdf, ppt)
  • Geometric Network Analysis Tools (talk from MMDS 2010) (pdf, ppt)
  • Algorithmic and Statistical Perspectives on Large-Scale Data Analysis (pdf, ppt)
  • Community structure in large social and information networks (newer) (pdf, ppt)
  • Statistical leverage and improved matrix algorithms (newer and long) (pdf, ppt)
  • Approximation Algorithms as Experimental Probes of Informatics Graphs (pdf, ppt)
  • Community structure in large social and information networks (talk from MMDS 2008) (pdf, ppt)
  • Community structure in large social and information networks (older) (pdf, ppt)
  • Statistical leverage and improved matrix algorithms (older and short) (pdf, ppt)
  • Sampling algorithms and core-sets for Lp regression and applications (pdf, ppt)
  • CUR Matrix Decompositions for Improved Data Analysis (talk from MMDS 2006) (pdf, ppt)
  • A Relative-Error CUR Decomposition for Matrices and Its Data Applications (pdf, ppt)
  • Sampling Algorithms for L2 Regression and Applications (talk from SODA 2006) (pdf, ppt)
  • Approximating a Gram Matrix for Improved Kernel-Based Learning (talk from COLT 2005) (ps, pdf)
  • Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis (newer) ( pdf, ppt)
  • Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis (older) ( pdf)
  • CUR Matrix Decomposition with Applications to Algorithm Design and Massive Data Set Analysis (pdf)
  • Fast Monte Carlo Algorithms for Massive Data Sets and Approximating Max-Cut (ps, pdf)
  • The TIP5P Water talk:

  • The Computational Statistical Mechanics of Simple Models of Liquid Water ( pdf)
  • Videos of several talks:

  • "Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments"
    (at the Simons Big Data Workshop II, October 2013).
  • "Input-sparsity Time Algorithms for Embeddings and Regression Problems"
    (at the Simons Big Data Workshop I, September 2013).
  • "Past, Present and Future of Randomized Numerical Linear Algebra: Part I (PD) and Part II" (MM)
    (at the Simons Big Data Bootcamp, September 2013).
  • "Randomized Regression in Parallel and Distributed Environments"
    (at the GraphLab Workshop, July 2013).
  • "Sensors, networks, and massive data"
    (at Kavli FoS, November 2012).
  • "Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments"
    (at MMDS 2012, July 2012).
  • "Extracting insight from large networks: implications of small-scale and large-scale structure"
    (at the University of Marlyand, April 2012).
  • "Fast Approximation of Matrix Coherence and Statistical Leverage"
    (at the NIPS Workshops, December 2011).
  • "Approximate computation and implicit regularization in large-scale data analysis," or click here
    (at the Workshop on Beyond Worst-Case Analysis, September 2011).
  • "Linear Algebra and Machine Learning of Large Informatics Graphs"
    (at the NIPS Workshops, December 2010).
  • "Geometric Tools for Identifying Structure in Large Social and Information Networks"
    (90 minute version, tutorial at SAMSI Opening workshop, August 2010).
  • "Geometric Tools for Graph Mining of Large Social and Information Networks"
    (3 hour version, tutorial at KDD 2010, July 2010).
  • "Community Structure in Large Social and Information Networks"
    (at the Newton Institute in Cambridge, June 2010).
  • "Algorithmic and Statistical Perspectives on Large-Scale Data Analysis"
    (at the SF Bay Area DM-SIG Meeting, February 2010).
  • "Community Structure in Large Social and Information Networks," in avi or mpg (note: slow to download),
    (at the 2009 IIT Kanpur Processing Massive Data Sets Workshop, December 2009).
  • "Statistical Leverage and Improved Matrix Algorithms"
    (at the 2009 ICML Workshops, June 2009).
  • "Sampling Algorithms and Coresets for Lp Regression and Applications," in mpg (note: slow to download),
    (at the 2006 IIT Kanpur Data Streams Workshop, December 2006).
  • "CUR Matrix Decompositions for Improved Data Analysis"
    (at Johns Hopkins University, March 2006).
  • "Fast Monte Carlo Algorithms for Matrix Operations and Massive Data Set Analysis"
    (at the 2005 IPAM Summer School, July 2005).


  • My old web page at Yale computer science has additional (somewhat outdated) information; but you should click on it (or Google search for it) if you are interested in extremely large homes.