• We've been looking at how foundation models can help us build software systems, most recently:
  • Tri Dao is amazing and on the market is headed to Princeton. He's the force of nature behind the widely used Flash Attention (usage).
  • Some thoughts on foundation models and data with Simran.
  • Combining with weak supervision techniques in UAI22, led by Dan and Mayee. Best Student Paper Runner Up.
  • My DAC Sky Talk slides are here
  • In ICML22, we describe audio generation with state spaces, fast learning with sparse matrices, making contrastive learning robust, and a new method for robustness.
  • In ICLR22, we share some results on state space models, domino, and sparsity.
  • SIGMOD keynote on Data-centric AI, Declarative ML, and Foundation Models in data slides (YouTube)
  • In NeurIPS21, we share some of our results on sequence modeling, sparsity, NAS and introduce two benchmarking projects.
  • In ICML21, we describe some results on hyperbolic geometry, model evaluation, and stability.
  • In NAACL21, a pair of papers about robustness gym a demonstration and an industry track paper comparing modern named entity linking systems.
  • In ICLR2021, model patching, and self-supervision on medical images:
  • In NeurIPS 2020, memory units, hidden strat, and non-euclidean geometry.
  • Recent Software Releases
  • In ICML 2020, we describe our continuing work on weak supervision and data augmentation in two papers:
  • In ACL2020, we describe some of our continuing work on embeddings, compression, and geometry.
  • A bunch of great collaborations in nature-family journals, clinical journals, and others

  • We're looking for great postdocs jointly with the Mobilize Center.
  • Talk info: Apple NLU Summit, KDD Knowledge Graphs, KDD Converse, Triangle Computer Science Distinguished Lecture, JHU, MIDAS @ Michigan, Google Ads ML Keynote, Large-Scale Learning Keynote, Wisconsin MLOS, NDBC, Naver Labs. My DAC Sky Talk slides are here
  • In ICLR2020
  • Sparse recovery for Jacobi Polynomials in ICALP20.
  • Charles leads the way on understaning the link between weak supervision and instrumental variables for causal inference in AISTATS20
  • In CIDR20, paper about our Overton work at Apple including zero-code deep learning, weak supervision, and data slicing.
  • Exciting to see GMail adopt Software 2.0
  • Exciting to see GMail adopt Software 2.0 Gmail
  • Students and Postdocs described their view on Software 2.0 and what's next
  • Snorkel is in a new location Snorkel.org. Excited for all the great collaborations!
  • Teaching ML CS229 this spring 2020
  • Upcoming talks: Dagstuhl on ML meets Software Engineering, SysML (Keynote), OPsML@SysML, WWW BIG, WWW IDS, DAC Sky Talk, Duke
  • NeurIPS19. Preprints, blog posts, and code releases coming soon!
  • Talks: Berkeley RISE, Google FACT Keynote, Cornell 50th CIS Panel, ONR,AI for Health, PSB Keynote on AI and Ethics, CIDR2020, Amsterdam Data Science Meetup, USC,
  • In ICML19, we talk about learning structure with only weak supervision, a theory for data augmentation, and how to learn structured matrices that are provably fast using butterfly factorizations.
  • In SIGMOD19, with folks at Google, we talk about lessons learned from Snorkel applied at Google in DryBell.

  • Some of the industrial engagements that we're most proud of are: Software 2.0 products with Apple via Lattice, with Google Ads (blog), and with Intel via Snorkel. We're proud of all the folks who adopted Snorkel! Technical ideas including Hogwild! in Microsoft's deep learning system (Wired), momentum correction for delay in Nvidia, and high-accuracy low-precision (HALP) in ImageNet in minutes from Tencent. Our work also led to some classical analytics layers for companies like Oracle, Cloudera, and Pivotal. In benchmarking, GLUE, TAC-KBP, and better-than-volunteer accuracy in machine reading for paleobiology in 2014.

    To validate our ideas, we continue to build systems that we hope change the way people do science and improve society. This work is with great partners in areas including paleobiology (Nature), drug repurposing, genomics, material science, and the fight against human trafficking (60 minutes, Forbes, Scientific American, WSJ, BBC, and Wired). Our work is supporting investigations. In the past, we've worked with a neutrino telescope (IceCube Science cover and our modest contribution) and on economic indicators.

  • The DeepDive (one pager) project was commercialized as Lattice. As of 2017, Lattice is part of Apple. Our work on architectural changes for converged analytics and machine learning is commericialized as SambaNova Systems.
  • In AAAI19, Snorkel folks talk about Training Complex Models with Multi-Task Weak Supervision. We've seen this is a new and exciting way to build machine learning software.
  • In AIStats19, Tri, Avner, and Jian try explain how low-precision random Fourier features generalize better than Nystrom features in the same amount of memory--this result is surprising since, if you measure by feature count the reverse is true!
  • Paroma's thoughts about automating weak supervision in Reef in VLDB19
  • Alex and Braden's thoughts on the role of massive multitask learning and weak supervision in Software 2.0 in CIDR19.
  • Checkout great workshops run in part by our students Relational Representaton Learning at NeurIPS, and learning with 2nd Learning with Limited data and Graph Representation Learning at ICLR19.
  • Beliz, Albert, and Fred learn embeddings in mixed product spaces (hyperbolic, spherical, and euclidean) in ICLR19.
  • Theo, Ihab, and crew have released Holoclean.
  • A draft writeup about Bit Centering a technique to use accelerators for low-precision training. ISMP18.
  • A small writeup about Software 2.0 and Snorkel for KDD18.
  • In ICML18, Hyperbolic embddings embed structured knowledge in continuous space, a blog post about our wo
  • In ACL18, train your classfiers with natural language! Braden, Percy, and Stephanie show you how in ACL18.
  • In SODA18, we characterize the largest class of matrices for which matrix-vector multiply is in (nearly) linear time. Anna is exploring whether these matrices can be useful in deep learning ICLR18. Thank you to the NSF for supporting this work!
  • Snorkel system paper in VLDB18 with its own blog. Software 2.0 madness is coming...
  • In SIGMOD18, Founduer constructs knowledge bases from richly formatted data using visual and textual reasoning.
  • For PCA and other mildly nonconvex matrix problems, in AIStas18, we show that a simple stochastic algorithm gets the optimal accelerated rate--and that the standard Polyak momentum scheme can't give acceleration in the stochastic case.
  • News

  • New Tradeoffs for Machine Learning Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning. There are limits to the robustness of these algorithms, see our ICML 2016 best paper.

  • New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more soon.
  • Our course material from CS145 intro databases is available (send a note), and we'll continue to update it. We're aware of a handful of courses that are using these materials. Drop us a note, if you do!
  • Recent/Upcoming keynotes and talks: EDBT17, UAI17, ABS East 2017, Cornell, Alibaba, CMU (SiValley), SystemX, KBCOM (WSDM), ITBB18.
  • Manuscripts.
  • Upcoming talks, keynotes, and meetings: March: MEMEX, SIMPLEX, EDBT, AI and the future of business, AAAI Spring, April: Computer Forum, May: SIGMOD, SystemX, SIOPT 2017 June: MongoDB World, STOC Theory Fest, NAS Kavli, August: UAI, DIMACS large-scale learning.
  • Nature Communications. Kun's paper about automated cancer prognosis is out! He shows that automated approaches can out perform human pathologists at lung cancer prognosis. Update: Kun wins the data parasite award for this work!
  • Database Theory
  • Machine Learning and Optimization
  • Elated that our group's work was honored by a MacArthur Foundation Fellowship. So excited for what's next!
  • Talks Feb: Mobilize Center, Distributed Inference at AAAI,
  • SIGMOD and PODS
  • VLDB15. Honored to receive the VLDB Early Career Award for scalable analytics. talk video.
  • ICML16
  • Upcoming Meetings and Talks
  • ICDT16. It’s all a matter of degree: Using degree information to optimize multiway joins by Manas Joglekar discusses one technique to use degree information to perform joins faster (asymptotically!).
  • SODA16. Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.
  • NIPS15. Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width by Chris De Sa explains a notion of width that allows one to bound mixing times for factor graphs.
  • NIPS15. Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms by Chris De Sa et al. derives results for low precision and non-convex Hogwild! (asynchronous) style algorithms.
  • VLDB15. Incremental Knowledge Base Construction Using DeepDive is our latest description of DeepDive.
  • New. Increasing the parallelism in multi-round MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
  • ICDE. Finland!
  • USC ML. Jan 26.
  • Berkeley. Feb 3.
  • Michigan. Mar 11.
  • ICDT. Mar 15-21.
  • Strata. Mar 28-31.
  • SIMPLEX. April 4-7.
  • Dagstuhl. Foundations of Databases. April 10-15. Slides for EDBT/ICDT keynote on Joins and Convex Geometry Frank explains our joins work very nicely
  • Code
  • Application Overview Videos (See our YouTube channel, HazyResearch)