
I'm an assistant professor in the InfoLab and affiliated with the PPL and SAIL labs, and I work on the fundamentals of the next generation of data management systems (bio here). This means we work on databases, theory, and machine learning, and we worry about hardware trends. A major application of our work is to make it dramatically easier to build highquality systems to process more of the world's dark data (sql databases, text, and images). Recently, we've shown that our systems can even exceed human volunteer quality in reading scientific journal articles (featured in Nature).
 New Tradeoffs for Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning.
 New Programming Models. The DeepDive system demonstrates that one can build high quality applications that use machine learning without specifying an inference algorithm, which makes it usable by a wider range of people. Our goal for the last few years has been to dramatically reduce the time analyst spend specifying models, maintaining them, and collaboratively building models.
 New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more soon.
DeepDive is tons of fun (one pager) and commericialized as Lattice. Our code is on github. Data is here. Twitter @HazyResearch sometimes.

News
 ICML16
 ICML. Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa observes that asynchrony can introduce bias in Gibbs sampling and some sufficient conditions for the bias to vanish. Also, some chains that take exponential time to mix without asynchrony can mix in polynomial time with asynchrony, and vice versa.
 OPTML. Parallel SGD: When does Averaging Help? with Jian Zhang, Christopher De Sa, Ioannis Mitiliagkas.
 SIGMOD and PODS
 SIGMOD16 EmptyHeaded: A Relational Algebra for Graph Processing by Chris Aberger discusses how to use SIMD hardware to support our worstcase optimal join algorithms to find graph patterns. It's fast! on github.
 SIGMOD16 Industrial. A paper describing DeepDive and its use in Lattice written largely by Mike Cafarella (CEO).
 PODS16. Rohan and Manas extend our new join algorithms to message passing and so fast matrix multiplication. The point is: Standard worstcase algorithms are enough to get the best asympotic runtimes for these problems. A step toward the vision of unifying relational and linear algebra systems using GHDs
 HILDA. We describe our group's work on data programming and DDlite. We're really excited about the ability to quickly create high quality extractors!
 ICDE. Finland!
 Panel on Dark Data.
 ICDE Workshop. Old Techniques for New Join Algorithms: A Case Study in RDF Processing. Chris Aberger and Susan Tu, show that our new join algorithms apply to EmptyHeaded and describe some ongoing modifications to EmptyHeaded that are critical for performance.
 Elated that our group's work was honored by a MacArthur Foundation Fellowship. So excited for what's next!
 Our course material from CS145 intro databases is here, and we'll continue to update it. We're aware of a handful of courses that are using these materials, drop us a note if you do! We hope to update them throughout the year.
 ICDT16. It’s all a matter of degree: Using degree information to optimize multiway joins by Manas Joglekar discusses one technique to use degree information to perform joins faster (asymptotically!).
 SODA16. Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.
 NIPS15. Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width by Chris De Sa explains a notion of width that allows one to bound mixing times for factor graphs.
 NIPS15. Taming the Wild: A Unified Analysis of Hogwild!Style Algorithms by Chris De Sa et al. derives results for low precision and nonconvex Hogwild! (asynchronous) style algorithms.
 VLDB15. Incremental Knowledge Base Construction Using DeepDive is our latest description of DeepDive.
 VLDB15. Honored to receive the VLDB Early Career Award for scalable analytics. talk video.
 New. Increasing the parallelism in multiround MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
 ICML16

Upcoming Meetings and Talks
 Wisconsin 50th. April 21.
 System X. May 1012.
 ICDE. Dark data! May 1620.
 Inside the Black Box. June 8.
 MMDS. June 21June 24.
 SIGMOD. June 26Jul 1.
 Randomized Linear Algebra. JAPAN. Jul. 2529.
 A messy, incomplete log of old updates is here.