
I'm an associate professor in the InfoLab who is affiliated with the Statistical Machine Learning Group, PPL, and SAIL (bio). I work on the foundations of the next generation of data analytics systems. These systems extend ideas from databases, machine learning, and theory, and our group is active in all areas. A major application of our work is to make it dramatically easier to build highquality machine learning systems to process dark data including text, images, and video, e.g., Snorkel.
The DeepDive (one pager) project is commericialized as Lattice. Our code is on github.

News
 Upcoming talks, keynotes, and meetings: March: MEMEX, SIMPLEX, EDBT, AI and the future of business, AAAI Spring, April: Computer Forum, May: SIGMOD, SystemX, SIOPT 2017 June: MongoDB World, STOC Theory Fest, NAS Kavli, August: UAI, DIMACS largescale learning.
 Blog Posts. Students have been writing blog posts about our work. More soon...
 An argument for why weak supervision is a key technique to build ML systems. NIPS video
 Debugging generative models with Socractic Learning (manuscript)
 Some initial thoughts about how to supervise models using natural languagealone.
 Trolling about tuning momentum for Deep Learning systems at large scale.
 Manuscripts.
 Stephen Bach learns the structure of graphical model without any handlabeled data.
 Paroma Varma corrects generative models using discriminative models (blog)
 Sen, Xiao, Braden and Luke describe a system for multimodal extraction (blog)
 Theo, Xu, and Ihab use weak supervision to get mind blowingly good results in data cleaning.
 Recurrence Width for Structured Dense Matrix Vector Multiplication with Albert Gu, Rohan Puttagunta, and Atri Rudra studies the problem of structured matrixvector multiply, e.g., fourier transforms, orthogonal polynomials, low displacement rank. This work unifies and we think simplfies many known algorithms and extends to a host of new cases. We are hopeful that we can develop one technique to unify all known cases of structuredmatrix vector multiply.
 Nature Communications. Kun's paper about automated cancer prognosis is out! He shows that automated approaches can out perform human pathologists at lung cancer prognosis. Update: Kun wins the data parasite award for this work!
 Database Theory
 SIGMOD 2017. In SlimFast, Theo et al describe a formal statistical model for data quality that has quality guarantees and it runs efficiently on real datasets! blog post.
 ICDT 2017. Increasing the parallelism in multiround MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
 PODS 2017. Benny Kimelfeld has a great framework for feature engineering based on the relational model.
 Machine Learning and Optimization
 NIPS 2016. Data Programming: Creating Large Training Sets, Quickly by Alex Ratner, Chris De Sa, Sen Wu, and Daniel Selsam. The idea is to have the user create a generative model as a biproduct of their actions. We use this model to create a large (but noisy) training set to train a discriminative model. The hope is it makes it easier to program and it is now in use inside Snorkel. User studies reported here when it was called DDlite. NIPS video
 NIPS 2016. Chris De Sa, Bryan He, and Ioannis continue to discover fundamental properties of Gibbs Sampling including scan order. This is used in our generative models in Snorkel and factor graph inference. video.
 NIPS 2016. Peng, Jiyan, Fred, and MICHAEL MAHONEY did some exciting work about subsampled newton methods that work on more poorly conditioned problems than previous approaches, while retaining the speed benefits of subsampled newton approaches.
 Allerton 2016. In a short note Ioannis, Ce, and Stefan show that asynchrony for SGD can be viewed as adding a momentum term. The analysis does not depend on whether the function is convex, which means it applies to Deep learning. This makes me feel a little better about people running Hogwild! Deep Learning systems. But it does mean you need to tune your momentum. It is a key ingredient in Omnivore, our deep learning optimizer, which is described here. A toungeincheek blog post about it here.
 ICML 2016. Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa observes that asynchrony can introduce bias in Gibbs sampling and some sufficient conditions for the bias to vanish. Also, some chains that take exponential time to mix without asynchrony can mix in polynomial time with asynchrony, and vice versa. Thank you to the committee for selecting this as best paper.
 Our course material from CS145 intro databases is available (send a note), and we'll continue to update it. We're aware of a handful of courses that are using these materials. Drop us a note, if you do! We hope to update them throughout the year.
Research Details
 New Tradeoffs for Machine Learning Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning. There are limits to the robustness of these algorithms, see our ICML 2016 best paper.
 New Programming Models. Our goal for the last few years has been to dramatically reduce the time analyst spend specifying models, maintaining them, and collaboratively building models. Our new effort for lightweight dark data extraction is Snorkel which is built on the idea of weak supervision and data programming, see our blog or video. These systems do not use traditional handlabeled training data, which removes a fundamental obstacle in using machine learning tools.
 New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more soon.
To validate our ideas, we continue to build systems that we hope change the way people do science and improve society. This work is with great partners in areas including paleobiology (Nature), drug repurposing, genomics, material science, and the fight against human trafficking (60 minutes, Forbes, Scientific American, WSJ, BBC, and Wired). Our work is supporting investigations. In the past, we've worked with a neutrino telescope (IceCube Science cover and our modest contribution) and on economic indicators.