
DeepDive has been released, and more updates are coming soon! DeepDive is a generic probabilistic inference engine that uses a declarative language (SQL) to define factor graphs. DeepDive's most popular use case is knowledge base construction (KBC). Recently, some DeepDivebased KBC applications have exceeded the quality of humans in both precision and recall. DeepDive was recently mentioned in Forbes and as a top tool for data science.

New Results, Funding News, and Press
 DARPA. Thank you to DARPA Memex for supporting our work. We are really excited to be part of this program!
 NIH. Thank you to the NIH Big Data to Knowledge (BD2K) program for supporting our work on mobility data led by Scott Delp.
 Our paper with Michigan Econ (Shaprio and Levinson) and Michigan CS people (Antenucci and Cafarella) about using Twitter to predict economic indicators is out. It has been picked up by The Economist's blog, Wall Street Journal, the Boston Globe, The Washington Post, Patria (Czech), and Slate. A summary of this work has been selected to appear in the NBER digest in the August 14 issue.
 PaleoDB. Our assessment of PaleoDeepDive is here!. On some extraction tasks, PaleoDeepDive exceeds human volunteer quality in both precision and recall. Thank you to PaleoDB and Shanan Peters for all their painstaking work! A draft about our approach to building KBC systems is here. It's all about debugging!
 VLDB14. See you in Hangzhou!
 VLDB14. We've just released a description of our sampling/inference engine DimmWitted: A Study of MainMemory Statistical Analytics. DimmWitted is the successor to the Hogwild! and Elementary engines.
 Panel. The Role of Database Systems in the Era of Big Data.
 IMDM. Victor Bittorf will give an invited talk describing his work on Impala, on porting MADlib to Impala, and some thoughts about mainmemory analytics as described here.
 Thank you to Microsoft for giving a nod to Hogwild!; they mentioned that their Adam system (for machine learning) is based on our Hogwild! approach. We love that they got the exclamation point into print, although it's dubious that anyone is more Hogwild! than we are! Our next versions of both the theory (ICML14) and systems work (VLDB14) are in print. More coming soon!
 SIGMOD/PODS. Our papers are about joins and feature selection.
 SIGMOD14 Materialization Optimizations for Feature Selection shows that using a DSL and some novel optimizations, we can get order of magnitude performance wins for feature engineering in R.
Thank you to SIGMOD for selecting this as the best paper!  PODS The Minesweeper paper is our attempt to go beyond worstcase analysis for join algorithms. We (Hung Ngo, Dung Ngyuen, Atri Rudra, and I) develop a new algorithm that we call Minesweeper. The main idea is to formalize the amount of work any algorithm spends certifying (using a set of propositional statements) that the output set is complete (and not, say, a proper subset). We call this set of propositions the certificate. We manage to establish a dichotomy theorem for this stronger notion of complexity: if a query's hypergraph is what Ron Fagin calls betaacyclic, then Minesweeper runs in time linear in the certificate; if a query is betacyclic than on some instance any algorithm takes time that is super linear in the certificate. The results get sharper and more fun. Also, Dung is a superhero and has implemented a variant of this algorithm.
 SIGMOD14 Materialization Optimizations for Feature Selection shows that using a DSL and some novel optimizations, we can get order of magnitude performance wins for feature engineering in R.
 ICML14. Ji Liu and Stephen J. Wright, Victor Bittorf, Srikrishna Sridhar and I have some new theory about An Asynchronous Parallel Stochastic Coordinate Descent Algorithm in ICML14. This is a Hogwild!style algorithm but has more rapid convergence rates than Hogwild! for different types of sparse data.
 Joins! New papers about one of my favorite topics, Joins. This is joint work with Hung Q Ngo and Atri Rudra.
 We have written a short survey for SIGMOD record about recent advances in worstcase optimal join algorithms. Our goal is to give a highlevel view of the worstcase optimality results for practitioners and applied researchers. We also managed to simplify the arguments.
 The Minesweeper paper is the first beyondworstcase analysis for any join algorithm (PODS14).
 Our Tetris paper describes (what we think is) a beautiful framework for beyond worst case and worstcase optimal algorithms for joins via a new connection between geometry and DPLL resolution.
 A full version of our join algorithm with worstcase optimal running time is here (original PODS 2012 paper).
 Thank you to Master's in Data Science for naming me a thought leader. It's humbling to be listed with such great people.
 Google We want to thank Google for funding our research project, Trust, but (Probabilistically) Verify: Toward Tail Extraction. Should be a lot of fun!
 Toshiba We want to thank Toshiba for funding our work on knowledge base construction. We are very excited that Toshiba engineers are using DeepDive!
 ONR Thank you for funding my proposal Foundations for Datadriven Systems; Join Algorithms and Random Network Theory . The ONR continues to be one of the best supporters of pure theoretical work.
 DARPA. Thank you to DARPA's XData for supporting my collaborative work on scalable analytics and join processing.

Upcoming Meetings and Talks
 Modern Data Management Summit in Beijing. Aug 2830
 VLDB. We'll present DimmWitted. Sep 15.
 UCB. Talk about DimmWitted. Sep 10.
 Facebook. Sep 1516.
 MEMEX. Sep 1526.
 DC (CCC and PI). Oct 1416.
 DC (NSF). Oct 1617.
 IBM Watson. Nov. 7.
 Allen Distinguished Lecture. Nov 14.
 NIPS Workshop. Automatic Knowledge Base Construction. Dec. 1213.
 Dagstuhl. PlanBig. Dec. 1419.

Code
 DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
 Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of tracenorm regularization called Jellyfish and nonnegative matrix factorization called HottTopix.
 Code for more projects are here and in MADlib, a product from Oracle, and in Cloudera's Impala.

Application Overview Videos (See our YouTube channel, HazyResearch)
 GeoDeepDive With Shanan Peters (UW Geoscience) and Miron Livny (Condor), we are combining Macrostrat with DeepDive to (hopefully!) deliver value for Geoscientists. One key challenge is extracting all the measurement information that is reported in the literature, that is buried in the dark data of text, graphs, and figures. A demo video and a new video about quality that is higher than the volunteers who have been at this for the last decade. This is all powered by DeepDive. Thank you to the National Science Foundation and Google for supporting this work.
 IceCube Mark Wellons, Ben Recht, and I have done some work with the IceCube Neutrino Detector. Mark's code now runs in the detector on the South Pole and is used on over 250 Million events per day. More details are in this video, this video, this paper at the The International Cosmic Ray Conference 2013, or this paper. Thank you to the IceCube Collaboration and UW Graduate School for their support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
 There are also videos about some of the technical portions of these projects Matrix Factorization, Seismic Data Interpolation, and a nowcasting framework (now called Ringtail).
 A messy, incomplete log of old updates is here.
Slides for EDBT/ICDT keynote on Joins and Convex Geometry