
DeepDive has been released, and more updates are coming soon! DeepDive is a generic probabilistic inference engine that uses a declarative language (SQL) to define factor graphs. DeepDive's most popular use case is knowledge base construction (KBC). Recently, some DeepDivebased KBC applications have exceeded the quality of humans in both precision and recall.

News

MEMEX . DeepDive helps power the MEMEX project to help in the fight against human trafficking. The project was recently featured on 60 minutes, Scientific American, Wall St. Journal and Wired. 
Data . We're giving away data! Big, markedup datasets.  Manuscripts
 Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.
 Incremental Knowledge Base Construction Using DeepDive is our latest description of what we're building in DeepDive. Code moving its way through our arcane github branching process now :)
 (Nearly) Global Convergence of SGD for NonConvex Matrix Problems preliminary version. Chris De Sa shows that a widely used SGD heuristic converges at a provable rate using a novel argument based on martingales. It requires a slight—but embarassingly fancy sounding—twist (the stepsize must correct for the curvature of the space using the Reimannian metric associated with an appropriately defined quotient manifold) Given that ridiculous word salad, it may be suprisinging but we're aware of implementations at a few web companies... so, hey, it turns out it works... Phew!
 Increasing the parallelism in multiround MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
 Resolution framework for joins with beyondworstcase guarantees here. This is described in more detail below.
 SIGMOD15. Exploting Correlations for Expensive Predicate Evaluation here. Manas Jogeklar and Aditya P. show that you can give tight bounds on evaluating predicates using random sampling to speed up query evaluation with nasty predicates.
 PaleoDB. Our assessment of PaleoDeepDive is here and here. PaleoDeepDive exceeds human volunteer quality in both precision and recall on some extraction tasks. Thank you to PaleoDB and Shanan Peters for all their painstaking work! A description of our approach to building KBC systems is here. It's all about feature engineering! DeepDive was recently mentioned in Forbes and as a top tool for data science. It has received some media coverage (Vice, Fusion, The India Times, El Mundo, Kurzweil's newsletter, CACM, and perhaps here?)
 I'm honored to be selected as a Moore DataDriven Discovery Investigator. What an exciting list of people! Thank you to Context Relevant, the world's best analytics company for Wall St. and beyond, for putting out such a nice press release (CNBC, Yahoo! Finance). My dad loved it.
 Our paper with Michigan Econ (Shaprio and Levinson) and Michigan CS people (Antenucci and Cafarella) about using Twitter to predict economic indicators is out. It has been picked up by The Economist's blog, Wall Street Journal, the Boston Globe, The Washington Post, Patria (Czech), and Slate. A summary of this work has been selected to appear in the NBER digest in the August issue. Interesting follow ups in Science, excited to see where it goes!
 Recent papers are about analytics, joins and feature selection. It all ends up in DeepDive...
 HighPerformance Analytics. Thank you to Microsoft for giving a nod to Hogwild!; they mentioned that their Adam system (for machine learning) is based on our Hogwild! approach. We love that they got the exclamation point into print, although it's dubious that anyone is more Hogwild! than we are! The succesors for both the theory (ICML14) and systems work (VLDB14) are in print. The recent DimmWitted engine is a deeper systems exploration of the tradeoff space, check it out!
 Feature Engineering. Materialization Optimizations for Feature Selection shows that using a DSL and some novel optimizations, we can get order of magnitude performance wins for feature engineering in R. Thank you to SIGMOD for selecting this as the best paper at SIGMOD14! In NIPS14, Yingbo Zhou and (many) others have some very nice work about how to do feature selection in parallel using some group testing ideas. Most recently, a draft about the DeepDive approach to feature engineering for KBC systems is here.
 Joins! New papers about one of my favorite topics, Joins. This is joint work with Hung Q Ngo and Atri Rudra.
 We have written a short overview for SIGMOD record about recent advances in worstcase optimal join algorithms. This is the one to read first. Our goal is to give a highlevel view of the worstcase optimality results for practitioners and applied researchers. We also managed to simplify the arguments.
 The Minesweeper paper goes beyond worstcase analysis for join algorithms, which is a much stronger guarantee than traditional CS theory: an algorithm must be within a small logarithmic facton every instance compared to any comparisonbased algorithms, which includes all standard join algorithms.
 Our Tetris paper describes (what we think is) a beautiful framework for beyond worst case and worstcase optimal algorithms for joins via a new connection between geometry and DPLL resolution. It also allows us to consider a wide variety of indexing schemes in a single framework.
 A full version of our join algorithm with worstcase optimal running time is here (original PODS 2012 paper).


Upcoming Meetings and Talks
 BeyondMR with EDBT/ICDT Mar. 27.
 ACM SIGAI Bay Area at Baidu. Apr. 16.
 BIRS Apr. 2022.
 SIGMOD/PODS. May 31June 4.
 CVPR. BigVision. June 12.
 ISCA. Learning and Architecture. June 14.
 ISMP. Sparsity or Something. July 1217.

Code
 Our stuff is on github
 DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
 Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of tracenorm regularization called Jellyfish and nonnegative matrix factorization called HottTopix.
 Code for more projects are here and in MADlib, a product from Oracle, and in Cloudera's Impala.

Application Overview Videos (See our YouTube channel, HazyResearch)
 GeoDeepDive With Shanan Peters (UW Geoscience) and Miron Livny (Condor), we are combining Macrostrat with DeepDive to (hopefully!) deliver value for Geoscientists. One key challenge is extracting all the measurement information that is reported in the literature, that is buried in the dark data of text, graphs, and figures. A demo video and a new video about quality that is higher than the volunteers who have been at this for the last decade. This is all powered by DeepDive. Thank you to the National Science Foundation and Google for supporting this work.
 IceCube Mark Wellons, Ben Recht, and I have done some work with the IceCube Neutrino Detector. Mark's code now runs in the detector on the South Pole and is used on over 250 Million events per day. More details are in this video, this video, this paper at the The International Cosmic Ray Conference 2013, or this paper. Thank you to the IceCube Collaboration and UW Graduate School for their support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
 There are also videos about some of the technical portions of these projects Matrix Factorization, Seismic Data Interpolation, and a nowcasting framework (now called Ringtail).
 A messy, incomplete log of old updates is here.
Slides for EDBT/ICDT keynote on Joins and Convex Geometry