
DeepDive is a new type of system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing data systems. DeepDive's most popular use case is to transform the dark data of web pages, pdfs, and other databases into rich SQLstyle databases. In turn, these databases can be used to support both SQLstyle and predictive analytics. Recently, some DeepDivebased applications have exceeded the quality of human volunteer annotators in both precision and recall for complex scientific articles. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. The technical core of DeepDive is an engine that combines extraction, integration, and prediction into a single engine with probabilistic inference as its core operation. A one pager with key design highlights is here. PaleoDeepDive is featured in the July 2015 issue of Nature.

News

MEMEX . DeepDive helps power the MEMEX project in the fight against human trafficking. The project was recently featured on 60 minutes, Forbes, Scientific American, Wall St. Journal, BBC, and Wired. It's supporting investigations. 
Data . We're giving away data! Big, markedup datasets.  Manuscripts. In the end, it all goes into DeepDive...
 Joins and Graph Processing. Frank explains our joins work very nicely
 EmptyHeaded: Boolean Algebra based Graph Processing by Chris R. Aberger, Andres Notzli, Kunle Olukotun, and me discusses how to use SIMD hardware to support our worstcase optimal join algorithms to find graph patterns. It's fast!
 Increasing the parallelism in multiround MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
 Analytics
 Taming the Wild: A Unified Analysis of Hogwild!Style Algorithms by Chris De Sa et al. derives results for low precision and nonconvex Hogwild! (asynchronous) style algorithms.
 Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.
 Joins and Graph Processing. Frank explains our joins work very nicely
 VLDB15.
 Incremental Knowledge Base Construction Using DeepDive is our latest description of what we're building in DeepDive.
 A Demonstration of Data Labeling in Knowledge Base Construction. This describes Jaeho's Mindtagger tool, which has really been our secret sauce to build DeepDive applications with high quality.
 ICML15. (Nearly) Global Convergence of SGD for NonConvex Matrix Problems preliminary version. Chris De Sa shows that a widely used SGD heuristic converges at a provable rate using a novel argument based on martingales. It requires a slight—but embarrassingly fancy sounding—twist (the stepsize must correct for the curvature of the space using the Reimannian metric associated with an appropriately defined quotient manifold) Given that ridiculous word salad, it may be surprising but we're aware of implementations at web companies... so, hey, it turns out it works... Phew!


Upcoming Meetings and Talks
 ICML. July 611.
 Invited Talk @ ISMP. Sparse Optimization and Applications. July 1217.
 Invited Talk @ StarAI with UAI. July 16.
 Dato's Data Science Summit. July 20.
 VLDB. Aug 31Sep. 4.

Code
 Our stuff is on github
 DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
 Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of tracenorm regularization called Jellyfish and nonnegative matrix factorization called HottTopix.
 Code for more projects are here and in MADlib, a product from Oracle, and in Cloudera's Impala.

Application Overview Videos (See our YouTube channel, HazyResearch)
 GeoDeepDive With Shanan Peters (UW Geoscience) and Miron Livny (Condor), we are combining Macrostrat with DeepDive to (hopefully!) deliver value for Geoscientists. One key challenge is extracting all the measurement information that is reported in the literature, that is buried in the dark data of text, graphs, and figures. A demo video and a new video about quality that is higher than the volunteers who have been at this for the last decade. This is all powered by DeepDive. Thank you to the National Science Foundation and Google for supporting this work.
 IceCube Mark Wellons, Ben Recht, and I have done some work with the IceCube Neutrino Detector. Mark's code now runs in the detector on the South Pole and is used on over 250 Million events per day. More details are in this video, this video, this paper at the The International Cosmic Ray Conference 2013, or this paper. Thank you to the IceCube Collaboration and UW Graduate School for their support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
 There are also videos about some of the technical portions of these projects Matrix Factorization, Seismic Data Interpolation, and a nowcasting framework (now called Ringtail).
 A messy, incomplete log of old updates is here.
Slides for EDBT/ICDT keynote on Joins and Convex Geometry
Our code is on github. Twitter @HazyResearch.