
DeepDive is a new type of system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing data systems. DeepDive's most popular use case is to transform the dark data of web pages, pdfs, and other databases into rich SQLstyle databases. In turn, these databases can be used to support both SQLstyle and predictive analytics. Recently, some DeepDivebased applications have exceeded the quality of human volunteer annotators in both precision and recall for complex scientific articles. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. The technical core of DeepDive is an engine that combines extraction, integration, and prediction into a single engine with probabilistic inference as its core operation. A one pager with key design highlights is here. PaleoDeepDive is featured in the July 2015 issue of Nature.

News

MEMEX . DeepDive helps power the MEMEX project in the fight against human trafficking. The project was recently featured on 60 minutes, Forbes, Scientific American, Wall St. Journal, BBC, and Wired. It's supporting investigations. 
Data . We're giving away data! Big, markedup datasets.  Manuscripts. In the end, it all goes into DeepDive...
 Joins and Graph Processing. Frank explains our joins work very nicely
 EmptyHeaded: Boolean Algebra based Graph Processing by Chris Aberger discusses how to use SIMD hardware to support our worstcase optimal join algorithms to find graph patterns. It's fast!
 Increasing the parallelism in multiround MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
 It’s all a matter of degree: Using degree information to optimize multiway joins by Manas Joglekar discusses one technique to use degree information to go faster (asymptotically!). new version soon!
 Analytics
 Taming the Wild: A Unified Analysis of Hogwild!Style Algorithms by Chris De Sa et al. derives results for low precision and nonconvex Hogwild! (asynchronous) style algorithms.
 Asynchronous stochastic convex optimization by John C. Duchi and Tum Chaturapruek explore the limits of asynchrony for convex optimization. As John puts it, "Nothing Really Matters".
 Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.
 Joins and Graph Processing. Frank explains our joins work very nicely
 VLDB15.
 Incremental Knowledge Base Construction Using DeepDive is our latest description of DeepDive.
 A Demonstration of Data Labeling in Knowledge Base Construction describes Jaeho's Mindtagger tool, which has really been our secret sauce to build DeepDive applications with high quality.


Upcoming Meetings and Talks
 VLDB. Aug 31Sep. 4.
 UC Merced Colloquium. Sep 11.
 A16z Academic Roundtable. Sep 21.
 HPTS. Sep 2730.
 SIMPLEX in NYC. Sep 2930.
 Moore DDD Event. Oct 79.
 ONR. Oct. 2729.
 GaTech Colloquium. Nov. 20.
 NIPS. Dec 12. Nonconvex Optimization Workshop.
 NIPS. Dec 12. Machine learning systems.
 Chile. Jan 15.

Code
 Our stuff is on github
 DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
 Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of tracenorm regularization called Jellyfish and nonnegative matrix factorization called HottTopix.
 Code for more projects are here and in MADlib, a product from Oracle, and in Cloudera's Impala.

Application Overview Videos (See our YouTube channel, HazyResearch)
 GeoDeepDive With Shanan Peters (UW Geoscience) and Miron Livny (Condor), we are combining Macrostrat with DeepDive to (hopefully!) deliver value for Geoscientists. One key challenge is extracting all the measurement information that is reported in the literature, that is buried in the dark data of text, graphs, and figures. A demo video and a new video about quality that is higher than the volunteers who have been at this for the last decade. This is all powered by DeepDive. Thank you to the National Science Foundation and Google for supporting this work.
 IceCube Mark Wellons, Ben Recht, and I have done some work with the IceCube Neutrino Detector. Mark's code now runs in the detector on the South Pole and is used on over 250 Million events per day. More details are in this video, this video, this paper at the The International Cosmic Ray Conference 2013, or this paper. Thank you to the IceCube Collaboration and UW Graduate School for their support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
 There are also videos about some of the technical portions of these projects Matrix Factorization, Seismic Data Interpolation, and a nowcasting framework (now called Ringtail).
 A messy, incomplete log of old updates is here.
Slides for EDBT/ICDT keynote on Joins and Convex Geometry
Our code is on github. Twitter @HazyResearch.