
DeepDive is a new type of system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing data systems. DeepDive's most popular use case is to transform the dark data of web pages, pdfs, and other databases into rich SQLstyle databases. In turn, these databases can be used to support both SQLstyle and predictive analytics. Recently, some DeepDivebased applications have exceeded the quality of human volunteer annotators in both precision and recall on complex scientific articles. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. The technical core of DeepDive is an engine that combines extraction, integration, and prediction into a single engine with probabilistic inference as its core operation.

News

MEMEX . DeepDive helps power the MEMEX project in the fight against human trafficking. The project was recently featured on 60 minutes, Forbes, Scientific American, Wall St. Journal, BBC, and Wired. It's supporting investigations. 
Data . We're giving away data! Big, markedup datasets.  Manuscripts. In the end, it all goes into DeepDive...
 Joins and Graph Processing. Who doesn't love joins? Frank does!
 EmptyHeaded: Boolean Algebra based Graph Processing by Chris R. Aberger, Andres Notzli, Kunle Olukotun, and me discusses how to use SIMD hardware to support our worstcase optimal join algorithms to find graph patterns. It's fast!
 Increasing the parallelism in multiround MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
 Analytics
 Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.
 VLDB15. Incremental Knowledge Base Construction Using DeepDive is our latest description of what we're building in DeepDive. Code moving its way through our arcane github branching process.
 ICML15. (Nearly) Global Convergence of SGD for NonConvex Matrix Problems preliminary version. Chris De Sa shows that a widely used SGD heuristic converges at a provable rate using a novel argument based on martingales. It requires a slight—but embarrassingly fancy sounding—twist (the stepsize must correct for the curvature of the space using the Reimannian metric associated with an appropriately defined quotient manifold) Given that ridiculous word salad, it may be surprising but we're aware of implementations at a few web companies... so, hey, it turns out it works... Phew!
 Joins and Graph Processing. Who doesn't love joins? Frank does!
 SIGMOD15 and PODS15
 UGRAD. Susan Tu and Adam Perelman are headed to the SIGMOD undergrad research contest to talk about their research on their query optimizer, Dunce Cap, that sits on top of EmptyHeaded. Awesome!
 PODS. Tetris, a geometricresolution framework for joins with beyondworstcase guarantees here.
 SIGMOD. Exploiting Correlations for Expensive Predicate Evaluation here. Manas Jogeklar and Aditya P. show how to use correlations to (approximately) evaluate expensive predicates to speed up query evaluation.
 SIGMOD. Panel Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?
 DANAC Keynote. DeepDive and it's applications in science and the fight against human trafficking!
 DANAC15.Caffe con Troll: Shallow Ideas to Speed up Deep Learning. This is our first exploration of how to schedule deep learning computations on CPUs and GPUs. We show that simple ideas yield a 4.5x speedup for Caffe (the uber popular deep learning framework) on CPUs. In particular, the throughput of the task is proportional to the FLOPS delivered by the hardware. We can also use both CPUs and GPUs to get higher FLOPS and go even faster.
 GRADES. Led by Dung Nguyen and Hung Q. Ngo, and the awesome LogicBlox team, we benchmarked a wide range of join and graph algorithms including Minesweeper, our beyond worstcase algorithm. Check out our new paper, Join Processing for Graph Patterns: An Old Dog with New Tricks.



Upcoming Meetings and Talks
 UCI Data Science. May 29.
 SIGMOD/PODS. May 31June 4.
 DanaC Keynote. May 31.
 SIoT Retreat. June 34.
 Keynote @ BigVision with CVPR. June 12.
 Kavli Frontiers of Science Symposium. June 1618.
 ICML. July 611.
 Invited Talk @ ISMP. Sparse Optimization and Applications. July 1217.
 Invited Talk @ StarAI with UAI. July 16.
 Dato's Data Science Summit. July 20.

Code
 Our stuff is on github
 DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
 Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of tracenorm regularization called Jellyfish and nonnegative matrix factorization called HottTopix.
 Code for more projects are here and in MADlib, a product from Oracle, and in Cloudera's Impala.

Application Overview Videos (See our YouTube channel, HazyResearch)
 GeoDeepDive With Shanan Peters (UW Geoscience) and Miron Livny (Condor), we are combining Macrostrat with DeepDive to (hopefully!) deliver value for Geoscientists. One key challenge is extracting all the measurement information that is reported in the literature, that is buried in the dark data of text, graphs, and figures. A demo video and a new video about quality that is higher than the volunteers who have been at this for the last decade. This is all powered by DeepDive. Thank you to the National Science Foundation and Google for supporting this work.
 IceCube Mark Wellons, Ben Recht, and I have done some work with the IceCube Neutrino Detector. Mark's code now runs in the detector on the South Pole and is used on over 250 Million events per day. More details are in this video, this video, this paper at the The International Cosmic Ray Conference 2013, or this paper. Thank you to the IceCube Collaboration and UW Graduate School for their support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
 There are also videos about some of the technical portions of these projects Matrix Factorization, Seismic Data Interpolation, and a nowcasting framework (now called Ringtail).
 A messy, incomplete log of old updates is here.
Slides for EDBT/ICDT keynote on Joins and Convex Geometry
Our code is on github.