
DeepDive is a new type of data system that has probabilistic inference as its core operation. DeepDive's most popular use case is to create rich SQLstyle databases to support analytics from web pages, pdfs, and other databases. Recently, some DeepDivebased applications have exceeded the quality of human volunteer annotators in both precision and recall. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. DeepDive is also used to read scientific research across a broad range of domains and by a handful of companies for "BI on steroids."

News

MEMEX . DeepDive helps power the MEMEX project to help in the fight against human trafficking. The project was recently featured on 60 minutes, Scientific American, Wall St. Journal, BBC, and Wired. It's supporting investigations. 
Data . We're giving away data! Big, markedup datasets.  Manuscripts.In the end, it all goes into DeepDive...
 Joins and Graph Processing. Who doesn't love joins? Frank does!
 EmptyHeaded: Boolean Algebra based Graph Processing by Chris R. Aberger, Andres Notzli, Kunle Olukotun, and me discusses how to use SIMD hardware to support our worstcase optimal join algorithms to find graph patterns. It's fast!
 Increasing the parallelism in multiround MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
 Analytics
 Caffe con Troll: Shallow Ideas to Speed up Deep Learning. This is our first exploration of how to schedule deep learning computations on CPUs and GPUs. We show that simple ideas can result in a 56x speedup for Caffe (the uber popular deep learning framework) on CPUs, which means we can use both CPUs and GPUs to go even faster.
 Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.
 Incremental Knowledge Base Construction Using DeepDive is our latest description of what we're building in DeepDive. Code moving its way through our arcane github branching process.
 (Nearly) Global Convergence of SGD for NonConvex Matrix Problems preliminary version. Chris De Sa shows that a widely used SGD heuristic converges at a provable rate using a novel argument based on martingales. It requires a slight—but embarrassingly fancy sounding—twist (the stepsize must correct for the curvature of the space using the Reimannian metric associated with an appropriately defined quotient manifold) Given that ridiculous word salad, it may be surprising but we're aware of implementations at a few web companies... so, hey, it turns out it works... Phew!
 Joins and Graph Processing. Who doesn't love joins? Frank does!
 SIGMOD15 and PODS15
 Susan Tu and Adam Perelman are headed to SIGMOD undergrad research contest to talk about their research on query optimizer, Dunce Cap, that sits on top of EmptyHeaded. Awesome!
 Resolution framework for joins with beyondworstcase guarantees here. This is described in more detail below.
 Exploiting Correlations for Expensive Predicate Evaluation here. Manas Jogeklar and Aditya P. show that you can give tight bounds on evaluating predicates using random sampling to speed up query evaluation with nasty predicates.
 Panel questions on database research and machine learning titled Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?
 Led by Dung Nguyen and Hung Q. Ngo, and the awesome LogicBlox team, we benchmarked a wide range of join and graph algorithms including Minesweeper, our beyond worstcase algorithm. Check out our new paper, Join Processing for Graph Patterns: An Old Dog with New Tricks. GRADES15


Upcoming Meetings and Talks
 NorCal DB. April 24.
 CS50th and SDSI Kickoff. April 2829.
 PPL Retreat. May 79.
 Huawei. May 1920.
 SIGMOD/PODS. May 31June 4.
 DanaC Keynote. May 31.
 SIoT Retreat. June 34.
 Keynote @ BigVision with CVPR. June 12.
 Invited Talk @ ISCA. Learning and Architecture. June 14.
 Invited Talk @ ISMP. Sparsity or Something. July 1217.

Code
 Our stuff is on github
 DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
 Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of tracenorm regularization called Jellyfish and nonnegative matrix factorization called HottTopix.
 Code for more projects are here and in MADlib, a product from Oracle, and in Cloudera's Impala.

Application Overview Videos (See our YouTube channel, HazyResearch)
 GeoDeepDive With Shanan Peters (UW Geoscience) and Miron Livny (Condor), we are combining Macrostrat with DeepDive to (hopefully!) deliver value for Geoscientists. One key challenge is extracting all the measurement information that is reported in the literature, that is buried in the dark data of text, graphs, and figures. A demo video and a new video about quality that is higher than the volunteers who have been at this for the last decade. This is all powered by DeepDive. Thank you to the National Science Foundation and Google for supporting this work.
 IceCube Mark Wellons, Ben Recht, and I have done some work with the IceCube Neutrino Detector. Mark's code now runs in the detector on the South Pole and is used on over 250 Million events per day. More details are in this video, this video, this paper at the The International Cosmic Ray Conference 2013, or this paper. Thank you to the IceCube Collaboration and UW Graduate School for their support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
 There are also videos about some of the technical portions of these projects Matrix Factorization, Seismic Data Interpolation, and a nowcasting framework (now called Ringtail).
 A messy, incomplete log of old updates is here.
Slides for EDBT/ICDT keynote on Joins and Convex Geometry