In ICML18, Hyperbolic embddings embed structured knowledge in continuous space, a blog post about our wo
In ACL18, train your classfiers with natural language! Braden, Percy, and Stephanie show you how in ACL18.
In SODA18, we characterize the largest class of matrices for which matrix-vector multiply is in (nearly) linear time. Anna is exploring whether these matrices can be useful in deep learning ICLR18. Thank you to the NSF for supporting this work!
Snorkel system paper in VLDB18 with its own blog. Software 2.0 madness is coming...
In SIGMOD18, Founduer constructs knowledge bases from richly formatted data using visual and textual reasoning.
For PCA and other mildly nonconvex matrix problems, in AIStas18, we show that a simple stochastic algorithm gets the optimal accelerated rate--and that the standard Polyak momentum scheme can't give acceleration in the stochastic case.
News
Latest crop of professors have started! Chris (Cornell), Ioannis (MILA), and Theo (Wisconsin). We miss them already!
Our goal for the last few years has been to dramatically reduce the time analyst spend specifying models, maintaining them, and collaboratively building models. Our new effort for lightweight dark data extraction is Snorkel which is built on the idea of weak supervision and data programming, see our blog or video. These systems do not use traditional hand-labeled training data, which removes a fundamental obstacle in using machine learning tools.
New Tradeoffs for Machine Learning Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning. There are limits to the robustness of these algorithms, see our ICML 2016 best paper.
New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more soon.
Our course material from CS145 intro databases is available (send a note), and we'll continue to update it. We're aware of a handful of courses that are using these materials. Drop us a note, if you do!
Recent/Upcoming keynotes and talks: EDBT17, UAI17, ABS East 2017, Cornell, Alibaba, CMU (SiValley), SystemX, KBCOM (WSDM), ITBB18.
Recurrence Width for Structured Dense Matrix Vector Multiplication with Albert Gu, Rohan Puttagunta, and Atri Rudra studies the problem of structured matrix-vector multiply, e.g., fourier transforms, orthogonal polynomials, low displacement rank. This work unifies and we think simplifies many known algorithms and extends to a host of new cases. We are hopeful that we can develop one technique to unify all known cases of structured-matrix vector multiply.
Upcoming talks, keynotes, and meetings: March: MEMEX, SIMPLEX, EDBT, AI and the future of business, AAAI Spring, April: Computer Forum, May: SIGMOD, SystemX, SIOPT 2017 June: MongoDB World, STOC Theory Fest, NAS Kavli, August: UAI, DIMACS large-scale learning.
Nature Communications. Kun's paper about automated cancer prognosis is out! He shows that automated approaches can out perform human pathologists at lung cancer prognosis. Update: Kun wins the data parasite award for this work!
Database Theory
SIGMOD 2017. In SlimFast, Theo et al describe a formal statistical model for data quality that has quality guarantees and it runs efficiently on real datasets! blog post.
ICDT 2017. Increasing the parallelism in multi-round MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
PODS 2017. Benny Kimelfeld has a great framework for feature engineering based on the relational model.
Machine Learning and Optimization
NIPS 2016. Data Programming: Creating Large Training Sets, Quickly by Alex Ratner, Chris De Sa, Sen Wu, and Daniel Selsam. The idea is to have the user create a generative model as a biproduct of their actions. We use this model to create a large (but noisy) training set to train a discriminative model. The hope is it makes it easier to program and it is now in use inside Snorkel. User studies reported here when it was called DDlite. NIPS video
NIPS 2016. Chris De Sa, Bryan He, and Ioannis continue to discover fundamental properties of Gibbs Sampling including scan order. This is used in our generative models in Snorkel and factor graph inference. video.
NIPS 2016. Peng, Jiyan, Fred, and MICHAEL MAHONEY did some exciting work about subsampled newton methods that work on more poorly conditioned problems than previous approaches, while retaining the speed benefits of subsampled newton approaches.
Allerton 2016. In a short note Ioannis, Ce, and Stefan show that asynchrony for SGD can be viewed as adding a momentum term. The analysis does not depend on whether the function is convex, which means it applies to Deep learning. This makes me feel a little better about people running Hogwild! Deep Learning systems. But it does mean you need to tune your momentum. It is a key ingredient in Omnivore, our deep learning optimizer, which is described here. A tounge-in-cheek blog post about it here.
ICML 2016. Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa observes that asynchrony can introduce bias in Gibbs sampling and some sufficient conditions for the bias to vanish. Also, some chains that take exponential time to mix without asynchrony can mix in polynomial time with asynchrony, and vice versa. Thank you to the committee for selecting this as best paper.
PODS16. Rohan and Manas extend our new join algorithms to message passing and so fast matrix multiplication. The point is: Standard worst-case algorithms are enough to get the best asympotic runtimes for these problems. A step toward the vision of unifying relational and linear algebra systems using GHDs
HILDA. We describe our group's work on data programming and DDlite. We're really excited about the ability to quickly create high quality extractors!
VLDB15. Honored to receive the VLDB Early Career Award for scalable analytics. talk video.
ICML16
ICML. Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa observes that asynchrony can introduce bias in Gibbs sampling and some sufficient conditions for the bias to vanish. Also, some chains that take exponential time to mix without asynchrony can mix in polynomial time with asynchrony, and vice versa. Thank you to the committee for selecting this as best paper.
OPTML. Parallel SGD: When does Averaging Help? with Jian Zhang, Christopher De Sa, Ioannis Mitiliagkas.
New. Increasing the parallelism in multi-round MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of trace-norm regularization called Jellyfish and non-negative matrix factorization called HottTopix.
Application Overview Videos (See our YouTube channel, HazyResearch)
GeoDeepDive
With Shanan
Peters (UW Geoscience) and Miron Livny (Condor), we are combining
Macrostrat with DeepDive to (hopefully!) deliver value for
Geoscientists. One key challenge is extracting all the measurement
information that is reported in the literature, that is buried in the
dark data of text, graphs, and figures. A demo
video and a new video about quality that is
higher than the volunteers who have been at this for the last
decade. This is all powered by DeepDive. Thank you to the National
Science Foundation and Google for supporting this work.
IceCube Mark Wellons, Ben Recht,
and I have done some work with the IceCube Neutrino Detector. Mark's
code now runs in the detector on the South Pole and is used on over
250 Million events per day. More details are in this video, this video, this
paper at the The International Cosmic Ray Conference 2013, or this paper. Thank
you to the IceCube Collaboration and UW Graduate School for their
support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
Invited Talk @ ISMP. Sparse Optimization and Applications. July 12-17.
Invited Talk @ StarAI with UAI. July 16.
Dato's Data Science Summit. July 20.
SDSI. July 21.
CCC Roundtable. July 24
UCI Data Science. May 29.
SIGMOD/PODS. May 31-June 4.
DanaC Keynote. May 31.
SIoT Retreat. June 3-4.
Keynote @ BigVision with CVPR. June 12.
Kavli Frontiers of Science Symposium. June 16-18.
ICML15. (Nearly) Global Convergence of SGD for Non-Convex Matrix Problems preliminary version. Chris De Sa shows that a widely used SGD heuristic converges at a provable rate using a novel argument based on martingales. It requires a slight—but embarrassingly fancy sounding—twist (the stepsize must correct for the curvature of the space using the Reimannian metric associated with an appropriately defined quotient manifold) Given that ridiculous word salad, it may be surprising but we're aware of implementations at web companies... so, hey, it turns out it works... Phew!
SIGMOD15 and PODS15
UGRAD. Susan Tu and Adam Perelman are headed to the SIGMOD undergrad research contest to talk about their research on their query optimizer, Dunce Cap, that sits on top of EmptyHeaded. CoWinners of SIGMOD Undergraduate Research Competition. Awesome!
PODS. Tetris, a geometric-resolution framework for joins with beyond-worst-case guarantees here.
SIGMOD. Exploiting Correlations for Expensive Predicate Evaluation here. Manas Jogeklar and Aditya P. show how to use correlations to (approximately) evaluate expensive predicates to speed up query evaluation.
DANAC Keynote. DeepDive and it's applications in science and the fight against human trafficking!
DANAC15.Caffe con Troll: Shallow Ideas to Speed up Deep Learning. This is our first exploration of how to schedule deep learning computations on CPUs and GPUs. We show that simple ideas yield a 4.5x speedup for Caffe (the uber popular deep learning framework) on CPUs. In particular, the throughput of the task is proportional to the FLOPS delivered by the hardware. We can also use both CPUs and GPUs to get higher FLOPS and go even faster.
GRADES. Led by Dung Nguyen and Hung Q. Ngo, and the awesome LogicBlox team, we benchmarked a wide range of join and graph algorithms including Minesweeper, our beyond worst-case algorithm. Check out our new paper, Join Processing for Graph Patterns: An Old Dog with New Tricks.
Adobe Data Science. May 28.
Huawei. May 19-20.
CS50th and SDSI Kickoff. April 28-29.
PPL Retreat. May 7-9.
Adobe ML Talk. May 14.
NorCal DB. April 24.
MEMEX. April 13-17.
Invited Talk @ ACM SIGAI Bay Area at Baidu. April 16.
Recent papers are about analytics, joins and feature selection. It all ends up in DeepDive...
High-Performance Analytics.Thank you to Microsoft for giving a nod to Hogwild!; they mentioned that their Adam system (for machine learning) is based on our Hogwild! approach. We love that they got the exclamation point into print, although it's dubious that anyone is more Hogwild! than we are! The succesors for both the theory (ICML14) and systems work (VLDB14) are in print. The recent DimmWitted engine is a deeper systems exploration of the tradeoff space, check it out!
Feature Engineering.Materialization Optimizations for Feature Selection shows that using a DSL and some novel optimizations, we can get order of magnitude performance wins for feature engineering in R. Thank you to SIGMOD for selecting this as the best paper at SIGMOD14!
In NIPS14, Yingbo Zhou and (many) others have some very nice work about how to do feature selection in parallel using some group testing ideas. Most recently, a draft about the DeepDive approach to feature engineering for KBC systems is here.
Joins! New papers about one of my favorite topics, Joins. This is joint work with Hung Q Ngo and Atri Rudra.
We have written a short overview for SIGMOD record about recent advances in worst-case optimal join algorithms. This is the one to read first. Our goal is to give a high-level view of the worst-case optimality results for practitioners and applied researchers. We also managed to simplify the arguments.
The Minesweeper paper goes beyond worst-case analysis for join algorithms, which is a much stronger guarantee than traditional CS theory: an algorithm must be within a small logarithmic facton every instance compared to any comparison-based algorithms, which includes all standard join algorithms. Our empirical results with the LogicBlox guys are in GRADES15.
Our Tetris paper describes (what we think is) a beautiful framework for beyond worst case and worst-case optimal algorithms for joins via a new connection between geometry and DPLL resolution. It also allows us to consider a wide variety of indexing schemes in a single framework.
PaleoDB. Our assessment of PaleoDeepDive is
here and
here. PaleoDeepDive exceeds human volunteer quality in both precision and recall on some extraction tasks. Thank you to PaleoDB and Shanan Peters for all their painstaking work! A description of our approach to building KBC systems is here. It's all about feature engineering! DeepDive was recently mentioned in Forbes and as a top tool for data science. It has received some media coverage (Vice, Fusion, The India Times, El Mundo, Kurzweil's newsletter, CACM, and perhaps here?)
Seattle and Allen AI2 Distinguished Lecture. Nov 13-14.
Panel on Data Science Education. Dec 4.
NIPS and NIPS Workshop. Automatic Knowledge Base Construction. Dec. 12-13.
Deep-Time Data with AGU. Dec. 14. (Matteo will be there!)
Dagstuhl. PlanBig. Dec. 14-19.
The Minesweeper paper is our attempt to go beyond worst-case analysis for join algorithms, which is a much stronger pointwise or per-instance guarantee than is typical. We (Hung Ngo, Dung Ngyuen, Atri Rudra, and I) develop a new algorithm that we call Minesweeper. The main idea is to formalize the amount of work any algorithm spends certifying (using a set of propositional statements) that the output set is complete (and not, say, a proper subset). We call this set of propositions the certificate. We manage to establish a dichotomy theorem for this stronger notion of complexity: if a query's hypergraph is what Ron Fagin calls beta-acyclic, then Minesweeper runs in time linear in the certificate; if a query is beta-cyclic than on some instance any algorithm takes time that is super linear in the certificate. The results get sharper and more fun. Also, Dung is a superhero and has implemented a variant of this algorithm.
Facebook. Sep 15-16.
MEMEX. Sep 15-26.
DC (CCC and PI). Oct 14-16.
DC (NSF). Oct 16-17.
NYC and IBM Watson. Nov. 6-7.
Master's in Data Science named me a thought leader. It's humbling to be listed with such great people.
Google We want to thank Google for funding our research project, Trust, but (Probabilistically) Verify: Toward Tail Extraction. Should be a lot of fun!
Toshiba We want to thank Toshiba for funding our work on knowledge base construction. We are very excited that Toshiba engineers are using DeepDive!
ONR Thank you for funding my proposal Foundations for Data-driven Systems; Join Algorithms and Random Network Theory . The ONR continues to be one of the best supporters of pure theoretical work.
DARPA. Thank you to DARPA's XData for supporting my collaborative work on scalable analytics and join processing.
DARPA. Thank you to DARPA Memex for supporting our work. We are really excited to be part of this program!
NIH. Thank you to the NIH Big Data to Knowledge (BD2K) program for supporting our work on mobility data led by Scott Delp.
Modern Data Management Summit in Beijing. Aug 28-30
VLDB. We'll present DimmWitted. Sep 1-5.
UCB. Talk about DimmWitted. Sep 10.
VLDB14. See you in Hangzhou!
ICML14. Ji Liu and Stephen J. Wright, Victor Bittorf, Srikrishna Sridhar and I have some new theory about An Asynchronous Parallel Stochastic Coordinate Descent Algorithm in ICML14. This is a Hogwild!-style algorithm but has more rapid convergence rates than Hogwild! for different types of sparse data.
Panel. The Role of Database Systems in the Era of Big Data.
IMDM. Victor Bittorf will give an invited talk describing his work on Impala, on porting MADlib to Impala, and some thoughts about main-memory analytics as described here.
Baidu Lunch. Aug 22.
Stanford IoT Workshop. Aug 11.
Berkeley Workshop on Awesome. Aug. 15?
MSR. Faculty Summit. July 14-20.
Moore. Talk. July 28-29.
SAP. June 16.
SIGMOD/PODS. Ce is talking about feature selection, and I'll talk about beyond worst-case analysis for joins. June 22-27.
XDATA. July 7-9.
EPFL. Research Day. A broad audience talk about DeepDive. June 12.
ICDT. I'm giving a keynote about
one of my favorite topics, joins! A brief description of my talk is here. I will also present a paper that I
very much enjoyed writing about random graphs. Mar 24-28.
ICDT. I will present a paper in beautiful Athens called The Theory of Zeta Graphs with an Application to Random Networks. The main idea is to understand the basic theory of a declarative, Erdos-Renyi-like model for preferential attachment graphs. It makes some connections to the combinatorics of multiple-valued zeta functions. A version is here.
DIMACS. A meeting about Big Data. Street art and features! Mar. 17.
SGD Workshop at IPAM in Feb. 24-28.
IBM. I'm giving a distinguished lecture at the IBM accelerated discovery lab. Mar. 6.
Talk at Oracle. All about Joins! Feb. 18.
Talk at STRATA Silicon Valley. I'm going to talk about some of our thoughts about how to make data
analytics easier to deploy. Street art will again feature prominently;
this talk builds on Brainwash
and a spate of recent demos including a nowcasting
framework, Ringtail in VLDB
and WebDB;
and an enterprise feature-selection system with Oracle. It will also describe DeepDive.
IceCube. IceCube has a press release here and here about Mark Wellons's new manuscript, which is accepted to NIM A. Mark's algorithm obtains a 66 percent improvement for detecting when more
than one particle is active in the detector. Thank you Francis Halzen,
Gary Hill, and all of IceCube for supporting this work! See all the
amazing stuff IceCube is doing here. IceCube gets the cover
of Science! Way to go IceCube (and Mark)!
Talks at Stanford or near
Stanford in January. Vision group meeting about knowledge base
construction (KBC) on 1/13, ISL Big Data about
joins on 1/16, at the Infolunch about why our group is working on this
set of problems on 1/17, at SRI about KBC on 1/17, and about
hardware-aware analytics at the PPL retreat on 1/24.
NIPS13. Krishna Sridhar, Ji Liu, Ce Zhang, Victor Bittorf, Stephen J. Wright and I have a paper about an approximate LP solver geared to LP relaxations. It's an order of magnitude faster than CPLEX for many combinatorial problems.
SIGMOD. The most recent version of DeepDive's engine is described in a recent SIGMOD paper.
NSF. Thank you to the NSF for supporting our work on GeoDeepDive
with Shanan Peters and Miron Livny. The hope is to build a data
platform that can be used to advance Geoscience. It's going to be exciting!
NSF. Thank you to the NSF for supporting our work New
Frontiers in Join Algorithms: Optimality, Noise, and Richer
Languages. This is joint with Atri Rudra and Hung Ngo, and lets us continue to go deeper into the theory of joins. How great is that!
Alfred P. Sloan Research Fellowship. Thank you for your generous support of our research!
VLDB. We have two demonstrations of feature selection: one on unstructured text called Ringtail (Formely Automan in this video) and Columbus (Video soon!)
I spent the last four years in beautiful Wisconsin. Bucky, you gave me a great home with great students, colleagues, and cheese curds. Thank you!
VLDB 2013. We have two demos accepted to VLDB. Feature
Selection in Enterprise Analytics: A Demonstration using an R-based
Data Analytics System with Oracle people and Ringtail:
Nowcasting Made Simple with Michigan people. Both of these are
early incarnations of Brainwash,
our CIRD13 vision about how feature selection is a data management
problem. Thank you, Jordan Novet, for using a
picture of my rant from the Graphlab workshop in GigaOm.
MPC. Ben Recht and my paper about Matrix Factorization (Jellyfish) is
accepted to Mathematical Programming Computation, the latest version
of the paper is here
which supersedes the 2011 version.
BNCOD13 and DEOS. I'm giving an invited tutorial at BNCOD and a keynote talk at DEOS. Dan
Olteanu put together a very interesting BNCOD program and
Wolfgang Gatterbauer put together a very fun DEOS program.
GraphLab. Carlos has a company, and they are awesome. I'll
be talking at their workshop.
SIGMOD13: Towards High-Throughput Gibbs Sampling at Scale: A
Study across Storage Managers. The paper is here.
This is the latest version of our inference engine that runs
GeoDeepDive and DeepDive. Code is here. A new video
and code release is coming soon!
SIGMOD13 (demo). GeoDeepDive: Statistical Inference using
Familiar Data-Processing Languages has been accepted as a
demonstration! Roughly, it will be a live version of how to build the
system in this video and this paper.
WebDB. Dolan Antenucci, Michael Cafarella, Margaret
C. Levenstein, Matthew Shapiro and I have a paper Ringtail:
Nowcasting Made Easy in WebDB 2013 with SIGMOD. (Formerly called Automan with an awesome movie).
Oracle. Thank you for continuing support to support the Hazy
Research group! This gift will be used to continue our work on feature
engineering for structured analytics.
AirForce. Thank you to the Air Force for supporting Mathematical Foundations of Secure Computing Clouds; this is the hard work of Jordan Ellenberg (Math), Ben Recht (CS), Tom Ristenpart (CS), Rob Nowak (EE), and Steve Wright (CS).
StarAI. Sriraam Natarajan, José Picado, Tushar Khot, Kristian
Kersting, Jude Shavlik, and my paper Using Commonsense Knowledge to
Automatically Create (Noisy) Training Examples from Text has been
accepted to StarAI with AAAI 2013.
SLG2013 at ICML. There is a great
program at the workshop on structured learning this year. The
program looks very interesting, and I'll be talking about some of Ce and Vidhya's work on GeoDeepDive.
I am giving an invited talk at Microsoft's Machine Learning summit (Paris in April!) and a keynote at ECML-PKDD 2013!
CIDR13. Brainwash is a vision paper that argues that feature engineering is a database problem.
NIPS12 (spotlight). An updated version of HottTopix, Factoring
Nonnegative Matrices with Linear Programs, is now on Arxiv with a
cleaner proof, but Nicolas Gillis did a much better
analysis of the case with duplicates and noise!
Automan Automan is a system whose goal is to predict
real-life indicators using social media. Working with economists and
Mike Cafarella (CS) at Michigan, we were able to show that Twitter
data can predict the number of weekly unemployment insurance claims
better than traditional methods. Another goal is to understand to what
extent social media provides a truly novel signal for predictive
tasks. In this task, it looks it may! The video is here.
NIPS 2012. Provable Non-negative Matrix
Factorization. Recently, in a beautiful paper, Arora, Ge, Kannan,
and Moitra described a class of matrices and an error model for which
non-negative matrix factorization returns provably the right
answer. In this manuscript,
Victor Bittorf, Ben Recht, Joel Tropp, and I
describe an alternate approach to establish this result that has three
characteristics: (1) better noise tolerance, (2) more general noise
model, and most importantly, (3) admits efficient, scalable
algorithms. We demonstrate that our algorithm that achieves these
properties is able to factor multigigabyte matrices in minutes on a
commodity workstation. Manuscript
and an
implementation in the Hogwild! framework and the paper version. A
revised full version, CRC, and a YouTube video for our spotlight will
be out soon. Also, Nicolas Gillis has a nice paper that follows up and improves our analysis. Fresh on the ArXiv!
ICDM 2012 Feng Niu, Ce Zhang, Jude Shavlik and I have a new paper about
how to scale Markov Logic inference to the web (how we built systems
like DeepDive).
Our idea is to combine the best-of-breed tools for each subtask; the
challenge is how to combine these tools at scale, and for that we use
Lagrange relaxation or dual decomposition, an old idea from math
programming. One cute part of our approach is that we can encode much
of the Lagrange relaxation algorithm using standard SQL queries.
XLDI. I'll be at XLDI where I plan
to beg smart PL people for help about a problem that my research group
struggles with on a daily basis, Feature Engineering. My
contention is that the biggest bottleneck in building trained systems
may not be more sophisticated machine learning, but may be allowing
one to get signal more rapidly in to the system. The CS angle is to
accelerate the cycle that the engineer goes through of exploring data
sources, extracting relevant information, and then evaluating their
changes; this cycle, we contend, has simple machine learning at its
core, but the challenges we see are mostly data engineering and
processing challenges.
PhD!Feng
Niu graduated! (co-advised by AnHai Doan). The entire Hazy
group is indebted to Feng for his intelligence and
creativity. Together with Ce Zhang, Feng was the driving force behind
DeepDive, GeoDeepDive, and many of our other projects. Google is lucky
to have him!
VLDB 2012. Accepted for publication in VLDB 2012 in Istanbul.
Arun Kumar and I have a
paper, Probabilistic Management of OCR Data using an RDBMS. The paper is our
first attempt to understand the implications of combining rich content
models like OCR with the sophisticated querying capabilities of an
RDBMS. The state-of-the-art models underlying these content models are
probabilistic, e.g., OCRopus for Google Books. These models have very
high quality, but the sheer size of these models can destroy query
processing performance. Arun's idea is to use some ideas from
statistics to compress the model; in turn, this allows him to trade
run time for quality. Thank you to the Microsoft Jim Gray Systems Lab
for supporting Arun's work. (Code and
data here)
The
MADlib Analytics Library or MAD Skills, the SQL. Joseph
M. Hellerstein, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin,
Aleks Gorajek, Kee Siong Ng, Caleb Welton, Xixuan
Feng, Kun Li, Arun Kumar, and I have a
paper about the status of the status of MADlib, which is an awesome
open-source effort led by Greenplum. Thank you, EMC!
ACL 2012 Feng
Niu, Ce Zhang, Jude Shavlik, and I have a
paper that builds on our experiences with DeepDive to study the impact
of Big Data with Distant Supervision versus Crowd-Sourced data on the
quality of relationship extraction from the Web. The paper title
is Big Data versus the Crowd: Looking for Relationships in All the Right Places.
COLT 2012 Ben Recht and I have a
new manuscript. The
starting point of the manuscript is that practioners who run
sequential random sampling algorithms (like stochastic gradient)
sample from their data without replacement. In many cases, people have
observed that sampling without replacement empircally converges to an
optimal value faster than with-replacement sampling approaches. There
is, however, a gap in our current theory as current theory provides
faster convergence rates for with-replacement sampling than
without replacement sampling. This paper takes a step toward closing
this gap. We establish that without replacement methods do converge
faster in several sampling models and propose a general inequality, a
symmetrized (noncommutative) arithmetic-geometric-mean inequality,
that would close this gap in many cases.
The program of the NRS symposium and slides from some presenters at the symposium are here.
PODS 2012 Hung Ngo, Ely Porat, Atri Rudra, and I have a new
PODS paper called "Worst-Case Optimal Join Algorithms." For all
join queries, we are able to provide the first optimal algorithm in
terms of data complexity. The key to the result is an algorithmic
proof for a pair of inequalities from discrete geometry called the
Loomis-Whitney inequality from the 1940s and the more recent
Bollobás-Thomason inequality. We show that these classical
inequalties are equivalent to the beautiful
recent results
by Atserias, Grohe, and Marx. Our work also provides an algorithmic
proof of these results. In contrast, all previous proofs are
non-constructive. Thank you to the program committee for selecting this
paper for the Best Paper Award.
SIGMOD 2012 Aaron
Feng, Arun Kumar, Ben Recht, and I have
a paper Towards a Unified Architecture for In-RDBMS Analytics accepted
to SIGMOD 2012. The code is available here Bismark (so is the data) and the paper (full version). Thank you to Greenplum and Oracle for valuable discussions that helped us understand this problem!
ACL 2012 Feng
Niu, Ce Zhang, Jude Shavlik, and I have a
paper that builds on our experiences with DeepDive to study the impact
of Big Data with Distant Supervision versus Crowd-Sourced data on the
quality of relationship extraction from the Web. The paper title
is Big Data versus the Crowd: Looking for Relationships in All the Right Places.
COLT 2012 Ben Recht and I have a
new manuscript. The
starting point of the manuscript is that practioners who run
sequential random sampling algorithms (like stochastic gradient)
sample from their data without replacement. In many cases, people have
observed that sampling without replacement empircally converges to an
optimal value faster than with-replacement sampling approaches. There
is, however, a gap in our current theory as current theory provides
faster convergence rates for with-replacement sampling than
without replacement sampling. This paper takes a step toward closing
this gap. We establish that without replacement methods do converge
faster in several sampling models and propose a general inequality, a
symmetrized (noncommutative) arithmetic-geometric-mean inequality,
that would close this gap in many cases. CRC Coming Soon.
VLDB 2012 Arun Kumar and I have a
paper, Probabilistic Management of OCR Data using an RDBMS,
accepted for publication in VLDB 2012 in Istanbul. The paper is our
first attempt to understand the implications of combining rich content
models like OCR with the sophisticated querying capabilities of an
RDBMS. The state-of-the-art models underlying these content models are
probabilistic, e.g., OCRopus for Google Books. These models have very
high quality, but the sheer size of these models can destroy query
processing performance. Arun's idea is to use some ideas from
statistics to compress the model; in turn, this allows him to trade
run time for quality. Thank you to the Microsoft Jim Gray Systems Lab
for supporting Arun's work. (Code and
data here)
Probabilistic DB Book is out. My coauthors (Dan Suciu, Dan Olteanu, and
Christoph Koch) have written a wonderful
book
about probabilistic databases in Morgan and Claypool's Synthesis
Lectures for Data Management.
ICDE 2012 Xixuan (Aaron) Feng, Fei
Chen and Min Wang from HP Labs-China, and I have authored a paper
titled Optimizing Statistical Information Extraction Programs Over
Evolving Text about efficiently maintaining conditional random
fields on evolving coropra, i.e., new documents are added or old
documents are modified. The paper is accepted to ICDE 2012 in
Washington,
DC. (CRC
and Full
Version)
TODS in 2012 Dan Suciu and I have a paper, Understanding
Cardinality Estimation using Entropy Maximization, that has been
accepted to TODS. I would like to thank the referees who produced
thorough reviews that helped us to improve the paper. It is the
journal version of our PODS 2010 paper. A preliminary version is here.
NIPS 2011
Big Learning I'm giving a keynote
at Big Learning. The organizers
have put together some great keynotes and one given by some database
guy.
HOGWILD! Ben Recht, Steve Wright, Feng
Niu, and I have a new way of parallelizing incremental gradient
algorithms. The biggest obstacle to achieving linear speedup is
minimizing lock contention. Hogwild's approach is simple: get rid
of locking entirely! We prove that as long as the data are sparse,
Hogwild achieves linear speedups. We demonstrate our theory on a
diverse set of problems including text classification with support
vector machines, cut problems from vision applications, and
recommendation via matrix factorization. (Accepted NIPS 2011)
Victor: Mathematical Optimization + Large Data Projects
Jellyfish Ben Recht and I have released some software
for large-scale matrix completion. If your algorithm is faster on
billion-entry matrices, send it to us so we can learn how to go
faster. Currently, we are two orders of magnitude faster to the same
error (RMSE) versus any algorithm that we know about. The algorithm is
essentially buzzword complete: a large-scale parallel stochastic
gradient algorithm for nonconvex relaxations
Code is available (with
data generators). Paper on Optimization
Online.
Thank you, Office of Naval Research and Physical
Layer Systems for supporting this work!
Victor-SQL integrates incremental schemes with an RDBMS via
a (hopefully) easy-to-use python interface. Available
at the Victor website.
VLDB 2011. Look for three papers in the upcoming VLDB 2011:
Tuffy and Manimal get some press. Thank you to those who have
posted references to Tuffy and Manimal on Twitter, Y-combinator,
O'Reilly online, and Radar Online. If you use our stuff and have any
feedback (positive or negative), please send us a note.
"Incrementally Maintaining Classification using an RDBMS" by M. Levent Koc and me. The source code for this project will be released as part of Hazy very soon.
"Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS" by Feng Niu, AnHai Doan, Jude Shavlik, and me. A prototype version of the system is available off the Hazy Website.
"Automatic Optimization for MapReduce Programs" by Mike Cafarella, Eaman Jahani, and me.
Best of PODS 2010: our paper, Transducing Markov Sequences, has been
invited to a special issue of JACM for the best papers of PODS
2010.
Best of PODS 2010: our paper, Understanding
Cardinality Estimation using Entropy Maximization, has been
invited to a special issue of TODS for the best papers of PODS 2010.
ACM SIGMOD was kind enough to give me the ACM SIGMOD Jim Gray
Thesis Award for my
dissertation, Managing
Large-Scale Probabilistic Databases. Hearing Jim Gray talk
about his work on the World Wide Telescope project was what inspired
me to study databases. It is a real honor to me that his name is on
this award.
I have two new PODS 2010 papers:
Transducing
Markov Sequences
with Benny Kimelfeld that
studies how to evaluate transducers (think: automaton with output)
over the output of a Hidden Markov Model. This problem is motivated
by challenges that we faced while building the querying
infrastructure of Lahar (see below).