A narrated version of Overton, our high-level framework for machine learning built at Apple. (pptx|YouTube) and the paper.

We've been looking at how foundation models can help us build software systems, most recently:

Domino debugging your data by discovering systematic errors with cross-modal embeddings.
Wrangle your data in which we show few shot models (not trained for data tasks) obtains state-of-the-art performance on cleaning, integration, and imputation benchmarks. github.
Led by Simran, Can foundation models offer perfect secrecy? How does it compare to prior approaches like federated learning? github. Also thinking about split QA with Meta.

Tri Dao is amazing and ~~on the market~~ is headed to Princeton. He's the force of nature behind the widely used Flash Attention (usage).

Some thoughts on foundation models and data with Simran.

Combining with weak supervision techniques in UAI22, led by Dan and Mayee. Best Student Paper Runner Up.

My DAC Sky Talk slides are here

In ICML22, we describe audio generation with state spaces, fast learning with sparse matrices, making contrastive learning robust, and a new method for robustness.

Karan and Albert use S4 to generate raw audio in Sashimi. Long Talk
Mayee and Dan discuss how to improve robustness of contrastive learning in Thanos.
Michael et al. describe Correct-n-Contrast approach which sets SotA on robustness problems. Long Talk
Tri and Beidi friends show wall-clock improvement over highly optimized versions of BERT and GPT using sparsity. Afaik, a first! Outstanding Poster at Efficient ML and Long Talk and Outstanding Paper Runner up at ICML.

In ICLR22, we share some results on state space models, domino, and sparsity.

Albert and Karan write about modeling long sequences in S4. Oral, Honorable Mention Outstanding Paper A blog.
- We are very grateful that Sasha Rush and Sidd Karamchetti created an Annotated S4 along with a port to JAX!
- Albert and team released code for our work state space models. Albert's first tweet storm.
Sabri and team discuss discovering systematic errors with cross-modal embeddings. Oral.
Beidi, Tri, and Atri and team describe pixelated butterfly for hardware efficient sparsity. Spotlight blog github

SIGMOD keynote on Data-centric AI, Declarative ML, and Foundation Models in data slides (YouTube)

A panel on Service, Science and Startups changing research

In NeurIPS21, we share some of our results on sequence modeling, sparsity, NAS and introduce two benchmarking projects.

Albert and team continue to make progress on linear state space layers.
Beidi and Tri continue to more deeply understand low-rank and sparsity (code)
Led by friends at CMU, we rethink NAS for diverse tasks.
Arjun introduces a very exciting MRI dataset with clinical labels description and code
Avanika and Piero make a new benchmarking toolkit around Ludwig.

In ICML21, we describe some results on hyperbolic geometry, model evaluation, and stability.

We continue our exploration of hyperbolic geometry showing that Busemann functions and ideal points allow for projections with higher fidelity (lower distortion) in hyperbolic space.
In Mandoline, we think about how a user can combat distribution shift by allowing them to finely slice their data.
In Catformer, we make transformers a bit more stable using concatenation.

In NAACL21, a pair of papers about robustness gym a demonstration and an industry track paper comparing modern named entity linking systems.

In ICLR2021, model patching, and self-supervision on medical images:

Beidi Chen and friends talk about Mongoose, a learnable LSH framework for Efficient Training. Oral.
Karan Goel et al talk about how to patch your models to handle differences among subgroups.
Sarah Hooper et al describe techniques to (very) weakly supervise medical image segmentation tasks.

In NeurIPS 2020, memory units, hidden strat, and non-euclidean geometry.

Albert, Tri, Stefano, and Atri describe our work understanding recurrent models and memory from first principles using orthogonal polynomials in Hippo (code). Spotlight
Nimit, Jared, Geoff, and Albert describe how to prevent some forms of hidden stratification (blog) and an overview video
Ines, Albert, and Vaggos describe how to solve hierarchical clustering problems with hyperbolic geometry—with provable guarantees! code

Recent Software Releases

Karan, Nazneen, and folks from Salesforce and UNC describe Robustness Gym and a writeup.
Bootleg. It's the successor to one of the first industrially deployed self-supervised systems in 2018 at Apple. In CIDR21.

In ICML 2020, we describe our continuing work on weak supervision and data augmentation in two papers:

An analysis of data augmentation that formally explains some of the mysteries of data augmentation.
A simple method to speed up weak supervision

In ACL2020, we describe some of our continuing work on embeddings, compression, and geometry.

Ines et al. explore when you can use hyperbolic geometry for low-dimensional knowledge graph embeddings.
Simran and Avner describe some tradeoffs in a short paper Contextual Embeddings: When are they worth it?

A bunch of great collaborations in nature-family journals, clinical journals, and others

In Journal of Biomedical Informatics, Rebecca, Jared, Ann, Siyi and Daniel Comparison of segmentation-free and segmentation-dependent computer-aided diagnosis of breast masses on a public mammography dataset
In Science Translational Medicine, Johannes, Gill, et al describe AMELIE that how to speed up diagnosis for rare diseases.
In BMC Bioinformatics, Emily, Russ et al describe how to Extract Chemical Reactions from Text using Snorkel
In Cell Patterns, Jared and Alex examine how to weakly supervise text and images.
In NPJ Digital Medicine, Khaled leads applying weak supervision to EEG for efficient seizure detection.
In NPJ Digital Medicine, Alison Callahan and Jason A Fries led an amazing effort to apply weak supervision in device surveillance in health records or here.
In Nature Comms, Weak supervision for Cardiac MRI videos for rare aoritc valve disorders
In Nature Comms, the world's largest machine read GWASKB--both with help from Snorkel's ideas.
In Radiology, Jared's paper about using deep learning in image triage: at what training set sizes do modern methods provide utility in radiology? This is collaboration with great folks in the medical school!

We're looking for great postdocs jointly with the Mobilize Center.

Talk info: Apple NLU Summit, KDD Knowledge Graphs, KDD Converse, Triangle Computer Science Distinguished Lecture, JHU, MIDAS @ Michigan, Google Ads ML Keynote, Large-Scale Learning Keynote, Wisconsin MLOS, NDBC, Naver Labs. My DAC Sky Talk slides are here

In ICLR2020

Hongyang and Sen describe theory that helps tell us when multitask learning works--and when it doesn't!
Tri et al. describe Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps, and they show they can learn hand-tuned features in speech pipelines--from scratch!

Sparse recovery for Jacobi Polynomials in ICALP20.

Charles leads the way on understaning the link between weak supervision and instrumental variables for causal inference in AISTATS20

In CIDR20, paper about our Overton work at Apple including zero-code deep learning, weak supervision, and data slicing.

Exciting to see GMail adopt Software 2.0

Exciting to see GMail adopt Software 2.0 Gmail

Students and Postdocs described their view on Software 2.0 and what's next

Snorkel is in a new location Snorkel.org. Excited for all the great collaborations!

Teaching ML CS229 this spring 2020

Upcoming talks: Dagstuhl on ML meets Software Engineering, SysML (Keynote), OPsML@SysML, WWW BIG, WWW IDS, DAC Sky Talk, Duke

NeurIPS19. Preprints, blog posts, and code releases coming soon!

Avner, Jian, and Tri discuss how compression of word embeddings changes downstream performance of ML models with a mix of theory and empirical study. spotlight.
Fred and Paroma led work about how to do weak supervision for sequential data at multiple scales, e.g., for learning in video. Available in Snorkel. Also used in the below Nature Comms paper.
Vincent, Sen, and Alex describe their work Slice-based Learning: A programming model for residual learning on critical slices. Available in Snorkel. blog.
Ines, Rex, and Jure describe Hyperbolic Graph Convolutional Neural Networks. This builds on our recent interest to understand when it helps to incorporate (non-Euclidean) geometry into state-of-the-art embedding methods.
Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging by Dr. Luke Oakden-Rayden and Jared with blog in ML4H.

Talks: Berkeley RISE, Google FACT Keynote, Cornell 50th CIS Panel, ONR,AI for Health, PSB Keynote on AI and Ethics, CIDR2020, Amsterdam Data Science Meetup, USC,

In ICML19, we talk about learning structure with only weak supervision, a theory for data augmentation, and how to learn structured matrices that are provably fast using butterfly factorizations.

In SIGMOD19, with folks at Google, we talk about lessons learned from Snorkel applied at Google in DryBell.

Some of the industrial engagements that we're most proud of are: Software 2.0 products with Apple via Lattice, with Google Ads (blog), and with Intel via Snorkel. We're proud of all the folks who adopted Snorkel! Technical ideas including Hogwild! in Microsoft's deep learning system (Wired), momentum correction for delay in Nvidia, and high-accuracy low-precision (HALP) in ImageNet in minutes from Tencent. Our work also led to some classical analytics layers for companies like Oracle, Cloudera, and Pivotal. In benchmarking, GLUE, TAC-KBP, and better-than-volunteer accuracy in machine reading for paleobiology in 2014.

To validate our ideas, we continue to build systems that we hope change the way people do science and improve society. This work is with great partners in areas including paleobiology (Nature), drug repurposing, genomics, material science, and the fight against human trafficking (60 minutes, Forbes, Scientific American, WSJ, BBC, and Wired). Our work is supporting investigations. In the past, we've worked with a neutrino telescope (IceCube Science cover and our modest contribution) and on economic indicators.

The DeepDive (one pager) project was commercialized as Lattice. As of 2017, Lattice is part of Apple. Our work on architectural changes for converged analytics and machine learning is commericialized as SambaNova Systems.

In AAAI19, Snorkel folks talk about Training Complex Models with Multi-Task Weak Supervision. We've seen this is a new and exciting way to build machine learning software.

In AIStats19, Tri, Avner, and Jian try explain how low-precision random Fourier features generalize better than Nystrom features in the same amount of memory--this result is surprising since, if you measure by feature count the reverse is true!

Paroma's thoughts about automating weak supervision in Reef in VLDB19

Alex and Braden's thoughts on the role of massive multitask learning and weak supervision in Software 2.0 in CIDR19.

Checkout great workshops run in part by our students Relational Representaton Learning at NeurIPS, and learning with 2nd Learning with Limited data and Graph Representation Learning at ICLR19.

Beliz, Albert, and Fred learn embeddings in mixed product spaces (hyperbolic, spherical, and euclidean) in ICLR19.

Theo, Ihab, and crew have released Holoclean.

A draft writeup about Bit Centering a technique to use accelerators for low-precision training. ISMP18.

A small writeup about Software 2.0 and Snorkel for KDD18.

In ICML18, Hyperbolic embddings embed structured knowledge in continuous space, a blog post about our wo

In ACL18, train your classfiers with natural language! Braden, Percy, and Stephanie show you how in ACL18.

In SODA18, we characterize the largest class of matrices for which matrix-vector multiply is in (nearly) linear time. Anna is exploring whether these matrices can be useful in deep learning ICLR18. Thank you to the NSF for supporting this work!

Snorkel system paper in VLDB18 with its own blog. Software 2.0 madness is coming...

In SIGMOD18, Founduer constructs knowledge bases from richly formatted data using visual and textual reasoning.

For PCA and other mildly nonconvex matrix problems, in AIStas18, we show that a simple stochastic algorithm gets the optimal accelerated rate--and that the standard Polyak momentum scheme can't give acceleration in the stochastic case.

News

Latest crop of professors have started! Chris (Cornell), Ioannis (MILA), and Theo (Wisconsin). We miss them already!
New work on data augmentation and why weak supervision is a critical problem in AI/ML and systems.

New Tradeoffs for Machine Learning Systems. The next generation of data systems need to make fundamentally new tradeoffs. For example, we proved that many statistical algorithms can be run in parallel without locks (Hogwild! or SCD) or with lower precision. This leads to a fascinating systems tradeoff between statistical and hardware efficiency. These ideas have been picked up by web and enterprise companies for everything from recommendation to deep learning. There are limits to the robustness of these algorithms, see our ICML 2016 best paper.

New Database Engines. We're thinking about how these new workloads change how one would build a database. We're building a new database, EmptyHeaded, that extends our theoretical work on optimal join processing. Multiway join algorithms are asymptotically and empirically faster than traditional database engines—by orders of magnitude. We're using it to unify database querying, graph patterns, linear algebra and inference, RDF processing, and more soon.

Our course material from CS145 intro databases is available (send a note), and we'll continue to update it. We're aware of a handful of courses that are using these materials. Drop us a note, if you do!

Recent/Upcoming keynotes and talks: EDBT17, UAI17, ABS East 2017, Cornell, Alibaba, CMU (SiValley), SystemX, KBCOM (WSDM), ITBB18.

Manuscripts.

Stephen Bach learns the structure of graphical model without any hand-labeled data. ICML 17. (blog)
Theo, Xu, and Ihab use weak supervision to get mind blowingly good results in data cleaning. VLDB17. (blog)
Paroma Varma corrects generative models using discriminative models (blog)
Sen, Xiao, Braden and Luke describe a system for multimodal extraction (blog)
Recurrence Width for Structured Dense Matrix Vector Multiplication with Albert Gu, Rohan Puttagunta, and Atri Rudra studies the problem of structured matrix-vector multiply, e.g., fourier transforms, orthogonal polynomials, low displacement rank. This work unifies and we think simplifies many known algorithms and extends to a host of new cases. We are hopeful that we can develop one technique to unify all known cases of structured-matrix vector multiply.

Upcoming talks, keynotes, and meetings: March: MEMEX, SIMPLEX, EDBT, AI and the future of business, AAAI Spring, April: Computer Forum, May: SIGMOD, SystemX, SIOPT 2017 June: MongoDB World, STOC Theory Fest, NAS Kavli, August: UAI, DIMACS large-scale learning.

Nature Communications. Kun's paper about automated cancer prognosis is out! He shows that automated approaches can out perform human pathologists at lung cancer prognosis. Update: Kun wins the data parasite award for this work!

Database Theory

SIGMOD 2017. In SlimFast, Theo et al describe a formal statistical model for data quality that has quality guarantees and it runs efficiently on real datasets! blog post.
ICDT 2017. Increasing the parallelism in multi-round MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.
PODS 2017. Benny Kimelfeld has a great framework for feature engineering based on the relational model.

Machine Learning and Optimization

NIPS 2016. Data Programming: Creating Large Training Sets, Quickly by Alex Ratner, Chris De Sa, Sen Wu, and Daniel Selsam. The idea is to have the user create a generative model as a biproduct of their actions. We use this model to create a large (but noisy) training set to train a discriminative model. The hope is it makes it easier to program and it is now in use inside Snorkel. User studies reported here when it was called DDlite. NIPS video
NIPS 2016. Chris De Sa, Bryan He, and Ioannis continue to discover fundamental properties of Gibbs Sampling including scan order. This is used in our generative models in Snorkel and factor graph inference. video.
NIPS 2016. Peng, Jiyan, Fred, and MICHAEL MAHONEY did some exciting work about subsampled newton methods that work on more poorly conditioned problems than previous approaches, while retaining the speed benefits of subsampled newton approaches.
Allerton 2016. In a short note Ioannis, Ce, and Stefan show that asynchrony for SGD can be viewed as adding a momentum term. The analysis does not depend on whether the function is convex, which means it applies to Deep learning. This makes me feel a little better about people running Hogwild! Deep Learning systems. But it does mean you need to tune your momentum. It is a key ingredient in Omnivore, our deep learning optimizer, which is described here. A tounge-in-cheek blog post about it here.
ICML 2016. Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa observes that asynchrony can introduce bias in Gibbs sampling and some sufficient conditions for the bias to vanish. Also, some chains that take exponential time to mix without asynchrony can mix in polynomial time with asynchrony, and vice versa. Thank you to the committee for selecting this as best paper.

Elated that our group's work was honored by a MacArthur Foundation Fellowship. So excited for what's next!

Talks Feb: Mobilize Center, Distributed Inference at AAAI,

SIGMOD and PODS

SIGMOD16 EmptyHeaded: A Relational Engine for Graph Processing by Chris Aberger discusses how to use SIMD hardware to support our worst-case optimal join algorithms to find graph patterns. It's fast! on github.
SIGMOD16 Industrial. A paper describing DeepDive and its use in Lattice written largely by Mike Cafarella (CEO).
PODS16. Rohan and Manas extend our new join algorithms to message passing and so fast matrix multiplication. The point is: Standard worst-case algorithms are enough to get the best asympotic runtimes for these problems. A step toward the vision of unifying relational and linear algebra systems using GHDs
HILDA. We describe our group's work on data programming and DDlite. We're really excited about the ability to quickly create high quality extractors!

VLDB15. Honored to receive the VLDB Early Career Award for scalable analytics. talk video.

ICML16

ICML. Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa observes that asynchrony can introduce bias in Gibbs sampling and some sufficient conditions for the bias to vanish. Also, some chains that take exponential time to mix without asynchrony can mix in polynomial time with asynchrony, and vice versa. Thank you to the committee for selecting this as best paper.
OPTML. Parallel SGD: When does Averaging Help? with Jian Zhang, Christopher De Sa, Ioannis Mitiliagkas.

Upcoming Meetings and Talks

Wisconsin 50th. April 21.
System X. May 10-12.
ICDE. Dark data! May 16-20.
Inside the Black Box. June 8.
MMDS. June 21-June 24.
SIGMOD. June 26-Jul 1.
Randomized Linear Algebra. JAPAN. Jul. 25-29.

ICDT16. It’s all a matter of degree: Using degree information to optimize multiway joins by Manas Joglekar discusses one technique to use degree information to perform joins faster (asymptotically!).

SODA16. Weighted SGD for lp Regression with Randomized Preconditioning by Jiyan, Yin Lam Chow, Michael Mahoney, and me looks at some preconditioning methods to speed up SGD in theory and practice.

NIPS15. Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width by Chris De Sa explains a notion of width that allows one to bound mixing times for factor graphs.

NIPS15. Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms by Chris De Sa et al. derives results for low precision and non-convex Hogwild! (asynchronous) style algorithms.

VLDB15. Incremental Knowledge Base Construction Using DeepDive is our latest description of DeepDive.

New. Increasing the parallelism in multi-round MapReduce join plans. Semih Salihoglu, Manas Joglekar, and crew show that you can recover classical results about parallelizing acyclic queries using only Yannakakis's algorithm and our recent algorithms for generalized fractional hypertree decompositions for joins.

ICDE. Finland!

Panel on Dark Data.
ICDE Workshop. Old Techniques for New Join Algorithms: A Case Study in RDF Processing. Chris Aberger and Susan Tu, show that our new join algorithms apply to EmptyHeaded and describe some on-going modifications to EmptyHeaded that are critical for performance.

USC ML. Jan 26.

Berkeley. Feb 3.

Michigan. Mar 11.

ICDT. Mar 15-21.

Strata. Mar 28-31.

SIMPLEX. April 4-7.

Dagstuhl. Foundations of Databases. April 10-15. Slides for EDBT/ICDT keynote on Joins and Convex Geometry Frank explains our joins work very nicely

Code

Our stuff is on github
DeepDive is available. Components have their own pages. Elementary. Gibbs Sampling on Factor Graphs on TBs in files, Accumulo, or HBase! Now with BUGS support! Tuffy is updated, which uses an RDBMS to process Markov Logic.
Hogwild! SVMs, logistic regression, matrix factorization, and other convex goodness without locking. Specialized versions of trace-norm regularization called Jellyfish and non-negative matrix factorization called HottTopix.
Code for more projects are here and in MADlib, a product from Oracle, and in Cloudera's Impala.

Application Overview Videos (See our YouTube channel, HazyResearch)

GeoDeepDive With Shanan Peters (UW Geoscience) and Miron Livny (Condor), we are combining Macrostrat with DeepDive to (hopefully!) deliver value for Geoscientists. One key challenge is extracting all the measurement information that is reported in the literature, that is buried in the dark data of text, graphs, and figures. A demo video and a new video about quality that is higher than the volunteers who have been at this for the last decade. This is all powered by DeepDive. Thank you to the National Science Foundation and Google for supporting this work.
IceCube Mark Wellons, Ben Recht, and I have done some work with the IceCube Neutrino Detector. Mark's code now runs in the detector on the South Pole and is used on over 250 Million events per day. More details are in this video, this video, this paper at the The International Cosmic Ray Conference 2013, or this paper. Thank you to the IceCube Collaboration and UW Graduate School for their support of our work! and a most recent writeup accepted to NIM A and described here. IceCube (and Mark) got the cover of Science! Awesome!
There are also videos about some of the technical portions of these projects Matrix Factorization, Seismic Data Interpolation, and a nowcasting framework (now called Ringtail).

Intel ISRA Deep Learning Kickoff. Jan 7.
CS Retreat. Oct 2-3.
Moore DDD Event. Oct 7-9.
Accenture. Oct 14.
ONR. Oct. 27-29.
GaTech Colloquium. Nov. 20.
NIPS. Dec 12. Nonconvex Optimization Workshop.
NIPS. Dec 12. Machine learning systems.
Chile. Jan 15.
Ce's thesis describes DeepDive.
MEMEX. DeepDive helps power the MEMEX project in the fight against human trafficking. The project was recently featured on 60 minutes, Forbes, Scientific American, Wall St. Journal, BBC, and Wired. It's supporting investigations.
A16z Academic Roundtable. Sep 21.
HPTS. Sep 27-30.
SIMPLEX in NYC. Sep 29-30.
UC Merced Colloquium. Sep 11.
VLDB. Aug 31-Sep. 4.
In-Q-Tel Tech Forum. Aug 19.
Joint Statistical Meetings (JSM). Aug 10.
Ce's Defense! Aug 13.
Stanford Science Bootcamp. July 28.
ICML. July 6-11.
Invited Talk @ ISMP. Sparse Optimization and Applications. July 12-17.
Invited Talk @ StarAI with UAI. July 16.
Dato's Data Science Summit. July 20.
SDSI. July 21.
CCC Roundtable. July 24
UCI Data Science. May 29.
SIGMOD/PODS. May 31-June 4.
DanaC Keynote. May 31.
SIoT Retreat. June 3-4.
Keynote @ BigVision with CVPR. June 12.
~~Kavli Frontiers of Science Symposium. June 16-18.~~
ICML15. (Nearly) Global Convergence of SGD for Non-Convex Matrix Problems preliminary version. Chris De Sa shows that a widely used SGD heuristic converges at a provable rate using a novel argument based on martingales. It requires a slight—but embarrassingly fancy sounding—twist (the stepsize must correct for the curvature of the space using the Reimannian metric associated with an appropriately defined quotient manifold) Given that ridiculous word salad, it may be surprising but we're aware of implementations at web companies... so, hey, it turns out it works... Phew!
SIGMOD15 and PODS15
- UGRAD. Susan Tu and Adam Perelman are headed to the SIGMOD undergrad research contest to talk about their research on their query optimizer, Dunce Cap, that sits on top of EmptyHeaded. CoWinners of SIGMOD Undergraduate Research Competition. Awesome!
- PODS. Tetris, a geometric-resolution framework for joins with beyond-worst-case guarantees here.
- SIGMOD. Exploiting Correlations for Expensive Predicate Evaluation here. Manas Jogeklar and Aditya P. show how to use correlations to (approximately) evaluate expensive predicates to speed up query evaluation.
- SIGMOD. Panel Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?
- DANAC Keynote. DeepDive and it's applications in science and the fight against human trafficking!
- DANAC15.Caffe con Troll: Shallow Ideas to Speed up Deep Learning. This is our first exploration of how to schedule deep learning computations on CPUs and GPUs. We show that simple ideas yield a 4.5x speedup for Caffe (the uber popular deep learning framework) on CPUs. In particular, the throughput of the task is proportional to the FLOPS delivered by the hardware. We can also use both CPUs and GPUs to get higher FLOPS and go even faster.
- GRADES. Led by Dung Nguyen and Hung Q. Ngo, and the awesome LogicBlox team, we benchmarked a wide range of join and graph algorithms including Minesweeper, our beyond worst-case algorithm. Check out our new paper, Join Processing for Graph Patterns: An Old Dog with New Tricks.
Adobe Data Science. May 28.
Huawei. May 19-20.
CS50th and SDSI Kickoff. April 28-29.
PPL Retreat. May 7-9.
Adobe ML Talk. May 14.
NorCal DB. April 24.
MEMEX. April 13-17.
Invited Talk @ ACM SIGAI Bay Area at Baidu. April 16.
SIMPLEX. April 19-23.
BIRS April 20-22.
I'm honored to be selected as a Moore Data-Driven Discovery Investigator. What an exciting list of people! Thank you to Context Relevant, the world's best analytics company for Wall St. and beyond, for putting out such a nice press release (CNBC, Yahoo! Finance). My dad loved it.
Recent papers are about analytics, joins and feature selection. It all ends up in DeepDive...
- High-Performance Analytics. Thank you to Microsoft for giving a nod to Hogwild!; they mentioned that their Adam system (for machine learning) is based on our Hogwild! approach. We love that they got the exclamation point into print, although it's dubious that anyone is more Hogwild! than we are! The succesors for both the theory (ICML14) and systems work (VLDB14) are in print. The recent DimmWitted engine is a deeper systems exploration of the tradeoff space, check it out!
- Feature Engineering. Materialization Optimizations for Feature Selection shows that using a DSL and some novel optimizations, we can get order of magnitude performance wins for feature engineering in R. Thank you to SIGMOD for selecting this as the best paper at SIGMOD14! In NIPS14, Yingbo Zhou and (many) others have some very nice work about how to do feature selection in parallel using some group testing ideas. Most recently, a draft about the DeepDive approach to feature engineering for KBC systems is here.
Joins! New papers about one of my favorite topics, Joins. This is joint work with Hung Q Ngo and Atri Rudra.
- We have written a short overview for SIGMOD record about recent advances in worst-case optimal join algorithms. This is the one to read first. Our goal is to give a high-level view of the worst-case optimality results for practitioners and applied researchers. We also managed to simplify the arguments.
- The Minesweeper paper goes beyond worst-case analysis for join algorithms, which is a much stronger guarantee than traditional CS theory: an algorithm must be within a small logarithmic facton every instance compared to any comparison-based algorithms, which includes all standard join algorithms. Our empirical results with the LogicBlox guys are in GRADES15.
- Our Tetris paper describes (what we think is) a beautiful framework for beyond worst case and worst-case optimal algorithms for joins via a new connection between geometry and DPLL resolution. It also allows us to consider a wide variety of indexing schemes in a single framework.
- A full version of our join algorithm with worst-case optimal running time is here (original PODS 2012 paper).
- PaleoDB. Our assessment of PaleoDeepDive is here and here. PaleoDeepDive exceeds human volunteer quality in both precision and recall on some extraction tasks. Thank you to PaleoDB and Shanan Peters for all their painstaking work! A description of our approach to building KBC systems is here. It's all about feature engineering! DeepDive was recently mentioned in Forbes and as a top tool for data science. It has received some media coverage (Vice, Fusion, The India Times, El Mundo, Kurzweil's newsletter, CACM, and perhaps here?)
- Our paper with Michigan Econ (Shaprio and Levinson) and Michigan CS people (Antenucci and Cafarella) about using Twitter to predict economic indicators is out. It has been picked up by The Economist's blog, Wall Street Journal, the Boston Globe, The Washington Post, Patria (Czech), and Slate. A summary of this work has been selected to appear in the NBER digest in the August issue. Interesting follow ups in Science, excited to see where it goes!
- BeyondMR15 Mar. 27.
- DARPA workshop. Mar 2.
- Strata. Feb. 17-20.
- UW Data Science. Feb 4.
- MEMEX and XData. DC. Jan. 11-16.
- CIDR. Jan 4-7.
- Seattle and Allen AI2 Distinguished Lecture. Nov 13-14.
- Panel on Data Science Education. Dec 4.
- NIPS and NIPS Workshop. Automatic Knowledge Base Construction. Dec. 12-13.
- Deep-Time Data with AGU. Dec. 14. (Matteo will be there!)
- Dagstuhl. PlanBig. Dec. 14-19.
- The Minesweeper paper is our attempt to go beyond worst-case analysis for join algorithms, which is a much stronger pointwise or per-instance guarantee than is typical. We (Hung Ngo, Dung Ngyuen, Atri Rudra, and I) develop a new algorithm that we call Minesweeper. The main idea is to formalize the amount of work any algorithm spends certifying (using a set of propositional statements) that the output set is complete (and not, say, a proper subset). We call this set of propositions the certificate. We manage to establish a dichotomy theorem for this stronger notion of complexity: if a query's hypergraph is what Ron Fagin calls beta-acyclic, then Minesweeper runs in time linear in the certificate; if a query is beta-cyclic than on some instance any algorithm takes time that is super linear in the certificate. The results get sharper and more fun. Also, Dung is a superhero and has implemented a variant of this algorithm.
- Facebook. Sep 15-16.
- MEMEX. Sep 15-26.
- DC (CCC and PI). Oct 14-16.
- DC (NSF). Oct 16-17.
- NYC and IBM Watson. Nov. 6-7.
- Master's in Data Science named me a thought leader. It's humbling to be listed with such great people.
- Google We want to thank Google for funding our research project, Trust, but (Probabilistically) Verify: Toward Tail Extraction. Should be a lot of fun!
- Toshiba We want to thank Toshiba for funding our work on knowledge base construction. We are very excited that Toshiba engineers are using DeepDive!
- ONR Thank you for funding my proposal Foundations for Data-driven Systems; Join Algorithms and Random Network Theory . The ONR continues to be one of the best supporters of pure theoretical work.
- DARPA. Thank you to DARPA's XData for supporting my collaborative work on scalable analytics and join processing.
- DARPA. Thank you to DARPA Memex for supporting our work. We are really excited to be part of this program!
- NIH. Thank you to the NIH Big Data to Knowledge (BD2K) program for supporting our work on mobility data led by Scott Delp.
- Modern Data Management Summit in Beijing. Aug 28-30
- VLDB. We'll present DimmWitted. Sep 1-5.
- UCB. Talk about DimmWitted. Sep 10.
- VLDB14. See you in Hangzhou!
  - ICML14. Ji Liu and Stephen J. Wright, Victor Bittorf, Srikrishna Sridhar and I have some new theory about An Asynchronous Parallel Stochastic Coordinate Descent Algorithm in ICML14. This is a Hogwild!-style algorithm but has more rapid convergence rates than Hogwild! for different types of sparse data.
  - VLDB14. We've just released a description of our sampling/inference engine DimmWitted: A Study of Main-Memory Statistical Analytics. DimmWitted is the successor to the Hogwild! and Elementary engines.
  - Panel. The Role of Database Systems in the Era of Big Data.
  - IMDM. Victor Bittorf will give an invited talk describing his work on Impala, on porting MADlib to Impala, and some thoughts about main-memory analytics as described here.
- Baidu Lunch. Aug 22.
- Stanford IoT Workshop. Aug 11.
- Berkeley Workshop on Awesome. Aug. 15?
- MSR. Faculty Summit. July 14-20.
- Moore. Talk. July 28-29.
- SAP. June 16.
- SIGMOD/PODS. Ce is talking about feature selection, and I'll talk about beyond worst-case analysis for joins. June 22-27.
- XDATA. July 7-9.
- EPFL. Research Day. A broad audience talk about DeepDive. June 12.
- Adobe Data Science. June 5.
- ~~LinkedIn. May 22.~~
- HP Labs. May 15.
- Norcal Database Day. Apr. 23.
- Computer Forum. Apr. 15.
- SWANK Knowledge workshop Apr. 17-18.
- Big Data (NAS) Apr. 11-12.
- Stonebreaker @ 70. Apr. 12.
- ICDT. I'm giving a keynote about one of my favorite topics, joins! A brief description of my talk is here. I will also present a paper that I very much enjoyed writing about random graphs. Mar 24-28.
- ICDT. I will present a paper in beautiful Athens called The Theory of Zeta Graphs with an Application to Random Networks. The main idea is to understand the basic theory of a declarative, Erdos-Renyi-like model for preferential attachment graphs. It makes some connections to the combinatorics of multiple-valued zeta functions. A version is here.
- ~~DIMACS. A meeting about Big Data. Street art and features! Mar. 17.~~
- ~~SGD Workshop at IPAM in Feb. 24-28.~~
- IBM. I'm giving a distinguished lecture at the IBM accelerated discovery lab. Mar. 6.
- Talk at Oracle. All about Joins! Feb. 18.
- Talk at STRATA Silicon Valley. I'm going to talk about some of our thoughts about how to make data analytics easier to deploy. Street art will again feature prominently; this talk builds on Brainwash and a spate of recent demos including a nowcasting framework, Ringtail in VLDB and WebDB; and an enterprise feature-selection system with Oracle. It will also describe DeepDive.
- IceCube. IceCube has a press release here and here about Mark Wellons's new manuscript, which is accepted to NIM A. Mark's algorithm obtains a 66 percent improvement for detecting when more than one particle is active in the detector. Thank you Francis Halzen, Gary Hill, and all of IceCube for supporting this work! See all the amazing stuff IceCube is doing here. IceCube gets the cover of Science! Way to go IceCube (and Mark)!
- Talks at Stanford or near Stanford in January. Vision group meeting about knowledge base construction (KBC) on 1/13, ISL Big Data about joins on 1/16, at the Infolunch about why our group is working on this set of problems on 1/17, at SRI about KBC on 1/17, and about hardware-aware analytics at the PPL retreat on 1/24.
- NIPS13. Krishna Sridhar, Ji Liu, Ce Zhang, Victor Bittorf, Stephen J. Wright and I have a paper about an approximate LP solver geared to LP relaxations. It's an order of magnitude faster than CPLEX for many combinatorial problems.
- SIGMOD. The most recent version of DeepDive's engine is described in a recent SIGMOD paper.
- NSF. Thank you to the NSF for supporting our work on GeoDeepDive with Shanan Peters and Miron Livny. The hope is to build a data platform that can be used to advance Geoscience. It's going to be exciting!
- NSF. Thank you to the NSF for supporting our work New Frontiers in Join Algorithms: Optimality, Noise, and Richer Languages. This is joint with Atri Rudra and Hung Ngo, and lets us continue to go deeper into the theory of joins. How great is that!
- Alfred P. Sloan Research Fellowship. Thank you for your generous support of our research!
- ACM Queue and CACM. The students of the Hazy group put together a manuscript describing their vision for Big Data Analytics Hazy: Making it Easier to Build and Maintain Big-data Analytics in ACM Queue (invited to CACM)
- NIPS.
  - Krishna will talk about solving linear programming relaxations of combinatorial problems approximately in parallel.
  - I'm giving a keynote at big learn. Street art will feature prominently.
  - I'm talking about PaleoDB and DeepDive at KET
- EngX. I'm giving a talk at EnGX about some of our work on PaleoDeepDive. These are short, fun talks.
- CIS. I'm giving a talk at the CIS roundtable about simple hardware-aware analytics.
- AAAI13. I'm giving a talk at the AAAI Fall Symposium on Discovery Informatics about our work in GeoDeepDive.
- SIMONS. Simons big data events at Cal. They ~~will be~~ were off-the-charts good! (The parking tickets are expensive)
- SIMONS. Parallel and Distributed Algorithms for Inference and Optimization. Check it out!.
- ECML-PKDD. I'm giving a keynote at ECML-PKDD 2013.
- ~~SIMONS. Succinct Data Representations and Applications at Simons (Sep. 16-20)~~ Going to an ONR meeting instead.
- RICE. I'm giving a talk about our GeoDeepDive project at Rice. Should be fun!
- SIMONS. Big Data Bootcamp.
- VLDB. We have two demonstrations of feature selection: one on unstructured text called Ringtail (Formely Automan in this video) and Columbus (Video soon!)
- I spent the last four years in beautiful Wisconsin. Bucky, you gave me a great home with great students, colleagues, and cheese curds. Thank you!
- VLDB 2013. We have two demos accepted to VLDB. Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System with Oracle people and Ringtail: Nowcasting Made Simple with Michigan people. Both of these are early incarnations of Brainwash, our CIRD13 vision about how feature selection is a data management problem. Thank you, Jordan Novet, for using a picture of my rant from the Graphlab workshop in GigaOm.
- ACL 2013. Vidhya and Ce have an awesome new paper Understanding Tables in Context using Standard NLP Toolkits. We use this work in GeoDeepDive.
- MPC. Ben Recht and my paper about Matrix Factorization (Jellyfish) is accepted to Mathematical Programming Computation, the latest version of the paper is here which supersedes the 2011 version.
- BNCOD13 and DEOS. I'm giving an invited tutorial at BNCOD and a keynote talk at DEOS. Dan Olteanu put together a very interesting BNCOD program and Wolfgang Gatterbauer put together a very fun DEOS program.
- GraphLab. Carlos has a company, and they are awesome. I'll be talking at their workshop.
- SIGMOD13: Towards High-Throughput Gibbs Sampling at Scale: A Study across Storage Managers. The paper is here. This is the latest version of our inference engine that runs GeoDeepDive and DeepDive. Code is here. A new video and code release is coming soon!
- SIGMOD13 (demo). GeoDeepDive: Statistical Inference using Familiar Data-Processing Languages has been accepted as a demonstration! Roughly, it will be a live version of how to build the system in this video and this paper.
- WebDB. Dolan Antenucci, Michael Cafarella, Margaret C. Levenstein, Matthew Shapiro and I have a paper Ringtail: Nowcasting Made Easy in WebDB 2013 with SIGMOD. (Formerly called Automan with an awesome movie).
- PODS13. Dan Suciu and I are organizing a Colloquium on Theory Challenges in Big Data. Dan got a great set of speakers, and I'm excited to hear what they say!
- Oracle. Thank you for continuing support to support the Hazy Research group! This gift will be used to continue our work on feature engineering for structured analytics.
- AirForce. Thank you to the Air Force for supporting Mathematical Foundations of Secure Computing Clouds; this is the hard work of Jordan Ellenberg (Math), Ben Recht (CS), Tom Ristenpart (CS), Rob Nowak (EE), and Steve Wright (CS).
- StarAI. Sriraam Natarajan, José Picado, Tushar Khot, Kristian Kersting, Jude Shavlik, and my paper Using Commonsense Knowledge to Automatically Create (Noisy) Training Examples from Text has been accepted to StarAI with AAAI 2013.
- SLG2013 at ICML. There is a great program at the workshop on structured learning this year. The program looks very interesting, and I'll be talking about some of Ce and Vidhya's work on GeoDeepDive.
- I am giving an invited talk at Microsoft's Machine Learning summit (Paris in April!) and a keynote at ECML-PKDD 2013!
- CIDR13. Brainwash is a vision paper that argues that feature engineering is a database problem.
- NIPS12 (spotlight). An updated version of HottTopix, Factoring Nonnegative Matrices with Linear Programs, is now on Arxiv with a cleaner proof, but Nicolas Gillis did a much better analysis of the case with duplicates and noise!
- Automan Automan is a system whose goal is to predict real-life indicators using social media. Working with economists and Mike Cafarella (CS) at Michigan, we were able to show that Twitter data can predict the number of weekly unemployment insurance claims better than traditional methods. Another goal is to understand to what extent social media provides a truly novel signal for predictive tasks. In this task, it looks it may! The video is here.
- DeepDive. A first published description of our DeepDive system appears. This is the hard work of Feng Niu, Ce Zhang, and Jude Shavlik. Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference. The demonstration is up on the Web. Watch the video and play with the live content.
- NIPS 2012. Provable Non-negative Matrix Factorization. Recently, in a beautiful paper, Arora, Ge, Kannan, and Moitra described a class of matrices and an error model for which non-negative matrix factorization returns provably the right answer. In this manuscript, Victor Bittorf, Ben Recht, Joel Tropp, and I describe an alternate approach to establish this result that has three characteristics: (1) better noise tolerance, (2) more general noise model, and most importantly, (3) admits efficient, scalable algorithms. We demonstrate that our algorithm that achieves these properties is able to factor multigigabyte matrices in minutes on a commodity workstation. Manuscript and an implementation in the Hogwild! framework and the paper version. A revised full version, CRC, and a YouTube video for our spotlight will be out soon.
  Also, Nicolas Gillis has a nice paper that follows up and improves our analysis. Fresh on the ArXiv!
- ICDM 2012 Feng Niu, Ce Zhang, Jude Shavlik and I have a new paper about how to scale Markov Logic inference to the web (how we built systems like DeepDive). Our idea is to combine the best-of-breed tools for each subtask; the challenge is how to combine these tools at scale, and for that we use Lagrange relaxation or dual decomposition, an old idea from math programming. One cute part of our approach is that we can encode much of the Lagrange relaxation algorithm using standard SQL queries.
- XLDI. I'll be at XLDI where I plan to beg smart PL people for help about a problem that my research group struggles with on a daily basis, Feature Engineering. My contention is that the biggest bottleneck in building trained systems may not be more sophisticated machine learning, but may be allowing one to get signal more rapidly in to the system. The CS angle is to accelerate the cycle that the engineer goes through of exploring data sources, extracting relevant information, and then evaluating their changes; this cycle, we contend, has simple machine learning at its core, but the challenges we see are mostly data engineering and processing challenges.
- PhD! Feng Niu graduated! (co-advised by AnHai Doan). The entire Hazy group is indebted to Feng for his intelligence and creativity. Together with Ce Zhang, Feng was the driving force behind DeepDive, GeoDeepDive, and many of our other projects. Google is lucky to have him!
- VLDB 2012. Accepted for publication in VLDB 2012 in Istanbul.
  - Arun Kumar and I have a paper, Probabilistic Management of OCR Data using an RDBMS. The paper is our first attempt to understand the implications of combining rich content models like OCR with the sophisticated querying capabilities of an RDBMS. The state-of-the-art models underlying these content models are probabilistic, e.g., OCRopus for Google Books. These models have very high quality, but the sheer size of these models can destroy query processing performance. Arun's idea is to use some ideas from statistics to compress the model; in turn, this allows him to trade run time for quality. Thank you to the Microsoft Jim Gray Systems Lab for supporting Arun's work. (Code and data here)
  - The MADlib Analytics Library or MAD Skills, the SQL. Joseph M. Hellerstein, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleks Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, Arun Kumar, and I have a paper about the status of the status of MADlib, which is an awesome open-source effort led by Greenplum. Thank you, EMC!
  - VLDS. We will also present an overview of DeepDive of our work at VLDS2012! See you in Istanbul!
- MMDS. Modern Massive Data Sets is this week -- what an awesome workshop!
- ACL 2012 Feng Niu, Ce Zhang, Jude Shavlik, and I have a paper that builds on our experiences with DeepDive to study the impact of Big Data with Distant Supervision versus Crowd-Sourced data on the quality of relationship extraction from the Web. The paper title is Big Data versus the Crowd: Looking for Relationships in All the Right Places.
- COLT 2012 Ben Recht and I have a new manuscript. The starting point of the manuscript is that practioners who run sequential random sampling algorithms (like stochastic gradient) sample from their data without replacement. In many cases, people have observed that sampling without replacement empircally converges to an optimal value faster than with-replacement sampling approaches. There is, however, a gap in our current theory as current theory provides faster convergence rates for with-replacement sampling than without replacement sampling. This paper takes a step toward closing this gap. We establish that without replacement methods do converge faster in several sampling models and propose a general inequality, a symmetrized (noncommutative) arithmetic-geometric-mean inequality, that would close this gap in many cases.
- SIGMOD EVENTS New Researcher Symposium and Undergraduate Research Poster Competition. Please come to these two events at this week's SIGMOD to encourage the next generation of database nerds!
  - The program of the NRS symposium and slides from some presenters at the symposium are here.
  - PODS 2012 Hung Ngo, Ely Porat, Atri Rudra, and I have a new PODS paper called "Worst-Case Optimal Join Algorithms." For all join queries, we are able to provide the first optimal algorithm in terms of data complexity. The key to the result is an algorithmic proof for a pair of inequalities from discrete geometry called the Loomis-Whitney inequality from the 1940s and the more recent Bollobás-Thomason inequality. We show that these classical inequalties are equivalent to the beautiful recent results by Atserias, Grohe, and Marx. Our work also provides an algorithmic proof of these results. In contrast, all previous proofs are non-constructive. Thank you to the program committee for selecting this paper for the Best Paper Award.
  - SIGMOD 2012 Aaron Feng, Arun Kumar, Ben Recht, and I have a paper Towards a Unified Architecture for In-RDBMS Analytics accepted to SIGMOD 2012. The code is available here Bismark (so is the data) and the paper (full version). Thank you to Greenplum and Oracle for valuable discussions that helped us understand this problem!
- ACL 2012 Feng Niu, Ce Zhang, Jude Shavlik, and I have a paper that builds on our experiences with DeepDive to study the impact of Big Data with Distant Supervision versus Crowd-Sourced data on the quality of relationship extraction from the Web. The paper title is Big Data versus the Crowd: Looking for Relationships in All the Right Places.
- COLT 2012 Ben Recht and I have a new manuscript. The starting point of the manuscript is that practioners who run sequential random sampling algorithms (like stochastic gradient) sample from their data without replacement. In many cases, people have observed that sampling without replacement empircally converges to an optimal value faster than with-replacement sampling approaches. There is, however, a gap in our current theory as current theory provides faster convergence rates for with-replacement sampling than without replacement sampling. This paper takes a step toward closing this gap. We establish that without replacement methods do converge faster in several sampling models and propose a general inequality, a symmetrized (noncommutative) arithmetic-geometric-mean inequality, that would close this gap in many cases. CRC Coming Soon.
- VLDB 2012 Arun Kumar and I have a paper, Probabilistic Management of OCR Data using an RDBMS, accepted for publication in VLDB 2012 in Istanbul. The paper is our first attempt to understand the implications of combining rich content models like OCR with the sophisticated querying capabilities of an RDBMS. The state-of-the-art models underlying these content models are probabilistic, e.g., OCRopus for Google Books. These models have very high quality, but the sheer size of these models can destroy query processing performance. Arun's idea is to use some ideas from statistics to compress the model; in turn, this allows him to trade run time for quality. Thank you to the Microsoft Jim Gray Systems Lab for supporting Arun's work. (Code and data here)
- Probabilistic DB Book is out. My coauthors (Dan Suciu, Dan Olteanu, and Christoph Koch) have written a wonderful book about probabilistic databases in Morgan and Claypool's Synthesis Lectures for Data Management.
- ICDE 2012 Xixuan (Aaron) Feng, Fei Chen and Min Wang from HP Labs-China, and I have authored a paper titled Optimizing Statistical Information Extraction Programs Over Evolving Text about efficiently maintaining conditional random fields on evolving coropra, i.e., new documents are added or old documents are modified. The paper is accepted to ICDE 2012 in Washington, DC. (CRC and Full Version)
- TODS in 2012 Dan Suciu and I have a paper, Understanding Cardinality Estimation using Entropy Maximization, that has been accepted to TODS. I would like to thank the referees who produced thorough reviews that helped us to improve the paper. It is the journal version of our PODS 2010 paper. A preliminary version is here.
- NIPS 2011
  - Big Learning I'm giving a keynote at Big Learning. The organizers have put together some great keynotes and one given by some database guy.
  - HOGWILD! Ben Recht, Steve Wright, Feng Niu, and I have a new way of parallelizing incremental gradient algorithms. The biggest obstacle to achieving linear speedup is minimizing lock contention. Hogwild's approach is simple: get rid of locking entirely! We prove that as long as the data are sparse, Hogwild achieves linear speedups. We demonstrate our theory on a diverse set of problems including text classification with support vector machines, cut problems from vision applications, and recommendation via matrix factorization. (Accepted NIPS 2011)
    - Code
    - Preprint on Optimization Online (with proofs)
- Victor: Mathematical Optimization + Large Data Projects
  - Jellyfish Ben Recht and I have released some software for large-scale matrix completion. If your algorithm is faster on billion-entry matrices, send it to us so we can learn how to go faster. Currently, we are two orders of magnitude faster to the same error (RMSE) versus any algorithm that we know about. The algorithm is essentially buzzword complete: a large-scale parallel stochastic gradient algorithm for nonconvex relaxations
    - Code is available (with data generators). Paper on Optimization Online.
    - Thank you, Office of Naval Research and Physical Layer Systems for supporting this work!
  - Victor-SQL integrates incremental schemes with an RDBMS via a (hopefully) easy-to-use python interface. Available at the Victor website.
- VLDB 2011. Look for three papers in the upcoming VLDB 2011:
  - Tuffy and Manimal get some press. Thank you to those who have posted references to Tuffy and Manimal on Twitter, Y-combinator, O'Reilly online, and Radar Online. If you use our stuff and have any feedback (positive or negative), please send us a note.
  - "Incrementally Maintaining Classification using an RDBMS" by M. Levent Koc and me. The source code for this project will be released as part of Hazy very soon.
  - "Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS" by Feng Niu, AnHai Doan, Jude Shavlik, and me. A prototype version of the system is available off the Hazy Website.
  - "Automatic Optimization for MapReduce Programs" by Mike Cafarella, Eaman Jahani, and me.
- Best of PODS 2010: our paper, Transducing Markov Sequences, has been invited to a special issue of JACM for the best papers of PODS 2010.
- Best of PODS 2010: our paper, Understanding Cardinality Estimation using Entropy Maximization, has been invited to a special issue of TODS for the best papers of PODS 2010.
- A new WebDB 2010 paper with Michael Cafarella, Manimal: Relational Optimization for Data-Intensive Programs. Manimal is a hybrid system that does relational style optimization for MapReduce programs by performing a static analysis of Java code.
- ACM SIGMOD was kind enough to give me the ACM SIGMOD Jim Gray Thesis Award for my dissertation, Managing Large-Scale Probabilistic Databases. Hearing Jim Gray talk about his work on the World Wide Telescope project was what inspired me to study databases. It is a real honor to me that his name is on this award.
- I have two new PODS 2010 papers:
  - Transducing Markov Sequences with Benny Kimelfeld that studies how to evaluate transducers (think: automaton with output) over the output of a Hidden Markov Model. This problem is motivated by challenges that we faced while building the querying infrastructure of Lahar (see below).
  - Understanding Cardinality Estimation using Entropy Maximization with Dan Suciu. In this paper, we ask a very basic question: "Given some statistical information, what is the best cardinality estimate that one can make?" Shockingly, we can (sometimes) answer this question!
- Queries and Materialized Views on Probabilistic Databases with Nilesh Dalvi and Dan Suciu appears in the Journal of Computer and System Sciences. This paper is the best explanation of our work on using materialized views to answer queries in a probabilistic database.