DANAC15.
Caffe con Troll: Shallow Ideas to Speed up Deep Learning. This is our first exploration of how to schedule deep learning computations on CPUs and GPUs. We show that simple ideas yield a 4.5x speedup for Caffe (the uber popular deep learning framework) on CPUs. In particular, the throughput of the task is proportional to the FLOPS delivered by the hardware. We can also use both CPUs and GPUs to get higher FLOPS and go even faster.
GRADES. Led by Dung Nguyen and Hung Q. Ngo, and the awesome LogicBlox team, we benchmarked a wide range of join and graph algorithms including Minesweeper, our beyond worst-case algorithm. Check out our new paper, Join Processing for Graph Patterns: An Old Dog with New Tricks.
Adobe Data Science. May 28.
Huawei. May 19-20.
CS50th and SDSI Kickoff. April 28-29.
PPL Retreat. May 7-9.
Adobe ML Talk. May 14.
NorCal DB. April 24.
MEMEX. April 13-17.
Invited Talk @ ACM SIGAI Bay Area at Baidu. April 16.
SIMPLEX. April 19-23.
BIRS April 20-22.
I'm honored to be selected as a Moore Data-Driven Discovery Investigator. What an exciting list of people! Thank you to Context Relevant, the world's best analytics company for Wall St. and beyond, for putting out such a nice press release (CNBC, Yahoo! Finance). My dad loved it.
Recent papers are about analytics, joins and feature selection. It all ends up in DeepDive...
- High-Performance Analytics. Thank you to Microsoft for giving a nod to Hogwild!; they mentioned that their Adam system (for machine learning) is based on our Hogwild! approach. We love that they got the exclamation point into print, although it's dubious that anyone is more Hogwild! than we are! The succesors for both the theory (ICML14) and systems work (VLDB14) are in print. The recent DimmWitted engine is a deeper systems exploration of the tradeoff space, check it out!
- Feature Engineering. Materialization Optimizations for Feature Selection shows that using a DSL and some novel optimizations, we can get order of magnitude performance wins for feature engineering in R. Thank you to SIGMOD for selecting this as the best paper at SIGMOD14!
In NIPS14, Yingbo Zhou and (many) others have some very nice work about how to do feature selection in parallel using some group testing ideas. Most recently, a draft about the DeepDive approach to feature engineering for KBC systems is here.
Joins! New papers about one of my favorite topics, Joins. This is joint work with Hung Q Ngo and Atri Rudra.
- We have written a short overview for SIGMOD record about recent advances in worst-case optimal join algorithms. This is the one to read first. Our goal is to give a high-level view of the worst-case optimality results for practitioners and applied researchers. We also managed to simplify the arguments.
- The Minesweeper paper goes beyond worst-case analysis for join algorithms, which is a much stronger guarantee than traditional CS theory: an algorithm must be within a small logarithmic facton every instance compared to any comparison-based algorithms, which includes all standard join algorithms. Our empirical results with the LogicBlox guys are in GRADES15.
- Our Tetris paper describes (what we think is) a beautiful framework for beyond worst case and worst-case optimal algorithms for joins via a new connection between geometry and DPLL resolution. It also allows us to consider a wide variety of indexing schemes in a single framework.
- A full version of our join algorithm with worst-case optimal running time is here (original PODS 2012 paper).
-
- PaleoDB. Our assessment of PaleoDeepDive is
here and
here. PaleoDeepDive exceeds human volunteer quality in both precision and recall on some extraction tasks. Thank you to PaleoDB and Shanan Peters for all their painstaking work! A description of our approach to building KBC systems is here. It's all about feature engineering! DeepDive was recently mentioned in Forbes and as a top tool for data science. It has received some media coverage (Vice, Fusion, The India Times, El Mundo, Kurzweil's newsletter, CACM, and perhaps here?)
- Our paper with Michigan Econ (Shaprio and Levinson) and Michigan CS people (Antenucci and Cafarella) about using Twitter to predict economic indicators is out. It has been picked up by The Economist's blog, Wall Street Journal, the Boston Globe, The Washington Post, Patria (Czech), and Slate. A summary of this work has been selected to appear in the NBER digest in the August issue. Interesting follow ups in Science, excited to see where it goes!
- BeyondMR15 Mar. 27.
- DARPA workshop. Mar 2.
- Strata. Feb. 17-20.
- UW Data Science. Feb 4.
- MEMEX and XData. DC. Jan. 11-16.
- CIDR. Jan 4-7.
- Seattle and Allen AI2 Distinguished Lecture. Nov 13-14.
- Panel on Data Science Education. Dec 4.
- NIPS and NIPS Workshop. Automatic Knowledge Base Construction. Dec. 12-13.
- Deep-Time Data with AGU. Dec. 14. (Matteo will be there!)
- Dagstuhl. PlanBig. Dec. 14-19.
- The Minesweeper paper is our attempt to go beyond worst-case analysis for join algorithms, which is a much stronger pointwise or per-instance guarantee than is typical. We (Hung Ngo, Dung Ngyuen, Atri Rudra, and I) develop a new algorithm that we call Minesweeper. The main idea is to formalize the amount of work any algorithm spends certifying (using a set of propositional statements) that the output set is complete (and not, say, a proper subset). We call this set of propositions the certificate. We manage to establish a dichotomy theorem for this stronger notion of complexity: if a query's hypergraph is what Ron Fagin calls beta-acyclic, then Minesweeper runs in time linear in the certificate; if a query is beta-cyclic than on some instance any algorithm takes time that is super linear in the certificate. The results get sharper and more fun. Also, Dung is a superhero and has implemented a variant of this algorithm.
- Facebook. Sep 15-16.
- MEMEX. Sep 15-26.
- DC (CCC and PI). Oct 14-16.
- DC (NSF). Oct 16-17.
- NYC and IBM Watson. Nov. 6-7.
- Master's in Data Science named me a thought leader. It's humbling to be listed with such great people.
- Google We want to thank Google for funding our research project, Trust, but (Probabilistically) Verify: Toward Tail Extraction. Should be a lot of fun!
- Toshiba We want to thank Toshiba for funding our work on knowledge base construction. We are very excited that Toshiba engineers are using DeepDive!
- ONR Thank you for funding my proposal Foundations for Data-driven Systems; Join Algorithms and Random Network Theory . The ONR continues to be one of the best supporters of pure theoretical work.
- DARPA. Thank you to DARPA's XData for supporting my collaborative work on scalable analytics and join processing.
- DARPA. Thank you to DARPA Memex for supporting our work. We are really excited to be part of this program!
- NIH. Thank you to the NIH Big Data to Knowledge (BD2K) program for supporting our work on mobility data led by Scott Delp.
- Modern Data Management Summit in Beijing. Aug 28-30
- VLDB. We'll present DimmWitted. Sep 1-5.
- UCB. Talk about DimmWitted. Sep 10.
- VLDB14. See you in Hangzhou!
- ICML14. Ji Liu and Stephen J. Wright, Victor Bittorf, Srikrishna Sridhar and I have some new theory about An Asynchronous Parallel Stochastic Coordinate Descent Algorithm in ICML14. This is a Hogwild!-style algorithm but has more rapid convergence rates than Hogwild! for different types of sparse data.
- VLDB14. We've just released a description of our sampling/inference engine DimmWitted: A Study of Main-Memory Statistical Analytics. DimmWitted is the successor to the Hogwild! and Elementary engines.
- Panel. The Role of Database Systems in the Era of Big Data.
- IMDM. Victor Bittorf will give an invited talk describing his work on Impala, on porting MADlib to Impala, and some thoughts about main-memory analytics as described here.
- Baidu Lunch. Aug 22.
- Stanford IoT Workshop. Aug 11.
- Berkeley Workshop on Awesome. Aug. 15?
- MSR. Faculty Summit. July 14-20.
- Moore. Talk. July 28-29.
- SAP. June 16.
- SIGMOD/PODS. Ce is talking about feature selection, and I'll talk about beyond worst-case analysis for joins. June 22-27.
- XDATA. July 7-9.
- EPFL. Research Day. A broad audience talk about DeepDive. June 12.
- Adobe Data Science. June 5.
-
LinkedIn. May 22.
- HP Labs. May 15.
- Norcal Database Day. Apr. 23.
- Computer Forum. Apr. 15.
- SWANK Knowledge workshop Apr. 17-18.
- Big Data (NAS) Apr. 11-12.
- Stonebreaker @ 70. Apr. 12.
- ICDT. I'm giving a keynote about
one of my favorite topics, joins! A brief description of my talk is here. I will also present a paper that I
very much enjoyed writing about random graphs. Mar 24-28.
- ICDT. I will present a paper in beautiful Athens called The Theory of Zeta Graphs with an Application to Random Networks. The main idea is to understand the basic theory of a declarative, Erdos-Renyi-like model for preferential attachment graphs. It makes some connections to the combinatorics of multiple-valued zeta functions. A version is here.
-
DIMACS. A meeting about Big Data. Street art and features! Mar. 17.
-
SGD Workshop at IPAM in Feb. 24-28.
- IBM. I'm giving a distinguished lecture at the IBM accelerated discovery lab. Mar. 6.
- Talk at Oracle. All about Joins! Feb. 18.
- Talk at STRATA Silicon Valley. I'm going to talk about some of our thoughts about how to make data
analytics easier to deploy. Street art will again feature prominently;
this talk builds on Brainwash
and a spate of recent demos including a nowcasting
framework, Ringtail in VLDB
and WebDB;
and an enterprise feature-selection system with Oracle. It will also describe DeepDive.
- IceCube. IceCube has a press release here and here about Mark Wellons's new manuscript, which is accepted to NIM A. Mark's algorithm obtains a 66 percent improvement for detecting when more
than one particle is active in the detector. Thank you Francis Halzen,
Gary Hill, and all of IceCube for supporting this work! See all the
amazing stuff IceCube is doing here. IceCube gets the cover
of Science! Way to go IceCube (and Mark)!
- Talks at Stanford or near
Stanford in January. Vision group meeting about knowledge base
construction (KBC) on 1/13, ISL Big Data about
joins on 1/16, at the Infolunch about why our group is working on this
set of problems on 1/17, at SRI about KBC on 1/17, and about
hardware-aware analytics at the PPL retreat on 1/24.
- NIPS13. Krishna Sridhar, Ji Liu, Ce Zhang, Victor Bittorf, Stephen J. Wright and I have a paper about an approximate LP solver geared to LP relaxations. It's an order of magnitude faster than CPLEX for many combinatorial problems.
- SIGMOD. The most recent version of DeepDive's engine is described in a recent SIGMOD paper.
- NSF. Thank you to the NSF for supporting our work on GeoDeepDive
with Shanan Peters and Miron Livny. The hope is to build a data
platform that can be used to advance Geoscience. It's going to be exciting!
- NSF. Thank you to the NSF for supporting our work New
Frontiers in Join Algorithms: Optimality, Noise, and Richer
Languages. This is joint with Atri Rudra and Hung Ngo, and lets us continue to go deeper into the theory of joins. How great is that!
- Alfred P. Sloan Research Fellowship. Thank you for your generous support of our research!
- ACM Queue and CACM. The students of the Hazy group put
together a manuscript describing their vision for Big Data Analytics Hazy: Making it
Easier to Build and Maintain Big-data Analytics in ACM
Queue (invited to CACM)
- NIPS.
- Krishna will talk about solving linear programming relaxations of combinatorial problems approximately in parallel.
- I'm giving a keynote at big learn. Street art will feature prominently.
- I'm talking about PaleoDB and DeepDive at KET
- EngX. I'm giving a talk at EnGX about some of our work on PaleoDeepDive. These are short, fun talks.
- CIS. I'm giving a talk at the CIS roundtable about simple hardware-aware analytics.
- AAAI13. I'm giving a talk at the AAAI Fall
Symposium on Discovery Informatics about our work in GeoDeepDive.
- SIMONS. Simons big data events at Cal. They
will be were off-the-charts good! (The parking tickets are expensive)
- SIMONS. Parallel and Distributed Algorithms for Inference and Optimization. Check it out!.
- ECML-PKDD. I'm giving a keynote at ECML-PKDD 2013.
-
SIMONS. Succinct Data Representations and Applications at Simons (Sep. 16-20) Going to an ONR meeting instead.
- RICE. I'm giving a talk about our GeoDeepDive project at Rice. Should be fun!
- SIMONS. Big Data Bootcamp.
- VLDB. We have two demonstrations of feature selection: one on unstructured text called Ringtail (Formely Automan in this video) and Columbus (Video soon!)
-
I spent the last four years in beautiful Wisconsin. Bucky, you gave me a great home with great students, colleagues, and cheese curds. Thank you!
- VLDB 2013. We have two demos accepted to VLDB. Feature
Selection in Enterprise Analytics: A Demonstration using an R-based
Data Analytics System with Oracle people and Ringtail:
Nowcasting Made Simple with Michigan people. Both of these are
early incarnations of Brainwash,
our CIRD13 vision about how feature selection is a data management
problem. Thank you, Jordan Novet, for using a
picture of my rant from the Graphlab workshop in GigaOm.
- ACL 2013. Vidhya and Ce have an awesome new paper Understanding Tables in Context using Standard NLP Toolkits.
We use this work in GeoDeepDive.
- MPC. Ben Recht and my paper about Matrix Factorization (Jellyfish) is
accepted to Mathematical Programming Computation, the latest version
of the paper is here
which supersedes the 2011 version.
- BNCOD13 and DEOS. I'm giving an invited tutorial at BNCOD and a keynote talk at DEOS. Dan
Olteanu put together a very interesting BNCOD program and
Wolfgang Gatterbauer put together a very fun DEOS program.
- GraphLab. Carlos has a company, and they are awesome. I'll
be talking at their workshop.
- SIGMOD13: Towards High-Throughput Gibbs Sampling at Scale: A
Study across Storage Managers. The paper is here.
This is the latest version of our inference engine that runs
GeoDeepDive and DeepDive. Code is here. A new video
and code release is coming soon!
- SIGMOD13 (demo). GeoDeepDive: Statistical Inference using
Familiar Data-Processing Languages has been accepted as a
demonstration! Roughly, it will be a live version of how to build the
system in this video and this paper.
- WebDB. Dolan Antenucci, Michael Cafarella, Margaret
C. Levenstein, Matthew Shapiro and I have a paper Ringtail:
Nowcasting Made Easy in WebDB 2013 with SIGMOD. (Formerly called Automan with an awesome movie).
- PODS13. Dan Suciu and I are organizing a Colloquium on Theory
Challenges in Big Data. Dan got a great set of speakers, and I'm
excited to hear what they say!
- Oracle. Thank you for continuing support to support the Hazy
Research group! This gift will be used to continue our work on feature
engineering for structured analytics.
- AirForce. Thank you to the Air Force for supporting Mathematical Foundations of Secure Computing Clouds; this is the hard work of Jordan Ellenberg (Math), Ben Recht (CS), Tom Ristenpart (CS), Rob Nowak (EE), and Steve Wright (CS).
- StarAI. Sriraam Natarajan, José Picado, Tushar Khot, Kristian
Kersting, Jude Shavlik, and my paper Using Commonsense Knowledge to
Automatically Create (Noisy) Training Examples from Text has been
accepted to StarAI with AAAI 2013.
- SLG2013 at ICML. There is a great
program at the workshop on structured learning this year. The
program looks very interesting, and I'll be talking about some of Ce and Vidhya's work on GeoDeepDive.
- I am giving an invited talk at Microsoft's Machine Learning summit (Paris in April!) and a keynote at ECML-PKDD 2013!
- CIDR13. Brainwash is a vision paper that argues that feature engineering is a database problem.
- NIPS12 (spotlight). An updated version of HottTopix, Factoring
Nonnegative Matrices with Linear Programs, is now on Arxiv with a
cleaner proof, but Nicolas Gillis did a much better
analysis of the case with duplicates and noise!
- Automan Automan is a system whose goal is to predict
real-life indicators using social media. Working with economists and
Mike Cafarella (CS) at Michigan, we were able to show that Twitter
data can predict the number of weekly unemployment insurance claims
better than traditional methods. Another goal is to understand to what
extent social media provides a truly novel signal for predictive
tasks. In this task, it looks it may! The video is here.
- DeepDive. A first published description of our DeepDive
system appears. This is the hard work of Feng
Niu, Ce Zhang, and Jude Shavlik. Elementary:
Large-scale Knowledge-base Construction via Machine Learning and
Statistical Inference. The demonstration is up on the
Web. Watch the video
and play with the live content.
- NIPS 2012. Provable Non-negative Matrix
Factorization. Recently, in a beautiful paper, Arora, Ge, Kannan,
and Moitra described a class of matrices and an error model for which
non-negative matrix factorization returns provably the right
answer. In this manuscript,
Victor Bittorf, Ben Recht, Joel Tropp, and I
describe an alternate approach to establish this result that has three
characteristics: (1) better noise tolerance, (2) more general noise
model, and most importantly, (3) admits efficient, scalable
algorithms. We demonstrate that our algorithm that achieves these
properties is able to factor multigigabyte matrices in minutes on a
commodity workstation. Manuscript
and an
implementation in the Hogwild! framework and the paper version. A
revised full version, CRC, and a YouTube video for our spotlight will
be out soon.
Also, Nicolas Gillis has a nice paper that follows up and improves our analysis. Fresh on the ArXiv!
- ICDM 2012 Feng Niu, Ce Zhang, Jude Shavlik and I have a new paper about
how to scale Markov Logic inference to the web (how we built systems
like DeepDive).
Our idea is to combine the best-of-breed tools for each subtask; the
challenge is how to combine these tools at scale, and for that we use
Lagrange relaxation or dual decomposition, an old idea from math
programming. One cute part of our approach is that we can encode much
of the Lagrange relaxation algorithm using standard SQL queries.
- XLDI. I'll be at XLDI where I plan
to beg smart PL people for help about a problem that my research group
struggles with on a daily basis, Feature Engineering. My
contention is that the biggest bottleneck in building trained systems
may not be more sophisticated machine learning, but may be allowing
one to get signal more rapidly in to the system. The CS angle is to
accelerate the cycle that the engineer goes through of exploring data
sources, extracting relevant information, and then evaluating their
changes; this cycle, we contend, has simple machine learning at its
core, but the challenges we see are mostly data engineering and
processing challenges.
- PhD! Feng
Niu graduated! (co-advised by AnHai Doan). The entire Hazy
group is indebted to Feng for his intelligence and
creativity. Together with Ce Zhang, Feng was the driving force behind
DeepDive, GeoDeepDive, and many of our other projects. Google is lucky
to have him!
- VLDB 2012. Accepted for publication in VLDB 2012 in Istanbul.
- Arun Kumar and I have a
paper, Probabilistic Management of OCR Data using an RDBMS. The paper is our
first attempt to understand the implications of combining rich content
models like OCR with the sophisticated querying capabilities of an
RDBMS. The state-of-the-art models underlying these content models are
probabilistic, e.g., OCRopus for Google Books. These models have very
high quality, but the sheer size of these models can destroy query
processing performance. Arun's idea is to use some ideas from
statistics to compress the model; in turn, this allows him to trade
run time for quality. Thank you to the Microsoft Jim Gray Systems Lab
for supporting Arun's work. (Code and
data here)
- The
MADlib Analytics Library or MAD Skills, the SQL. Joseph
M. Hellerstein, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin,
Aleks Gorajek, Kee Siong Ng, Caleb Welton, Xixuan
Feng, Kun Li, Arun Kumar, and I have a
paper about the status of the status of MADlib, which is an awesome
open-source effort led by Greenplum. Thank you, EMC!
- VLDS. We will also present an overview
of DeepDive of our work at VLDS2012! See you in
Istanbul!
- MMDS. Modern Massive Data Sets is this week -- what an awesome workshop!
- ACL 2012 Feng
Niu, Ce Zhang, Jude Shavlik, and I have a
paper that builds on our experiences with DeepDive to study the impact
of Big Data with Distant Supervision versus Crowd-Sourced data on the
quality of relationship extraction from the Web. The paper title
is Big Data versus the Crowd: Looking for Relationships in All the Right Places.
- COLT 2012 Ben Recht and I have a
new manuscript. The
starting point of the manuscript is that practioners who run
sequential random sampling algorithms (like stochastic gradient)
sample from their data without replacement. In many cases, people have
observed that sampling without replacement empircally converges to an
optimal value faster than with-replacement sampling approaches. There
is, however, a gap in our current theory as current theory provides
faster convergence rates for with-replacement sampling than
without replacement sampling. This paper takes a step toward closing
this gap. We establish that without replacement methods do converge
faster in several sampling models and propose a general inequality, a
symmetrized (noncommutative) arithmetic-geometric-mean inequality,
that would close this gap in many cases.
- SIGMOD EVENTS New
Researcher Symposium and Undergraduate
Research Poster Competition. Please come to these two events at
this week's SIGMOD to encourage the next generation of database nerds!
- The program of the NRS symposium and slides from some presenters at the symposium are here.
- PODS 2012 Hung Ngo, Ely Porat, Atri Rudra, and I have a new
PODS paper called "Worst-Case Optimal Join Algorithms." For all
join queries, we are able to provide the first optimal algorithm in
terms of data complexity. The key to the result is an algorithmic
proof for a pair of inequalities from discrete geometry called the
Loomis-Whitney inequality from the 1940s and the more recent
Bollobás-Thomason inequality. We show that these classical
inequalties are equivalent to the beautiful
recent results
by Atserias, Grohe, and Marx. Our work also provides an algorithmic
proof of these results. In contrast, all previous proofs are
non-constructive. Thank you to the program committee for selecting this
paper for the Best Paper Award.
- SIGMOD 2012 Aaron
Feng, Arun Kumar, Ben Recht, and I have
a paper Towards a Unified Architecture for In-RDBMS Analytics accepted
to SIGMOD 2012. The code is available here Bismark (so is the data) and the paper (full version). Thank you to Greenplum and Oracle for valuable discussions that helped us understand this problem!
- ACL 2012 Feng
Niu, Ce Zhang, Jude Shavlik, and I have a
paper that builds on our experiences with DeepDive to study the impact
of Big Data with Distant Supervision versus Crowd-Sourced data on the
quality of relationship extraction from the Web. The paper title
is Big Data versus the Crowd: Looking for Relationships in All the Right Places.
- COLT 2012 Ben Recht and I have a
new manuscript. The
starting point of the manuscript is that practioners who run
sequential random sampling algorithms (like stochastic gradient)
sample from their data without replacement. In many cases, people have
observed that sampling without replacement empircally converges to an
optimal value faster than with-replacement sampling approaches. There
is, however, a gap in our current theory as current theory provides
faster convergence rates for with-replacement sampling than
without replacement sampling. This paper takes a step toward closing
this gap. We establish that without replacement methods do converge
faster in several sampling models and propose a general inequality, a
symmetrized (noncommutative) arithmetic-geometric-mean inequality,
that would close this gap in many cases. CRC Coming Soon.
- VLDB 2012 Arun Kumar and I have a
paper, Probabilistic Management of OCR Data using an RDBMS,
accepted for publication in VLDB 2012 in Istanbul. The paper is our
first attempt to understand the implications of combining rich content
models like OCR with the sophisticated querying capabilities of an
RDBMS. The state-of-the-art models underlying these content models are
probabilistic, e.g., OCRopus for Google Books. These models have very
high quality, but the sheer size of these models can destroy query
processing performance. Arun's idea is to use some ideas from
statistics to compress the model; in turn, this allows him to trade
run time for quality. Thank you to the Microsoft Jim Gray Systems Lab
for supporting Arun's work. (Code and
data here)
- Probabilistic DB Book is out. My coauthors (Dan Suciu, Dan Olteanu, and
Christoph Koch) have written a wonderful
book
about probabilistic databases in Morgan and Claypool's Synthesis
Lectures for Data Management.
- ICDE 2012 Xixuan (Aaron) Feng, Fei
Chen and Min Wang from HP Labs-China, and I have authored a paper
titled Optimizing Statistical Information Extraction Programs Over
Evolving Text about efficiently maintaining conditional random
fields on evolving coropra, i.e., new documents are added or old
documents are modified. The paper is accepted to ICDE 2012 in
Washington,
DC. (CRC
and Full
Version)
- TODS in 2012 Dan Suciu and I have a paper, Understanding
Cardinality Estimation using Entropy Maximization, that has been
accepted to TODS. I would like to thank the referees who produced
thorough reviews that helped us to improve the paper. It is the
journal version of our PODS 2010 paper. A preliminary version is here.
- NIPS 2011
- Big Learning I'm giving a keynote
at Big Learning. The organizers
have put together some great keynotes and one given by some database
guy.
- HOGWILD! Ben Recht, Steve Wright, Feng
Niu, and I have a new way of parallelizing incremental gradient
algorithms. The biggest obstacle to achieving linear speedup is
minimizing lock contention. Hogwild's approach is simple: get rid
of locking entirely! We prove that as long as the data are sparse,
Hogwild achieves linear speedups. We demonstrate our theory on a
diverse set of problems including text classification with support
vector machines, cut problems from vision applications, and
recommendation via matrix factorization. (Accepted NIPS 2011)
- Victor: Mathematical Optimization + Large Data Projects
- Jellyfish Ben Recht and I have released some software
for large-scale matrix completion. If your algorithm is faster on
billion-entry matrices, send it to us so we can learn how to go
faster. Currently, we are two orders of magnitude faster to the same
error (RMSE) versus any algorithm that we know about. The algorithm is
essentially buzzword complete: a large-scale parallel stochastic
gradient algorithm for nonconvex relaxations
- Code is available (with
data generators). Paper on Optimization
Online.
- Thank you, Office of Naval Research and Physical
Layer Systems for supporting this work!
- Victor-SQL integrates incremental schemes with an RDBMS via
a (hopefully) easy-to-use python interface. Available
at the Victor website.
- VLDB 2011. Look for three papers in the upcoming VLDB 2011:
- Tuffy and Manimal get some press. Thank you to those who have
posted references to Tuffy and Manimal on Twitter, Y-combinator,
O'Reilly online, and Radar Online. If you use our stuff and have any
feedback (positive or negative), please send us a note.
- "Incrementally Maintaining Classification using an RDBMS" by M. Levent Koc and me. The source code for this project will be released as part of Hazy very soon.
- "Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS" by Feng Niu, AnHai Doan, Jude Shavlik, and me. A prototype version of the system is available off the Hazy Website.
- "Automatic Optimization for MapReduce Programs" by Mike Cafarella, Eaman Jahani, and me.
- Best of PODS 2010: our paper, Transducing Markov Sequences, has been
invited to a special issue of JACM for the best papers of PODS
2010.
- Best of PODS 2010: our paper, Understanding
Cardinality Estimation using Entropy Maximization, has been
invited to a special issue of TODS for the best papers of PODS 2010.
- A new WebDB 2010 paper
with Michael
Cafarella, Manimal: Relational Optimization for Data-Intensive
Programs. Manimal is a hybrid system that does relational style
optimization for MapReduce programs by performing a static analysis of
Java code.
- ACM SIGMOD was kind enough to give me the ACM SIGMOD Jim Gray
Thesis Award for my
dissertation, Managing
Large-Scale Probabilistic Databases. Hearing Jim Gray talk
about his work on the World Wide Telescope project was what inspired
me to study databases. It is a real honor to me that his name is on
this award.
- I have two new PODS 2010 papers:
- Queries and Materialized Views on Probabilistic Databases with
Nilesh Dalvi and Dan Suciu appears in the Journal of Computer and
System Sciences. This paper is the best explanation of our work on using
materialized views to answer queries in a probabilistic database.