Matei Zaharia
Assistant Professor, Computer Science
matei@cs.stanford.edu |
Google Scholar |
Twitter
Office: Gates 412
Curriculum Vitæ
I’m an assistant professor at Stanford CS, where I work on computer systems and machine learning as part of Stanford DAWN. I’m also co-founder and Chief Technologist of Databricks, a data and AI platform startup. Before joining Stanford, I was an assistant professor at MIT.
Interests: I’m interested in computer systems for emerging large-scale workloads such as machine learning, big data analytics and cloud computing. In DAWN, we’re working on infrastructure for usable machine learning to make it dramatically easier to bring ML applications to production: these issues are often much larger obstacles than ML algorithms in practice. My work includes software runtimes, quality assurance tools and systems optimizations for ML. Beyond usability, I am intersted in data privacy as the flipside to big data, and have worked on systems that can provide scalable privacy for communication, Internet queries and SaaS applications.
Impact: Our group works closely with the open source community to test and publish our ideas. During my PhD, I started the Apache Spark project, which is now one of the most widely used frameworks for distributed data processing, and co-started other widely used datacenter software such as Apache Mesos, Alluxio, and Spark Streaming. At Stanford, we developed DAWNBench, a machine learning performance competition that drew submissions from the top industry groups and influenced the industry-standard MLPerf, and we are continuing to develop open source software such as Weld, Sparser, NoScope, and MacroBase DIFF.
Some of our work has been featured in Wired (1/2/3), Fortune, TechCrunch, The Wall Street Journal, The Register, Ars Technica, Motherboard, ZDNet, The Economist, and Forbes.
Teaching
- CS 320 (Value of Data and AI): winter 2021.
- CS 245 (Principles of Data-Intensive Systems): winter 2021.
- CS 320 (Value of Data and AI): winter 2020.
- CS 245 (Principles of Data-Intensive Systems): winter 2020.
- CS 245 (Principles of Data-Intensive Systems): spring 2019.
- CS 349D (Cloud Computing Technology): fall 2018.
- CS 149 (Parallel Computing): winter 2018.
PhD Students and Postdocs
- Cody Coleman (with Peter Bailis)
- Daniel Kang (with Peter Bailis)
- Deepak Narayanan
- Deepti Raghavan (with Phil Levis)
- Fiodar Kazhamiaka (with Peter Bailis)
- Firas Abuzaid (with Peter Bailis)
- Gina Yuan (with David Mazieres)
- James Thomas (with Pat Hanrahan)
- Keshav Santhanam
- Lingjiao Chen (with James Zou)
- Omar Khattab (with Chris Potts)
- Peter Kraft (with Peter Bailis)
- Pratiksha Thaker
- Trevor Gale
Past PhD Students and Postdocs
- Shoumik Palkar
- Zhihao Jia (coadvised with Alex Aiken)
Publications
2021
- Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics. D. Kang, A. Mathur, T. Veeramacheneni, P. Bailis and M. Zaharia. To appear at VLDB 2021. (preprint)
- Express: Lowering the Cost of Metadata-hiding Communication with Cryptographic Privacy. S. Eskandarian, H. Corrigan-Gibbs, M. Zaharia and D. Boneh. To appear at USENIX Security 2021. (preprint)
- Contracting Wide-area Network Topologies to Solve Flow Problems Quickly. F. Abuzaid, S. Kandula, B. Arzani, I. Menache, P. Bailis and M. Zaharia. To appear at NSDI 2021.
- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. M. Armbrust, A. Ghodsi, R. Xin and M. Zaharia. To appear at CIDR 2021.
- Challenges and Opportunities for Autonomous Vehicle Query Systems. F. Kazhamiaka, M. Zaharia and P. Bailis. To appear at CIDR 2021.
2020
- FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply. L. Chen, M. Zaharia and J. Zou. NeurIPS 2020. Oral. (preprint)
- Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads. D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee and M. Zaharia. OSDI 2020. (preprint)
- Sparse GPU Kernels for Deep Learning. T. Gale, M. Zaharia, C. Young and E. Elsen. Supercomputing 2020. (preprint)
- DIFF: A Relational Interface for Large-Scale Data Explanation (extended version). F. Abuzaid, P. Kraft, S. Suri, E. Gan, E. Xu, A. Shenoy, A. Ananthanarayan, J. Sheu, E. Meijer, X. Wu, J. Naughton, P. Bailis, and M. Zaharia. VLDB Journal Special Issue.
- Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. M. Armbrust, T. Das, L. Sun, B. Yavuz, S. Zhu, M. Murthy, J. Torres, H. van Hovell, A. Ionescu, A. Luszczak, M. Switakowski, M. Szafranski, X. Li, T. Ueshin, M. Mokhtar, P. Boncz, A. Ghodsi, S. Paranjpye, P. Senster, R. Xin, M. Zaharia. VLDB 2020.
- Approximate Selection with Guarantees using Proxies. D. Kang, E. Gan, P. Bailis, T. Hashimoto and M. Zaharia. VLDB 2020. (preprint)
- BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics. D. Kang, P. Bailis and M. Zaharia. VLDB 2020. (preprint)
- ObliDB: Oblivious Query Processing for Secure Databases. S. Eskandarian and M. Zaharia. VLDB 2020. (preprint)
- Analysis and Exploitation of Dynamic Pricing in the Public Cloud for ML Training. D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee and M. Zaharia. VLDB DISPA Workshop 2020.
- To Call or not to Call? Using ML Prediction APIs more Accurately and Economically. L. Chen, M. Zaharia and J. Zou. ICML EcoPaDL Workshop 2020. (video)
- Machine Learning to Classify Intracardiac Electrical Patterns During Atrial Fibrillation. M. Alhusseini, F. Abuzaid, A. Rogers, J. Zaman, T. Baykaner, P. Clopton, P. Bailis, M. Zaharia, P. Wang, W-J. Rappel, and S. Narayan. Circulation: Arrhythmia and Electrophysiology, 2020.
- Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. A. Chen, A. Chow, A. Davidson, A. DCuncha, A. Ghodsi, S.A. Hong, A. Konwinski, C. Mewald, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, A. Singh, F. Xie, M. Zaharia, R. Zang, J. Zheng and C. Zumar. SIGMOD DEEM Workshop 2020. (video)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. O. Khattab and M. Zaharia. SIGIR 2020. (preprint)
- POSH: A Data-Aware Shell. D. Raghavan, S. Fouladi, P. Levis and M. Zaharia. USENIX ATC 2020.
- Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads. G. Yuan, S. Palkar, D. Narayanan and M. Zaharia. USENIX ATC 2020.
- Spectral Lower Bounds on the I/O Complexity of Computation Graphs. S. Jain and M. Zaharia. SPAA 2020. (preprint)
- Selection via Proxy: Efficient Data Selection for Deep Learning. C. Coleman, C. Yeh, S. Mussmann, B. Mirzasoleiman, P. Bailis, P. Liang, J. Leskovec and M. Zaharia. ICLR 2020. (preprint) (blog)
- Fleet: A Framework for Massively Parallel Streaming on FPGAs. J. Thomas, P. Hanrahan and M. Zaharia. ASPLOS 2020.
- Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference. P. Kraft, D. Kang, D. Narayanan, S. Palkar, P. Bailis and M. Zaharia. MLSys 2020. (preprint)
- Model Assertions for Monitoring and Improving ML Models. D. Kang, D. Raghavan, P. Bailis and M. Zaharia. MLSys 2020.
- Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. Z. Jia, S. Lin, M. Gao, M. Zaharia and A. Aiken. MLSys 2020.
- MLPerf Training Benchmark. P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G-Y. Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V. J. Reddi, T. Robie, T. St. John, C-J. Wu, L. Xu, C. Young, and M. Zaharia. MLSys 2020. (preprint)
2019
- Optimizing Data-Intensive Computations in Existing Libraries with Split Annotations. S. Palkar and M. Zaharia. SOSP 2019. (blog)
- TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken. SOSP 2019.
- PipeDream: Generalized Pipeline Parallelism for DNN Training. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, P. Gibbons, and M. Zaharia. SOSP 2019.
- Outsourcing Everyday Jobs to Thousands of Cloud Functions with gg. S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, C. Kozyrakis, M. Zaharia, and K. Winstein. USENIX ;login:, 44(3), September 2019.
- DIFF: A Relational Interface for Large-Scale Data Explanation. F. Abuzaid, P. Kraft, S. Suri, E. Gan, E. Xu, A. Shenoy, A. Ananthanarayan, J. Sheu, E. Meijer, X. Wu, J. Naughton, P. Bailis, and M. Zaharia. VLDB 2019.
- Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark. C. Coleman, D. Kang, D. Narayanan, L. Nardi, T. Zhao, J. Zhang, P. Bailis, C. Olukotun, C. Re and M. Zaharia. SIGOPS Operating Systems Review, 53(1):14-25, July 2019.
- From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers. S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, C. Kozyrakis, M. Zaharia, and K. Winstein. USENIX ATC 2019.
- LIT: Learned Intermediate Representation Training for Model Compression. A. Koratana, D. Kang, P. Bailis and M. Zaharia. ICML 2019. (blog)
- Debugging Machine Learning via Model Assertions. D. Kang, D. Raghavan, P. Bailis and M. Zaharia. ICLR DebugML Workshop 2019. Best Student Paper. (blog)
- To Index or Not to Index: Optimizing Exact Maximum Inner Product Search. F. Abuzaid, G. Sethi, P. Bailis and M. Zaharia. ICDE 2019.
- Beyond Data and Model Parallelism for Deep Neural Networks. Z. Jia, M. Zaharia and A. Aiken. SysML 2019.
- Optimizing DNN Computation with Relaxed Graph Substitutions. Z. Jia, J. Thomas, T. Warszawski, M. Gao, M. Zaharia and A. Aiken. SysML 2019.
- Challenges and Opportunities in DNN-Based Video Analytics: A Demonstration of the BlazeIt Video Query Engine (demo). D. Kang, P. Bailis and M. Zaharia. CIDR 2019.
2018
- Accelerating the Machine Learning Lifecycle with MLflow. M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S.A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar. IEEE Data Engineering Bulletin, 41(4), December 2018.
- Model Assertions for Debugging Machine Learning. D. Kang, D. Raghavan, P. Bailis and M. Zaharia. NeurIPS Systems for ML Workshop 2018.
- Analysis of the Time-To-Accuracy Metric and Entries in the DAWNBench Deep Learning Benchmark. C. Coleman, D. Kang, D. Narayanan, L. Nardi, T. Zhao, J. Zhang, P. Bailis, K. Olukotun, C. Re and M. Zaharia. NeurIPS Systems for ML Workshop 2018. (blog)
- Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. NeurIPS Systems for ML Workshop 2018.
- Exploring the Use of Learning Algorithms for Efficient Performance Profiling. S. Palkar, S. Suri, P. Bailis and M. Zaharia. NeurIPS ML for Systems Workshop 2018.
- Block-wise Intermediate Representation Training for Model Compression. A. Koratana, D. Kang, P. Bailis and M. Zaharia. NeurIPS CDNNRIA Workshop 2018.
- Filter Before You Parse: Faster Analytics on Raw Data with Sparser. S. Palkar, F. Abuzaid, P. Bailis and M. Zaharia. VLDB 2018. (blog)
- Evaluating End-to-End Optimization for Data Analytics Applications in Weld. S. Palkar, J. Thomas, D. Narayanan, P. Thaker, R. Palamuttam, P. Negi, A. Shanbhag, M. Schwarzkopf, H. Pirk, S. Amarasinghe, S. Madden and M. Zaharia. VLDB 2018. (blog)
- MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis. M. Vartak, J. da Trindade, S. Madden and M. Zaharia. SIGMOD 2018.
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica and M. Zaharia. SIGMOD 2018.
- Accelerating Model Search with Model Batching (poster). D. Narayanan, K. Santhanam and M. Zaharia. SysML 2018.
- BlazeIt: An Optimizing Query Engine for Video at Scale (poster). D. Kang, P. Bailis and M. Zaharia. SysML 2018.
- DAWNBench: An End-to-End Deep Learning Benchmark and Competition (poster). C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Re and M. Zaharia. SysML 2018.
2017
- Making Caches Work for Graph Analytics. Y. Zhang, V. Kiriansky, C. Mendis, M. Zaharia and S. Amarasinghe. IEEE BigData 2017. Best Student Paper.
- DAWNBench: An End-to-End Deep Learning Benchmark and Competition. C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Re and M. Zaharia. NIPS SysML 2017 (blog)
- DIY Hosting for Online Privacy. S. Palkar and M. Zaharia. HotNets 2017.
- Stadium: A Distributed Metadata-Private Messaging System. N. Tyagi, Y. Gilad, D. Leung, M. Zaharia and N. Zeldovich. SOSP 2017.
- NoScope: Optimizing Neural Network Queries over Video at Scale. D. Kang, J. Emmons, F. Abuzaid, P. Bailis and M. Zaharia. VLDB 2017 (blog)
- Splinter: Practical Private Queries on Public Data. F. Wang, C. Yun, S. Goldwasser, V. Vaikuntanathan and M. Zaharia. NSDI 2017.
- Weld: A Common Runtime for High Performance Data Analytics. S. Palkar, J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk, M. Schwarzkopf, S. Amarasinghe and M. Zaharia. CIDR 2017.
2016
- Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale. F. Abuzaid, J. Bradley, F. Liang, A. Feng, L. Yang, M. Zaharia and A. Talwalkar. NIPS 2016.
- Apache Spark: A Unified Engine for Big Data Processing. M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica. Communications of the ACM, 59(11):56-65, November 2016.
- Voodoo – A Vector Algebra for Portable Database Performance on Modern Hardware. H. Pirk, O. Moll, M. Zaharia and S. Madden. VLDB 2016.
- Matrix Computations and Optimizations in Apache Spark. R.B. Zadeh, X. Meng, A. Staple, B. Yavuz, L. Pu, S. Venkataraman, E. Sparks, A. Ulanov and M. Zaharia. KDD 2016. Best Paper Award Runner-Up.
- GraphFrames: An Integrated API for Mixing Graph and Relational Queries. A. Dave, A. Jindal, L.E. Li, R. Xin, J. Gonzalez and M. Zaharia. GRADES 2016.
- ModelDB: A System for Machine Learning Model Management. M. Vartak, H. Subramanyam, W.E. Lee, S. Viswanathan, S. Husnoo, S. Madden and M. Zaharia. HILDA 2016.
- SparkR: Scaling R Programs with Spark. S. Venkataraman, Z. Yang, D. Liu, E. Liang, X. Meng, R. Xin, A. Ghodsi, M. Franklin, I. Stoica and M. Zaharia. SIGMOD 2016.
- MLlib: Machine Learning in Apache Spark. X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. JMLR, 17(34):1–7, 2016.
- FairRide: Near-Optimal, Fair Cache Sharing. Q. Pu, H. Li, M. Zaharia, A. Ghodsi, and I. Stoica. NSDI 2016.
2015
- Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis. J. van den Hooff, D. Lazar, M. Zaharia and N. Zeldovich. SOSP 2015.
- Scaling Spark in the Real World: Performance and Usability. M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin and M. Zaharia. VLDB 2015.
- Spark SQL: Relational Data Processing in Spark. M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi and M. Zaharia. SIGMOD 2015.
2014
- Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. H. Li, A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica, SOCC 2014, November 2014.
- A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from Next-Generation Sequencing of Clinical Samples. S.N. Naccache, S. Federman, N. Veeeraraghavan, M. Zaharia, D. Lee, E. Samayoa, J. Bouquet, A.L. Greninger, K. Luk, B. Enge, D.A. Wadford, S.L. Messenger, G.L. Genrich, K. Pellegrino, G. Grard, E. Leroy, B.S. Schneider, J.N. Fair, M.A. Martinez, P. Isa, J.A. Crump, J.L. DeRisi, T. Sittler, J. Hackett Jr., S. Miller and C.Y. Chiu, Genome Research, 24(7):1180-92, June 2014.
2013
- An Architecture for Fast and General Data Processing on Large Clusters. M. Zaharia. PhD Disseration, 2014 ACM Doctoral Dissertation Award.
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. SOSP 2013.
- Sparrow: Distributed, Low-Latency Scheduling. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. SOSP 2013.
- Shark: SQL and Rich Analytics at Scale. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. SIGMOD 2013.
- Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints. A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica. EuroSys 2013.
2012
- Multi-Resource Fair Queueing for Packet Processing. A. Ghodsi, V. Sekar, M. Zaharia and I. Stoica. SIGCOMM 2012. Best Paper Award.
- Fast and Interactive Analytics over Hadoop Data with Spark. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. USENIX ;login:, August 2012.
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica. HotCloud 2012.
- Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems. L. Martignoni, P. Poosankam, M. Zaharia, J. Han, S. McCamant, D. Song, V. Paxson, A. Perrig, S. Shenker, I. Stoica. USENIX ATC 2012.
- Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo). C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin, S. Shenker, I. Stoica. SIGMOD 2012. Best Demo Award.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. NSDI 2012. Best Paper Award and Honorable Mention for Community Award.
Awards
- EuroSys Test of Time Paper Award (for Delay Scheduling), 2020
- Presidential Early Career Award for Scientists and Engineers (PECASE), 2019
- NSF CAREER Award, 2017
- VMware Systems Research Award, 2016
- Google Faculty Research Award, 2015
- ACM Doctoral Dissertation Award, 2014
- U. Waterloo Faculty of Mathematics Young Alumni Achievement Medal, 2014
- Daytona GraySort World Record, 2014
- David J. Sakrison Prize for Research, UC Berkeley, 2013
- Best Paper Awards at SIGCOMM 2012 and NSDI 2012
Service
- Board Member: MLSys Conference.
- Program Co-Chair: DISPA Workshop at VLDB 2020, MLOps Workshop at MLSys 2020, SysML 2019.
- Program Committee Member: NSDI 2021, VLDB 2021, ICML 2020, HotCloud 2020, NeurIPS 2019, SIGMOD 2019, OSDI 2018, SIGMOD 2018, NSDI 2018, SoCC 2017, SIGMOD 2016, SIGCOMM 2016, NSDI 2015.
- Invited Reviewer: CACM, TPDS, VLDB.
Open Source
Almost all of my work is open source:
- The Spark engine became an Apache project at spark.apache.org. We have also open sourced subsequent projects including Shark, Spark SQL, MLlib, GraphFrames and Spark Streaming.
- MLflow is a new open source project for managing the machine learning development process.
- The Mesos cluster manager is a top-level Apache project.
- LATE straggler mitigation and the Hadoop Fair Scheduler are included in Apache Hadoop.
- The SNAP sequence aligner is available on GitHub.
More recent projects are available on the Weld and FutureData websites.
Adapted from a template by Andreas Viklund. Photo by Hector Garcia-Molina.