I’m an assistant professor at Stanford CS, where I work on computer systems and machine learning as part of Stanford DAWN. I’m also co-founder and Chief Technologist of Databricks, a data and AI platform startup. Before joining Stanford, I was an assistant professor at MIT.
Interests: I’m interested in computer systems for emerging large-scale workloads such as machine learning, big data analytics and cloud computing. In DAWN, we’re working on infrastructure for usable machine learning to make it dramatically easier to bring ML applications to production: these issues are often much larger obstacles than ML algorithms in practice. My work includes software runtimes, quality assurance tools and systems optimizations for ML. Beyond usability, I am intersted in data privacy as the flipside to big data, and have worked on systems that can provide scalable privacy for communication, Internet queries and SaaS applications.
Impact: Our group works closely with the open source community to test and publish our ideas. During my PhD, I started the Apache Spark project, which is now one of the most widely used frameworks for distributed data processing, and co-started other widely used datacenter software such as Apache Mesos, Alluxio, and Spark Streaming. At Stanford, we developed DAWNBench, a machine learning performance competition that drew submissions from the top industry groups and influenced the industry-standard MLPerf, and we are continuing to develop open source software such as Weld, Sparser, NoScope, and MacroBase DIFF.
- CS 320 (Value of Data and AI): winter 2020.
- CS 245 (Principles of Data-Intensive Systems): winter 2020.
- CS 245 (Principles of Data-Intensive Systems): spring 2019.
- CS 349D (Cloud Computing Technology): fall 2018.
- CS 149 (Parallel Computing): winter 2018.
- CS 349D (Cloud Computing Technology): fall 2017.
- CS 341 (Projects in Mining Massive Datasets): spring 2017.
- Cody Coleman (with Peter Bailis)
- Daniel Kang (with Peter Bailis)
- Deepak Narayanan
- Deepti Raghavan (with Phil Levis)
- Fiodar Kazhamiaka (with Peter Bailis)
- Firas Abuzaid (with Peter Bailis)
- James Thomas (with Pat Hanrahan)
- Keshav Santhanam
- Peter Kraft (with Peter Bailis)
- Pratiksha Thaker
- Shoumik Palkar
- Trevor Gale
- Zhihao Jia (with Alex Aiken)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. O. Khattab and M. Zaharia. To appear at SIGIR 2020. (preprint)
- DASH: Data-Aware Shell. D. Raghavan, S. Fouladi, P. Levis and M. Zaharia. To appear at USENIX ATC 2020.
- Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads. G. Yuan, S. Palkar, D. Narayanan and M. Zaharia. To appear at USENIX ATC 2020.
- Spectral Lower Bounds on the I/O Complexity of Computation Graphs. S. Jain and M. Zaharia. To appear at SPAA 2020. (preprint)
- BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics. D. Kang, P. Bailis and M. Zaharia. VLDB 2020. (preprint)
- ObliDB: Oblivious Query Processing for Secure Databases. S. Eskandarian and M. Zaharia. VLDB 2020. (preprint)
- Selection via Proxy: Efficient Data Selection for Deep Learning. C. Coleman, C. Yeh, S. Mussmann, B. Mirzasoleiman, P. Bailis, P. Liang, J. Leskovec and M. Zaharia. ICLR 2020. (preprint) (blog)
- Fleet: A Framework for Massively Parallel Streaming on FPGAs. J. Thomas, P. Hanrahan and M. Zaharia. ASPLOS 2020.
- Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference. P. Kraft, D. Kang, D. Narayanan, S. Palkar, P. Bailis and M. Zaharia. MLSys 2020. (preprint)
- Model Assertions for Monitoring and Improving ML Models. D. Kang, D. Raghavan, P. Bailis and M. Zaharia. MLSys 2020.
- Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc. Z. Jia, S. Lin, M. Gao, M. Zaharia and A. Aiken. MLSys 2020.
- MLPerf Training Benchmark. P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G-Y. Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, B. Jia, D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi, G. Pekhimenko, L. Pentecost, V. J. Reddi, T. Robie, T. St. John, C-J. Wu, L. Xu, C. Young, and M. Zaharia. MLSys 2020. (preprint)
- Optimizing Data-Intensive Computations in Existing Libraries with Split Annotations. S. Palkar and M. Zaharia. SOSP 2019. (blog)
- TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken. SOSP 2019.
- PipeDream: Generalized Pipeline Parallelism for DNN Training. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, P. Gibbons, and M. Zaharia. SOSP 2019.
- Outsourcing Everyday Jobs to Thousands of Cloud Functions with gg. S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, C. Kozyrakis, M. Zaharia, and K. Winstein. USENIX ;login:, 44(3), September 2019.
- DIFF: A Relational Interface for Large-Scale Data Explanation. F. Abuzaid, P. Kraft, S. Suri, E. Gan, E. Xu, A. Shenoy, A. Ananthanarayan, J. Sheu, E. Meijer, X. Wu, J. Naughton, P. Bailis, and M. Zaharia. VLDB 2019.
- Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark. C. Coleman, D. Kang, D. Narayanan, L. Nardi, T. Zhao, J. Zhang, P. Bailis, C. Olukotun, C. Re and M. Zaharia. SIGOPS Operating Systems Review, 53(1):14-25, July 2019.
- From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers. S. Fouladi, F. Romero, D. Iter, Q. Li, S. Chatterjee, C. Kozyrakis, M. Zaharia, and K. Winstein. USENIX ATC 2019.
- LIT: Learned Intermediate Representation Training for Model Compression. A. Koratana, D. Kang, P. Bailis and M. Zaharia. ICML 2019. (blog)
- Debugging Machine Learning via Model Assertions. D. Kang, D. Raghavan, P. Bailis and M. Zaharia. ICLR DebugML Workshop 2019. Best Student Paper. (blog)
- To Index or Not to Index: Optimizing Exact Maximum Inner Product Search. F. Abuzaid, G. Sethi, P. Bailis and M. Zaharia. ICDE 2019.
- Beyond Data and Model Parallelism for Deep Neural Networks. Z. Jia, M. Zaharia and A. Aiken. SysML 2019.
- Optimizing DNN Computation with Relaxed Graph Substitutions. Z. Jia, J. Thomas, T. Warszawski, M. Gao, M. Zaharia and A. Aiken. SysML 2019.
- Challenges and Opportunities in DNN-Based Video Analytics: A Demonstration of the BlazeIt Video Query Engine (demo). D. Kang, P. Bailis and M. Zaharia. CIDR 2019.
- Accelerating the Machine Learning Lifecycle with MLflow. M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S.A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar. IEEE Data Engineering Bulletin, 41(4), December 2018.
- Model Assertions for Debugging Machine Learning. D. Kang, D. Raghavan, P. Bailis and M. Zaharia. NeurIPS Systems for ML Workshop 2018.
- Analysis of the Time-To-Accuracy Metric and Entries in the DAWNBench Deep Learning Benchmark. C. Coleman, D. Kang, D. Narayanan, L. Nardi, T. Zhao, J. Zhang, P. Bailis, K. Olukotun, C. Re and M. Zaharia. NeurIPS Systems for ML Workshop 2018. (blog)
- Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. NeurIPS Systems for ML Workshop 2018.
- Exploring the Use of Learning Algorithms for Efficient Performance Profiling. S. Palkar, S. Suri, P. Bailis and M. Zaharia. NeurIPS ML for Systems Workshop 2018.
- Block-wise Intermediate Representation Training for Model Compression. A. Koratana, D. Kang, P. Bailis and M. Zaharia. NeurIPS CDNNRIA Workshop 2018.
- Filter Before You Parse: Faster Analytics on Raw Data with Sparser. S. Palkar, F. Abuzaid, P. Bailis and M. Zaharia. VLDB 2018. (blog)
- Evaluating End-to-End Optimization for Data Analytics Applications in Weld. S. Palkar, J. Thomas, D. Narayanan, P. Thaker, R. Palamuttam, P. Negi, A. Shanbhag, M. Schwarzkopf, H. Pirk, S. Amarasinghe, S. Madden and M. Zaharia. VLDB 2018. (blog)
- MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis. M. Vartak, J. da Trindade, S. Madden and M. Zaharia. SIGMOD 2018.
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica and M. Zaharia. SIGMOD 2018.
- Accelerating Model Search with Model Batching (poster). D. Narayanan, K. Santhanam and M. Zaharia. SysML 2018.
- BlazeIt: An Optimizing Query Engine for Video at Scale (poster). D. Kang, P. Bailis and M. Zaharia. SysML 2018.
- DAWNBench: An End-to-End Deep Learning Benchmark and Competition (poster). C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Re and M. Zaharia. SysML 2018.
- Making Caches Work for Graph Analytics. Y. Zhang, V. Kiriansky, C. Mendis, M. Zaharia and S. Amarasinghe. IEEE BigData 2017. Best Student Paper.
- DAWNBench: An End-to-End Deep Learning Benchmark and Competition. C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Re and M. Zaharia. NIPS SysML 2017 (blog)
- DIY Hosting for Online Privacy. S. Palkar and M. Zaharia. HotNets 2017.
- Stadium: A Distributed Metadata-Private Messaging System. N. Tyagi, Y. Gilad, D. Leung, M. Zaharia and N. Zeldovich. SOSP 2017.
- NoScope: Optimizing Neural Network Queries over Video at Scale. D. Kang, J. Emmons, F. Abuzaid, P. Bailis and M. Zaharia. VLDB 2017 (blog)
- Splinter: Practical Private Queries on Public Data. F. Wang, C. Yun, S. Goldwasser, V. Vaikuntanathan and M. Zaharia. NSDI 2017.
- Weld: A Common Runtime for High Performance Data Analytics. S. Palkar, J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk, M. Schwarzkopf, S. Amarasinghe and M. Zaharia. CIDR 2017.
- Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale. F. Abuzaid, J. Bradley, F. Liang, A. Feng, L. Yang, M. Zaharia and A. Talwalkar. NIPS 2016.
- Apache Spark: A Unified Engine for Big Data Processing. M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica. Communications of the ACM, 59(11):56-65, November 2016.
- Voodoo – A Vector Algebra for Portable Database Performance on Modern Hardware. H. Pirk, O. Moll, M. Zaharia and S. Madden. VLDB 2016.
- Matrix Computations and Optimizations in Apache Spark. R.B. Zadeh, X. Meng, A. Staple, B. Yavuz, L. Pu, S. Venkataraman, E. Sparks, A. Ulanov and M. Zaharia. KDD 2016. Best Paper Award Runner-Up.
- GraphFrames: An Integrated API for Mixing Graph and Relational Queries. A. Dave, A. Jindal, L.E. Li, R. Xin, J. Gonzalez and M. Zaharia. GRADES 2016.
- ModelDB: A System for Machine Learning Model Management. M. Vartak, H. Subramanyam, W.E. Lee, S. Viswanathan, S. Husnoo, S. Madden and M. Zaharia. HILDA 2016.
- SparkR: Scaling R Programs with Spark. S. Venkataraman, Z. Yang, D. Liu, E. Liang, X. Meng, R. Xin, A. Ghodsi, M. Franklin, I. Stoica and M. Zaharia. SIGMOD 2016.
- MLlib: Machine Learning in Apache Spark. X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. JMLR, 17(34):1–7, 2016.
- FairRide: Near-Optimal, Fair Cache Sharing. Q. Pu, H. Li, M. Zaharia, A. Ghodsi, and I. Stoica. NSDI 2016.
- Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis. J. van den Hooff, D. Lazar, M. Zaharia and N. Zeldovich. SOSP 2015.
- Scaling Spark in the Real World: Performance and Usability. M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin and M. Zaharia. VLDB 2015.
- Spark SQL: Relational Data Processing in Spark. M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi and M. Zaharia. SIGMOD 2015.
- Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. H. Li, A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica, SOCC 2014, November 2014.
- A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from Next-Generation Sequencing of Clinical Samples. S.N. Naccache, S. Federman, N. Veeeraraghavan, M. Zaharia, D. Lee, E. Samayoa, J. Bouquet, A.L. Greninger, K. Luk, B. Enge, D.A. Wadford, S.L. Messenger, G.L. Genrich, K. Pellegrino, G. Grard, E. Leroy, B.S. Schneider, J.N. Fair, M.A. Martinez, P. Isa, J.A. Crump, J.L. DeRisi, T. Sittler, J. Hackett Jr., S. Miller and C.Y. Chiu, Genome Research, 24(7):1180-92, June 2014.
- An Architecture for Fast and General Data Processing on Large Clusters. M. Zaharia. PhD Disseration, 2014 ACM Doctoral Dissertation Award.
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. SOSP 2013.
- Sparrow: Distributed, Low-Latency Scheduling. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. SOSP 2013.
- Shark: SQL and Rich Analytics at Scale. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. SIGMOD 2013.
- Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints. A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica. EuroSys 2013.
- Multi-Resource Fair Queueing for Packet Processing. A. Ghodsi, V. Sekar, M. Zaharia and I. Stoica. SIGCOMM 2012. Best Paper Award.
- Fast and Interactive Analytics over Hadoop Data with Spark. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. USENIX ;login:, August 2012.
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica. HotCloud 2012.
- Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems. L. Martignoni, P. Poosankam, M. Zaharia, J. Han, S. McCamant, D. Song, V. Paxson, A. Perrig, S. Shenker, I. Stoica. USENIX ATC 2012.
- Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo). C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin, S. Shenker, I. Stoica. SIGMOD 2012. Best Demo Award.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. NSDI 2012. Best Paper Award and Honorable Mention for Community Award.
- EuroSys Test of Time Paper Award (for Delay Scheduling), 2020
- Presidential Early Career Award for Scientists and Engineers (PECASE), 2019
- NSF CAREER Award, 2017
- VMware Systems Research Award, 2016
- Google Faculty Research Award, 2015
- ACM Doctoral Dissertation Award, 2014
- U. Waterloo Faculty of Mathematics Young Alumni Achievement Medal, 2014
- Daytona GraySort World Record, 2014
- David J. Sakrison Prize for Research, UC Berkeley, 2013
- Best Paper Awards at SIGCOMM 2012 and NSDI 2012
- Board Member: MLSys Conference.
- Program Co-Chair: DISPA Workshop at VLDB 2020, MLOps Workshop at MLSys 2020, SysML 2019.
- Program Committee Member: NSDI 2021, VLDB 2021, ICML 2020, HotCloud 2020, NeurIPS 2019, SIGMOD 2019, OSDI 2018, SIGMOD 2018, NSDI 2018, SoCC 2017, SIGMOD 2016, SIGCOMM 2016, NSDI 2015.
- Invited Reviewer: CACM, TPDS, VLDB.
Almost all of my work is open source:
- The Spark engine became an Apache project at spark.apache.org. We have also open sourced subsequent projects including Shark, Spark SQL, MLlib, GraphFrames and Spark Streaming.
- MLflow is a new open source project for managing the machine learning development process.
- The Mesos cluster manager is a top-level Apache project.
- LATE straggler mitigation and the Hadoop Fair Scheduler are included in Apache Hadoop.
- The SNAP sequence aligner is available on GitHub.