I’m an assistant professor at Stanford CS, where I work on computer systems and big data as part of Stanford DAWN. I’m also co-founder and Chief Technologist of Databricks, the big data company based around Apache Spark. Prior to joining Stanford, I was an assistant professor of CS at MIT.
I work on computer systems and large-scale data processing. My past projects included open source datacenter software such as Spark, Mesos, Spark Streaming and Spark SQL, and cluster scheduling algorithms such as DRF, delay scheduling and LATE. My current projects include:
DAWN, a new Stanford lab to create infrastructure for usable machine learning. We observe that ML algorithms are now “good enough” for many applications, but the bottlenecks to real-world use are tasks around the algorithm, such as data labeling, data augmentation, and robust serving. We are developing runtimes, algorithms and serving systems to tackle these problems.
Weld, a runtime for data analytics applications that changes the interface between software libraries to enable powerful cross-library optimizations. Weld can increase the performance of widely used libraries such as Pandas, NumPy, TensorFlow and Spark by up to 30x, as well as make it much simpler to port parallel software across hardware platforms. (position paper, optimizer paper, code).
Scalable Strong Privacy systems that protect user data in common Internet applications at the scale of millions of users. For example, Vuvuzela is the first linearly-scaling messaging system that hides metadata about which pairs of users are communicating, while Splinter hides user queries on non-sensitive data (e.g., map routing on OpenStreetMap) from the service providers.
- CS 349D (Cloud Computing Technology): fall 2018.
- CS 149 (Parallel Computing): winter 2018.
- CS 349D (Cloud Computing Technology): fall 2017.
- CS 341 (Projects in Mining Massive Datasets): spring 2017.
- Firas Abuzaid (with Peter Bailis)
- Cody Coleman (with Peter Bailis)
- Daniel Kang (with Peter Bailis)
- Peter Kraft (with Peter Bailis)
- Deepak Narayanan
- Shoumik Palkar
- Pratiksha Thaker
- James Thomas (with Pat Hanrahan)
- S. Palkar, F. Abuzaid, P. Bailis and M. Zaharia. Filter Before You Parse: Faster Analytics on Raw Data with Sparser, VLDB 2018.
- S. Palkar, J. Thomas, D. Narayanan, P. Thaker, R. Palamuttam, P. Negi, A. Shanbhag, M. Schwarzkopf, H. Pirk, S. Amarasinghe, S. Madden and M. Zaharia. Evaluating End-to-End Optimization for Data Analytics Applications in Weld, VLDB 2018.
- M. Vartak, J. da Trindade, S. Madden and M. Zaharia. MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis, SIGMOD 2018.
- M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica and M. Zaharia. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark, SIGMOD 2018.
- D. Narayanan, K. Santhanam, M. Zaharia. Accelerating Model Search with Model Batching, SysML 2018 (poster).
- D. Kang, P. Bailis, M. Zaharia. BlazeIt: An Optimizing Query Engine for Video at Scale, SysML 2018 (poster).
- C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Re, M. Zaharia. DAWNBench: An End-to-End Deep Learning Benchmark and Competition, SysML 2018 (poster).
- Y. Zhang, V. Kiriansky, C. Mendis, M. Zaharia and S. Amarasinghe. Making Caches Work for Graph Analytics, IEEE BigData 2017. Best Student Paper.
- C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Re, M. Zaharia. DAWNBench: An End-to-End Deep Learning Benchmark and Competition, NIPS SysML 2017 (blog).
- S. Palkar and M. Zaharia. DIY Hosting for Online Privacy, HotNets 2017.
- N. Tyagi, Y. Gilad, D. Leung, M. Zaharia and N. Zeldovich. Stadium: A Distributed Metadata-Private Messaging System, SOSP 2017.
- D. Kang, J. Emmons, F. Abuzaid, P. Bailis and M. Zaharia. NoScope: Optimizing Neural Network Queries over Video at Scale, VLDB 2017 (blog).
- F. Wang, C. Yun, S. Goldwasser, V. Vaikuntanathan and M. Zaharia. Splinter: Practical Private Queries on Public Data, NSDI 2017.
- S. Palkar, J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk, M. Schwarzkopf, S. Amarasinghe and M. Zaharia. Weld: A Common Runtime for High Performance Data Analytics, CIDR 2017.
- F. Abuzaid, J. Bradley, F. Liang, A. Feng, L. Yang, M. Zaharia and A. Talwalkar. Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale, NIPS 2016.
- M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica. Apache Spark: A Unified Engine for Big Data Processing, Communications of the ACM, 59(11):56-65, November 2016.
- H. Pirk, O. Moll, M. Zaharia and S. Madden. Voodoo – A Vector Algebra for Portable Database Performance on Modern Hardware, VLDB 2016.
- R.B. Zadeh, X. Meng, A. Staple, B. Yavuz, L. Pu, S. Venkataraman, E. Sparks, A. Ulanov and M. Zaharia. Matrix Computations and Optimizations in Apache Spark, KDD 2016. Best Paper Award Runner-Up.
- A. Dave, A. Jindal, L.E. Li, R. Xin, J. Gonzalez and M. Zaharia. GraphFrames: An Integrated API for Mixing Graph and Relational Queries, GRADES 2016.
- M. Vartak, H. Subramanyam, W.E. Lee, S. Viswanathan, S. Husnoo, S. Madden and M. Zaharia. ModelDB: A System for Machine Learning Model Management, HILDA 2016.
- S. Venkataraman, Z. Yang, D. Liu, E. Liang, X. Meng, R. Xin, A. Ghodsi, M. Franklin, I. Stoica and M. Zaharia. SparkR: Scaling R Programs with Spark, SIGMOD 2016.
- X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine Learning in Apache Spark, JMLR, 17(34):1–7, 2016.
- Q. Pu, H. Li, M. Zaharia, A. Ghodsi, and I. Stoica. FairRide: Near-Optimal, Fair Cache Sharing, NSDI 2016.
- J. van den Hooff, D. Lazar, M. Zaharia and N. Zeldovich. Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis, SOSP 2015.
- M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin and M. Zaharia. Scaling Spark in the Real World: Performance and Usability, VLDB 2015.
- M. Armbrust, R. Xin, C. Lian, Y. Huai, D. Liu, J. Bradley, X. Meng, T. Kaftan, M. Franklin, A. Ghodsi and M. Zaharia. Spark SQL: Relational Data Processing in Spark. SIGMOD 2015.
- H. Li, A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica, Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks, SOCC 2014, November 2014.
- S.N. Naccache, S. Federman, N. Veeeraraghavan, M. Zaharia, D. Lee, E. Samayoa, J. Bouquet, A.L. Greninger, K. Luk, B. Enge, D.A. Wadford, S.L. Messenger, G.L. Genrich, K. Pellegrino, G. Grard, E. Leroy, B.S. Schneider, J.N. Fair, M.A. Martinez, P. Isa, J.A. Crump, J.L. DeRisi, T. Sittler, J. Hackett Jr., S. Miller and C.Y. Chiu, A Cloud-Compatible Bioinformatics Pipeline for Ultrarapid Pathogen Identification from Next-Generation Sequencing of Clinical Samples, Genome Research, 24(7):1180-92, June 2014.
- M. Zaharia. An Architecture for Fast and General Data Processing on Large Clusters. PhD Disseration, 2014 ACM Doctoral Dissertation Award.
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013.
- K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013.
- R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013.
- A. Ghodsi, M. Zaharia, S. Shenker and I. Stoica. Choosy: Max-Min Fair Sharing for Datacenter Jobs with Constraints, EuroSys 2013.
- A. Ghodsi, V. Sekar, M. Zaharia and I. Stoica. Multi-Resource Fair Queueing for Packet Processing, SIGCOMM 2012. Best Paper Award.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Fast and Interactive Analytics over Hadoop Data with Spark, USENIX ;login:, August 2012.
- M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters, HotCloud 2012.
- L. Martignoni, P. Poosankam, M. Zaharia, J. Han, S. McCamant, D. Song, V. Paxson, A. Perrig, S. Shenker, I. Stoica. Cloud Terminal: Secure Access to Sensitive Applications from Untrusted Systems, USENIX ATC 2012.
- C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin, S. Shenker, I. Stoica. Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo), SIGMOD 2012. Best Demo Award.
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012. Best Paper Award and Honorable Mention for Community Award.
- T. Hunter, T. Moldovan, M. Zaharia, S. Merzgui, J. Ma, M.J. Franklin, P. Abbeel, and A.M. Bayen. Scaling the Mobile Millennium System in the Cloud, SOCC 2011.
- M. Chowdhury, M. Zaharia, J. Ma, M.I. Jordan and I. Stoica, Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM 2011.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: Flexible Resource Sharing for the Cloud, USENIX ;login:, August 2011.
- M. Zaharia, B. Hindman, A. Konwinski, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, The Datacenter Needs an Operating System, HotCloud 2011.
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011.
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple Resources Types, NSDI 2011.
- NSF CAREER Award, 2017
- VMware Systems Research Award, 2016
- ACM Doctoral Dissertation Award, 2014
- U. Waterloo Faculty of Mathematics Young Alumni Achievement Medal, 2014
- Daytona GraySort World Record, 2014
- David J. Sakrison Prize for Research, UC Berkeley, 2013
- Best Paper Awards at SIGCOMM 2012 and NSDI 2012
Almost all of my work is open source:
- The Spark engine is now an Apache project at spark.apache.org. We have also open sourced subsequent projects including Shark, Spark SQL, MLlib, GraphFrames and Spark Streaming.
- The Mesos cluster manager is a top-level Apache project.
- The LATE algorithm for straggler mitigation and the Hadoop Fair Scheduler are included in Apache Hadoop.
- The SNAP sequence aligner is available on GitHub.
I’m also a committer on the Apache Hadoop, Spark and Mesos projects.