Homepage of Christopher Re (Chris Re)

Christopher Ré
Email: chrismre at cs.stanford.edu
Our Lab's Github | Blog | Twitter

Department of Computer Science
Stanford University
353 Jane Stanford Way
Stanford, CA 94305-9025

I'm a professor in the Stanford AI Lab (SAIL), the center for research on foundation models (CRFM), and the Machine Learning Group (bio). Our lab works on the foundations of the next generation of AI systems.
- On the AI side, I am fascinated by how we can learn from increasingly weak forms of supervision, the basis of new architectures, the role of data, and by the mathematical foundations of such techniques.
- On the systems side, I am broadly interested in how machine learning is changing how we build software and hardware. I'm particularly excited when we can blend AI and systems, e.g,. Snorkel, Overton (YouTube), or Together.
Our work is inspired by the observation that data is central to these systems, and so data management principles (re-imagined) play a starring role in our work. This sounds like Silicon Valley nonsense, but oddly enough, these ideas get used due to amazing students and collaborations with Google ads, YouTube, Apple, and more.

While we're very proud of our research ideas and their impact, the lab's real goal is to help students become professors, entrepreneurs, and researchers. To that end, over a dozen members of our group have started their own professorships. With students and collaborators, I've been fortunate enough to cofound a number of companies and a venture firm. For transparency, I try to list companies I advise or invest in here and our research sponsors here. My students run the ML Sys Podcast.
- We're interested in improving the foundations of foundation models.
  - We released ThunderKittens (quick blog|paper|repo) for our opinionated take on building AI kernels. Now with ParallelKittens (paper) for multi-GPU and HipKittens (paper) for AMD.
  - Intelligence per Watt (paper|repo) measuring efficiency of AI and foundation models!
  - Blog post on sequence length and more. See the blog for more details
  - Flash Attention is an IO-Aware algorithm for attention. This is widely used now including in ML Perf, see MLPerf Story on Tri!. Tri's Version 2
  - We continue to work on long sequences. An explainer of a simplified version of S4 (S4 Explainer Blog). It's a convolution and an RNN based on simple ideas from signal processing. SOTA on long range arena and first to solve Path-X. update on this line of work.
  - We've been working on Hyena using ideas from signal processing, and its application to HyenaDNA and now Evo led by Arc Institute--and Brian Hie. Evo selected for the cover of Science. Now Evo 2 extends to all domains of life.
- Some Talks and resources
  - Neurips23 Keynote (pptx|pdf|video) about building blocks for foundation models. GitHub for SysAI building blocks.
  - Some resources for a budding community in Data-Centric AI and a blog post about it.
  - SIGMOD keynote on Data-centric AI, Declarative ML, and Foundation Models in data slides (YouTube)
  - SIGMOD panel on Service, Science and Startups changing research
  - Software 2.0 Overview at HAI
  - Thanks, NeurIPS! Our Test-of-time Award talk for Hogwild! is on YouTube
  - A quick overview of video our work on Hidden Stratification.
  - MLSys 20 keynote talk (pdf|pptx) or WWW BIG. More articles on new group website also see github.
  A messy, incomplete log of old updates is here.

I'm a professor in the Stanford AI Lab (SAIL), the center for research on foundation models (CRFM), and the Machine Learning Group (bio). Our lab works on the foundations of the next generation of AI systems.
- On the AI side, I am fascinated by how we can learn from increasingly weak forms of supervision, the basis of new architectures, the role of data, and by the mathematical foundations of such techniques.
- On the systems side, I am broadly interested in how machine learning is changing how we build software and hardware. I'm particularly excited when we can blend AI and systems, e.g,. Snorkel, Overton (YouTube), or Together.
Our work is inspired by the observation that data is central to these systems, and so data management principles (re-imagined) play a starring role in our work. This sounds like Silicon Valley nonsense, but oddly enough, these ideas get used due to amazing students and collaborations with Google ads, YouTube, Apple, and more.

While we're very proud of our research ideas and their impact, the lab's real goal is to help students become professors, entrepreneurs, and researchers. To that end, over a dozen members of our group have started their own professorships. With students and collaborators, I've been fortunate enough to cofound a number of companies and a venture firm. For transparency, I try to list companies I advise or invest in here and our research sponsors here. My students run the ML Sys Podcast.
- We're interested in improving the foundations of foundation models.
  - We released ThunderKittens (quick blog|paper|repo) for our opinionated take on building AI kernels. Now with ParallelKittens (paper) for multi-GPU and HipKittens (paper) for AMD.
  - Intelligence per Watt (paper|repo) measuring efficiency of AI and foundation models!
  - Blog post on sequence length and more. See the blog for more details
  - Flash Attention is an IO-Aware algorithm for attention. This is widely used now including in ML Perf, see MLPerf Story on Tri!. Tri's Version 2
  - We continue to work on long sequences. An explainer of a simplified version of S4 (S4 Explainer Blog). It's a convolution and an RNN based on simple ideas from signal processing. SOTA on long range arena and first to solve Path-X. update on this line of work.
  - We've been working on Hyena using ideas from signal processing, and its application to HyenaDNA and now Evo led by Arc Institute--and Brian Hie. Evo selected for the cover of Science. Now Evo 2 extends to all domains of life.
- Some Talks and resources
  - Neurips23 Keynote (pptx|pdf|video) about building blocks for foundation models. GitHub for SysAI building blocks.
  - Some resources for a budding community in Data-Centric AI and a blog post about it.
  - SIGMOD keynote on Data-centric AI, Declarative ML, and Foundation Models in data slides (YouTube)
  - SIGMOD panel on Service, Science and Startups changing research
  - Software 2.0 Overview at HAI
  - Thanks, NeurIPS! Our Test-of-time Award talk for Hogwild! is on YouTube
  - A quick overview of video our work on Hidden Stratification.
  - MLSys 20 keynote talk (pdf|pptx) or WWW BIG. More articles on new group website also see github.
  A messy, incomplete log of old updates is here.

This is AI generated, so there are probably fewer errors than when I did it... and something to blame.

Manuscripts

Index by year

2025

Minions: Cost-efficient collaboration between on-device and cloud language models Narayan, Biderman, et al. ICML 2025.
Genome modeling and design across all domains of life with Evo 2 Brixi, Durrant, et al. BioRxiv 2025.
Kernelbench: Can llms write efficient gpu kernels? Ouyang, Guo, et al. 2025.
Codemonkeys: Scaling test-time compute for software engineering Ehrlich, Brown, et al. 2025.
Restructuring Vector Quantization with the Rotation Trick Fifty, Junkins, et al. ICLR 2025. Oral
Scaling laws for precision Kumar, Ankner, et al. ICLR 2025. Oral
LoLCATs: On low-rank linearizing of large language models Zhang, Arora, et al. ICLR 2025.
Aioli: A unified optimization framework for language model data mixing Chen, Hu, et al. ICLR 2025.
Archon: An architecture search framework for inference-time techniques Saad-Falcon, Lafuente, et al. ICML 2025.
Thunderkittens: Simple, fast, and adorable ai kernels Spector, Arora, et al. ICLR 2025. Spotlight
HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation Kumbong, Liu, et al. CVPR 2025.
BWLer: Barycentric Weight Layer Elucidates a Precision-Conditioning Tradeoff for PINNs Liu, Baig, et al. 2025.
Shrinking the Generation-Verification Gap with Weak Verifiers Saad-Falcon, Buchanan, et al. 2025.
Cartridges: Lightweight and general-purpose long context representations via self-study Eyuboglu, Ehrlich, et al. 2025.
Towards learning high-precision least squares algorithms with sequence models Liu, Grogan, et al. 2025.
Systems and algorithms for convolutional multi-hybrid language models at scale Ku, Nguyen, et al. 2025.
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels Sul, Arora, et al. 2025.
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI Saad-Falcon, Narayan, et al. 2025.
HipKittens: Fast and Furious AMD Kernels Hu, Wadsworth, et al. 2025.
A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems Gonzalez, Buchanan, et al. 2025.
Building GenAI Benchmarks: A Case Study in Legal Applications Guha, Nyarko, et al. 2025.
Precise high-dimensional asymptotics for quantifying heterogeneous transfers Yang, Zhang, et al. Journal of Machine Learning Research 2025.

2024

Sequence modeling and design from molecular to genome scale with Evo Nguyen, Poli, et al. Science 2024. Selected for Cover
WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks Wornow, Narayan, et al. NeurIPS24 (Benchmark)
Smoothie: Label Free Language Model Routing Guha, Chen, et al. NeurIPS24.
Red Pajama Data NeurIPS24 (data)
Large language monkeys: Scaling inference compute with repeated sampling Brown, Juravsky, et al. 2024.
Just read twice: closing the recall gap for recurrent language models Arora, Timalsina, et al. 2024.
Automated Rewards via LLM-Generated Progress Functions Sarukkai, Shacklett, et al. 2024.
Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates Narayan, Chen, et al. NeurIPS 2024 (Datasets and Benchmarks).
Context clues: Evaluating long context models for clinical prediction tasks on ehrs Wornow, Bedi, et al. 2024.
Model changelists: Characterizing updates to ml models Eyuboglu, Goel, et al. FAccT 2024.
Hydragen: High-throughput llm inference with shared prefixes Juravsky, Brown, et al. 2024.
Towards trustworthy seizure onset detection using workflow notes Saab, Tang, et al. npj Digital Medicine 2024.
Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT Jon Saad-Falcon et al. ICML 2024.
Simple linear attention language models balance the recall-throughput tradeoff Evan Sabri Eyuboglu and Simran Arora et al. ICML 2024.
Mechanistic Design and Scaling of Hybrid Architectures Michael Poli et al. ICML 2024.
Prospector Heads: Generalized Feature Attribution for Large Models and Data Gautam Machiraju et al. ICML2024.
State-Free Inference of State-Space Models: The *Transfer Function* Approach Parnichkun et al. ICML2024.
Automating the Enterprise with Foundation Models Wornow, Narayan, et al. VLDB 2024.
Zoology: Measuring and Improving Recall in Efficient Language Models Simran Arora and Sabri Eyuboglu et al. ICLR 2024.
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores Daniel Y. Fu and Hermann Kumbong et al. ICLR 2024
Context-Aware Meta-Learning Christopher Fifty et al. ICLR 2024
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry Michael Zhang et al. ICLR 2024.

2023

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution NeurIPS23. Spotlight
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture Fu, Arora, Grogan et al. NeurIPS23. Oral
Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification Guha et al. NeurIPS23.
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models Guha et al. NeurIPS data 23.
Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions Massaroli and Poli et al. NeurIPS23
Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models Chen et al. NeurIPS23. Spotlight
TART: A plug-and-play Transformer module for task-agnostic reasoningBhatia and Narayan et al. NeurIPS23.
A case for reframing automated medical image classification as segmentation Hooper et al. NeurIPS23.
Reasoning over Public and Private Data in Retrieval-Based Systems Arora et al. TACL23.
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Arora et al. VLDB23.
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time. Chen et al. ICML 23 Oral
Hyena Hierarchy: Towards Larger Convolutional Language Models. Poli et al. ICML23 Oral
CocktailSGD: Fine-tuning Foundation Models over 500Mbps Networks Wang et al. ICML23
Simple Hardware-Efficient Long Convolutions for Sequence Modeling. Fu et al. ICML23
High-throughput Generative Inference of Large Language Models with a Single GPU Sheng et al. ICML23 Oral
Hungry Hungry Hippos: Towards Language Modeling with State Space Models. Fu, Dao et al. ICLR23 Spotlight
Ask Me Anything: A simple strategy for prompting language models. Arora, Narayan et al. ICLR23 Spotlight
Effectively Modeling Time Series with Simple Discrete State Spaces. Zhang et al. ICLR23.
How to Train your HiPPO: State Space Models with Generalized Orthogonal Basis Projection. Gu et al. ICLR23
Can Foundation Models Wrangle Your Data? Narayan et al. VLDB23

2022

HAPI: A Large-scale Longitudinal Dataset of Commercial ML API Predictions Chen et al. NeurIPS22 (Data)
Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data Thomas and Poldrack. NeurIPS22
Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees Wang et al. NeurIPS22.
Decentralized Training of Foundation Models in Heterogeneous Environments Yuan et al. NeurIPS22 Oral
Transform Once: Efficient Operator Learning in Frequency Domain Poli et al. NeurIPS22
On the Parameterization and Initialization of Diagonal State Space Models Gu, Goel, Gupta. NeurIPS22
S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces Nguyen et al. NeurIPS22
Contrastive Adapters for Foundation Model Group Robustness Zhang. NeurIPS22
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Dao et al. HAET22 Best Paper and then NeurIPS22
VORTEX: Physics-Driven Data Augmentations Using Consistency Training for Robust Accelerated MRI Reconstruction Desai et al. MIDL22 Best Paper
Shoring Up the Foundations: Fusing Model Embeddings and Weak Supervision Chen et al. UAI22. Oral, Best Student Paper Runner Up
Correct-N-Contrast: A Contrastive Approach for Improving Robustness to Spurious Correlations. Zhang et al. ICML22. Long Talk
It's Raw! Audio Generation with State-Space Models Goel et al. ICML22. Long Talk
Monarch: Expressive Structured Matrices for Efficient and Accurate Training Dao et al. ICML22. Long Talk and Honorable Paper Runner up
Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning Chen et al. ICML22.
Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins. Suri et al. VLDB22.
Efficiently Modeling Long Sequences with Structured State Spaces Gu et al. ICLR22. Oral, Honorable Mention Outstanding Paper .
Domino: Discovering Systematic Errors with Cross-Modal Embeddings. Eyuboglu et al. ICLR22. Oral
Pixelated Butterfly: Simple and Efficient Sparses Training for Neural Network Models Chen et al. ICLR22. Spotlight
Metadata Shaping: A Simple Approach for Knowledge-Enhanced Language Models Arora et al. Findings of ACL22.
TABi: Type-Aware Bi-Encoders for Open-Domain Entity Retrieval Leszczynski et al. Findings of ACL22.
Cross-Domain Data Integration for Named Entity Disambiguation in Biomedical Text Varma et al. Findings of ACL22.
The Details Matter: Preventing Class Collapse in Supervised Contrastive Learning Fu and Chen et al. AIBDSD@AAAI Best Paper

2021

SKM-TEA: A Dataset for Accelerated MRI Reconstruction with Dense Image Labels for Quantitative Clinical Evaluation. Desai et al. NeurIPS21.
Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers. Gu et al. NeurIPS21.
Personalized Benchmarking with the Ludwig Benchmarking Toolkit Narayan et al. NeurIPS21
Scatterbrain: Unifying Sparse and Low-rank Attention Chen et al. NeurIPS21.
Rethinking Neural Operations for Diverse Tasks Roberts et al. NeurIPS21.
Catformer: Designing Stable Transformers via Sensitivity Analysis Davis et al. ICML 21.
Mandoline: Model Evaluation under Distribution Shift Chen et al. ICML 21
HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections Chami et al. ICML 21.
Goodwill Hunting: Analyzing and Repurposing Off-the-Shelf Named Entity Linking Systems. Karan Goel et al. NAACL Industry 2021
Robustness Gym: Unifying the NLP Evaluation Landscape Karan Goel et al.. NAACL demo, 2021.
Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation. Mayee F. Chen et al. AIStats21.
PipeMare: Asynchronous Pipeline Parallel DNN Training Bowen Yang et al. MLSys21.
Mongoose: A Learnable LSH Framework for Efficient Neural Network Training. Beidi Chen et al. ICLR21 Oral
Model Patching: Closing the Subgroup Performance Gap with Data Augmentation. Karan Goel et al. ICLR21
Cut out the annotator, keep the cutout: better segmentation with weak supervision. Sarah Hooper et al. ICLR21
Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation. Laurel Orr et al. CIDR21
Observational Supervision for Medical Image Classification Using Gaze Data"> Saab et al. MICCAI21

2020

Comparison of segmentation-free and segmentation-dependent computer-aided diagnosis of breast masses on a public mammography dataset Lee et al. Journal of Biomedical Informatics, 2020.
HiPPO: Recurrent Memory with Optimal Polynomial Projections Albert Gu et al. NeurIPS 2020. Spotlight
No Subclass left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems Nimit Sohoni. NeurIPS 2020.
Hyperbolic Hierarchical Clustering. Ines Chami et al. NeurIPS 2020.
Leveraging Organizational Resources to Adapt Models to New Data Modalities Sahaana Suri et al. VLDB 2020
On the Generalization Effects of Linear Transformations in Data Augmentation Sen Wu, Hongyang R. Zhang, Greg Valiant, C. Ré. ICML 2020
Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods Dan Fu et al. ICML 2020
AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature J. Birgmeier et al. Science Trans. Med 2020
Extract Chemical Reactions from Text using Snorkel Emily K. Mallory et al. BMC Bioinformatics 2020.
Cross-Modal Data Programming Enables Rapid Medical Machine Learning. J. Dunnmon, AJ Ratner et al. Cell Patterns 2020.
Low-Dimensional Hyperbolic Knowledge Graph Embeddings. Ines Chami et al. ACL20
Contextual Embeddings: When are they worth it? S. Arora, A. May, J. Zhang, C. Ré. ACL20 (Short Paper)
Sparse Recovery for Orthogonal Polynomial Transforms Anna Gilbert, Albert Gu, C. Re, Atri Rudra, Mary Wootters. ICALP 2020.
Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging Luke Oakden-Rayner*, Jared Dunnmon*, Gustavo Carneiro, and C. Ré ACM CHIL, 2020 (Oral Spotlight).
Ivy: Instrumental Variable Synthesis for Causal Inference Z. Kuang et al. AISTATS 2020.
Weak supervision as an efficient approach for automated seizure detection in electroencephalography. Khaled Saab et al. NPJ Digital Medicine.
Overton: A Data System for Monitoring and Improving Machine-Learned Products. C. Ré, F. Niu, P. Gudipati, and C. Srisuwananukorn. CIDR2020
Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps Tri Dao et al. ICLR 2020. Spotlight
Understanding and Improving Information Transfer in Multi-Task Learning Sen Wu, H. Zhang, C. Ré. ICLR 2020.

2019

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging Luke Oakden-Rayner et al. ML4H19
Medical device surveillance with electronic health records Alison Callahan and Jason A Fries et al. NPJ Digital Medicine.
On the Downstream Performance of Compressed Word Embeddings Avner May, Jian Zhang, Tri Dao, C. Ré. NeurIPS2019. Spotlight
Multi-Resolution Weak Supervision for Sequential Data. Paroma Varma, Fred Sala et al. NeurIPS 2019.
Slice-based Learning: A programming model for residual learning on critical slices. Vincent Chen, Sen Wu, Alex Ratner, C. Ré. NeurIPS2019.
Hyperbolic Graph Convolutional Neural Networks. Ines Chami, Rex Ying, C. Ré, Jure Leskovec. NeurIPS 2019.
Scene Graph Prediction with Limited Labels. Vincent S Chen, Paroma Varma, Ranjay Krishna, Michael Bernstein, C. Ré, Li Fei-Fei. ICCV 2019.
Automating the generation of hardware component knowledge bases. Luke Hsiao, Sen Wu, Nicholas Chiang, C.Ré, Philip Levis. LCTES 2019.
Snorkel: Rapid Training Data Creation with Weak Supervision Alex Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, C. Ré. VLDBJ 2019 (Best Of)
Doubly Weak Supervision of Deep Learning Models for Head CT. Khaled Saab et al. MICCAI 2019.
Improving Sample Complexity with Observational Supervision. Khaled Saab et al. ICLR Learning with Limited Labeled Data (LLD) Workshop 2019.
A Machine-Curated Database of Genome-Wide Association Studies Volodymyr Kuleshov et al. Nature Comms 2019.
Weakly supervised classification of rare aortic valve malformations using unlabeled cardiac MRI sequences Jason Fries et al. Nature Comms 2019.
Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, C. Ré. ICML 2019.
Learning Dependency Structures for Weak Supervision Models Paroma Varma, Frederic Sala, Ann He, Alexander Ratner, C. Ré. ICML 2019.
A Kernel Theory of Modern Data Augmentation Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, C. Ré. ICML 2019.
Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale Stephen Bach et al. SIGMOD 2019.
Learning Mixed-Curvature Representations in Product Spaces. Beliz Gunel, Albert Gu, C. Ré. Fred Sala. ICLR 2019.
Low-precision Random Fourier Features for Memory Constrained Kernel Approximation. Tri Dao, Avner May, C. Ré, Jian Zhang. AIStats 2019.
Training Complex Models with Multi-Task Weak Supervision. A. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, C. Ré. AAAI 2019.
The Role of Massively Multi-task and Weak Superivsion in Software 2.0 A. Ratner, B. Hancock, C. Ré. CIDR 2019.
Assessment of Deep Learning Models for Automated Triage of Chest Radiographs J. Dunnmon, D. Yi, C. Langlotz, C. Ré, D. Rubin, M. Lungren Radiology, 2019
Snuba: Automating Weak Supervision to Label Training Data, Paroma Varma and C. Ré. VLDB 2019.
A Formal Framework for Probabilistic Unclean Databases. C. De Sa, Ihab Ilyas, Benny Kimelfeld, C. Ré, and Theodoros Rekatsinas, ICDT 2019.

2018

Learning Compressed Transforms with Low Displacement Rank. Anna T. Thomas, Albert Gu, Tri Dao, Atri Rudra, C. Ré. NeurIPS, 2018.

Software 2.0 and Snorkel: Beyond Hand-Labeled Data. C. Ré. KDD18 (Invited).

Representation Tradeoffs for Hyperbolic Embeddings Christopher De Sa, Albert Gu, C. Ré, Frederic Sala. ICML18.

Training Classifiers with Natural Language Explanations. Braden Hancock, Paroma Varma, Stephanie Wang, Percy Liang, C. Ré. ACL18.

LevelHeaded: A Unified Engine for Business Intelligence and Linear Algebra Querying. Chris Aberger, Andy Lamb, Kunle Olukotun, and C. Ré. ICDE 18.

Snorkel: Rapid Training Data Creation with Weak Supervision A.J. Ratner, S. Bach, H. Ehrenberg, J.A. Fries, S. Wu, C. Ré. VLDB 18 Best Of.

Fonduer: Knowledge Base Construction from Richly Formatted Data Sen Wu et al. SIGMOD 18.

A Two Pronged Progress in Structured Dense Matrix Multiplication C. De Sa, Albert Gu, Rohan Puttagunta, C. Ré, Atri Rudra. SODA 18.

Accelerated Power Iteration. AIStats 18.

Learning Invariance with Compact Transforms. Anna T. Thomas, Albert Gu, Tri Dao, Atri Rudra, C. Ré. ICLR Workshop 18.

Exploring the Utility of Developer Exhaust. Jian Zhang, Max Lam, Stephanie Wang, Paroma Varma, Luigi Nardi, Kunle Olukotun and C. Ré. DEEM 2018

Snorkel MeTaL: Weak Supervision for Multi-Task Learning. Alex Ratner, Braden Hancock, Jared Dunnmon, Roger Goldman, C.Ré. DEEM 2018.

High-Accuracy Low-Precision Training C. De Sa, M. Leszczynski, J Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, C. Ré

2017

Association of Omics Features with Histopathology Patterns in Lung Adenocarcinoma. K. Yu, G. Berry, D. Rubin, et al. Cell Systems 2017.
Learning to Compose Domain-Specific Transformations for Data Augmentation A. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, C. Ré, NeurIPS2017.
Inferring Generative Model Structure with Static Analysis Paroma Varma, Bryan He, Payal Bajaj, C. Ré, NeurIPS2017.
Gaussian Quadrature for Kernel Features Tri Dao, Chris De Sa, C. Ré, NeurIPS2017. spotlight
HoloClean: Holistic Data Repairs with Probabilistic Inference Theo Rekatsinas, Xu Chu, Ihab F. Ilyas, C. Ré. VLDB 17.
Weighted SGD for lp regression with Randomized Preconditioning. Jiyan Yang, Yin-Lam Chow, C. Ré, and Michael Mahoney. JMLR 17.
Learning the Structure of Generative Models without Labeled Data Stephen H. Bach, Bryan He, Alex Ratner, C. Ré. ICML 2017.
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent. C. De Sa, Matt Feldman, C. Ré, Kunle Olukotun. ISCA 2017.
GYM: A Multiround Join Algorithm In MapReduce and Its Analysis Foto Afrati, Manas Joglekar, C. Ré, Semih Salihoglu, Jeffrey D. Ullman. ICDT 2017.
SLiMFast: Guaranteed Results for Data Fusion and Source Reliability. Manas Joglekar, Theodoros Rekatsinas, H. Garcia-Molina, et al. SIGMOD 17.
A Relational Format for Feature Engineering Benny Kimmelfeld, C. Ré. PODS 2017. Best of PODS.
Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded. Chris Aberger, Andy Lamb, Kunle Olukotun, C. Ré. VLDB17 (demo)
Snorkel: Fast Training Set Generation for Information Extraction. Alex Ratner, Stephen Bach, Henry Ehrenberg, C. Ré. SIGMOD 17 (demo).
Snorkel: A System for Lightweight Extraction Alex Ratner, Stephen Bach et al. CIDR 2017 (one pager)
Flipper: A Systematic Approach to Debugging Training Sets. Paroma Varma, Dan Iter, C. De Sa and C. Ré. HILDA 2017

2016

Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features Kun-Hsing Yu et al. Nature Comms. 2016. Parasite Award
Data Programming: Creating Large Training Sets, Quickly Alex Ratner, Chris De Sa, Sen Wu, Daniel Selsam, and C. Ré. NeurIPS 2016. video.
Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much. Bryan He, C. De Sa, I. Mitliagkas, and C. Ré. NeurIPS 2016. video.
Sub-sampled Newton Methods with Non-uniform Sampling Peng Xu, Jiyan Yang, Farbod Roosta-Korasani, C. Ré, and Michael Mahoney. NeurIPS 2016.
Cyclades: Conflict-free Asynchronous Machine Learning. Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, et al. NeurIPS 2016.
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling Chris De Sa, Kunle Olukotun, C. Ré. ICML 2016. Best Paper Award.
EmptyHeaded: A Relational Engine for Graph Processing Christopher R. Aberger, Susan Tu, Kunle Olukotun, and C. Ré. SIGMOD 2016. Best Of.
Aggregations over Generalized Hypertree Decompositions. Manas Joglekar, Rohan Puttagunta, and C. Ré. PODS 2016.
Extracting Databases from Dark Data with DeepDive. Ce Zhang, Michael Cafarella, Feng Niu, C. Ré, Jaeho Shin. SIGMOD 2016 (Industrial Track).
High Performance Parallel Stochastic Gradient Descent in Shared Memory. S. Sallinen, N. Satish, M. Smelyanskiy, S. Sury, C. Ré. IPDPS16.
It’s all a matter of degree: Using degree information to optimize multiway joins. Manas Joglekar and C. Ré. ICDT2016. Best Of.
Weighted SGD for lp Regression with Randomized Preconditioning Jiyan Yang, Yin-Lam Chow, C. Ré, and Michael Mahoney. SODA16.
Asynchrony begets Momentum, with an Application to Deep Learning. Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, and C. Ré. Allerton 16.
Incremental Knowledge Base Construction Using DeepDive Jaeho Shin, Sen Wu, Feiran Wang, Ce Zhang, C. De Sa, C. Ré. VLDBJ.
Materialization Optimizations for Feature Selection. Ce Zhang, Arun Kumar, and C. Ré. TODS 2016.
A Resolution-based Framework for Joins: Worst-case and Beyond. Mahmoud Abo Khamis, Hung Q. Ngo, C. Ré, and Atri Rudra. TODS 2016.
DeepDive: Declarative Knowledge Base Construction. Chris De Sa, Alex Ratner, C. Ré, J. Shin, F.Wang, Sen Wu, Ce Zhang. SIGMOD Record 2016.
Socratic Learning: Empowering the Generative Model Paroma Varma, Rose Yu, Dan Iter, C. De Sa, C. Ré. FiLM-NeurIPS 2016.
Data Programming with DDLite: Putting Humans in a Different Part of the Loop. Henry Ehrenberg, J. Shin, A. Ratner, J. Fries, C. Ré. HILDA16
Parallel SGD: When does Averaging Help? Jian Zhang, Christopher De Sa, Ioannis Mitiliagkas, and C. Ré. OptML16
Old Techniques for New Join Algorithms: A Case Study in RDF Processing. Chris Aberger, Susan Tu, Kunle Olukotun, and C. Ré. DESWEB. ICDE16.
Wikipedia Knowledge Graph with DeepDive. Thomas Palomares, Youssef Ahres, Juhana Kangaspunta and C. Ré. In Wiki Workshop at ICSMW 2016.
Dark Data: Are We Solving the Right Problems? M. Cafarella, I. Ilyas, M. Kornacker, Tim Kraska, C. Ré. ICDE 2016 (Panel).
Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs. Stefan Hadjis, Ce Zhang, Ioannis Mitliagkas, and C. Ré.

2015

Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width Chris De Sa, Ce Zhang, Kunle Olukotun, C. Ré. NeurIPS15. Spotlight
Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms Chris De Sa, Ce Zhang, Kunle Olukotun, and C. Ré. NeurIPS15
Asynchronous stochastic convex optimization. John C. Duchi, Sorathan Chaturapruek, and C. Ré. NeurIPS15.
Incremental Knowledge Base Construction Using DeepDive Jaeho Shin, Sen Wu, Feiran Wang, Ce Zhang, C. De Sa, C. Ré. VLDB15. Best of Issue
Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems Christopher De Sa, Kunle Olukotun, and C. Ré. ICML15.
A Resolution-based Framework for Joins: Worst-case and Beyond. Mahmoud Abo Khamis, Hung Q. Ngo, C. Ré, and Atri Rudra. PODS15. Best of Issue.
Exploiting Correlations for Expensive Predicate Evaluation Manas Joglekar, Hector Garcia-Molina, Aditya Parameswaran, and C. Ré. SIGMOD15
A Demonstration of Data Labeling in Knowledge Base Construction. Jaeho Shin, Mike Cafarella, and C. Ré. VLDB15 (demo).
Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?. SIGMOD15 Panel.
Large-scale extraction of gene interactions from full text literature using DeepDive Emily Mallory, Ce Zhang, C. Ré., Russ Altman. Bioinformatics
Caffe con Troll: Shallow Ideas to Speed up Deep Learning. Firas Abuzaid, Stefan Hadjis, Ce Zhang, and C. Ré. DANAC15.
Join Processing for Graph Patterns: An Old Dog with New Tricks Dung Nguyen, LogicBlox, et al. GRADES15.
DunceCap: Compiling Worst-case Optimal Query Plans. Adam Perelman and C. Ré. Winner of SIGMOD Undergrad Research Competition.
DunceCap: Query Plans Using Generalized Hypertree Decompositions. Susan Tu and C. Ré. Winner of SIGMOD Undergrad Research Competition.
A Database Framework for Classifier Engineering. Benny Kimmelfeld and C. Ré. AMW 2015.
The Mobilize Center: an NIH big data to knowledge center to advance human movement research and improve mobility Ku et al. AMIA.

2014

Materialization Optimizations for Feature Selection. Ce Zhang, Arun Kumar, and C. Ré. SIGMOD 2014. Best Paper Award.
DimmWitted: A Study of Main-Memory Statistical Analytics. Ce Zhang and C. Ré. VLDB 2014.
An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. J. Liu, S. Wright, C. Ré, V. Bittorf, S. Sridhar. ICML 2014. (JMLR version)
Beyond Worst-case Analysis for Joins using Minesweeper. Hung Q. Ngo, Dung Nguyen, C. Ré, and Atri Rudra. PODS 2014. [Full]
Parallel Feature Selection Inspired by Group Testing. Y. Zhou et al. NeurIPS2014.
The Theory of Zeta Graphs with an Application to Random Networks. C. Ré. ICDT 2014. Invited to "Best of" Special Issue.
Transducing Markov Sequences Benny Kimelfeld and C. Ré. JACM 2014.
A Machine-compiled Macroevolutionary History of Phanerozoic Life. Shanan E. Peters, Ce Zhang, Miron Livny, and C. Ré. PloS ONE.
Using Social Media to Measure Labor Market Flow D. Antenucci, M. Cafarella, M. Levenstein, C. Ré, and M. Shapiro. NBER. Selected for NBER Digest
Global Convergence of Stochastic Gradient Descent for Some Nonconvex Matrix Problems Christopher De Sa, Kunle Olukotun, and C. Ré.
Preliminary version in Distributed Matrix Computation with NeurIPS14.
Feature Engineering for Knowledge Base Construction DeepDive Group. Data Engineering Bulletin.
Tradeoffs in Main-Memory Statistical Analytics: Impala to DimmWitted (Invited) V. Bittorf, M. Kornacker, C. Ré, C. Zhang. IMDM with VLDB14.
The Beckman Report on Database Research Mike Carey, AnHai Doan, et al. 2014.
Links between Join Processing and Convex Geometry, C. Ré. ICDT 2014 (Invited Abstract for Keynote) [slides].
Skew Strikes Back: New Developments in the Theory of Join Algorithms. Hung Ngo, C. Ré, and Atri Rudra. SIGMOD Rec. 2013.

2013

Towards High-Throughput Gibbs Sampling at Scale: A Study across Storage Managers. Ce Zhang and C. Ré. SIGMOD 2013.
An Approximate, Efficient LP Solver for LP Rounding. Srikrishna Sridhar, Victor Bittorf, Ji Liu, Ce Zhang, C. Ré, and Stephen J. Wright. NeurIPS 2013
Brainwash: A Data System for Feature Engineering. M. Anderson et al. CIDR Conference 2013 (Vision Track)
Understanding Tables in Context Using Standard NLP Toolkits Vidhya Govindaraju, Ce Zhang, and C. Ré. ACL 2013 (Short Paper)
Hazy: Making it Easier to Build and Maintain Big-data Analytics. Arun Kumar, Feng Niu, and C. Ré ACM Queue, 2013. Invited to CACM March 2013
Ringtail: Nowcasting Made Easy D. Antenucci, M.J. Cafarella, M.C. Levenstein, C. Ré, and M. Shapiro. WebDB 2013 with SIGMOD 2013
Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion B. Recht and C. Ré. Mathematical Programming Computation, 2013.
Ringtail: Nowcasting Made Easy. Dolan Antenucci, Erdong Li, Shaobo Liu, Michael J. Cafarella, and C. Ré. VLDB Demo 2013.
Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System P. Konda, A. Kumar, C. Ré, and V. Sashikanth. VLDB Demo 2013
GeoDeepDive: Statistical Inference using Familiar Data-Processing Languages. Ce Zhang, V. Govindaraju, J. Borchardt, T. Foltz, C. Ré, and S. Peters. SIGMOD 13 (demo).
Building an Entity-Centric Stream Filtering Test Collection for TREC 2102. J.R. Frank, M.Kleiman-Weiner, D. A. Roberts, F.Niu, Ce Zhang, C. Ré, and I. Soboroff. TREC 2013
Improvement in Fast Particle Track Reconstruction with Robust Statistics M. Wellons, IceCube Collaboration, B. Recht, and C. Ré. Nuclear Inst. and Methods in Physics Research, A.
Robust Statistics in IceCube Initial Muon Reconstruction. M. Wellons, IceCube Collaboration, B. Recht, and C. Ré. International Cosmic Ray Conference 2013.

2012

Factoring nonnegative matrices with linear programs. Victor Bittorf, Benjamin Recht, C. Ré, and Joel A. Tropp. NeurIPS 2012. Revised Version.

The MADlib Analytics Library or MAD Skills, the SQL. Joseph M. Hellerstein et al. PVLDB 2012

Probabilistic Management of OCR using an RDBMS Arun Kumar and C. Ré. PVLDB 2012. [Full Version]
Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen, Xixuan Feng, C. Ré, and Min Wang. ICDE. [Full Version]
Understanding cardinality estimation using entropy maximization C. Ré and Dan Suciu. ACM Trans. Database Syst. Volume 37.
Towards a Unified Architecture for In-Database Analytics Aaron Feng, Arun Kumar, Benjamin Recht, and C. Ré SIGMOD 2012. [Full Version]
Worst-case Optimal Join Algorithms Hung Q. Ngo, Ely Porat, C. Ré, and Atri Rudra. PODS, 2012. Best Paper Award
Big Data versus the Crowd: Looking for Relationships in All the Right Places Ce Zhang, Feng Niu, C. Ré, and Jude Shavlik. ACL, 2012.
Toward a noncommutative arithmetic-geometric mean inequality B. Recht and C. Ré. COLT, 2012 [Full Version]
Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference F. Niu, Ce Zhang, C. Ré, and J. Shavlik. IJSWIS, Special Issue on Knowledge Extraction from the Web, 2012.
DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference F. Niu, C. Zhang, C. Ré, and J. Shavlik. VLDS, 2012.
Scaling Inference for Markov Logic via Dual Decomposition (Short). F. Niu, C. Zhang, C. Ré, and J. Shavlik. ICDM, 2012.

2011

Probabilistic Databases. Dan Suciu, Dan Olteanu, C. Ré, and Christoph Koch. Morgan Claypool's Synthesis Lectures, 2011
Incrementally maintaining classification using an RDBMS Mehmet Levent Koc and C. Ré. PVLDB Volume 4, 2011, p. 302-313
Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS F. Niu, C. Ré, A.Doan, and J.W. Shavlik. PVLDB 11, [Full Version]
Automatic Optimization for MapReduce ProgramsEaman Jahani, Michael J. Cafarella, and C. Ré. PVLDB 2011.
Queries and materialized views on probabilistic databases. Nilesh N. Dalvi, C. Re, and Dan Suciu. JCSS 2011.
Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion B. Recht and C. Ré. 2011.
Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent F. Niu, B. Recht, C. Ré, and S. J. Wright. NeurIPS, 2011. [Full Version]
Felix: Scaling Inference for Markov Logic with an Operator-based Approach. Feng Niu, Ce Zhang, C. Ré, and Jude Shavlik.

2010

Manimal: Relational Optimization for Data-Intensive Programs Michael J. Cafarella and C. Ré. WebDB, 2010.
Transducing Markov Sequences Benny Kimelfeld and C. Ré. PODS, 2010. Invited to Special Issue
Understanding Cardinality Estimation using Entropy Maximization C. Ré and Dan Suciu. PODS, 2010, Invited to Special Issue
Approximation Trade-Offs in a Markovian Stream Warehouse: An Empirical Study (Short) J. Letchner, C. Ré, M. Balazinska, and M. Philipose. ICDE.

2009 and older

C. Ré

Managing Large-Scale Probabilistic Databases

Winner of SIGMOD Jim Gray Thesis Award

Raghav Kaushik, C. Ré, and Dan Suciu

General Database Statistics Using Entropy Maximization

Talk

Katherine F. Moore, Vibhor Rastogi, C. Ré, and Dan Suciu

Query Containment of Tier-2 Queries over a Probabilistic Database

Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose

Access Methods for Markovian Streams

Arvind Arasu, C. Ré, and Dan Suciu

Large-Scale Deduplication with Constraints Using Dedupalog

Talk

Selected as one of the best papers in ICDE 2009

Nilesh N. Dalvi, C. Ré, and Dan Suciu

Probabilistic databases: Diamonds in the dirt

Full Version

S. Manegold, I. Manolescu, L. Afanasiev, J. Feng, G. Gou, M. Hadjieleftheriou, S. Harizopoulos, P. Kalnis, K. Karanasos, D. Laurent, M. Lupu, N. Onose, C. Ré, V. Sans, P. Senellart, T. Wu, and D. Shasha

Repeatability & Workability Evaluation of SIGMOD 2009

Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose

Lahar Demonstration: Warehousing Markovian Streams

C. Ré and Dan Suciu

The Trichotomy of HAVING Queries on a Probabilistic Database

C. Ré

Managing Probabilistic Data with Mystiq (Plenary Talk)

C. Ré, and Dan Suciu

Advances in Processing SQL Queries on Probabilistic Data

Ting-You Wang, C. Ré, and Dan Suciu

Implementing NOT EXISTS Predicates over a Probabilistic Database

Nodira Khoussainova, Evan Welbourne, Magdalena Balazinska, Gaetano Borriello, Garrett Cole, Julie Letchner, Yang Li, C. Ré, Dan Suciu, and Jordan Walke

A demonstration of Cascadia through a digital diary application

C. Ré, Julie Letchner, Magdalena Balazinska, and Dan Suciu

Event queries on correlated probabilistic streams

C. Ré, and Dan Suciu

Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can't-Do

Julie Letchner, C. Ré, Magdalena Balazinska, and Matthai Philipose

Challenges for Event Queries over Markovian Streams

C. Ré, and Dan Suciu

Approximate lineage for probabilistic databases

Full Version

Talk

Magdalena Balazinska, C. Ré, and Dan Suciu

Systems aspects of probabilistic data management (Part I)

Talk

Magdalena Balazinska, C. Ré, and Dan Suciu

Systems aspects of probabilistic data management (Part II)

Talk

Michael J. Cafarella, C. Ré, Dan Suciu, and Oren Etzioni

Structured Querying of Web Text Data: A Technical Challenge

C. Re, and Dan Suciu

Management of data with uncertainties

C. Ré, Dan Suciu, and Val Tannen

Orderings on Annotated Collections

C. Ré, and Dan Suciu

Efficient Evaluation of HAVING Queries

Full Version

Talk

C. Ré, Nilesh N. Dalvi, and Dan Suciu

Efficient Top-k Query Evaluation on Probabilistic Data

Full Version

Talk

C. Re and Dan Suciu

Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization

Full Version

Talk

C. Ré

Applications of Probabilistic Constraints (General Exam Paper)

Eytan Adar and C. Ré

Managing Uncertainty in Social Networks

Giorgio Ghelli, C. Ré, and Jér^ome Sim'eon

XQuery!: An XML Query Language with Side Effects

C. Re, Jér^ome Sim'eon, and Mary F. Fern'andez

A Complete and Efficient Algebraic Compiler for XQuery

C. Ré, Nilesh N. Dalvi, and Dan Suciu

Query Evaluation on Probabilistic Databases

Chavdar Botev, Hubert Chao, Theodore Chao, Yim Cheng, Raymond Doyle, Sergey Grankin, Jon Guarino, Saikat Guha, Pei-Chen Lee, Dan Perry, C. Re, Ilya Rifkin, Tingyan Yuan, Dora Abdullah, Kathy Carpenter, David Gries, Dexter Kozen, Andrew C. Myers, David I. Schwartz, and Jayavel Shanmugasundaram

Supporting workflow in a course management system

Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, C. Ré, and Dan Suciu

MYSTIQ: a system for finding more answers by using probabilities

Nathan Bales, James Brinkley, E. Sally Lee, Shobhit Mathur, C. Re, and Dan Suciu

A Framework for XML-Based Integration of Data, Visualization and Analysis in a Biomedical Domain

C. Ré, Jim Brinkley, Kevin Hinshaw, and Dan Suciu

Distributed XQuery

Werner Vogels and C. Ré

WS-Membership - Failure Management in a Web-Services World

Werner Vogels, C. Ré, Robbert Renesse, and Kenneth P. Birman

A Collaborative Infrastructure for Scalable and Robust News Delivery

Current PhD Students

Yasa Baig Coadvisor: Stephen Quake
Francois Chaubard Coadvisor: Mykel Kochenderfer
Catherine Deng
Neel Guha
Junmiao Hu Coadvisor: Gianlucca Laccarino
Hermann Kumbong Coadvisor: Azalia Mirhoseini
Jerry Liu
Avanika Narayan Coadvisor: John Hennessy
Jon Saad-Falcon Coadvisor: Azalia Mirhoseini
Divya Nori Coadvisor: Brian Hie
Stuart Sul
Roberto Torres
Michael Zhang

PhD and Postdoc Alumni (Degree year, Employment)

Mayee Chen (PhD 26, Engram)
Owen Dugan (On Leave, Founding Engineer New Co)
Sabri Eyuboglu (PhD 25, Co-Founder Stealth Co) Coadvisor: James Zou
Eric Nguyen (PhD 25, Co-Founder Radical Numerics) Coadvisor: Stephen Baccus
Dan Biderman (Postdoc 25, Co-Founder Stealth company)
Ben Spector (On Leave, Founder Flapping Airplanes)
Simran Arora (PhD 25, Asst. Professor Caltech, Together)
Mike Wornow (PhD 25, New Co) Coadvisor: Nigam Shah
Chris Fifty (PhD 24, Apple) Coadvisor: Sebastian Thrun
Gautam Machiraju (PhD 24, SmarterDx) Coadvisor: Parag Mallick
Dan Fu (PhD 24, Asst. Professor UCSD) Coadvisor: Kayvon Fatahalian
Silas Alberti Coadvisor: Stefano Ermon (Cognition)
Kush Bhatia (Postdoc 23, Google DeepMind)
Armin W. Thomas (Postdoc 24, Radical Numerics)
Brandon Yang (On Leave 2023, Cofounder Cartesia)
Karan Goel (PhD 2023, Cofounder Cartesia)
Arjun Desai (PhD 2023, Cofounder Cartesia) Coadvisor: Akshay Chaudhari
Tri Dao (PhD 2023, Asst. Professor Princeton) Coadvisor: Stefano Ermon
Khaled Saab (PhD 2023, Google) Coadvisor: Daniel Rubin
Sarah Hooper (PhD 2023, NIH) Coadvisor: Sam Gambhir
Megan Leszczynski (PhD 2023, Samaya)
Nancy Xu (On Leave 2022, Founder Moonhub)
Laurel Orr (Postdoc 2022, Numbers Station)
Albert Gu (PhD 2022, Asst. Professor CMU, Cofounder Cartesia)
Beidi Chen (Postdoc 2022, Asst. Professor CMU)
Nimit Sohoni (PhD 2022, Citadel)
Ines Chami (PhD 2021, Cofounder Numbers Station)
Charles Kuang (Postdoc 2020, Google)
Sharon Li (Postdoc 2020, Asst. Professor University of Wisconsin)
Avner May (Postdoc 2020, Google)
Fred Sala (Postdoc 2020, Asst. Professor University of Wisconsin)
Sen Wu (PhD 2020, Cofounder Numbers Station)
Jared Dunnmon (Postdoc 2019, DIUX)
Jian Zhang (PhD 2019, SambaNova Systems)
Alex Ratner (PhD 2019, Asst. Professor at University of Washington, Cofounder and CEO Snorkel)
Paroma Varma (PhD 2019, Snorkel AI)
Braden Hancock (PhD 2019, Snorkel AI)
Chris Aberger (PhD 2018, SambaNova Systems, Cofounder Numbers Station) Coadvisor: Kunle Olukotun
Stephen Bach (Postdoc 2018, Asst. Professor at Brown)
Emily Mallory (PhD Biomedical Informatics, 2018) Principle advisor: Russ Altman
Madalina Fiterau (Postdoc 2018, Asst. Professor at UMass Amherst) Coadvisor: Scott Delp
Jason Fries (Postdoc 2018, Asst. Professor at Stanford) Coadvisor: Scott Delp
Virginia Smith (Postdoc 2018, Asst. Professor at CMU)
Peng Xu (PhD 2018, Amazon A9) Coadvisor: Michael Mahoney
Chris De Sa (PhD 2017, Asst. Professor at Cornell) Coadvisor: Kunle Olukotun
Ioannis Mitliagkas (Postdoc 2017, Asst. Professor at Montréal) Coadvisor: Lester Mackey
Theodoros Rekatsinas (Postdoc 2017, Asst. Professor at Wisconsin)
Jaeho Shin (PhD 2016, Lattice)
Jiyan Yang (PhD 2016, Facebook) Advisor: Michael Saunders (ICME) and Michael Mahoney (Berkeley)
Kun-Hsing Yu (PhD 2016, Asst. Professor Harvard) Advisor: Michael Snyder (BioE)
Manas Joglekar (PhD 2016, Google) Advisor: Hector Garcia-Molina
Ce Zhang (PhD 2015, Postdoc 2016, Asst. Professor at ETH)
Srikrishna Sridhar (PhD 2014, Apple) Main Advisor: Stephen J. Wright
Feng Niu (PhD 2012, Google, Alation, Lattice, Evidently Cofounder)

MS Alumni (Degree year, First Employment)

Vincent Chen (MS 2019, Snorkel AI)
Max Lam (MS 2019, PhD Harvard)
Anna Thomas (Churchill Scholar)
Xiao Cheng (MS 2017, Facebook)
Henry Ehrenberg (MS 2017, Facebook)
Andy Lamb (CoTerm MS 2017, Google)
Rohan Puttagunta (MS 2016, Facebook)
Thomas Palomares (MS 2016, Founder Farmwise)
Susan Tu (CoTerm MS 2016, Stripe)
Feiran Wang (MS2016, LinkedIn)
Michael Fitzpatrick (MS 2015, Google)
Firas Abuzaid (MS 2015, MIT for PhD)
Zifei Shan (MS 2015, Lattice)
Adam Goldberg (BS 2015, Rubrik)
Adam Perelman (BS 2015, Good Eggs)
Victor Bittorf (MS 2014, Cloudera)
Vidhya Govindaraju (MS 2014, Oracle)
Mark Wellons (MS 2013, Amazon)
Arun Kumar (MS 2013, Wisconsin for PhD, Asst. Professor UCSD)
Xixi Luo (MS in Industrial Engineering 2012, Oracle)
Vinod Ramachandran (MS 2011, Oracle)
M. Levent Koc (MS 2011, Google)
Balaji Gopalan (MS 2010, Google)

We are working on two broad topics:

(1) DeepDive is a new type of system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing data systems. DeepDive's most popular use case is to transform the dark data of web pages, pdfs, and other databases into rich SQL-style databases. In turn, these databases can be used to support both SQL-style and predictive analytics. Recently, some DeepDive-based applications have exceeded the quality of human volunteer annotators in both precision and recall for complex scientific articles. Data produced by DeepDive is used by several law enforcement agencies and NGOs to fight human trafficking. The technical core of DeepDive is an engine that combines extraction, integration, and prediction into a single engine with probabilistic inference as its core operation. A one pager with key design highlights is here. PaleoDeepDive is featured in the July 2015 issue of Nature.

the data, the output of various tools, the input from users — including the program the developer writes — are observations from which the system statistically infers the answer

(2) Fundamentals of Data Processing. Almost all data processing systems have their intellectual roots in first order logic. The most computationally expensive (and most interesting) operation in such systems is the relational join. Recently, I helped discover the first join algorithm with optimal worst-case running time. This result uses a novel connection between logic, combinatorics, and geometry. We are using this connection to develop new attacks on classical problems in listing patterns in graphs and in statistical inference. Two threads have emerged:
- Demos, Examples, and Papers.
  - Worst-case Optimal Joins. We have posted a survey for SIGMOD record about recent advances in join algorithms. Our goal is to give a high-level view of the results for practitioners and applied researchers. We also managed to simplify the arguments. A full version of our join algorithm with worst-case optimal running time is here. The LogicBlox guys have their own commercial worst-case optimal algorithm. Our new system, EmptyHeaded is based on this theory.
  - Beyond Worst-case Joins. This work is our attempt to go beyond worst-case analysis for join algorithms. We (with Dung Nguyen) develop a new algorithm that we call Minesweeper based on these ideas. The main theoretical idea is to formalize the amount of work any algorithm spends certifying (using a set of propositional statements) that the output set is complete (and not, say, a proper subset). We call this set of propositions the certificate. We manage to establish a dichotomy theorem for this stronger notion of complexity: if a query is what Ron Fagin calls beta-acyclic, then Minesweeper runs in time linear in the certificate; if a query is beta-cyclic than on some instance any algorithm takes time that is super linear in the certificate. The results get sharper and more fun.
  - Almost to one algorithm to rule them all? We have a much better description of beyond worst-case optimality with a resolution framework and a host of new results for different indexing strategies. This paper supercedes many of the results in Minesweeper and in a much nicer way!. We also hope to connect more of geometry and resolution... but we'll see!
  - A first part of our attack on conditioning for combinatorial problems is in NIPS and on Arxiv.
  - It is not difficult to get me interested in a theory problem. Ask around the Infolab if you don't believe me.
Our goal is to understand the fundamentals of data processing systems.

CS229, Machine Learning. Spring 19. Fall 19. Spring 20. Fall 20. Spring 21. Spring 22. Spring 23.
CS528, MLSys at Stanford. Podcast! Fall 21. Winter 22. Spring 22. Winter 23. Spring 23.
CS324, Large Language Models. Winter 22. Winter 23.
CS329, ML System Design. Winter 21. Faculty Sponsor
CS145, Introduction to Databases. Fall 14. Fall 15. Fall 16.
CS346, Database System Implementation. Spring 14. Spring 15
CS341, Project in Mining Massive Data Sets. Spring 15 Spring 16.
CS345, Advanced Database Systems. Winter 14.

Our course material from CS145 intro databases is here, and we'll continue to update it. We're aware of a handful of courses that are using these materials, drop us a note if you do!

Honors and Awards
- Inaugural Stanford Open Source Software Prize for Flash Attention, 2024.
- Best Paper from IEEE TC, 2023.
- MIDL Best Paper, 2022
- UAI Outstanding Student Paper Runner Up, 2022
- ICML Outstanding Paper Runner Up, 2022
- ICLR Outstanding Paper Honorable Mention, 2022
- PODS Test-of-Time Award, 2022
- NeurIPS Test-of-Time Award, 2020
- Okawa Research Grant, 2016
- ICML Best Paper Award, 2016
- Distinguished Lectures ONR and FDA, 2016
- CACM Research Highlight for DeepDive, 2016.
- MacArthur Foundation Fellowship, 2015
- VLDB Early Career Award, 2015 (talk video)
- Kavli Fellow, NAS, Frontiers of Science, 2015 (unable to attend)
- Gordon & Betty Moore Data-Driven Discovery Award, 2014
- SIGMOD Best Paper Award, 2014
- National Bureau of Economic Review, NBER Digest Highlight, 2014
- Alfred P. Sloan Research Fellowship, 2013
- Robert N. Noyce Faculty Fellowship, 2013
- PODS Best Paper Award, 2012
- NSF CAREER Award, 2011
- ACM SIGMOD Jim Gray Dissertation Award, 2010
- "Best of" Special Issue Paper Awards: VLDB 2018 (VLDBJ); PODS 2017 (TODS); Nature Comms 2016 (Research Parasite for Kun) SIGMOD 2016 (TODS); ICML 2016 (IJCAI Best of AI); ICDT 2016 (TOCS); VLDB 2015 (VLDBJ & CACM Resesearch Highlight); PODS 2015 (TODS); SIGMOD 2014 (JACM); ICDT 2014 (TOCS), declined; PODS 2012; PODS 2010, two papers, JACM and TODS; ICDE 2009 (TKDE), declined.

News!

Students

Papers

Teaching and Awards

Current PhD Students

PhD and Postdoc Alumni (Degree year, Employment)

MS Alumni (Degree year, First Employment)