Sang Michael Xie

I am currently a Research Scientist at Meta GenAI, on the LLaMA team. My research interests are in 1) using AI for AI (self-improvement algorithms, synthetic data, data feedback loops, data-centric AI) and 2) new capabilities (agents, test-time compute, continual learning, models with memory, multi-agent algorithms). I have done research on data-centric methods for language models and understanding pretraining and adaptation to downstream tasks, including emergent behavior such as in-context learning. I have also worked on pretraining and self-training methods for robust machine learning.

I received my Ph.D. in Computer Science from Stanford University, advised by Percy Liang and Tengyu Ma. I was a FY2019 NDSEG Fellow and a Student Researcher at Google Brain, working with Adams Wei Yu, Hieu Pham, and Quoc Le. I received both my B.S. with departmental honors and M.S. in Computer Science from Stanford in 2017, where I am grateful to have worked with Stefano Ermon on the first deep learning and transfer learning methods for sustainability, particularly in poverty mapping using satellite imagery. My work has been recognized in Scientific American's 10 World Changing Ideas, publication in flagship venues such as Science, and covered by media outlets including the New York Times, The Washington Post, Reuters, BBC News, IEEE Spectrum, and The Verge.

[Music] Email: xie AT cs.stanford.edu


Selected Publications

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining [Paper] [Code] [Blog]

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc Le, Tengyu Ma, Adams Wei Yu
Conference on Neural Information Processing Systems (NeurIPS) 2023 Spotlight
BayLearn 2023 Oral

Data Selection for Language Models via Importance Resampling [Paper] [Data and Code]

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang
Conference on Neural Information Processing Systems (NeurIPS) 2023

An Explanation of In-context Learning as Implicit Bayesian Inference [Paper] [Code] [Video] [Blog]

Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma
International Conference on Learning Representations (ICLR) 2022

Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation [Paper]

Kendrick Shen*, Robbie Jones*, Ananya Kumar*, Sang Michael Xie*, Jeff Z. HaoChen, Tengyu Ma, Percy Liang
International Conference on Machine Learning (ICML) 2022 Long talk

In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness [Paper]

Sang Michael Xie*, Ananya Kumar*, Robbie Jones*, Fereshte Khani, Tengyu Ma, Percy Liang
International Conference on Learning Representations (ICLR) 2021

Understanding and Mitigating the Tradeoff Between Robustness and Accuracy [Paper] [Video]

Aditi Raghunathan*, Sang Michael Xie*, Fanny Yang, John C. Duchi, Percy Liang
International Conference on Machine Learning (ICML) 2020

Combining Satellite Imagery and Machine Learning to Predict Poverty [Paper] [Video/Maps/Media/Links] [Code] [Top of Mind radio interview]

Neal Jean*, Marshall Burke*, Michael Xie, William Davis, David Lobell, Stefano Ermon
Science 2016

Teaching

  • Course Assistant for CS 324: Understanding and Developing Large Language Models, Winter 2022

  • Course Assistant for CS 229: Machine Learning, Spring 2022

  • Section Leader for ENGR 40M: Intro to Making: What is EE, Winter 2015

I have mentored the following amazing undergraduate/master's researchers:
  • Kendrick Shen, now ML Research Engineer at Genesis Therapeutics

  • Robbie Jones, now ML Software Engineer at GridSpace

  • Fahim Tajwar, now PhD Student at CMU

  • Ben Newman, now PhD Student at University of Washington

Service

  • I co-organized the ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo) and the ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models (DPFM).

  • I co-organized the ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo) with Ananya Kumar, Tiffany Vlaar, Yamini Bansal, Mathilde Caron, Tengyu Ma, Hanie Sedghi, Aditi Raghunathan, and Percy Liang.

  • I participated in a panel discussion with Ludwig Schmidt, Nathan Lambert, and Megan Ansdell at the Data-centric ML Research (DMLR) workshop at ICML 2023.

  • I have reviewed for NeurIPS (2019-2023), ICML (2020, 2022, 2023), ICLR (2021, 2022, 2023), COLM (2024), ICLR Workshop Proposals 2025, IEEE SatML 2023, NeurIPS 2022 RobustSeq Workshop, ICML 2022 First Workshop on Pre-Training, ICML 2022 Principles of Distribution Shift (PODS) workshop, the NeurIPS 2021, 2022, 2023 Workshops on Distribution Shifts (DistShift), the Workshop on Computer Vision for Global Challenges (CV4GC) at CVPR 2019.