Shirley Wu

Shirley is a third-year Ph.D. student in Stanford CS, advised by Prof. Jure Leskovec and Prof. James Zou. Previously, she obtained her B.S. degree in the School of Data Science at University of Science and Technology of China (USTC), advised by Prof. Xiangnan He.

Her work centered around self-improving LLM agents and alignment. Her recent research focuses on advancing the capabilities of LLM agents (knowledge-grounding retrieval, tool usage) and aligning these agents with user objective in real-world deployment (optimizing multiturn interactions and compound agent frameworks).

GitHub   Scholar   Twitter   Linkedin   Email: shir{last_name}@cs.stanford.edu

What's
New

[05/2025] CollabLLM is accepted to ICML as a Spotlight!
[05/2025] Our workshop proposal to ICCV 2025 is accepted! To be released soon!
[05/2025] Present CollabLLM at Citadel GQS PhD Colloquium in NYC!
[04/2025] Present STaRK at Citadel Securities PhD Summit in Florida!
[02/2025] Present CollabLLM and other recent works at Adobe GenAI Seminar [slides].
[09/2024] AvaTaR is integrated to DSPy!
[09/2024] STaRK, AvaTaR, and GraphMETRO are accepted to NeurIPS 2024!
[05/2024] My talks about STaRK at the Stanford Annual Affiliates Meeting [Youtube] and Stanford Data Science Conference [Youtube] are out!
[04/2024] STaRK is released! [website] We propose a large-scale retrieval benchmark on Semi-structured Knowledge Bases!
[06/2023] Pleasure to give a talk about our Discover and Cure paper for UP lab.
[02/2023] Pleasure to give a talk [Youtube] about DIR (ICLR 2022) for DEFirst - MILA x Vector group.

Research
Topics

  • CollabLLM: From Passive Responders to Active Collaborators

    ICML 2025 (Spotlight)
    Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao

    Problem: LLMs act as passive responders, especially when faced with ambiguity. They don't naturally help users clarify needs or explore options.
    Contribution: CollabLLM empowers LLMs to actively seek information from users and collaborate more effectively with humans.
    Key Insight: Reward responses based on their long-term impact on the conversation.

  • STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

    NeurIPS 2024
    Shirley Wu*, Shiyu Zhao*, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou*, Jure Leskovec*

    Problem: How do we systematically evaluate systems on complex, semi-structured QA?
    Contribution: STaRK is a large-scale retrieval benchmark on Textual and Relational Knowledge Bases.
    Why STaRK: Realistic complex queries across diverse domains and accurate ground truth.

  • AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning

    NeurIPS 2024
    Shirley Wu, Shiyu Zhao, Qian Huang, Kexin Huang, Michihiro Yasunaga, Kaidi Cao, Vassilis N. Ioannidis, Karthik Subbian, Jure Leskovec*, James Zou*

    Problem: LLMs cannot effectively use external tools.
    Contribution: AvaTaR helps LLMs better tackle complex Q&A tasks by improving their ability to leverage tools
    Key Insight: Contrastive reasoning to construct instructions for tool usage and multi-stage problems.

  • GraphMETRO: Mitigating Complex Graph Distribution Shifts via Mixture of Aligned Experts

    NeurIPS 2024
    Shirley Wu, Kaidi Cao, Bruno Ribeiro, James Zou*, Jure Leskovec*

    Task: Node and graph classification.
    What: A framework that enhances GNN generalization under complex distribution shifts.
    Benefits: Generalization and interpretability on distribution shift types.
    How: Through a mixture-of-expert architecture and training objective.

  • Discovering Invariant Rationales for Graph Neural Networks

    ICLR 2022
    Ying-Xin Wu, Xiang Wang, An Zhang, Xiangnan He, Tat-Seng Chua.

    Task: Graph classification.
    What: An invariant learning algorithm for GNNs.
    Motivation: GNNs mostly fail to generalize to OOD datasets and provide interpretations.
    Insight: We construct interventional distributions as "multiple eyes" to discover the features that make the label invarian (i.e., causal features).
    Benefits: Intrinsic interpretable GNNs that are robust and generalizable to OOD datasets.

  • Let Invariant Rationale Discovery Inspire Graph Contrastive Learning

    ICML 2022
    Sihang Li, Xiang Wang, An Zhang, Ying-Xin Wu, Xiangnan He and Tat-Seng Chua

    Task: Graph classification.
    What: A graph contrastive learning (GCL) method with model interpretations.
    How: We generate rationale-aware graphs for contrastive learning to achieve better transferability.

  • Knowledge-Aware Meta-learning for Low-Resource Text Classification

    EMNLP (Oral) 2021. Short Paper.
    Huaxiu Yao, Ying-Xin Wu, Maruan Al-Shedivat, Eric P. Xing.

    Task: Text classification.
    What: A meta-learning algorithm for low-resource text classification problem.
    How: We extract sentence-specific subgraphs from a knowledge graph for training.
    Benefits: Better generalization between meta-training and meta-testing tasks.

  • Discover and Cure: Concept-aware Mitigation of Spurious Correlation (DISC)

    ICML 2023
    Shirley Wu, Mert Yuksekgonul, Linjun Zhang, James Zou.

    What: DISC adaptively mitigates spurious correlations during model training.
    Benefits: Less spurious bias, better generalization, and unambiguous interpretations.
    How: Using concept images generated by Stable Diffusion, in each iteration, DISC computes a metric called concept sensitivity to indicate each concept's spuriousness. Guided by it, DISC creates a balanced dataset (where spurious correlations are removed) to update model.

  • Med-Flamingo: a Multimodal Medical Few-shot Learner

    ML4H, NeurIPS 2023
    Michael Moor*, Qian Huang*, Shirley Wu, Michihiro Yasunaga, Cyril Zakka,
    Yash Dalmia, Eduardo Pontes Reis, Pranav Rajpurkar, Jure Leskovec

    Task: Visual question answering, rationale generation etc.
    What: A new multimodal few-shot learner specialized for the medical domain.
    How: Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks.
    Benefits: Few-shot generative medical VQA abilities in the medical domain.

  • Holistic analysis of hallucination in GPT-4V(ision): Bias and interference challenges (BINGO)

    Preprint
    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang,
    James Zou, Huaxiu Yao

    What: A benchmark designed to evaluate two common types of hallucinations in visual language models: bias and interference.

  • Deconfounding to Explanation Evaluation in Graph Neural Networks

    Preprint
    Ying-Xin Wu, Xiang Wang, An Zhang, Xia Hu, Fuli Feng, Xiangnan He, Tat-Seng Chua

    Task: Explanation evaluation.
    What: A new paradigm to evaluate GNN explanations.
    Motivation: Explanations evaluation fundamentally guides the diretion of GNNs explainability.
    Insight: Removal-based evaluation hardly reflects the true importance of explanations.
    Benefits: More faithful ranking of different explanations and explanatory methods.

  • D4Explainer: In-distribution Explanations of Graph Neural Network via Discrete Denoising Diffusion

    NeurIPS 2023
    Jialin Chen, Shirley Wu, Abhijit Gupta, Rex Ying

    Task: Explanation generation for GNNs.
    What: A unified framework that generates both counterfactual and model-level explanations
    Benefits: The explanations are in-distribution and thus more reliable.

  • Towards Multi-Grained Explainability for Graph Neural Networks

    NeurIPS 2021
    Xiang Wang, Ying-Xin Wu, An Zhang, Xiangnan He, Tat-Seng Chua.

    Task: Explanation generation for GNNs.
    What: ReFine, a two-step explainer.
    How: It generates multi-grained explanations via pre-training and fine-tuning.
    Benefits: Obtain both global (for a group) and local explanations (for an instance).

  • Reinforced Causal Explainer for Graph Neural Networks

    TPAMI. May 2022
    Xiang Wang, Ying-Xin Wu, An Zhang, Fuli Feng, Xiangnan He & Tat-Seng Chua.

    Task: Explanation generation for GNNs.
    What: Reinforced Causal Explainer (RC-Explainer).
    How: It frames an explanatory subgraph via successively adding edges using a policy network.
    Benefits: Faithful and concise explanations.

Services

Reviewer: ICLR '24, ICML '22-'25, NeurIPS '22-24'.

Miscellaneous

I play drums, mostly rock music.

I enjoy extreme sports like bungee jumping. I skydived at Hawaii (first time) in July, 2023!! Pic 1, Pic 2.