Yuhui Zhang

Department of Computer Science
Stanford University
Email: yuhuiz@stanford.edu


Hi, nice to e-meet you! I am a PhD student in Computer Science at Stanford University, advised by Serena Yeung. I am also part of Stanford Artificial Intelligence Laboratory (SAIL) and Stanford Center for Research on Foundation Models (CRFM), where I have worked with James Zou, Chris Manning, Percy Liang, and many other amazing researchers.

My research focuses on multi-modal machine learning (e.g., vision and language) and applications to health and broad science. My recent works explain the representation space geometry from multi-modal contrastive learning (ICLR'24, ICLR'23, NeurIPS'22), analyze behaviors of multi-modal foundation models (ICLR'23, ACL'23, NeurIPS'23, TMLR'23, EMNLP'24, NeurIPS'24), solve novel tasks by compounding vision-language foundation models (CVPR'24, ECCV'24, NEJM AI'24), and adapt these foundation models to health (ML4H'22, JAMIA'21, npj Digital Medicine'19, npj Digital Medicine'18). I have also contributed to many open-source systems, including Stanza, HELM, Jiuge, and OpenLM (ACL'20, TMLR'22, ACL'19).

I enjoy collaboration and learning from different people. Feel free to email me if you want to discuss anything about research!


News:
  • 09/2024: VLMClassifier is accepted to NeurIPS; Our work analyzing pre-trained language models for image generation is accepted to EMNLP 2024 main conference.
  • 07/2024: VideoAgent is accepted to ECCV 2024; AI scientific feedback is published in NEJM AI.
  • 06/2024: Our new work investigates why visually-grounded language models are bad at the basic image classification task.
  • 05/2024: Selected as a Citadel GQS Fellowship finalist and gave a talk in Chicago.
  • 04/2024: VisDiff is selected as an oral presentation (90/11532) at CVPR 2024!
  • 03/2024: Introduce VideoAgent, where we leverage a large language model as an agent for long-form video understanding.
  • 02/2024: VisDiff accepted to CVPR 2024.
  • 01/2024: ICLR 2024: C3 explains the geometry in multi-modal contrastive representation space and introduces a three-step method to bridge the modality gap.
  • 12/2023: Introduce VisDiff, an algorithm that automatically describes differences between two image sets, joint work with Berkeley AI Research!
  • 11/2023: Honored to be selected as one of NeurIPS 2023 Top Reviewers.
  • 10/2023: Large language models generate scientific feedback, answer moral and causal questions, show inverse scaling on 11 tasks.
  • 05/2023: Larger language models are not necessarily better on all the tasks. Check our work in ACL 2023 Findings!
  • 01/2023: Can you diagnose and rectify a vision model using language inputs? Check our work in ICLR 2023!
  • 11/2022: We won the 3rd prize in the first-round Inverse Scaling Prize! Also check out HELM that holistically evaluates language models.
  • 10/2022: Honored to receive a NeurIPS 2022 Scholar Award. Thank you NeurIPS organizers!
  • 10/2022: Two more works will be presented in ML4H and NeurIPS 2022!
  • 09/2022: Our work studying the modality gap accepted to NeurIPS 2022!
  • 07/2020: Stanza now supports biomedical and clinical text processing!
  • 03/2020: Announce Stanza: A Python NLP Library for Many Human Languages! Star
  • 05/2019: Selected as the best oral presentation at 36th Tsinghua CS Forum for Graduate Students!
  • 04/2019: How to infer thousands of diagnoses from EHRs? Check our paper in npj (Nature) Digital Medicine!
  • 12/2018: Awarded the SenseTime Scholarship (USD 3,000). Thanks SenseTime Inc.!
  • 10/2018: Awarded highly selective National Scholarship!
  • 06/2018: Received Tsinghua Research Fellowship with a funding of 7,500 USD!

Publications

Why are Visually-Grounded Language Models Bad at Image Classification? [PDF][CODE][PROJECT PAGE]

Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy.

NeurIPS (2024).

We investigate why visually-grounded language models are bad at classification and find that the primary cause is data-related.

VideoAgent: Long-form Video Understanding with Large Language Model as Agent. [PDF][CODE][PROJECT PAGE]

Xiaohan Wang*, Yuhui Zhang*, Orr Zohar, Serena Yeung-Levy.

ECCV (2024).

We treat long-form video as an environment in which the LLM interacts with the video and acts as an agent to iteratively decide where to look.

Describing Differences in Image Sets with Natural Language. [PDF][CODE][PROJECT PAGE]

Lisa Dunlap*, Yuhui Zhang*, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy.

CVPR (2024).

Oral Presentation (90/11532).

We introduce VisDiff, an algorithm that automatically finds differences between two image sets.

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data. [PDF][CODE][PROJECT PAGE]

Yuhui Zhang*, Elaine Sui*, Serena Yeung-Levy.

ICLR (2024).

Top 5% Review Score (8,8,6,6).

We explain the geometry of multi-modal contrastive representation space and introduces a three-step method to bridge the modality gap.

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. [PDF]

Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev.

EMNLP (2024).

Auto-regressive multi-modal language models do not benefit from initializing their weights with pre-trained language models.

Can Large Language Models Provide Useful Feedback on Research Papers? A Large-scale Empirical Analysis. [PDF][CODE]

Weixin Liang*, Yuhui Zhang*, Hancheng Cao*, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel McFarland, James Zou.

NEJM AI (2024).

We conduct a large-scale empirical analysis about using large language models for scientific feedback.

MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks. [PDF][CODE][PROJECT PAGE]

Allen Nie, Yuhui Zhang, Atharva Amdekar, Christopher J Piech, Tatsunori Hashimoto, Tobias Gerstenberg.

NeurIPS (2023).

We test whether large language models make causal and moral judgments about text-based scenarios that align with human participants.

Inverse Scaling: When Bigger Isn't Better. [PDF][CODE]

Ian R McKenzie, ... (21 authors) ..., Yuhui Zhang, Zhengping Zhou, Najoung Kim, Samuel R Bowman, Ethan Perez.

TMLR (2023).

TMLR Featured Certification.

We present empirical evidence of inverse scaling on 11 datasets collected by the Inverse Scaling Prize.

Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models. [PDF][CODE][PROJECT PAGE]

Yuhui Zhang*†, Michihiro Yasunaga*, Zhengping Zhou*, Jeff Z. HaoChen*, James Zou, Percy Liang, Serena Yeung.

ACL Findings (2023).

Inverse Scaling Prize Winner.

We introduce NeQA, a dataset consisting of questions with negation in which language models do not exhibit straightforward positive scaling.

Diagnosing and Rectifying Vision Models using Language. [PDF][CODE][PROJECT PAGE]

Yuhui Zhang, Jeff Z. HaoChen, Shih-Cheng Huang, Kuan-Chieh Wang, James Zou, Serena Yeung.

ICLR (2023).

AI Audit Challenge Finalist.

We propose to diagnose vision classifiers through natural language using multi-modal contrastive embedding space.

Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2022 Symposium. [PDF]

Stefan Hegselmann, ... (18 authors) ..., Yuhui Zhang, ... (23 authors) ..., Girmaw Abebe Tadesse.

Technical Report (2023).

Summary and lessons learned across all roundtables from the second machine learning for health (ML4H) symposium.

Machine Learning-guided Lipid Nanoparticle Design for mRNA Delivery. [PDF][CODE]

Daisy Yi Ding, Yuhui Zhang, Yuan Jia, Jiuzhi Sun.

ICML: CompBio Workshop (2023).

We curate a dataset of 622 lipid nanoparticles and propose to in silico optimize its design with machine learning models.

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. [PDF][CODE][PROJECT PAGE]

Weixin Liang*, Yuhui Zhang*, Yongchan Kwon*, Serena Yeung, James Zou.

NeurIPS (2022).

We present the modality gap, an intriguing geometric phenomenon of the multi-modal contrastive representation space.

Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation. [PDF][CODE]

Yuhui Zhang, Shih-Cheng Huang, Zhengping Zhou, Matthew P. Lungren, Serena Yeung.

ML4H (2022).

Selected to PMLR.

We adapt pre-trained vision Transformers from 2D to 3D through weight inflation to maximize model performance on 3D medical images.

Holistic Evaluation of Language Models. [PDF][DOC][CODE]

Percy Liang*, Rishi Bommasani*, Tony Lee*, ... (45 authors) ..., Yuhui Zhang, Yuta Koreeda.

TMLR (2022).

TMLR Featured Certification.

HELM provides a holistic evaluation of existing language models about their capabilities, limitations, and risks.

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library. [PDF][DOC][DEMO][CODE]

Yuhao Zhang, Yuhui Zhang, Peng Qi, Christopher D. Manning, Curtis P. Langlotz.

JAMIA (2021).

Biomedical and clinical English model packages for the Stanza Python NLP library.

On the Opportunities and Risks of Foundation Models. [PDF]

Rishi Bommasani, ... (109 authors) ..., Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, Percy Liang.

Technical Report (2021).

A foundation model is a large machine learning model trained on a vast quantity of data at scale such that it can be adapted to a wide range of downstream tasks.

Language Models as Recommender Systems: Evaluations and Limitations. [PDF][CODE]

Yuhui Zhang, Hao Ding, Zeren Shui, Yifei Ma, James Zou, Anoop Deoras, Hao Wang.

NeurIPS: ICBINB Workshop (2021).

We propose to use powerful pre-trained language models as recommender systems through prompting.

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. [PDF][DOC][DEMO][CODE]

Peng Qi*, Yuhao Zhang*, Yuhui Zhang, Jason Bolton, Christopher D. Manning.

ACL (2020).

Stanford NLP Group's official Python NLP library that supports 60+ languages.

Enhancing Transformer with Sememe Knowledge. [PDF][CODE]

Yuhui Zhang*, Chenghao Yang*, Zhengping Zhou*, Zhiyuan Liu.

ACL: RepL4NLP Workshop (2020).

We enhance the Transformer's performance and robustness by injecting sememe linguistic knowledge.

Inducing Grammar from Long Short-Term Memory Networks by Shapley Decomposition. [PDF][CODE]

Yuhui Zhang, Allen Nie.

ACL: SR Workshop (2020).

We induce linguistic knowledge from a long short-term memory network by tracing its computational process.

VetTag: Improving Automated Veterinary Diagnosis Coding via Large-scale Language Modeling. [PDF][BLOG][DEMO][CODE]

Yuhui Zhang*, Allen Nie*, Ashley Zehnder, Rodney Page, James Zou.

Nature Digital Medicine (2019).

We improve DeepTag in four directions: predicting fine-grained diagnosis codes, leveraging unlabeled clinical notes, harnessing label hierarchies, and interpreting predictions.

Jiuge: A Human-Machine Collaborative Chinese Classical Poetry Generation System. [PDF][DEMO]

Zhipeng Guo*, Xiaoyuan Yi*, Maosong Sun, Wenhao Li, Cheng Yang, Jiannan Liang, Huimin Chen, Yuhui Zhang, Ruoyu Li.

ACL (2019).

We develop Jiuge, a human-machine collaborative Chinese poetry generation system.

DeepTag: Inferring Diagnoses from Veterinary Clinical Notes. [PDF][PRESS][CODE]

Allen Nie*, Ashley Zehnder*, Rodney Page, Yuhui Zhang, A. Pineda, M. Rivas, C. Bustamante, James Zou.

Nature Digital Medicine (2018).

We develop algorithms to automatically predict standard diagnosis codes from EHRs and evaluate them in challenging cross-hospital settings.

Large-scale Generative Modeling to Improve Automated Veterinary Disease Coding. [PDF][CODE]

Yuhui Zhang*, Allen Nie*, James Zou.

ML4H (2018).

We improve diagnosis coding via pre-training language models on unlabeled clinical notes.

THUOCL: Tsinghua Open Chinese Lexicon. [DOC]

Shiyi Han*, Yuhui Zhang*, Yunshan Ma, Cunchao Tu, Zhipeng Guo, Zhiyuan Liu, Maosong Sun.

Technical Report (2016).

THUOCL is a set of high-quality Chinese lexicons that can be used to improve many Chinese NLP tasks.

Awards

Academic Services

Miscellaneous

I enjoy reading books. Some of my favorites: To Live (Hua Yu), Walden (Henry David Thoreau), Principles of Economics (N. Gregory Mankiw). I enjoy hiking, jogging, and swimming. I am a fan of classical music, and I was fortunate to learn basics about how to play the guitar, piano, and pipa at Tsinghua University.

Last Update: Sep 2024. Template adapted from HTML5 UP. Visit count: Visit counter For Websites