HumanScore: Benchmarking Human Motions in Generated Videos

Yusu Fang1,2* Tiange Xiang1* Tian Tan1 Narayan Schuetz1 Scott Delp1 Li Fei-Fei1† Ehsan Adeli1†

1Stanford University    2Peking University

*Equal first authorship    Equal last authorship

Real or AI?

Watch each pair and choose which video was generated by AI. HumanScore targets the subtle biomechanical violations that often reveal the answer.

Badminton

Make a choice to reveal the answer.

Ballet

Make a choice to reveal the answer.

Parkour

Make a choice to reveal the answer.

Abstract

Recent advances in model architectures, compute, and data scale have driven rapid progress in video generation, producing increasingly realistic content. Yet, no prior method systematically measures how faithfully these systems render human bodies and motion dynamics.

We present HumanScore, a systematic framework to evaluate the quality of human motions in AI-generated videos. HumanScore defines six interpretable metrics spanning kinematic plausibility, temporal stability, and biomechanical consistency, enabling fine-grained diagnosis beyond visual realism alone.

Through carefully designed prompts, we elicit a diverse set of movements at varying intensities and evaluate videos generated by thirteen state-of-the-art models. Our analysis reveals consistent gaps between perceptual plausibility and motion biomechanical fidelity, identifies recurrent failure modes, and produces robust model rankings from quantitative and physically meaningful criteria.

HumanScore Benchmark

HumanScore follows a biomechanics hierarchy: anatomical correctness, kinematic correctness, and kinetic correctness. Each tier is evaluated with two independent metrics.

The HumanScore biomechanical hierarchy and its six metrics.

I. Extra Limbs

Detects anatomically impossible duplicate arms, legs, hands, feet, or ghost-like body segments.

II. Bone Length

Measures whether body segment lengths remain stable through time under a fitted skeleton.

III. Joint Range

Penalizes poses that exceed physiologically plausible joint limits and hyperextension bounds.

IV. Self-Collision

Finds impossible interpenetration between non-adjacent body parts in reconstructed meshes.

V. Kinematic Extremes

Flags unnatural velocity spikes in joint angular motion and body segment movement.

VI. Motion Smoothness

Measures excessive acceleration and jerk that appear as jitter, stutter, or temporal discontinuity.

Prompt Set

The final benchmark contains 51 motion types, balanced across three difficulty levels, with gentle and intense variants for each motion type.

51motion types
3difficulty levels
102unique prompts

HumanScore Leaderboard

* Higher scores indicate better biomechanical plausibility.

Rank Model Availability Anatomy Kinematic Kinetic Overall
1Seedance 1.0 Pro fastProprietary93.984.794.391.1
1HunyuanVideo 1.5Open source95.383.094.991.1
3KlingAI 2.5 Turbo ProProprietary91.086.495.190.8
4Ray 3.0Proprietary86.782.593.887.7
5Kandinsky 5.0 proOpen source86.780.792.186.5
6Sora-2Proprietary90.878.289.486.1
6Wan 2.2Open source94.078.885.686.1
8Veo 3.1 fastProprietary84.679.893.385.9
9Hailuo 02Proprietary89.177.191.285.8
10PixVerse 5.5Proprietary87.078.691.085.5
11Wan 2.6Proprietary89.678.086.884.8
12Pika v2.2Proprietary88.274.881.881.6
13CogVideoX-5BOpen source73.864.386.374.8

Fitting Process and Results

Metrics II, III, V, and VI depend on a biomechanics-aware skeleton fitting pipeline. We first infer 87 anatomical 3D keypoints from monocular observations, then optimize a physics-based human skeletal model to match those keypoints and export an OpenSim-compatible trajectory for downstream analysis.

HumanScore fitting pipeline from monocular video to 3D keypoints and a biomechanics-informed skeleton.
Two-stage fitting pipeline: 3D keypoint inference followed by iterative optimization of a biomechanics-informed skeleton model.

Representative Fitting Results

Each row shows one provided good case and one bad case for the same model. This makes it easier to see where the fitting pipeline remains stable and where unrealistic motion starts to break down.

BibTeX

@misc{fang2026humanscorebenchmarkinghumanmotions,
            title={HumanScore: Benchmarking Human Motions in Generated Videos}, 
            author={Yusu Fang and Tiange Xiang and Tian Tan and Narayan Schuetz and Scott Delp and Li Fei-Fei and Ehsan Adeli},
            year={2026},
            eprint={2604.20157},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2604.20157}, 
      }