Visual Reasoning in the Real World

A New Dataset for Visual Question Answering

Drew Hudson & Christopher Manning

The GQA Dataset

Question Answering on Image Scene Graphs

Semantic Representations

Each image comes with a scene graph of objects and relations. Each question comes with a structured representation of its semantics.


22M multi-step questions that require a diverse set of reasoning skills, with both binary and open questions.


The answer distribution biases are reduced for each question type to mitigate language priors and prevent educated guesses.

Strong Supervision

The structured representations allow for a stronger and more informative error signal during training.

New Metrics

A suite of new metrics to evaluate not only accuracy, but also the consistency, validity and plausibility of responses.

Thorough Diagnosis

Supports careful analysis based on question and answer type, length, number of reasoning steps and difficulty.

Join the 2020 GQA Challenge for Real-World Visual Reasoning

GQA images are from COCO and Flickr. The image scene graphs are based on a
new cleaner version of Visual Genome. We thank COCO, Flickr, and Visual
Genome teams for their great work!

Read the Paper!
Read the Paper!
    title={GQA: A New Dataset for Real-World Visual Reasoning 
    and Compositional Question Answering},
    author={Hudson, Drew A and Manning, Christopher D},
    journal={Conference on Computer Vision and Pattern Recognition (CVPR)},