Each image comes with a scene graph of objects and relations. Each question comes with a structured representation of its semantics.
22M multi-step questions that require a diverse set of reasoning skills, with both binary and open questions.
The answer distribution biases are reduced for each question type to mitigate language priors and prevent educated guesses.
The structured representations allow for a stronger and more informative error signal during training.
A suite of new metrics to evaluate not only accuracy, but also the consistency, validity and plausibility of responses.
Supports careful analysis based on question and answer type, length, number of reasoning steps and difficulty.
Join the 2020 GQA Challenge for Real-World Visual Reasoning
GQA images are from COCO and Flickr. The image scene graphs are based on a
new cleaner version of Visual Genome. We thank COCO, Flickr, and Visual
Genome teams for their great work!
@article{hudson2018gqa, title={GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering}, author={Hudson, Drew A and Manning, Christopher D}, journal={Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2019} }