Images are not simply sets of objects: each image represents a web of interconnected relationships. These relationships between entities carry semantic meaning and help a viewer differentiate between instances of an entity. For example, in an image of a soccer match, there may be multiple persons present, but each participates in different relationships: one is kicking the ball, and the other is guarding the goal. In this paper, we formulate the task of utilizing these "referring relationships" to disambiguate between entities of the same category. We introduce an iterative model that localizes the two entities in the referring relationship, conditioned on one another. We formulate the cyclic condition between the entities in a relationship by modelling predicates that connect the entities as shifts in attention from one entity to another. We demonstrate that our model can not only outperform existing approaches on three datasets --- CLEVR, VRD and Visual Genome --- but also that it produces visually meaningful predicate shifts, as an instance of interpretable neural networks. Finally, we show that by modelling predicates as attention shifts, we can even localize entities in the absence of their category, allowing our model to find completely unseen categories.

CVPR 2018


Download the paper here.


Download the Visual Genome dataset here.

Download the VRD dataset here.

Download the CLEVR dataset here.


Download the code here


  title={Referring Relationships},
  author={Krishna, Ranjay and Chami, Ines and Bernstein, Michael and Fei-Fei, Li},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition},

Referring Relationships Task

Given a visual relationship, we learn to utilize it to difference between different entities in an image. We can localize which person is kicking the ball versus which person is guarding the goal.

Model Design

We design an iterative model that learns to use predicates in visual relationships as attention shifts, inspired by the moving spotlight theory in Psychology. Given an initial estimate of the ball, it learns where the person kicking it must be. Similarly, given an estimate of the people, it will learn to identify where the ball must be. By iterating between these estimates, our model eliminates other instances and is able to focus on the correct instances.

Example outputs from our model.

Interpretable shifts

Without any supervision to guide the shifting process, our model generates shifts that are interpretable. For example, when moving attention from a subject that is to the "left of" the object, it learns to shift the attention to the right.