Image Retrieval using Scene Graphs

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Li Fei-Fei

Ground truth scene graphs

Randomly selected human-generated ground-truth scene graphs from our dataset. The full dataset contains 5000 such scene graphs.

Scene graph groundings successes failures

Examples of scene graph groundings computed by our models, similar to Figure 8(c) from the paper. We present several success cases, where our full model SG-obj-attr-rel performs better than the baseline models. In addition, we show failure cases, where our full model performs worse than the baseline models.


Failure cases

Dataset classes

As mentioned in Section 4.2 from the paper, we use a subset of our full dataset for all experiments. The subset has 266 object classes, 145 attribute types, and 68 relationship types. The full lists of these object classes, attribute types, and relationship types are shown below.

Data collection

Section 4.2 of the paper gives an extremely abbreviated description of our data collection pipeline. Many more details can be found here. Our pipeline consists of four types of HITs on Amazon's Mechanical Turk (AMT):

Each image passes through these four phases in order. In the first phase, a single worker writes phrases describing the image and draws bounding boxes around objects mentioned by the phrases. In the subsequent three phases, other workers verify the correctness of these phrases. After the image has made a complete pass through all four phases, the process iterates: the image returns to the beginning, and another worker writes more phrases about the image and possibly adds new objects to the image, which are corrected and verified in the other phases. In practice each image makes three complete passes through the four phases.

Having multiple workers write phrases about each image means that our annotations have greater variety and recall. Having these workers annotate the image in serial rather than in parallel eliminates the need to consolidate duplicate annotations between workers, as each worker builds upon the annotations of the previous worker rather than starting from scratch.

Write phrases top

In this task we show an image to workers and ask them to write phrases to describe the objects in the image. Each phrase is composed of three chunks. If the middle chunk is one of "is", "is a", "is an", "are", "are an", "looks", or "looks like" then the phrase is classified as "descriptive" and gives an attribute of an object; otherwise the phrase gives a relationship between two objects. Workers must write at least 15 phrases total, with at least 5 "descriptive" phrases and 5 "relationship" phrases.
After writing a phrase to describe objects in the image, workers are asked to draw bounding boxes to ground the objects mentioned in the phrase to the image. For descriptive phrases, workers need to ground only the first chunk; for relationship phrases workers must ground the first and third chunk.
In the example above the worker has grounded both the dog and the stick for the phrase "dog holding stick" and has written a new phrase: "dog is black." If one of the chunks of the phrase matches an object that has already been annotated for this image, the worker is prompted to see if the new phrase refers to the same object. This allows multiple phrases, possibly written by different workers, to refer to the same annotated object.
In this case, the phrase "dog is black" does not refer to the dog that was already annotated, so the worker can draw a new bounding box for this new dog.
Sometimes a worker may write a phrase that refers to an object that has already been annotated for this image, but will use a different word to refer to the object. In this case the phrase "dog holding stick" refers to the white dog, and the worker has written "doggy is white"; this phrase ought to refer to the same bounding box in the image, but this cannot be detected automatically.
To circumvent this problem, whenever a worker draws a bounding box that has high intersection over union with an object that has already been labeled for this image, the worker is prompted to decide whether the two boxes actually refer to the same object.

Verify objects top

In this phase we ask three workers to vote on each annotated object. Each object may be classified as either "bad", "good with bad box", or "good with good box". Three workers vote on each object.
An example of a bad object. Objects must be physical objects, not descriptive terms or abstract concepts, and must be located in the image near the bounding box.
An example of a good object with a bad bounding box. Bounding boxes must tightly enclose the entire visible portion of the object.
An example of a good object with a good bounding box.
As a result of this phase, there are four possible outcomes for each object:
  • If at least two workers decide that the object is bad, we discard the object and any phrases that refer to the bad object.
  • If at least two workers decide that the object is good and has a good bounding box, we keep the object and proceed to verify the phrases that refer to the object.
  • If at least two workers decide that the object is good but fewer than two workers think that the bounding box is correct, this object has its bounding box redrawn in the next phase.
  • If at least two workers decide that the object is good but fewer than two workers think that the bounding box is correct, AND this object has already had three different bounding boxes marked as incorrect, then we discard the object. This prevents objects from getting caught in an infinite loop of bad bounding boxes.

Redraw bounding boxes top

We ask workers to redraw bounding boxes that were marked as bad in the previous step. To prevent workers from drawing the same bad bounding box again, we show them the existing bad bounding box for the object.
After the bounding box has been redrawn, the object is sent back to the object verification phase.

Verify phrases top

In this phase we ask three workers to vote on each phrase from the first phase. We only verify phrases once their objects have been verified and have good bounding boxes. If at least two workers mark the phrase as correct then it becomes part of our dataset; otherwise the phrase is discarded.
An example of a good relationship phrase.
We verify attributes and relationships in the same phase.
An example of a bad relationship phrase. Both objects are correct and have good bounding boxes, but the relationship does not actually hold between these objects.

Aggregate scene graph

An aggregate scene graph constructed using the 400 most common (object, relationship, object) and (object, attribute) tuples from our full dataset. This is similar to Figure 5 from the paper, but larger.