Each image passes through these four phases in order. In the first phase, a single worker writes phrases describing the image and draws bounding boxes around objects mentioned by the phrases. In the subsequent three phases, other workers verify the correctness of these phrases. After the image has made a complete pass through all four phases, the process iterates: the image returns to the beginning, and another worker writes more phrases about the image and possibly adds new objects to the image, which are corrected and verified in the other phases. In practice each image makes three complete passes through the four phases.
Having multiple workers write phrases about each image means that our annotations have greater variety and recall. Having these workers annotate the image in serial rather than in parallel eliminates the need to consolidate duplicate annotations between workers, as each worker builds upon the annotations of the previous worker rather than starting from scratch.
In this task we show an image to workers and ask them to write phrases to describe the
objects in the image. Each phrase is composed of three chunks. If the middle chunk is
one of "is", "is a", "is an", "are", "are an", "looks", or "looks like" then the phrase
is classified as "descriptive" and gives an attribute of an object; otherwise the phrase
gives a relationship between two objects. Workers must write at least 15 phrases total,
with at least 5 "descriptive" phrases and 5 "relationship" phrases.
After writing a phrase to describe objects in the image, workers are asked to draw
bounding boxes to ground the objects mentioned in the phrase to the image.
For descriptive phrases, workers need to ground only the first chunk; for
relationship phrases workers must ground the first and third chunk.
In the example above the worker has grounded both the dog and the stick for the phrase
"dog holding stick" and has written a new phrase: "dog is black." If one of the chunks
of the phrase matches an object that has already been annotated for this image, the
worker is prompted to see if the new phrase refers to the same object. This allows
multiple phrases, possibly written by different workers, to refer to the same annotated
object.
In this case, the phrase "dog is black" does not refer to the dog that was already
annotated, so the worker can draw a new bounding box for this new dog.
Sometimes a worker may write a phrase that refers to an object that has already been
annotated for this image, but will use a different word to refer to the object.
In this case the phrase "dog holding stick" refers to the white dog, and the worker
has written "doggy is white"; this phrase ought to refer to the same bounding box
in the image, but this cannot be detected automatically.
To circumvent this problem, whenever a worker draws a bounding box that has high
intersection over union with an object that has already been labeled for this image,
the worker is prompted to decide whether the two boxes actually refer to the same
object.
An example of a bad object. Objects must be physical objects,
not descriptive terms or abstract concepts, and must be located
in the image near the bounding box.
An example of a good object with a bad bounding box.
Bounding boxes must tightly enclose the entire visible portion of the
object.
An example of a good object with a good bounding box.
We ask workers to redraw bounding boxes that were marked as bad
in the previous step. To prevent workers from drawing the same
bad bounding box again, we show them the existing bad bounding
box for the object.
After the bounding box has been redrawn, the object is sent back to
the object verification phase.
An example of a good relationship phrase.
We verify attributes and relationships in the same phase.
An example of a bad relationship phrase. Both objects are correct and
have good bounding boxes, but the relationship does not actually hold
between these objects.