- Total images:
- Average image size (px):
- Maximum image size (px):
- Minimum image size (px):
- Total region descriptions: 4,297,502
- Total image object instances: 1,366,673
- Unique image objects: 75,729
- Total object-object relationship instances: 1,531,448
- Unique relationships: 40,480
- Total attribute-object instances: 1,670,182
- Unique attributes: 40,513
- Total Scene Graphs: 108,249
- Total Region Graphs: 3,788,715
- Total Question Answers: 1,773,258
||Average number of objects
||Average number of relationships
||Average number of attributes
|Per region annotation
|Most common objects
||Most common predicates
||Most common attributes
- Average word length in description:
- Average region width:
- Average region height:
Region Description Word Length
Region per Image Distribution
Top Region Phrases
Top Region Words
Sentence Objects per Image
Sentence Objects per Region
Sentence Objects per Bounding Box
Top Object Names
Objects and Categories
||ImageNet Detection (training/validation)
||PASCAL Detection (training/validation)
||Zitnick Abstract Scenes
||350,000 (2,300 unique pedestrians)
|Objects per category
Objects and Actions
* This dataset uses binary attributes.
** This dataset has 6 attribute "categories"; humans chose how much of each attribute an object had. See
here for more details.
Attributes per Image
Attributes per Region
Attributes per Sentence Object
Attributes per Category:
Top Attributes on People*
* Attributes describing all instances of "people", e.g. "man," "women," "person."
Relationships per Image
Relationships per Region
Relationships per Sentence Object
Top Person-Like Relationships*
* Relationships where both subject and object are instances of "people", e.g. "man," "women," "person."
We ran an experiment to test how diverse our descriptions were. We clustered all of our region descriptions (read the paper for more details) into semantic clusters.
Some examples of clusters are shown in the infographic at the bottom. We found that on average each image had descriptions from 17 different clusters.
We conducted a similar experimenting with Microsoft COCO's sentences and found that our images had descriptions from more clusters.
Clusters per Image*
* Clusters found through Minibatch K-means on vector representations of each RegionAnnotation phrase. These vectors were formed by averaging the Word2Vec vectors of each word in the phrase.
Clusters per Image – COCO Comparison*
* Since COCO has 5 captions per image, we randomly sample 5 region annotations per image for a fairer comparison.
Top Image Synsets
Top Object Synsets
Top Attribute Synsets
Top Relationship Synsets
Top Region Synsets
Top Question Synsets
Top Answer Synsets
Question Answering Statistics
- Total QA pairs: 1,773,258
- Total QA images: 101,174
- Average question length (words): 6.0 ± 1.9
- Average answer length (words): 1.9 ± 1.3