Object Detection

The model analyzed in this card detects one or more physical objects within an image, from apparel and animals to tools and vehicles, and returns a box around each object, as well as a label and description for each object.

On this page, you can learn more about how the model performs on different classes of objects, and what kinds of images you should expect the model to perform well or poorly on.

Model Description

Image of a woman's face with points on it.

Input: Photo(s) or video(s)

Output: The model can detect 550+ different object classes. For each object detected in a photo or video, the model outputs:

Model architecture: Single shot detector model with a Resnet 101 backbone and a feature pyramid network feature map.

View public API documentation

Performance

loading

Performance evaluated for specific object classes recognized by the model (e.g. shirt, muffin), and for categories of objects (e.g. apparel, food).

Two performance metrics are reported:

Performance evaluated on two datasets distinct from the training set:

  • Open Images Validation set, which contains ~40k images and 600 object classes, of which the model can recognize 518.
  • An internal Google dataset of ~5,000 images of consumer products, containing 210 object classes, all of which model can recognize.
Go to performance

Limitations

The following factors may degrade the model’s performance.

Object size:

Object size must be at least 1% of the image area to be detected.

“Things” vs “stuff”:

Model was designed to detect discrete objects with clearly discernible shapes (“things”), not a group of overlapping objects or background clutter (“stuff”).

Lighting:

Poor or harsh, high-contrast illumination (e.g. nighttime, back-lit, side-lit) may degrade model performance.

Occlusion or clutter:

Partially obstructed or truncated objects may not be detected. For example, a shirt underneath a jacket, or where less than 25% of an object is visible in the image.

Camera positioning and lens type:

Camera angle and positioning (e.g. oblique angles, long-distance), and lens type (e.g. fisheye) may impact model performance.

Blur or noise:

Blurry objects, rapid movement between frames, or encoding/decoding noise may degrade model performance.

Image resolution:

Minimum image resolution of 300K pixels recommended.

Object type:

Model accuracy varies across different object types (see Performance section).

Performance

Here you can learn more about the model's performance on two evaluation datasets drawn from different sources than the training data. You can assess model performance across 500+ different object classes and two different performance metrics: Average Precision (“AP”) and Recall at 60% Precision (“Recall@60%”).

Summary

  • Aggregate performance varies across the two evaluation datasets (mAP of 0.43 and Recall@60% Precision of 0.42 on Open Images Validation set; vs. mAP of 0.34 and Recall@60% Precision of 0.36 on the Google Internal test set).
  • Performance varies across object classes. For example, based on the Open Image Validation set, the model exhibited higher performance (>70% AP and Recall@60%) for object categories like cars and animals, and lower performance (<30% AP and Recall@60%) for toys, food, and certain appliances.
  • The level of granularity within a category significantly affects performance - a model asked to differentiate several different types of food tends to perform worse according to standard metrics than a model simply asked to identify all food as a single category.
  • P-R curves were generated by recording Precision and Recall values for all decision thresholds between 0.2 and 1.0. The curves appear truncated because no P-R values were generated for thresholds below 0.2.

P-R Curves

loading
  1. Choose an evaluation dataset.

    Open Images Validation dataset (V4) is comprised of ~41k images, annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships. It contains a total of 204k bounding boxes and covers 600 object classes. The model evaluated in this card can detect 547 object classes out of the 600 labeled in this evaluation dataset. Learn more about Open Images V4 Validation Dataset.

  2. Select performance metric.

    Precision

    Precision measures how often a model correctly predicts the presence of an object or class of objects. Average Precision (AP) is calculated by taking the average of the precision values for each relevant result. A low-precision model would frequently detect objects that are not in the image (or apply imprecise labels to objects that are present).

  3. View performance results for your selections.

    • Performance results are presented as part of a label hierarchy, where objects are nested within a tree structure. Specific objects (e.g. “cat”) are the “leaves” organized under higher-level object categories or ”parents” (e.g. “animal”).
    • You can select between two different View Modes: image tiles (default) and a tabular view.
    • Note: Not all objects and object categories will show results due to a lack of coverage in the evaluation dataset selected.
    View Mode
    • Miscellaneous object
    • Person
      AP:0.6 to 0.7
    • Drink
      AP:0.4 to 0.5
    • Functional image
    • Indoor
    • Animal
      AP:0.5 to 0.6
    • Apparel and fashion accessory
    • Luggage & bags
      AP:0.2 to 0.3
    • Health and beauty
    • Outdoor
    • Vehicle
    • Art
    • Food
      AP:0.2 to 0.3
    • Media
    • Plant
      AP:0.4 to 0.5
    • Equipment

Test your own images

See how the model works on your own image here (we will not keep a copy).

Feedback

We’d love your feedback on the information presented in this card and/or the framework we’re exploring here for model reporting. Please also share any unexpected results. Your feedback will be forwarded to model and service owners. Click here to provide feedback.