Multimodal Blending for High-Accuracy Instance Recognition

Ziang Xie, Arjun Singh, Justin Uang, Karthik S. Narayan, Pieter Abbeel

Talk Video

Abstract

Despite the rich information provided by sensors such as the Microsoft Kinect in the robotic perception setting, the problem of detecting object instances remains unsolved, even in the tabletop setting, where segmentation is greatly simplified. Existing object detection systems often focus on textured objects, for which local feature descriptors can be used to reliably obtain correspondences between different views of the same object.

We examine the benefits of dense feature extraction and multimodal features for improving the accuracy and robustness of an instance recognition system. By combining multiple modalities and blending their scores through an ensemble-based method in order to generate our final object hypotheses, we obtain significant improvements over previously published results on two RGB-D datasets. On the Challenge dataset, our method results in only one missed detection (achieving 100% precision and 99.77% recall). On the Willow dataset, we also make significant gains on the prior state of the art (achieving 98.28% precision and 87.78% recall), resulting in an increase in F-score from 0.8092 to 0.9273.

Datasets

The Challenge and Willow datasets described below were downloaded from here.

Challenge corresponds to {Willow_Final_Training_Set, Willow_Final_Test_Set} and Willow to {training_data, test_data}.

Dataset Annotations

Challenge Dataset

We found several errors in the original ground truth data provided:

In test 34, frames 0, 1, 2, 3, and 4 do not contain object 15 as the ground truth CSV files claim.
In test 21, frames 0-5, the ground truth poses for object 5 are wrong; we have replaced these with our own, which we verified to be close to actual ground truth by projecting the object mesh in the predicted pose onto the image.
Likewise, in test 34 frames 0-4, the ground truth poses for object 13 are wrong, and we have replaced these as well.

Download the fixed CSV files: [.tar.gz] [.zip]

Willow Dataset

Unfortunately, to the best of our knowledge, there is no ground truth pose information for the Willow dataset. The results reported in our paper were obtained assuming that the objects given in each test case are visible in all frames for that test (which is often not the case).

Download the Willow ground truth used for the paper here: [.tar.gz] [.zip]

We noted many instances where objects in a given test frame scene were fully occluded by manual inspection, and hence modified the ground truth files accordingly.

Download the fixed Willow ground truth files (Willow-Vis): [.tar.gz] [.zip]

Willow-20 (described in paper): [.tar.gz] [.zip]

Detections & Mistakes

Challenge Dataset

For all our detections, we show our color models projected on the test image in the predicted pose for each test frame. We also show the segmentations and IDs of the detected or missing objects. Finally, we provide the scores computed for each individual segmentation cluster.

Key for scores in the JSON files:

scores_nn_pose: best matching scores for each object obtained during the RANSAC phase.
scores_color_match: pose verification scores using CIE Lab models.
scores_shape_match: pose verification scores using shape context feature models.
scores_desc_match: pose verification scores using dense SIFT models.

Challenge Detections

Willow Dataset

As described in our paper, the Willow dataset contains many highly occluded and non-textured views of objects as well as imposter objects. Though we attain good precision, it is much more difficult to attain good recall.

Willow Detections

Code

Our SIFT feature extraction code can be found here.

Last modified: