Learning Visual Servoing with Deep Features and Fitted Q-Iteration

Learning Visual Servoing with Deep Features and Fitted Q-Iteration

Alex X. Lee^†, Sergey Levine^†, Pieter Abbeel^‡†§ ^†University of California, Berkeley, Department of Electrical Engineering and Computer Sciences
^‡OpenAI ^§International Computer Science Institute

Abstract — Visual servoing involves choosing actions that move a robot in response to observations from a camera, in order to reach a goal configuration in the world. Standard visual servoing approaches typically rely on manually designed features and analytical dynamics models, which limits their generalization capability and often requires extensive application-specific feature and model engineering. In this work, we study how learned visual features, learned predictive dynamics models, and reinforcement learning can be combined to learn visual servoing mechanisms. We focus on target following, with the goal of designing algorithms that can learn a visual servo using low amounts of data of the target in question, to enable quick adaptation to new targets. Our approach is based on servoing the camera in the space of learned visual features, rather than image pixels or manually-designed keypoints. We demonstrate that standard deep features, in our case taken from a model trained for object classification, can be used together with a bilinear predictive model to learn an effective visual servo that is robust to visual variation, changes in viewing angle and appearance, and occlusions. A key component of our approach is to use a sample-efficient fitted Q-iteration algorithm to learn which features are best suited for the task at hand. We show that we can learn an effective visual servo on a complex synthetic car following benchmark using just 20 training trajectory samples for reinforcement learning. We demonstrate substantial improvement over a conventional approach based on image pixels or hand-designed keypoints, and we show an improvement in sample-efficiency of more than two orders of magnitude over standard model-free deep reinforcement learning algorithms.

[Paper]
[Code]
[Servoing Benchmark Code]
[GTC 2017 slides] [with notes] [GTC 2017 talk]

Supplementary Videos

Click on any of the costs in the table entries below to see the trajectories corresponding to each of the costs.

Dynamics-Based Servoing Policies Learned with Reinforcement Learning

Details [+]

Costs when using the set of cars seen during learning.
	Policy Optimization Algorithm
Feature Dynamics	unweighted feature dynamics + CEM (1500)	feature dynamics + CEM (3250)	feature dynamics + TRPO (≥ 80)	feature dynamics + TRPO (≥ 2000)	ours feature dynamics + FQI (20)
pixel, FC	8.20 ± 0.66 (avi)	7.77 ± 0.66 (avi)	9.56 ± 0.62 (avi)	8.03 ± 0.66 (avi)	7.92 ± 0.67 (avi)
pixel, LC	8.07 ± 0.74 (avi)	7.13 ± 0.74 (avi)	10.11 ± 0.60 (avi)	7.97 ± 0.72 (avi)	7.98 ± 0.77 (avi)
VGG conv1_2	2.22 ± 0.38 (avi)		2.06 ± 0.35 (avi)	1.66 ± 0.31 (avi)	1.89 ± 0.32 (avi)
VGG conv2_2	2.40 ± 0.47 (avi)		2.42 ± 0.47 (avi)	1.89 ± 0.40 (avi)	1.40 ± 0.29 (avi)
VGG conv3_3	2.91 ± 0.52 (avi)		2.87 ± 0.53 (avi)	1.59 ± 0.42 (avi)	1.56 ± 0.40 (avi)
VGG conv4_3	2.70 ± 0.52 (avi)		2.57 ± 0.49 (avi)	1.69 ± 0.41 (avi)	1.11 ± 0.29 (avi)
VGG conv5_3	3.68 ± 0.47 (avi)		3.69 ± 0.48 (avi)	3.16 ± 0.48 (avi)	2.49 ± 0.35 (avi)

Costs when using novel cars, none of which were seen during learning.
	Policy Optimization Algorithm
Feature Dynamics	unweighted feature dynamics + CEM (1500)	feature dynamics + CEM (3250)	feature dynamics + TRPO (≥ 80)	feature dynamics + TRPO (≥ 2000)	ours feature dynamics + FQI (20)
pixel, FC	8.84 ± 0.68 (avi)	8.66 ± 0.70 (avi)	10.01 ± 0.62 (avi)	8.75 ± 0.67 (avi)	9.00 ± 0.70 (avi)
pixel, LC	8.37 ± 0.75 (avi)	7.17 ± 0.75 (avi)	11.29 ± 0.57 (avi)	8.25 ± 0.71 (avi)	8.36 ± 0.79 (avi)
VGG conv1_2	2.03 ± 0.43 (avi)		1.79 ± 0.36 (avi)	1.42 ± 0.33 (avi)	1.78 ± 0.37 (avi)
VGG conv2_2	2.01 ± 0.44 (avi)		2.00 ± 0.45 (avi)	1.26 ± 0.30 (avi)	1.28 ± 0.30 (avi)
VGG conv3_3	2.03 ± 0.47 (avi)		2.08 ± 0.47 (avi)	1.46 ± 0.37 (avi)	1.04 ± 0.31 (avi)
VGG conv4_3	2.40 ± 0.50 (avi)		2.57 ± 0.53 (avi)	1.48 ± 0.36 (avi)	0.90 ± 0.26 (avi)
VGG conv5_3	3.31 ± 0.45 (avi)		3.55 ± 0.50 (avi)	2.76 ± 0.42 (avi)	2.56 ± 0.41 (avi)

End-to-End Policies Learned with TRPO

Details [+]

Costs when using the set of cars seen during learning.
Observation Modality
ground truth car position	0.59 ± 0.24 (avi)
raw pixel-intensity images	3.23 ± 0.22 (avi)
VGG conv1_2 features	7.45 ± 0.40 (avi)
VGG conv2_2 features	13.38 ± 0.53 (avi)
VGG conv3_3 features	10.02 ± 0.49 (avi)

Costs when using novel cars, none of which were seen during learning.
Observation Modality
ground truth car position	0.59 ± 0.24 (avi)
raw pixel-intensity images	5.20 ± 0.40 (avi)
VGG conv1_2 features	8.35 ± 0.44 (avi)
VGG conv2_2 features	14.01 ± 0.47 (avi)
VGG conv3_3 features	10.51 ± 0.65 (avi)

Classical Image-Based Visual Servoing

Details [+]

Observation Modality (Feature Points)
corners of bounding box from C-COT tracker	(0.75) 1.70 ± 0.30 (avi)
corners of ground truth bounding box	(0.75) 0.86 ± 0.25 (avi)
corners of next frame's bounding box from C-COT tracker	(0.65) 1.46 ± 0.22 (avi)
corners of next frame's ground truth bounding box	(0.65) 0.53 ± 0.05 (avi)
SIFT feature points	(0.30) 14.47 ± 0.75 (avi)
SURF feature points	(0.60) 16.37 ± 0.78 (avi)
ORB feature points	(0.30) 4.41 ± 0.60 (avi)

Classical Position-Based Visual Servoing

Details [+]

	Policy Variant
Observation Modality (Pose)	Use Rotation	Ignore Rotation
car pose	(1.55) 0.58 ± 0.25 (avi)	(1.90) 0.51 ± 0.25 (avi)
next frame's car pose	(1.00) 0.0059 ± 0.0020 (avi)	(1.00) 0.0025 ± 0.0017 (avi)