Learning Visual Servoing with Deep Features and Fitted Q-Iteration | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
‡OpenAI |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Abstract — Visual servoing involves choosing actions that move a robot in response to observations from a camera, in order to reach a goal configuration in the world. Standard visual servoing approaches typically rely on manually designed features and analytical dynamics models, which limits their generalization capability and often requires extensive application-specific feature and model engineering. In this work, we study how learned visual features, learned predictive dynamics models, and reinforcement learning can be combined to learn visual servoing mechanisms. We focus on target following, with the goal of designing algorithms that can learn a visual servo using low amounts of data of the target in question, to enable quick adaptation to new targets. Our approach is based on servoing the camera in the space of learned visual features, rather than image pixels or manually-designed keypoints. We demonstrate that standard deep features, in our case taken from a model trained for object classification, can be used together with a bilinear predictive model to learn an effective visual servo that is robust to visual variation, changes in viewing angle and appearance, and occlusions. A key component of our approach is to use a sample-efficient fitted Q-iteration algorithm to learn which features are best suited for the task at hand. We show that we can learn an effective visual servo on a complex synthetic car following benchmark using just 20 training trajectory samples for reinforcement learning. We demonstrate substantial improvement over a conventional approach based on image pixels or hand-designed keypoints, and we show an improvement in sample-efficiency of more than two orders of magnitude over standard model-free deep reinforcement learning algorithms. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||
[Paper] [Code] [Servoing Benchmark Code] [GTC 2017 slides] [with notes] [GTC 2017 talk] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Supplementary Videos
Click on any of the costs in the table entries below to see the trajectories corresponding to each of the costs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Dynamics-Based Servoing Policies Learned with Reinforcement Learning | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 2: Costs on test executions of the dynamics-based servoing policies for different feature dynamics and weighting of the features. The reported numbers are the mean and standard error across 100 test trajectories, of up to 100 time steps each. We test on executions with the training cars and the novel cars; for consistency, the novel cars follow the same route as the training cars. We compare the performance of policies with unweighted features or weights learned by other methods. For the case of unweighted feature dynamics, we use the cross entropy method (CEM) to learn the relative weights λ of the control and the single feature weight w. For the other cases, we learn the weights with CEM, Trust Region Policy Optimization (TRPO) for either 2 or 50 iterations, and our proposed FQI algorithm. CEM searches over the full space of policy parameters w and λ, but it was only ran for pixel features since it does not scale for high-dimensional problems. We report the number of training trajectories in parenthesis. For TRPO, we use a fixed number of training samples per iteration, whereas for CEM and FQI, we use a fixed number of training trajectories per iteration. We use a batch size of 4000 samples for TRPO, which means that at least 40 trajectories were used per iteration, since trajectories can terminate early, i.e. in less than 100 time steps. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
End-to-End Policies Learned with TRPO | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 4: Costs on test executions of servoing policies that were trained end-to-end with TRPO. These policies take in different observation modalities: ground truth car position or image-based observations. This table follows the same format as Table 2. The mean of the first policy is parametrized as a 3-layer MLP, with tanh non-linearities except for the output layer; the first 2 fully connected layers use 32 hidden units each. For the other policies, each of their means is parametrized as a 5-layer CNN, consisting of 2 convolutional and 3 fully- connected layers, with ReLU non-linearities except for the output layer; the convolutional layers use 16 filters (4 × 4, stride 2) each and the first 2 fully-connected layers use 32 hidden units each. All the policies are trained with TRPO, a batch size of 4000 samples, 500 iterations, and a step size of 0.01. The car position observations are not affected by the appearance of the cars, so the test performance for that modality is the same regardless of which set of cars are used. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Classical Image-Based Visual Servoing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 5: Costs on test executions when using classical image-based visual servoing (IBVS) with respect to feature points derived from bounding boxes and keypoints derived from hand-engineered features. The blue and green circles denote the target and current feature points, respectively. Since there is no learning involved in this method, we only test with one set of cars: the cars that were used for training in the other methods. This table follows the same format as Table 2. This method has one hyperparameter, which is the gain for the control law. For each feature type, we select the best hyperparameter (shown in parenthesis) by validating the policy on 10 validation trajectories for gains between 0.05 and 2, in increments of 0.05. The servoing policies based on bounding box features achieve low cost, and even lower ones if ground truth car dynamics is used. However, servoing with respect to hand-crafted feature points is significantly worse than the other methods. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Classical Position-Based Visual Servoing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 6: Costs on test executions when using classical position-based visual servoing (PBVS). The poses with corresponding blue and green circles at the origin denote the target and current car's pose, respectively. Sometimes, the policy is almost perfect, in which case the points and poses might overlap each other perfectly. Since there is no learning involved in this method, we only test with one set of cars: the cars that were used for training in the other methods. This table follows the same format as Table 2. This method has one hyperparameter, which is the gain for the control law. For each condition, we select the best hyperparameter (shown in parenthesis) by validating the policy on 10 validation trajectories for gains between 0.05 and 2, in increments of 0.05. These servoing policies, which use ground truth car poses, outperforms all the other policies based on images. In addition, the performance is more than two orders of magnitude better if ground truth car dynamics is used. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|