Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics



Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics Supplementary Material



Sergey Levine and Pieter Abbeel Department of Electrical Engineering and Computer Science University of California, Berkeley
Abstract
We present a policy search method that uses iteratively refitted local linear models to optimize trajectory distributions for large, continuous problems. These trajectory distributions can be used within the framework of guided policy search to learn policies with an arbitrary parameterization. Our method fits time-varying linear dynamics models to speed up learning, but does not rely on learning a global model, which can be difficult when the dynamics are complex and discontinuous. We show that this hybrid approach requires many fewer samples than model-free methods, and can handle complex, nonsmooth dynamics that can pose a challenge for model-based techniques. We present experiments showing that our method can be used to learn complex neural network policies that successfully execute simulated robotic manipulation tasks in partially observed environments with numerous contact discontinuities and underactuation.

BibTeX Citation @inproceedings{2014-mfcgps, author = {Sergey Levine and Pieter Abbeel}, title = {Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics}, booktitle = {Neural Information Processing Systems (NIPS)}, year = {2014}, }	Paper (with appendices): [PDF]
Supplementary Video


Walking Gait Generation The walking result in the paper is initialized from a demonstration, which is assumed to have a known stabilizing controller. Here, we show the result of the linear-Gaussian controller training when only the demonstrated trajectory is provided (without a stabilizing controller), and when the walking gait must be generated from scratch, with no demonstration at all.


Grasp Simulation Grasping motion generated with linear-Gaussian controller. The cost function penalizes distance between the palm and the object, as well as between the object and the target position.