Optimism Driven Exploration for Nonlinear Systems
Teodor Mihai Moldovan, Sergey Levine, Michael I. Jordan, Pieter Abbeel

  Abstract — Tasks with unknown dynamics and costly system interaction time present a serious challenge for reinforcement learning. If a model of the dynamics can be learned quickly, interaction time can be reduced substantially. We show that combining an optimistic exploration strategy with modelpredictive control can achieve very good sample complexity for a range of nonlinear systems. Our method learns a Dirichlet process mixture of linear models using an exploration strategy based on optimism in the face of uncertainty. Trajectory optimization is used to plan paths in the learned model that both minimize the cost and perform exploration. Experimental results show that our approach achieves some of the most sample-efficient learning rates on several benchmark problems, and is able to successfully learn to control a simulated helicopter during hover and autorotation with only seconds of interaction time. The computational requirements are substantial.

Paper: [PDF]
Code: [ZIP]
Readme: [TXT]