Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an im- portance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U(θ) at the current policy parameteri- zation θ, rather than to obtain a more complete estimate of U(θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which lever- ages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient meth- ods. Our algorithm outperforms standard likelihood ratio policy gradient algo- rithms on several testbeds.
@inproceedings{tangnips10, title = {On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient}, author = {J. Tang and P. Abbeel}, booktitle = {Neural Information Processing Systems (NIPS)}, year = {2010}, }
Many robotic control tasks involve complex dynamics that are hard to model. Hand-specifying trajectories that satisfy a system’s dynamics can be very time-consuming and often exceedingly difficult. We present an algorithm for automatically generating large classes of trajectories for difficult control tasks by learning parameterized versions of desired maneuvers from multiple expert demonstrations. Our algorithm has enabled our autonomous helicopter to fly a wide range of challenging aerobatic maneuvers of different sizes from the same set of expert demonstrations.
@inproceedings{tangicra10, title = {Parameterized Maneuver Learning for Autonomous Helicopter Flight}, author = {J. Tang and A. Singh and N. Goehausen and P. Abbeel}, booktitle = {International Conference on Robotics and Automation (ICRA)}, year = {2010}, }
Establishing trust amongst agents is of central importance to the development of well-functioning multi-agent systems. Trust (or reputation) mechanisms can help by aggregating and sharing trust information between agents. Unfortunately these mechanisms can often be manipulated by strategic agents. Existing mechanisms are either very robust to manipulation (i.e., manipulations are not beneficial for strategic agents), or they are very informative (i.e., good at aggregating trust data), but never both. This paper explores this trade-off between these competing desiderata. First, we introduce a metric to evaluate the informativeness of existing trust mechanisms. We then show analytically that trust mechanisms can be combined to generate new hybrid mechanisms with intermediate robustness properties. We establish through simulation that hybrid mechanisms can achieve higher overall efficiency in environments with risky transactions and mixtures of agent types (some cooperative, some malicious, and some strategic) than any previously known mechanism.
@inproceedings{tang10trust, title = {Hybrid Transitive Trust Mechanisms}, author = {J. Tang and S. Seuken and D. Parkes}, booktitle = {International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS)}, year = {2010}, }
In distributed work systems, individual users perform work for other users. A significant challenge in these systems is to provide proper incentives for users to contribute as much work as they consume, even when monitoring is not possible. We formalize the problem of designing incentive-compatible accounting mechanisms that measure the net contributions of users, despite relying on voluntary reports. We introduce the Drop-Edge Mechanism that removes any incentive for a user to manipulate via misreports about work contributed or consumed. We prove that Drop-Edge provides a good approximation to a user’s net contribution, and is accurate in the limit as the number of users grows. We demonstrate very good welfare properties in simulation compared to an existing, manipulable mechanism. In closing, we discuss our ongoing work, including a real-world implementation and evaluation of the Drop- Edge Mechanism in a BitTorrent client.
@inproceedings{seuken10aaai, title = {Accounting Mechanisms for Distributed Work Systems}, author = {S. Seuken and J. Tang and D. Parkes}, booktitle = {International Conference on Artificial Intelligence (AAAI)}, year = {2010}, }
Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. There exists a rich literature on different policy gradient techniques, from simple gradient estimators like REINFORCE [3], Baxter and Bartlett’s GPOMDP [2], natural gradient techniques [1] which leverage the curvature of the space, and more sophisticated value function approximators like Peters’ episodic Natural Actor Critic [5]. Policy gradient methods have been widely applied to a variety of complex real-world reinforcement learning problems, e.g. hitting a baseball with an articulated arm robot [5], learning gaits for a legged robot quickly [6]. In these settings the most time consuming factor in the learning process is the number of real-world trials.12 We describe a novel connection between likelihood ratio based policy gradient methods and importance sampling. The likelihood ratio policy gradient estimate is equivalent to taking the derivative of a particular importance sampled estimate of the value function. This particular importance sampled estimate of the value function only leverages data from the current policy in the search. This indicates that likelihood ratio policy gradients are quite naive in terms of data use.
@inproceedings{tang_snowbird10, title = {A Connection Between Importance Sampling and Likelihood Ratio Policy Gradients}, author = {J. Tang and P. Abbeel}, booktitle = {The Snowbird Workshop}, year = {2010}, }
Hand-specifying trajectories for complicated robotic platforms is often non-trivial because of the dimensionality of the state space and the complexity of the robot dynamics. This is especially true for autonomous helicopter flight, where no accurate dynamics model is available and where we may want highly aggressive maneuvers. Yet expert human pilots are able to successfully fly a large range of dynamic maneuvers. In this setting it is natural to ask for demonstrations of desired maneuvers - the apprenticeship learning problem. Past applications of this technique [1] has included state of the art autonomous helicopter flight performance for replicating a complex airshow. Next, we’d like to build a flexible controller for extended autonomous flight in dynamic environments. One approach for this task is to build a maneuver library from demonstrations which allows us to fly learned maneuvers in novel ways. We could then plan and sequence different flight maneuvers to adapt to changing conditions or objectives. Unfortunately, building a maneuver library directly based upon [1] would require collecting a set of demonstrations for each discrete maneuver. For example, one might require loops of size 10m, 15m, 20m, 25m, etc. We demonstrate generative probablistic-model based approaches for learning entire classes of challenging maneuvers for an autonomous helicopter (stall turns, loops, tic toc) from expert demonstrations. This algorithm automatically aligns these demonstrations, accounts for noise, and learns a feasible flight regime for each maneuver as well as a high fidelity dynamics model suitable for control in said flight regimes. This manuever model can then be queried to generate novel trajectories which pass through specific key points in the trajectory (i.e. the height at the top of a loop). Though learned from demonstrations, our algorithm can generate and fly trajectories which are outside the limits of our demonstration trajectories. We have used this method to learn and fly a variety of challenging acrobatic helicopter maneuvers autonomously from a small number of demonstrations. In addition, we expect this technique to generalize to other robotic platforms.
@inproceedings{tang_snowbird10, title = {Learning Parameterized Maneuvers for Autonomous Helicopter Aerobatics from Expert Demonstration}, author = {J. Tang and A. Singh and N. Goehausen and P. Abbeel}, booktitle = {NIPS Workshop on Probablistic Approaches for Robotics and Control}, year = {2010}, }