Title | : | On Risk-Sensitive Reinforcement Learning: Algorithms, Analysis and Applications |
Speaker | : | Prashanth L A (Institute for Systems Research, University of Maryland, USA) |
Details | : | Tue, 22 Dec, 2015 11:00 AM @ BSB 361 |
Abstract: | : | In many sequential decision-making problems, one may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. While the theory of risk-sensitive Markov decision processes (MDPs) is relatively well-understood and that we know many of such problems are computationally intractable, not much work has been done to solve risk-sensitive MDPs in a typical reinforcement learning (RL) setting. In this talk I will describe a few important steps that I took to approximately solve risk-sensitive MDPs - both discounted and average reward. For each formulation, I will first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, I derive a formula for computing its gradient and then devise actor-critic algorithms that operate on three timescales - a temporal difference (TD) critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, I will point out the difficulty in estimating the gradient of the variance of the return and then present a simultaneous perturbation approach to alleviate this problem. The average setting, on the other hand, allows for an actor update using compatible features to estimate the gradient of the variance. The analysis of the aforementioned risk-sensitive RL algorithms involves statistical aspects of the popular TD algorithm with function approximation and I will present concentration bounds that I derived in a recent work for the latter algorithm. These bounds help in establishing the convergence of risk-sensitive RL algorithms using the ordinary differential equations (ODE) method to locally risk-sensitive optimal policies. Finally, I will demonstrate the usefulness of the risk-sensitive RL algorithms in a traffic signal control application. In particular, the empirical results show that risk-sensitive RL algorithms exhibit lower variance in the delay experienced by road users, as compared to corresponding risk-neutral RL variants. Speaker Bio: Prashanth L A is a post-doctoral researcher at the Institute for Systems Research at the University of Maryland, College Park. His research interests are in reinforcement learning, stochastic optimization and multi-armed bandits. He earned MS and PhD from IISc Bangalore. |