Title | : | Policy Gradient Algorithms for Sequential Decision Making under Uncertainty and Risk |
Speaker | : | Nithia V (IITM) |
Details | : | Thu, 20 Jun, 2024 10:30 AM @ By Google Meet |
Abstract: | : | We consider the problem of sequential decision making under uncertainty and risk. Our objective is to propose and analyze novel algorithms that utilize the Markov decision process (MDP) framework for modeling the problem, facilitating the learning of policies that cater to both risk-neutral and risk-sensitive reinforcement learning (RL) settings, respectively. More specifically, we propose novel algorithms that leverage the policy gradient framework to learn parameterized stochastic policies by optimizing an objective function, which governs the type of policies to be learned, viz., risk-neutral or risk-sensitive. Learning risk-neutral policies involves utilizing the expected value of the cumulative discounted reward as the objective function. We propose two policy gradient algorithms that incorporate a smoothed-functional (SF) based gradient estimation scheme to optimize the expected value in an off-policy RL context. The first algorithm combines importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. Our non-asymptotic analysis establishes the convergence of our algorithms to an approximate stationary point. Our first algorithm demonstrates a convergence rate comparable to an off-policy adaptation of the well-known REINFORCE algorithm, which employs a gradient estimation scheme based on the likelihood ratio (LR) method. In contrast, our second algorithm, which incorporates variance reduction, exhibits an improved rate of convergence as compared to REINFORCE. Learning risk-sensitive policies involves utilizing a risk measure of the cumulative discounted reward as the objective function. Initially, we consider optimizing a family of risk measures called distortion risk measures (DRM). A DRM utilizes a distortion function to alter the original distribution, and determine the mean of the rewards with respect to the distorted distribution. We propose and analyze novel policy gradient algorithms that support both on-policy and off-policy RL settings, respectively. Our algorithms utilize LR-based and SF-based gradient estimation schemes, respectively. Subsequently, we generalize the policy gradient algorithms that employ an SF-based gradient estimation scheme to encompass the broad class of smooth risk measures (SRM). Optimization of the SRM is highly valuable since a wide variety of risk measures, including DRM and mean-variance risk measure (MVRM), can be categorized as an SRM under generally applicable conditions. In contrast to DRM, which distorts the distribution, MVRM utilizes the variance to model the risk related to the cumulative reward, and guarantees a trade-off between the mean and the variance of the cumulative reward. To optimize SRM, we develop template policy gradient algorithms that employ an SF-based gradient estimation scheme. Specifically, we demonstrate that our algorithms are applicable to the optimization of MVRM and DRM. Despite obtaining biased estimates of the risk measure, in contrast to the unbiased estimates of the expected value, our risk-sensitive algorithms maintain a convergence rate comparable to that of risk-neutral algorithms. |