In this short note we derive a relationship between the Bregman divergence from the current policy to the optimal policy and the suboptimality of the current value function in a regularized Markov decision process. This result has implications for multi-task reinforcement learning, offline reinforcement learning, and regret analysis under function approximation, among others.
translated by 谷歌翻译
我们解决了加固学习的安全问题。我们在折扣无限地平线受限的Markov决策过程框架中提出了问题。现有结果表明,基于梯度的方法能够实现$ \ mathcal {o}(1 / \ sqrt {t})$全球收敛速度,用于最优差距和约束违规。我们展示了一种基于自然的基于政策梯度的算法,该算法具有更快的收敛速度$ \ mathcal {o}(\ log(t)/ t)$的最优性差距和约束违规。当满足Slater的条件并已知先验时,可以进一步保证足够大的$ T $的零限制违规,同时保持相同的收敛速度。
translated by 谷歌翻译
政策优化,通过大规模优化技术最大化价值函数来学习兴趣的政策,位于现代强化学习(RL)的核心。除了价值最大化之外,其他实际考虑因素也出现,包括令人鼓舞的探索,以及确保由于安全,资源和运营限制而确保学习政策的某些结构性。这些考虑通常可以通过诉诸正规化的RL来占据,这增加了目标值函数,并通过结构促进正则化术语。专注于无限范围打折马尔可夫决策过程,本文提出了一种用于解决正规化的RL的广义策略镜血压(GPMD)算法。作为策略镜血压LAN的概括(2021),所提出的算法可以容纳一般类凸常规的常规阶级,以及在使用中的规则器的认识到的广泛的Bregman分歧。我们展示了我们的算法在整个学习速率范围内,以无维的方式在全球解决方案的整个学习速率范围内融合到全球解决方案,即使常规器缺乏强大的凸起和平滑度。此外,在不精确的策略评估和不完美的政策更新方面,该线性收敛特征是可透明的。提供数值实验以证实GPMD的适用性和吸引力性能。
translated by 谷歌翻译
我们重新审视了最简单的设置之一中的政策梯度方法的有限时间分析:有限状态和动作MDP,具有由所有随机策略组成的策略类和精确的渐变评估。有一些最近的工作将此设置视为平滑的非线性优化问题的实例,并显示具有小阶梯大小的子线性收敛速率。在这里,我们根据与政策迭代的连接采取不同的透视,并显示政策梯度方法的许多变体成功,阶梯大小大,并达到了线性收敛速率。
translated by 谷歌翻译
深度加强学习的最近成功的大部分是由正常化的政策优化(RPO)算法驱动,具有跨多个域的强大性能。在这家族的方法中,代理经过培训,以在惩罚某些引用或默认策略的行为中的偏差时最大化累积奖励。除了经验的成功外,还有一个强大的理论基础,了解应用于单一任务的RPO方法,与自然梯度,信任区域和变分方法有关。但是,对于多任务设置中的默认策略,对所需属性的正式理解有限,越来越重要的域作为现场转向培训更有能力的代理商。在这里,我们通过将默认策略的质量与其对优化的影响正式链接到其对其影响的效果方面,进行第一步才能填补这种差距。使用这些结果,我们将获得具有强大性能保证的多任务学习的原则性的RPO算法。
translated by 谷歌翻译
Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.
translated by 谷歌翻译
We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the penalization strength.
translated by 谷歌翻译
We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a by product, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / \epsilon^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.
translated by 谷歌翻译
我们研究具有多个奖励价值函数的马尔可夫决策过程(MDP)的政策优化,应根据给定的标准共同优化,例如比例公平(平滑凹面标量),硬约束(约束MDP)和Max-Min Trade-离开。我们提出了一个改变锚定的正规自然政策梯度(ARNPG)框架,该框架可以系统地将良好表现的一阶方法中的思想纳入多目标MDP问题的策略优化算法的设计。从理论上讲,基于ARNPG框架的设计算法实现了$ \ tilde {o}(1/t)$全局收敛,并具有精确的梯度。从经验上讲,与某些现有的基于策略梯度的方法相比,ARNPG引导的算法在精确梯度和基于样本的场景中也表现出卓越的性能。
translated by 谷歌翻译
诸如最大熵正则化之类的政策正则化方法被广泛用于增强学习以提高学习政策的鲁棒性。在本文中,我们展示了这种鲁棒性是如何通过对冲的奖励功能扰动而产生的,奖励功能是从想象中的对手设定的限制设置中选择的。使用凸双重性,我们表征了KL和Alpha-Divergence正则化的一组强大的对抗奖励扰动集,其中包括香农和Tsallis熵正则定期为特殊情况。重要的是,可以在此强大集合中给出概括保证。我们提供了有关最坏的奖励扰动的详细讨论,并提供了直观的经验示例,以说明这种稳健性及其与概括的关系。最后,我们讨论我们的分析如何补充并扩展对对抗奖励鲁棒性和路径一致性最佳条件的先前结果。
translated by 谷歌翻译
我们考虑解决强大的马尔可夫决策过程(MDP)的问题,该过程涉及一组折扣,有限状态,有限的动作空间MDP,具有不确定的过渡核。计划的目的是找到一项强大的政策,以优化针对过渡不确定性的最坏情况值,从而将标准MDP计划作为特殊情况。对于$(\ Mathbf {s},\ Mathbf {a})$ - 矩形不确定性集,我们开发了一种基于策略的一阶方法,即稳健的策略镜像下降(RPMD),并建立$ \ Mathcal {o }(\ log(1/\ epsilon))$和$ \ Mathcal {o}(1/\ epsilon)$迭代复杂性,用于查找$ \ epsilon $ -optimal策略,并带有两个增加的步骤式方案。 RPMD的先前收敛适用于任何Bregman差异,前提是政策空间在以初始政策为中心时通过差异测量的半径限制了半径。此外,当布雷格曼的分歧对应于平方的欧几里得距离时,我们建立了一个$ \ mathcal {o}(\ max \ {1/\ epsilon,1/(\ eta \ eTa \ epsilon^2)\ epsilon^2)\任何常量的步进$ \ eta $。对于Bregman差异的一般类别,如果不确定性集满足相对强的凸度,则还为RPMD建立了类似的复杂性。当仅通过与名义环境的在线互动获得一阶信息时,我们进一步开发了一个名为SRPMD的随机变体。对于Bregman General Divergences,我们建立了一个$ \ MATHCAL {O}(1/\ Epsilon^2)$和$ \ Mathcal {O}(1/\ Epsilon^3)$样品复杂性,具有两个增加的静态方案。对于Euclidean Bregman Divergence,我们建立了一个$ \ MATHCAL {O}(1/\ Epsilon^3)$样本复杂性,并具有恒定的步骤。据我们所知,所有上述结果似乎是应用于强大的MDP问题的基于策略的一阶方法的新事物。
translated by 谷歌翻译
In reinforcement learning (RL), the ability to utilize prior knowledge from previously solved tasks can allow agents to quickly solve new problems. In some cases, these new problems may be approximately solved by composing the solutions of previously solved primitive tasks (task composition). Otherwise, prior knowledge can be used to adjust the reward function for a new problem, in a way that leaves the optimal policy unchanged but enables quicker learning (reward shaping). In this work, we develop a general framework for reward shaping and task composition in entropy-regularized RL. To do so, we derive an exact relation connecting the optimal soft value functions for two entropy-regularized RL problems with different reward functions and dynamics. We show how the derived relation leads to a general result for reward shaping in entropy-regularized RL. We then generalize this approach to derive an exact relation connecting optimal value functions for the composition of multiple tasks in entropy-regularized RL. We validate these theoretical contributions with experiments showing that reward shaping and task composition lead to faster learning in various settings.
translated by 谷歌翻译
策略梯度方法适用于复杂的,不理解的,通过对参数化的策略进行随机梯度下降来控制问题。不幸的是,即使对于可以通过标准动态编程技术解决的简单控制问题,策略梯度算法也会面临非凸优化问题,并且被广泛理解为仅收敛到固定点。这项工作确定了结构属性 - 通过几个经典控制问题共享 - 确保策略梯度目标函数尽管是非凸面,但没有次优的固定点。当这些条件得到加强时,该目标满足了产生收敛速率的Polyak-lojasiewicz(梯度优势)条件。当其中一些条件放松时,我们还可以在任何固定点的最佳差距上提供界限。
translated by 谷歌翻译
离线增强学习(RL)的样本效率保证通常依赖于对功能类别(例如Bellman-Completeness)和数据覆盖范围(例如,全政策浓缩性)的强有力的假设。尽管最近在放松这些假设方面做出了努力,但现有作品只能放松这两个因素之一,从而使另一个因素的强烈假设完好无损。作为一个重要的开放问题,我们是否可以实现对这两个因素的假设较弱的样本效率离线RL?在本文中,我们以积极的态度回答了这个问题。我们基于MDP的原始偶对偶进行分析了一种简单的算法,其中双重变量(打折占用)是使用密度比函数对离线数据进行建模的。通过适当的正则化,我们表明该算法仅在可变性和单极浓缩性下具有多项式样品的复杂性。我们还基于不同的假设提供了替代分析,以阐明离线RL原始二算法的性质。
translated by 谷歌翻译
我们研究了平均奖励马尔可夫决策过程(AMDP)的问题,并开发了具有强大理论保证的新型一阶方法,以进行政策评估和优化。由于缺乏勘探,现有的彻底评估方法遭受了次优融合率以及处理不足的随机策略(例如确定性政策)的失败。为了解决这些问题,我们开发了一种新颖的差异时间差异(VRTD)方法,具有随机策略的线性函数近似以及最佳收敛保证,以及一种探索性方差降低的时间差(EVRTD)方法,用于不充分的随机策略,可相当的融合保证。我们进一步建立了政策评估偏见的线性收敛速率,这对于改善策略优化的总体样本复杂性至关重要。另一方面,与对MDP的政策梯度方法的有限样本分析相比,对AMDP的策略梯度方法的现有研究主要集中在基础马尔可夫流程的限制性假设下(例如,参见Abbasi-e, Yadkori等人,2019年),他们通常缺乏整体样本复杂性的保证。为此,我们开发了随机策略镜下降(SPMD)的平均奖励变体(LAN,2022)。我们建立了第一个$ \ widetilde {\ Mathcal {o}}(\ epsilon^{ - 2})$样品复杂性,用于在生成模型(带有UNICHAIN假设)和Markovian Noise模型(使用Ergodicicic Modele(具有核能的模型)下,使用策略梯度方法求解AMDP假设)。该界限可以进一步改进到$ \ widetilde {\ Mathcal {o}}}(\ epsilon^{ - 1})$用于求解正则化AMDPS。我们的理论优势通过数值实验来证实。
translated by 谷歌翻译
Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature.
translated by 谷歌翻译
In robust Markov decision processes (MDPs), the uncertainty in the transition kernel is addressed by finding a policy that optimizes the worst-case performance over an uncertainty set of MDPs. While much of the literature has focused on discounted MDPs, robust average-reward MDPs remain largely unexplored. In this paper, we focus on robust average-reward MDPs, where the goal is to find a policy that optimizes the worst-case average reward over an uncertainty set. We first take an approach that approximates average-reward MDPs using discounted MDPs. We prove that the robust discounted value function converges to the robust average-reward as the discount factor $\gamma$ goes to $1$, and moreover, when $\gamma$ is large, any optimal policy of the robust discounted MDP is also an optimal policy of the robust average-reward. We further design a robust dynamic programming approach, and theoretically characterize its convergence to the optimum. Then, we investigate robust average-reward MDPs directly without using discounted MDPs as an intermediate step. We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value iteration algorithm that provably finds its solution, or equivalently, the optimal robust policy.
translated by 谷歌翻译
Offline reinforcement learning (RL) concerns pursuing an optimal policy for sequential decision-making from a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and advance the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, sometimes also with careful choices of the function classes and initial state distributions. We hope our insights further advocate the study of the LP framework, as well as the induced primal-dual minimax optimization, in offline RL.
translated by 谷歌翻译
熵正则化是增强学习(RL)的流行方法。尽管它具有许多优势,但它改变了原始马尔可夫决策过程(MDP)的RL目标。尽管已经提出了差异正则化来解决这个问题,但不能微不足道地应用于合作的多代理增强学习(MARL)。在本文中,我们研究了合作MAL中的差异正则化,并提出了一种新型的非政策合作MARL框架,差异性的多代理参与者 - 参与者(DMAC)。从理论上讲,我们得出了DMAC的更新规则,该规则自然存在,并保证了原始MDP和Divergence regullatized MDP的单调政策改进和收敛。我们还给出了原始MDP中融合策略和最佳策略之间的差异。 DMAC是一个灵活的框架,可以与许多现有的MARL算法结合使用。从经验上讲,我们在教学随机游戏和Starcraft Multi-Agent挑战中评估了DMAC,并表明DMAC显着提高了现有的MARL算法的性能。
translated by 谷歌翻译
我们考虑了在连续的状态行为空间中受到约束马尔可夫决策过程(CMDP)的问题,在该空间中,目标是最大程度地提高预期的累积奖励受到某些约束。我们提出了一种新型的保守自然政策梯度原始二算法(C-NPG-PD),以达到零约束违规,同时实现了目标价值函数的最新融合结果。对于一般策略参数化,我们证明了价值函数与全局最佳功能的融合到由于限制性策略类而导致的近似错误。我们甚至从$ \ Mathcal {o}(1/\ epsilon^6)$从$ \ Mathcal {o}(1/\ Epsilon^4)$提高了现有约束NPG-PD算法\ cite {ding2020}的样本复杂性。。据我们所知,这是第一项通过自然政策梯度样式算法建立零约束违规的工作,用于无限的地平线折扣CMDP。我们通过实验评估证明了提出的算法的优点。
translated by 谷歌翻译