安全探索是强化学习(RL)的常见问题,旨在防止代理在探索环境时做出灾难性的决定。一个解决这个问题的方法家庭以这种环境的(部分)模型的形式假设域知识,以决定动作的安全性。所谓的盾牌迫使RL代理只选择安全的动作。但是,要在各种应用中采用,必须超越执行安全性,还必须确保RL的适用性良好。我们通过与最先进的深度RL的紧密整合扩展了盾牌的适用性,并在部分可观察性下提供了充满挑战的,稀疏的奖励环境中的广泛实证研究。我们表明,经过精心整合的盾牌可确保安全性,并可以提高RL代理的收敛速度和最终性能。我们此外表明,可以使用盾牌来引导最先进的RL代理:它们在屏蔽环境中初步学习后保持安全,从而使我们最终可以禁用潜在的过于保守的盾牌。
translated by 谷歌翻译
Safety is still one of the major research challenges in reinforcement learning (RL). In this paper, we address the problem of how to avoid safety violations of RL agents during exploration in probabilistic and partially unknown environments. Our approach combines automata learning for Markov Decision Processes (MDPs) and shield synthesis in an iterative approach. Initially, the MDP representing the environment is unknown. The agent starts exploring the environment and collects traces. From the collected traces, we passively learn MDPs that abstractly represent the safety-relevant aspects of the environment. Given a learned MDP and a safety specification, we construct a shield. For each state-action pair within a learned MDP, the shield computes exact probabilities on how likely it is that executing the action results in violating the specification from the current state within the next $k$ steps. After the shield is constructed, the shield is used during runtime and blocks any actions that induce a too large risk from the agent. The shielded agent continues to explore the environment and collects new data on the environment. Iteratively, we use the collected data to learn new MDPs with higher accuracy, resulting in turn in shields able to prevent more safety violations. We implemented our approach and present a detailed case study of a Q-learning agent exploring slippery Gridworlds. In our experiments, we show that as the agent explores more and more of the environment during training, the improved learned models lead to shields that are able to prevent many safety violations.
translated by 谷歌翻译
Besides the recent impressive results on reinforcement learning (RL), safety is still one of the major research challenges in RL. RL is a machine-learning approach to determine near-optimal policies in Markov decision processes (MDPs). In this paper, we consider the setting where the safety-relevant fragment of the MDP together with a temporal logic safety specification is given and many safety violations can be avoided by planning ahead a short time into the future. We propose an approach for online safety shielding of RL agents. During runtime, the shield analyses the safety of each available action. For any action, the shield computes the maximal probability to not violate the safety specification within the next $k$ steps when executing this action. Based on this probability and a given threshold, the shield decides whether to block an action from the agent. Existing offline shielding approaches compute exhaustively the safety of all state-action combinations ahead of time, resulting in huge computation times and large memory consumption. The intuition behind online shielding is to compute at runtime the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our approach is well suited for high-level planning problems where the time between decisions can be used for safety computations and it is sustainable for the agent to wait until these computations are finished. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game represents a high-level planning problem that requires fast decisions and the multiplayer setting induces a large state space, which is computationally expensive to analyse exhaustively.
translated by 谷歌翻译
在实际应用中,尽管这种知识对于确定反应性控制系统与环境的精确相互作用很重要,但我们很少可以完全观察到系统的环境。因此,我们提出了一种在部分可观察到的环境中进行加固学习方法(RL)。在假设环境的行为就像是可观察到的马尔可夫决策过程,但我们对其结构或过渡概率不了解。我们的方法将Q学习与IOALERGIA结合在一起,这是一种学习马尔可夫决策过程(MDP)的方法。通过从RL代理的发作中学习环境的MDP模型,我们可以在不明确的部分可观察到的域中启用RL,而没有明确的记忆,以跟踪以前的相互作用,以处理由部分可观察性引起的歧义。相反,我们通过模拟学习环境模型上的新体验以跟踪探索状态,以抽象环境状态的形式提供其他观察结果。在我们的评估中,我们报告了方法的有效性及其有希望的性能,与六种具有复发性神经网络和固定记忆的最先进的深度RL技术相比。
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
A long-standing challenge in artificial intelligence is lifelong learning. In lifelong learning, many tasks are presented in sequence and learners must efficiently transfer knowledge between tasks while avoiding catastrophic forgetting over long lifetimes. On these problems, policy reuse and other multi-policy reinforcement learning techniques can learn many tasks. However, they can generate many temporary or permanent policies, resulting in memory issues. Consequently, there is a need for lifetime-scalable methods that continually refine a policy library of a pre-defined size. This paper presents a first approach to lifetime-scalable policy reuse. To pre-select the number of policies, a notion of task capacity, the maximal number of tasks that a policy can accurately solve, is proposed. To evaluate lifetime policy reuse using this method, two state-of-the-art single-actor base-learners are compared: 1) a value-based reinforcement learner, Deep Q-Network (DQN) or Deep Recurrent Q-Network (DRQN); and 2) an actor-critic reinforcement learner, Proximal Policy Optimisation (PPO) with or without Long Short-Term Memory layer. By selecting the number of policies based on task capacity, D(R)QN achieves near-optimal performance with 6 policies in a 27-task MDP domain and 9 policies in an 18-task POMDP domain; with fewer policies, catastrophic forgetting and negative transfer are observed. Due to slow, monotonic improvement, PPO requires fewer policies, 1 policy for the 27-task domain and 4 policies for the 18-task domain, but it learns the tasks with lower accuracy than D(R)QN. These findings validate lifetime-scalable policy reuse and suggest using D(R)QN for larger and PPO for smaller library sizes.
translated by 谷歌翻译
马尔可夫决策过程通常用于不确定性下的顺序决策。然而,对于许多方面,从受约束或安全规范到任务和奖励结构中的各种时间(非Markovian)依赖性,需要扩展。为此,近年来,兴趣已经发展成为强化学习和时间逻辑的组合,即灵活的行为学习方法的组合,具有稳健的验证和保证。在本文中,我们描述了最近引入的常规决策过程的实验调查,该过程支持非马洛维亚奖励功能以及过渡职能。特别是,我们为常规决策过程,与在线,增量学习有关的算法扩展,对无模型和基于模型的解决方案算法的实证评估,以及以常规但非马尔维亚,网格世界的应用程序的算法扩展。
translated by 谷歌翻译
在过去的十年中,多智能经纪人强化学习(Marl)已经有了重大进展,但仍存在许多挑战,例如高样本复杂性和慢趋同稳定的政策,在广泛的部署之前需要克服,这是可能的。然而,在实践中,许多现实世界的环境已经部署了用于生成策略的次优或启发式方法。一个有趣的问题是如何最好地使用这些方法作为顾问,以帮助改善多代理领域的加强学习。在本文中,我们提供了一个原则的框架,用于将动作建议纳入多代理设置中的在线次优顾问。我们描述了在非传记通用随机游戏环境中提供多种智能强化代理(海军上将)的问题,并提出了两种新的基于Q学习的算法:海军上将决策(海军DM)和海军上将 - 顾问评估(Admiral-AE) ,这使我们能够通过适当地纳入顾问(Admiral-DM)的建议来改善学习,并评估顾问(Admiral-AE)的有效性。我们从理论上分析了算法,并在一般加上随机游戏中提供了关于他们学习的定点保证。此外,广泛的实验说明了这些算法:可以在各种环境中使用,具有对其他相关基线的有利相比的性能,可以扩展到大状态行动空间,并且对来自顾问的不良建议具有稳健性。
translated by 谷歌翻译
强化学习(RL)是人工智能中的核心问题。这个问题包括定义可以通过与环境交互学习最佳行为的人工代理 - 其中,在代理试图最大化的奖励信号的奖励信号中定义最佳行为。奖励机(RMS)提供了一种基于Automate的基于自动机的表示,该奖励功能使RL代理能够将RL问题分解为可以通过禁止策略学习有效地学习的结构化子问题。在这里,我们表明可以从经验中学习RMS,而不是由用户指定,并且可以使用所产生的问题分解来有效地解决部分可观察的RL问题。我们将学习RMS的任务作为离散优化问题构成,其中目标是找到将问题分解为一组子问题的RM,使得其最佳记忆策略的组合是原始问题的最佳策略。我们展示了这种方法在三个部分可观察的域中的有效性,在那里它显着优于A3C,PPO和宏碁,并讨论其优点,限制和更广泛的潜力。
translated by 谷歌翻译
最先进的多机构增强学习(MARL)方法为各种复杂问题提供了有希望的解决方案。然而,这些方法都假定代理执行同步的原始操作执行,因此它们不能真正可扩展到长期胜利的真实世界多代理/机器人任务,这些任务固有地要求代理/机器人以异步的理由,涉及有关高级动作选择的理由。不同的时间。宏观行动分散的部分可观察到的马尔可夫决策过程(MACDEC-POMDP)是在完全合作的多代理任务中不确定的异步决策的一般形式化。在本论文中,我们首先提出了MacDec-Pomdps的一组基于价值的RL方法,其中允许代理在三个范式中使用宏观成果功能执行异步学习和决策:分散学习和控制,集中学习,集中学习和控制,以及分散执行的集中培训(CTDE)。在上述工作的基础上,我们在三个训练范式下制定了一组基于宏观行动的策略梯度算法,在该训练范式下,允许代理以异步方式直接优化其参数化策略。我们在模拟和真实的机器人中评估了我们的方法。经验结果证明了我们在大型多代理问题中的方法的优势,并验证了我们算法在学习具有宏观actions的高质量和异步溶液方面的有效性。
translated by 谷歌翻译
强化学习的标准制定缺乏指定禁止和禁止行为的实用方式。最常见的是,从业者通过手动工程来指定行为规范的任务,这是一个需要几个迭代的反向直观的过程,并且易于奖励代理人。在这项工作中,我们认为,几乎完全用于安全RL的受限制的RL,也有可能大大减少应用加强学习项目中奖励规范所花费的工作量。为此,我们建议在CMDP框架中指定行为偏好,并使用拉格朗日方法,该方法寻求解决代理程序的策略和拉格朗日乘法器之间的最小问题,以自动称量每个行为约束。具体而言,我们研究了如何调整CMDP,以便解决基于目标的任务,同时遵守一组行为约束,并提出对Sac-Lagrangian算法的修改以处理若干约束的具有挑战性的情况。我们对这一框架进行了一系列持续控制任务,该任务与用于视频游戏中NPC设计的加固学习应用相关。
translated by 谷歌翻译
Reinforcement learning in partially observable domains is challenging due to the lack of observable state information. Thankfully, learning offline in a simulator with such state information is often possible. In particular, we propose a method for partially observable reinforcement learning that uses a fully observable policy (which we call a state expert) during offline training to improve online performance. Based on Soft Actor-Critic (SAC), our agent balances performing actions similar to the state expert and getting high returns under partial observability. Our approach can leverage the fully-observable policy for exploration and parts of the domain that are fully observable while still being able to learn under partial observability. On six robotics domains, our method outperforms pure imitation, pure reinforcement learning, the sequential or parallel combination of both types, and a recent state-of-the-art method in the same setting. A successful policy transfer to a physical robot in a manipulation task from pixels shows our approach's practicality in learning interesting policies under partial observability.
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
强化学习(RL)涉及在未知系统中执行探索性动作。这可以将学习代理放在危险且潜在的灾难性系统中。当前在RL中解决安全学习的方法同时权衡了安全探索和任务实现。在本文中,我们介绍了新一代的RL求解器,这些求解器学会最大程度地减少安全性违规行为,同时在安全政策可以容忍的范围内最大化任务奖励。我们的方法引入了一个新型的两人框架,用于安全RL,称为分配探索安全培训算法(DESTA)。 DESTA的核心是两种自适应代理之间的游戏:安全代理,其任务是最大程度地减少安全违规行为和任务代理,其目标是最大程度地提高环境奖励。具体而言,安全代理可以在任何给定点有选择地控制系统,以防止任务代理在任何其他州自由执行其策略时违反安全性。该框架使安全代理能够学会在培训和测试时间中最大程度地减少未来安全违规行为的某些行动,而任务代理人执行的动作可以最大程度地提高其他任何地方的任务绩效。从理论上讲,我们证明DESTA会汇合到稳定的点,从而最大程度地违反了对预验证的政策的行为。从经验上讲,我们表明了DESTA提高现有政策安全性的能力,其次,当对任务代理和安全代理人同时培训时,构建安全的RL政策。我们展示了DESTA在Lunar Lander和Openai Gym的Frozen Lake中的领先RL方法的出色表现。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
安全已成为对现实世界系统应用深度加固学习的主要挑战之一。目前,诸如人类监督等外部知识的纳入唯一可以防止代理人访问灾难性状态的手段。在本文中,我们提出了一种基于安全模型的强化学习的新框架MBHI,可确保状态级安全,可以有效地避免“本地”和“非本地”灾难。监督学习者的合并在MBHI培训,以模仿人类阻止决策。类似于人类决策过程,MBHI将在执行对环境的动作之前在动态模型中推出一个想象的轨迹,并估算其安全性。当想象力遇到灾难时,MBHI将阻止当前的动作并使用高效的MPC方法来输出安全策略。我们在几个安全任务中评估了我们的方法,结果表明,与基线相比,MBHI在样品效率和灾难数方面取得了更好的性能。
translated by 谷歌翻译
在自动车辆,健康和航空等安全关键系统领域中越来越多的加强学习引发了确保其安全的必要性。现有的安全机制,如对抗性训练,对抗性检测和强大的学习并不总是适应代理部署的所有干扰。这些干扰包括移动的对手,其行为可能无法预测的代理人,并且作为对其学习有害的事实问题。确保关键系统的安全性也需要提供正式保障对扰动环境中的代理人的行为的正式保障。因此,有必要提出适应代理人面临的学习挑战的新解决方案。在本文中,首先,我们通过提出移动对手,产生对代理人政策中的缺陷的对抗性代理人。其次,我们使用奖励塑造和修改的Q学习算法作为防御机制,在面临对抗扰动时改善代理人的政策。最后,采用概率模型检查来评估两种机制的有效性。我们在离散网格世界进行了实验,其中一个面临非学习和学习对手的单一代理人。我们的结果表明,代理商与对手之间的碰撞次数减少。概率模型检查提供了关于对普遍环境中的代理安全性的较低和上部概率范围。
translated by 谷歌翻译
Drug dosing is an important application of AI, which can be formulated as a Reinforcement Learning (RL) problem. In this paper, we identify two major challenges of using RL for drug dosing: delayed and prolonged effects of administering medications, which break the Markov assumption of the RL framework. We focus on prolongedness and define PAE-POMDP (Prolonged Action Effect-Partially Observable Markov Decision Process), a subclass of POMDPs in which the Markov assumption does not hold specifically due to prolonged effects of actions. Motivated by the pharmacology literature, we propose a simple and effective approach to converting drug dosing PAE-POMDPs into MDPs, enabling the use of the existing RL algorithms to solve such problems. We validate the proposed approach on a toy task, and a challenging glucose control task, for which we devise a clinically-inspired reward function. Our results demonstrate that: (1) the proposed method to restore the Markov assumption leads to significant improvements over a vanilla baseline; (2) the approach is competitive with recurrent policies which may inherently capture the prolonged effect of actions; (3) it is remarkably more time and memory efficient than the recurrent baseline and hence more suitable for real-time dosing control systems; and (4) it exhibits favorable qualitative behavior in our policy analysis.
translated by 谷歌翻译
通过定义具有可变复杂性的流量类型独立环境,基于深度加强学习,介绍一种新的动态障碍避免方法。在当前文献中填补了差距,我们彻底调查了缺失速度信息对代理商在避免任务中的性能的影响。这是实践中至关重要的问题,因为几个传感器仅产生物体或车辆的位置信息。我们在部分可观察性方面评估频繁应用的方法,即在深神经网络中的复发性并简单帧堆叠。为我们的分析,我们依靠最先进的无模型深射RL算法。发现速度信息缺乏影响代理商的性能。两种方法 - 复发性和帧堆叠 - 不能在观察空间中一致地替换缺失的速度信息。但是,在简化的情况下,它们可以显着提高性能并稳定整体培训程序。
translated by 谷歌翻译
随着自动驾驶行业的发展,自动驾驶汽车群体的潜在相互作用也随之增长。结合人工智能和模拟的进步,可以模拟此类组,并且可以学习控制内部汽车的安全模型。这项研究将强化学习应用于多代理停车场的问题,在那里,汽车旨在有效地停车,同时保持安全和理性。利用强大的工具和机器学习框架,我们以马尔可夫决策过程的形式与独立学习者一起设计和实施灵活的停车环境,从而利用多代理通信。我们实施了一套工具来进行大规模执行实验,从而取得了超过98.1%成功率的高达7辆汽车的模型,从而超过了现有的单代机构模型。我们还获得了与汽车在我们环境中表现出的竞争性和协作行为有关的几个结果,这些行为的密度和沟通水平各不相同。值得注意的是,我们发现了一种没有竞争的合作形式,以及一种“泄漏”的合作形式,在没有足够状态的情况下,代理商进行了协作。这种工作在自动驾驶和车队管理行业中具有许多潜在的应用,并为将强化学习应用于多机构停车场提供了几种有用的技术和基准。
translated by 谷歌翻译