虽然近年来,深层加强学习代理人取得了前所未有的成功,但他们所学的政策可能是脆弱的,甚至无法概括到甚至略微修改他们的环境或不熟悉的情况。神经网络学习动态的黑匣子性质使得无法审核培训的深层代理并从这种失败中恢复过来。在本文中,我们提出了一种新颖的表示和学习方法来捕获环境动态而不使用神经网络。它起源于观察,在为人们设计的游戏中,动作的效果通常可以以连续的视觉观测的局部变化的形式感知。我们的算法旨在提取基于视觉的更改,并将其冷凝成一组依赖于依赖的描述性规则,我们调用“Visual Rewrite规则”(VRRS)。我们还提出了可以探索,扩展其规则集的VRR代理的初步结果,并通过规划与其学习的VRR世界模型来解决游戏。在若干古典游戏中,与几个主流深层代理相比,我们的非深度代理商证明了卓越的性能,极端样品效率和鲁棒泛化能力。
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
使用规划算法和神经网络模型的基于模型的强化学习范例最近在不同的应用中实现了前所未有的结果,导致现在被称为深度增强学习的内容。这些代理非常复杂,涉及多个组件,可能会为研究产生挑战的因素。在这项工作中,我们提出了一个适用于这些类型代理的新模块化软件架构,以及一组建筑块,可以轻松重复使用和组装,以构建基于模型的增强学习代理。这些构建块包括规划算法,策略和丢失功能。我们通过将多个这些构建块组合实现和测试经过针对三种不同的测试环境的代理来说明这种架构的使用:Cartpole,Minigrid和Tictactoe。在我们的实施中提供的一个特定的规划算法,并且以前没有用于加强学习,我们称之为Imperage Minimax,在三个测试环境中取得了良好的效果。用这种架构进行的实验表明,规划算法,政策和损失函数的最佳组合依赖性严重问题。该结果提供了证据表明,拟议的架构是模块化和可重复使用的,对想要研究新环境和技术的强化学习研究人员有用。
translated by 谷歌翻译
加强学习(RL)研究的进展通常是由新的,具有挑战性的环境的设计驱动的,这是一项昂贵的事业,需要技能与典型的机器学习研究人员的正交性。环境发展的复杂性仅随着程序性产生(PCG)的兴起而增加,作为产生能够测试RL剂稳健性和泛化的各种环境的流行范式。此外,现有环境通常需要复杂的构建过程,从而使重现结果变得困难。为了解决这些问题,我们介绍了基于网状引擎的基于网络的集成开发环境(IDE)Griddlyjs。 Griddlyjs允许研究人员使用方便的图形接口在视觉上设计和调试任意,复杂的PCG网格世界环境,并可视化,评估和记录训练有素的代理模型的性能。通过将RL工作流连接到由现代Web标准启用的高级功能,Griddlyjs允许发布交互式代理 - 环境演示,将实验结果直接重现为Web。为了证明Griddlyjs的多功能性,我们使用它来快速开发一个复杂的组成拼图解决环境,以及任意人为设计的环境配置及其用于自动课程学习和离线RL的解决方案。 Griddlyjs IDE是开源的,可以在\ url {https://griddly.ai}上免费获得。
translated by 谷歌翻译
Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.
translated by 谷歌翻译
Adequately assigning credit to actions for future outcomes based on their contributions is a long-standing open challenge in Reinforcement Learning. The assumptions of the most commonly used credit assignment method are disadvantageous in tasks where the effects of decisions are not immediately evident. Furthermore, this method can only evaluate actions that have been selected by the agent, making it highly inefficient. Still, no alternative methods have been widely adopted in the field. Hindsight Credit Assignment is a promising, but still unexplored candidate, which aims to solve the problems of both long-term and counterfactual credit assignment. In this thesis, we empirically investigate Hindsight Credit Assignment to identify its main benefits, and key points to improve. Then, we apply it to factored state representations, and in particular to state representations based on the causal structure of the environment. In this setting, we propose a variant of Hindsight Credit Assignment that effectively exploits a given causal structure. We show that our modification greatly decreases the workload of Hindsight Credit Assignment, making it more efficient and enabling it to outperform the baseline credit assignment method on various tasks. This opens the way to other methods based on given or learned causal structures.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
人类通常通过将它们分解为更容易的子问题,然后结合子问题解决方案来解决复杂的问题。这种类型的组成推理允许在解决共享一部分基础构图结构的未来任务时重复使用子问题解决方案。在持续或终身的强化学习(RL)设置中,将知识分解为可重复使用的组件的能力将使代理通过利用积累的组成结构来快速学习新的RL任务。我们基于神经模块探索一种特定形式的组成形式,并提出了一组RL问题,可以直观地接受组成溶液。从经验上讲,我们证明了神经组成确实捕获了问题空间的基本结构。我们进一步提出了一种构图终身RL方法,该方法利用累积的神经成分来加速学习未来任务的学习,同时通过离线RL通过离线RL保留以前的RL,而不是重播经验。
translated by 谷歌翻译
Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.
translated by 谷歌翻译
This paper surveys the eld of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the eld and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but di ers considerably in the details and in the use of the word \reinforcement." The paper discusses central issues of reinforcement learning, including trading o exploration and exploitation, establishing the foundations of the eld via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.
translated by 谷歌翻译
In recent years, Monte Carlo tree search (MCTS) has achieved widespread adoption within the game community. Its use in conjunction with deep reinforcement learning has produced success stories in many applications. While these approaches have been implemented in various games, from simple board games to more complicated video games such as StarCraft, the use of deep neural networks requires a substantial training period. In this work, we explore on-line adaptivity in MCTS without requiring pre-training. We present MCTS-TD, an adaptive MCTS algorithm improved with temporal difference learning. We demonstrate our new approach on the game miniXCOM, a simplified version of XCOM, a popular commercial franchise consisting of several turn-based tactical games, and show how adaptivity in MCTS-TD allows for improved performances against opponents.
translated by 谷歌翻译
尽管深度强化学习(RL)最近取得了许多成功,但其方法仍然效率低下,这使得在数据方面解决了昂贵的许多问题。我们的目标是通过利用未标记的数据中的丰富监督信号来进行学习状态表示,以解决这一问题。本文介绍了三种不同的表示算法,可以访问传统RL算法使用的数据源的不同子集使用:(i)GRICA受到独立组件分析(ICA)的启发,并训练深层神经网络以输出统计独立的独立特征。输入。 Grica通过最大程度地减少每个功能与其他功能之间的相互信息来做到这一点。此外,格里卡仅需要未分类的环境状态。 (ii)潜在表示预测(LARP)还需要更多的上下文:除了要求状态作为输入外,它还需要先前的状态和连接它们的动作。该方法通过预测当前状态和行动的环境的下一个状态来学习状态表示。预测器与图形搜索算法一起使用。 (iii)重新培训通过训练深层神经网络来学习国家表示,以学习奖励功能的平滑版本。该表示形式用于预处理输入到深度RL,而奖励预测指标用于奖励成型。此方法仅需要环境中的状态奖励对学习表示表示。我们发现,每种方法都有其优势和缺点,并从我们的实验中得出结论,包括无监督的代表性学习在RL解决问题的管道中可以加快学习的速度。
translated by 谷歌翻译
In this article we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domain-independent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. All of the software, including the benchmark agents, is publicly available.
translated by 谷歌翻译
当环境稀疏和非马克维亚奖励时,使用标量奖励信号的训练加强学习(RL)代理通常是不可行的。此外,在训练之前对这些奖励功能进行手工制作很容易指定,尤其是当环境的动态仅部分知道时。本文提出了一条新型的管道,用于学习非马克维亚任务规格,作为简洁的有限状态“任务自动机”,从未知环境中的代理体验情节中。我们利用两种关键算法的见解。首先,我们通过将其视为部分可观察到的MDP并为隐藏的Markov模型使用现成的算法,从而学习了由规范的自动机和环境MDP组成的产品MDP,该模型是由规范的自动机和环境MDP组成的。其次,我们提出了一种从学习的产品MDP中提取任务自动机(假定为确定性有限自动机)的新方法。我们学到的任务自动机可以使任务分解为其组成子任务,从而提高了RL代理以后可以合成最佳策略的速率。它还提供了高级环境和任务功能的可解释编码,因此人可以轻松地验证代理商是否在没有错误的情况下学习了连贯的任务。此外,我们采取步骤确保学识渊博的自动机是环境不可静止的,使其非常适合用于转移学习。最后,我们提供实验结果,以说明我们在不同环境和任务中的算法的性能及其合并先前的领域知识以促进更有效学习的能力。
translated by 谷歌翻译
在加强学习算法中纳入先前知识主要是一个开放的问题。即使有关环境动态的见解,也可以在Tabula Rasa设置中使用加固学习,并且必须从头开始探索和学习所有内容。在本文中,我们考虑利用对动作序列等价的前沿的问题:即,当不同的行动序列产生相同的效果时。我们提出了一种新的本地探索策略,以最大限度地减少碰撞并最大限度地提高新的国家审视。我们表明,通过解决凸优化问题,可以几乎没有成本计算该策略。通过在DQN中取代通常的epsilon贪婪策略,我们在具有各种动态结构的若干环境中展示了其潜力。
translated by 谷歌翻译
蒙特卡洛树搜索(MCT)是设计游戏机器人或解决顺序决策问题的强大方法。该方法依赖于平衡探索和开发的智能树搜索。MCT以模拟的形式进行随机抽样,并存储动作的统计数据,以在每个随后的迭代中做出更有教育的选择。然而,该方法已成为组合游戏的最新技术,但是,在更复杂的游戏(例如那些具有较高的分支因素或实时系列的游戏)以及各种实用领域(例如,运输,日程安排或安全性)有效的MCT应用程序通常需要其与问题有关的修改或与其他技术集成。这种特定领域的修改和混合方法是本调查的主要重点。最后一项主要的MCT调查已于2012年发布。自发布以来出现的贡献特别感兴趣。
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
深度加强学习概括(RL)的研究旨在产生RL算法,其政策概括为在部署时间进行新的未经调整情况,避免对其培训环境的过度接受。如果我们要在现实世界的情景中部署强化学习算法,那么解决这一点至关重要,那么环境将多样化,动态和不可预测。该调查是这个新生领域的概述。我们为讨论不同的概括问题提供统一的形式主义和术语,在以前的作品上建立不同的概括问题。我们继续对现有的基准进行分类,以及用于解决泛化问题的当前方法。最后,我们提供了对现场当前状态的关键讨论,包括未来工作的建议。在其他结论之外,我们认为,采取纯粹的程序内容生成方法,基准设计不利于泛化的进展,我们建议快速在线适应和将RL特定问题解决作为未来泛化方法的一些领域,我们推荐在UniTexplorated问题设置中构建基准测试,例如离线RL泛化和奖励函数变化。
translated by 谷歌翻译
深度强化学习(RL)的进展是通过用于培训代理商的具有挑战性的基准的可用性来驱动。但是,社区广泛采用的基准未明确设计用于评估RL方法的特定功能。虽然存在用于评估RL的特定打开问题的环境(例如探索,转移学习,无监督环境设计,甚至语言辅助RL),但一旦研究超出证明,通常难以将这些更富有,更复杂的环境 - 概念结果。我们展示了一个强大的沙箱框架,用于易于设计新颖的RL环境。 Minihack是一个停止商店,用于RL实验,环境包括从小房间到复杂的,程序生成的世界。通过利用来自Nethack的全套实体和环境动态,MiniHack是最富有的基网上的视频游戏之一,允许设计快速方便的定制RL测试台。使用这种沙箱框架,可以轻松设计新颖的环境,可以使用人类可读的描述语言或简单的Python接口来设计。除了各种RL任务和基线外,Minihack还可以包装现有的RL基准,并提供无缝添加额外复杂性的方法。
translated by 谷歌翻译
强化学习(RL)通过与环境相互作用的试验过程解决顺序决策问题。尽管RL在玩复杂的视频游戏方面取得了巨大的成功,但在现实世界中,犯错误总是不希望的。为了提高样本效率并从而降低错误,据信基于模型的增强学习(MBRL)是一个有前途的方向,它建立了环境模型,在该模型中可以进行反复试验,而无需实际成本。在这项调查中,我们对MBRL进行了审查,重点是Deep RL的最新进展。对于非壮观环境,学到的环境模型与真实环境之间始终存在概括性错误。因此,非常重要的是分析环境模型中的政策培训与实际环境中的差异,这反过来又指导了更好的模型学习,模型使用和政策培训的算法设计。此外,我们还讨论了其他形式的RL,包括离线RL,目标条件RL,多代理RL和Meta-RL的最新进展。此外,我们讨论了MBRL在现实世界任务中的适用性和优势。最后,我们通过讨论MBRL未来发展的前景来结束这项调查。我们认为,MBRL在被忽略的现实应用程序中具有巨大的潜力和优势,我们希望这项调查能够吸引更多关于MBRL的研究。
translated by 谷歌翻译