In this article we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domain-independent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. All of the software, including the benchmark agents, is publicly available.
translated by 谷歌翻译
Monte Carlo Tree Search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarise the results from the key game and non-game domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work.
translated by 谷歌翻译
蒙特卡洛树搜索(MCT)是设计游戏机器人或解决顺序决策问题的强大方法。该方法依赖于平衡探索和开发的智能树搜索。MCT以模拟的形式进行随机抽样,并存储动作的统计数据,以在每个随后的迭代中做出更有教育的选择。然而,该方法已成为组合游戏的最新技术,但是,在更复杂的游戏(例如那些具有较高的分支因素或实时系列的游戏)以及各种实用领域(例如,运输,日程安排或安全性)有效的MCT应用程序通常需要其与问题有关的修改或与其他技术集成。这种特定领域的修改和混合方法是本调查的主要重点。最后一项主要的MCT调查已于2012年发布。自发布以来出现的贡献特别感兴趣。
translated by 谷歌翻译
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
translated by 谷歌翻译
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games -the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled -our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
translated by 谷歌翻译
街机学习环境(ALE)被提出作为凭经验评估跨越atari 2600游戏的代理的一般性的评估平台。 ALE提供各种具有挑战性的问题,并从深加固学习(RL)社区中汲取了重要的关注。从Deep Q-Networks(DQN)到Agent57,RL代理似乎在ALE中实现了超人的性能。但是,这是这种情况吗?在本文中,为了探讨这个问题,我们首先审查了Atari基准中的当前评估指标,然后揭示了实现超人表现的当前评估标准是不合适的,这使得人类的性能相对于可能的是。为了处理这些问题并促进RL研究的发展,我们提出了一种基于人类世界纪录(HWR)的新型Atari基准,这对最终表现和学习效率提出了RL代理的更高要求。此外,我们总结了ATARI基准中的最先进的(SOTA)方法,并根据人类世界记录提供基准测试新的评估指标。我们得出结论,至少四个开放的挑战阻碍了RL代理商从那些新的基准结果实现超人绩效。最后,我们还讨论了一些有希望的方法来处理这些问题。
translated by 谷歌翻译
2048 is a single-player stochastic puzzle game. This intriguing and addictive game has been popular worldwide and has attracted researchers to develop game-playing programs. Due to its simplicity and complexity, 2048 has become an interesting and challenging platform for evaluating the effectiveness of machine learning methods. This dissertation conducts comprehensive research on reinforcement learning and computer game algorithms for 2048. First, this dissertation proposes optimistic temporal difference learning, which significantly improves the quality of learning by employing optimistic initialization to encourage exploration for 2048. Furthermore, based on this approach, a state-of-the-art program for 2048 is developed, which achieves the highest performance among all learning-based programs, namely an average score of 625377 points and a rate of 72% for reaching 32768-tiles. Second, this dissertation investigates several techniques related to 2048, including the n-tuple network ensemble learning, Monte Carlo tree search, and deep reinforcement learning. These techniques are promising for further improving the performance of the current state-of-the-art program. Finally, this dissertation discusses pedagogical applications related to 2048 by proposing course designs and summarizing the teaching experience. The proposed course designs use 2048-like games as materials for beginners to learn reinforcement learning and computer game algorithms. The courses have been successfully applied to graduate-level students and received well by student feedback.
translated by 谷歌翻译
Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.
translated by 谷歌翻译
Deep reinforcement learning is poised to revolutionise the field of AI and represents a step towards building autonomous systems with a higher level understanding of the visual world. Currently, deep learning is enabling reinforcement learning to scale to problems that were previously intractable, such as learning to play video games directly from pixels. Deep reinforcement learning algorithms are also applied to robotics, allowing control policies for robots to be learned directly from camera inputs in the real world. In this survey, we begin with an introduction to the general field of reinforcement learning, then progress to the main streams of value-based and policybased methods. Our survey will cover central algorithms in deep reinforcement learning, including the deep Q-network, trust region policy optimisation, and asynchronous advantage actor-critic. In parallel, we highlight the unique advantages of deep neural networks, focusing on visual understanding via reinforcement learning. To conclude, we describe several current areas of research within the field.
translated by 谷歌翻译
The reinforcement learning paradigm is a popular way to address problems that have only limited environmental feedback, rather than correctly labeled examples, as is common in other machine learning contexts. While significant progress has been made to improve learning in a single task, the idea of transfer learning has only recently been applied to reinforcement learning tasks. The core idea of transfer is that experience gained in learning to perform one task can help improve learning performance in a related, but different, task. In this article we present a framework that classifies transfer learning methods in terms of their capabilities and goals, and then use it to survey the existing literature, as well as to suggest future directions for transfer learning work.
translated by 谷歌翻译
Imitation learning techniques aim to mimic human behavior in a given task. An agent (a learning machine) is trained to perform a task from demonstrations by learning a mapping between observations and actions. The idea of teaching by imitation has been around for many years, however, the field is gaining attention recently due to advances in computing and sensing as well as rising demand for intelligent applications. The paradigm of learning by imitation is gaining popularity because it facilitates teaching complex tasks with minimal expert knowledge of the tasks. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations; without the need for explicit programming or designing reward functions specific to the task. Modern sensors are able to collect and transmit high volumes of data rapidly, and processors with high computational power allow fast processing that maps the sensory data to actions in a timely manner. This opens the door for many potential AI applications that require real-time perception and reaction such as humanoid robots, self-driving vehicles, human computer interaction and computer games to name a few. However, specialized algorithms are needed to effectively and robustly learn models as learning by imitation poses its own set of challenges. In this paper, we survey imitation learning methods and present design options in different steps of the learning process. We introduce a background and motivation for the field as well as highlight challenges specific to the imitation problem. Methods for designing and evaluating imitation learning tasks are categorized and reviewed. Special attention is given to learning methods in robotics and games as these domains are the most popular in the literature and provide a wide array of problems and methodologies. We extensively discuss combining imitation learning approaches using different sources and methods, as well as incorporating other motion learning methods to enhance imitation. We also discuss the potential impact on industry, present major applications and highlight current and future research directions.
translated by 谷歌翻译
Alphazero,Leela Chess Zero和Stockfish Nnue革新了计算机国际象棋。本书对此类引擎的技术内部工作进行了完整的介绍。该书分为四个主要章节 - 不包括第1章(简介)和第6章(结论):第2章引入神经网络,涵盖了所有用于构建深层网络的基本构建块,例如Alphazero使用的网络。内容包括感知器,后传播和梯度下降,分类,回归,多层感知器,矢量化技术,卷积网络,挤压网络,挤压和激发网络,完全连接的网络,批处理归一化和横向归一化和跨性线性单位,残留层,剩余层,过度效果和底漆。第3章介绍了用于国际象棋发动机以及Alphazero使用的经典搜索技术。内容包括minimax,alpha-beta搜索和蒙特卡洛树搜索。第4章展示了现代国际象棋发动机的设计。除了开创性的Alphago,Alphago Zero和Alphazero我们涵盖Leela Chess Zero,Fat Fritz,Fat Fritz 2以及有效更新的神经网络(NNUE)以及MAIA。第5章是关于实施微型α。 Shexapawn是国际象棋的简约版本,被用作为此的示例。 Minimax搜索可以解决六ap峰,并产生了监督学习的培训位置。然后,作为比较,实施了类似Alphazero的训练回路,其中通过自我游戏进行训练与强化学习结合在一起。最后,比较了类似α的培训和监督培训。
translated by 谷歌翻译
深度强化学习(RL)导致了许多最近和开创性的进步。但是,这些进步通常以培训的基础体系结构的规模增加以及用于训练它们的RL算法的复杂性提高,而均以增加规模的成本。这些增长反过来又使研究人员更难迅速原型新想法或复制已发表的RL算法。为了解决这些问题,这项工作描述了ACME,这是一个用于构建新型RL算法的框架,这些框架是专门设计的,用于启用使用简单的模块化组件构建的代理,这些组件可以在各种执行范围内使用。尽管ACME的主要目标是为算法开发提供一个框架,但第二个目标是提供重要或最先进算法的简单参考实现。这些实现既是对我们的设计决策的验证,也是对RL研究中可重复性的重要贡献。在这项工作中,我们描述了ACME内部做出的主要设计决策,并提供了有关如何使用其组件来实施各种算法的进一步详细信息。我们的实验为许多常见和最先进的算法提供了基准,并显示了如何为更大且更复杂的环境扩展这些算法。这突出了ACME的主要优点之一,即它可用于实现大型,分布式的RL算法,这些算法可以以较大的尺度运行,同时仍保持该实现的固有可读性。这项工作提出了第二篇文章的版本,恰好与模块化的增加相吻合,对离线,模仿和从演示算法学习以及作为ACME的一部分实现的各种新代理。
translated by 谷歌翻译
使用规划算法和神经网络模型的基于模型的强化学习范例最近在不同的应用中实现了前所未有的结果,导致现在被称为深度增强学习的内容。这些代理非常复杂,涉及多个组件,可能会为研究产生挑战的因素。在这项工作中,我们提出了一个适用于这些类型代理的新模块化软件架构,以及一组建筑块,可以轻松重复使用和组装,以构建基于模型的增强学习代理。这些构建块包括规划算法,策略和丢失功能。我们通过将多个这些构建块组合实现和测试经过针对三种不同的测试环境的代理来说明这种架构的使用:Cartpole,Minigrid和Tictactoe。在我们的实施中提供的一个特定的规划算法,并且以前没有用于加强学习,我们称之为Imperage Minimax,在三个测试环境中取得了良好的效果。用这种架构进行的实验表明,规划算法,政策和损失函数的最佳组合依赖性严重问题。该结果提供了证据表明,拟议的架构是模块化和可重复使用的,对想要研究新环境和技术的强化学习研究人员有用。
translated by 谷歌翻译
Progress in continual reinforcement learning has been limited due to several barriers to entry: missing code, high compute requirements, and a lack of suitable benchmarks. In this work, we present CORA, a platform for Continual Reinforcement Learning Agents that provides benchmarks, baselines, and metrics in a single code package. The benchmarks we provide are designed to evaluate different aspects of the continual RL challenge, such as catastrophic forgetting, plasticity, ability to generalize, and sample-efficient learning. Three of the benchmarks utilize video game environments (Atari, Procgen, NetHack). The fourth benchmark, CHORES, consists of four different task sequences in a visually realistic home simulator, drawn from a diverse set of task and scene parameters. To compare continual RL methods on these benchmarks, we prepare three metrics in CORA: Continual Evaluation, Isolated Forgetting, and Zero-Shot Forward Transfer. Finally, CORA includes a set of performant, open-source baselines of existing algorithms for researchers to use and expand on. We release CORA and hope that the continual RL community can benefit from our contributions, to accelerate the development of new continual RL algorithms.
translated by 谷歌翻译
With the development of deep representation learning, the domain of reinforcement learning (RL) has become a powerful learning framework now capable of learning complex policies in high dimensional environments. This review summarises deep reinforcement learning (DRL) algorithms and provides a taxonomy of automated driving tasks where (D)RL methods have been employed, while addressing key computational challenges in real world deployment of autonomous driving agents. It also delineates adjacent domains such as behavior cloning, imitation learning, inverse reinforcement learning that are related but are not classical RL algorithms. The role of simulators in training agents, methods to validate, test and robustify existing solutions in RL are discussed.
translated by 谷歌翻译
本文介绍了一种扮演流行的第一人称射击(FPS)视频游戏的AI代理商的AI代理商;来自像素输入的全球攻势(CSGO)。代理人,一个深度神经网络,符合Deathmatch游戏模式内置AI内置AI的媒体难度的性能,同时采用人类的戏剧风格。与在游戏中的许多事先工作不同,CSGO没有API,因此算法必须培训并实时运行。这限制了可以生成的策略数据的数量,妨碍许多增强学习算法。我们的解决方案使用行为克隆 - 在从在线服务器上的人类播放(400万帧,大小与Imagenet相当的400万帧)上刮出的大型嘈杂数据集的行为克隆训练,以及一个较小的高质量专家演示数据集。这种比例是比FPS游戏中的模仿学习的先前工作的数量级。
translated by 谷歌翻译
深度强化学习(RL)的进展是通过用于培训代理商的具有挑战性的基准的可用性来驱动。但是,社区广泛采用的基准未明确设计用于评估RL方法的特定功能。虽然存在用于评估RL的特定打开问题的环境(例如探索,转移学习,无监督环境设计,甚至语言辅助RL),但一旦研究超出证明,通常难以将这些更富有,更复杂的环境 - 概念结果。我们展示了一个强大的沙箱框架,用于易于设计新颖的RL环境。 Minihack是一个停止商店,用于RL实验,环境包括从小房间到复杂的,程序生成的世界。通过利用来自Nethack的全套实体和环境动态,MiniHack是最富有的基网上的视频游戏之一,允许设计快速方便的定制RL测试台。使用这种沙箱框架,可以轻松设计新颖的环境,可以使用人类可读的描述语言或简单的Python接口来设计。除了各种RL任务和基线外,Minihack还可以包装现有的RL基准,并提供无缝添加额外复杂性的方法。
translated by 谷歌翻译
深入学习的强化学习(RL)的结合导致了一系列令人印象深刻的壮举,许多相信(深)RL提供了一般能力的代理。然而,RL代理商的成功往往对培训过程中的设计选择非常敏感,这可能需要繁琐和易于易于的手动调整。这使得利用RL对新问题充满挑战,同时也限制了其全部潜力。在许多其他机器学习领域,AutomL已经示出了可以自动化这样的设计选择,并且在应用于RL时也会产生有希望的初始结果。然而,自动化强化学习(AutorL)不仅涉及Automl的标准应用,而且还包括RL独特的额外挑战,其自然地产生了不同的方法。因此,Autorl已成为RL中的一个重要研究领域,提供来自RNA设计的各种应用中的承诺,以便玩游戏等游戏。鉴于RL中考虑的方法和环境的多样性,在不同的子领域进行了大部分研究,从Meta学习到进化。在这项调查中,我们寻求统一自动的领域,我们提供常见的分类法,详细讨论每个区域并对研究人员来说是一个兴趣的开放问题。
translated by 谷歌翻译