Reinforcement learning in partially observable domains is challenging due to the lack of observable state information. Thankfully, learning offline in a simulator with such state information is often possible. In particular, we propose a method for partially observable reinforcement learning that uses a fully observable policy (which we call a state expert) during offline training to improve online performance. Based on Soft Actor-Critic (SAC), our agent balances performing actions similar to the state expert and getting high returns under partial observability. Our approach can leverage the fully-observable policy for exploration and parts of the domain that are fully observable while still being able to learn under partial observability. On six robotics domains, our method outperforms pure imitation, pure reinforcement learning, the sequential or parallel combination of both types, and a recent state-of-the-art method in the same setting. A successful policy transfer to a physical robot in a manipulation task from pixels shows our approach's practicality in learning interesting policies under partial observability.
translated by 谷歌翻译
在现实设置中跨多个代理的决策同步是有问题的,因为它要求代理等待其他代理人终止和交流有关终止的终止。理想情况下,代理应该学习和执行异步。这样的异步方法还允许暂时扩展的动作,这些操作可能会根据执行的情况和操作花费不同的时间。不幸的是,当前的策略梯度方法不适用于异步设置,因为他们认为代理在每个时间步骤中都同步推理了动作选择。为了允许异步学习和决策,我们制定了一组异步的多代理参与者 - 批判性方法,这些方法使代理可以在三个标准培训范式中直接优化异步策略:分散的学习,集中学习,集中学习和集中培训以进行分解执行。各种现实域中的经验结果(在模拟和硬件中)证明了我们在大型多代理问题中的优势,并验证了我们算法在学习高质量和异步解决方案方面的有效性。
translated by 谷歌翻译
Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.
translated by 谷歌翻译
分散执行的集中培训,其中培训是以集中的离线方式完成的,已成为多智能经纪增强学习中的流行解决方案范例。许多这样的方法采用了基于国家的批评者的演员 - 评论家的形式,因为集中式训练允许访问真正的系统状态,尽管在执行时间没有可用,但在训练期间可以有用。基于国家的评论家已成为一个共同的经验选择,尽管是一个具有有限的理论性理由或分析。在本文中,我们表明,国家基本批评者可以在政策梯度估计中引入偏差,可能会破坏算法的渐近保证。我们还表明,即使国家的批评者没有引入任何偏差,它们仍然可以导致更大的梯度方差,与常见的直觉相反。最后,我们通过比较了在实践中的影响,通过比较不同形式的集中评论家对广泛的共同基准,以及详细的各种环境特性与不同类型批评者的有效性有关。
translated by 谷歌翻译
MultiSep Returns的违规学习对于采样高效的强化学习至关重要,特别是在现在与深神经网络常用的经验重播设置中。经典地,以每个判定方式纠正偏离策略估计偏差:通过在每个动作之后通过瞬时重要性采样(IS)比率(通过资格迹线)重新加权。许多重要的截止措施算法,如树备份和回撤依赖于该机制以及用于截断的不同协议(“切割”)比率(“迹线”)来抵消IS估计器的过度方差。遗憾的是,各种决策的切割迹线不一定有效;一旦根据当地信息切割了痕迹,效果就不能在后来逆转,可能导致估计恢复和较慢的学习截断。为了激励有效的截止策略算法,我们提出了一个多步算子,允许任意的过去依赖性迹线。我们证明我们的运营商是策略评估的融合,并在针对限制限制策略时最佳控制。我们的定理建立了许多现有算法的第一个收敛保证,包括截断,非马尔可道回撤和历史依赖于历史依赖于历史依赖性TD($ \ lambda $)。我们的理论结果还为开发新算法的制定提供了指导,以便共同考虑更好的过去的决定,以获得更好的信用分配和更快的学习。
translated by 谷歌翻译
返回缓存是最近的策略,可以使用多步骤估算器(例如{\ lambda} -Return)实现高效的小纤维培训,用于深入加强学习。通过预先计算顺序批量的返回估计,然后将结果存储在辅助数据结构中以供稍后采样,可以大大减少每次估计的平均计算。尽管如此,可以提高返回缓存的效率,特别是关于其大的内存使用和重复数据副本。我们提出了一种新的数据结构,虚拟重播缓存(VRC),以解决这些缺点。学习播放Atari 2600游戏时,VRC几乎消除了DQN({\ Lambda})的缓存内存占用占据占据功能,略微降低了硬件上的总培训时间。
translated by 谷歌翻译
深度Q-Network(DQN)标志着强化学习的主要里程碑,首次展示了人类水平控制政策,可以通过奖励最大化直接从原始视觉输入学习。即使是介绍多年后,DQN与研究界仍然高度相关,因为其在继承方法中采用了许多创新。然而,尽管在临时上的硬件进步,但DQN的原始ATari 2600实验仍然昂贵,以便全面复制。这对无法负担最先进的硬件或缺乏大规模云计算资源的研究人员构成了巨大的障碍。为了便于改进对深度加强学习研究的访问,我们介绍了一种DQN实现,它利用了一种新颖的并发和同步执行框架,旨在最大限度地利用异构CPU-GPU桌面系统。只需一个Nvidia GeForce GTX 1080 GPU,我们的实施将200亿帧atari实验的培训时间从25小时到仅需9小时。本文介绍的想法应普遍适用于大量违规的深度增强学习方法。
translated by 谷歌翻译
政策梯度方法在多智能体增强学习中变得流行,但由于存在环境随机性和探索代理(即非公平性​​),它们遭受了高度的差异,这可能因信用分配难度而受到困扰。结果,需要一种方法,该方法不仅能够有效地解决上述两个问题,而且需要足够强大地解决各种任务。为此,我们提出了一种新的多代理政策梯度方法,称为强大的本地优势(ROLA)演员 - 评论家。 Rola允许每个代理人将个人动作值函数作为当地评论家,以及通过基于集中评论家的新型集中培训方法来改善环境不良。通过使用此本地批评,每个代理都计算基准,以减少对其策略梯度估计的差异,这导致含有其他代理的预期优势动作值,这些选项可以隐式提高信用分配。我们在各种基准测试中评估ROLA,并在许多最先进的多代理政策梯度算法上显示其鲁棒性和有效性。
translated by 谷歌翻译
用于分散执行的集中培训,其中代理商使用集中信息训练,但在线以分散的方式执行,在多智能体增强学习界中获得了普及。特别是,具有集中评论家和分散的演员的演员 - 批评方法是这个想法的常见实例。然而,即使它是许多算法的标准选择,也没有完全讨论和理解使用集中评论批读的影响。因此,我们正式分析集中和分散的批评批评方法,了解对评论家选择的影响。由于我们的理论使得不切实际的假设,我们还经验化地比较了广泛的环境中集中式和分散的批评方法来验证我们的理论并提供实用建议。我们展示了当前文献中集中评论家存在误解,并表明集中式评论家设计并不是严格用的,而是集中和分散的批评者具有不同的利弊,算法设计人员应该考虑到不同的利弊。
translated by 谷歌翻译
Computational units in artificial neural networks follow a simplified model of biological neurons. In the biological model, the output signal of a neuron runs down the axon, splits following the many branches at its end, and passes identically to all the downward neurons of the network. Each of the downward neurons will use their copy of this signal as one of many inputs dendrites, integrate them all and fire an output, if above some threshold. In the artificial neural network, this translates to the fact that the nonlinear filtering of the signal is performed in the upward neuron, meaning that in practice the same activation is shared between all the downward neurons that use that signal as their input. Dendrites thus play a passive role. We propose a slightly more complex model for the biological neuron, where dendrites play an active role: the activation in the output of the upward neuron becomes optional, and instead the signals going through each dendrite undergo independent nonlinear filterings, before the linear combination. We implement this new model into a ReLU computational unit and discuss its biological plausibility. We compare this new computational unit with the standard one and describe it from a geometrical point of view. We provide a Keras implementation of this unit into fully connected and convolutional layers and estimate their FLOPs and weights change. We then use these layers in ResNet architectures on CIFAR-10, CIFAR-100, Imagenette, and Imagewoof, obtaining performance improvements over standard ResNets up to 1.73%. Finally, we prove a universal representation theorem for continuous functions on compact sets and show that this new unit has more representational power than its standard counterpart.
translated by 谷歌翻译