在复杂的任务中,奖励函数并不简单,并且由一组目标,多种强化学习(RL)策略充分地执行任务,但可以通过调整个人目标对奖励功能的影响来训练不同的策略。了解政策之间的策略差异是必要的,使用户能够在提供的策略之间进行选择,可以帮助开发人员了解从各种奖励功能中出现的不同行为,并在RL系统中培训QuantEnparameters。在这项工作中,我们可以比较两项训练在同一任务的两项政策的行为,但在目标中具有不同的偏好。我们提出了一种区分源自来自不同能力的行为的差异的方法,这是两种R1代理商的偏好的结果。此外,我们只使用基于优先级的差异数据,以便产生关于代理偏好的对比解释。最后,我们在自主驾驶任务上测试和评估我们的方法,并比较安全导向政策的行为和更喜欢速度的行为。
translated by 谷歌翻译
Neural approaches have become very popular in the domain of Question Answering, however they require a large amount of annotated data. Furthermore, they often yield very good performance but only in the domain they were trained on. In this work we propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low resource settings, where the target domains are diverse in terms of difficulty and similarity to the source domain. We also investigate Active Learning for question answering in different stages, overall reducing the annotation effort of humans. For this purpose, we consider target domains in realistic settings, with an extremely low amount of annotated samples but with many unlabeled documents, which we assume can be obtained with little effort. Additionally, we assume sufficient amount of labeled data from the source domain is available. We perform extensive experiments to find the best setup for incorporating domain experts. Our findings show that our novel approach, where humans are incorporated as early as possible in the process, boosts performance in the low-resource, domain-specific setting, allowing for low-labeling-effort question answering systems in new, specialized domains. They further demonstrate how human annotation affects the performance of QA depending on the stage it is performed.
translated by 谷歌翻译