加强学习(RL)代理通常通过其预期值在测试方案的分布中进行评估。不幸的是,这种评估方法为超出测试分布以外的部署后概括提供了有限的证据。在本文中,我们通过将最新的清单测试方法从自然语言处理扩展到基于计划的RL来解决此限制。具体而言,我们考虑使用学习过渡模型和价值功能通过在线树搜索做出决策的RL代理。关键思想是通过清单方法来改善对未来绩效的评估,以探索和评估树木搜索过程中代理商的推论。该方法为用户提供了界面和一般查询规则机制,用于识别潜在的推理缺陷并验证预期的推理不变。我们介绍了一项涉及知识渊博的AI研究人员的用户研究,使用该方法评估训练有素的代理商,可以玩复杂的实时策略游戏。结果表明,该方法有效地允许用户识别代理推理中以前未知的缺陷。此外,我们的分析提供了有关AI专家如何使用这种测试方法的见解,这可能有助于改善未来的实例。
translated by 谷歌翻译
在本文中,我们提出了DendroMap,这是一种新颖的方法,用于互动地探索用于机器学习的大规模图像数据集(ML)。 ML从业人员通常通过使用降低降低技术(例如T-SNE)生成图像的网格或将图像的高维表示分为2-D来探索图像数据集。但是,两种方法都没有有效地扩展到大型数据集,因为图像是无效组织的,并且相互作用不足。为了应对这些挑战,我们通过适应Treemaps(一种众所周知的可视化技术)来开发树突。树突图通过从图像的高维表示中提取层次群集结构来有效地组织图像。它使用户能够理解数据集的整体分布,并在多个抽象级别上进行交互放大到特定的兴趣领域。我们使用广泛使用的图像数据集进行深度学习的案例研究表明,用户可以通过检查图像的多样性,确定表现不佳的子组并分析分类错误,从而发现有关数据集和训练模型的见解。我们进行了一项用户研究,该研究通过将其与T-SNE的网状版本进行比较,评估了树突图在分组和搜索任务中的有效性,并发现参与者更喜欢DendroMap。 DendroMap可在https://div-lab.github.io/dendromap/上获得。
translated by 谷歌翻译
The pandemic of these very recent years has led to a dramatic increase in people wearing protective masks in public venues. This poses obvious challenges to the pervasive use of face recognition technology that now is suffering a decline in performance. One way to address the problem is to revert to face recovery methods as a preprocessing step. Current approaches to face reconstruction and manipulation leverage the ability to model the face manifold, but tend to be generic. We introduce a method that is specific for the recovery of the face image from an image of the same individual wearing a mask. We do so by designing a specialized GAN inversion method, based on an appropriate set of losses for learning an unmasking encoder. With extensive experiments, we show that the approach is effective at unmasking face images. In addition, we also show that the identity information is preserved sufficiently well to improve face verification performance based on several face recognition benchmark datasets.
translated by 谷歌翻译
Differentiable Search Indices (DSIs) encode a corpus of documents in the parameters of a model and use the same model to map queries directly to relevant document identifiers. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12\%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting by a significant margin. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
A clustering termination procedure which is locally adaptive (with respect to the hierarchical tree of sets representative of the agglomerative merging) is proposed, for agglomerative hierarchical clustering on a set equipped with a distance function. It represents a multi-scale alternative to conventional scale dependent threshold based termination criteria.
translated by 谷歌翻译
实例级图像检索(IIR)或简单的实例检索,涉及在数据集中查找包含查询实例(例如对象)的数据集中所有图像的问题。本文首次尝试使用基于实例歧视的对比学习(CL)解决此问题。尽管CL在许多计算机视觉任务中表现出令人印象深刻的性能,但在IIR领域也从未找到过类似的成功。在这项工作中,我们通过探索从预先训练和微调的CL模型中得出判别表示的能力来解决此问题。首先,我们通过比较预先训练的深度神经网络(DNN)分类器与CL模型学到的功能相比,研究了IIR转移学习的功效。这些发现启发了我们提出了一种新的培训策略,该策略通过使用平均精度(AP)损失以及微调方法来学习针对IIR量身定制的对比功能表示形式,从而优化CL以学习为导向IIR的功能。我们的经验评估表明,从挑战性的牛津和巴黎数据集中的预先培训的DNN分类器中学到的现成的特征上的表现显着提高。
translated by 谷歌翻译
夜间使用常规视觉摄像机运行的机器人由于噪声受限图像而在重建中面临重大挑战。先前的工作表明,爆发成像技术可用于部分克服这一问题。在本文中,我们开发了一种新型的功能检测器,该功能检测器直接在图像爆发上运行,从而在极低的光线条件下增强了基于视觉的重建。我们的方法通过在多尺度和多运动空间中共同搜索,在每次爆发中找到了定义明确的尺度和明显运动的关键点。因为我们在图像具有较高信噪比的阶段描述了这些功能,因此检测到的特征比常规嘈杂图像和突发的图像和表现出高度精确的最新特征更准确和匹配性能。我们显示了提高功能性能和摄像头姿势估计值,并在挑战光限制的场景中使用功能检测器展示了改进的结构,从而改善了结构。我们的功能Finder为在弱光方案和应用程序(包括夜间操作)中运行的机器人提供了重要的一步。
translated by 谷歌翻译
我们介绍了StreamNet,这是一种自动编码器体系结构,用于分析大量白质流线的高度异质几何形状。该提出的框架利用了Wasserstein-1度量的几何形状赋值特性,以实现整个流线束的直接编码和重建。我们表明,该模型不仅可以准确捕获人群中流线的分布结构,而且还能够在真实和合成流线之间实现出色的重建性能。使用最新的ART捆绑包比较度量标准,对40个健康对照的T1加权扩散成像产生的白质流线评估了实验模型性能。
translated by 谷歌翻译
大型语言模型已经证明了能够在自然语言和编程语言文本上进行条件和生成的能力。这样的模型打开了多语言代码生成的可能性:代码生成模型是否可以将知识从一种语言推广到另一种语言?尽管当代代码生成模型可以生成语义上正确的Python代码,但对它们使用其他语言的能力知之甚少。我们通过提出Multipl-E来促进该主题的探索,这是自然语言到代码生成的第一个多语言平行基准。 Multipl-E扩展了HumaneVal基准(Chen等,2021),以支持另外18种编程语言,涵盖了一系列编程范式和受欢迎程度。我们在Multipl-E:Codex和Incoder上评估了两个最先进的代码生成模型。我们发现,在几种语言上,法典匹配,甚至超过了其在Python上的性能。在多型E中表示的编程语言范围使我们能够探索语言频率和语言功能对模型性能的影响。最后,将代码生成基准分配给新编程语言的多重方法既可扩展又可扩展。我们描述了一种通用方法,可以轻松地增加对新基准和语言的支持。
translated by 谷歌翻译