Recent work in open-domain question answering (ODQA) has shown that adversarial poisoning of the input contexts can cause large drops in accuracy for production systems. However, little to no work has proposed methods to defend against these attacks. To do so, we introduce a new method that uses query augmentation to search for a diverse set of retrieved passages that could answer the original question. We integrate these new passages into the model through the design of a novel confidence method, comparing the predicted answer to its appearance in the retrieved contexts (what we call Confidence from Answer Redundancy, e.g. CAR). Together these methods allow for a simple but effective way to defend against poisoning attacks and provide gains of 5-20% exact match across varying levels of data poisoning.
translated by 谷歌翻译
最近的开放式域问题回答表明,新颖的测试问题之间的模型性能和那些在很大程度上与培训问题重叠的模型性能存在很大差异。然而,目前尚不清楚新颖的问题的哪些方面使他们成为挑战。在进行系统泛化的研究时,我们根据三个类别介绍和注释问题,这些类别测量了不同的水平和概括的种类:培训设定重叠,组成泛化(Comp-Gen)和新颖的实体概括(新实体)。在评估六个流行的参数和非参数模型时,我们发现,对于既定的自然问题和TriviaQA数据集,即使是Comp-Gen /新颖实体的最强的模型性能也是13.1 / 5.4%和9.6 / 1.5%,而与此相比降低对于完整的测试集 - 表示这些类型的问题所带来的挑战。此外,我们表明,虽然非参数模型可以相对良好地处理含有新颖实体的问题,但它们与那些需要组成泛化的问题斗争。最后,我们发现关键问题是:来自检索组件的级联错误,问题模式的频率和实体的频率。
translated by 谷歌翻译
知识密集型任务,例如开放域问题答案(QA),需要访问大量的世界知识或领域知识。知识密集型任务的一种常见方法是采用检索到阅读的管道,该管道首先从诸如Wikipedia之类的外部语料库中检索少数相关的上下文文档,然后预测在检索文档的条件下得到答案。在本文中,我们提出了一种新的观点,可以通过用大型语言模型生成器代替文档检索器来解决知识密集型任务。我们称我们的方法生成-Read Read(GenRead),该方法首先提示大型语言模型根据给定问题生成上下文文档,然后读取生成的文档以产生最终答案。此外,我们提出了一种基于聚类的提示方法,该方法选择了不同的提示,从而产生了涵盖不同观点的生成文档,从而更好地回忆了可接受的答案。我们对三个不同的知识密集任务进行了广泛的实验,包括开放域质量检查,事实检查和对话系统。值得注意的是,GenRead在Triviaqa和WebQ上实现了71.6和54.4的精确匹配分数,显着超过了最先进的检索到+4.0和+3.9的最先进的dpr-fid,而无需从任何外部知识源中检索任何文档。最后,我们证明可以通过结合检索和生成来进一步提高模型性能。
translated by 谷歌翻译
为了解决现实世界应用需求的日益增长,知识密集型NLP(KI-NLP)的研究应通过捕获真正开放域环境的挑战:网络规模知识,结构缺乏,质量不一致,和噪音。为此,我们提出了一种新的设置,用于评估现有的KI-NLP任务,其中我们将背景语料库概括为通用Web快照。我们重新保证Kilt,最初为维基百科最初开发的标准Ki-NLP基准测试,并要求系统使用CCNet的子集 - 球体语料库 - 作为知识源。与维基百科相比,球体是较大的数量级,更好地反映了互联网上的全部知识。我们发现,尽管潜在的覆盖范围,规模挑战,结构缺乏,质量较低,来自领域的检索可以实现最先进的检索系统,以匹配和甚至优于基于Wikipedia的模型在几个kilt上任务 - 即使我们积极过滤看起来像维基百科的内容。我们还观察到Wikipedia的单一密集通道指数可以胜过稀疏的BM25版本,而在球体上尚不实现。为了促进进一步研究该领域,并尽量减少社区对专有黑匣子搜索引擎的依赖,我们将分享我们的指数,评估指标和基础设施。
translated by 谷歌翻译
Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dualencoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks. 1 * Equal contribution 1 The code and trained models have been released at https://github.com/facebookresearch/DPR.
translated by 谷歌翻译
知识依赖任务通常使用两个知识来源:参数,在培训时间和上下文中学到的,作为推理时间的段落给出。要了解模型如何使用这些来源,我们正式化知识冲突问题,其中上下文信息与学到的信息相矛盾。分析流行模型的行为,我们衡量其过度依赖记忆信息(幻觉的原因),并揭示加剧这种行为的重要因素。最后,我们提出了一种简单的方法来减轻对参数知识的过度依赖,这最大限度地减少了幻觉,并提高了分配的推广4%-7%。我们的调查结果表明了从业者评估模型倾向于幻觉而不是阅读的重要性,并表明我们的缓解战略鼓励向不断发展的信息(即时间依赖查询)概括。为鼓励这些做法,我们发布了我们的框架,以产生知识冲突。
translated by 谷歌翻译
我们介绍了Realtime QA,这是一个动态的问答(QA)平台,该平台宣布问题并定期评估系统(此版本每周)。实时质量检查询问当前世界,质量检查系统需要回答有关新事件或信息的问题。因此,它挑战了QA数据集中的静态,常规假设,并追求瞬时应用。我们在包括GPT-3和T5在内的大型语言模型上建立了强大的基线模型。我们的基准是一项持续的努力,该初步报告在过去一个月中提出了实时评估结果。我们的实验结果表明,GPT-3通常可以根据新的退休文档正确更新其生成结果,从而突出了最新信息检索的重要性。尽管如此,我们发现GPT-3倾向于在检索文件时返回过时的答案,这些文件没有提供足够的信息来找到答案。这表明了未来研究的重要途径:开放式域质量检查系统是否可以确定无法回答的案例,并与用户甚至检索模块进行通信以修改检索结果?我们希望实时质量检查能够刺激问题答案及其他问题的瞬时应用。
translated by 谷歌翻译
We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K questionanswer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a featurebased classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that Trivi-aQA is a challenging testbed that is worth significant future study. 1
translated by 谷歌翻译
This paper proposes to tackle opendomain question answering using Wikipedia as the unique knowledge source: the answer to any factoid question is a text span in a Wikipedia article. This task of machine reading at scale combines the challenges of document retrieval (finding the relevant articles) with that of machine comprehension of text (identifying the answer spans from those articles). Our approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs. Our experiments on multiple existing QA datasets indicate that (1) both modules are highly competitive with respect to existing counterparts and (2) multitask learning using distant supervision on their combination is an effective complete system on this challenging task.
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
Often questions provided to open-domain question answering systems are ambiguous. Traditional QA systems that provide a single answer are incapable of answering ambiguous questions since the question may be interpreted in several ways and may have multiple distinct answers. In this paper, we address multi-answer retrieval which entails retrieving passages that can capture majority of the diverse answers to the question. We propose a re-ranking based approach using Determinantal point processes utilizing BERT as kernels. Our method jointly considers query-passage relevance and passage-passage correlation to retrieve passages that are both query-relevant and diverse. Results demonstrate that our re-ranking technique outperforms state-of-the-art method on the AmbigQA dataset.
translated by 谷歌翻译
目前用于开放域问题的最先进的生成模型(ODQA)专注于从非结构化文本信息生成直接答案。但是,大量的世界知识存储在结构化数据库中,并且需要使用SQL等查询语言访问。此外,查询语言可以回答需要复杂推理的问题,以及提供完全的解释性。在本文中,我们提出了一个混合框架,将文本和表格证据占据了输入,并根据哪种形式更好地回答这个问题而生成直接答案或SQL查询。然后可以在关联的数据库上执行生成的SQL查询以获得最终答案。据我们所知,这是第一种将Text2SQL与ODQA任务应用于ODQA任务的论文。凭经验,我们证明,在几个ODQA数据集上,混合方法始终如一地优于仅采用大边缘的均匀输入的基线模型。具体地,我们使用T5基础模型实现OpenSquad数据集的最先进的性能。在一个详细的分析中,我们证明能够生成结构的SQL查询可以始终带来增益,特别是对于那些需要复杂推理的问题。
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the limitations of relying solely on their parameters to encode a wealth of world knowledge. This paper aims to understand LMs' strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments of 10 models and 4 augmentation methods on PopQA, our new open-domain QA dataset with 14k questions. We find that LMs struggle with less popular factual knowledge, and that scaling fails to appreciably improve memorization of factual knowledge in the tail. We then show that retrieval-augmented LMs largely outperform orders of magnitude larger LMs, while unassisted LMs remain competitive in questions about high-popularity entities. Based on those findings, we devise a simple, yet effective, method for powerful and efficient retrieval-augmented LMs, which retrieves non-parametric memories only when necessary. Experimental results show that this significantly improves models' performance while reducing the inference costs.
translated by 谷歌翻译
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit nonparametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
translated by 谷歌翻译
检索增强的代表在许多知识密集型的NLP任务中表现出最先进的表现,例如打开问题应答和事实验证。考虑到检索到的段落,这些模型训练以产生最终输出,这可能与原始查询无关,导致学习虚假线索或回答记忆。这项工作介绍了一种融入通道的证据性的方法 - 是否段落包含正确的证据来支持输出 - 培训发电机。我们介绍了一个多任务学习框架,共同生成最终输出并预测每个段落的证据性,利用新的任务不可行方法来获得{\ IT Silver}分证分性标签进行监督。我们在三个知识密集型任务中的五个数据集的实验表明,我们的新的证据引导发电机具有相同尺寸模型的直接对应的直接对应,并使Faviq-Ambig的最先进。我们将这些改进归因于辅助多任务学习和银证处分性挖掘技术。
translated by 谷歌翻译
我们提出了一种用于在生成答案时将信息与多个检索文件中的信息组合的可检索增强的开放式开放式开放式开放域问题训练方法。我们将检索决策模拟作为相关文件集的潜在变量。由于通过对所检索的文件集的边缘化,因此使用期望最大化算法估计这一点。我们迭代地估计我们的潜在变量的价值(给定问题的这些相关文档集),然后使用此估计来更新检索器和读取器参数。我们假设这种端到端的训练允许训练信号流到读者,然后比上演明智的训练更好地流到猎犬。这导致检索器能够为问题和读者选择更多相关文档,这些文件在更准确的文档中培训以生成答案。三个基准数据集的实验表明,我们所提出的方法优于所有现有的相当大小的方法2-3%绝对精确匹配点,实现了新的最先进的结果。我们的结果还展示了学习检索以改善答复的可行性,而无明确监督检索决策。
translated by 谷歌翻译
This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple (potentially long) documents. The system, called Visconde, uses a three-step pipeline to perform the task: decompose, retrieve, and aggregate. The first step decomposes the question into simpler questions using a few-shot large language model (LLM). Then, a state-of-the-art search engine is used to retrieve candidate passages from a large collection for each decomposed question. In the final step, we use the LLM in a few-shot setting to aggregate the contents of the passages into the final answer. The system is evaluated on three datasets: IIRC, Qasper, and StrategyQA. Results suggest that current retrievers are the main bottleneck and that readers are already performing at the human level as long as relevant passages are provided. The system is also shown to be more effective when the model is induced to give explanations before answering a question. Code is available at \url{https://github.com/neuralmind-ai/visconde}.
translated by 谷歌翻译
使用来自表格(TableQA)的信息回答自然语言问题是最近的兴趣。在许多应用程序中,表未孤立,但嵌入到非结构化文本中。通常,通过将其部分与表格单元格内容或非结构化文本跨度匹配,并从任一源中提取答案来最佳地回答问题。这导致了HybridQA数据集引入的TextableQA问题的新空间。现有的表格表示对基于变换器的阅读理解(RC)架构的适应性未通过单个系统解决两个表示的不同模式。培训此类系统因对遥远监督的需求而进一步挑战。为了降低认知负担,培训实例通常包括问题和答案,后者匹配多个表行和文本段。这导致嘈杂的多实例培训制度不仅涉及表的行,而且涵盖了链接文本的跨度。我们通过提出Mitqa来回应这些挑战,这是一个新的TextableQA系统,明确地模拟了表行选择和文本跨度选择的不同但密切相关的概率空间。与最近的基线相比,我们的实验表明了我们的方法的优越性。该方法目前在HybridQA排行榜的顶部,并进行了一个试验集,在以前公布的结果上实现了对em和f1的21%的绝对改善。
translated by 谷歌翻译
Retrieval-augmented in-context learning has emerged as a powerful approach for addressing knowledge-intensive tasks using frozen language models (LM) and retrieval models (RM). Existing work has combined these in simple "retrieve-then-read" pipelines in which the RM retrieves passages that are inserted into the LM prompt. To begin to fully realize the potential of frozen LMs and RMs, we propose Demonstrate-Search-Predict (DSP), a framework that relies on passing natural language texts in sophisticated pipelines between an LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware demonstrations, search for relevant passages, and generate grounded predictions, systematically breaking down problems into small transformations that the LM and RM can handle more reliably. We have written novel DSP programs for answering questions in open-domain, multi-hop, and conversational settings, establishing in early evaluations new state-of-the-art in-context learning results and delivering 37-200%, 8-40%, and 80-290% relative gains against vanilla LMs, a standard retrieve-then-read pipeline, and a contemporaneous self-ask pipeline, respectively.
translated by 谷歌翻译