预审前的语言模型通过提供高质量的上下文化单词嵌入来显着改善了下游语言理解任务(包括提取性问题)的性能。但是,培训问答模型仍然需要大量特定域的注释数据。在这项工作中,我们提出了一个合作的自我训练框架RGX,用于自动生成更非平凡的问题 - 解答对以提高模型性能。 RGX建立在带有答案实体识别器,问题生成器和答案提取器的交互式学习环境的蒙版答案提取任务上。给定带有蒙版实体的段落,生成器会在实体周围生成一个问题,并培训了提取器,以提取蒙面实体,并使用生成的问题和原始文本。该框架允许对任何文本语料库的问题产生和回答模型进行培训,而无需注释。实验结果表明,RGX优于最先进的语言模型(SOTA)的语言模型,并在标准提问基准的基准上采用转移学习方法,并在给定的模型大小和传输学习设置下产生新的SOTA性能。
translated by 谷歌翻译
问题答案(QA)是自然语言处理中最具挑战性的最具挑战性的问题之一(NLP)。问答(QA)系统试图为给定问题产生答案。这些答案可以从非结构化或结构化文本生成。因此,QA被认为是可以用于评估文本了解系统的重要研究区域。大量的QA研究致力于英语语言,调查最先进的技术和实现最先进的结果。然而,由于阿拉伯QA中的研究努力和缺乏大型基准数据集,在阿拉伯语问答进展中的研究努力得到了很大速度的速度。最近许多预先接受的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中,我们使用四个阅读理解数据集来评估阿拉伯QA的最先进的接种变压器模型,它是阿拉伯语 - 队,ArcD,AQAD和TYDIQA-GoldP数据集。我们微调并比较了Arabertv2基础模型,ArabertV0.2大型型号和ARAElectra模型的性能。在最后,我们提供了一个分析,了解和解释某些型号获得的低绩效结果。
translated by 谷歌翻译
Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer module, in which we use pre-trained models from the existing literature, and therefore, our metric can be used without further training. We show that RQUGE has a higher correlation with human judgment without relying on the reference question. RQUGE is shown to be significantly more robust to several adversarial corruptions. Additionally, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on the synthetic data generated by a question generation model and re-ranked by RQUGE.
translated by 谷歌翻译
Most previous unsupervised domain adaptation (UDA) methods for question answering(QA) require access to source domain data while fine-tuning the model for the target domain. Source domain data may, however, contain sensitive information and may be restricted. In this study, we investigate a more challenging setting, source-free UDA, in which we have only the pretrained source model and target domain data, without access to source domain data. We propose a novel self-training approach to QA models that integrates a unique mask module for domain adaptation. The mask is auto-adjusted to extract key domain knowledge while trained on the source domain. To maintain previously learned domain knowledge, certain mask weights are frozen during adaptation, while other weights are adjusted to mitigate domain shifts with pseudo-labeled samples generated in the target domain. %As part of the self-training process, we generate pseudo-labeled samples in the target domain based on models trained in the source domain. Our empirical results on four benchmark datasets suggest that our approach significantly enhances the performance of pretrained QA models on the target domain, and even outperforms models that have access to the source data during adaptation.
translated by 谷歌翻译
Entities, as important carriers of real-world knowledge, play a key role in many NLP tasks. We focus on incorporating entity knowledge into an encoder-decoder framework for informative text generation. Existing approaches tried to index, retrieve, and read external documents as evidence, but they suffered from a large computational overhead. In this work, we propose an encoder-decoder framework with an entity memory, namely EDMem. The entity knowledge is stored in the memory as latent representations, and the memory is pre-trained on Wikipedia along with encoder-decoder parameters. To precisely generate entity names, we design three decoding methods to constrain entity generation by linking entities in the memory. EDMem is a unified framework that can be used on various entity-intensive question answering and generation tasks. Extensive experimental results show that EDMem outperforms both memory-based auto-encoder models and non-memory encoder-decoder models.
translated by 谷歌翻译
Supervised Question Answering systems (QA systems) rely on domain-specific human-labeled data for training. Unsupervised QA systems generate their own question-answer training pairs, typically using secondary knowledge sources to achieve this outcome. Our approach (called PIE-QG) uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages and uses the question-answer pairs as training data for a language model for a state-of-the-art QA system based on BERT. Triples in the form of <subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers. Experimenting on five extractive QA datasets demonstrates that our technique achieves on-par performance with existing state-of-the-art QA systems with the benefit of being trained on an order of magnitude fewer documents and without any recourse to external reference data sources.
translated by 谷歌翻译
大型语言模型在各种任务上显示出令人印象深刻的几次结果。但是,当知识是此类结果的关键时,就像问题回答和事实检查之类的任务一样,似乎需要存储知识的大量参数计数。众所周知,检索增强模型可以在不需要多个参数的情况下在知识密集的任务上表现出色,但是目前尚不清楚它们是否在几个弹药设置中工作。在这项工作中,我们介绍了地图集,这是一个经过精心设计和预先训练的增强语言模型,能够通过很少的培训示例学习知识密集型任务。我们对包括MMLU,苏格兰短裙和归类等各种任务进行评估,并研究文档索引内容的影响,表明它可以很容易地进行更新。值得注意的是,在自然问题上仅使用64个示例在自然问题上达到超过42 \%的准确性,尽管参数少了50倍,但比540B参数模型的表现优于540b参数模型。
translated by 谷歌翻译
生物医学机器阅读理解(生物医学MRC)旨在理解复杂的生物医学叙事,并协助医疗保健专业人员从中检索信息。现代神经网络的MRC系统的高性能取决于高质量的大规模,人为宣传的培训数据集。在生物医学领域中,创建此类数据集的一个至关重要的挑战是域知识的要求,引起了标记数据的稀缺性以及从标记的通用(源)域转移学习到生物医学(目标)域的需求。然而,由于主题方差,通用和生物医学领域之间的边际分布存在差异。因此,从在通用域上训练的模型到生物医学领域的模型直接转移学会的表示可能会损害模型的性能。我们为生物医学机器阅读理解任务(BioAdapt-MRC)提供了基于对抗性学习的域适应框架,这是一种基于神经网络的方法,可解决一般和生物医学域数据之间边际分布中的差异。 Bioadapt-MRC松弛了生成伪标签的需求,以训练表现出色的生物医学MRC模型。我们通过将生物ADAPT-MRC与三种广泛使用的基准生物医学MRC数据集进行比较,从而广泛评估了生物ADAPT-MRC的性能-Bioasq-7B,BioASQ-8B和BioASQ-9B。我们的结果表明,如果不使用来自生物医学领域的任何合成或人类通知的数据,Bioadapt-MRC可以在这些数据集中实现最先进的性能。可用性:bioadapt-MRC可作为开放源项目免费获得,\ url {https://github.com/mmahbub/bioadapt-mrc}。
translated by 谷歌翻译
无监督的问题回答是一项有吸引力的任务,因为它在标签数据上的独立性。以前的作品通常使用启发式规则以及预先培训的模型来构建数据和训练QA模型。但是,这些作品中的大多数都将命名为实体(NE)视为唯一的答案类型,它忽略了现实世界中答案的高度多样性。为了解决这个问题,我们提出了一种新颖的无监督方法,通过多样化的答案,名为Diverseqa。具体而言,所提出的方法由三个模块组成:数据构建,数据增强和降解过滤器。首先,数据构建模块将提取的命名实体扩展到一个较长的句子成分,作为构建具有不同答案的质量检查数据集的新答案跨度。其次,数据增强模块通过嵌入级别的对抗训练采用答案型依赖性数据增强过程。第三,denoising滤波器模块旨在减轻构造数据中的噪声。广泛的实验表明,所提出的方法在五个基准数据集上优于先前的无监督模型,包括squeadv1.1,newsqa,triviaqa,bioasq和Duorc。此外,提出的方法在少量学习设置中显示出强劲的性能。
translated by 谷歌翻译
Open-Domain Question Answering (ODQA) requires models to answer factoid questions with no context given. The common way for this task is to train models on a large-scale annotated dataset to retrieve related documents and generate answers based on these documents. In this paper, we show that the ODQA architecture can be dramatically simplified by treating Large Language Models (LLMs) as a knowledge corpus and propose a Self-Prompting framework for LLMs to perform ODQA so as to eliminate the need for training data and external knowledge corpus. Concretely, we firstly generate multiple pseudo QA pairs with background passages and one-sentence explanations for these QAs by prompting LLMs step by step and then leverage the generated QA pairs for in-context learning. Experimental results show our method surpasses previous state-of-the-art methods by +8.8 EM averagely on three widely-used ODQA datasets, and even achieves comparable performance with several retrieval-augmented fine-tuned models.
translated by 谷歌翻译
问答(QA)在回答定制域中的问题方面表现出了令人印象深刻的进展。然而,域的适应性仍然是质量检查系统最难以捉摸的挑战之一,尤其是当质量检查系统在源域中训练但部署在不同的目标域中时。在这项工作中,我们调查了问题分类对质量检查域适应的潜在好处。我们提出了一个新颖的框架:问题回答的问题分类(QC4QA)。具体而言,采用问题分类器将问题类分配给源数据和目标数据。然后,我们通过伪标记以自我监督的方式进行联合培训。为了优化,源和目标域之间的域间差异通过最大平均差异(MMD)距离降低。我们还最大程度地减少了同一问题类别的质量质量适应性表现的QA样本中的类内部差异。据我们所知,这是质量检查域适应中的第一部作品,以通过自我监督的适应来利用问题分类。我们证明了拟议的QC4QA的有效性,并在多个数据集上针对最先进的基线进行了一致的改进。
translated by 谷歌翻译
Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English. QG-Bench is released along with the fine-tuned models presented in the paper https://github.com/asahi417/lm-question-generation, which are also available as a demo https://autoqg.net/.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
最近的开放式域问题的作品应答使用检索器模型引用外部知识库,可选地重新映射与单独的重新编制模型,并使用另一个读取器模型生成答案。尽管执行相关任务,但模型具有单独的参数,并且在训练期间略微耦合。在这项工作中,我们建议将猎犬和重新划分为依次应用于变压器架构内的硬注视机制,并将所产生的计算表示给读者送入。在这个奇异模型架构中,隐藏的表示从搬运者逐渐改进到Reranker到读者,这更有效地利用模型容量,并且当我们以端到端的方式训练时,还导致更好的梯度流动。我们还提出了一种预先训练的方法,以有效地培训这种架构。我们评估我们的自然问题和TriviaQA Open DataSets的模型以及固定参数预算,我们的模型优于以前的最先进模型1.0和0.7精确匹配分数。
translated by 谷歌翻译
Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dualencoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks. 1 * Equal contribution 1 The code and trained models have been released at https://github.com/facebookresearch/DPR.
translated by 谷歌翻译
Neural approaches have become very popular in the domain of Question Answering, however they require a large amount of annotated data. Furthermore, they often yield very good performance but only in the domain they were trained on. In this work we propose a novel approach that combines data augmentation via question-answer generation with Active Learning to improve performance in low resource settings, where the target domains are diverse in terms of difficulty and similarity to the source domain. We also investigate Active Learning for question answering in different stages, overall reducing the annotation effort of humans. For this purpose, we consider target domains in realistic settings, with an extremely low amount of annotated samples but with many unlabeled documents, which we assume can be obtained with little effort. Additionally, we assume sufficient amount of labeled data from the source domain is available. We perform extensive experiments to find the best setup for incorporating domain experts. Our findings show that our novel approach, where humans are incorporated as early as possible in the process, boosts performance in the low-resource, domain-specific setting, allowing for low-labeling-effort question answering systems in new, specialized domains. They further demonstrate how human annotation affects the performance of QA depending on the stage it is performed.
translated by 谷歌翻译
临床问题应答(QA)旨在根据临床文本自动回答医疗专业人员的问题。研究表明,在一个语料库上培训的神经QA模型可能对来自不同研究所或不同患者组的新临床文本概括,其中大规模的QA对不容易获得模型再培训。为了解决这一挑战,我们提出了一个简单但有效的框架CliniQG4QA,它利用问题生成(QG)在新的临床环境中综合QA对,并在不需要手动注释的情况下提升QA模型。为了生成对训练QA模型至关重要的不同类型的问题,我们进一步引入了基于SEQ2SEQ的问题短语预测(QPP)模块,可以与大多数现有的QG模型一起使用以使生成多样化。我们的综合实验结果表明,我们的框架产生的QA​​语料库可以改善新上下文的QA模型(在完全匹配方面最高8%的绝对增益),QPP模块在实现增益方面发挥着至关重要的作用。
translated by 谷歌翻译
对于开放式域问题的密集检索已被证明通过在问题通道对的大型数据集上培训来实现令人印象深刻的性能。我们调查是否可以以自我监督的方式学习密集的检索,并有效地应用没有任何注释。我们观察到这种情况下的检索斗争的现有借用模型,并提出了一种设计用于检索的新预制方案:重复跨度检索。我们在文档中使用经常性跨度来创建用于对比学习的伪示例。由此产生的模型 - 蜘蛛 - 在广泛的ODQA数据集上没有任何示例,并且与BM25具有竞争力,具有强烈的稀疏基线。此外,蜘蛛通常优于DPR在其他数据集的问题上培训的DPR培训的强大基线。我们将蜘蛛与BM25结合的混合猎犬改进了所有数据集的组件,并且通常与域中DPR模型具有竞争力,这些模型培训数万例培训。
translated by 谷歌翻译
Recent advances in open-domain question answering (ODQA) have demonstrated impressive accuracy on standard Wikipedia style benchmarks. However, it is less clear how robust these models are and how well they perform when applied to real-world applications in drastically different domains. While there has been some work investigating how well ODQA models perform when tested for out-of-domain (OOD) generalization, these studies have been conducted only under conservative shifts in data distribution and typically focus on a single component (ie. retrieval) rather than an end-to-end system. In response, we propose a more realistic and challenging domain shift evaluation setting and, through extensive experiments, study end-to-end model performance. We find that not only do models fail to generalize, but high retrieval scores often still yield poor answer prediction accuracy. We then categorize different types of shifts and propose techniques that, when presented with a new dataset, predict if intervention methods are likely to be successful. Finally, using insights from this analysis, we propose and evaluate several intervention methods which improve end-to-end answer F1 score by up to 24 points.
translated by 谷歌翻译