我们的目标是学习不容易获得大量数据的克里奥尔语言的语言模型,因此探索了祖先语言的潜在转移(“祖先转移假设”)。我们发现标准转移方法不会促进祖先转移。令人惊讶的是,与其他非凝聚语言不同,克里奥尔语出现了一种非常独特的两相模式:随着我们的训练损失高原,语言模型开始过分贴在其源语言上,克里奥尔语对克里奥尔语的困惑下降。我们探讨了这个压缩阶段是否可以导致实际上有用的语言模型(“祖先瓶颈假说”),但也会伪造这一点。此外,我们表明,即使在随机,无关的语言训练时,克里奥尔人甚至表现出这种两相模式。因此,Creoles似乎是类型的异常值,我们推测两个观察结果之间是否存在联系。
translated by 谷歌翻译
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.
translated by 谷歌翻译
Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.
translated by 谷歌翻译
JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the effectiveness of transfer from large monolingual or multilingual pretrained models. While our work, along with previous work, shows that transfer from these models to low-resource languages that are unrelated to languages in their training set is not very effective, we would expect stronger results from transfer to creoles. Indeed, our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages, and help us begin to understand how the unique relationship between creoles and their high-resource base languages affect cross-lingual transfer. JamPatoisNLI, which consists of naturally-occurring premises and expert-written hypotheses, is a step towards steering research into a traditionally underserved language and a useful benchmark for understanding cross-lingual NLP.
translated by 谷歌翻译
In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2019) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.
translated by 谷歌翻译
Misinformation spread over social media has become an undeniable infodemic. However, not all spreading claims are made equal. If propagated, some claims can be destructive, not only on the individual level, but to organizations and even countries. Detecting claims that should be prioritized for fact-checking is considered the first step to fight against spread of fake news. With training data limited to a handful of languages, developing supervised models to tackle the problem over lower-resource languages is currently infeasible. Therefore, our work aims to investigate whether we can use existing datasets to train models for predicting worthiness of verification of claims in tweets in other languages. We present a systematic comparative study of six approaches for cross-lingual check-worthiness estimation across pairs of five diverse languages with the help of Multilingual BERT (mBERT) model. We run our experiments using a state-of-the-art multilingual Twitter dataset. Our results show that for some language pairs, zero-shot cross-lingual transfer is possible and can perform as good as monolingual models that are trained on the target language. We also show that in some languages, this approach outperforms (or at least is comparable to) state-of-the-art models.
translated by 谷歌翻译
Multi-lingual language models (LM), such as mBERT, XLM-R, mT5, mBART, have been remarkably successful in enabling natural language tasks in low-resource languages through cross-lingual transfer from high-resource ones. In this work, we try to better understand how such models, specifically mT5, transfer *any* linguistic and semantic knowledge across languages, even though no explicit cross-lingual signals are provided during pre-training. Rather, only unannotated texts from each language are presented to the model separately and independently of one another, and the model appears to implicitly learn cross-lingual connections. This raises several questions that motivate our study, such as: Are the cross-lingual connections between every language pair equally strong? What properties of source and target language impact the strength of cross-lingual transfer? Can we quantify the impact of those properties on the cross-lingual transfer? In our investigation, we analyze a pre-trained mT5 to discover the attributes of cross-lingual connections learned by the model. Through a statistical interpretation framework over 90 language pairs across three tasks, we show that transfer performance can be modeled by a few linguistic and data-derived features. These observations enable us to interpret cross-lingual understanding of the mT5 model. Through these observations, one can favorably choose the best source language for a task, and can anticipate its training data demands. A key finding of this work is that similarity of syntax, morphology and phonology are good predictors of cross-lingual transfer, significantly more than just the lexical similarity of languages. For a given language, we are able to predict zero-shot performance, that increases on a logarithmic scale with the number of few-shot target language data points.
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
Related works used indexes like CKA and variants of CCA to measure the similarity of cross-lingual representations in multilingual language models. In this paper, we argue that assumptions of CKA/CCA align poorly with one of the motivating goals of cross-lingual learning analysis, i.e., explaining zero-shot cross-lingual transfer. We highlight what valuable aspects of cross-lingual similarity these indexes fail to capture and provide a motivating case study \textit{demonstrating the problem empirically}. Then, we introduce \textit{Average Neuron-Wise Correlation (ANC)} as a straightforward alternative that is exempt from the difficulties of CKA/CCA and is good specifically in a cross-lingual context. Finally, we use ANC to construct evidence that the previously introduced ``first align, then predict'' pattern takes place not only in masked language models (MLMs) but also in multilingual models with \textit{causal language modeling} objectives (CLMs). Moreover, we show that the pattern extends to the \textit{scaled versions} of the MLMs and CLMs (up to 85x original mBERT).\footnote{Our code is publicly available at \url{https://github.com/TartuNLP/xsim}}
translated by 谷歌翻译
The BLOOM model is a large open-source multilingual language model capable of zero-shot learning, but its pretraining was limited to 46 languages. To improve its zero-shot performance on unseen languages, it is desirable to adapt BLOOM, but previous works have only explored adapting small language models. In this work, we apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages. We find language adaptation to be effective at improving zero-shot performance in new languages. Surprisingly, adapter-based finetuning is more effective than continued pretraining for large models. In addition, we discover that prompting performance is not significantly affected by language specifics, such as the writing system. It is primarily determined by the size of the language adaptation data. We also add new languages to BLOOMZ, which is a multitask finetuned version of BLOOM capable of following task instructions zero-shot. We find including a new language in the multitask fine-tuning mixture to be the most effective method to teach BLOOMZ a new language. We conclude that with sufficient training data language adaptation can generalize well to diverse languages. Our code is available at \url{https://github.com/bigscience-workshop/multilingual-modeling/}.
translated by 谷歌翻译
最近的工作表明,通过多语种伯爵(MBENT)获得的知识有两个组件:特定于语言和语言中立的。本文分析了它们之间的关系,在两项任务的微调 - POS标记和自然语言推理的背景下 - 需要模型带来不同的语言特异性知识。可视化揭示MBERT失去了在微调后通过语言进行群集表示的能力,这是通过语言识别实验的证据支持的结果。然而,显示使用梯度逆转和迭代对抗性学习的“无学习”语言特定表示的进一步实验,不会在微调的效果之外增加对独立于语言无关的组件的进一步改进。此处提出的结果表明,微调的过程导致模型的重组有限的代表能力,以特定于语言特定的代表性的语言无关的表示。
translated by 谷歌翻译
虽然最近关于多语种语言模型的工作已经证明了他们对下游任务的交叉零射击传输的能力,但社区缺乏符合语言之间的共享属性,可以实现这种转移。涉及成对的自然语言的分析通常是不确定的,并且矛盾以来,许多语言方面同时不同。在本文中,我们进行大规模的实证研究,通过测量四种不同的自然语言和通过修改脚本,单词顺序和语法等方面构造的零拍摄传递来隔离各种语言特性的影响。在其他事情之外,我们的实验表明,当语言的单词顺序不同时,缺乏子字重叠显着影响零拍摄传输,并且在语言之间的传输性能和Word嵌入对准之间存在强烈相关性(例如,r = 0.94关于NLI的任务)。我们的结果呼吁专注于在明确改进语言之间的嵌入对齐而不是依赖于隐含的出现。
translated by 谷歌翻译
Transformer language models (TLMs) are critical for most NLP tasks, but they are difficult to create for low-resource languages because of how much pretraining data they require. In this work, we investigate two techniques for training monolingual TLMs in a low-resource setting: greatly reducing TLM size, and complementing the masked language modeling objective with two linguistically rich supervised tasks (part-of-speech tagging and dependency parsing). Results from 7 diverse languages indicate that our model, MicroBERT, is able to produce marked improvements in downstream task evaluations relative to a typical monolingual TLM pretraining approach. Specifically, we find that monolingual MicroBERT models achieve gains of up to 18% for parser LAS and 11% for NER F1 compared to a multilingual baseline, mBERT, while having less than 1% of its parameter count. We conclude reducing TLM parameter count and using labeled data for pretraining low-resource TLMs can yield large quality benefits and in some cases produce models that outperform multilingual approaches.
translated by 谷歌翻译
最近,大型预用语言模型(LMS)越来越受欢迎。培训这些模型需要更多的计算资源,并且大多数现有模型仅在英文文本上培训。以其他语言训练这些模型非常昂贵。为了缓解这个问题,我们介绍了一种叫做威施塞的方法 - 将英语模型传输到新语言。我们将英语模型的销量与目标语言中的销量交换,并初始化令牌嵌入式,以便通过利用覆盖英语和目标语言的多语言静态字嵌入来初始化令牌嵌入式。我们使用Wechsel将GPT-2和Roberta模型转移到4种其他语言(法语,德语,中文和斯瓦希里语)。 Wechsel通过以前提出的跨语言参数转移和优于比较大小的模型来改善从目标语言的划痕训练的相当大小的型号,距离培训速度较小。我们的方法使培训大型语言模型为新语言更容易访问,更少损害环境。我们宣传我们的代码和型号。
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
深语模型在NLP域中取得了显着的成功。培养深层语言模型的标准方法是从大型未标记的语料库中雇用无监督的学习。但是,这种大型公司仅适用于广泛采用和高资源语言和域名。本研究提出了第一款深语型号DPRK-BERT为朝鲜语言。我们通过编制朝鲜语言的第一个未标记的语料库和微调预先存在的ROK语言模型来实现这一目标。我们将所提出的模型与现有方法进行比较,并显示两个DPRK数据集的显着改进。我们还提供了这种模型的交叉语言版本,其在两种韩语语言中产生了更好的泛化。最后,我们提供与朝鲜语言相关的各种NLP工具,这些工具将培养未来的研究。
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
Providing better language tools for low-resource and endangered languages is imperative for equitable growth. Recent progress with massively multilingual pretrained models has proven surprisingly effective at performing zero-shot transfer to a wide variety of languages. However, this transfer is not universal, with many languages not currently understood by multilingual approaches. It is estimated that only 72 languages possess a "small set of labeled datasets" on which we could test a model's performance, the vast majority of languages not having the resources available to simply evaluate performances on. In this work, we attempt to clarify which languages do and do not currently benefit from such transfer. To that end, we develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model. Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language. We construct a cross-lingual sentence similarity task to evaluate our approach empirically on 350, primarily low-resource, languages.
translated by 谷歌翻译
在多语言甚至单语言中鉴定的模型的零拍跨语言能力刺激了许多假设,以解释这一有趣的经验结果。但是,由于预处理的成本,大多数研究都使用公共模型的公共模型,其预处理方法(例如代币化,语料库规模和计算预算的选择)可能会大不相同。当研究人员对自己的模型预识时,他们通常会在预算有限的情况下这样做,并且与SOTA模型相比,最终的模型的表现可能明显不足。这些实验差异导致有关这些模型跨语性能力的性质的各种不一致的结论。为了帮助对该主题进行进一步研究,我们发布了10个单语字节级模型,并在相同的配置下进行了严格审慎的概述,并具有大型计算预算(相当于V100的420天)和Corpora,比原始BERT大4倍。由于它们不含令牌,因此消除了看不见的令牌嵌入的问题,从而使研究人员可以在具有不同脚本的语言中尝试更广泛的跨语言实验。此外,我们释放了在不自然语言文本上预测的两个模型,这些模型可用于理智检查实验。关于质量检查和NLI任务的实验表明,我们的单语模型实现了多语言的竞争性能,因此可以加强我们对语言模型中跨语性可传递性的理解。
translated by 谷歌翻译
预处理的多语言编码器可实现零拍的跨语性转移,但通常会产生不可靠的模型,这些模型在目标语言上表现出高性能差异。我们假设这种高差异是由零拍的跨语性转移解决了一个不明显的优化问题。我们表明,源语言单语言模型和源 +目标双语模型之间的任何线性交互模型都具有较低的源语言概括错误,但是当我们从单语模型移动到双语模型时,目标语言概括误差会顺利而线性地降低,这表明该模型努力仅使用源语言来识别源和目标语言的良好解决方案。此外,我们表明零击解决方案在于目标语言误差概括表面的非平板区域,从而导致较高的方差。
translated by 谷歌翻译