There is considerable interest in predicting the pathogenicity of protein variants in human genes. Due to the sparsity of high quality labels, recent approaches turn to \textit{unsupervised} learning, using Multiple Sequence Alignments (MSAs) to train generative models of natural sequence variation within each gene. These generative models then predict variant likelihood as a proxy to evolutionary fitness. In this work we instead combine this evolutionary principle with pretrained protein language models (LMs), which have already shown promising results in predicting protein structure and function. Instead of training separate models per-gene, we find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot, without MSAs or finetuning. We call this unsupervised approach \textbf{VELM} (Variant Effect via Language Models), and show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.
translated by 谷歌翻译
已经开发出各种机器学习模型,包括深神经网络模型,以预测错义(非同义)突变的有害性。尽管如此,使用更复杂的自适应机器学习方法对生物学问题的新审查可能会受益于当前最新水平的潜在改进。自然语言处理领域的最新进展显示了变压器模型 - 一种深神经网络类型,在与上下文依赖性建模序列信息方面特别有力。在这项研究中,我们介绍了Mutformer,这是一种基于变压器的模型,用于预测有害错义突变。 Mutformer使用人类基因组中的参考和突变蛋白序列作为主要特征。它结合了自我发项层和卷积层的结合,以学习蛋白质序列中氨基酸突变之间的远距离依赖性和短期依赖性。我们在参考蛋白序列和突变蛋白序列上预先训练融合剂,该蛋白质序列是由于人类种群中观察到的常见遗传变异而产生的。接下来,我们检查了不同的微调方法,以成功地将模型应用于错义突变的有害性预测。最后,我们在多个测试数据集上评估了杂货商的性能。我们发现,在各种现有工具中,杂种器表现出相似或改进的性能,包括使用常规机器学习方法的工具(例如,支持向量机,卷积神经网络,经常性神经网络)。我们得出的结论是,杂货商成功考虑了以前研究中未探索的序列特征,并且可能会补充现有的计算预测或经验产生的功能分数,以提高我们对疾病变异的理解。
translated by 谷歌翻译
基于注意的蛋白质序列训练的基于注意力的模型在分类和与人工智能驱动的蛋白质设计相关的分类和生成任务方面取得了令人难以置信的成功。但是,我们对非常大规模的模型和数据在有效的蛋白质模型开发中发挥作用。我们介绍了一套名为progen2的蛋白质语言模型的套件,该模型最高为6.4b参数,并在从基因组,宏基因组和免疫曲目数据库中绘制的不同序列数据集上进行了培训。 GEECEN2模型在捕获观察到的进化序列的分布,生成新型的可行序列并预测蛋白质适应性的情况下显示出最先进的性能,而无需额外的芬特。随着蛋白质序列的大型大小和原始数量继续变得更加广泛,我们的结果表明,越来越多的重点需要放在提供给蛋白质序列模型的数据分布上。我们在https://github.com/salesforce/progen上发布了PECEN2模型和代码。
translated by 谷歌翻译
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations that improve the function of a known protein. We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models, such as protein language models, and supervised models that predict protein function from sequence. By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins. Our framework achieves this without any model fine-tuning or re-training by constructing a product of experts distribution directly in discrete protein space. Instead of resorting to brute force search or random sampling, which is typical of classic directed evolution, we introduce a fast MCMC sampler that uses gradients to propose promising mutations. We conduct in silico directed evolution experiments on wide fitness landscapes and across a range of different pre-trained unsupervised models, including a 650M parameter protein language model. Our results demonstrate an ability to efficiently discover variants with high evolutionary likelihood as well as estimated activity multiple mutations away from a wild type protein, suggesting our sampler provides a practical and effective new paradigm for machine-learning-based protein engineering.
translated by 谷歌翻译
在这项工作中,我们介绍了RITA:蛋白质序列的自回归生成模型套件,具有多达12亿个参数,对属于Uniref-100数据库的2.8亿次蛋白质序列进行了培训。这种生成模型具有极大加速蛋白质设计的希望。我们对蛋白质结构域中自回旋变压器的模型大小进行的能力大小进行了首次系统研究:我们在下一个氨基酸预测,零摄像及适应性和酶功能预测中评估RITA模型,从而显示出增加的量表。我们公开发布丽塔模型,以使研究界受益。
translated by 谷歌翻译
大规模蛋白质语言模型(PLM)在蛋白质预测任务中的性能提高,范围从3D结构预测到各种功能预测。特别是,Alphafold(一种开创性的AI系统)可能会重塑结构生物学。但是,尚未探索超出结构预测的AlphaFold,Evoformer的PLM模块的效用。在本文中,我们研究了三个流行PLM的表示能力:ESM-1B(单序),MSA转换器(多个序列比对)和Evoformer(结构),并特别关注Evoformer。具体而言,我们旨在回答以下关键问题:(i)作为Alphafold的一部分,Evoformer是否会产生可预测蛋白质功能的表示形式? (ii)如果是的,可以替换ESM-1B和MSA转换器? (iii)这些PLM多少依赖于进化相关的蛋白质数据?在这方面,他们彼此补充吗?我们通过实证研究以及新的见解和结论来比较这些模型。最后,我们发布代码和数据集以获得可重复性。
translated by 谷歌翻译
基于神经网络的深层语言模型(LMS)越来越多地应用于大规模蛋白质序列数据以预测蛋白质功能。然而,作为黑框模型,当前的蛋白质LM方法并不促进对序列功能映射的基本理解,而阻碍了基于规则的生物治疗药物开发,因此目前的蛋白质LM方法不大。我们认为,从语言学中得出的指导是从自然语言数据中提取分析规则的领域,可以帮助构建学习相关领域特定规则的更容易解释的蛋白质LM。与自然语言LMS相比,蛋白质序列数据和语言序列数据之间的差异需要在蛋白质LMS中集成更多的域特异性知识。在这里,我们为培训数据,令牌化,令牌嵌入,序列嵌入和模型解释提供了基于语言学的路线图。将语言学与蛋白质LMS结合起来,可以发展下一代可解释的机器学习模型,并有可能发现序列功能关系基础的生物学机制。
translated by 谷歌翻译
The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.
translated by 谷歌翻译
Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (<50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.
translated by 谷歌翻译
在这项工作中,我们证明了多种语的大规模序列到序列(SEQ2SEQ)模型,该模型是通过Denoising和因果语言建模(CLM)任务的混合物进行训练的,比仅解码器模型更有效地进行了效率的学习者在各种任务上。特别是,我们培训了一个名为Alexa教师模型(Alexatm 20b)的200亿个参数多语言SEQ2SEQ模型,并表明它在1-Shot摘要任务上实现了最先进的(SOTA)性能,超过了更大的540B PALM DOPODER模型。 Alexatm 20b还可以在1-Shot Machine翻译中实现SOTA,尤其是对于低资源语言,几乎所有语言对(阿拉伯语,英语,法语,德语,德语,印地语,意大利语,日语,以及flores-101数据集上的泰卢固语)。我们还显示了零拍设置,AlexATM 20B在SuperGlue和SqueadV2数据集上的表现优于GPT3(175B),并在XNLI,XCOPA,PAWS-X和XWINOGRAD等多语言任务上提供SOTA性能。总体而言,我们的结果为SEQ2SEQ模型提供了一个令人信服的案例,作为大型语言模型(LLM)培训的仅解码器模型的强大替代方法。
translated by 谷歌翻译
蛋白质与几乎每个生命过程都相关联。因此,分析蛋白质序列的生物学结构和性质对植物勘探至关重要,以及疾病检测和药物发现。传统的蛋白质分析方法往往是劳动密集型和耗时的。深度学习模型的出现使得大量数据的建模数据模式可能。跨学科研究人员已经开始利用深入学习方法来建模大型生物数据集,例如,使用长短期记忆和卷积神经网络进行蛋白质序列分类。在数百万年的进化之后,进化信息在蛋白质序列中编码。灵感来自自然语言和蛋白质序列之间的相似性,我们使用大规模的语言模型来模拟进化尺度蛋白序列,编码表示的蛋白质生物学信息。在令牌级和序列级任务中观察到显着改进,表明我们的大规模模型可以准确地捕获进化尺度单个序列上的预测信息。我们的代码和型号可在https://github.com/thudm/proteinlm获得。
translated by 谷歌翻译
由于细微偏见,主观性和难以在规模上获得良好质量的数据集,尤其考虑到社会偏见和社会的不断变化本质,检测文本中的社会偏见是挑战。为了解决这些挑战,我们提出了一些基于指令的基于指令的方法,以提示预先接受预先接受的语言模型(LMS)。我们从最接近查询的小型支持存储库中选择一些标签平衡的示例,以便在嵌入空间中标记。然后,我们向LM提供由标记示例的此子集的指令,查询文本被分类,偏差定义,并提示它做出决定。我们证明了几次上下文中使用的大型LMS可以检测不同类型的细粒度偏差,具有与微调模型的相似且有时卓越的精度。我们观察到,与较小模型相比,最大的530B参数模型在检测社会偏差方面明显更有效(与其他模型相比,在AUC度量上实现至少20%)。它还在几张拍摄设置中保持高AUC(掉落小于5%),其中标记的存储库减少到100个样本的少量。因此,大型预制语言模型使得更容易且更快地建立新的偏置探测器。
translated by 谷歌翻译
Meta-training, which fine-tunes the language model (LM) on various downstream tasks by maximizing the likelihood of the target label given the task instruction and input instance, has improved the zero-shot task generalization performance. However, meta-trained LMs still struggle to generalize to challenging tasks containing novel labels unseen during meta-training. In this paper, we propose Flipped Learning, an alternative method of meta-training which trains the LM to generate the task instruction given the input instance and label. During inference, the LM trained with Flipped Learning, referred to as Flipped, selects the label option that is most likely to generate the task instruction. On 14 tasks of the BIG-bench benchmark, the 11B-sized Flipped outperforms zero-shot T0-11B and even a 16 times larger 3-shot GPT-3 (175B) on average by 8.4% and 9.7% points, respectively. Flipped gives particularly large improvements on tasks with unseen labels, outperforming T0-11B by up to +20% average F1 score. This indicates that the strong task generalization of Flipped comes from improved generalization to novel labels. We release our code at https://github.com/seonghyeonye/Flipped-Learning.
translated by 谷歌翻译
自由文本的理由旨在通过自然语言更灵活,直观地解释神经语言模型(LM)行为。为了确保理由质量,重要的是要拥有衡量理由的忠诚度(反映了LM的实际行为)和合理性(对人类的说服力)很重要。所有现有的自由文本理由指标均基于模拟性(基本原理与LM预测标签之间的关联),但没有评估此类指标可靠性的协议。为了调查这一点,我们提出了框架,该框架是评估自由文本理由的模拟性指标的框架。框架基于三个公理:(1)良好的指标应为参考理由产生最高的分数,从而最大程度地逐构构建标签标签的关联; (2)良好的指标应适当地对理由的语义扰动敏感; (3)良好的指标应该对LM的任务性能的变化具有鲁棒性。在三个文本分类数据集中,我们表明现有的可模拟性指标无法满足所有三个帧公理,因为它们是通过模型预处理实现的,该模型预处理弄乱了度量标准的信号。我们介绍了一种非原始的模拟性变体,该变体将(1)和(3)的性能平均提高41.7%和42.9%,同时在(2)上进行竞争性能。
translated by 谷歌翻译
数据驱动的预测方法可以有效,准确地将蛋白质序列转化为生物活性结构,对于科学研究和治疗发展非常有价值。使用共同进化信息确定准确的折叠格局是现代蛋白质结构预测方法的成功基础。作为最新的状态,AlphaFold2显着提高了准确性,而无需进行明确的共同进化分析。然而,其性能仍然显示出对可用序列同源物的强烈依赖。我们研究了这种依赖性的原因,并提出了一种元生成模型Evogen,以弥补较差的MSA靶标的Alphafold2的表现不佳。 Evogen使我们能够通过降低搜索的MSA或生成虚拟MSA来操纵折叠景观,并帮助Alphafold2在低数据表方面准确地折叠,甚至通过单序预测来实现令人鼓舞的性能。能够用很少的MSA做出准确的预测,不仅可以更好地概括为孤儿序列的Alphafold2,而且使其在高通量应用程序中的使用民主化。此外,Evogen与AlphaFold2结合产生了一种概率结构生成方法,该方法可以探索蛋白质序列的替代构象,并且序列生成的任务意识可区分算法将使包括蛋白质设计在内的其他相关任务受益。
translated by 谷歌翻译
在Bircocrive VII的Track-1中,要求参与者识别药物/化学品和蛋白质之间的相互作用。提供每个药物/化学和蛋白质的内部名称实体注释,必须自动预测14个不同的相互作用中的一种。对于此关系提取任务,我们尝试两种基于BERT的句子分类方法,以及使用T5模型的更新文本到文本方法。我们发现基于BERT的模型一般表现更好,我们的生物综太基模型实现了所有指标的最高分,实现了0.74 F1得分。虽然我们的小说T5文本到文本方法没有表现出基于BERT的大多数模型,但它表现出在类似数据上培训的那些,呈现出有希望的结果,实现0.65 F1得分。我们认为,与关系提取的文本文本方法有一些竞争优势,并且有很多研究进步的空间。
translated by 谷歌翻译
程序合成或代码生成旨在生成满足问题规范的程序。使用大规模预处理的语言模型(LMS)的最新方法显示出令人鼓舞的结果,但它们有一些关键的局限性。特别是,他们经常遵循标准监督的微调程序,仅从对自然语言问题描述和基础真相计划对培训代码生成模型。这种范式在很大程度上忽略了问题规范中的一些重要但潜在的信号,例如单位测试,因此在求解复杂的看不见的编码任务时通常会导致性能差。为了解决这些局限性,我们提出了“ Coderl”,这是通过验证的LMS和深入强化学习(RL)实现程序合成任务的新框架。具体而言,在培训期间,我们将代码生成的LM视为参与者网络,并引入批评网络,该网络经过培训,以预测生成的程序的功能正确性,并为演员提供密集的反馈信号。在推理期间,我们引入了一种新一代程序,具有关键的抽样策略,该过程允许模型根据示例单位测试和评论家分数的反馈自动重新生成程序。对于模型骨架,我们扩展了Codet5的编码器架构,具有增强的学习目标,更大的模型大小和更好的预处理数据。我们的方法不仅在具有挑战性的应用程序基准上实现了新的SOTA结果,而且还显示出强大的零弹性传输能力,并在简单的MBPP基准上具有新的SOTA结果。
translated by 谷歌翻译
尽管变压器语言模型(LMS)是信息提取的最新技术,但长文本引入了需要次优的预处理步骤或替代模型体系结构的计算挑战。稀疏注意的LMS可以代表更长的序列,克服性能障碍。但是,目前尚不清楚如何解释这些模型的预测,因为并非所有令牌都在自我发项层中相互参加,而在运行时,长序列对可解释性算法提出了计算挑战,而当运行时取决于文档长度。这些挑战在文档可能很长的医学环境中是严重的,机器学习(ML)模型必须是审核和值得信赖的。我们介绍了一种新颖的蒙版抽样程序(MSP),以识别有助于预测的文本块,将MSP应用于预测医学文本诊断的背景下,并通过两位临床医生的盲目审查来验证我们的方法。我们的方法比以前的最先进的临床信息块高约1.7倍,速度更快100倍,并且可用于生成重要的短语对。 MSP特别适合长LMS,但可以应用于任何文本分类器。我们提供了MSP的一般实施。
translated by 谷歌翻译
Free-text rationales (FTRs) follow how humans communicate by explaining reasoning processes via natural language. A number of recent works have studied how to improve language model (LM) generalization by using FTRs to teach LMs the correct reasoning processes behind correct task outputs. These prior works aim to learn from FTRs by appending them to the LM input or target output, but this may introduce an input distribution shift or conflict with the task objective, respectively. We propose KNIFE, which distills FTR knowledge from an FTR-augmented teacher LM (takes both task input and FTR) to a student LM (takes only task input), which is used for inference. Crucially, the teacher LM's forward computation has a bottleneck stage in which all of its FTR states are masked out, which pushes knowledge from the FTR states into the task input/output states. Then, FTR knowledge is distilled to the student LM by training its task input/output states to align with the teacher LM's. On two question answering datasets, we show that KNIFE significantly outperforms existing FTR learning methods, in both fully-supervised and low-resource settings.
translated by 谷歌翻译
Proteins are fundamental biological entities that play a key role in life activities. The amino acid sequences of proteins can be folded into stable 3D structures in the real physicochemical world, forming a special kind of sequence-structure data. With the development of Artificial Intelligence (AI) techniques, Protein Representation Learning (PRL) has recently emerged as a promising research topic for extracting informative knowledge from massive protein sequences or structures. To pave the way for AI researchers with little bioinformatics background, we present a timely and comprehensive review of PRL formulations and existing PRL methods from the perspective of model architectures, pretext tasks, and downstream applications. We first briefly introduce the motivations for protein representation learning and formulate it in a general and unified framework. Next, we divide existing PRL methods into three main categories: sequence-based, structure-based, and sequence-structure co-modeling. Finally, we discuss some technical challenges and potential directions for improving protein representation learning. The latest advances in PRL methods are summarized in a GitHub repository https://github.com/LirongWu/awesome-protein-representation-learning.
translated by 谷歌翻译