智能论文笔记

Pipeline-Invariant Representation Learning for Neuroimaging

Xinhui Li , Alex Fedorov , Mrinal Mathur , Anees Abrol , Gregory Kiar , Sergey Plis , Vince Calhoun

分类：机器学习

2022-08-27

深度学习已被广泛应用于神经影像学，包括预测磁共振成像（MRI）体积的脑表型关系。 MRI数据通常需要进行广泛的预处理，然后才能通过深度学习准备建模，部分原因是其高维和异质性。各种MRI预处理管道都有自己的优势和局限性。最近的研究表明，即使使用相同的数据，与管道相关的变化也可能导致不同的科学发现。同时，机器学习社区强调了从以模型为中心转移到以数据为中心的方法的重要性，因为数据质量在深度学习应用中起着至关重要的作用。在这个想法的激励下，我们首先评估预处理管道选择如何影响监督学习模型的下游表现。接下来，我们提出了两个管道不变表示方法MPSL和PXL，以提高分类性能的一致性并捕获管道对之间的类似神经网络表示。使用来自英国生物库数据集的2000名人类受试者，我们证明了这两种模型都具有独特的优势，特别是可以使用MPSL来改善对新管道的样本概括，而PXL则可以用来提高预测性能一致性和代表性封闭管道集中的相似性。这些结果表明，我们提出的模型可用于克服与管道相关的偏差，并提高神经成像预测任务的可重复性。

translated by 谷歌翻译

Self-supervised multimodal neuroimaging yields predictive representations for a spectrum of Alzheimer's phenotypes

Alex Fedorov , Eloy Geenjaar , Lei Wu , Tristan Sylvain , Thomas P. DeRamus , Margaux Luck , Maria Misiura , R Devon Hjelm , Sergey M. Plis , Vince D. Calhoun

分类：机器学习

2022-09-07

最近，致力于通过现代机器学习方法预测脑部疾病的最新神经影像学研究通常包括单一模态并依靠监督的过度参数化模型。但是，单一模态仅提供了高度复杂的大脑的有限视图。至关重要的是，临床环境中的有监督模型缺乏用于培训的准确诊断标签。粗标签不会捕获脑疾病表型的长尾谱，这导致模型的普遍性丧失，从而使它们在诊断环境中的有用程度降低。这项工作提出了一个新型的多尺度协调框架，用于从多模式神经影像数据中学习多个表示。我们提出了一般的归纳偏见分类法，以捕获多模式自学融合中的独特和联合信息。分类法构成了一个无解码器模型的家族，具有降低的计算复杂性，并捕获多模式输入的本地和全局表示之间的多尺度关系。我们使用各种阿尔茨海默氏病表型中使用功能和结构磁共振成像（MRI）数据对分类法进行了全面评估，并表明自我监督模型揭示了与疾病相关的大脑区域和多模态链接，而无需在预先访问PRE-PRE-the PRE-the PRE-the PRE-the PRE-PRECTEN NICKES NOCKER NOCKER NOCKER NOCKER NOCKER NOCE访问。训练。拟议的多模式自学学习的学习能够表现出两种模式的分类表现。伴随的丰富而灵活的无监督的深度学习框架捕获了复杂的多模式关系，并提供了符合或超过更狭窄的监督分类分析的预测性能。我们提供了详尽的定量证据，表明该框架如何显着提高我们对复杂脑部疾病中缺失的联系的搜索。

translated by 谷歌翻译

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen , Simon Kornblith , Mohammad Norouzi , Geoffrey Hinton

分类：

2020-02-13

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive selfsupervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by Sim-CLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-ofthe-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100× fewer labels. 1

translated by 谷歌翻译

Contrastive Multiview Coding

Yonglong Tian , Dilip Krishnan , Phillip Isola

分类：

2019-06-13

Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a "dog" can be seen, heard, and felt). We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors. We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. Our approach scales to any number of views, and is viewagnostic. We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics. Our approach achieves state-of-the-art results on image and video unsupervised learning benchmarks.

translated by 谷歌翻译

Contrastive Neural Processes for Self-Supervised Learning

Konstantinos Kallidromitis , Denis Gudovskiy , Kazuki Kozuka , Ohama Iku , Luca Rigazio

分类：机器学习

2021-10-24

最近的对比方法显着改善了几个域的自我监督学习。特别地，对比方法是最有效的，其中数据增强可以容易地构造。在计算机愿景中。但是，在没有建立的数据变换（如时间序列数据）的情况下，它们在域中不太成功。在本文中，我们提出了一种新颖的自我监督学习框架，将对比学习与神经过程结合起来。它依赖于神经过程的最近进步来执行时间序列预测。这允许通过采用一组各种采样功能来生成增强版本的数据，并且因此避免手动设计增强。我们扩展了传统的神经过程，并提出了一种新的对比损失，以便在自我监督设置中学习时序序列表示。因此，与以前的自我监督方法不同，我们的增强管道是任务不可行的，使我们的方法能够在各种应用程序中执行良好。特别是，具有使用我们的方法训练的线性分类器的RESET能够跨越工业，医疗和音频数据集的最先进的技术，从而提高ECG定期数据的精度超过10％。我们进一步证明，我们的自我监督的表示在潜在的空间中更有效，改善了多种聚类指标，并且在10％的标签上进行微调我们的方法实现了完全监督的竞争竞争。

translated by 谷歌翻译

Hyper-Representations as Generative Models: Sampling Unseen Neural Network Weights

Konstantin Schürholt , Boris Knyazev , Xavier Giró-i-Nieto , Damian Borth

分类：机器学习 | 计算机视觉

2022-09-29

给定模型动物园的神经网络权重的学习表示是一个新兴而具有挑战性的领域，从模型检查到神经体系结构搜索或知识蒸馏，具有许多潜在的应用。最近，在模型动物园进行训练的自动编码器能够学习一个超代理，该代表体捕获了动物园中模型的内在和外在特性。在这项工作中，我们扩展了超代表，以供生成使用以采样新的模型权重。我们提出的是层损失归一化，我们证明，这是基于超代表拓扑生成高性能模型和几种采样方法的关键。使用我们的方法生成的模型是多种多样的，性能的，并且能够超过强大的基准，从而在下游任务上进行了评估：初始化，合奏采样和传递学习。我们的结果表明，通过超代理通过过度代理，知识聚集从模型动物园到新模型的潜力，从而为新的研究方向铺平了途径。

translated by 谷歌翻译

FastSurferVINN: Building Resolution-Independence into Deep Learning Segmentation Methods -- A Solution for HighRes Brain MRI

Leonie Henschel , David Kügler , Martin Reuter

分类：计算机视觉

2021-12-17

主要的神经影像学研究推动了1.0 mm以下的3T MRI采集分辨率，以改善结构定义和形态学。然而，只有很少的时间 - 密集的自动化图像分析管道已被验证为高分辨率（雇用）设置。另一方面，有效的深度学习方法很少支持多个固定分辨率（通常1.0 mm）。此外，缺乏标准的杂交数据分辨率以及具有足够覆盖的扫描仪，年龄，疾病或遗传方差的多样化数据的有限可用性会带来额外的，未解决的挑战培训网络。将分辨率独立于基于深度学习的分割，即在一系列不同的体素大小上以其本地分辨率进行分辨率的能力，承诺克服这些挑战，但目前没有这种方法。我们现在通过向决议独立的分割任务（VINN）引入VINOSEIZED独立的神经网络（VINN）来填补这个差距，并呈现FastSurfervinn，（i）建立并实施决议独立，以获得深度学习作为同时支持0.7-1.0 mm的第一种方法分割，（ii）显着优于跨决议的最先进方法，（iii）减轻雇用数据集中存在的数据不平衡问题。总体而言，内部分辨率 - 独立性相互益处雇用和1.0 mm MRI分割。通过我们严格验证的FastSurfervinn，我们将为不同的神经视线镜分析分发一个快速工具。此外，VINN架构表示更广泛应用的有效分辨率的分段方法

translated by 谷歌翻译

Fighting Fire with Fire: Contrastive Debiasing without Bias-free Data via Generative Bias-transformation

Yeonsung Jung , Hajin Shim , June Yong Yang , Eunho Yang

分类：机器学习

2021-12-02

尽管能够与过度能力网络概括，但深神经网络通常会学会滥用数据中的虚假偏见而不是使用实际的任务相关信息。由于此类快捷方式仅在收集的数据集中有效，因此由此产生的偏置模型在现实世界的投入上表现不佳，或导致意外的社交影响，例如性别歧视。为了抵消偏差的影响，现有方法可以利用辅助信息，这在实践中很少可获得，或者在训练数据中的无偏见样本中筛选，希望能够充分存在清洁样品。但是，这些关于数据的推定并不总是保证。在本文中，我们提出了通过生成偏差变换〜（CDVG）对比下展，该〜（CDVG）能够在现有的方法中经营，其中现有方法由于未偏置的偏差样品而不足的预设而下降。通过我们的观察，不仅如前所述的鉴别模型，而且生成模型倾向于关注偏差，CDVG使用翻译模型来将样本中的偏置转换为另一种偏差模式，同时保留任务相关信息。。通过对比学习，我们将转化的偏见视图与另一个学习偏见，学习偏见不变的表示。综合和现实世界数据集的实验结果表明，我们的框架优于目前的最先进，并且有效地阻止模型即使在无偏差样本极为稀缺时也会被偏置。

translated by 谷歌翻译

Artifact-Tolerant Clustering-Guided Contrastive Embedding Learning for Ophthalmic Images

Min Shi , Anagha Lokhande , Mojtaba S. Fazli , Vishal Sharma , Yu Tian , Yan Luo , Louis R. Pasquale , Tobias Elze , Michael V. Boland , Nazlee Zebardast

分类：计算机视觉 | 人工智能

2022-09-02

眼科图像和衍生物，例如视网膜神经纤维层（RNFL）厚度图对于检测和监测眼科疾病至关重要（例如，青光眼）。对于计算机辅助诊断眼疾病，关键技术是自动从眼科图像中提取有意义的特征，这些特征可以揭示与功能视觉丧失相关的生物标志物（例如RNFL变薄模式）。然而，将结构性视网膜损伤与人类视力丧失联系起来的眼科图像的表示，主要是由于患者之间的解剖学变化很大。在存在图像伪像的情况下，这项任务变得更加具有挑战性，由于图像采集和自动细分，这很常见。在本文中，我们提出了一个耐伪造的无监督的学习框架，该框架称为眼科图像的学习表示。 Eyelearn具有一个伪影校正模块，可以学习可以最好地预测无伪影眼镜图像的表示形式。此外，Eyelearn采用聚类引导的对比度学习策略，以明确捕获内部和间形的亲和力。在训练过程中，图像在簇中动态组织，以形成对比样品，其中鼓励在相同或不同的簇中分别学习相似或不同的表示形式。为了评估包冰者，我们使用青光眼患者的现实世界眼科摄影图数据集使用学习的表示形式进行视野预测和青光眼检测。广泛的实验和与最先进方法的比较验证了眼球从眼科图像中学习最佳特征表示的有效性。

translated by 谷歌翻译

HTML版本

MouseGAN++: Unsupervised Disentanglement and Contrastive Representation for Multiple MRI Modalities Synthesis and Structural Segmentation of Mouse Brain

Ziqi Yu , Xiaoyang Han , Shengjie Zhang , Jianfeng Feng , Tingying Peng , Xiao-Yong Zhang

分类：计算机视觉

2022-12-04

Segmenting the fine structure of the mouse brain on magnetic resonance (MR) images is critical for delineating morphological regions, analyzing brain function, and understanding their relationships. Compared to a single MRI modality, multimodal MRI data provide complementary tissue features that can be exploited by deep learning models, resulting in better segmentation results. However, multimodal mouse brain MRI data is often lacking, making automatic segmentation of mouse brain fine structure a very challenging task. To address this issue, it is necessary to fuse multimodal MRI data to produce distinguished contrasts in different brain structures. Hence, we propose a novel disentangled and contrastive GAN-based framework, named MouseGAN++, to synthesize multiple MR modalities from single ones in a structure-preserving manner, thus improving the segmentation performance by imputing missing modalities and multi-modality fusion. Our results demonstrate that the translation performance of our method outperforms the state-of-the-art methods. Using the subsequently learned modality-invariant information as well as the modality-translated images, MouseGAN++ can segment fine brain structures with averaged dice coefficients of 90.0% (T2w) and 87.9% (T1w), respectively, achieving around +10% performance improvement compared to the state-of-the-art algorithms. Our results demonstrate that MouseGAN++, as a simultaneous image synthesis and segmentation method, can be used to fuse cross-modality information in an unpaired manner and yield more robust performance in the absence of multimodal data. We release our method as a mouse brain structural segmentation tool for free academic usage at https://github.com/yu02019.

translated by 谷歌翻译

Addressing Feature Suppression in Unsupervised Visual Representations

Tianhong Li , Lijie Fan , Yuan Yuan , Hao He , Yonglong Tian , Rogerio Feris , Piotr Indyk , Dina Katabi

分类：机器学习 | 计算机视觉

2020-12-17

对比学习是机器学习中最快的研究领域之一，因为它可以在没有标记数据的情况下学习有用的表示。然而，对比学学习易于特征抑制，即，它可能会丢弃与感兴趣的任务相关的重要信息，并学习无关的功能。过去的工作通过消除无关信息的手工制作的数据增强解决了这一限制。然而，这种方法不适用于所有数据集和任务。此外，当一个属性可以抑制与其他属性相关的特征时，数据增强在解决多属性分类中的功能抑制中失败。在本文中，我们分析了对比学习的目标函数，并正式证明它易于特征抑制。然后，我们提出预测对比学习（PCL），一种学习对特征抑制具有鲁棒的无监督表示的框架。关键的想法是强制学习的表示来预测输入，因此防止它丢弃重要信息。广泛的实验验证PCL是否强大地对特征抑制和优于各种数据集和任务的最先进的对比学习方法。

translated by 谷歌翻译

GATE: Graph CCA for Temporal SElf-supervised Learning for Label-efficient fMRI Analysis

Liang Peng , Nan Wang , Jie Xu , Xiaofeng Zhu , Xiaoxiao Li

分类：机器学习

2022-03-17

在这项工作中，我们使用功能磁共振成像（fMRI）专注于具有挑战性的任务，神经疾病分类。在基于人群的疾病分析中，图卷积神经网络（GCN）取得了显着的成功。但是，这些成就与丰富的标记数据密不可分，对虚假信号敏感。为了改善在标签有效的设置下的fMRI表示学习和分类，我们建议在GCN上使用新颖的，理论驱动的自我监督学习（SSL）框架，即在FMRI分析门上用于时间自我监督学习的CCA。具体而言，要求设计合适有效的SSL策略来提取fMRI的形成和鲁棒特征。为此，我们研究了FMRI动态功能连接（FC）的几种新的图表增强策略，用于SSL培训。此外，我们利用规范相关分析（CCA）在不同的时间嵌入中，并呈现理论含义。因此，这产生了一个新颖的两步GCN学习程序，该过程包括在未标记的fMRI人群图上的（i）SSL组成，并且（ii）在小标记的fMRI数据集上进行了微调，以进行分类任务。我们的方法在两个独立的fMRI数据集上进行了测试，这表明自闭症和痴呆症诊断方面表现出色。

translated by 谷歌翻译

Self-supervised deep convolutional neural network for chest X-ray classification

Matej Gazda , Jakub Gazda , Jan Plavka , Peter Drotar

分类：计算机视觉 | 神经与进化计算

2021-03-04

胸部射线照相是一种相对便宜，广泛的医疗程序，可传达用于进行诊断决策的关键信息。胸部X射线几乎总是用于诊断呼吸系统疾病，如肺炎或最近的Covid-19。在本文中，我们提出了一个自我监督的深神经网络，其在未标记的胸部X射线数据集上掠夺。学习的陈述转移到下游任务 - 呼吸系统疾病的分类。在四个公共数据集获得的结果表明，我们的方法在不需要大量标记的培训数据的情况下产生竞争力。

translated by 谷歌翻译

Overcoming the Domain Gap in Neural Action Representations

Semih Günel , Florian Aymanns , Sina Honari , Pavan Ramdya , Pascal Fua

分类：计算机视觉

2021-12-02

将动物行为与大脑活动相关是神经科学的基本目标，具有建立强大的脑机接口的实际应用。但是，个人之间的域间差距是一种重大问题，可以防止对未标记科目工作的一般模型的培训。由于现在可以从无手动干预的多视图视频序列可以可靠地提取3D构成数据，我们建议使用它来指导神经动作表示的编码以及利用显微镜成像的性质的一组神经和行为增强。为了减少域间差距，在培训期间，我们跨越似乎正在执行类似行动的动物交换神经和行为数据。为了证明这一点，我们在三个非常不同的多模式数据集上测试我们的方法;特征是苍蝇和神经活动的一种，其中一个包含人类神经电压（ECOG）数据，最后是来自不同观点的人类活动的RGB视频数据。

translated by 谷歌翻译

A Clustering-guided Contrastive Fusion for Multi-view Representation Learning

Guanzhou Ke , Guoqing Chao , Xiaoli Wang , Chenyang Xu , Chang Xu , Yongqi Zhu , Yang Yu

分类：计算机视觉

2022-12-28

The past two decades have seen increasingly rapid advances in the field of multi-view representation learning due to it extracting useful information from diverse domains to facilitate the development of multi-view applications. However, the community faces two challenges: i) how to learn robust representations from a large amount of unlabeled data to against noise or incomplete views setting, and ii) how to balance view consistency and complementary for various downstream tasks. To this end, we utilize a deep fusion network to fuse view-specific representations into the view-common representation, extracting high-level semantics for obtaining robust representation. In addition, we employ a clustering task to guide the fusion network to prevent it from leading to trivial solutions. For balancing consistency and complementary, then, we design an asymmetrical contrastive strategy that aligns the view-common representation and each view-specific representation. These modules are incorporated into a unified method known as CLustering-guided cOntrastiVE fusioN (CLOVEN). We quantitatively and qualitatively evaluate the proposed method on five datasets, demonstrating that CLOVEN outperforms 11 competitive multi-view learning methods in clustering and classification. In the incomplete view scenario, our proposed method resists noise interference better than those of our competitors. Furthermore, the visualization analysis shows that CLOVEN can preserve the intrinsic structure of view-specific representation while also improving the compactness of view-commom representation. Our source code will be available soon at https://github.com/guanzhou-ke/cloven.

translated by 谷歌翻译

Local Spatiotemporal Representation Learning for Longitudinally-consistent Neuroimage Analysis

Mengwei Ren , Neel Dey , Martin A. Styner , Kelly Botteron , Guido Gerig

分类：计算机视觉 | 机器学习

2022-06-09

医学计算机视觉的最新自我监督进步利用了在下游任务（例如分割）之前预处理的全球和局部解剖自我相似性。但是，当前方法假设I.I.D.图像采集是在临床研究设计中无效的，其中随访纵向扫描跟踪特定于主体的时间变化。此外，现有的自我监督方法用于医学上相关的图像到图像体系结构仅利用空间或时间自相似性，并且仅通过在单个图像尺度上应用的损失来进行，而天真的多尺度空间时空扩展崩溃了解决方案。对于这些目的，本文做出了两种贡献：（1）它提出了一种局部和多规模的时空表示方法，用于对纵向图像进行训练的图像到图像架构。它利用了学到的多尺度内部主体内特征的时空自相似性来进行训练，并开发出几种特征正规化，以避免崩溃的身份表示。（2）在填充期间，它提出了一个令人惊讶的简单的自我监督分割一致性正规化以利用受试者内部的相关性。该框架以单次分割设置为基准，该框架的表现优于良好调整的随机定位基线和为I.I.D设计的当前自我监督技术。和纵向数据集。在纵向神经退行性的成年MRI和发育的婴儿脑MRI中，这些改进都得到了证明，并产生了更高的性能和纵向一致性。

translated by 谷歌翻译

Max-Margin Contrastive Learning

Anshul Shah , Suvrit Sra , Rama Chellappa , Anoop Cherian

分类：机器学习 | 人工智能 | 计算机视觉

2021-12-21

标准的对比学习方法通常需要大量的否定否定有效的无监督学习，并且往往表现出缓慢的收敛性。我们怀疑这种行为是由于用于提供与积极鲜明对比的否定的廉价选择。我们通过从支持向量机（SVM）的灵感来呈现最大值保证金对比学习（MMCL）来抵消这种困难。我们的方法选择否定作为通过二次优化问题获得的稀疏支持向量，通过最大化决策余量来强制执行对比度。由于SVM优化可以计算要求，特别是在端到端设置中，我们提出了缓解计算负担的简化。我们验证了我们对标准视觉基准数据集的方法，展示了在无监督的代表上学习最先进的表现，同时具有更好的经验收敛性。

translated by 谷歌翻译

COCOA: Cross Modality Contrastive Learning for Sensor Data

Shohreh Deldari , Hao Xue , Aaqib Saeed , Daniel V. Smith , Flora D. Salim

分类：计算机视觉 | 机器学习

2022-07-31

自我监督学习（SSL）是一个新的范式，用于学习判别性表示没有标记的数据，并且与受监督的对手相比，已经达到了可比甚至最新的结果。对比度学习（CL）是SSL中最著名的方法之一，试图学习一般性的信息表示数据。 CL方法主要是针对仅使用单个传感器模态的计算机视觉和自然语言处理应用程序开发的。但是，大多数普遍的计算应用程序都从各种不同的传感器模式中利用数据。虽然现有的CL方法仅限于从一个或两个数据源学习，但我们提出了可可（Crockoa）（交叉模态对比度学习），这是一种自我监督的模型，该模型采用新颖的目标函数来通过计算多功能器数据来学习质量表示形式不同的数据方式，并最大程度地减少了无关实例之间的相似性。我们评估可可对八个最近引入最先进的自我监督模型的有效性，以及五个公共数据集中的两个受监督的基线。我们表明，可可与所有其他方法相比，可可的分类表现出色。同样，可可比其他可用标记数据的十分之一的基线（包括完全监督的模型）的标签高得多。

translated by 谷歌翻译

Self-Supervised Representation Learning on Neural Network Weights for Model Characteristic Prediction

Konstantin Schürholt , Dimche Kostadinov , Damian Borth

分类：机器学习

2021-10-28

已显示自我监督学习（SSL）学习有用和信息保存的表示。神经网络（NNS）被广泛应用，但它们的重量空间仍然不完全理解。因此，我们建议使用SSL来学习NNS群体重量的神经表示。为此，我们介绍域特定的数据增强和适应的关注架构。我们的实证评估表明，该领域的自我监督的代表学习能够恢复不同的NN模型特征。此外，我们表明所提出的学习表示始终是预测超参数，测试准确性和泛化差距以及转移到分发外设置的工作。

translated by 谷歌翻译

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark

分类：

2021-02-26

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

translated by 谷歌翻译