Large deep learning models have achieved remarkable success in many scenarios. However, training large models is usually challenging, e.g., due to the high computational cost, the unstable and painfully slow optimization procedure, and the vulnerability to overfitting. To alleviate these problems, this work studies a divide-and-conquer strategy, i.e., dividing a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model. This approach is promising since it avoids directly training large models from scratch. Nevertheless, implementing this idea is non-trivial, as it is difficult to ensure the compatibility of the independently trained modules. In this paper, we present an elegant solution to address this issue, i.e., we introduce a global, shared meta model to implicitly link all the modules together. This enables us to train highly compatible modules that collaborate effectively when they are assembled together. We further propose a module incubation mechanism that enables the meta model to be designed as an extremely shallow network. As a result, the additional overhead introduced by the meta model is minimalized. Though conceptually simple, our method significantly outperforms end-to-end (E2E) training in terms of both final accuracy and training efficiency. For example, on top of ViT-Huge, it improves the accuracy by 2.7% compared to the E2E baseline on ImageNet-1K, while saving the training cost by 43% in the meantime. Code is available at https://github.com/LeapLabTHU/Model-Assembling.
translated by 谷歌翻译
In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. Painter significantly outperforms recent generalist models on several challenging tasks. Surprisingly, our model shows capabilities of completing out-of-domain tasks, which do not exist in the training data, such as open-category keypoint detection and object segmentation, validating the powerful task transferability of in-context learning.
translated by 谷歌翻译
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
translated by 谷歌翻译
Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.
translated by 谷歌翻译
视觉变压器(VIT)在包括低水平的视觉任务中显示了有望,而U-NET在基于分数的扩散模型中仍然占主导地位。在本文中,我们对扩散模型中的基于VIT的体系结构进行了系统的经验研究。我们的结果表明,在VIT中添加超长的跳过连接(例如U-NET)对于扩散模型至关重要。新的VIT体系结构以及其他改进被称为U-Vit。在几个流行的视觉数据集中,U-Vit可以将竞争性生成结果达到SOTA U-NET,同时需要大量的参数和计算,如果不是更少。
translated by 谷歌翻译
视频异常检测旨在在视频中找到不符合预期行为的事件。普遍的方法主要通过摘要重建或将来的框架预测误差来检测异常。但是,该错误高度依赖于当前摘要的局部环境,并且缺乏对正态性的理解。为了解决这个问题,我们建议不仅通过本地环境来检测异常事件,而且还根据测试事件与培训数据正常的知识之间的一致性。具体而言,我们提出了一个基于上下文恢复和知识检索的新颖的两流框架,这两个流可以相互补充。对于上下文恢复流,我们提出了一个时空的U-NET,可以完全利用运动信息来预测未来的框架。此外,我们提出了一种最大的局部误差机制,以减轻复杂前景对象引起的大恢复错误的问题。对于知识检索流,我们提出了一种改进的可学习区域敏感性散列的散列,该哈希通过暹罗网络和相互差异损失来优化哈希功能。关于正态性的知识是编码和存储在哈希表中的,测试事件与知识表示之间的距离用于揭示异常的概率。最后,我们融合了从两个流的异常得分以检测异常。广泛的实验证明了这两个流的有效性和互补性,因此提出的两流框架在四个数据集上实现了最新的性能。
translated by 谷歌翻译
知识蒸馏(KD)将知识从高容量的教师网络转移到加强较小的学生。现有方法着重于发掘知识的提示,并将整个知识转移给学生。但是,由于知识在不同的学习阶段显示出对学生的价值观,因此出现了知识冗余。在本文中,我们提出了知识冷凝蒸馏(KCD)。具体而言,每个样本上的知识价值是动态估计的,基于期望最大化(EM)框架的迭代性凝结,从老师那里划定了一个紧凑的知识,以指导学生学习。我们的方法很容易建立在现成的KD方法之上,没有额外的培训参数和可忽略不计的计算开销。因此,它为KD提出了一种新的观点,在该观点中,积极地识别教师知识的学生可以学会更有效,有效地学习。对标准基准测试的实验表明,提出的KCD可以很好地提高学生模型的性能,甚至更高的蒸馏效率。代码可在https://github.com/dzy3/kcd上找到。
translated by 谷歌翻译
我们将变异自动编码器(VAE)应用于Lamost-K2低分辨率光谱,以检测K2场中恒星的磁活性。在对所选无活跃恒星的光谱进行训练之后,VAE模型可以有效地生成光谱减法程序所需的合成参考模板,而不知道任何恒星参数。然后,我们在样品中检测到特殊的光谱特征,例如色圈排放,强卵巢排放和锂吸收。我们测量色球活性指标的排放,H $ \ alpha $和Ca II红外三重线(IRT)线,以量化出色的磁性活性。活跃星的H $ \ alpha $和Ca II IRT线的过量排放与旋转周期和源自K2光度法得出的光曲线的振幅非常相关。我们降低了LAMOST光谱,以模拟中国空间站望远镜(CSST)的无频谱,并将VAE应用于模拟数据。对于凉爽的活跃恒星,我们揭示了h $ \ alpha $线的等效宽度(ews)之间的良好协议,该线从光谱中衍生出具有两种决议。结果表明,在未来的CSST调查中鉴定磁性恒星的能力,该恒星将提供前所未有的大型低分辨率光谱数据库以及同时的恒星多波段光度法。
translated by 谷歌翻译
自我监督学习的一个重要目标是使模型预训练能够从几乎无限的数据中受益。但是,一种最近变得流行的方法,即掩盖图像建模(MIM),被怀疑无法从较大的数据中受益。在这项工作中,我们通过广泛的实验打破了这一误解,数据量表从10 \%imagenet-1k到完整的Imagenet-22K,型号的尺寸从4,900万到10亿,培训长度从125k迭代到500k迭代迭代范围不等。我们的研究表明:(i)蒙版的图像建模也要求对较大的数据进行要求。我们观察到,非常大的模型被相对较小的数据过度。 (ii)培训的时间长度。接受掩盖图像建模训练的大型模型可以从更多的数据中受益,并具有更长的培训。 (iii)预训练中的验证损失是衡量模型在多个任务上进行微调的表现的好指标。该观察结果使我们能够预先评估预训练的模型,而无需对下游任务进行昂贵的试用和错误评估。我们希望我们的发现能够从缩放能力方面提高对蒙版图像建模的理解。
translated by 谷歌翻译
蒙版的图像建模(MIM)学习具有非常好的微调性能的表示形式,掩盖了先前普遍的预训练方法,例如图像分类,实例对比度学习和图像文本对齐。在本文中,我们表明,通过以功能蒸馏(FD)形式进行简单的后处理,可以显着改善这些预训练方法的下部微调性能。功能蒸馏将旧表示形式转换为具有一些理想属性的新表示形式,就像MIM产生的表示一样。这些属性总共称为优化友好性,通过一组与注意力和优化相关的诊断工具来识别和分析。借助这些属性,新表示表现出强烈的微调性能。具体而言,对比度的自我监督学习方法在微调方面具有竞争力,就像最先进的蒙版图像建模(MIM)算法一样。剪辑模型的微调性能也得到了显着改善,夹子VIT-L模型达到\ TextBf {89.0%} TOP-1的ImagEnet-1K分类精度。在30亿参数SWINV2-G模型上,ADE20K语义分割的微调精度通过+1.5 miou提高到\ textbf {61.4 miou},创建了新记录。更重要的是,我们的工作为未来的研究提供了一种方法,可以将更多的精力集中在学习表现的通用性和可扩展性上,而不会与优化友好性相处,因为它可以很容易地增强。该代码将在https://github.com/swintransformer/feature-distillation上找到。
translated by 谷歌翻译