视觉模型最近在许多计算机视觉任务上显示出巨大的潜力。同时,与线性探针相比,先前的工作表明,与线性探针相比,这是较少的图像识别的迅速调整,可以在很少的图像识别上获得卓越的性能。在实际应用程序中,相关的几个射击任务是相关的,尤其是在专业领域。但是,以前的工作忽略了此类信息。受到以下事实的启发,即通过多任务学习通常可以提高性能,我们提出了一种新颖的方法softcpt(迅速调整的软上下文共享),以微调多个目标几个目标任务的预训练的视觉模型, 同时。具体来说,我们设计了一个任务共享的元网络,以使用预定义的任务名称以及可学习的元提示为输入为每个任务生成提示向量。因此,所有任务的迅速向量将以软的方式共享。该共享的元网络的参数以及元提示向量都在所有目标任务的联合培训集中调整。在三个多任务少量数据集上进行的广泛实验表明,SoftCpt的表现优于代表性的单任务提示方法Coop [78],这意味着多任务学习在视觉及时及时调整中的有效性。源代码和数据将公开可用。
translated by 谷歌翻译
姿势注册在视觉和机器人技术中至关重要。本文重点介绍了无初始化姿势注册的挑战性任务,最高为7DOF,用于均质和异质测量。虽然最近基于学习的方法显示了使用可区分求解器的希望,但它们要么依赖于启发式定义的对应关系,要么易于局部最小值。我们提出了一个可区分的相关(DPC)求解器,该求解器是全球收敛性且无对应的。当与简单的特征提取网络结合使用时,我们的一般框架DPCN ++允许使用任意初始化的多功能姿势注册。具体而言,特征提取网络首先从一对均质/异质测量值中学习致密特征网格。然后将这些特征网格转换为基于傅立叶变换和球形径向聚集的翻译和比例不变频谱表示形式,将翻译转换和从旋转中脱钩。接下来,使用DPC求解器在频谱中独立有效地估计旋转,比例和翻译。整个管道都是可区分和训练的端到端。我们评估了DCPN ++在多种注册任务上,以不同的输入方式,包括2D Bird的视图图像,3D对象和场景测量以及医疗图像。实验结果表明,DCPN ++的表现优于经典和基于学习的基础线,尤其是在部分观察到的异质测量方面。
translated by 谷歌翻译
班级增量学习(CIL)吸引了很多关注,但是大多数现有相关的作品都集中在微调整个表示模型上,这不可避免地导致了许多灾难性的遗忘。相比之下,使用语义丰富的预训练的表示模型,参数 - 辅助调整(PAT)仅更改很少的参数来学习新的视觉概念。最近的研究证明,基于PAT的CIL自然可以避免像大多数现有方法一样通过重播或蒸馏而战斗。但是,我们发现基于PAT的CIL仍然面临严重的语义漂移,这是由分类器学习偏见在不同学习阶段引起的问题,这大大降低了基于PAT的CIL的性能。为了解决这个问题,我们提出了增量原型调整(IPT),这是一种简单但有效的方法,它调整了分类和学习示例原型的类别原型,以补偿语义漂移。广泛的实验表明,我们的方法可以有效地补偿语义漂移。与经过良好训练的VIT骨架和其他PAT方法相结合,IPT超过了主流学习基准的最新基准。
translated by 谷歌翻译
随着计算能力的兴起,使用数据驱动的方法来共同设计机器人的形态和控制器已成为一种可行的方法。然而,评估每个形态下控制器的适应性是耗时的。作为开创性数据驱动的方法,共同适应利用了双NETWORK机制,目的是学习以形态学参数为条件的Q功能,以取代对各种候选者的传统评估,从而加快优化的速度。在本文中,我们发现共同适应在参数传输期间训练和状态行动分布变化期间的勘探误差的存在,这损害了性能。我们提出了在线和离线RL方法的并发网络的框架。通过灵活地利用行为克隆术语,我们可以减轻上述问题对结果的影响。进行仿真和物理实验以证明我们所提出的方法优于基线算法,这说明了所提出的方法是发现形态和控制器的最佳组合的有效方法。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译