现有的步态识别研究以实验室场景为主。由于人们生活在现实世界中,因此野外的步态识别是一个更实用的问题,最近引起了多媒体和计算机视觉社区的关注。在现有基准上获得最先进性能的当前方法在最近提出的野外数据集上的准确性差得多,因为这些方法几乎无法模拟不受约束场景中步态序列的各种时间动力学。因此,本文提出了一种新型的多跳时间开关方法,以实现实际场景中步态模式的有效时间建模。具体来说,我们设计了一个新型的步态识别网络,称为多跳临时交换机网络(MTSGait),以同时学习空间特征和多尺度的时间功能。与现有的3D卷积进行时间建模的方法不同,我们的MTSGAIT通过2D卷积对步态序列的时间动力学进行建模。通过这种方式,与基于3D卷积的模型相比,它以较少的模型参数来达到高效率,并减少了优化的难度。基于2D卷积内核的特定设计,我们的方法可以消除相邻帧之间特征的不对准。此外,提出了一种新的采样策略,即非环保连续采样,以使模型学习更强大的时间特征。最后,与最新方法相比,提出的方法在两个公共步态数据集(即增长和步态3D)上取得了出色的性能。
translated by 谷歌翻译
COVID-19大流行威胁着全球健康。许多研究应用了深度卷积神经网络(CNN),以识别基于胸部3D计算机断层扫描(CT)的COVID-19。最近的作品表明,没有模型在不同国家 /地区的CT数据集中概括得很好,并且为特定数据集设计模型需要专业知识。因此,旨在自动搜索模型的神经体系结构搜索(NAS)已成为一个有吸引力的解决方案。为了降低大型3D CT数据集的搜索成本,大多数基于NAS的作品都使用权重共享(WS)策略来使所有型号在超级网中共享权重。但是,WS不可避免地会导致搜索不稳定性,从而导致模型估计不准确。在这项工作中,我们提出了一个有效的进化多目标架构搜索(EMARS)框架。我们提出了一个新的目标,即潜在的潜力,可以帮助利用有前途的模型间接减少权重训练中涉及的模型数量,从而减轻搜索不稳定性。我们证明,在准确性和潜力的目标下,EMAR可以平衡剥削和探索,即减少搜索时间并找到更好的模型。我们的搜索模型很小,并且比在三个公共Covid-19 3D CT数据集上的先前工作表现更好。
translated by 谷歌翻译
深度学习在计算机视觉方面取得了巨大的成功,而由于数据注释的稀缺性,医疗图像细分(MIS)仍然是一个挑战。几次分割的元学习技术(meta-fs)已被广泛用于应对这一挑战,而它们忽略了查询图像和支持集之间可能的分配变化。相比之下,经验丰富的临床医生可以通过从查询图像中借用信息,然后相应地对其(她)先前的认知模型进行微调或校准。在此灵感的启发下,我们提出了一种Q-NET,这是一种质疑的Meta-FSS方法,它在精神上模仿了专家临床医生的学习机制。我们基于ADNET构建Q-NET,这是一种最近提出的异常检测启发方法。具体而言,我们将两个查询信息的计算模块添加到ADNET中,即一个查询信息的阈值适应模块和一个查询信息的原型细化模块。将它们与特征提取模块的双路扩展相结合,Q-NET在两个广泛使用的数据集上实现了最先进的性能,分别由腹部MR图像和心脏MR图像组成。我们的作品通过利用查询信息来改善元FSS技术的新颖方法。
translated by 谷歌翻译
随机重量平均(SWA)被认为是一种简单的,而一种有效的方法来改善随机梯度下降(SGD)的推广,用于训练深层神经网络(DNN)。解释其成功的常见见解是,在配备周期性或高常数学习率的SGD过程之后的平均权重可以发现更广泛的Optima,然后导致更好的泛化。我们给出了一个不同意上述内容的新洞察力。我们的表征,SWA的性能高度依赖于SWA收敛前运行的SGD进程的程度,并且权重平均的操作仅有助于减少方差。这种新的Insight表明了更好的算法设计上的实用指南。作为一个实例化,我们表明,随着收敛不足的SGD过程,运行SWA更多次导致泛化方面的持续增量益处。我们的发现在不同网络架构上的广泛实验得到了证实,包括基线CNN,PRERESNET-164,WieresNetNet-28-10,VGG16,Resnet-50,Reset-152,DenSenet-161和不同的数据集,包括CiFar- {10,100}和想象因。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Blind image quality assessment (BIQA) remains challenging due to the diversity of distortion and image content variation, which complicate the distortion patterns crossing different scales and aggravate the difficulty of the regression problem for BIQA. However, existing BIQA methods often fail to consider multi-scale distortion patterns and image content, and little research has been done on learning strategies to make the regression model produce better performance. In this paper, we propose a simple yet effective Progressive Multi-Task Image Quality Assessment (PMT-IQA) model, which contains a multi-scale feature extraction module (MS) and a progressive multi-task learning module (PMT), to help the model learn complex distortion patterns and better optimize the regression issue to align with the law of human learning process from easy to hard. To verify the effectiveness of the proposed PMT-IQA model, we conduct experiments on four widely used public datasets, and the experimental results indicate that the performance of PMT-IQA is superior to the comparison approaches, and both MS and PMT modules improve the model's performance.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译