自视觉变压器(VIT)出现以来,变形金刚在计算机视觉世界中迅速发光。卷积神经网络(CNN)的主要作用似乎受到越来越有效的基于变压器的模型的挑战。最近,几个先进的卷积模型以当地但大量注意机制的驱动的大型内核进行反击,显示出吸引力的性能和效率。尽管其中一个(即Replknet)令人印象深刻地设法将内核大小扩展到31x31,而性能提高,但随着内核大小的持续增长,性能开始饱和,与Swin Transformer等高级VIT的缩放趋势相比。在本文中,我们探讨了训练大于31x31的极端卷积的可能性,并测试是否可以通过策略性地扩大卷积来消除性能差距。这项研究最终是从稀疏性的角度施加极大核的食谱,该核心可以将内核平滑地扩展到61x61,并且性能更好。我们提出了稀疏的大内核网络(SLAK),这是一种纯CNN架构,配备了51x51个核,可以与最先进的层次变压器和现代探测器架构(如Convnext和Repleknet and Replknet and Replknet and Replknet and Replinext and Replknet and Replinext and Convnext and Replentical conternels cor相同或更好在成像网分类以及典型的下游任务上。我们的代码可在此处提供https://github.com/vita-group/slak。
translated by 谷歌翻译
与卷积层相比,完全连接的(FC)层更好地在捕获本地模式时更好地建模,但是更糟糕的是,因此通常不对图像识别的青睐。在本文中,我们提出了一种方法,局部注射,通过将培训的并行参数合并到FC内核中的训练参数并将训练的参数合并到FC层中。可以将位置喷射为新颖的结构重新参数化方法,因为它等效地通过转换参数来转换结构。基于此,我们提出了一个名为RepMLP块的多层 - Perceptron(MLP)块,它使用三个FC层提取特征,以及名为Repmlpnet的新颖体系结构。分层设计将RepMLPNET与其他同时提出的视觉MLPS区分开来。由于它生成不同级别的特征映射,它有资格作为下游任务的骨干模型,如语义分割。我们的结果表明,1)地区注射是MLP型号的一般方法; 2)与其他MLP相比,REPMLPNET具有良好的准确性效率折衷; 3)REPMLPNET是第一MLP,可无缝转移到CityCAPES语义分割。代码和模型可在https://github.com/dingxiaoh/repmlp上使用。
translated by 谷歌翻译
联合学习(FL)可以从云到资源限制的边缘设备分发机器学习工作负载。遗憾的是,当前的深网络不仅对边缘设备的推理和培训造成了太重,而且对于在带宽约束网络上传送更新,也太大了。在本文中,我们开发,实施和实验验证了所谓的联合动态稀疏训练(FEDDST)的新型FL框架,通过该训练可以通过该培训和培训复杂的神经网络,在设备上计算和网络内通信中具有基本上提高的效率。在FEDDST的核心是一个动态过程,可以从目标完整网络中提取和列出稀疏子网。通过这个方案,“两只鸟类用一块石头杀死:”而不是完整的模型,每个客户端都会对自己的稀疏网络进行有效的培训,并且在设备和云之间仅传输稀疏网络。此外,我们的结果表明,在流动训练期间的动态稀疏性更灵活地容纳比固定的共用稀疏面具的局部异质性。此外,动态稀疏性自然地引入了培训动态的“时间自化效应”,即使通过密集训练也会提高流程。在一个现实和挑战的非I.I.D。 FL Setting,FEDDST始终如一地优于我们的实验中的竞争算法:例如,在非IID CIFAR-10上的任何固定上传数据帽时,在给定相同的上传数据帽时,它会在FedVGM上获得令人印象深刻的精度优势;即使在上传数据帽2倍,也可以进一步展示FEDDST的疗效,即使FEDAVGM为2X,即使将FEDAVGM提供精度差距也会保持3%。代码可用:https://github.com/bibikar/feddst。
translated by 谷歌翻译
最近对稀疏神经网络的作品已经证明了独立从头开始训练稀疏子网,以匹配其相应密集网络的性能。然而,识别这种稀疏的子网(获奖票)涉及昂贵的迭代火车 - 培训 - 培训过程(例如,彩票票证假设)或过度扩展的训练时间(例如,动态稀疏训练)。在这项工作中,我们在稀疏神经网络训练和深度合并技术之间汲取了独特的联系,产生了一个名为FreeTickets的新型集合学习框架。 FreeTickets而不是从密集的网络开始,随机初始化稀疏的子网,然后在动态调整其稀疏掩码的同时列举子网,从而在整个训练过程中产生许多不同的稀疏子网。 FreeTickets被定义为这些稀疏子网的集合,在这种单次通过,稀疏稀疏训练中自由获得,其仅使用Vanilla密集培训所需的计算资源的一小部分。此外,尽管是模型的集合,但与单一密集模型相比,FreeTickets的参数和训练拖鞋更少:这种看似反向直观的结果是由于每个子网的高稀疏性。与标准致密基线相比,观察到惯性基因术,以预测准确性,不确定度估计,鲁棒性和效率相比表现出显着的全面改进。 FreeTickets在ImageNet上只使用后者所需的四分之一的培训拖鞋,可以轻松地表达Naive Deep EndleBe。我们的结果提供了对稀疏神经网络的强度的见解,并表明稀疏性的好处超出了通常预期的推理效率。
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
每次使用新的(但类似)数据的应用程序都必须重复解决优化问题的应用。可以手动设计分析优化算法以迭代方式解决这些问题。一方面,数据驱动的算法可以“学习优化”(L2O),其迭代率较少,而每次迭代的成本与通用优化算法相似。另一方面,不幸的是,许多L2O算法缺乏融合保证。为了融合这些方法的优势,我们提出了一个安全的L2O框架。 Safe-L2O更新结合了保障措施,以保证近端和/或梯度甲状管的凸问题收敛。安全性在实现方面很简单且计算便宜,并且只有在数据驱动的L2O更新性能较差或似乎差异时,它才会被激活。这产生了使用机器学习来创建快速L2O算法的数值好处,同时仍然保证收敛。我们的数值示例表明,即使提供的数据不是来自培训数据的分布,Safe-L2O算法的收敛性也是如此。
translated by 谷歌翻译
Vision-language models (VLMs) that are pre-trained on large-scale image-text pairs have demonstrated impressive transferability on a wide range of visual tasks. Transferring knowledge from such powerful pre-trained VLMs is emerging as a promising direction for building effective video recognition models. However, the current exploration is still limited. In our opinion, the greatest charm of pre-trained vision-language models is to build a bridge between visual and textual domains. In this paper, we present a novel framework called BIKE which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We propose a Video Attribute Association mechanism which leverages the Video-to-Text knowledge to generate textual auxiliary attributes to complement video recognition. ii) We also present a Temporal Concept Spotting mechanism which uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner to yield enhanced video representation. The extensive studies on popular video datasets (ie, Kinetics-400 & 600, UCF-101, HMDB-51 and ActivityNet) show that our method achieves state-of-the-art performance in most recognition scenarios, eg, general, zero-shot, and few-shot video recognition. To the best of our knowledge, our best model achieves a state-of-the-art accuracy of 88.4% on challenging Kinetics-400 with the released CLIP pre-trained model.
translated by 谷歌翻译
Cross-view geo-localization aims to estimate the location of a query ground image by matching it to a reference geo-tagged aerial images database. As an extremely challenging task, its difficulties root in the drastic view changes and different capturing time between two views. Despite these difficulties, recent works achieve outstanding progress on cross-view geo-localization benchmarks. However, existing methods still suffer from poor performance on the cross-area benchmarks, in which the training and testing data are captured from two different regions. We attribute this deficiency to the lack of ability to extract the spatial configuration of visual feature layouts and models' overfitting on low-level details from the training set. In this paper, we propose GeoDTR which explicitly disentangles geometric information from raw features and learns the spatial correlations among visual features from aerial and ground pairs with a novel geometric layout extractor module. This module generates a set of geometric layout descriptors, modulating the raw features and producing high-quality latent representations. In addition, we elaborate on two categories of data augmentations, (i) Layout simulation, which varies the spatial configuration while keeping the low-level details intact. (ii) Semantic augmentation, which alters the low-level details and encourages the model to capture spatial configurations. These augmentations help to improve the performance of the cross-view geo-localization models, especially on the cross-area benchmarks. Moreover, we propose a counterfactual-based learning process to benefit the geometric layout extractor in exploring spatial information. Extensive experiments show that GeoDTR not only achieves state-of-the-art results but also significantly boosts the performance on same-area and cross-area benchmarks.
translated by 谷歌翻译
Though transfer learning is promising to increase the learning efficiency, the existing methods are still subject to the challenges from long-horizon tasks, especially when expert policies are sub-optimal and partially useful. Hence, a novel algorithm named EASpace (Enhanced Action Space) is proposed in this paper to transfer the knowledge of multiple sub-optimal expert policies. EASpace formulates each expert policy into multiple macro actions with different execution time period, then integrates all macro actions into the primitive action space directly. Through this formulation, the proposed EASpace could learn when to execute which expert policy and how long it lasts. An intra-macro-action learning rule is proposed by adjusting the temporal difference target of macro actions to improve the data efficiency and alleviate the non-stationarity issue in multi-agent settings. Furthermore, an additional reward proportional to the execution time of macro actions is introduced to encourage the environment exploration via macro actions, which is significant to learn a long-horizon task. Theoretical analysis is presented to show the convergence of the proposed algorithm. The efficiency of the proposed algorithm is illustrated by a grid-based game and a multi-agent pursuit problem. The proposed algorithm is also implemented to real physical systems to justify its effectiveness.
translated by 谷歌翻译
Understanding objects is a central building block of artificial intelligence, especially for embodied AI. Even though object recognition excels with deep learning, current machines still struggle to learn higher-level knowledge, e.g., what attributes an object has, and what can we do with an object. In this work, we propose a challenging Object Concept Learning (OCL) task to push the envelope of object understanding. It requires machines to reason out object affordances and simultaneously give the reason: what attributes make an object possesses these affordances. To support OCL, we build a densely annotated knowledge base including extensive labels for three levels of object concept (category, attribute, affordance), and the causal relations of three levels. By analyzing the causal structure of OCL, we present a baseline, Object Concept Reasoning Network (OCRN). It leverages causal intervention and concept instantiation to infer the three levels following their causal relations. In experiments, OCRN effectively infers the object knowledge while following the causalities well. Our data and code are available at https://mvig-rhos.com/ocl.
translated by 谷歌翻译