为了以计算有效的方式部署深层模型,经常使用模型量化方法。此外,由于新的硬件支持混合的位算术操作,最近对混合精度量化(MPQ)的研究开始通过搜索网络中不同层和模块的优化位低宽,从而完全利用表示的能力。但是,先前的研究主要是在使用强化学习,神经体系结构搜索等的昂贵方案中搜索MPQ策略,或者简单地利用部分先验知识来进行位于刻度分配,这可能是有偏见和优势的。在这项工作中,我们提出了一种新颖的随机量化量化(SDQ)方法,该方法可以在更灵活,更全球优化的空间中自动学习MPQ策略,并具有更平滑的梯度近似。特别是,可区分的位宽参数(DBP)被用作相邻位意选择之间随机量化的概率因素。在获取最佳MPQ策略之后,我们将进一步训练网络使用熵感知的bin正则化和知识蒸馏。我们广泛评估了不同硬件(GPU和FPGA)和数据集的多个网络的方法。 SDQ的表现优于所有最先进的混合或单个精度量化,甚至比较低的位置量化,甚至比各种重新网络和Mobilenet家族的全精度对应物更好,这表明了我们方法的有效性和优势。
translated by 谷歌翻译
精确分割是分析心脏周期语义信息并使用心血管信号捕获异常的至关重要的第一步。但是,在深层语义分割领域,通常会单方面与数据的个体属性相混淆。走向心血管信号,准周期性是要学习的必不可少的特征,被视为形态学属性(AM)和节奏(AR)的合成。我们的关键见解是在深度表示的生成过程中抑制对AM或AR的过度依赖性。为了解决这个问题,我们建立了一个结构性因果模型,作为分别自定义AM和AR的干预方法的基础。在本文中,我们提出了对比性因果干预(CCI),以在框架级对比框架下形成一种新颖的训练范式。干预可以消除单个属性带来的隐式统计偏见,并导致更客观的表示。我们对QRS位置和心脏声音分割的受控条件进行了全面的实验。最终结果表明,我们的方法显然可以将QRS位置的性能提高高达0.41%,心脏声音分段为2.73%。该方法的效率推广到多个数据库和嘈杂的信号。
translated by 谷歌翻译
很少有语义细分旨在识别一个看不见类别的对象区域,只有几个带注释的示例作为监督。几次分割的关键是在支持图像和查询图像之间建立牢固的语义关系,并防止过度拟合。在本文中,我们提出了一个有效的多相似性超关联网络(MSHNET),以解决几个射击语义分割问题。在MSHNET中,我们提出了一种新的生成原型相似性(GPS),与余弦相似性可以在支持图像和查询图像之间建立牢固的语义关系。基于全局特征的本地生成的原型相似性在逻辑上与基于本地特征的全局余弦相似性互补,并且可以通过同时使用两个相似性来更全面地表达查询图像和受支持图像之间的关系。此外,我们提出了MSHNET中的对称合并块(SMB),以有效合并多层,多弹射和多相似性超相关特征。 MSHNET是基于相似性而不是特定类别特征而构建的,这些特征可以实现更一般的统一性并有效地减少过度拟合。在两个基准的语义分割数据集Pascal-5i和Coco-20i上,MSHNET在1次和5次语义分段任务上实现了新的最先进的表演。
translated by 谷歌翻译
近年来,目睹了分布式数据并行培训的越来越多的系统列表。现有系统很大程度上适合两个范例,即参数服务器和MPI样式的集体操作。在算法方面,研究人员提出了广泛的技术,以通过系统弛豫降低通信:量化,分散和通信延迟。然而,大多数情况下,如果不是全部,现有系统仅依赖于标准的同步和异步随机梯度(SG)的优化,因此不能利用机器学习社区最近发展的所有可能的优化。鉴于该系统和理论的当前景观之间的新出现差距,我们构建了一个MPI式通信库,提供了一种基元的集合,这既灵活又模块化,以支持分布式的最先进的系统松弛技术训练。 BAGUA提供了这种设计,拥有巨大的实现和扩展各种最先进的分布式学习算法的能力。在具有多达16台机器(128个GPU)的生产群集中,BAGUA可以在端到端培训时间内优于Pytorch-DDP,Horovod和ByTeps,在各种任务范围内的重大边缘(最多2次)。此外,我们进行严格的权衡探索,表明不同的算法和系统放松在不同的网络条件下实现了最佳性能。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
The recent increase in public and academic interest in preserving biodiversity has led to the growth of the field of conservation technology. This field involves designing and constructing tools that utilize technology to aid in the conservation of wildlife. In this article, we will use case studies to demonstrate the importance of designing conservation tools with human-wildlife interaction in mind and provide a framework for creating successful tools. These case studies include a range of complexities, from simple cat collars to machine learning and game theory methodologies. Our goal is to introduce and inform current and future researchers in the field of conservation technology and provide references for educating the next generation of conservation technologists. Conservation technology not only has the potential to benefit biodiversity but also has broader impacts on fields such as sustainability and environmental protection. By using innovative technologies to address conservation challenges, we can find more effective and efficient solutions to protect and preserve our planet's resources.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Decompilation aims to transform a low-level program language (LPL) (eg., binary file) into its functionally-equivalent high-level program language (HPL) (e.g., C/C++). It is a core technology in software security, especially in vulnerability discovery and malware analysis. In recent years, with the successful application of neural machine translation (NMT) models in natural language processing (NLP), researchers have tried to build neural decompilers by borrowing the idea of NMT. They formulate the decompilation process as a translation problem between LPL and HPL, aiming to reduce the human cost required to develop decompilation tools and improve their generalizability. However, state-of-the-art learning-based decompilers do not cope well with compiler-optimized binaries. Since real-world binaries are mostly compiler-optimized, decompilers that do not consider optimized binaries have limited practical significance. In this paper, we propose a novel learning-based approach named NeurDP, that targets compiler-optimized binaries. NeurDP uses a graph neural network (GNN) model to convert LPL to an intermediate representation (IR), which bridges the gap between source code and optimized binary. We also design an Optimized Translation Unit (OTU) to split functions into smaller code fragments for better translation performance. Evaluation results on datasets containing various types of statements show that NeurDP can decompile optimized binaries with 45.21% higher accuracy than state-of-the-art neural decompilation frameworks.
translated by 谷歌翻译
Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
translated by 谷歌翻译
In this paper, we propose a novel framework dubbed peer learning to deal with the problem of biased scene graph generation (SGG). This framework uses predicate sampling and consensus voting (PSCV) to encourage different peers to learn from each other, improving model diversity and mitigating bias in SGG. To address the heavily long-tailed distribution of predicate classes, we propose to use predicate sampling to divide and conquer this issue. As a result, the model is less biased and makes more balanced predicate predictions. Specifically, one peer may not be sufficiently diverse to discriminate between different levels of predicate distributions. Therefore, we sample the data distribution based on frequency of predicates into sub-distributions, selecting head, body, and tail classes to combine and feed to different peers as complementary predicate knowledge during the training process. The complementary predicate knowledge of these peers is then ensembled utilizing a consensus voting strategy, which simulates a civilized voting process in our society that emphasizes the majority opinion and diminishes the minority opinion. This approach ensures that the learned representations of each peer are optimally adapted to the various data distributions. Extensive experiments on the Visual Genome dataset demonstrate that PSCV outperforms previous methods. We have established a new state-of-the-art (SOTA) on the SGCls task by achieving a mean of \textbf{31.6}.
translated by 谷歌翻译