我们研究了不同修剪技术对具有对比损失功能的深神经网络所学的表示的影响。我们的工作发现,相对于经过传统的跨透明损失训练的模型,在高稀疏度水平上,对比度学习的示例数量更高。为了理解这种明显的差异,我们使用派(Hooker等,2019),Q-Score(Kalibhat等,2022)和PD-Score(Baldock等,2021)等指标(Hooker等,2019),测量修剪对学习的表示质量的影响。我们的分析表明,修剪方法实施的时间表很重要。我们发现,当在训练阶段早期引入修剪时,稀疏性对学习表示的质量的负面影响最高。
translated by 谷歌翻译
预先接受的语言模型(PLMS)在预训练和微调范式下,在各种自然语言处理(NLP)任务中取得了巨大成功。具有大量参数,PLMS是计算密集型和资源饥饿的。因此,已经引入了模型修剪来压缩大规模的PLM。然而,大多数先前的方法只考虑对下游任务的任务特定知识,但忽略了修剪期间的基本任务无关知识,这可能导致灾难性的遗忘问题并导致普遍性较差。为了在我们的修剪模型中维护任务不可行的特定知识,我们提出了在预训练和微调范式下的对比修剪(盖子)。它设计为一​​般框架,与结构化和非结构化修剪兼容。统一的对比学习,CAP使修剪模型能够从预训练的模型中学到任务无关的知识,以及特定于任务知识的微调模型。此外,为了更好地保留修剪模型的性能,快照(即,每个修剪迭代的中间模型)也是修剪的有效监督。我们广泛的实验表明,采用盖子一致地产生显着的改善,特别是在极高的稀疏性方案中。只有3%的型号参数保留(即97%的稀疏性),CAP成功达到了QQP和MNLI任务的原始BERT性能的99.2%和96.3%。此外,我们的探测实验表明,CAP修剪的模型趋于达到更好的泛化能力。
translated by 谷歌翻译
我们研究了基于SGD的深神经网络(DNN)的优化是否可以适应高度准确且易于压缩的模型。我们提出了一种新的压缩意识的最小化器,称为CRAM,它以原则性的方式修改了SGD训练迭代,以产生在压缩操作(例如减肥或量化)下局部损失行为稳定的模型。标准图像分类任务的实验结果表明,CRAM产生的密集模型比标准SGD型基准线更准确,但在重量修剪下令人惊讶的是稳定的:例如,对于Imagenet上的Resnet50,CRAM训练的模型可能会损失到。他们的重量的70%一次性只有微小的精度损失。
translated by 谷歌翻译
Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin, 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization. * Equal contribution. † Work done while visiting UC Berkeley.
translated by 谷歌翻译
网络修剪是一种广泛使用的技术,用于有效地压缩深神经网络,几乎没有在推理期间在性能下降低。迭代幅度修剪(IMP)是由几种迭代训练和修剪步骤组成的网络修剪的最熟悉的方法之一,其中在修剪后丢失了大量网络的性能,然后在随后的再培训阶段中恢复。虽然常用为基准参考,但经常认为a)通过不将稀疏纳入训练阶段来达到次优状态,b)其全球选择标准未能正确地确定最佳层面修剪速率和c)其迭代性质使它变得缓慢和不竞争。根据最近提出的再培训技术,我们通过严格和一致的实验来调查这些索赔,我们将Impr到培训期间的训练算法进行比较,评估其选择标准的建议修改,并研究实际需要的迭代次数和总培训时间。我们发现IMP与SLR进行再培训,可以优于最先进的修剪期间,没有或仅具有很少的计算开销,即全局幅度选择标准在很大程度上具有更复杂的方法,并且只有几个刷新时期在实践中需要达到大部分稀疏性与IMP的诽谤 - 与性能权衡。我们的目标既可以证明基本的进攻已经可以提供最先进的修剪结果,甚至优于更加复杂或大量参数化方法,也可以为未来的研究建立更加现实但易于可实现的基线。
translated by 谷歌翻译
Neural network pruning-the task of reducing the size of a network by removing parameters-has been the subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to compare pruning techniques to one another or determine how much progress the field has made over the past three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods. We use ShrinkBench to compare various pruning techniques and show that its comprehensive evaluation can prevent common pitfalls when comparing pruning methods.
translated by 谷歌翻译
The rectified linear unit (ReLU) is a highly successful activation function in neural networks as it allows networks to easily obtain sparse representations, which reduces overfitting in overparameterized networks. However, in network pruning, we find that the sparsity introduced by ReLU, which we quantify by a term called dynamic dead neuron rate (DNR), is not beneficial for the pruned network. Interestingly, the more the network is pruned, the smaller the dynamic DNR becomes during optimization. This motivates us to propose a method to explicitly reduce the dynamic DNR for the pruned network, i.e., de-sparsify the network. We refer to our method as Activating-while-Pruning (AP). We note that AP does not function as a stand-alone method, as it does not evaluate the importance of weights. Instead, it works in tandem with existing pruning methods and aims to improve their performance by selective activation of nodes to reduce the dynamic DNR. We conduct extensive experiments using popular networks (e.g., ResNet, VGG) via two classical and three state-of-the-art pruning methods. The experimental results on public datasets (e.g., CIFAR-10/100) suggest that AP works well with existing pruning methods and improves the performance by 3% - 4%. For larger scale datasets (e.g., ImageNet) and state-of-the-art networks (e.g., vision transformer), we observe an improvement of 2% - 3% with AP as opposed to without. Lastly, we conduct an ablation study to examine the effectiveness of the components comprising AP.
translated by 谷歌翻译
转移学习是一种经典范式,通过该范式,在大型“上游”数据集上佩戴的模型适于在“下游”专业数据集中产生良好的结果。通常,据了解,“上游”数据集上的更准确的模型将提供更好的转移精度“下游”。在这项工作中,我们在想象的神经网络(CNNS)的背景下对这种现象进行了深入的调查,这些现象已经在想象的数据集上训练的情况下被修剪 - 这是通过缩小它们的连接来压缩。具体地,我们考虑使用通过应用几种最先进的修剪方法而获得的非结构化修剪模型的转移,包括基于幅度的,二阶,重新增长和正规化方法,在12个标准转移任务的上下文中。简而言之,我们的研究表明,即使在高稀稀物质,稀疏的型号也可以匹配或甚至优于致密模型的转移性能,并且在此操作时,可以导致显着的推论甚至培训加速度。与此同时,我们观察和分析不同修剪方法行为的显着差异。
translated by 谷歌翻译
人们通常认为,修剪网络不仅会降低深网的计算成本,而且还可以通过降低模型容量来防止过度拟合。但是,我们的工作令人惊讶地发现,网络修剪有时甚至会加剧过度拟合。我们报告了出乎意料的稀疏双后裔现象,随着我们通过网络修剪增加模型稀疏性,首先测试性能变得更糟(由于过度拟合),然后变得更好(由于过度舒适),并且终于变得更糟(由于忘记了有用的有用信息)。尽管最近的研究集中在模型过度参数化方面,但他们未能意识到稀疏性也可能导致双重下降。在本文中,我们有三个主要贡献。首先,我们通过广泛的实验报告了新型的稀疏双重下降现象。其次,对于这种现象,我们提出了一种新颖的学习距离解释,即$ \ ell_ {2} $稀疏模型的学习距离(从初始化参数到最终参数)可能与稀疏的双重下降曲线良好相关,并更好地反映概括比最小平坦。第三,在稀疏的双重下降的背景下,彩票票假设中的获胜票令人惊讶地并不总是赢。
translated by 谷歌翻译
Most existing pruning works are resource-intensive, requiring retraining or fine-tuning of the pruned models for accuracy. We propose a retraining-free pruning method based on hyperspherical learning and loss penalty terms. The proposed loss penalty term pushes some of the model weights far from zero, while the rest weight values are pushed near zero and can be safely pruned with no need for retraining and a negligible accuracy drop. In addition, our proposed method can instantly recover the accuracy of a pruned model by replacing the pruned values with their mean value. Our method obtains state-of-the-art results in retraining-free pruning and is evaluated on ResNet-18/50 and MobileNetV2 with ImageNet dataset. One can easily get a 50\% pruned ResNet18 model with a 0.47\% accuracy drop. With fine-tuning, the experiment results show that our method can significantly boost the accuracy of the pruned models compared with existing works. For example, the accuracy of a 70\% pruned (except the first convolutional layer) MobileNetV2 model only drops 3.5\%, much less than the 7\% $\sim$ 10\% accuracy drop with conventional methods.
translated by 谷歌翻译
深度神经网络(DNN)的计算要求增加导致获得稀疏,且准确的DNN模型的兴趣。最近的工作已经调查了稀疏训练的更加困难的情况,其中DNN重量尽可能稀少,以减少训练期间的计算成本。现有的稀疏训练方法通常是经验的,并且可以具有相对于致密基线的准确性较低。在本文中,我们介绍了一种称为交替压缩/解压缩(AC / DC)训练DNN的一般方法,证明了算法变体的收敛,并表明AC / DC在类似的计算预算中准确地表现出现有的稀疏训练方法;在高稀疏水平下,AC / DC甚至优于现有的现有方法,依赖于准确的预训练密集模型。 AC / DC的一个重要属性是它允许联合培训密集和稀疏的型号,在训练过程结束时产生精确的稀疏密集模型对。这在实践中是有用的,其中压缩变体可能是为了在资源受限的设置中进行部署而不重新执行整个训练流,并且还为我们提供了深入和压缩模型之间的精度差距的见解。代码可在:https://github.com/ist-daslab/acdc。
translated by 谷歌翻译
基于变压器的大型模型在各种自然语言处理和计算机视觉任务中表现出卓越的性能。但是,这些模型包含大量参数,这些参数将其部署限制为现实世界应用程序。为了减少模型大小,研究人员根据权重的重要性得分修剪这些模型。但是,这种分数通常在训练过程中估计在小批次上,这会由于迷你批次采样和复杂的训练动力学而产生巨大的可变性/不确定性。结果,由于这种不确定性,可以通过常用的修剪方法来修剪一些关键权重,从而使训练不稳定并受到概括。为了解决这个问题,我们提出了Platon,该问题通过对重要性估计的上限(UCB)捕获了重要性得分的不确定性。特别是,对于较低的分数但不确定性较高的权重,柏拉图倾向于保留它们并探索其能力。我们对基于自然语言的理解,问答和图像分类的几种基于变压器的模型进行了广泛的实验,以验证柏拉图的有效性。结果表明,柏拉图在不同的稀疏度水平下显着改善。我们的代码可在https://github.com/qingruzhang/platon上公开获取。
translated by 谷歌翻译
深度学习方法通​​过依靠极大的大量参数化神经网络来提供许多应用程序的最先进性能。但是,此类网络已被证明非常脆弱,并不能很好地概括为新用途案例,并且通常很难在资源有限的平台上部署。模型修剪,即减少网络的大小,是一种广泛采用的策略,可以导致更健壮和可推广的网络 - 通常较小的数量级,具有相同甚至改善的性能。尽管有许多用于修剪模型的启发式方法,但我们对修剪过程的理解仍然有限。实证研究表明,某些启发式方法可以改善性能,而另一些可以使模型更脆或具有其他副作用。这项工作旨在阐明不同的修剪方法如何改变网络的内部功能表示以及对模型性能的相应影响。为了提供模型特征空间的有意义的比较和表征,我们使用三个几何指标,这些指标是从共同采用的分类损失中分解的。使用这些指标,我们设计了一个可视化系统,以突出修剪对模型预测以及潜在功能嵌入的影响。所提出的工具为探索和研究修剪方法以及修剪和原始模型之间的差异提供了一个环境。通过利用我们的可视化,ML研究人员不仅可以识别模型修剪和数据损坏的样本,而且还可以获得有关某些修剪模型如何实现出色鲁棒性能的见解和解释。
translated by 谷歌翻译
Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. In existing methods, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization prior to training. To achieve this, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on the MNIST, CIFAR-10, and Tiny-ImageNet classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.
translated by 谷歌翻译
深度神经网络近似高度复杂功能的能力是其成功的关键。但是,这种好处是以巨大的模型大小为代价的,这挑战了其在资源受限环境中的部署。修剪是一种用于限制此问题的有效技术,但通常以降低准确性和对抗性鲁棒性为代价。本文解决了这些缺点,并引入了Deadwooding,这是一种新型的全球修剪技术,它利用了Lagrangian双重方法来鼓励模型稀疏性,同时保持准确性并确保鲁棒性。所得模型显示出在鲁棒性和准确性度量方面的最先进研究大大优于最先进的模型。
translated by 谷歌翻译
Yes. In this paper, we investigate strong lottery tickets in generative models, the subnetworks that achieve good generative performance without any weight update. Neural network pruning is considered the main cornerstone of model compression for reducing the costs of computation and memory. Unfortunately, pruning a generative model has not been extensively explored, and all existing pruning algorithms suffer from excessive weight-training costs, performance degradation, limited generalizability, or complicated training. To address these problems, we propose to find a strong lottery ticket via moment-matching scores. Our experimental results show that the discovered subnetwork can perform similarly or better than the trained dense model even when only 10% of the weights remain. To the best of our knowledge, we are the first to show the existence of strong lottery tickets in generative models and provide an algorithm to find it stably. Our code and supplementary materials are publicly available.
translated by 谷歌翻译
We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).
translated by 谷歌翻译
Many applications require sparse neural networks due to space or inference time restrictions. There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model. In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-tosparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques. We demonstrate state-of-the-art sparse training results on a variety of networks and datasets, including ResNet-50, MobileNets on Imagenet-2012, and RNNs on WikiText-103. Finally, we provide some insights into why allowing the topology to change during the optimization can overcome local minima encountered when the topology remains static * .
translated by 谷歌翻译
最近对深神经网络(DNN)效率的重点已导致了模型压缩方法的重要工作,其中重量修剪是最受欢迎的方法之一。同时,有快速增长的计算支持,以有效地执行通过修剪获得的非结构化模型。但是,大多数现有的修剪方法最小化仅剩余权重的数量,即模型的大小,而不是针对推理时间进行优化。我们通过引入SPDY来解决这一差距,SPDY是一种新的压缩方法,该方法会自动确定层次的稀疏性目标,可以在给定系统上实现所需的推理速度,同时最大程度地减少准确性损失。 SPDY由两种新技术组成:第一个是一种有效的动态编程算法,用于求解一组给定的层敏感性得分,以解决加速约束的层压缩问题;第二个是一个局部搜索程序,用于确定准确的层敏感性得分。跨流行视觉和语言模型的实验表明,SPDY可以保证相对于现有策略的恢复较高的准确性,无论是一次性和逐步修剪方案,并且与大多数现有的修剪方法兼容。我们还将方法扩展到了最近实施的修剪任务,几乎没有数据,在该数据中,我们在修剪GPU支持的2:4稀疏模式时实现了最著名的准确性恢复。
translated by 谷歌翻译
由于深度学习模型通常包含数百万可培训的权重,因此对更有效的网络结构具有越来越高的存储空间和提高的运行时效率。修剪是最受欢迎的网络压缩技术之一。在本文中,我们提出了一种新颖的非结构化修剪管线,基于关注的同时稀疏结构和体重学习(ASWL)。与传统的频道和体重注意机制不同,ASWL提出了一种有效的算法来计算每层的层次引起的修剪比率,并且跟踪密度网络和稀疏网络的两种权重,以便修剪结构是同时从随机初始化的权重学习。我们在Mnist,CiFar10和Imagenet上的实验表明,与最先进的网络修剪方法相比,ASWL在准确性,修剪比率和操作效率方面取得了卓越的修剪。
translated by 谷歌翻译