Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91% on CIFAR-10).
translated by 谷歌翻译
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections-one between each layer and its subsequent layer-our network has L(L+1) 2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high performance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet.
translated by 谷歌翻译
Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. However, each fraction of a percent of improved accuracy costs nearly doubling the number of layers, and so training very deep residual networks has a problem of diminishing feature reuse, which makes these networks very slow to train. To tackle these problems, in this paper we conduct a detailed experimental study on the architecture of ResNet blocks, based on which we propose a novel architecture where we decrease depth and increase width of residual networks. We call the resulting network structures wide residual networks (WRNs) and show that these are far superior over their commonly used thin and very deep counterparts. For example, we demonstrate that even a simple 16-layer-deep wide residual network outperforms in accuracy and efficiency all previous deep residual networks, including thousand-layerdeep networks, achieving new state-of-the-art results on CIFAR, SVHN, COCO, and significant improvements on ImageNet. Our code and models are available at https: //github.com/szagoruyko/wide-residual-networks.
translated by 谷歌翻译
Deep residual networks [1] have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https://github.com/KaimingHe/ resnet-1k-layers.
translated by 谷歌翻译
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers-8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions 1 , where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
translated by 谷歌翻译
胶囊网络是一类神经网络,可在许多计算机视觉任务上取得有希望的结果。但是,由于高计算和内存要求,基线胶囊网络未能在更复杂的数据集上达到最新结果。我们通过提出一种称为动量胶囊网络(Mocapsnet)的新网络体系结构来解决这个问题。Mocapsnets的灵感来自动量Resnets,这是一种应用可逆残留构建块的网络。可逆的网络允许重新计算后反向传播算法中正向通行的激活,因此可以大大减少这些内存要求。在本文中,我们提供了一个框架,介绍如何将可逆的残留构建块应用于胶囊网络。我们将证明Mocapsnet在MNIST,SVHN,CIFAR-10和CIFAR-100上击败基线胶囊网络的准确性,同时使用的内存较少。源代码可在https://github.com/moejoe95/mocapsnet上找到。
translated by 谷歌翻译
神经架构的创新促进了语言建模和计算机视觉中的重大突破。不幸的是,如果网络参数未正确初始化,新颖的架构通常会导致挑战超参数选择和培训不稳定。已经提出了许多架构特定的初始化方案,但这些方案并不总是可移植到新体系结构。本文介绍了毕业,一种用于初始化神经网络的自动化和架构不可知论由方法。毕业基础是一个简单的启发式;调整每个网络层的规范,使得具有规定的超参数的SGD或ADAM的单个步骤导致可能的损耗值最小。通过在每个参数块前面引入标量乘数变量,然后使用简单的数字方案优化这些变量来完成此调整。 GradInit加速了许多卷积架构的收敛性和测试性能,无论是否有跳过连接,甚至没有归一化层。它还提高了机器翻译的原始变压器架构的稳定性,使得在广泛的学习速率和动量系数下使用ADAM或SGD来训练它而无需学习速率预热。代码可在https://github.com/zhuchen03/gradinit上获得。
translated by 谷歌翻译
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
translated by 谷歌翻译
The deployment of deep convolutional neural networks (CNNs) in many real world applications is largely hindered by their high computational cost. In this paper, we propose a novel learning scheme for CNNs to simultaneously 1) reduce the model size; 2) decrease the run-time memory footprint; and 3) lower the number of computing operations, without compromising accuracy. This is achieved by enforcing channel-level sparsity in the network in a simple but effective way. Different from many existing approaches, the proposed method directly applies to modern CNN architectures, introduces minimum overhead to the training process, and requires no special software/hardware accelerators for the resulting models. We call our approach network slimming, which takes wide and large networks as input models, but during training insignificant channels are automatically identified and pruned afterwards, yielding thin and compact models with comparable accuracy. We empirically demonstrate the effectiveness of our approach with several state-of-the-art CNN models, including VGGNet, ResNet and DenseNet, on various image classification datasets. For VGGNet, a multi-pass version of network slimming gives a 20× reduction in model size and a 5× reduction in computing operations.
translated by 谷歌翻译
Learning deeper convolutional neural networks becomes a tendency in recent years. However, many empirical evidences suggest that performance improvement cannot be gained by simply stacking more layers. In this paper, we consider the issue from an information theoretical perspective, and propose a novel method Relay Backpropagation, that encourages the propagation of effective information through the network in training stage. By virtue of the method, we achieved the first place in ILSVRC 2015 Scene Classification Challenge. Extensive experiments on two challenging large scale datasets demonstrate the effectiveness of our method is not restricted to a specific dataset or network architecture. Our models will be available to the research community later.
translated by 谷歌翻译
差异隐私(DP)提供了正式的隐私保证,以防止对手可以访问机器学习模型,从而从提取有关单个培训点的信息。最受欢迎的DP训练方法是差异私有随机梯度下降(DP-SGD),它通过在训练过程中注入噪声来实现这种保护。然而,以前的工作发现,DP-SGD通常会导致标准图像分类基准的性能显着降解。此外,一些作者假设DP-SGD在大型模型上固有地表现不佳,因为保留隐私所需的噪声规范与模型维度成正比。相反,我们证明了过度参数化模型上的DP-SGD可以比以前想象的要好得多。将仔细的超参数调整与简单技术结合起来,以确保信号传播并提高收敛速率,我们获得了新的SOTA,而没有额外数据的CIFAR-10,在81.4%的81.4%下(8,10^{ - 5}) - 使用40 -layer wide-Resnet,比以前的SOTA提高了71.7%。当对预训练的NFNET-F3进行微调时,我们在ImageNet(0.5,8*10^{ - 7})下达到了83.8%的TOP-1精度。此外,我们还在(8,8 \ cdot 10^{ - 7})下达到了86.7%的TOP-1精度,DP仅比当前的非私人SOTA仅4.3%。我们认为,我们的结果是缩小私人图像分类和非私有图像分类之间准确性差距的重要一步。
translated by 谷歌翻译
培训深度神经网络是一项非常苛刻的任务,尤其是具有挑战性的是如何适应体系结构以提高训练有素的模型的性能。我们可以发现,有时,浅网络比深网概括得更好,并且增加更多层会导致更高的培训和测试错误。深层残留学习框架通过将跳过连接添加到几个神经网络层来解决此降解问题。最初,需要这种跳过连接才能成功地训练深层网络,因为网络的表达性会随着深度的指数增长而成功。在本文中,我们首先通过神经网络分析信息流。我们介绍和评估批处理循环,该批处理通过神经网络的每一层量化信息流。我们从经验和理论上证明,基于梯度下降的训练方法需要正面批处理融合,以成功地优化给定的损失功能。基于这些见解,我们引入了批处理凝聚正则化,以使基于梯度下降的训练算法能够单独通过每个隐藏层来优化信息流。借助批处理正则化,梯度下降优化器可以将不可吸引的网络转换为可训练的网络。我们从经验上表明,因此我们可以训练“香草”完全连接的网络和卷积神经网络 - 没有跳过连接,批处理标准化,辍学或任何其他建筑调整 - 只需将批处理 - 凝集正则术语添加到500层中损失功能。批处理 - 注入正则化的效果不仅在香草神经网络上评估,还评估了在各种计算机视觉以及自然语言处理任务上的剩余网络,自动编码器以及变压器模型上。
translated by 谷歌翻译
We propose deeply-supervised nets (DSN), a method that simultaneously minimizes classification error and improves the directness and transparency of the hidden layer learning process. We focus our attention on three aspects of traditional convolutional-neuralnetwork-type (CNN-type) architectures: (1) transparency in the effect intermediate layers have on overall classification; (2) discriminativeness and robustness of learned features, especially in early layers; (3) training effectiveness in the face of "vanishing" gradients. To combat these issues, we introduce "companion" objective functions at each hidden layer, in addition to the overall objective function at the output layer (an integrated strategy distinct from layer-wise pretraining). We also analyze our algorithm using techniques extended from stochastic gradient methods. The advantages provided by our method are evident in our experimental results, showing state-of-the-art performance on MNIST, CIFAR-10, CIFAR-100, and SVHN.
translated by 谷歌翻译
We perform an empirical study of the behaviour of deep networks when pushing its activation functions to become fully linear in some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. To measure the depth of the resulting partially linearized network, we compute the average number of active nonlinearities encountered along a path in the network graph. In experiments on CNNs with sparsified PReLUs on typical image classification tasks, we make several observations: Under sparsity pressure, the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty. We consistently observe a slow decay of performance with depth until the onset of a rapid collapse in accuracy, allowing for surprisingly shallow networks at moderate losses in accuracy that outperform base-line networks of similar depth, even after increasing width to a comparable number of parameters. In terms of training, we observe a nonlinear advantage: Reducing nonlinearity after training leads to a better performance than before, in line with previous findings in linearized training, but with a gap depending on task difficulty that vanishes for easy problems.
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online 1 .
translated by 谷歌翻译
使用卷积神经网络(CNN)已经显着改善了几种图像处理任务,例如图像分类和对象检测。与Reset和Abseralnet一样,许多架构在创建时至少在一个数据集中实现了出色的结果。培训的一个关键因素涉及网络的正规化,这可以防止结构过度装备。这项工作分析了在过去几年中开发的几种正规化方法,显示了不同CNN模型的显着改进。该作品分为三个主要区域:第一个称为“数据增强”,其中所有技术都侧重于执行输入数据的更改。第二个,命名为“内部更改”,旨在描述修改神经网络或内核生成的特征映射的过程。最后一个称为“标签”,涉及转换给定输入的标签。这项工作提出了与关于正则化的其他可用调查相比的两个主要差异:(i)第一个涉及在稿件中收集的论文并非超过五年,并第二个区别是关于可重复性,即所有作品此处推荐在公共存储库中可用的代码,或者它们已直接在某些框架中实现,例如Tensorflow或Torch。
translated by 谷歌翻译
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
translated by 谷歌翻译
Structural pruning of neural network parameters reduces computation, energy, and memory transfer costs during inference. We propose a novel method that estimates the contribution of a neuron (filter) to the final loss and iteratively removes those with smaller scores. We describe two variations of our method using the first and secondorder Taylor expansions to approximate a filter's contribution. Both methods scale consistently across any network layer without requiring per-layer sensitivity analysis and can be applied to any kind of layer, including skip connections. For modern networks trained on ImageNet, we measured experimentally a high (>93%) correlation between the contribution computed by our methods and a reliable estimate of the true importance. Pruning with the proposed methods leads to an improvement over state-ofthe-art in terms of accuracy, FLOPs, and parameter reduction. On ResNet-101, we achieve a 40% FLOPS reduction by removing 30% of the parameters, with a loss of 0.02% in the top-1 accuracy on ImageNet. Code is available at https://github.com/NVlabs/Taylor_pruning.
translated by 谷歌翻译
Neural networks require careful weight initialization to prevent signals from exploding or vanishing. Existing initialization schemes solve this problem in specific cases by assuming that the network has a certain activation function or topology. It is difficult to derive such weight initialization strategies, and modern architectures therefore often use these same initialization schemes even though their assumptions do not hold. This paper introduces AutoInit, a weight initialization algorithm that automatically adapts to different neural network architectures. By analytically tracking the mean and variance of signals as they propagate through the network, AutoInit appropriately scales the weights at each layer to avoid exploding or vanishing signals. Experiments demonstrate that AutoInit improves performance of convolutional, residual, and transformer networks across a range of activation function, dropout, weight decay, learning rate, and normalizer settings, and does so more reliably than data-dependent initialization methods. This flexibility allows AutoInit to initialize models for everything from small tabular tasks to large datasets such as ImageNet. Such generality turns out particularly useful in neural architecture search and in activation function discovery. In these settings, AutoInit initializes each candidate appropriately, making performance evaluations more accurate. AutoInit thus serves as an automatic configuration tool that makes design of new neural network architectures more robust. The AutoInit package provides a wrapper around TensorFlow models and is available at https://github.com/cognizant-ai-labs/autoinit.
translated by 谷歌翻译