In a series of recent theoretical works, it was shown that strongly overparameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to overparameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks.
translated by 谷歌翻译
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activations and confirm the statistical benefits of this implicit bias.
translated by 谷歌翻译
古典统计学习理论表示,拟合太多参数导致过度舒服和性能差。尽管大量参数矛盾,但是现代深度神经网络概括了这一发现,并构成了解释深度学习成功的主要未解决的问题。随机梯度下降(SGD)引起的隐式正规被认为是重要的,但其特定原则仍然是未知的。在这项工作中,我们研究了当地最小值周围的能量景观的局部几何学如何影响SGD的统计特性,具有高斯梯度噪声。我们争辩说,在合理的假设下,局部几何形状力强制SGD保持接近低维子空间,这会引起隐式正则化并导致深神经网络的泛化误差界定更严格的界限。为了获得神经网络的泛化误差界限,我们首先引入局部最小值周围的停滞迹象,并施加人口风险的局部基本凸性财产。在这些条件下,推导出SGD的下界,以保留在这些停滞套件中。如果发生停滞,我们会导出涉及权重矩阵的光谱规范的深神经网络的泛化误差的界限,但不是网络参数的数量。从技术上讲,我们的证据基于控制SGD中的参数值的变化以及基于局部最小值周围的合适邻域的熵迭代的参数值和局部均匀收敛。我们的工作试图通过统一收敛更好地连接非凸优化和泛化分析。
translated by 谷歌翻译
过分分度化是没有凸起的关键因素,以解释神经网络的全局渐变(GD)的全局融合。除了研究良好的懒惰政权旁边,已经为浅网络开发了无限宽度(平均场)分析,使用凸优化技术。为了弥合懒惰和平均场制度之间的差距,我们研究残留的网络(RESNET),其中残留块具有线性参数化,同时仍然是非线性的。这种Resnets承认无限深度和宽度限制,在再现内核Hilbert空间(RKHS)中编码残差块。在这个限制中,我们证明了当地的Polyak-Lojasiewicz不等式。因此,每个关键点都是全球最小化器和GD的局部收敛结果,并检索懒惰的制度。与其他平均场研究相比,它在残留物的表达条件下适用于参数和非参数案。我们的分析导致实用和量化的配方:从通用RKHS开始,应用随机傅里叶特征来获得满足我们的表征条件的高概率的有限维参数化。
translated by 谷歌翻译
我们证明了由例如He等人提出的广泛使用的方法。(2015年)并使用梯度下降对最小二乘损失进行训练并不普遍。具体而言,我们描述了一大批一维数据生成分布,较高的概率下降只会发现优化景观的局部最小值不好,因为它无法将其偏离偏差远离其初始化,以零移动。。事实证明,在这些情况下,即使目标函数是非线性的,发现的网络也基本执行线性回归。我们进一步提供了数值证据,表明在实际情况下,对于某些多维分布而发生这种情况,并且随机梯度下降表现出相似的行为。我们还提供了有关初始化和优化器的选择如何影响这种行为的经验结果。
translated by 谷歌翻译
A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach [2018], we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a simple two-layer model that already exhibits an interesting and meaningful transition between the kernel and rich regimes, and we demonstrate the transition for more complex matrix factorization models and multilayer non-linear networks.
translated by 谷歌翻译
We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions.Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit.
translated by 谷歌翻译
深度神经网络和其他现代机器学习模型的培训通常包括解决高维且受大规模数据约束的非凸优化问题。在这里,基于动量的随机优化算法在近年来变得尤其流行。随机性来自数据亚采样,从而降低了计算成本。此外,动量和随机性都应该有助于算法克服当地的最小化器,并希望在全球范围内融合。从理论上讲,这种随机性和动量的结合被糟糕地理解。在这项工作中,我们建议并分析具有动量的随机梯度下降的连续时间模型。该模型是一个分段确定的马尔可夫过程,它通过阻尼不足的动态系统和通过动力学系统的随机切换来代表粒子运动。在我们的分析中,我们研究了长期限制,子采样到无填充采样极限以及动量到非摩托车的限制。我们对随着时间的推移降低动量的情况特别感兴趣:直觉上,动量有助于在算法的初始阶段克服局部最小值,但禁止后来快速收敛到全球最小化器。在凸度的假设下,当降低随时间的动量时,我们显示了动力学系统与全局最小化器的收敛性,并让子采样率转移到无穷大。然后,我们提出了一个稳定的,合成的离散方案,以从我们的连续时间动力学系统中构造算法。在数值实验中,我们研究了我们在凸面和非凸测试问题中的离散方案。此外,我们训练卷积神经网络解决CIFAR-10图像分类问题。在这里,与动量相比,我们的算法与随机梯度下降相比达到了竞争性结果。
translated by 谷歌翻译
Many tasks in machine learning and signal processing can be solved by minimizing a convex function of a measure. This includes sparse spikes deconvolution or training a neural network with a single hidden layer. For these problems, we study a simple minimization method: the unknown measure is discretized into a mixture of particles and a continuous-time gradient descent is performed on their weights and positions. This is an idealization of the usual way to train neural networks with a large hidden layer. We show that, when initialized correctly and in the many-particle limit, this gradient flow, although non-convex, converges to global minimizers. The proof involves Wasserstein gradient flows, a by-product of optimal transport theory. Numerical experiments show that this asymptotic behavior is already at play for a reasonable number of particles, even in high dimension.
translated by 谷歌翻译
了解通过随机梯度下降(SGD)训练的神经网络的特性是深度学习理论的核心。在这项工作中,我们采取了平均场景,并考虑通过SGD培训的双层Relu网络,以实现一个非变量正则化回归问题。我们的主要结果是SGD偏向于简单的解决方案:在收敛时,Relu网络实现输入的分段线性图,以及“结”点的数量 - 即,Relu网络估计器的切线变化的点数 - 在两个连续的训练输入之间最多三个。特别地,随着网络的神经元的数量,通过梯度流的解决方案捕获SGD动力学,并且在收敛时,重量的分布方法接近相关的自由能量的独特最小化器,其具有GIBBS形式。我们的主要技术贡献在于分析了这一最小化器产生的估计器:我们表明其第二阶段在各地消失,除了代表“结”要点的一些特定地点。我们还提供了经验证据,即我们的理论预测的不同可能发生与数据点不同的位置的结。
translated by 谷歌翻译
低维歧管假设认为,在许多应用中发现的数据,例如涉及自然图像的数据(大约)位于嵌入高维欧几里得空间中的低维歧管上。在这种情况下,典型的神经网络定义了一个函数,该函数在嵌入空间中以有限数量的向量作为输入。但是,通常需要考虑在训练分布以外的点上评估优化网络。本文考虑了培训数据以$ \ mathbb r^d $的线性子空间分配的情况。我们得出对由神经网络定义的学习函数变化的估计值,沿横向子空间的方向。我们研究了数据歧管的编纂中与网络的深度和噪声相关的潜在正则化效应。由于存在噪声,我们还提出了训练中的其他副作用。
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
最近的研究表明,通过梯度下降训练的无限宽神经网络(NN)的动态可以是神经切线核(NTK)\ CITEP {Jacot2018neural}的特征。在平方损失下,通过梯度下降训练的无限宽度NN,具有无限小的学习速率等同于与NTK \ CITEP {arora2019Exact}的内核回归。但是,当前ridge回归{arora2019Harnessing}只知道等价物,而NN和其他内核机(KMS)之间的等价,例如,支持向量机(SVM),仍然未知。因此,在这项工作中,我们建议在NN和SVM之间建立等效,具体而言,通过柔软的边缘损失和具有由子润发性培训的NTK培训的标准柔软裕度SVM培训的无限宽NN。我们的主要理论结果包括建立NN和广泛的$ \ ELL_2 $正规化KMS之间的等价,其中有限宽度界限,不能通过事先工作来处理,并显示出通过这种正规化损耗函数训练的每个有限宽度NN大约一公里。此外,我们展示了我们的理论可以实现三种实际应用,包括(i)\ yressit {非空心}通过相应的km界限Nn; (ii)无限宽度NN的\ yryit {非琐碎}鲁棒性证书(而现有的鲁棒性验证方法提供空中界定); (iii)本质上更强大的无限宽度NN,来自以前的内核回归。我们的实验代码可用于\ URL {https://github.com/leslie-ch/equiv-nn-svm}。
translated by 谷歌翻译
已知神经网络对对抗性例子高度敏感。这些可能是由于不同的因素,例如随机初始化或学习问题中的虚假相关性。为了更好地理解这些因素,我们提供了对不同场景中对抗性鲁棒性的精确研究,从初始化到不同制度的培训结束以及中间场景,由于“懒惰”培训,初始化仍然起着作用。我们考虑具有二次靶标和无限样品的高维度中的过度参数化网络。我们的分析使我们能够确定近似(通过测试错误测量)和鲁棒性之间的新权衡,从而在测试误差改善时只能变得更糟,反之亦然。我们还展示了由于不当缩放的随机初始化,线性化的懒惰训练机制如何使鲁棒性恶化。通过数值实验说明了我们的理论结果。
translated by 谷歌翻译
At initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit (16; 4; 7; 13; 6), thus connecting them to kernel methods. We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function f θ (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK). This kernel is central to describe the generalization features of ANNs. While the NTK is random at initialization and varies during training, in the infinite-width limit it converges to an explicit limiting kernel and it stays constant during training. This makes it possible to study the training of ANNs in function space instead of parameter space. Convergence of the training can then be related to the positive-definiteness of the limiting NTK. We prove the positive-definiteness of the limiting NTK when the data is supported on the sphere and the non-linearity is non-polynomial. We then focus on the setting of least-squares regression and show that in the infinitewidth limit, the network function f θ follows a linear differential equation during training. The convergence is fastest along the largest kernel principal components of the input data with respect to the NTK, hence suggesting a theoretical motivation for early stopping. Finally we study the NTK numerically, observe its behavior for wide networks, and compare it to the infinite-width limit.
translated by 谷歌翻译
引入了归一化层(例如,批处理归一化,层归一化),以帮助在非常深的网中获得优化困难,但它们显然也有助于概括,即使在不太深入的网中也是如此。由于长期以来的信念,即最小的最小值导致更好的概括,本文提供了数学分析和支持实验,这表明归一化(与伴随的重量赛一起)鼓励GD降低损失表面的清晰度。鉴于损失是标准不变的,这是标准化的已知结果,因此仔细地定义了“清晰度”。具体而言,对于具有归一化的相当广泛的神经网类,我们的理论解释了有限学习率的GD如何进入所谓的稳定边缘(EOS)制度,并通过连续的清晰度来表征GD的轨迹 - 还原流。
translated by 谷歌翻译
在深度学习中的优化分析是连续的,专注于(变体)梯度流动,或离散,直接处理(变体)梯度下降。梯度流程可符合理论分析,但是风格化并忽略计算效率。它代表梯度下降的程度是深度学习理论的一个开放问题。目前的论文研究了这个问题。将梯度下降视为梯度流量初始值问题的近似数值问题,发现近似程度取决于梯度流动轨迹周围的曲率。然后,我们表明,在具有均匀激活的深度神经网络中,梯度流动轨迹享有有利的曲率,表明它们通过梯度下降近似地近似。该发现允许我们将深度线性神经网络的梯度流分析转换为保证梯度下降,其几乎肯定会在随机初始化下有效地收敛到全局最小值。实验表明,在简单的深度神经网络中,具有传统步长的梯度下降确实接近梯度流。我们假设梯度流动理论将解开深入学习背后的奥秘。
translated by 谷歌翻译
We consider neural networks with a single hidden layer and non-decreasing positively homogeneous activation functions like the rectified linear units. By letting the number of hidden units grow unbounded and using classical non-Euclidean regularization tools on the output weights, they lead to a convex optimization problem and we provide a detailed theoretical analysis of their generalization performance, with a study of both the approximation and the estimation errors. We show in particular that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace. Moreover, when using sparsity-inducing norms on the input weights, we show that high-dimensional non-linear variable selection may be achieved, without any strong assumption regarding the data and with a total number of variables potentially exponential in the number of observations. However, solving this convex optimization problem in infinite dimensions is only possible if the non-convex subproblem of addition of a new unit can be solved efficiently. We provide a simple geometric interpretation for our choice of activation functions and describe simple conditions for convex relaxations of the finite-dimensional non-convex subproblem to achieve the same generalization error bounds, even when constant-factor approximations cannot be found. We were not able to find strong enough convex relaxations to obtain provably polynomial-time algorithms and leave open the existence or non-existence of such tractable algorithms with non-exponential sample complexities.
translated by 谷歌翻译
深度学习的概括分析通常假定训练会收敛到固定点。但是,最近的结果表明,实际上,用随机梯度下降优化的深神经网络的权重通常无限期振荡。为了减少理论和实践之间的这种差异,本文着重于神经网络的概括,其训练动力不一定会融合到固定点。我们的主要贡献是提出一个统计算法稳定性(SAS)的概念,该算法将经典算法稳定性扩展到非convergergent算法并研究其与泛化的联系。与传统的优化和学习理论观点相比,这种崇高的理论方法可导致新的见解。我们证明,学习算法的时间复杂行为的稳定性与其泛化有关,并在经验上证明了损失动力学如何为概括性能提供线索。我们的发现提供了证据表明,即使训练无限期继续并且权重也不会融合,即使训练持续进行训练,训练更好地概括”的网络也是如此。
translated by 谷歌翻译
We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $f^*$ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $f^*$ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $f^*$ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.
translated by 谷歌翻译