高斯工艺高参数优化需要大核矩阵的线性溶解和对数确定因子。迭代数值技术依赖于线性溶液的共轭梯度方法(CG)和对数数据的随机痕迹估计的迭代数值技术变得越来越流行。这项工作介绍了用于预处理这些计算的新算法和理论见解。虽然在CG的背景下对预处理有充分的理解,但我们证明了它也可以加速收敛并减少对数数据及其衍生物的估计值的方差。我们证明了对数确定性,对数 - 界限可能性及其衍生物的预处理计算的一般概率误差界限。此外,我们得出了一系列内核 - 前提组合的特定速率,这表明可以达到指数收敛。我们的理论结果可以证明对内核超参数的有效优化,我们在大规模的基准问题上进行经验验证。我们的方法可以加速训练,最多可以达到数量级。
translated by 谷歌翻译
Despite advances in scalable models, the inference tools used for Gaussian processes (GPs) have yet to fully capitalize on developments in computing hardware. We present an efficient and general approach to GP inference based on Blackbox Matrix-Matrix multiplication (BBMM). BBMM inference uses a modified batched version of the conjugate gradients algorithm to derive all terms for training and inference in a single call. BBMM reduces the asymptotic complexity of exact GP inference from O(n 3 ) to O(n 2 ). Adapting this algorithm to scalable approximations and complex GP models simply requires a routine for efficient matrix-matrix multiplication with the kernel and its derivative. In addition, BBMM uses a specialized preconditioner to substantially speed up convergence. In experiments we show that BBMM effectively uses GPU hardware to dramatically accelerate both exact GP inference and scalable approximations. Additionally, we provide GPyTorch, a software platform for scalable GP inference via BBMM, built on PyTorch.
translated by 谷歌翻译
Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.
translated by 谷歌翻译
Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. We illustrate our results with generalized linear models and large attention based models on synthetic and real data.
translated by 谷歌翻译
我们在假设目标函数的先前和EIGENExpansion系数的假定下,我们将高斯进程回归(GPR)的幂律渐近学习曲线的幂律渐近学呈现出高斯过程回归(GPR)。在类似的假设下,我们利用GPR和内核RIDGE回归(KRR)之间的等价性来显示KRR的泛化误差。无限宽的神经网络可以与GPR相对于神经网络GP内核和神经切线内核有关,其中已知在几个情况下具有幂律谱。因此,我们的方法可以应用于研究无限宽神经网络的泛化误差。我们提出了展示理论的玩具实验。
translated by 谷歌翻译
低精度算术对神经网络的训练产生了变革性的影响,从而减少了计算,记忆和能量需求。然而,尽管有希望,低精确的算术对高斯流程(GPS)的关注很少,这主要是因为GPS需要在低精确度中不稳定的复杂线性代数例程。我们研究以一半精度训练GP时可能发生的不同故障模式。为了避免这些故障模式,我们提出了一种多方面的方法,该方法涉及具有重新构造,混合精度和预处理的共轭梯度。我们的方法大大提高了低精度在各种设置中的偶联梯度的数值稳定性和实践性能,从而使GPS能够在单个GPU上以10美元的$ 10 $ 10 $ 10 $ 10 $ 10的数据点进行培训,而没有任何稀疏的近似值。
translated by 谷歌翻译
随机奇异值分解(RSVD)是用于计算大型数据矩阵截断的SVD的一类计算算法。给定A $ n \ times n $对称矩阵$ \ mathbf {m} $,原型RSVD算法输出通过计算$ \ mathbf {m mathbf {m} $的$ k $引导singular vectors的近似m}^{g} \ mathbf {g} $;这里$ g \ geq 1 $是一个整数,$ \ mathbf {g} \ in \ mathbb {r}^{n \ times k} $是一个随机的高斯素描矩阵。在本文中,我们研究了一般的“信号加上噪声”框架下的RSVD的统计特性,即,观察到的矩阵$ \ hat {\ mathbf {m}} $被认为是某种真实但未知的加法扰动信号矩阵$ \ mathbf {m} $。我们首先得出$ \ ell_2 $(频谱规范)和$ \ ell_ {2 \ to \ infty} $(最大行行列$ \ ell_2 $ norm)$ \ hat {\ hat {\ Mathbf {M}} $和信号矩阵$ \ Mathbf {M} $的真实单数向量。这些上限取决于信噪比(SNR)和功率迭代$ g $的数量。观察到一个相变现象,其中较小的SNR需要较大的$ g $值以保证$ \ ell_2 $和$ \ ell_ {2 \ to \ fo \ infty} $ distances的收敛。我们还表明,每当噪声矩阵满足一定的痕量生长条件时,这些相变发生的$ g $的阈值都会很清晰。最后,我们得出了近似奇异向量的行波和近似矩阵的进入波动的正常近似。我们通过将RSVD的几乎最佳性能保证在应用于三个统计推断问题的情况下,即社区检测,矩阵完成和主要的组件分析,并使用缺失的数据来说明我们的理论结果。
translated by 谷歌翻译
随机梯度下降(SGD)及其变体已经建立为具有独立样本的大型机器学习问题的进入算法,由于其泛化性能和内在的计算优势。然而,随机梯度是具有相关样本的全梯度的偏置估计的事实导致了对SGD在相关环境中的表现和阻碍其在这种情况下使用的理解缺乏理论理解。在本文中,我们专注于高斯过程(GP)的近似参数估计,并通过证明小纤维SGD收敛到完整日志似然丢失功能的关键点来打破屏障的一步,并恢复速率$率的模型超参数o(\ frac {1} {k})$ k $迭代,达到统计误差术语,具体取决于小靶大小。我们的理论担保仍然存在,内核功能表现出指数或多项式EIGENDECAY,这是通过GPS常用的各种核的满足。模拟和实时数据集的数值研究表明,Minibatch SGD在最先进的GP方法上具有更好的推广,同时降低了计算负担并开启了GPS的新的,先前未开发的数据大小制度。
translated by 谷歌翻译
We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We propose to address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all the past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme exhibits local $Q$-superlinear convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$, where $\Upsilon$ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the method, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.
translated by 谷歌翻译
我们开发了一个计算程序,以估计具有附加噪声的半摩托车高斯过程回归模型的协方差超参数。也就是说,提出的方法可用于有效估计相关误差的方差,以及基于最大化边际似然函数的噪声方差。我们的方法涉及适当地降低超参数空间的维度,以简化单变量的根发现问题的估计过程。此外,我们得出了边际似然函数及其衍生物的边界和渐近线,这对于缩小高参数搜索的初始范围很有用。使用数值示例,我们证明了与传统参数优化相比,提出方法的计算优势和鲁棒性。
translated by 谷歌翻译
通常希望通过将其投影到低维子空间来降低大数据集的维度。矩阵草图已成为一种非常有效地执行这种维度降低的强大技术。尽管有关于草图最差的表现的广泛文献,但现有的保证通常与实践中观察到的差异截然不同。我们利用随机矩阵的光谱分析中的最新发展来开发新技术,这些技术为通过素描获得的随机投影矩阵的期望值提供了准确的表达。这些表达式可以用来表征各种常见的机器学习任务中尺寸降低的性能,从低级别近似到迭代随机优化。我们的结果适用于几种流行的草图方法,包括高斯和拉德马赫草图,它们可以根据数据的光谱特性对这些方法进行精确的分析。经验结果表明,我们得出的表达式反映了这些草图方法的实际性能,直到低阶效应甚至不变因素。
translated by 谷歌翻译
我们研究了非参数脊的最小二乘的学习属性。特别是,我们考虑常见的估计人的估计案例,由比例依赖性内核定义,并专注于规模的作用。这些估计器内插数据,可以显示规模来通过条件号控制其稳定性。我们的分析表明,这是不同的制度,具体取决于样本大小,其尺寸与问题的平滑度之间的相互作用。实际上,当样本大小小于数据维度中的指数时,可以选择比例,以便学习错误减少。随着样本尺寸变大,总体错误停止减小但有趣地可以选择规模,使得噪声引起的差异仍然存在界线。我们的分析结合了概率,具有来自插值理论的许多分析技术。
translated by 谷歌翻译
素描和项目是一个框架,它统一了许多已知的迭代方法来求解线性系统及其变体,并进一步扩展了非线性优化问题。它包括流行的方法,例如随机kaczmarz,坐标下降,凸优化的牛顿方法的变体等。在本文中,我们通过新的紧密频谱边界为预期的草图投影矩阵获得了素描和项目的收敛速率的敏锐保证。我们的估计值揭示了素描和项目的收敛率与另一个众所周知但看似无关的算法家族的近似误差之间的联系,这些算法使用草图加速了流行的矩阵因子化,例如QR和SVD。这种连接使我们更接近准确量化草图和项目求解器的性能如何取决于其草图大小。我们的分析不仅涵盖了高斯和次高斯的素描矩阵,还涵盖了一个有效的稀疏素描方法,称为较少的嵌入方法。我们的实验备份了理论,并证明即使极稀疏的草图在实践中也显示出相同的收敛属性。
translated by 谷歌翻译
Many problems in causal inference and economics can be formulated in the framework of conditional moment models, which characterize the target function through a collection of conditional moment restrictions. For nonparametric conditional moment models, efficient estimation often relies on preimposed conditions on various measures of ill-posedness of the hypothesis space, which are hard to validate when flexible models are used. In this work, we address this issue by proposing a procedure that automatically learns representations with controlled measures of ill-posedness. Our method approximates a linear representation defined by the spectral decomposition of a conditional expectation operator, which can be used for kernelized estimators and is known to facilitate minimax optimal estimation in certain settings. We show this representation can be efficiently estimated from data, and establish L2 consistency for the resulting estimator. We evaluate the proposed method on proximal causal inference tasks, exhibiting promising performance on high-dimensional, semi-synthetic data.
translated by 谷歌翻译
本文开发了一个通用框架,用于通过核规范正则化估计高维条件因子模型。我们建立了估计器的较大样本属性,并提供了用于查找估计器的有效计算算法以及选择正则化参数的交叉验证程序。一般框架使我们能够以统一的方式估算各种条件因素模型,并迅速提供新的渐近结果。我们采用该方法来分析单个美国股票收益的横截面,并发现施加同质性可以改善模型的样本外可预测性。
translated by 谷歌翻译
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed-either explicitly or implicitly-to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast with O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
translated by 谷歌翻译
学习线性时间不变动态系统(LTID)的参数是当前兴趣的问题。在许多应用程序中,人们有兴趣联合学习多个相关LTID的参数,这仍然是未探究的日期。为此,我们开发一个联合估计器,用于学习共享常见基矩阵的LTID的过渡矩阵。此外,我们建立有限时间误差界限,取决于底层的样本大小,维度,任务数和转换矩阵的光谱属性。结果是在轻度规律假设下获得的,并在单独学习每个系统的比较中,展示从LTID的汇集信息汇总信息。我们还研究了错过过渡矩阵的联合结构的影响,并显示成立的结果在适度误操作的存在下是强大的。
translated by 谷歌翻译
When comparing approximate Gaussian process (GP) models, it can be helpful to be able to generate data from any GP. If we are interested in how approximate methods perform at scale, we may wish to generate very large synthetic datasets to evaluate them. Na\"{i}vely doing so would cost \(\mathcal{O}(n^3)\) flops and \(\mathcal{O}(n^2)\) memory to generate a size \(n\) sample. We demonstrate how to scale such data generation to large \(n\) whilst still providing guarantees that, with high probability, the sample is indistinguishable from a sample from the desired GP.
translated by 谷歌翻译
我们提出了一个算法框架,用于近距离矩阵上的量子启发的经典算法,概括了Tang的突破性量子启发算法开始的一系列结果,用于推荐系统[STOC'19]。由量子线性代数算法和gily \'en,su,low和wiebe [stoc'19]的量子奇异值转换(SVT)框架[SVT)的动机[STOC'19],我们开发了SVT的经典算法合适的量子启发的采样假设。我们的结果提供了令人信服的证据,表明在相应的QRAM数据结构输入模型中,量子SVT不会产生指数量子加速。由于量子SVT框架基本上概括了量子线性代数的所有已知技术,因此我们的结果与先前工作的采样引理相结合,足以概括所有有关取消量子机器学习算法的最新结果。特别是,我们的经典SVT框架恢复并经常改善推荐系统,主成分分析,监督聚类,支持向量机器,低秩回归和半决赛程序解决方案的取消结果。我们还为汉密尔顿低级模拟和判别分析提供了其他取消化结果。我们的改进来自识别量子启发的输入模型的关键功能,该模型是所有先前量子启发的结果的核心:$ \ ell^2 $ -Norm采样可以及时近似于其尺寸近似矩阵产品。我们将所有主要结果减少到这一事实,使我们的简洁,独立和直观。
translated by 谷歌翻译
对于由缺陷线性回归中的标签噪声引起的预期平均平方概率,我们证明了无渐近分布的下限。我们的下部结合概括了过度公共数据(内插)制度的类似已知结果。与最先前的作品相比,我们的分析适用于广泛的输入分布,几乎肯定的全排列功能矩阵,允许我们涵盖各种类型的确定性或随机特征映射。我们的下限是渐近的锐利,暗示在存在标签噪声时,缺陷的线性回归不会在任何这些特征映射中围绕内插阈值进行良好的。我们详细分析了强加的假设,并为分析(随机)特征映射提供了理论。使用此理论,我们可以表明我们的假设对于具有(Lebesgue)密度的输入分布以及随机深神经网络给出的特征映射,具有Sigmoid,Tanh,SoftPlus或Gelu等分析激活功能。作为进一步的例子,我们示出了来自随机傅里叶特征和多项式内核的特征映射也满足我们的假设。通过进一步的实验和分析结果,我们补充了我们的理论。
translated by 谷歌翻译