智能论文笔记

Statistical Efficiency of Score Matching: The View from Isoperimetry

Frederic Koehler , Alexander Heckett , Andrej Risteski

分类：机器学习 | (统计)机器学习

2022-10-03

Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.

translated by 谷歌翻译

On lower bounds for the bias-variance trade-off

Alexis Derumigny , Johannes Schmidt-Hieber

分类： (统计)机器学习

2020-05-30

对于高维和非参数统计模型，速率最优估计器平衡平方偏差和方差是一种常见的现象。虽然这种平衡被广泛观察到，但很少知道是否存在可以避免偏差和方差之间的权衡的方法。我们提出了一般的策略，以获得对任何估计方差的下限，偏差小于预先限定的界限。这表明偏差差异折衷的程度是不可避免的，并且允许量化不服从其的方法的性能损失。该方法基于许多抽象的下限，用于涉及关于不同概率措施的预期变化以及诸如Kullback-Leibler或Chi-Sque-diversence的信息措施的变化。其中一些不平等依赖于信息矩阵的新概念。在该物品的第二部分中，将抽象的下限应用于几种统计模型，包括高斯白噪声模型，边界估计问题，高斯序列模型和高维线性回归模型。对于这些特定的统计应用，发生不同类型的偏差差异发生，其实力变化很大。对于高斯白噪声模型中集成平方偏置和集成方差之间的权衡，我们将较低界限的一般策略与减少技术相结合。这允许我们将原始问题与估计的估计器中的偏差折衷联动，以更简单的统计模型中具有额外的对称性属性。在高斯序列模型中，发生偏差差异的不同相位转换。虽然偏差和方差之间存在非平凡的相互作用，但是平方偏差的速率和方差不必平衡以实现最小估计速率。

translated by 谷歌翻译

Minimax Optimal Regression over Sobolev Spaces via Laplacian Eigenmaps on Neighborhood Graphs

Alden Green , Sivaraman Balakrishnan , Ryan J. Tibshirani

分类： (统计)机器学习

2021-11-14

本文研究了基于Laplacian Eigenmaps（Le）的基于Laplacian EIGENMAPS（PCR-LE）的主要成分回归的统计性质，这是基于Laplacian Eigenmaps（Le）的非参数回归的方法。 PCR-LE通过投影观察到的响应的向量$ {\ bf y} =（y_1，\ ldots，y_n）$ to to changbood图表拉普拉斯的某些特征向量跨越的子空间。我们表明PCR-Le通过SoboLev空格实现了随机设计回归的最小收敛速率。在设计密度$ P $的足够平滑条件下，PCR-le达到估计的最佳速率（其中已知平方$ l ^ 2 $ norm的最佳速率为$ n ^ { - 2s /（2s + d））} $）和健美的测试（$ n ^ { - 4s /（4s + d）$）。我们还表明PCR-LE是\ EMPH {歧管Adaptive}：即，我们考虑在小型内在维度$ M $的歧管上支持设计的情况，并为PCR-LE提供更快的界限Minimax估计（$ n ^ { - 2s /（2s + m）$）和测试（$ n ^ { - 4s /（4s + m）$）收敛率。有趣的是，这些利率几乎总是比图形拉普拉斯特征向量的已知收敛率更快;换句话说，对于这个问题的回归估计的特征似乎更容易，统计上讲，而不是估计特征本身。我们通过经验证据支持这些理论结果。

translated by 谷歌翻译

Triangular Flows for Generative Modeling: Statistical Consistency, Smoothness Classes, and Fast Rates

Nicholas J. Irons , Meyer Scetbon , Soumik Pal , Zaid Harchaoui

分类： (统计)机器学习 | 机器学习

2021-12-31

三角形流量，也称为kn \“{o}的Rosenblatt测量耦合，包括用于生成建模和密度估计的归一化流模型的重要构建块，包括诸如实值的非体积保存变换模型的流行自回归流模型（真实的NVP）。我们提出了三角形流量统计模型的统计保证和样本复杂性界限。特别是，我们建立了KN的统计一致性和kullback-leibler估算器的rospblatt的kullback-leibler估计的有限样本会聚率使用实证过程理论的工具测量耦合。我们的结果突出了三角形流动下播放功能类的各向异性几何形状，优化坐标排序，并导致雅各比比流动的统计保证。我们对合成数据进行数值实验，以说明我们理论发现的实际意义。

translated by 谷歌翻译

Optimistic Rates: A Unifying Theory for Interpolation Learning and Regularization in Linear Regression

Lijia Zhou , Frederic Koehler , Danica J. Sutherland , Nathan Srebro

分类： (统计)机器学习 | 机器学习

2021-12-08

我们研究了称为“乐观速率”（Panchenko 2002; Srebro等，2010）的统一收敛概念，用于与高斯数据的线性回归。我们的精致分析避免了现有结果中的隐藏常量和对数因子，这已知在高维设置中至关重要，特别是用于了解插值学习。作为一个特殊情况，我们的分析恢复了Koehler等人的保证。（2021年），在良性过度的过度条件下，严格地表征了低规范内插器的人口风险。但是，我们的乐观速度绑定还分析了具有任意训练错误的预测因子。这使我们能够在随机设计下恢复脊和套索回归的一些经典统计保障，并有助于我们在过度参数化制度中获得精确了解近端器的过度风险。

translated by 谷歌翻译

Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency

Wenlong Mou , Martin J. Wainwright , Peter L. Bartlett

分类： (统计)机器学习

2022-09-26

在因果推理和强盗文献中，基于观察数据的线性功能估算线性功能的问题是规范的。我们分析了首先估计治疗效果函数的广泛的两阶段程序，然后使用该数量来估计线性功能。我们证明了此类过程的均方误差上的非反应性上限：这些边界表明，为了获得非反应性最佳程序，应在特定加权$ l^2 $中最大程度地估算治疗效果的误差。 -规范。我们根据该加权规范的约束回归分析了两阶段的程序，并通过匹配非轴突局部局部最小值下限，在有限样品中建立了实例依赖性最优性。这些结果表明，除了取决于渐近效率方差之外，最佳的非质子风险除了取决于样本量支持的最富有函数类别的真实结果函数与其近似类别之间的加权规范距离。

translated by 谷歌翻译

Near optimal sample complexity for matrix and tensor normal models via geodesic convexity

Cole Franks , Rafael Oliveira , Akshay Ramachandran , Michael Walter

分类：机器学习

2021-10-14

矩阵正常模型，高斯矩阵变化分布的系列，其协方差矩阵是两个较低尺寸因子的Kronecker乘积，经常用于模拟矩阵变化数据。张量正常模型将该家庭推广到三个或更多因素的Kronecker产品。我们研究了矩阵和张量模型中协方差矩阵的Kronecker因子的估计。我们向几个自然度量中的最大似然估计器（MLE）实现的误差显示了非因素界限。与现有范围相比，我们的结果不依赖于条件良好或稀疏的因素。对于矩阵正常模型，我们所有的所有界限都是最佳的对数因子最佳，对于张量正常模型，我们对最大因数和整体协方差矩阵的绑定是最佳的，所以提供足够的样品以获得足够的样品以获得足够的样品常量Frobenius错误。在与我们的样本复杂性范围相同的制度中，我们表明迭代程序计算称为触发器算法称为触发器算法的MLE的线性地收敛，具有高概率。我们的主要工具是Fisher信息度量诱导的正面矩阵的几何中的测地强凸性。这种强大的凸起由某些随机量子通道的扩展来决定。我们还提供了数值证据，使得将触发器算法与简单的收缩估计器组合可以提高缺乏采样制度的性能。

translated by 谷歌翻译

The Voronoigram: Minimax Estimation of Bounded Variation Functions From Scattered Data

Addison J. Hu , Alden Green , Ryan J. Tibshirani

分类： (统计)机器学习 | 机器学习

2022-12-30

We consider the problem of estimating a multivariate function $f_0$ of bounded variation (BV), from noisy observations $y_i = f_0(x_i) + z_i$ made at random design points $x_i \in \mathbb{R}^d$, $i=1,\ldots,n$. We study an estimator that forms the Voronoi diagram of the design points, and then solves an optimization problem that regularizes according to a certain discrete notion of total variation (TV): the sum of weighted absolute differences of parameters $\theta_i,\theta_j$ (which estimate the function values $f_0(x_i),f_0(x_j)$) at all neighboring cells $i,j$ in the Voronoi diagram. This is seen to be equivalent to a variational optimization problem that regularizes according to the usual continuum (measure-theoretic) notion of TV, once we restrict the domain to functions that are piecewise constant over the Voronoi diagram. The regression estimator under consideration hence performs (shrunken) local averaging over adaptively formed unions of Voronoi cells, and we refer to it as the Voronoigram, following the ideas in Koenker (2005), and drawing inspiration from Tukey's regressogram (Tukey, 1961). Our contributions in this paper span both the conceptual and theoretical frontiers: we discuss some of the unique properties of the Voronoigram in comparison to TV-regularized estimators that use other graph-based discretizations; we derive the asymptotic limit of the Voronoi TV functional; and we prove that the Voronoigram is minimax rate optimal (up to log factors) for estimating BV functions that are essentially bounded.

translated by 谷歌翻译

Dimension-agnostic inference using cross U-statistics

Ilmun Kim , Aaditya Ramdas

分类： (统计)机器学习

2020-11-10

Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution, regardless of how $d$ scales with $n$. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for some classical problems including one-sample mean and covariance testing, and show that our tests have minimax rate-optimal power against appropriate local alternatives. In most settings, our cross U-statistic matches the high-dimensional power of the corresponding (degenerate) U-statistic up to a $\sqrt{2}$ factor.

translated by 谷歌翻译

On the Statistical Complexity of Sample Amplification

Brian Axelrod , Shivam Garg , Yanjun Han , Vatsal Sharan , Gregory Valiant

分类：机器学习

2022-01-12

鉴于$ n $ i.i.d.从未知的分发$ P $绘制的样本，何时可以生成更大的$ n + m $ samples，这些标题不能与$ n + m $ i.i.d区别区别。从$ p $绘制的样品？（AXELROD等人2019）将该问题正式化为样本放大问题，并为离散分布和高斯位置模型提供了最佳放大程序。然而，这些程序和相关的下限定制到特定分布类，对样本扩增的一般统计理解仍然很大程度上。在这项工作中，我们通过推出通常适用的放大程序，下限技术和与现有统计概念的联系来放置对公司统计基础的样本放大问题。我们的技术适用于一大类分布，包括指数家庭，并在样本放大和分配学习之间建立严格的联系。

translated by 谷歌翻译

Asymptotic Statistical Analysis of $f$-divergence GAN

Xinwei Shen , Kani Chen , Tong Zhang

分类：机器学习 | (统计)机器学习

2022-09-14

生成对抗网络（GAN）在数据生成方面取得了巨大成功。但是，其统计特性尚未完全理解。在本文中，我们考虑了GAN的一般$ f $ divergence公式的统计行为，其中包括Kullback- Leibler Divergence与最大似然原理密切相关。我们表明，对于正确指定的参数生成模型，在适当的规律性条件下，所有具有相同歧视类别类别的$ f $ divergence gans均在渐近上等效。 Moreover, with an appropriately chosen local discriminator, they become equivalent to the maximum likelihood estimate asymptotically.对于被误解的生成模型，具有不同$ f $ -Divergences {收敛到不同估计器}的gan，因此无法直接比较。但是，结果表明，对于某些常用的$ f $ -Diverences，原始的$ f $ gan并不是最佳的，因为当更换原始$ f $ gan配方中的判别器培训时，可以实现较小的渐近方差通过逻辑回归。结果估计方法称为对抗梯度估计（年龄）。提供了实证研究来支持该理论，并证明了年龄的优势，而不是模型错误的原始$ f $ gans。

translated by 谷歌翻译

Smooth $p$-Wasserstein Distance: Structure, Empirical Approximation, and Statistical Applications

Sloan Nietert , Ziv Goldfeld , Kengo Kato

分类： (统计)机器学习

2021-01-11

概率分布之间的差异措施，通常被称为统计距离，在概率理论，统计和机器学习中普遍存在。为了在估计这些距离的距离时，对维度的诅咒，最近的工作已经提出了通过带有高斯内核的卷积在测量的分布中平滑局部不规则性。通过该框架的可扩展性至高维度，我们研究了高斯平滑$ P $ -wassersein距离$ \ mathsf {w} _p ^ {（\ sigma）} $的结构和统计行为，用于任意$ p \ GEQ 1 $。在建立$ \ mathsf {w} _p ^ {（\ sigma）} $的基本度量和拓扑属性之后，我们探索$ \ mathsf {w} _p ^ {（\ sigma）}（\ hat {\ mu} _n，\ mu）$，其中$ \ hat {\ mu} _n $是$ n $独立观察的实证分布$ \ mu $。我们证明$ \ mathsf {w} _p ^ {（\ sigma）} $享受$ n ^ { - 1/2} $的参数经验融合速率，这对比$ n ^ { - 1 / d} $率对于未平滑的$ \ mathsf {w} _p $ why $ d \ geq 3 $。我们的证明依赖于控制$ \ mathsf {w} _p ^ {（\ sigma）} $ by $ p $ th-sting spoollow sobolev restion $ \ mathsf {d} _p ^ {（\ sigma）} $并导出限制$ \ sqrt {n} \，\ mathsf {d} _p ^ {（\ sigma）}（\ hat {\ mu} _n，\ mu）$，适用于所有尺寸$ d $。作为应用程序，我们提供了使用$ \ mathsf {w} _p ^ {（\ sigma）} $的两个样本测试和最小距离估计的渐近保证，使用$ p = 2 $的实验使用$ \ mathsf {d} _2 ^ {（\ sigma）} $。

translated by 谷歌翻译

Mean field Variational Inference via Wasserstein Gradient Flow

Rentian Yao , Yun Yang

分类： (统计)机器学习

2022-07-17

变性推理（VI）为基于传统的采样方法提供了一种吸引人的替代方法，用于实施贝叶斯推断，因为其概念性的简单性，统计准确性和计算可扩展性。然而，常见的变分近似方案（例如平均场（MF）近似）需要某些共轭结构以促进有效的计算，这可能会增加不必要的限制对可行的先验分布家族，并对变异近似族对差异进行进一步的限制。在这项工作中，我们开发了一个通用计算框架，用于实施MF-VI VIA WASSERSTEIN梯度流（WGF），这是概率度量空间上的梯度流。当专门针对贝叶斯潜在变量模型时，我们将分析基于时间消化的WGF交替最小化方案的算法收敛，用于实现MF近似。特别是，所提出的算法类似于EM算法的分布版本，包括更新潜在变量变异分布的E step以及在参数的变异分布上进行最陡峭下降的m step。我们的理论分析依赖于概率度量空间中的最佳运输理论和细分微积分。我们证明了时间限制的WGF的指数收敛性，以最大程度地减少普通大地测量学严格的凸度的通用物镜功能。我们还提供了通过使用时间限制的WGF的固定点方程从MF近似获得的变异分布的指数收缩的新证明。我们将方法和理论应用于两个经典的贝叶斯潜在变量模型，即高斯混合模型和回归模型的混合物。还进行了数值实验，以补充这两个模型下的理论发现。

translated by 谷歌翻译

Neural Estimation of Statistical Divergences

Sreejith Sreekumar , Ziv Goldfeld

分类： (统计)机器学习

2021-10-07

量化概率分布之间的异化的统计分歧（SDS）是统计推理和机器学习的基本组成部分。用于估计这些分歧的现代方法依赖于通过神经网络（NN）进行参数化经验变化形式并优化参数空间。这种神经估算器在实践中大量使用，但相应的性能保证是部分的，并呼吁进一步探索。特别是，涉及的两个错误源之间存在基本的权衡：近似和经验估计。虽然前者需要NN课程富有富有表现力，但后者依赖于控制复杂性。我们通过非渐近误差界限基于浅NN的基于浅NN的估计的估算权，重点关注四个流行的$ \ mathsf {f} $ - 分离 - kullback-leibler，chi squared，squared hellinger，以及总变异。我们分析依赖于实证过程理论的非渐近功能近似定理和工具。界限揭示了NN尺寸和样品数量之间的张力，并使能够表征其缩放速率，以确保一致性。对于紧凑型支持的分布，我们进一步表明，上述上三次分歧的神经估算器以适当的NN生长速率接近Minimax率 - 最佳，实现了对数因子的参数速率。

translated by 谷歌翻译

Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent -- an Open Problem

Rémi Le Priol , Frederik Kunstner , Damien Scieur , Simon Lacoste-Julien

分类： (统计)机器学习 | 机器学习

2021-11-12

我们以非渐近方式考虑最大似然估计（MLE）的预期对数估计（MLE）的预期似然估计（MLE）的最佳次数（MAL）的缀合物最大（MAP）的问题。令人惊讶的是，我们在文献中没有找到对这个问题的一般解决方案。特别是，当前的理论不适用于高斯或有趣的少数样本制度。在表现出问题的各个方面之后，我们显示我们可以将地图解释为在日志可能性上运行随机镜像下降（SMD）。然而，现代收敛结果不适用于指数家庭的标准例子，突出趋同文献中的孔。我们认为解决这一非常根本的问题可能会对统计和优化社区带来进展。

translated by 谷歌翻译

Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting

Frederic Koehler , Lijia Zhou , Danica J. Sutherland , Nathan Srebro

分类： (统计)机器学习 | 机器学习

2021-06-17

我们考虑与高斯数据的高维线性回归中的插值学习，并在类高斯宽度方面证明了任意假设类别中的内插器的泛化误差。将通用绑定到欧几里德常规球恢复了Bartlett等人的一致性结果。（2020）对于最小规范内插器，并确认周等人的预测。（2020）在高斯数据的特殊情况下，对于近乎最小常态的内插器。我们通过将其应用于单位来证明所界限的一般性，从而获得最小L1-NORM Interpoolator（基础追踪）的新型一致性结果。我们的结果表明，基于规范的泛化界限如何解释并用于分析良性过度装备，至少在某些设置中。

translated by 谷歌翻译

Statistical Inference with Local Optima

Yen-Chi Chen

分类： (统计)机器学习

2018-07-12

我们研究通过应用具有多个初始化的梯度上升方法来源的估计器的统计特性。我们派生了该估算器的目标的人口数量，并研究了从渐近正常性和自举方法构成的置信区间（CIS）的性质。特别是，我们通过有限数量的随机初始化来分析覆盖范围。我们还通过反转可能性比率测试，得分测试和WALD测试来调查CI，我们表明所得到的CIS可能非常不同。即使MLE是棘手的，我们也提出了一种两个样本测试程序。此外，我们在随机初始化下分析了EM算法的性能，并通过有限数量的初始化导出了CI的覆盖范围。

translated by 谷歌翻译

The Projected Covariance Measure for assumption-lean variable significance testing

Anton Rask Lundborg , Ilmun Kim , Rajen D. Shah , Richard J. Samworth

分类： (统计)机器学习

2022-11-03

Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches.

translated by 谷歌翻译

Riemannian Langevin Algorithm for Solving Semidefinite Programs

Mufan Bill Li , Murat A. Erdogdu

分类： (统计)机器学习 | 机器学习

2020-10-21

我们提出了一种基于langevin扩散的算法，以在球体的产物歧管上进行非凸优化和采样。在对数Sobolev不平等的情况下，我们根据Kullback-Leibler Divergence建立了有限的迭代迭代收敛到Gibbs分布的保证。我们表明，有了适当的温度选择，可以保证，次级最小值的次数差距很小，概率很高。作为一种应用，我们考虑了使用对角线约束解决半决赛程序（SDP）的burer- monteiro方法，并分析提出的langevin算法以优化非凸目标。特别是，我们为Burer建立了对数Sobolev的不平等现象 - 当没有虚假的局部最小值时，但在鞍点下，蒙蒂罗问题。结合结果，我们为SDP和最大切割问题提供了全局最佳保证。更确切地说，我们证明了Langevin算法在$ \ widetilde {\ omega}（\ epsilon^{ - 5}）$ tererations $ tererations $ \ widetilde {\ omega}（\ omega}中，具有很高的概率。

translated by 谷歌翻译

Quantifying the Effects of Data Augmentation

Kevin H. Huang , Peter Orbanz , Morgane Austern

分类：机器学习 | (统计)机器学习

2022-02-18

We provide results that exactly quantify how data augmentation affects the convergence rate and variance of estimates. They lead to some unexpected findings: Contrary to common intuition, data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables. The pathological behavior we identify is not a consequence of complex models, but can occur even in the simplest settings -- one of our examples is a ridge regressor with two parameters. On the other hand, our results also show that data augmentation can have real, quantifiable benefits.

translated by 谷歌翻译