智能论文笔记

Mining the Factor Zoo: Estimation of Latent Factor Models with Sufficient Proxies

Runzhe Wan , Yingying Li , Wenbin Lu , Rui Song

分类：机器学习

2022-12-25

Latent factor model estimation typically relies on either using domain knowledge to manually pick several observed covariates as factor proxies, or purely conducting multivariate analysis such as principal component analysis. However, the former approach may suffer from the bias while the latter can not incorporate additional information. We propose to bridge these two approaches while allowing the number of factor proxies to diverge, and hence make the latent factor model estimation robust, flexible, and statistically more accurate. As a bonus, the number of factors is also allowed to grow. At the heart of our method is a penalized reduced rank regression to combine information. To further deal with heavy-tailed data, a computationally attractive penalized robust reduced rank regression method is proposed. We establish faster rates of convergence compared with the benchmark. Extensive simulations and real examples are used to illustrate the advantages.

translated by 谷歌翻译

Preprocessing noisy functional data: a multivariate perspective

Siegfried Hörmann , Fatima Jammoul

分类： (统计)机器学习

2020-12-10

我们考虑在离散观察点上测量的功能数据。通常通过额外的噪声测量这种数据。我们在本文中探讨了这种类型数据的因子结构。我们表明潜伏信号可以归因于相应因子模型的公共组件，并且可以通过来自因子模型文献的方法借用方法来估计。我们还表明，在采取这种多变量而不是“功能”的角度之后，可以准确地估计在功能数据分析中发挥关键作用的主成分。除了估计问题之外，我们还解决了对IID噪声的零假设的测试。虽然这个假设在很大程度上在文献中主要是普遍存在的，但我们认为它通常不切实际，并且不受残留分析的支持。

translated by 谷歌翻译

Understanding Implicit Regularization in Over-Parameterized Single Index Model

Jianqing Fan , Zhuoran Yang , Mengxin Yu

分类： (统计)机器学习 | 机器学习

2020-07-16

在本文中，我们利用过度参数化来设计高维单索索引模型的无规矩算法，并为诱导的隐式正则化现象提供理论保证。具体而言，我们研究了链路功能是非线性且未知的矢量和矩阵单索引模型，信号参数是稀疏向量或低秩对称矩阵，并且响应变量可以是重尾的。为了更好地理解隐含正规化的角色而没有过度的技术性，我们假设协变量的分布是先验的。对于载体和矩阵设置，我们通过采用分数函数变换和专为重尾数据的强大截断步骤来构造过度参数化最小二乘损耗功能。我们建议通过将无规则化的梯度下降应用于损耗函数来估计真实参数。当初始化接近原点并且步骤中足够小时，我们证明了所获得的解决方案在载体和矩阵案件中实现了最小的收敛统计速率。此外，我们的实验结果支持我们的理论调查结果，并表明我们的方法在$ \ ell_2 $ -staticatisticated率和变量选择一致性方面具有明确的正则化的经验卓越。

translated by 谷歌翻译

A Unified Framework for Estimation of High-dimensional Conditional Factor Models

Qihui Chen

分类： (统计)机器学习

2022-09-01

本文开发了一个通用框架，用于通过核规范正则化估计高维条件因子模型。我们建立了估计器的较大样本属性，并提供了用于查找估计器的有效计算算法以及选择正则化参数的交叉验证程序。一般框架使我们能够以统一的方式估算各种条件因素模型，并迅速提供新的渐近结果。我们采用该方法来分析单个美国股票收益的横截面，并发现施加同质性可以改善模型的样本外可预测性。

translated by 谷歌翻译

On the instrumental variable estimation with many weak and invalid instruments

Yiqi Lin , Frank Windmeijer , Xinyuan Song , Qingliang Fan

分类： (统计)机器学习

2022-07-07

我们讨论了具有未知IV有效性的线性仪器变量（IV）模型中识别的基本问题。我们重新审视了流行的多数和多元化规则，并表明通常没有识别条件是“且仅在总体上”。假设“最稀少的规则”，该规则等同于多数规则，但在计算算法中变得运作，我们研究并证明了基于两步选择的其他IV估计器的非convex惩罚方法的优势，就两步选择而言选择一致性和单独弱IV的适应性。此外，我们提出了一种与识别条件保持一致的替代较低的惩罚，并同时提供甲骨文稀疏结构。与先前的文献相比，针对静脉强度较弱的估计仪得出了理想的理论特性。使用模拟证明了有限样本特性，并且选择和估计方法应用于有关贸易对经济增长的影响的经验研究。

translated by 谷歌翻译

Multivariate Analysis for Multiple Network Data via Semi-Symmetric Tensor PCA

Michael Weylandt , George Michailidis

分类： (统计)机器学习 | 机器学习

2022-02-09

网络数据通常在各种应用程序中收集，代表感兴趣的功能之间直接测量或统计上推断的连接。在越来越多的域中，这些网络会随着时间的流逝而收集，例如不同日子或多个主题之间的社交媒体平台用户之间的交互，例如在大脑连接性的多主体研究中。在分析多个大型网络时，降低降低技术通常用于将网络嵌入更易于处理的低维空间中。为此，我们通过专门的张量分解来开发用于网络集合的主组件分析（PCA）的框架，我们将半对称性张量PCA或SS-TPCA术语。我们得出计算有效的算法来计算我们提出的SS-TPCA分解，并在标准的低级别信号加噪声模型下建立方法的统计效率。值得注意的是，我们表明SS-TPCA具有与经典矩阵PCA相同的估计精度，并且与网络中顶点数的平方根成正比，而不是预期的边缘数。我们的框架继承了古典PCA的许多优势，适用于广泛的无监督学习任务，包括识别主要网络，隔离有意义的更改点或外出观察，以及表征最不同边缘的“可变性网络”。最后，我们证明了我们的提案对模拟数据的有效性以及经验法律研究的示例。用于建立我们主要一致性结果的技术令人惊讶地简单明了，可能会在其他各种网络分析问题中找到使用。

translated by 谷歌翻译

D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data

Hai Shu , Zhe Qu , Hongtu Zhu

分类： (统计)机器学习 | 机器学习

2020-01-09

现代生物医学研究通常收集多视图数据，即在同一组对象上测量的多种类型的数据。高维多视图数据分析中的流行模型是将每个视图的数据矩阵分解为跨所有数据视图常见的潜在因子生成的低级常见源矩阵，对应于每个视图的低级别源矩阵和添加剂噪声矩阵。我们提出了一种用于该模型的新型分解方法，称为基于分解的广义规范相关分析（D-GCCA）。与大多数现有方法使用的欧几里德点产品空间相比，D-GCCA严格地定义了随机变量的L2空间的分解，从而能够为低秩矩阵恢复提供估计一致性。此外，为了良好校准共同的潜在因子，我们对独特的潜在因子施加了理想的正交性限制。然而，现有方法不充分考虑这种正交性，因此可能遭受未检测到的共同源变异的大量损失。我们的D-GCCA通过分离规范变量中的共同和独特的组分，同时从主成分分析的角度享受吸引人的解释，进一步逐步进行一步。此外，我们建议使用常见的或独特潜在因子解释的信号方差的可变级别比例，以选择最受影响的变量。我们的D-GCCA方法的一致估计是通过良好的有限样本数性能建立的，并且具有封闭式表达式，导致有效计算，特别是对于大规模数据。 D-GCCA在最先进的方法上的优越性也在模拟和现实世界数据示例中得到证实。

translated by 谷歌翻译

Data-Driven Sample Average Approximation with Covariate Information

Rohit Kannan , Güzin Bayraksan , James R. Luedtke

分类： (统计)机器学习

2022-07-27

当我们对优化模型中的不确定参数进行观察以及对协变量的同时观察时，我们研究了数据驱动决策的优化。鉴于新的协变量观察，目标是选择一个决定以此观察为条件的预期成本的决定。我们研究了三个数据驱动的框架，这些框架将机器学习预测模型集成在随机编程样本平均值近似（SAA）中，以近似解决该问题的解决方案。 SAA框架中的两个是新的，并使用了场景生成的剩余预测模型的样本外残差。我们研究的框架是灵活的，并且可以容纳参数，非参数和半参数回归技术。我们在数据生成过程，预测模型和随机程序中得出条件，在这些程序下，这些数据驱动的SaaS的解决方案是一致且渐近最佳的，并且还得出了收敛速率和有限的样本保证。计算实验验证了我们的理论结果，证明了我们数据驱动的公式比现有方法的潜在优势（即使预测模型被误解了），并说明了我们在有限的数据制度中新的数据驱动配方的好处。

translated by 谷歌翻译

Deep Learning with Non-Linear Factor Models: Adaptability and Avoidance of Curse of Dimensionality

Mehmet Caner Maurizio Daniele

分类： (统计)机器学习 | 机器学习

2022-09-09

在本文中，我们将深度学习文献与非线性因素模型联系起来，并表明深度学习估计可以大大改善非线性加性因子模型文献。我们通过扩展Schmidt-Hieber（2020）定理来提供预期风险的界限，并表明这些上限在一组多个响应变量上是均匀的。我们表明，我们的风险界限并不取决于因素的数量。为了构建资产回报的协方差矩阵估计器，我们开发了深层神经网络中误差协方差矩阵的新型数据依赖性估计器。估算器是指灵活的自适应阈值技术，对创新中的异常值很强。我们证明估计量在光谱规范中是一致的。然后使用该结果，我们显示了协方差矩阵的一致性和收敛速率和资产回报的精确矩阵估计器。两种结果中的收敛速度并不取决于因素的数量，因此我们的收敛性是因子模型文献中的一个新结果，因为这一事实是因素的数量妨碍了更好的估计和预测。除了精确矩阵结果外，即使资产数量大于时间跨度，我们也可以获得我们所有的结果，并且两个数量都在增长。各种蒙特卡洛模拟证实了我们的大型样本发现，并揭示了DNN-FM的卓越精确度，以估计连接因子和可观察变量的真实潜在功能形式，以及与竞争方法相比的协方差和精确矩阵。此外，在大多数情况下，就样本外投资组合策略而言，在样本外预测应用程序中，就样本外投资组合标准偏差和Sharpe比率而言，它的表现优于其他投资组合策略。

translated by 谷歌翻译

Retire: Robust Expectile Regression in High Dimensions

Rebeka Man , Kean Ming Tan , Zian Wang , Wen-Xin Zhou

分类： (统计)机器学习

2022-12-11

High-dimensional data can often display heterogeneity due to heteroscedastic variance or inhomogeneous covariate effects. Penalized quantile and expectile regression methods offer useful tools to detect heteroscedasticity in high-dimensional data. The former is computationally challenging due to the non-smooth nature of the check loss, and the latter is sensitive to heavy-tailed error distributions. In this paper, we propose and study (penalized) robust expectile regression (retire), with a focus on iteratively reweighted $\ell_1$-penalization which reduces the estimation bias from $\ell_1$-penalization and leads to oracle properties. Theoretically, we establish the statistical properties of the retire estimator under two regimes: (i) low-dimensional regime in which $d \ll n$; (ii) high-dimensional regime in which $s\ll n\ll d$ with $s$ denoting the number of significant predictors. In the high-dimensional setting, we carefully characterize the solution path of the iteratively reweighted $\ell_1$-penalized retire estimation, adapted from the local linear approximation algorithm for folded-concave regularization. Under a mild minimum signal strength condition, we show that after as many as $\log(\log d)$ iterations the final iterate enjoys the oracle convergence rate. At each iteration, the weighted $\ell_1$-penalized convex program can be efficiently solved by a semismooth Newton coordinate descent algorithm. Numerical studies demonstrate the competitive performance of the proposed procedure compared with either non-robust or quantile regression based alternatives.

translated by 谷歌翻译

Treatment Effect Estimation with Unobserved and Heterogeneous Confounding Variables

Kevin Jiang , Yang Ning

分类： (统计)机器学习

2022-07-29

在存在未观察到的混杂变量的情况下，通常会偏向治疗效果的估计，这些变量通常称为隐藏变量。尽管最近提出了一些方法来处理隐藏变量的效果，但这些方法通常忽略了观察到的治疗变量与未观察到的协变量之间任何相互作用的可能性。在这项工作中，我们通过研究一个多变量响应回归问题来解决这一缺点x_j z + e $，其中$ y \ in \ mathbb {r}^m $是$ m $ - 二维响应变量，$ x \ in \ mathbb {r}^p $是观察到的covariates（包括处理变量）， $ z \ in \ mathbb {r}^k $是$ k $ - 二维不观察的混杂因素，而$ e \ in \ mathbb {r}^m $是随机噪声。允许$ x_j $和$ z $之间的相互作用引起异质混杂效果。我们的目标是估算未知的矩阵$ a $，观察到的协变量的直接效果或对响应的处理。为此，我们提出了一种通过SVD进行新的偏见估计方法，以消除未观察到的混杂变量的效果。估计量的收敛速率均在均质和异性噪声下建立。我们还提供了一些模拟实验和一个现实世界数据应用程序，以证实我们的发现。

translated by 谷歌翻译

Forecast Evaluation in Large Cross-Sections of Realized Volatility

Christis Katsouris

分类： (统计)机器学习 | 机器学习

2021-12-09

在本文中，我们考虑了使用相同的预测精度测试程序在横截面依赖下实现了实现波动率测量的预测评估。在预测实现挥发性时，我们根据增强横截面评估模型的预测精度。在相等预测精度的零假设下，所采用的基准模型是标准的HAR模型，而在非相同的预测精度的替代方案下，预测模型是通过套索缩收估计的增强的HAR模型。我们通过结合测量误差校正以及横截面跳转分量测量来研究预报对模型规范的敏感性。使用数值实现评估模型的样本外预测评估。

translated by 谷歌翻译

Best Subset Selection in Reduced Rank Regression

Canhong Wen , Ruipeng Dong , Xueqin Wang , Weiyu Li , Heping Zhang

分类：机器学习

2022-11-29

Sparse reduced rank regression is an essential statistical learning method. In the contemporary literature, estimation is typically formulated as a nonconvex optimization that often yields to a local optimum in numerical computation. Yet, their theoretical analysis is always centered on the global optimum, resulting in a discrepancy between the statistical guarantee and the numerical computation. In this research, we offer a new algorithm to address the problem and establish an almost optimal rate for the algorithmic solution. We also demonstrate that the algorithm achieves the estimation with a polynomial number of iterations. In addition, we present a generalized information criterion to simultaneously ensure the consistency of support set recovery and rank estimation. Under the proposed criterion, we show that our algorithm can achieve the oracle reduced rank estimation with a significant probability. The numerical studies and an application in the ovarian cancer genetic data demonstrate the effectiveness and scalability of our approach.

translated by 谷歌翻译

Robustifying Markowitz

Wolfgang Karl Härdle , Yegor Klochkov , Alla Petukhina , Nikita Zhivotovskiy

分类：机器学习

2022-12-28

Markowitz mean-variance portfolios with sample mean and covariance as input parameters feature numerous issues in practice. They perform poorly out of sample due to estimation error, they experience extreme weights together with high sensitivity to change in input parameters. The heavy-tail characteristics of financial time series are in fact the cause for these erratic fluctuations of weights that consequently create substantial transaction costs. In robustifying the weights we present a toolbox for stabilizing costs and weights for global minimum Markowitz portfolios. Utilizing a projected gradient descent (PGD) technique, we avoid the estimation and inversion of the covariance operator as a whole and concentrate on robust estimation of the gradient descent increment. Using modern tools of robust statistics we construct a computationally efficient estimator with almost Gaussian properties based on median-of-means uniformly over weights. This robustified Markowitz approach is confirmed by empirical studies on equity markets. We demonstrate that robustified portfolios reach the lowest turnover compared to shrinkage-based and constrained portfolios while preserving or slightly improving out-of-sample performance.

translated by 谷歌翻译

High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization

Junren Chen , Cheng-Long Wang , Michael K. Ng , Di Wang

分类： (统计)机器学习 | 机器学习

2022-02-26

在本文中，我们提出了一种均匀抖动的一位量化方案，以进行高维统计估计。该方案包含截断，抖动和量化，作为典型步骤。作为规范示例，量化方案应用于三个估计问题：稀疏协方差矩阵估计，稀疏线性回归和矩阵完成。我们研究了高斯和重尾政权，假定重尾数据的基本分布具有有限的第二或第四刻。对于每个模型，我们根据一位量化的数据提出新的估计器。在高斯次级政权中，我们的估计器达到了对数因素的最佳最小速率，这表明我们的量化方案几乎没有额外的成本。在重尾状态下，虽然我们的估计量基本上变慢，但这些结果是在这种单位量化和重型尾部设置中的第一个结果，或者比现有可比结果表现出显着改善。此外，我们为一位压缩传感和一位矩阵完成的问题做出了巨大贡献。具体而言，我们通过凸面编程将一位压缩感传感扩展到次高斯甚至是重尾传感向量。对于一位矩阵完成，我们的方法与标准似然方法基本不同，并且可以处理具有未知分布的预量化随机噪声。提出了有关合成数据的实验结果，以支持我们的理论分析。

translated by 谷歌翻译

Deep Partial Least Squares for Empirical Asset Pricing

Matthew F. Dixon , Nicholas G. Polson , Kemen Goicoechea

分类：机器学习 | (统计)机器学习

2022-06-20

我们使用深层部分最小二乘（DPL）来估算单个股票收益的资产定价模型，该模型以灵活而动态的方式利用调理信息，同时将超额回报归因于一小部分统计风险因素。新颖的贡献是解决非线性因子结构，从而推进经验资产定价中深度学习的当前范式，该定价在假设高斯资产回报和因素的假设下使用线性随机折现因子。通过使用预测的最小二乘正方形来共同投影公司特征和资产回报到潜在因素的子空间，并使用深度学习从因子负载到资产回报中学习非线性图。捕获这种非线性风险因素结构的结果是通过线性风险因素暴露和相互作用效应来表征资产回报中的异常情况。因此，深度学习捕获异常值的众所周知的能力，在潜在因素结构中的角色和高阶项在因素风险溢价上的作用。从经验方面来说，我们实施了DPLS因子模型，并表现出比Lasso和Plain Vanilla深度学习模型表现出卓越的性能。此外，由于DPL的更简约的架构，我们的网络培训时间大大减少了。具体而言，在1989年12月至2018年1月的一段时间内使用Russell 1000指数中的3290资产，我们评估了我们的DPLS因子模型，并生成比深度学习大约1.2倍的信息比率。 DPLS解释了变化和定价错误，并确定了最突出的潜在因素和公司特征。

translated by 谷歌翻译

The Projected Covariance Measure for assumption-lean variable significance testing

Anton Rask Lundborg , Ilmun Kim , Rajen D. Shah , Richard J. Samworth

分类： (统计)机器学习

2022-11-03

Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches.

translated by 谷歌翻译

Orthogonalized Kernel Debiased Machine Learning for Multimodal Data Analysis

Xiaowu Dai , Lexin Li

分类： (统计)机器学习

2021-03-12

多式联运成像已转化神经科学研究。虽然它提出了前所未有的机会，但它也会冒着严峻的挑战。特别地，难以将归因于简单关联模型的解释性的优点与通过高度自适应非线性模型实现的灵活性组合。在本文中，我们提出了一个正交化的内核脱叠机器学习方法，该方法建立在奈曼正交性和一种分解正交性的形式，用于多模式数据分析。我们针对几乎所有多式化研究中自然出现的环境，其中有一个主要的兴趣模式，以及额外的辅助方式。我们建立了估计主要参数，半参数估计效率和预测的主要模型效应的置信带的渐近有效性的root-$ n $和渐近常态。我们的建议在很大程度上享有模型可解释性和模型灵活性。它与现有的多式联数据集成统计方法以及基于正交性的高维推论的方法也很大。我们通过模拟和应用于阿尔茨海默病的多模峰神经影像研究的应用，证明了我们的方法的功效。

translated by 谷歌翻译

On Model Identification and Out-of-Sample Prediction of Principal Component Regression: Applications to Synthetic Controls

Anish Agarwal , Devavrat Shah , Dennis Shen

分类：机器学习 | (统计)机器学习

2020-10-27

我们在具有固定设计的高维错误设置中分析主组件回归（PCR）。在适当的条件下，我们表明PCR始终以最小$ \ ell_2 $ -norm识别唯一模型，并且是最小的最佳模型。这些结果使我们能够建立非质子化的样本外预测，以确保提高最著名的速率。在我们的分析中，我们在样本外协变量之间引入了天然的线性代数条件，这使我们能够避免分布假设。我们的模拟说明了即使在协变量转移的情况下，这种条件对于概括的重要性。作为副产品，我们的结果还导致了合成控制文献的新结果，这是政策评估的主要方法。特别是，我们的minimax结果表明，在众多变体中，基于PCR的方法具有吸引力。据我们所知，我们对固定设计设置的预测保证在高维错误和合成控制文献中都是难以捉摸的。

translated by 谷歌翻译

Distribution-Free Predictive Inference For Regression

Jing Lei , Max G'Sell , Alessandro Rinaldo , Ryan J. Tibshirani , Larry Wasserman

分类：

2016-04-14

We develop a general framework for distribution-free predictive inference in regression, using conformal inference. The proposed methodology allows for the construction of a prediction band for the response variable using any estimator of the regression function. The resulting prediction band preserves the consistency properties of the original estimator under standard assumptions, while guaranteeing finite-sample marginal coverage even when these assumptions do not hold. We analyze and compare, both empirically and theoretically, the two major variants of our conformal framework: full conformal inference and split conformal inference, along with a related jackknife method. These methods offer different tradeoffs between statistical accuracy (length of resulting prediction intervals) and computational efficiency. As extensions, we develop a method for constructing valid in-sample prediction intervals called rank-one-out conformal inference, which has essentially the same computational efficiency as split conformal inference. We also describe an extension of our procedures for producing prediction bands with locally varying length, in order to adapt to heteroskedascity in the data. Finally, we propose a model-free notion of variable importance, called leave-one-covariate-out or LOCO inference. Accompanying this paper is an R package conformalInference that implements all of the proposals we have introduced. In the spirit of reproducibility, all of our empirical results can also be easily (re)generated using this package.

translated by 谷歌翻译