提出了一种称为Trust-Region Boosting(TRBOOST)的通用梯度提升机,用于执行监督的机器学习任务。现有的梯度提升机(GBM)已经在许多问题上取得了最先进的结果。但是,在性能和一般性之间保持平衡存在一些困难。一阶算法适用于比二阶算法更多的一般损失函数。虽然表演通常不如后者那么好。TRBOOST基于信任区域算法将GBMS概括为适合任意损失功能,同时保持良好的性能作为二阶算法。进行了几项数值实验,以确认TRBOOST可以获得竞争成果,同时为收敛提供额外的好处。
Gradient boosting is a prediction method that iteratively combines weak learners to produce a complex and accurate model. From an optimization point of view, the learning procedure of gradient boosting mimics a gradient descent on a functional variable. This paper proposes to build upon the proximal point algorithm, when the empirical risk to minimize is not differentiable, in order to introduce a novel boosting approach, called proximal boosting. It comes with a companion algorithm inspired by [1] and called residual proximal boosting, which is aimed at better controlling the approximation error. Theoretical convergence is proved for these two procedures under different hypotheses on the empirical risk and advantages of leveraging proximal methods for boosting are illustrated by numerical experiments on simulated and real-world data. In particular, we exhibit a favorable comparison over gradient boosting regarding convergence rate and prediction accuracy.
Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have been widely used in biology, engineering, healthcare, etc. This paper proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree model by gradient boosting. It increases the randomness (diversity) by drawing the cut-points randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest generally outperformed four classical ensemble learning approaches (Random Forest, Extra-Trees, XGBoost and LightGBM) on 35 classification and regression datasets. Remarkably, BoostForest tunes its parameters by simply sampling them randomly from a parameter pool, which can be easily specified, and its ensemble learning framework can also be used to combine many other base learners.
我们引入了一种降低尺寸的二阶方法(DRSOM),用于凸和非凸的不受约束优化。在类似信任区域的框架下,我们的方法保留了二阶方法的收敛性,同时仅在两个方向上使用Hessian-Vector产品。此外,计算开销仍然与一阶相当,例如梯度下降方法。我们证明该方法的复杂性为$ O(\ epsilon^{ - 3/2})$,以满足子空间中的一阶和二阶条件。DRSOM的适用性和性能通过逻辑回归,$ L_2-L_P $最小化,传感器网络定位和神经网络培训的各种计算实验展示。对于神经网络,我们的初步实施似乎在训练准确性和迭代复杂性方面与包括SGD和ADAM在内的最先进的一阶方法获得了计算优势。
我们研究了估计多元高斯分布中的精度矩阵的问题,其中所有部分相关性都是非负面的,也称为多变量完全阳性的顺序阳性($ \ mathrm {mtp} _2 $)。近年来,这种模型得到了重大关注,主要是由于有趣的性质,例如,无论底层尺寸如何,最大似然估计值都存在于两个观察。我们将此问题作为加权$ \ ell_1 $ -norm正常化高斯的最大似然估计下$ \ mathrm {mtp} _2 $约束。在此方向上,我们提出了一种新颖的预计牛顿样算法,该算法包含精心设计的近似牛顿方向,这导致我们具有与一阶方法相同的计算和内存成本的算法。我们证明提出的预计牛顿样算法会聚到问题的最小值。从理论和实验中,我们进一步展示了我们使用加权$ \ ell_1 $ -norm的制剂的最小化器能够正确地恢复基础精密矩阵的支持,而无需在$ \ ell_1 $ -norm中存在不连贯状态方法。涉及合成和实世界数据的实验表明,我们所提出的算法从计算时间透视比最先进的方法显着更有效。最后,我们在金融时序数据中应用我们的方法,这些数据对于显示积极依赖性,在那里我们在学习金融网络上的模块间值方面观察到显着性能。
在本文中,我们提出了SC-REG(自助正规化)来学习过共同的前馈神经网络来学习\ EMPH {牛顿递减}框架的二阶信息进行凸起问题。我们提出了具有自助正规化(得分-GGN)算法的广义高斯 - 牛顿,其每次接收到新输入批处理时都会更新网络参数。所提出的算法利用Hessian矩阵中的二阶信息的结构,从而减少训练计算开销。虽然我们的目前的分析仅考虑凸面的情况,但数值实验表明了我们在凸和非凸面设置下的方法和快速收敛的效率,这对基线一阶方法和准牛顿方法进行了比较。
We investigate ensemble methods for prediction in an online setting. Unlike all the literature in ensembling, for the first time, we introduce a new approach using a meta learner that effectively combines the base model predictions via using a superset of the features that is the union of the base models' feature vectors instead of the predictions themselves. Here, our model does not use the predictions of the base models as inputs to a machine learning algorithm, but choose the best possible combination at each time step based on the state of the problem. We explore three different constraint spaces for the ensembling of the base learners that linearly combines the base predictions, which are convex combinations where the components of the ensembling vector are all nonnegative and sum up to 1; affine combinations where the weight vector components are required to sum up to 1; and the unconstrained combinations where the components are free to take any real value. The constraints are both theoretically analyzed under known statistics and integrated into the learning procedure of the meta learner as a part of the optimization in an automated manner. To show the practical efficiency of the proposed method, we employ a gradient-boosted decision tree and a multi-layer perceptron separately as the meta learners. Our framework is generic so that one can use other machine learning architectures as the ensembler as long as they allow for a custom differentiable loss for minimization. We demonstrate the learning behavior of our algorithm on synthetic data and the significant performance improvements over the conventional methods over various real life datasets, extensively used in the well-known data competitions. Furthermore, we openly share the source code of the proposed method to facilitate further research and comparison.
Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepestdescent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.
班级失衡对机器学习构成了重大挑战,因为大多数监督学习模型可能对多数级别和少数族裔表现不佳表现出偏见。成本敏感的学习通过以不同的方式处理类别,通常通过用户定义的固定错误分类成本矩阵来解决此问题,以提供给学习者的输入。这种参数调整是一项具有挑战性的任务,需要域知识,此外,错误的调整可能会导致整体预测性能恶化。在这项工作中,我们为不平衡数据提出了一种新颖的成本敏感方法,该方法可以动态地调整错误分类的成本,以响应Model的性能,而不是使用固定的错误分类成本矩阵。我们的方法称为ADACC,是无参数的,因为它依赖于增强模型的累积行为,以便调整下一次增强回合的错误分类成本,并具有有关培训错误的理论保证。来自不同领域的27个现实世界数据集的实验表明,我们方法的优势超过了12种最先进的成本敏感方法,这些方法在不同度量方面表现出一致的改进,例如[0.3] AUC的%-28.56%],平衡精度[3.4%-21.4%],Gmean [4.8%-45%]和[7.4%-85.5%]用于召回。
Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as continual learning, neural architecture search, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. In this paper, we introduce a bilevel stochastic gradient method for bilevel problems with lower-level constraints. We also present a comprehensive convergence theory that covers all inexact calculations of the adjoint gradient (also called hypergradient) and addresses both the lower-level unconstrained and constrained cases. To promote the use of bilevel optimization in large-scale learning, we introduce a practical bilevel stochastic gradient method (BSG-1) that does not require second-order derivatives and, in the lower-level unconstrained case, dismisses any system solves and matrix-vector products.
近年来,在广泛的机器学习应用程序中,在梯度增强决策树(GBDT)方面取得了重大成功。通常,关于GBDT训练算法的共识是梯度,统计数据是根据高精度浮点计算的。在本文中,我们调查了一个本质上重要的问题,该问题在先前的文献中在很大程度上被忽略了:代表培训GBDT的梯度需要多少位?为了解决这个谜团,我们建议在GBDT的培训算法中以非常简单但有效的方式量化所有高精度梯度。令人惊讶的是,我们的理论分析和实证研究都表明,梯度的必要精度而不伤害任何性能可能很低,例如2或3位。对于低精度梯度,GBDT培训中的大多数算术操作可以用8、16或32位的整数操作代替。有希望的是,这些发现可能为从几个方面对GBDT进行更有效训练的方式铺平了道路:(1)加速直方图中梯度统计的计算; (2)在分布式培训期间压缩高精度统计信息的通信成本; (3)使用和开发硬件体系结构的灵感,这些架构很好地支持了用于GBDT培训的低精确计算。与大量数据集中的SOTA GBDT系统相比,我们在CPU,GPU和分布式集群上进行了基准测试,最多可容纳我们简单量化策略的速度,这表明了GBDT低表演培训的有效性和潜力。该代码将发布给LightGBM的官方存储库。
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable endto-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
We explore the usage of the Levenberg-Marquardt (LM) algorithm for regression (non-linear least squares) and classification (generalized Gauss-Newton methods) tasks in neural networks. We compare the performance of the LM method with other popular first-order algorithms such as SGD and Adam, as well as other second-order algorithms such as L-BFGS , Hessian-Free and KFAC. We further speed up the LM method by using adaptive momentum, learning rate line search, and uphill step acceptance.
近期在应用于培训深度神经网络和数据分析中的其他优化问题中的非凸优化的优化算法的兴趣增加,我们概述了最近对非凸优化优化算法的全球性能保证的理论结果。我们从古典参数开始,显示一般非凸面问题无法在合理的时间内有效地解决。然后,我们提供了一个问题列表,可以通过利用问题的结构来有效地找到全球最小化器,因为可能的问题。处理非凸性的另一种方法是放宽目标,从找到全局最小,以找到静止点或局部最小值。对于该设置,我们首先为确定性一阶方法的收敛速率提出了已知结果,然后是最佳随机和随机梯度方案的一般理论分析,以及随机第一阶方法的概述。之后,我们讨论了非常一般的非凸面问题,例如最小化$ \ alpha $ -weakly-are-convex功能和满足Polyak-lojasiewicz条件的功能,这仍然允许获得一阶的理论融合保证方法。然后,我们考虑更高阶和零序/衍生物的方法及其收敛速率,以获得非凸优化问题。
