Bilevel optimization plays an essential role in many machine learning tasks, ranging from hyperparameter optimization to meta-learning. Existing studies on bilevel optimization, however, focus on either centralized or synchronous distributed setting. The centralized bilevel optimization approaches require collecting massive amount of data to a single server, which inevitably incur significant communication expenses and may give rise to data privacy risks. Synchronous distributed bilevel optimization algorithms, on the other hand, often face the straggler problem and will immediately stop working if a few workers fail to respond. As a remedy, we propose Asynchronous Distributed Bilevel Optimization (ADBO) algorithm. The proposed ADBO can tackle bilevel optimization problems with both nonconvex upper-level and lower-level objective functions, and its convergence is theoretically guaranteed. Furthermore, it is revealed through theoretic analysis that the iteration complexity of ADBO to obtain the $\epsilon$-stationary point is upper bounded by $\mathcal{O}(\frac{1}{{{\epsilon ^2}}})$. Thorough empirical studies on public datasets have been conducted to elucidate the effectiveness and efficiency of the proposed ADBO.
translated by 谷歌翻译
Distributionally Robust Optimization (DRO), which aims to find an optimal decision that minimizes the worst case cost over the ambiguity set of probability distribution, has been widely applied in diverse applications, e.g., network behavior analysis, risk management, etc. However, existing DRO techniques face three key challenges: 1) how to deal with the asynchronous updating in a distributed environment; 2) how to leverage the prior distribution effectively; 3) how to properly adjust the degree of robustness according to different scenarios. To this end, we propose an asynchronous distributed algorithm, named Asynchronous Single-looP alternatIve gRadient projEction (ASPIRE) algorithm with the itErative Active SEt method (EASE) to tackle the distributed distributionally robust optimization (DDRO) problem. Furthermore, a new uncertainty set, i.e., constrained D-norm uncertainty set, is developed to effectively leverage the prior distribution and flexibly control the degree of robustness. Finally, our theoretical analysis elucidates that the proposed algorithm is guaranteed to converge and the iteration complexity is also analyzed. Extensive empirical studies on real-world datasets demonstrate that the proposed method can not only achieve fast convergence, and remain robust against data heterogeneity as well as malicious attacks, but also tradeoff robustness with performance.
translated by 谷歌翻译
近年来,已经开发出各种基于梯度的方法来解决机器学习和计算机视觉地区的双层优化(BLO)问题。然而,这些现有方法的理论正确性和实际有效性总是依赖于某些限制性条件(例如,下层单身,LLS),这在现实世界中可能很难满足。此外,以前的文献仅证明了基于其特定的迭代策略的理论结果,因此缺乏一般的配方,以统一分析不同梯度的BLO的收敛行为。在这项工作中,我们从乐观的双级视点制定BLOS,并建立一个名为Bi-Level血液血统聚合(BDA)的新梯度的算法框架,以部分地解决上述问题。具体而言,BDA提供模块化结构,以分级地聚合上层和下层子问题以生成我们的双级迭代动态。从理论上讲,我们建立了一般会聚分析模板,并导出了一种新的证据方法,以研究基于梯度的BLO方法的基本理论特性。此外,这项工作系统地探讨了BDA在不同优化场景中的收敛行为,即,考虑从解决近似子问题返回的各种解决方案质量(即,全局/本地/静止解决方案)。广泛的实验证明了我们的理论结果,并展示了所提出的超参数优化和元学习任务算法的优越性。源代码可在https://github.com/vis-opt-group/bda中获得。
translated by 谷歌翻译
近年来,由于它们在对点对点网络上的分散性学习问题(例如,多机构元学习,多机构的多方强化增强学习学习)上,分散的双层优化问题在网络和机器学习社区中引起了越来越多的关注。 ,个性化的培训和拜占庭的弹性学习)。但是,对于具有有限的计算和通信功能的对等网络上的分散式双层优化,如何实现低样本和通信复杂性是迄今为止尚未探索的两个基本挑战。在本文中,我们首次尝试研究了分别与外部和内部子问题相对应的非凸和强结构结构的分散双重优化问题。本文中我们的主要贡献是两倍:i)我们首先提出了一种称为Interact的确定性算法(Inter-gradient-descent-out-outer-tracked-gradeent),需要$ \ Mathcal {o}的样品复杂性(n \ epsilon) ^{ - 1})$和$ \ mathcal {o}的通信复杂性(\ epsilon^{ - 1})$解决双重优化问题,其中$ n $和$ \ epsilon> 0 $是样本的数量在每个代理和所需的平稳性差距上。 ii)为了放宽每次迭代中进行全面梯度评估的需求,我们提出了一个随机方差的互动版本(SVR Interact),该版本将样品复杂性提高到$ \ Mathcal {o}(\ sqrt {n} \ epsilon ^{ - 1})$在达到与确定算法相同的通信复杂性时。据我们所知,这项工作是第一个实现低样本和通信复杂性,以解决网络上的分散双层优化问题。我们的数值实验也证实了我们的理论发现。
translated by 谷歌翻译
二重优化发现在现代机器学习问题中发现了广泛的应用,例如超参数优化,神经体系结构搜索,元学习等。而具有独特的内部最小点(例如,内部功能是强烈凸的,都具有唯一的内在最小点)的理解,这是充分理解的,多个内部最小点的问题仍然是具有挑战性和开放的。为此问题设计的现有算法适用于限制情况,并且不能完全保证融合。在本文中,我们采用了双重优化的重新制定来限制优化,并通过原始的双二线优化(PDBO)算法解决了问题。 PDBO不仅解决了多个内部最小挑战,而且还具有完全一阶效率的情况,而无需涉及二阶Hessian和Jacobian计算,而不是大多数现有的基于梯度的二杆算法。我们进一步表征了PDBO的收敛速率,它是与多个内部最小值的双光线优化的第一个已知的非质合收敛保证。我们的实验证明了所提出的方法的预期性能。
translated by 谷歌翻译
Bilevel programming has recently received attention in the literature, due to a wide range of applications, including reinforcement learning and hyper-parameter optimization. However, it is widely assumed that the underlying bilevel optimization problem is solved either by a single machine or in the case of multiple machines connected in a star-shaped network, i.e., federated learning setting. The latter approach suffers from a high communication cost on the central node (e.g., parameter server) and exhibits privacy vulnerabilities. Hence, it is of interest to develop methods that solve bilevel optimization problems in a communication-efficient decentralized manner. To that end, this paper introduces a penalty function based decentralized algorithm with theoretical guarantees for this class of optimization problems. Specifically, a distributed alternating gradient-type algorithm for solving consensus bilevel programming over a decentralized network is developed. A key feature of the proposed algorithm is to estimate the hyper-gradient of the penalty function via decentralized computation of matrix-vector products and few vector communications, which is then integrated within our alternating algorithm to give the finite-time convergence analysis under different convexity assumptions. Owing to the generality of this complexity analysis, our result yields convergence rates for a wide variety of consensus problems including minimax and compositional optimization. Empirical results on both synthetic and real datasets demonstrate that the proposed method works well in practice.
translated by 谷歌翻译
由于其在数据隐私保护,有效的沟通和并行数据处理方面的好处,联邦学习(FL)近年来引起了人们的兴趣。同样,采用适当的算法设计,可以实现fl中收敛效应的理想线性加速。但是,FL上的大多数现有作品仅限于I.I.D.的系统。数据和集中参数服务器以及与异质数据集分散的FL上的结果仍然有限。此外,在完全分散的FL下,与数据异质性在完全分散的FL下,可以实现收敛的线性加速仍然是一个悬而未决的问题。在本文中,我们通过提出一种称为Net-Fleet的新算法,以解决具有数据异质性的完全分散的FL系统,以解决这些挑战。我们算法的关键思想是通过合并递归梯度校正技术来处理异质数据集,以增强FL(最初旨在用于通信效率)的本地更新方案。我们表明,在适当的参数设置下,所提出的净型算法实现了收敛的线性加速。我们进一步进行了广泛的数值实验,以评估所提出的净化算法的性能并验证我们的理论发现。
translated by 谷歌翻译
最近,由于这些问题与一些新兴应用的相关性,最近有许多研究工作用于开发有效算法,以解决理论收敛的保证。在本文中,我们提出了一种统一的单环交替梯度投影(AGP)算法,用于求解平滑的非convex-(强烈)凹面和(强烈)凸出 - 非concave minimax问题。 AGP采用简单的梯度投影步骤来更新每次迭代时的原始变量和双变量。我们表明,它可以在$ \ MATHCAL {O} \ left(\ Varepsilon ^{ - 2} \ right)$(rep. $ \ Mathcal {O} \ left)中找到目标函数的$ \ VAREPSILON $ -STAIMATARY点。 (\ varepsilon ^{ - 4} \ right)$)$迭代,在nonconvex-strongly凹面(resp。nonconvex-concave)设置下。此外,获得目标函数的$ \ VAREPSILON $ -STAIMATARY的梯度复杂性由$ \ Mathcal {o} \ left(\ varepsilon ^{ - 2} \ right)界限O} \ left(\ varepsilon ^{ - 4} \ right)$在强烈的convex-nonconcave(resp。,convex-nonconcave)设置下。据我们所知,这是第一次开发出一种简单而统一的单环算法来解决非convex-(强烈)凹面和(强烈)凸出 - 非concave minimax问题。此外,在文献中从未获得过解决后者(强烈)凸线 - 非孔孔的最小问题的复杂性结果。数值结果表明所提出的AGP算法的效率。此外,我们通过提出块交替近端梯度(BAPG)算法来扩展AGP算法,以求解更通用的多块非块非conmooth nonmooth nonmooth noncovex-(强)凹面和(强烈)convex-nonconcave minimax问题。我们可以在这四个不同的设置下类似地建立所提出算法的梯度复杂性。
translated by 谷歌翻译
Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized. This framework faces several systemsoriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized messagepassing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method.
translated by 谷歌翻译
Federated learning (FL) has emerged as an instance of distributed machine learning paradigm that avoids the transmission of data generated on the users' side. Although data are not transmitted, edge devices have to deal with limited communication bandwidths, data heterogeneity, and straggler effects due to the limited computational resources of users' devices. A prominent approach to overcome such difficulties is FedADMM, which is based on the classical two-operator consensus alternating direction method of multipliers (ADMM). The common assumption of FL algorithms, including FedADMM, is that they learn a global model using data only on the users' side and not on the edge server. However, in edge learning, the server is expected to be near the base station and have direct access to rich datasets. In this paper, we argue that leveraging the rich data on the edge server is much more beneficial than utilizing only user datasets. Specifically, we show that the mere application of FL with an additional virtual user node representing the data on the edge server is inefficient. We propose FedTOP-ADMM, which generalizes FedADMM and is based on a three-operator ADMM-type technique that exploits a smooth cost function on the edge server to learn a global model parallel to the edge devices. Our numerical experiments indicate that FedTOP-ADMM has substantial gain up to 33\% in communication efficiency to reach a desired test accuracy with respect to FedADMM, including a virtual user on the edge server.
translated by 谷歌翻译
This paper studies the communication complexity of risk averse optimization over a network. The problem generalizes the well-studied risk-neutral finite-sum distributed optimization problem and its importance stems from the need to handle risk in an uncertain environment. For algorithms in the literature, there exists a gap in communication complexities for solving risk-averse and risk-neutral problems. We propose two distributed algorithms, namely the distributed risk averse optimization (DRAO) method and the distributed risk averse optimization with sliding (DRAO-S) method, to close the gap. Specifically, the DRAO method achieves the optimal communication complexity by assuming a certain saddle point subproblem can be easily solved in the server node. The DRAO-S method removes the strong assumption by introducing a novel saddle point sliding subroutine which only requires the projection over the ambiguity set $P$. We observe that the number of $P$-projections performed by DRAO-S is optimal. Moreover, we develop matching lower complexity bounds to show that communication complexities of both DRAO and DRAO-S are not improvable. Numerical experiments are conducted to demonstrate the encouraging empirical performance of the DRAO-S method.
translated by 谷歌翻译
分布式机器学习实现可扩展性和计算卸载,但需要大量的通信。因此,分布式学习设置中的沟通效率是一个重要的考虑因素,尤其是当通信是无线且采用电池驱动设备时。在本文中,我们开发了一种基于审查的重球(CHB)方法,用于在服务器工作者体系结构中分布式学习。除非其本地梯度与先前传播的梯度完全不同,否则每个工人的自我审查员。 HB学习问题的显着实际优势是众所周知的,但是尚未解决降低通信的问题。 CHB充分利用HB平滑来消除报告的微小变化,并证明达到了与经典HB方法相当的线性收敛速率,以平滑和强烈凸出目标函数。 CHB的收敛保证在理论上是合理的,对于凸和非凸案。此外,我们证明,在某些情况下,至少可以消除所有通信的一半,而不会对收敛率产生任何影响。广泛的数值结果验证了CHB在合成和真实数据集(凸,非凸和非不同情况)上的通信效率。鉴于目标准确性,与现有算法相比,CHB可以显着减少通信数量,从而实现相同的精度而不减慢优化过程。
translated by 谷歌翻译
在分散的学习中,节点网络协作以最小化通常是其本地目标的有限总和的整体目标函数,并结合了非平滑的正则化术语,以获得更好的泛化能力。分散的随机近端梯度(DSPG)方法通常用于培训这种类型的学习模型,而随机梯度的方差延迟了收敛速率。在本文中,我们提出了一种新颖的算法,即DPSVRG,通过利用方差减少技术来加速分散的训练。基本思想是在每个节点中引入估计器,该节点周期性地跟踪本地完整梯度,以校正每次迭代的随机梯度。通过将分散的算法转换为具有差异减少的集中内隙近端梯度算法,并控制错误序列的界限,我们证明了DPSVRG以o(1 / t)$的速率收敛于一般凸起目标加上非平滑术语以$ t $作为迭代的数量,而dspg以$ o(\ frac {1} {\ sqrt {t}})$汇聚。我们对不同应用,网络拓扑和学习模型的实验表明,DPSVRG会收敛于DSPG的速度要快得多,DPSVRG的损耗功能与训练时期顺利降低。
translated by 谷歌翻译
Decentralized bilevel optimization has received increasing attention recently due to its foundational role in many emerging multi-agent learning paradigms (e.g., multi-agent meta-learning and multi-agent reinforcement learning) over peer-to-peer edge networks. However, to work with the limited computation and communication capabilities of edge networks, a major challenge in developing decentralized bilevel optimization techniques is to lower sample and communication complexities. This motivates us to develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale stochastic approximation with momentum and gradient-tracking). The contributions of this paper are as follows: i) our DIAMOND algorithm adopts a single-loop structure rather than following the natural double-loop structure of bilevel optimization, which offers low computation and implementation complexity; ii) compared to existing approaches, the DIAMOND algorithm does not require any full gradient evaluations, which further reduces both sample and computational complexities; iii) through a careful integration of momentum information and gradient tracking techniques, we show that the DIAMOND algorithm enjoys $\mathcal{O}(\epsilon^{-3/2})$ in sample and communication complexities for achieving an $\epsilon$-stationary solution, both of which are independent of the dataset sizes and significantly outperform existing works. Extensive experiments also verify our theoretical findings.
translated by 谷歌翻译
用于解决具有量化消息传递的实际边缘计算系统中的一般机器学习(ML)问题的联邦学习(FL)算法的最佳设计仍然是一个打开问题。本文考虑了服务器和工人在发送消息之前具有不同的计算和通信能力以及使用量化的优势计算系统。为了探讨这种优势计算系统中的FL的全部潜力,我们首先介绍一般的FL算法,即GenQSGD,由全局和局部迭代,迷你批量大小和步骤尺寸序列参数化。然后,我们分析其对任意步长序列的融合,并指定三个常用的步大规则下的收敛结果,即常数,指数和递减的步长规则。接下来,我们优化算法参数,以最小化时间约束和收敛误差约束下的能量成本,重点是FL的整体实施过程。具体地,对于在每个考虑的步长规则下的任何给定的步骤尺寸序列,我们优化全局和本地迭代和迷你批量大小的数量,以最佳地实现具有预设步长序列的应用程序的FL。我们还优化了步骤序列以及这些算法参数,以探索FL的全部潜力。由此产生的优化问题是具有非可分性约束函数的非凸面问题。我们提出了使用通用内近似(GIA)的迭代算法来获得KKT点和用于解决互补几何编程(CGP)的技巧。最后,我们用现有的FL算法用优化的算法参数进行了数值展示了GenQSGD的显着收益,并揭示了最佳地设计了一般FL算法的重要性。
translated by 谷歌翻译
In large-scale distributed learning, security issues have become increasingly important. Particularly in a decentralized environment, some computing units may behave abnormally, or even exhibit Byzantine failures-arbitrary and potentially adversarial behavior. In this paper, we develop distributed learning algorithms that are provably robust against such failures, with a focus on achieving optimal statistical performance. A main result of this work is a sharp analysis of two robust distributed gradient descent algorithms based on median and trimmed mean operations, respectively. We prove statistical error rates for three kinds of population loss functions: strongly convex, nonstrongly convex, and smooth non-convex. In particular, these algorithms are shown to achieve order-optimal statistical error rates for strongly convex losses. To achieve better communication efficiency, we further propose a median-based distributed algorithm that is provably robust, and uses only one communication round. For strongly convex quadratic loss, we show that this algorithm achieves the same optimal error rate as the robust distributed gradient descent algorithms.
translated by 谷歌翻译
在联合学习(FL)中,通过跨设备的模型更新进行合作学习全球模型的目的倾向于通过本地信息反对个性化的目标。在这项工作中,我们通过基于多准则优化的框架以定量的方式校准了这一权衡,我们将其作为一个受约束的程序进行了:设备的目标是其本地目标,它试图最大程度地减少在满足非线性约束的同时,以使其满足非线性约束,这些目标是其本地目标。量化本地模型和全局模型之间的接近度。通过考虑该问题的拉格朗日放松,我们开发了一种算法,该算法允许每个节点通过查询到一阶梯度Oracle将其Lagrangian的本地组件最小化。然后,服务器执行Lagrange乘法器上升步骤,然后进行Lagrange乘法器加权步骤。我们称这种实例化的原始偶对方法是联合学习超出共识($ \ texttt {fedBc} $)的实例。从理论上讲,我们确定$ \ texttt {fedBc} $以与最算好状态相匹配的速率收敛到一阶固定点,直到额外的错误项,取决于由于接近性约束而产生的公差参数。总体而言,该分析是针对非凸鞍点问题的原始偶对偶的方法的新颖表征。最后,我们证明了$ \ texttt {fedBc} $平衡了整个数据集(合成,MNIST,CIFAR-10,莎士比亚)的全球和本地模型测试精度指标,从而与艺术现状达到了竞争性能。
translated by 谷歌翻译
为了提高分布式学习的训练速度,近年来见证了人们对开发同步和异步分布式随机方差减少优化方法的极大兴趣。但是,所有现有的同步和异步分布式训练算法都遭受了收敛速度或实施复杂性的各种局限性。这激发了我们提出一种称为\ algname(\ ul {s} emi-as \ ul {yn}的算法} ent \ ul {s} earch),它利用方差减少框架的特殊结构来克服同步和异步分布式学习算法的局限性,同时保留其显着特征。我们考虑分布式和共享内存体系结构下的\ algname的两个实现。我们表明我们的\ algname算法具有\(o(\ sqrt {n} \ epsilon^{ - 2}( - 2}(\ delta+1)+n)\)\)和\(o(\ sqrt {n} {n} 2}(\ delta+1)d+n)\)用于实现\(\ epsilon \)的计算复杂性 - 分布式和共享内存体系结构分别在非convex学习中的固定点,其中\(n \)表示培训样本的总数和\(\ delta \)表示工人的最大延迟。此外,我们通过建立二次强烈凸和非convex优化的算法稳定性界限来研究\ algname的概括性能。我们进一步进行广泛的数值实验来验证我们的理论发现
translated by 谷歌翻译
我们通过两种类型 - 主/工人(因此集中)架构(因此集中)架构和网格化(因此分散)网络,研究(强)凸起(强)凸起(强)凸起的鞍点问题(SPPS)的解决方案方法。由于统计数据相似度或其他,假设每个节点处的本地功能是相似的。我们为求解SPP的相当一般算法奠定了较低的复杂性界限。我们表明,在$ \ omega \ big(\ delta \ cdot \ delta / \ mu \ cdot \ log(1 / varepsilon)\ big)$ rounds over over over exoptimally $ \ epsilon> 0 $ over over master / workers网络通信,其中$ \ delta> 0 $测量本地功能的相似性,$ \ mu $是它们的强凸起常数,$ \ delta $是网络的直径。较低的通信复杂性绑定在网状网络上读取$ \ omega \ big(1 / {\ sqrt {\ rho}} \ cdot {\ delta} / {\ mu} \ cdot \ log(1 / varepsilon)\ big)$ ,$ \ rho $是用于邻近节点之间通信的八卦矩阵的(归一化)EIGENGAP。然后,我们提出算法与较低限制的网络(最多为日志因子)匹配。我们评估所提出的算法对强大的逻辑回归问题的有效性。
translated by 谷歌翻译
在本文中,我们研究了一类二聚体优化问题,也称为简单的双重优化,在其中,我们将光滑的目标函数最小化,而不是另一个凸的约束优化问题的最佳解决方案集。已经开发了几种解决此类问题的迭代方法。 las,它们的收敛保证并不令人满意,因为它们要么渐近,要么渐近,要么是收敛速度缓慢且最佳的。为了解决这个问题,在本文中,我们介绍了Frank-Wolfe(FW)方法的概括,以解决考虑的问题。我们方法的主要思想是通过切割平面在局部近似低级问题的解决方案集,然后运行FW型更新以减少上层目标。当上层目标是凸面时,我们表明我们的方法需要$ {\ mathcal {o}}(\ max \ {1/\ epsilon_f,1/\ epsilon_g \})$迭代才能找到$ \ \ \ \ \ \ epsilon_f $ - 最佳目标目标和$ \ epsilon_g $ - 最佳目标目标。此外,当高级目标是非convex时,我们的方法需要$ {\ MATHCAL {o}}(\ max \ {1/\ epsilon_f^2,1/(\ epsilon_f \ epsilon_g})查找$(\ epsilon_f,\ epsilon_g)$ - 最佳解决方案。我们进一步证明了在“较低级别问题的老年人错误约束假设”下的更强的融合保证。据我们所知,我们的方法实现了所考虑的二聚体问题的最著名的迭代复杂性。我们还向数值实验提出了数值实验。与最先进的方法相比,展示了我们方法的出色性能。
translated by 谷歌翻译