The mixture of Expert (MoE) parallelism is a recent advancement that scales up the model size with constant computational cost. MoE selects different sets of parameters (i.e., experts) for each incoming token, resulting in a sparsely-activated model. Despite several successful applications of MoE, its training efficiency degrades significantly as the number of experts increases. The routing stage in MoE relies on the efficiency of the All2All communication collective, which suffers from network congestion and has poor scalability. To mitigate these issues, we introduce SMILE, which exploits heterogeneous network bandwidth and splits a single-step routing into bi-level routing. Our experimental results show that the proposed method obtains a 2.5x speedup over Switch Transformer in terms of pretraining throughput on the Colossal Clean Crawled Corpus without losing any convergence speed.
translated by 谷歌翻译
由于监督模型无法学习可以在具有有限标签的域中概括的域名,因此自我监督学习(SSL)已成为计算机视觉中的理想范式。 SSL的最新流行导致了几种模型的开发,这些模型利用了不同的培训策略,架构和数据扩展政策,而没有现有的统一框架来研究或评估其在转移学习中的有效性。我们提出了一个数据驱动的几何策略,可以使用每个局部诱导的特征空间中的局部邻域分析不同的SSL模型。与考虑参数,单个组件或优化领域的数学近似的现有方法不同,我们的工作旨在探索SSL模型所学的表示歧管的几何特性。我们提出的歧管图指标(MGM)提供了有关可用SSL模型之间的几何相似性和差异的见解,它们在特定的增强方面的不变以及它们在转移学习任务方面的表现。我们的关键发现是两个方面:(i)与普遍的看法相反,SSL模型的几何形状与其训练范式(对比度,无对比性和基于群集)无关; (ii)我们可以根据其语义和增强歧管的几何特性来预测特定模型的传输学习能力。
translated by 谷歌翻译
重要的理论工作已经确定,在特定的制度中,通过梯度下降训练的神经网络像内核方法一样行为。但是,在实践中,众所周知,神经网络非常优于其相关内核。在这项工作中,我们通过证明有一大批功能可以通过内核方法有效地学习,但是可以通过学习表示与相关的学习表示,可以轻松地学习这一差距。到目标任务。我们还证明了这些表示允许有效的转移学习,这在内核制度中是不可能的。具体而言,我们考虑学习多项式的问题,该问题仅取决于少数相关的方向,即$ f^\ star(x)= g(ux)$ withy $ u:\ r^d \ to \ r^r $ d \ gg r $。当$ f^\ star $的度数为$ p $时,众所周知,在内核制度中学习$ f^\ star $是必要的。我们的主要结果是,梯度下降学会了数据的表示,这仅取决于与$ f^\ star $相关的指示。这导致改进的样本复杂性为$ n \ asymp d^2 r + dr^p $。此外,在转移学习设置中,源和目标域中的数据分布共享相同的表示$ u $,但具有不同的多项式头部,我们表明,转移学习的流行启发式启发式启发式具有目标样本复杂性,独立于$ d $。
translated by 谷歌翻译
联合学习(FL)是分布式学习范例,可以从边缘设备上的分散数据集中学习全局或个性化模型。然而,在计算机视觉域中,由于统一的流行框架缺乏探索,FL的模型性能远远落后于集中培训。在诸如物体检测和图像分割之类的高级计算机视觉任务中,FL很少有效地说明。为了弥合差距并促进电脑视觉任务的流动,在这项工作中,我们提出了一个联邦学习库和基准框架,命名为FEDCV,评估了三个最具代表性的计算机视觉任务:图像分类,图像分割,和物体检测。我们提供非I.I.D。基准测试数据集,模型和各种参考FL算法。我们的基准研究表明,存在多种挑战值得未来的探索:集中式培训技巧可能不会直接申请fl;非i.i.d。 DataSet实际上将模型精度降级到不同的任务中的某种程度;给出了联合培训的系统效率,具有挑战性,鉴于大量参数和每个客户端记忆成本。我们认为,这种图书馆和基准以及可比的评估设置是必要的,以便在计算机视觉任务中进行有意义的进展。 Fedcv公开可用:https://github.com/fedml-ai/fedcv。
translated by 谷歌翻译
最近以来,在理解与overparameterized模型非凸损失基于梯度的方法收敛性和泛化显著的理论进展。尽管如此,优化和推广,尤其是小的随机初始化的关键作用的许多方面都没有完全理解。在本文中,我们迈出玄机通过证明小的随机初始化这个角色的步骤,然后通过梯度下降的行为类似于流行谱方法的几个迭代。我们还表明,从小型随机初始化,这可证明是用于overparameterized车型更加突出这种隐含的光谱偏差,也使梯度下降迭代在一个特定的轨迹走向,不仅是全局最优的,但也很好期广义的解决方案。具体而言,我们专注于通过天然非凸制剂重构从几个测量值的低秩矩阵的问题。在该设置中,我们表明,从小的随机初始化的梯度下降迭代的轨迹可以近似分解为三个阶段:(Ⅰ)的光谱或对准阶段,其中,我们表明,该迭代具有一个隐含的光谱偏置类似于频谱初始化允许我们表明,在该阶段中进行迭代,并且下面的低秩矩阵的列空间被充分对准的端部,(II)一鞍回避/细化阶段,我们表明,该梯度的轨迹从迭代移动离开某些简并鞍点,和(III)的本地细化阶段,其中,我们表明,避免了鞍座后的迭代快速收敛到底层低秩矩阵。底层我们的分析是,可能有超出低等级的重建计算问题影响overparameterized非凸优化方案的分析见解。
translated by 谷歌翻译
在本文中,我们研究了学习最适合培训数据集的浅层人工神经网络的问题。我们在过度参数化的制度中研究了这个问题,在该制度中,观测值的数量少于模型中的参数数量。我们表明,通过二次激活,训练的优化景观这种浅神经网络具有某些有利的特征,可以使用各种局部搜索启发式方法有效地找到全球最佳模型。该结果适用于输入/输出对的任意培训数据。对于可区分的激活函数,我们还表明,适当初始化的梯度下降以线性速率收敛到全球最佳模型。该结果着重于选择输入的可实现模型。根据高斯分布和标签是根据种植的重量系数生成的。
translated by 谷歌翻译
Differentiable Architecture Search (DARTS) has attracted considerable attention as a gradient-based Neural Architecture Search (NAS) method. Since the introduction of DARTS, there has been little work done on adapting the action space based on state-of-art architecture design principles for CNNs. In this work, we aim to address this gap by incrementally augmenting the DARTS search space with micro-design changes inspired by ConvNeXt and studying the trade-off between accuracy, evaluation layer count, and computational cost. To this end, we introduce the Pseudo-Inverted Bottleneck conv block intending to reduce the computational footprint of the inverted bottleneck block proposed in ConvNeXt. Our proposed architecture is much less sensitive to evaluation layer count and outperforms a DARTS network with similar size significantly, at layer counts as small as 2. Furthermore, with less layers, not only does it achieve higher accuracy with lower GMACs and parameter count, GradCAM comparisons show that our network is able to better detect distinctive features of target objects compared to DARTS.
translated by 谷歌翻译
Solving portfolio management problems using deep reinforcement learning has been getting much attention in finance for a few years. We have proposed a new method using experts signals and historical price data to feed into our reinforcement learning framework. Although experts signals have been used in previous works in the field of finance, as far as we know, it is the first time this method, in tandem with deep RL, is used to solve the financial portfolio management problem. Our proposed framework consists of a convolutional network for aggregating signals, another convolutional network for historical price data, and a vanilla network. We used the Proximal Policy Optimization algorithm as the agent to process the reward and take action in the environment. The results suggested that, on average, our framework could gain 90 percent of the profit earned by the best expert.
translated by 谷歌翻译
To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
translated by 谷歌翻译
Prostate cancer is the most common cancer in men worldwide and the second leading cause of cancer death in the United States. One of the prognostic features in prostate cancer is the Gleason grading of histopathology images. The Gleason grade is assigned based on tumor architecture on Hematoxylin and Eosin (H&E) stained whole slide images (WSI) by the pathologists. This process is time-consuming and has known interobserver variability. In the past few years, deep learning algorithms have been used to analyze histopathology images, delivering promising results for grading prostate cancer. However, most of the algorithms rely on the fully annotated datasets which are expensive to generate. In this work, we proposed a novel weakly-supervised algorithm to classify prostate cancer grades. The proposed algorithm consists of three steps: (1) extracting discriminative areas in a histopathology image by employing the Multiple Instance Learning (MIL) algorithm based on Transformers, (2) representing the image by constructing a graph using the discriminative patches, and (3) classifying the image into its Gleason grades by developing a Graph Convolutional Neural Network (GCN) based on the gated attention mechanism. We evaluated our algorithm using publicly available datasets, including TCGAPRAD, PANDA, and Gleason 2019 challenge datasets. We also cross validated the algorithm on an independent dataset. Results show that the proposed model achieved state-of-the-art performance in the Gleason grading task in terms of accuracy, F1 score, and cohen-kappa. The code is available at https://github.com/NabaviLab/Prostate-Cancer.
translated by 谷歌翻译