Graph neural networks (GNNs) have been demonstrated to be a powerful algorithmic model in broad application fields for their effectiveness in learning over graphs. To scale GNN training up for large-scale and ever-growing graphs, the most promising solution is distributed training which distributes the workload of training across multiple computing nodes. However, the workflows, computational patterns, communication patterns, and optimization techniques of distributed GNN training remain preliminarily understood. In this paper, we provide a comprehensive survey of distributed GNN training by investigating various optimization techniques used in distributed GNN training. First, distributed GNN training is classified into several categories according to their workflows. In addition, their computational patterns and communication patterns, as well as the optimization techniques proposed by recent work are introduced. Second, the software frameworks and hardware platforms of distributed GNN training are also introduced for a deeper understanding. Third, distributed GNN training is compared with distributed training of deep neural networks, emphasizing the uniqueness of distributed GNN training. Finally, interesting issues and opportunities in this field are discussed.
translated by 谷歌翻译
图形神经网络(GNNS)将深度神经网络(DNN)的成功扩展到非欧几里德图数据,实现了各种任务的接地性能,例如节点分类和图形属性预测。尽管如此,现有系统效率低,培训数十亿节点和GPU的节点和边缘训练大图。主要瓶颈是准备GPU数据的过程 - 子图采样和特征检索。本文提出了一个分布式GNN培训系统的BGL,旨在解决一些关键思想的瓶颈。首先,我们提出了一种动态缓存引擎,以最小化特征检索流量。通过协同设计缓存政策和抽样顺序,我们发现低开销和高缓存命中率的精美斑点。其次,我们改善了曲线图分区算法,以减少子图采样期间的交叉分区通信。最后,仔细资源隔离减少了不同数据预处理阶段之间的争用。关于各种GNN模型和大图数据集的广泛实验表明,BGL平均明显优于现有的GNN训练系统20.68倍。
translated by 谷歌翻译
图形神经网络(GNN)已被证明是分析非欧国人图数据的强大工具。但是,缺乏有效的分布图学习(GL)系统极大地阻碍了GNN的应用,尤其是当图形大且GNN相对深时。本文中,我们提出了GraphTheta,这是一种以顶点为中心的图形编程模型实现的新颖分布式和可扩展的GL系统。 GraphTheta是第一个基于分布式图处理的GL系统,其神经网络运算符以用户定义的功能实现。该系统支持多种培训策略,并在分布式(虚拟)机器上启用高度可扩展的大图学习。为了促进图形卷积实现,GraphTheta提出了一个名为NN-Tgar的新的GL抽象,以弥合图形处理和图形深度学习之间的差距。提出了分布式图引擎,以通过混合平行执行进行随机梯度下降优化。此外,除了全球批次和迷你批次外,我们还为新的集群批次培训策略提供了支持。我们使用许多网络大小的数据集评估GraphTheta,范围从小,适度到大规模。实验结果表明,GraphTheta可以很好地扩展到1,024名工人,用于培训内部开发的GNN,该工业尺度的Aripay数据集为14亿个节点和41亿个属性边缘,并带有CPU虚拟机(Dockers)群的小群。 (5 $ \ sim $ 12GB)。此外,GraphTheta比最先进的GNN实现获得了可比或更好的预测结果,证明其学习GNN和现有框架的能力,并且可以超过多达$ 2.02 \ tims $ $ 2.02 \ times $,具有更好的可扩展性。据我们所知,这项工作介绍了文献中最大的边缘属性GNN学习任务。
translated by 谷歌翻译
图形神经网络(GNN)的输入图的大小不断增加,突显了使用多GPU平台的需求。但是,由于计算不平衡和效率较低的通信,现有的多GPU GNN解决方案遭受了劣质性能。为此,我们提出了MGG,这是一种新型的系统设计,可以通过以GPU为中心的软件管道在多GPU平台上加速GNN。 MGG探讨了通过细粒度计算通信管道中隐藏GNN工作负载中远程内存访问延迟的潜力。具体而言,MGG引入了管​​道感知工作负载管理策略和混合数据布局设计,以促进通信局限性重叠。 MGG实现以优化的管道为中心的内核。它包括工作负载交织和基于经经的映射,以进行有效的GPU内核操作管道和专门的内存设计以及优化,以更好地数据访问性能。此外,MGG还结合了轻巧的分析建模和优化启发式方法,以动态提高运行时不同设置的GNN执行性能。全面的实验表明,MGG在各种GNN设置上的最先进的多GPU系统要比最先进的多GPU系统:平均比具有统一虚拟内存设计的多GPU系统快3.65倍,平均比DGCL框架快7.38倍。
translated by 谷歌翻译
最近,图形卷积网络(GCNS)已成为用于分析非欧几里德图数据的最先进的算法。然而,实现有效的GCN训练,特别是在大图中挑战。原因是许多折叠的原因:1)GCN训练引发了大量的内存占用。大图中的全批量培训甚至需要数百到数千千兆字节的内存,以缓冲中间数据进行反向传播。 2)GCN培训涉及内存密集型数据减少和计算密集型功能/渐变更新操作。这种异构性质挑战当前的CPU / GPU平台。 3)图形的不规则性和复杂的训练数据流共同增加了提高GCN培训系统效率的难度。本文提出了一种混合架构来解决这些挑战的混合架构。具体地,GCNEAR采用基于DIMM的存储系统,提供易于级别的存储器容量。为了匹配异构性质,我们将GCN培训操作分类为内存密集型减少和计算密集型更新操作。然后,我们卸载将操作减少到DIMM NMES,充分利用高聚合的本地带宽。我们采用具有足够计算能力的CAE来处理更新操作。我们进一步提出了几种优化策略来处理GCN任务的不规则,提高GCNEAR的表现。我们还提出了一种多GCNEAR系统来评估GCNEAR的可扩展性。
translated by 谷歌翻译
Dynamic Graph Neural Networks (DGNNs) have been broadly applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle the challenges, we propose PiPAD, a $\underline{\textbf{Pi}}pelined$ and $\underline{\textbf{PA}}rallel$ $\underline{\textbf{D}}GNN$ training framework for the end-to-end performance optimization on GPUs. From both the algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates the unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves $1.22\times$-$9.57\times$ speedup over the state-of-the-art DGNN frameworks on three representative models.
translated by 谷歌翻译
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. The large data sizes of graphs and their vertex features make scalable training algorithms and distributed memory systems necessary. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts. In our solution, the large adjacency and vertex-feature matrices are partitioned among processors. We exploit the vertex-partitioning of the graph to use non-blocking point-to-point communication operations between processors for better scalability. To further minimize the parallelization overheads, we introduce a sparse matrix partitioning scheme based on a hypergraph partitioning model for full-batch training. We also propose a novel stochastic hypergraph model to encode the expected communication volume in mini-batch training. We show the merits of the hypergraph model, previously unexplored for GCN training, over the standard graph partitioning model which does not accurately encode the communication costs. Experiments performed on real-world graph datasets demonstrate that the proposed algorithms achieve considerable speedups over alternative solutions. The optimizations achieved on communication costs become even more pronounced at high scalability with many processors. The performance benefits are preserved in deeper GCNs having more layers as well as on billion-scale graphs.
translated by 谷歌翻译
图表卷积网络(GCNS)已成为最先进的图形学习模型。但是,它可以令人难以置于大图数据集的推断GCNS,这会将其应用于大型实际图表并阻碍更深层更复杂的GCN图形的探讨。这是因为真实世界图可能非常大而稀疏。此外,GCN的节点度倾向于遵循幂律分布,因此具有高度不规则的邻接矩阵,导致数据处理和移动中的禁止低效率,从而显着地限制了可实现的GCN加速效率。为此,本文提出了一种GCN算法和加速器协同设计框架被称为GCOD,其在很大程度上可以缓解上述GCN不规则性并提高GCNS推理效率。具体地,在算法级别上,GCOD集成了分割和征服GCN训练策略,该训练策略将图形偏离在本地邻域中的密集或稀疏,而不会影响模型精度,从而导致(主要)的图形邻接矩阵仅仅是两个级别的工作量并享受大部分增强的规律性,从而轻松加速。在硬件水平上,我们进一步开发了一个具有分离发动机的专用双子加速器,以处理每个上述密集和稀疏工作负载,进一步提高整体利用率和加速效率。广泛的实验和消融研究验证了我们的GCOD始终如一地减少了与CPU,GPU和现有技术GCN加速器相比的15286倍,294倍,7.8倍和2.5倍的加速,包括HYGCN和AWB -GCN分别在保持甚至提高任务准确性的同时。
translated by 谷歌翻译
开发用于训练图形的可扩展解决方案,用于链路预测任务的Neural网络(GNNS)由于具有高计算成本和巨大内存占用的高数据依赖性,因此由于高数据依赖性而具有挑战性。我们提出了一种新的方法,用于缩放知识图形嵌入模型的培训,以满足这些挑战。为此,我们提出了以下算法策略:自给自足的分区,基于约束的负采样和边缘迷你批量培训。两者都是分区策略和基于约束的负面采样,避免在训练期间交叉分区数据传输。在我们的实验评估中,我们表明,我们基于GNN的知识图形嵌入模型的缩放解决方案在基准数据集中实现了16倍的加速,同时将可比的模型性能作为标准度量的非分布式方法。
translated by 谷歌翻译
近年来,在平衡(超级)图分配算法的设计和评估中取得了重大进展。我们调查了过去十年的实用算法的趋势,用于平衡(超级)图形分区以及未来的研究方向。我们的工作是对先前有关该主题的调查的更新。特别是,该调查还通过涵盖了超图形分区和流算法来扩展先前的调查,并额外关注并行算法。
translated by 谷歌翻译
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, generalpurpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor-Flow achieves for several real-world applications.
translated by 谷歌翻译
最近,Graph神经网络(GNNS)已成为聚光灯作为强大的工具,可以有效地在图形结构化数据上执行各种推理任务。随着现实图表的大小继续扩展,GNN训练系统面临可扩展性挑战。分布式培训是一种流行的方法,可以通过扩展CPU节点来应对这一挑战。但是,对基于磁盘的GNN培训的关注不多,该培训可以通过利用NVME SSD等高性能存储设备来以更具成本效益的方式扩展单节点系统。我们观察到,主内存和磁盘之间的数据移动是基于SSD的训练系统中的主要瓶颈,并且常规的GNN训练管道是不错的选择,而无需考虑此开销。因此,我们提出了Ginex,这是第一个基于SSD的GNN训练系统,可以在单台计算机上处​​理数十亿个图形数据集。受到编译器优化的检查员执行模型的启发,Ginex通过分开样品和收集阶段来重组GNN训练管道。这种分离使Ginex能够实现一种可证明的最佳替换算法,即被称为Belady的算法,用于存储器中的Caching特征向量,该算法是I/O访问的主要部分。根据我们对40亿尺度图数据集的评估,Ginex平均比SSD扩展的Pytorch几何得出了2.11倍的训练吞吐量(最大最高2.67倍)。
translated by 谷歌翻译
图形神经网络(GNNS)在学习从图形结构数据中展示了成功,其中包含欺诈检测,推荐和知识图形推理。然而,培训GNN有效地具有挑战性,因为:1)GPU存储器容量有限,对于大型数据集可能不足,而2)基于图形的数据结构导致不规则的数据访问模式。在这项工作中,我们提供了一种统计分析的方法,并确定了GNN培训前更频繁地访问的数据。我们的数据分层方法不仅利用输入图的结构,而且还从实际GNN训练过程中获得了洞察力,以实现更高的预测结果。通过我们的数据分层方法,我们还提供了一种新的数据放置和访问策略,以进一步最大限度地减少CPU-GPU通信开销。我们还考虑了多GPU GNN培训,我们也展示了我们在多GPU系统中的策略的有效性。评估结果表明,我们的工作将CPU-GPU流量降低了87-95%,并通过数亿节点和数十亿边缘的图表提高了现有解决方案的GNN训练速度。
translated by 谷歌翻译
Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel Historical Embedding Cache combined with compute-communication overlap to address this challenge. On a 32-node (64-socket) cluster of $3^{rd}$ generation Intel Xeon Scalable Processors with 36 cores per socket, DistGNN-MB trains 3-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of 2 seconds and 4.9 seconds, respectively, on 32 compute nodes. At this scale, DistGNN-MB trains GraphSAGE 5.2x faster than the widely-used DistDGL. DistGNN-MB trains GraphSAGE and GAT 10x and 17.2x faster, respectively, as compute nodes scale from 2 to 32.
translated by 谷歌翻译
图形神经网络(GNNS)已成为处理机器学习任务的有效方法,它为构建推荐系统带来了一种新方法,其中可以将推荐任务作为用户 - 项目的链接预测问题提出, 。培训基于GNN的推荐系统(GNNRECSYS)在大图上会引起大型内存足迹,很容易超过典型服务器上的DRAM容量。现有的解决方案诉诸分布式子图培训,这是由于动态构建子图和各个子图的大量冗余的高成本而效率低下。新兴的Intel Optane持久记忆使一台机器以可承受的成本具有最多6 TB的存储器,从而使单机器Gnnrecsys训练可行,从而消除了分布式培训中的效率低下。与DRAM相比,将Optane用于Gnnrecsys的一个主要问题是Optane相对较低的带宽。由于其主要的计算内核稀疏且内存访问密集,因此这种限制可能对Gnnrecsys工作量的高性能特别有害。为了了解Optane是否适合Gnnrecsys培训,我们对Gnnrecsys工作负载进行了深入的表征和全面的基准测试研究。我们的基准测试结果表明,经过正确配置后,基于Optane的单机器GNNRECSYS训练优于大幅度的培训,尤其是在处理深度GNN模型时。我们分析了加速度的来源,提供有关如何为GNNRECSYS工作负载配置Optane的指导,并讨论进一步优化的机会。
translated by 谷歌翻译
采样是图形神经网络(GNN)培训的关键操作,有助于降低成本。以前的文献已经通过数学和统计方法探索了改进采样算法。但是,采样算法和硬件之间存在差距。在不考虑硬件的情况下,算法设计人员仅在算法级别优化采样,缺少通过利用硬件功能来促进现有采样算法效率的巨大潜力。在本文中,我们开创了一个为主流采样算法提出的统一编程模型,称为GNNSampler,涵盖了各个类别中采样算法的关键过程。其次,为了利用硬件功能,我们选择数据局部性作为案例研究,并在图中探索节点及其邻居之间的数据位置,以减轻采样中不规则的内存访问。第三,我们在GNNSampler中实现了各种采样算法的局部感知优化,以优化一般的采样过程。最后,我们强调在大图数据集上进行实验,以分析训练时间,准确性和硬件级指标之间的相关性。广泛的实验表明,我们的方法通用到主流采样算法,并有助于大大减少训练时间,尤其是在大规模图中。
translated by 谷歌翻译
Video, as a key driver in the global explosion of digital information, can create tremendous benefits for human society. Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, traffic control, and security surveillance, all facilitated by video analytics (VA). This trend is spurred by the rapid advancement of deep learning (DL), which enables more precise models for object classification, detection, and tracking. Meanwhile, with the proliferation of Internet-connected devices, massive amounts of data are generated daily, overwhelming the cloud. Edge computing, an emerging paradigm that moves workloads and services from the network core to the network edge, has been widely recognized as a promising solution. The resulting new intersection, edge video analytics (EVA), begins to attract widespread attention. Nevertheless, only a few loosely-related surveys exist on this topic. A dedicated venue for collecting and summarizing the latest advances of EVA is highly desired by the community. Besides, the basic concepts of EVA (e.g., definition, architectures, etc.) are ambiguous and neglected by these surveys due to the rapid development of this domain. A thorough clarification is needed to facilitate a consensus on these concepts. To fill in these gaps, we conduct a comprehensive survey of the recent efforts on EVA. In this paper, we first review the fundamentals of edge computing, followed by an overview of VA. The EVA system and its enabling techniques are discussed next. In addition, we introduce prevalent frameworks and datasets to aid future researchers in the development of EVA systems. Finally, we discuss existing challenges and foresee future research directions. We believe this survey will help readers comprehend the relationship between VA and edge computing, and spark new ideas on EVA.
translated by 谷歌翻译
Graph neural networks (GNNs) have received great attention due to their success in various graph-related learning tasks. Several GNN frameworks have then been developed for fast and easy implementation of GNN models. Despite their popularity, they are not well documented, and their implementations and system performance have not been well understood. In particular, unlike the traditional GNNs that are trained based on the entire graph in a full-batch manner, recent GNNs have been developed with different graph sampling techniques for mini-batch training of GNNs on large graphs. While they improve the scalability, their training times still depend on the implementations in the frameworks as sampling and its associated operations can introduce non-negligible overhead and computational cost. In addition, it is unknown how much the frameworks are 'eco-friendly' from a green computing perspective. In this paper, we provide an in-depth study of two mainstream GNN frameworks along with three state-of-the-art GNNs to analyze their performance in terms of runtime and power/energy consumption. We conduct extensive benchmark experiments at several different levels and present detailed analysis results and observations, which could be helpful for further improvement and optimization.
translated by 谷歌翻译
最近提出了基于子图的图表学习(SGRL)来应对规范图神经网络(GNNS)遇到的一些基本挑战,并在许多重要的数据科学应用(例如链接,关系和主题预测)中证明了优势。但是,当前的SGRL方法遇到了可伸缩性问题,因为它们需要为每个培训或测试查询提取子图。扩大规范GNN的最新解决方案可能不适用于SGRL。在这里,我们通过共同设计学习算法及其系统支持,为可扩展的SGRL提出了一种新颖的框架Surel。 Surel采用基于步行的子图表分解,并将步行重新形成子图,从而大大降低了子图提取的冗余并支持并行计算。具有数百万个节点和边缘的六个同质,异质和高阶图的实验证明了Surel的有效性和可扩展性。特别是,与SGRL基线相比,Surel可以实现10 $ \ times $ Quad-Up,具有可比甚至更好的预测性能;与规范GNN相比,Surel可实现50%的预测准确性。
translated by 谷歌翻译
Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging primarily because they consist of billions of embedding-based parameters which are often stored remotely leading to significant overheads from embedding access. By profiling existing DLRM training, we observe that only 8.5% of the iteration time is spent in forward/backward pass while the remaining time is spent on embedding and model synchronization. Our key insight in this paper is that access to embeddings have a specific structure and pattern which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with almost 1% of embeddings represent more than 92% of total accesses. Further, we observe that during training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insight, we propose Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We designed an Oracle Cacher, a new system component which uses our lookahead algorithm to generate optimal cache update decisions and provide strong consistency guarantees. Our experiments using three datasets and two models shows that our approach provides a speed up of up to 6.2x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.
translated by 谷歌翻译