In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
translated by 谷歌翻译
马尔可夫链是一类概率模型,在定量科学中已广泛应用。这部分是由于它们的多功能性,但是可以通过分析探测的便利性使其更加复杂。本教程为马尔可夫连锁店提供了深入的介绍,并探索了它们与图形和随机步行的联系。我们利用从线性代数和图形论的工具来描述不同类型的马尔可夫链的过渡矩阵,特别着眼于探索与这些矩阵相对应的特征值和特征向量的属性。提出的结果与机器学习和数据挖掘中的许多方法有关,我们在各个阶段描述了这些方法。本文并没有本身就成为一项新颖的学术研究,而是提出了一些已知结果的集合以及一些新概念。此外,该教程的重点是向读者提供直觉,而不是正式的理解,并且仅假定对线性代数和概率理论的概念的基本曝光。因此,来自各种学科的学生和研究人员可以访问它。
translated by 谷歌翻译
Graph clustering is a fundamental problem in unsupervised learning, with numerous applications in computer science and in analysing real-world data. In many real-world applications, we find that the clusters have a significant high-level structure. This is often overlooked in the design and analysis of graph clustering algorithms which make strong simplifying assumptions about the structure of the graph. This thesis addresses the natural question of whether the structure of clusters can be learned efficiently and describes four new algorithmic results for learning such structure in graphs and hypergraphs. All of the presented theoretical results are extensively evaluated on both synthetic and real-word datasets of different domains, including image classification and segmentation, migration networks, co-authorship networks, and natural language processing. These experimental results demonstrate that the newly developed algorithms are practical, effective, and immediately applicable for learning the structure of clusters in real-world data.
translated by 谷歌翻译
最近有一项激烈的活动在嵌入非常高维和非线性数据结构的嵌入中,其中大部分在数据科学和机器学习文献中。我们分四部分调查这项活动。在第一部分中,我们涵盖了非线性方法,例如主曲线,多维缩放,局部线性方法,ISOMAP,基于图形的方法和扩散映射,基于内核的方法和随机投影。第二部分与拓扑嵌入方法有关,特别是将拓扑特性映射到持久图和映射器算法中。具有巨大增长的另一种类型的数据集是非常高维网络数据。第三部分中考虑的任务是如何将此类数据嵌入中等维度的向量空间中,以使数据适合传统技术,例如群集和分类技术。可以说,这是算法机器学习方法与统计建模(所谓的随机块建模)之间的对比度。在论文中,我们讨论了两种方法的利弊。调查的最后一部分涉及嵌入$ \ mathbb {r}^ 2 $,即可视化中。提出了三种方法:基于第一部分,第二和第三部分中的方法,$ t $ -sne,UMAP和大节。在两个模拟数据集上进行了说明和比较。一个由嘈杂的ranunculoid曲线组成的三胞胎,另一个由随机块模型和两种类型的节点产生的复杂性的网络组成。
translated by 谷歌翻译
随机块模型(SBM)是一个随机图模型,其连接不同的顶点组不同。它被广泛用作研究聚类和社区检测的规范模型,并提供了肥沃的基础来研究组合统计和更普遍的数据科学中出现的信息理论和计算权衡。该专着调查了最近在SBM中建立社区检测的基本限制的最新发展,无论是在信息理论和计算方案方面,以及各种恢复要求,例如精确,部分和弱恢复。讨论的主要结果是在Chernoff-Hellinger阈值中进行精确恢复的相转换,Kesten-Stigum阈值弱恢复的相变,最佳的SNR - 单位信息折衷的部分恢复以及信息理论和信息理论之间的差距计算阈值。该专着给出了在寻求限制时开发的主要算法的原则推导,特别是通过绘制绘制,半定义编程,(线性化)信念传播,经典/非背带频谱和图形供电。还讨论了其他块模型的扩展,例如几何模型和一些开放问题。
translated by 谷歌翻译
The stochastic block model (SBM) is a random graph model with planted clusters. It is widely employed as a canonical model to study clustering and community detection, and provides generally a fertile ground to study the statistical and computational tradeoffs that arise in network and data sciences.This note surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational thresholds, and for various recovery requirements such as exact, partial and weak recovery (a.k.a., detection). The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial recovery, the learning of the SBM parameters and the gap between information-theoretic and computational thresholds.The note also covers some of the algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, linearized belief propagation, classical and nonbacktracking spectral methods. A few open problems are also discussed.
translated by 谷歌翻译
这是针对非线性维度和特征提取方法的教程和调查论文,该方法基于数据图的拉普拉斯语。我们首先介绍邻接矩阵,拉普拉斯矩阵的定义和拉普拉斯主义的解释。然后,我们涵盖图形和光谱聚类的切割,该谱图应用于数据子空间。解释了Laplacian征收及其样本外扩展的不同优化变体。此后,我们将保留投影的局部性及其内核变体作为拉普拉斯征本征的线性特殊案例。然后解释了图嵌入的版本,这些版本是Laplacian eigenmap和局部保留投影的广义版本。最后,引入了扩散图,这是基于数据图和随机步行的方法。
translated by 谷歌翻译
In this work we study statistical properties of graph-based algorithms for multi-manifold clustering (MMC). In MMC the goal is to retrieve the multi-manifold structure underlying a given Euclidean data set when this one is assumed to be obtained by sampling a distribution on a union of manifolds $\mathcal{M} = \mathcal{M}_1 \cup\dots \cup \mathcal{M}_N$ that may intersect with each other and that may have different dimensions. We investigate sufficient conditions that similarity graphs on data sets must satisfy in order for their corresponding graph Laplacians to capture the right geometric information to solve the MMC problem. Precisely, we provide high probability error bounds for the spectral approximation of a tensorized Laplacian on $\mathcal{M}$ with a suitable graph Laplacian built from the observations; the recovered tensorized Laplacian contains all geometric information of all the individual underlying manifolds. We provide an example of a family of similarity graphs, which we call annular proximity graphs with angle constraints, satisfying these sufficient conditions. We contrast our family of graphs with other constructions in the literature based on the alignment of tangent planes. Extensive numerical experiments expand the insights that our theory provides on the MMC problem.
translated by 谷歌翻译
Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain $\textit{subquadratic}$ time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from $\textit{weighted vertex}$ and $\textit{weighted edge sampling}$ on kernel graphs, $\textit{simulating random walks}$ on kernel graphs, and $\textit{importance sampling}$ on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in $\textit{sublinear}$ (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.
translated by 谷歌翻译
由于其数值益处增加及其坚实的数学背景,光谱聚类方法的非线性重构近来的关注。我们在$ p $ -norm中提出了一种新的直接多道谱聚类算法,以$ p \ in(1,2] $。计算图表的多个特征向量的问题$ p $ -laplacian,标准的非线性概括Graph Laplacian,被重用作为Grassmann歧管的无约束最小化问题。$ P $的价值以伪连续的方式减少,促进对应于最佳图形的稀疏解决方案载体作为$ P $接近。监测单调减少平衡图削减了我们从$ P $ -Levels获得的最佳可用解决方案的保证。我们展示了我们算法在各种人工测试案件中的算法的有效性和准确性。我们的数值和比较结果具有各种状态-Art聚类方法表明,所提出的方法在均衡的图形剪切度量和标签分配的准确性方面取得高质量的集群。此外,我们进行S面部图像和手写字符分类的束缚,以展示现实数据集中的适用性。
translated by 谷歌翻译
We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task? And how can clustering results be validated? Connectivity-based versus prototype-based approaches are reflected in the context of several popular methods: single-linkage, spectral embedding, k-means, and Gaussian mixtures are discussed as well as the density-based protocols (H)DBSCAN, Jarvis-Patrick, CommonNN, and density-peaks.
translated by 谷歌翻译
在机器学习中调用多种假设需要了解歧管的几何形状和维度,理论决定了需要多少样本。但是,在应用程序数据中,采样可能不均匀,歧管属性是未知的,并且(可能)非纯化;这意味着社区必须适应本地结构。我们介绍了一种用于推断相似性内核提供数据的自适应邻域的算法。从本地保守的邻域(Gabriel)图开始,我们根据加权对应物进行迭代率稀疏。在每个步骤中,线性程序在全球范围内产生最小的社区,并且体积统计数据揭示了邻居离群值可能违反了歧管几何形状。我们将自适应邻域应用于非线性维度降低,地球计算和维度估计。与标准算法的比较,例如使用K-Nearest邻居,证明了它们的实用性。
translated by 谷歌翻译
Pre-publication draft of a book to be published byMorgan & Claypool publishers. Unedited version released with permission. All relevant copyrights held by the author and publisher extend to this pre-publication draft.
translated by 谷歌翻译
我们提出了一种谱聚类算法,用于分析多元极端的依赖性结构。更具体地,我们专注于多元极端的渐近依赖性,其特征在于极值理论中的角度或光谱测量。我们的工作研究了基于从极端样本构成的随机价值的随机价值的频谱聚类的理论性能,即,半径超过大阈值的随机载体的角部分。特别是,我们得出了线性因子模型产生的极端的渐近分布,并证明,在某些条件下,光谱聚类可以一致地识别该模型中产生的极端簇。利用这一结果,我们提出了一种简单的一致估计策略,用于学习角度测量。我们的理论发现与说明我们方法的有限样本性能的数值实验相辅相成。
translated by 谷歌翻译
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
translated by 谷歌翻译
In applications such as social, energy, transportation, sensor, and neuronal networks, high-dimensional data naturally reside on the vertices of weighted graphs. The emerging field of signal processing on graphs merges algebraic and spectral graph theoretic concepts with computational harmonic analysis to process such signals on graphs. In this tutorial overview, we outline the main challenges of the area, discuss different ways to define graph spectral domains, which are the analogues to the classical frequency domain, and highlight the importance of incorporating the irregular structures of graph data domains when processing signals on graphs. We then review methods to generalize fundamental operations such as filtering, translation, modulation, dilation, and downsampling to the graph setting, and survey the localized, multiscale transforms that have been proposed to efficiently extract information from high-dimensional data on graphs. We conclude with a brief discussion of open issues and possible extensions.
translated by 谷歌翻译
Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data-there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems.
translated by 谷歌翻译
我们提出了一个多路相似性的理论框架,与将实价数据建模为通过光谱嵌入聚类的超图。对于基于图形的光谱群集,通常,通过使用内核函数对成对相似性进行建模,将实值数据模拟为图。这是因为内核函数与图形切割具有理论连接。对于使用多路相似性比成对相似性更合适的问题,自然地将模型作为超图,即图形的概括。然而,尽管剪切幅度进行了充分研究,但尚未建立基于HyperGraph Cut的框架来模拟多路相似性。在本文中,我们通过利用内核函数的理论基础来制定多路相似性。我们展示了我们的配方和超图之间的理论联系,以两种方式削减了加权内核$ k $ -MEANS和热核,我们证明了我们的配方合理性。我们还为光谱聚类提供了快速算法。我们的算法在经验上比现有图和其他启发式建模方法显示出更好的性能。
translated by 谷歌翻译
光谱聚类在从业者和理论家中都很受欢迎。尽管对光谱聚类的性能保证有充分的了解,但最近的研究集中于在群集中执行``公平'',要求它们在分类敏感的节点属性方面必须``平衡''人口中的种族分布)。在本文中,我们考虑了一个设置,其中敏感属性间接表现在辅助\ textit {表示图}中,而不是直接观察到。该图指定了可以相对于敏感属性互相表示的节点对,除了通常的\ textit {相似性图}外,还可以观察到。我们的目标是在相似性图中找到簇,同时尊重由表示图编码的新个人公平性约束。我们为此任务开发了不均衡和归一化光谱聚类的变体,并在代表图诱导的种植分区模型下分析其性能。该模型同时使用节点的群集成员身份和表示图的结构来生成随机相似性图。据我们所知,这些是在个人级别的公平限制下受约束光谱聚类的第一个一致性结果。数值结果证实了我们的理论发现。
translated by 谷歌翻译
Research in Graph Signal Processing (GSP) aims to develop tools for processing data defined on irregular graph domains. In this paper we first provide an overview of core ideas in GSP and their connection to conventional digital signal processing, along with a brief historical perspective to highlight how concepts recently developed in GSP build on top of prior research in other areas. We then summarize recent advances in developing basic GSP tools, including methods for sampling, filtering or graph learning. Next, we review progress in several application areas using GSP, including processing and analysis of sensor network data, biological data, and applications to image processing and machine learning.
translated by 谷歌翻译