Given a large graph with few node labels, how can we (a) identify the mixed network-effect of the graph and (b) predict the unknown labels accurately and efficiently? This work proposes Network Effect Analysis (NEA) and UltraProp, which are based on two insights: (a) the network-effect (NE) insight: a graph can exhibit not only one of homophily and heterophily, but also both or none in a label-wise manner, and (b) the neighbor-differentiation (ND) insight: neighbors have different degrees of influence on the target node based on the strength of connections. NEA provides a statistical test to check whether a graph exhibits network-effect or not, and surprisingly discovers the absence of NE in many real-world graphs known to have heterophily. UltraProp solves the node classification problem with notable advantages: (a) Accurate, thanks to the network-effect (NE) and neighbor-differentiation (ND) insights; (b) Explainable, precisely estimating the compatibility matrix; (c) Scalable, being linear with the input size and handling graphs with millions of nodes; and (d) Principled, with closed-form formula and theoretical guarantee. Applied on eight real-world graph datasets, UltraProp outperforms top competitors in terms of accuracy and run time, requiring only stock CPU servers. On a large real-world graph with 1.6M nodes and 22.3M edges, UltraProp achieves more than 9 times speedup (12 minutes vs. 2 hours) compared to most competitors.
translated by 谷歌翻译
给定图形学习任务,例如链接预测,在新的图数据集上,我们如何自动选择最佳方法及其超参数(集体称为模型)?图形学习的模型选择在很大程度上是临时的。一种典型的方法是将流行方法应用于新数据集,但这通常是次优的。另一方面,系统比较新图上的模型迅速变得太成本过高,甚至不切实际。在这项工作中,我们为自动图机学习开发了第一种称为AutoGML的元学习方法,该方法利用了基准图数据集中大量现有方法的先前性能,并携带此先前的经验以自动选择有效的有效用于新图的模型,无需任何模型培训或评估。为了捕获来自不同领域的图形之间的相似性,我们引入了量化图形结构特征的专业元图特征。然后,我们设计了一个代表模型和图形之间关系的元图,并开发了在元图上运行的图形元学习器,该图估计了每个模型与不同图的相关性。通过广泛的实验,我们表明,使用AutoGML选择新图的方法显着优于始终应用流行方法以及几个现有的元学习者,同时在测试时非常快。
translated by 谷歌翻译
哪些目标标签对于图形神经网络(GNN)培训最有效?在某些应用GNNS Excel样药物设计或欺诈检测的应用中,标记新实例很昂贵。我们开发一个具有数据效率的主动采样框架,即ScatterSample,以在主动学习设置下训练GNN。 ScatterSample采用称为不同确定性的抽样模块,从样品空间的不同区域收集具有较大不确定性的实例以进行标记。为了确保所选节点的多样化,不同的确定性簇群簇较高的不确定性节点,​​并从每个群集中选择代表性节点。严格的理论分析表明,与标准的主动采样方法相比,我们的ScatterSample算法进一步支持了其优势,该方法旨在简单地简单地提高不确定性,而不是使样品多样化。特别是,我们表明ScatterSample能够在整个样品空间上有效地减少模型不确定性。我们在五个数据集上的实验表明,散点样本明显优于其他GNN主动学习基线,特别是它将采样成本降低了50%,同时达到了相同的测试准确性。
translated by 谷歌翻译
给定实体及其在Web数据中的交互,可能在不同的时间发生,我们如何找到实体社区并跟踪其演变?在本文中,我们从图形群集的角度处理这项重要任务。最近,通过深层聚类方法,已经实现了各个领域的最新聚类性能。特别是,深图聚类(DGC)方法通过学习节点表示和群集分配在关节优化框架中成功扩展到图形结构的数据。尽管建模选择有所不同(例如,编码器架构),但现有的DGC方法主要基于自动编码器,并使用相同的群集目标和相对较小的适应性。同样,尽管许多现实世界图都是动态的,但以前的DGC方法仅被视为静态图。在这项工作中,我们开发了CGC,这是一个新颖的端到端图形聚类框架,其与现有方法的根本不同。 CGC在对比度图学习框架中学习节点嵌入和群集分配,在多级别方案中仔细选择了正面和负样本,以反映层次结构的社区结构和网络同质。此外,我们将CGC扩展到时间不断发展的数据,其中时间图以增量学习方式执行,并具有检测更改点的能力。对现实世界图的广泛评估表明,所提出的CGC始终优于现有方法。
translated by 谷歌翻译
鉴于ICU(重症监护股)监测心脏病患者,用于大脑活动,我们如何尽早预测其健康结果?早期决策在许多应用中至关重要,例如,监测患者可能有助于早期干预和改善护理。另一方面,EEG数据的早期预测造成了几个挑战:(i)早期准确性权衡;观察更多数据通常会提高精度,但牺牲了,(ii)大规模(用于训练)和流传输(在线决策)数据处理,(iii)多变化(由于多个电极)和多长度(由于变化患者的逗留时间)时间序列。通过这种现实世界的应用程序,我们提供了从早期预测中耗尽的受益者,以及从错误分类到统一的区域特定目标中的成本。统一这两种数量允许我们直接估计单个目标(即益处),重要的是,准确地指示输出预测的时间:当益处估计变为肯定时。 Eventitter(a)是高效且快速的,在输入序列的数量中具有训练时间线性,并且可以实时运行以进行决策,(b)可以处理多变化和可变长度的时间序列,适用于患者数据和(c)是有效的,与竞争对手相比,提供高达2倍的时间,具有相同或更好的准确性。
translated by 谷歌翻译
在M维数据点的云中,我们将如何发现,以及排名,单点和群体 - 异常?我们是第一个概括了两个维度的异常检测:第一维度是我们在统一的观点下处理点异常,以及组异常 - 我们将把它们称为广义异常。第二维度不仅可以检测到,而且还可以在可疑顺序中排名,但也排名,异常。异常检测和排名具有许多应用:例如,在癫痫患者的脑电图中,异常可能表明癫痫发作;在计算机网络流量数据中,它可能表示电源故障或DOS / DDOS攻击。我们首先设置一些合理的公理;令人惊讶的是,早期的方法都没有通过所有公理。我们的主要贡献是Gen2Out算法,具有以下理想的性质:(a)所指的原理和声音异常评分,使得探测器的公理组合,(b)倍增,在其检测到,以及排名的级别点和组异常,(c)可扩展,它是快速且可伸缩的,线性输入大小。 (d)有效,关于现实世界癫痫记录(200GB)的实验证明了临床医生证实Gen2Out的有效性。在27个现实世界基准数据集上的实验表明,GEN2OUT检测到准确性的地面真理组,匹配或优于点异常基线基线算法,没有对组异常的竞争,并且在储运机上需要大约2分钟的数据点。
translated by 谷歌翻译
给定传感器读数随着时间的推移从电网上,我们如何在发生异常时准确地检测?实现这一目标的关键部分是使用电网传感器网络在电网上实时地在实时检测到自然故障或恶意的任何不寻常的事件。行业中现有的坏数据探测器缺乏鲁布布利地检测广泛类型的异常,特别是由于新兴网络攻击而造成的复杂性,因为它们一次在网格的单个测量快照上运行。新的ML方法更广泛适用,但通常不会考虑拓扑变化对传感器测量的影响,因此无法适应历史数据中的定期拓扑调整。因此,我们向DynWatch,基于域知识和拓扑知识算法用于使用动态网格上的传感器进行异常检测。我们的方法准确,优于实验中的现有方法20%以上(F-Measure);快速,在60K +分支机用中的每次传感器上平均运行小于1.7ms,使用笔记本电脑,并在图表的大小上线性缩放。
translated by 谷歌翻译
High content imaging assays can capture rich phenotypic response data for large sets of compound treatments, aiding in the characterization and discovery of novel drugs. However, extracting representative features from high content images that can capture subtle nuances in phenotypes remains challenging. The lack of high-quality labels makes it difficult to achieve satisfactory results with supervised deep learning. Self-Supervised learning methods, which learn from automatically generated labels has shown great success on natural images, offer an attractive alternative also to microscopy images. However, we find that self-supervised learning techniques underperform on high content imaging assays. One challenge is the undesirable domain shifts present in the data known as batch effects, which may be caused by biological noise or uncontrolled experimental conditions. To this end, we introduce Cross-Domain Consistency Learning (CDCL), a novel approach that is able to learn in the presence of batch effects. CDCL enforces the learning of biological similarities while disregarding undesirable batch-specific signals, which leads to more useful and versatile representations. These features are organised according to their morphological changes and are more useful for downstream tasks - such as distinguishing treatments and mode of action.
translated by 谷歌翻译
While risk-neutral reinforcement learning has shown experimental success in a number of applications, it is well-known to be non-robust with respect to noise and perturbations in the parameters of the system. For this reason, risk-sensitive reinforcement learning algorithms have been studied to introduce robustness and sample efficiency, and lead to better real-life performance. In this work, we introduce new model-free risk-sensitive reinforcement learning algorithms as variations of widely-used Policy Gradient algorithms with similar implementation properties. In particular, we study the effect of exponential criteria on the risk-sensitivity of the policy of a reinforcement learning agent, and develop variants of the Monte Carlo Policy Gradient algorithm and the online (temporal-difference) Actor-Critic algorithm. Analytical results showcase that the use of exponential criteria generalize commonly used ad-hoc regularization approaches. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
translated by 谷歌翻译
Hierarchical learning algorithms that gradually approximate a solution to a data-driven optimization problem are essential to decision-making systems, especially under limitations on time and computational resources. In this study, we introduce a general-purpose hierarchical learning architecture that is based on the progressive partitioning of a possibly multi-resolution data space. The optimal partition is gradually approximated by solving a sequence of optimization sub-problems that yield a sequence of partitions with increasing number of subsets. We show that the solution of each optimization problem can be estimated online using gradient-free stochastic approximation updates. As a consequence, a function approximation problem can be defined within each subset of the partition and solved using the theory of two-timescale stochastic approximation algorithms. This simulates an annealing process and defines a robust and interpretable heuristic method to gradually increase the complexity of the learning architecture in a task-agnostic manner, giving emphasis to regions of the data space that are considered more important according to a predefined criterion. Finally, by imposing a tree structure in the progression of the partitions, we provide a means to incorporate potential multi-resolution structure of the data space into this approach, significantly reducing its complexity, while introducing hierarchical feature extraction properties similar to certain classes of deep learning architectures. Asymptotic convergence analysis and experimental results are provided for clustering, classification, and regression problems.
translated by 谷歌翻译