ClueWeb22, the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier ClueWeb corpora, the ClueWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, ClueWeb22 includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text to lower the barrier to entry. Many of these signals have been widely used in industry but are available to the research community for the first time at this scale.
translated by 谷歌翻译
A "heart attack" or myocardial infarction (MI), occurs when an artery supplying blood to the heart is abruptly occluded. The "gold standard" method for imaging MI is Cardiovascular Magnetic Resonance Imaging (MRI), with intravenously administered gadolinium-based contrast (late gadolinium enhancement). However, no "gold standard" fully automated method for the quantification of MI exists. In this work, we propose an end-to-end fully automatic system (MyI-Net) for the detection and quantification of MI in MRI images. This has the potential to reduce the uncertainty due to the technical variability across labs and inherent problems of the data and labels. Our system consists of four processing stages designed to maintain the flow of information across scales. First, features from raw MRI images are generated using feature extractors built on ResNet and MoblieNet architectures. This is followed by the Atrous Spatial Pyramid Pooling (ASPP) to produce spatial information at different scales to preserve more image context. High-level features from ASPP and initial low-level features are concatenated at the third stage and then passed to the fourth stage where spatial information is recovered via up-sampling to produce final image segmentation output into: i) background, ii) heart muscle, iii) blood and iv) scar areas. New models were compared with state-of-art models and manual quantification. Our models showed favorable performance in global segmentation and scar tissue detection relative to state-of-the-art work, including a four-fold better performance in matching scar pixels to contours produced by clinicians.
translated by 谷歌翻译
Neuromorphic systems require user-friendly software to support the design and optimization of experiments. In this work, we address this need by presenting our development of a machine learning-based modeling framework for the BrainScaleS-2 neuromorphic system. This work represents an improvement over previous efforts, which either focused on the matrix-multiplication mode of BrainScaleS-2 or lacked full automation. Our framework, called hxtorch.snn, enables the hardware-in-the-loop training of spiking neural networks within PyTorch, including support for auto differentiation in a fully-automated hardware experiment workflow. In addition, hxtorch.snn facilitates seamless transitions between emulating on hardware and simulating in software. We demonstrate the capabilities of hxtorch.snn on a classification task using the Yin-Yang dataset employing a gradient-based approach with surrogate gradients and densely sampled membrane observations from the BrainScaleS-2 hardware system.
translated by 谷歌翻译
Generalization is an important attribute of machine learning models, particularly for those that are to be deployed in a medical context, where unreliable predictions can have real world consequences. While the failure of models to generalize across datasets is typically attributed to a mismatch in the data distributions, performance gaps are often a consequence of biases in the 'ground-truth' label annotations. This is particularly important in the context of medical image segmentation of pathological structures (e.g. lesions), where the annotation process is much more subjective, and affected by a number underlying factors, including the annotation protocol, rater education/experience, and clinical aims, among others. In this paper, we show that modeling annotation biases, rather than ignoring them, poses a promising way of accounting for differences in annotation style across datasets. To this end, we propose a generalized conditioning framework to (1) learn and account for different annotation styles across multiple datasets using a single model, (2) identify similar annotation styles across different datasets in order to permit their effective aggregation, and (3) fine-tune a fully trained model to a new annotation style with just a few samples. Next, we present an image-conditioning approach to model annotation styles that correlate with specific image features, potentially enabling detection biases to be more easily identified.
translated by 谷歌翻译
每年都会在医院中获得数百万个大脑MRI扫描,这比任何研究数据集的规模都要大得多。因此,分析此类扫描的能力可以改变神经成像研究。然而,由于没有自动化算法可以应对临床采集的高度可变性(MR对比度,分辨率,方向等),因此它们的潜力仍未开发。在这里,我们提出了Synthseg+,这是一个AI分割套件,首次可以对异质临床数据集进行强有力的分析。具体而言,除了全脑分割外,SynthSeg+还执行皮质细胞,颅内体积估计和自动检测故障分割(主要是由质量非常低的扫描引起的)。我们在七个实验中证明了合成++,包括对14,000张扫描的老化研究,在该研究中,它准确地复制了在质量更高的数据上观察到的萎缩模式。 Synthseg+公开发布是一种现成的工具,可在广泛设置中解锁定量形态计量学的潜力。
translated by 谷歌翻译
具有差异隐私(DP)的文本重写提供了具体的理论保证,可以保护个人在文本文档中的隐私。实际上,现有系统可能缺乏验证其隐私索赔的手段,从而导致透明度和可重复性问题。我们介绍了DP-Rewrite,这是一个开源框架,用于差异化文本重写,旨在通过模块化,可扩展和高度定制来解决这些问题。我们的系统结合了各种下游数据集,模型,培训前程序和评估指标,以提供一种灵活的方式来领导和验证私人文本重写研究。为了在实践中展示我们的软件,我们提供了一组实验,作为对熟练DP文本重写系统的案例研究,检测其预训练方法中的隐私泄漏。我们的系统公开可用,我们希望它将帮助社区使DP文本重写研究更容易访问和透明。
translated by 谷歌翻译
大规模结构化数据的有效表示,进攻,分析和可视化在图形上引起了很多关注。到目前为止,大多数文献都集中在实现的信号上。但是,信号通常在傅立叶域中稀疏,并且可以使用其光谱组件的复杂信封来获得更多信息和紧凑的表示形式,而不是原始的真实价值信号。出于这一事实的激励,在这项工作中,我们将图形卷积神经网络(GCN)推广到复杂域,从而得出了允许将复杂值的图形移位运算符(GSO)纳入图形过滤器(GF)和过程的理论。复杂值图形信号(GS)。开发的理论可以处理时空复杂的网络过程。我们证明,相对于基础图支持的扰动,传输误差的界限以及通过乘积层传播的界限,复合物值GCN是稳定的。然后,我们将复杂的GCN应用于电网状态预测,电网网络攻击检测和定位。
translated by 谷歌翻译
在许多临床背景下,检测所有病变对于评估疾病活动至关重要。尽管获取分割标签的耗时性,但标准方法仍将病变检测作为分割问题。在本文中,我们提出了一种仅依赖点标签的病变检测方法。我们的模型通过热图回归训练,可以以概率方式检测可变数量的病变。实际上,我们提出的后处理方法提供了一种直接估计病变存在不确定性的可靠方法。GAD病变检测的实验结果表明,与昂贵的分割标签的培训相比,我们的基于点的方法具有竞争性。最后,我们的检测模型为分割提供了合适的预训练。仅在17个细分样本上进行微调时,我们实现了与完整数据集的培训相当的性能。
translated by 谷歌翻译
发现预测未来疾病结果的患者特定成像标记可以帮助我们更好地了解疾病进化的个体水平异质性。实际上,可以在医学实践中采用的可以提供数据驱动的个性化标记的深度学习模型。在这项工作中,我们证明了数据驱动的生物标志物发现可以通过反事实综合过程来实现。我们展示了如何使用深层的条件生成模型来扰动基线图像中的局部成像特征,这些图像与特定于受试者的未来疾病进化有关,并导致反事实图像有望具有不同的未来结果。因此,候选生物标志物是由于检查了此过程中受到干扰的一组功能而产生的。通过对大型多扫描仪多中心多发性硬化症(MS)临床试验磁共振成像(MRI)数据集(RRMS)患者数据集(RRMS)患者数据集进行的几项实验,我们证明我们的模型会产生反面的反面事件,并具有成像变化反映了建立的临床标记的特征,可预测人群水平的未来MRI病变活性。其他定性结果表明,我们的模型有可能发现未来活动的新颖和主题的预测标记。
translated by 谷歌翻译
我们建议基于负担能力识别和一种神经远期模型的组合来预测负担执行的效果的新型动作序列计划。通过对预测期货进行负担能力识别,我们避免依赖多步计划的明确负担效果定义。由于该系统从经验数据中学习负担能力效果,因此该系统不仅可以预见到负担的规范效应,还可以预见到特定情况的副作用。这使系统能够避免由于这种非规范效应而避免计划故障,并可以利用非规范效应来实现给定目标。我们在一组需要考虑规范和非典型负担效应的测试任务上评估了模拟系统的系统。
translated by 谷歌翻译