与靶蛋白具有高结合亲和力的药物样分子的产生仍然是药物发现中的一项困难和资源密集型任务。现有的方法主要采用强化学习,马尔可夫采样或以高斯过程为指导的深层生成模型,在生成具有高结合亲和力的分子时,通过基于计算量的物理学方法计算出的高结合亲和力。我们提出了对分子(豪华轿车)的潜在构成主义,它通过类似于Inceptionism的技术显着加速了分子的产生。豪华轿车采用序列的两个神经网络采用变异自动编码器生成的潜在空间和性质预测,从而使基于梯度的分子特性更快地基于梯度的反相比。综合实验表明,豪华轿车在基准任务上具有竞争力,并且在产生具有高结合亲和力的类似药物的化合物的新任务上,其最先进的技术表现出了最先进的技术,可针对两个蛋白质靶标达到纳摩尔范围。我们通过对绝对结合能的基于更准确的基于分子动力学的计算来证实这些基于对接的结果,并表明我们生成的类似药物的化合物之一的预测$ k_d $(结合亲和力的量度)为$ 6 \ cdot 10^ {-14} $ m针对人类雌激素受体,远远超出了典型的早期药物候选物和大多数FDA批准的药物的亲和力。代码可从https://github.com/rose-stl-lab/limo获得。
translated by 谷歌翻译
This paper studies the quantization of heavy-tailed data in some fundamental statistical estimation problems, where the underlying distributions have bounded moments of some order. We propose to truncate and properly dither the data prior to a uniform quantization. Our major standpoint is that (near) minimax rates of estimation error are achievable merely from the quantized data produced by the proposed scheme. In particular, concrete results are worked out for covariance estimation, compressed sensing, and matrix completion, all agreeing that the quantization only slightly worsens the multiplicative factor. Besides, we study compressed sensing where both covariate (i.e., sensing vector) and response are quantized. Under covariate quantization, although our recovery program is non-convex because the covariance matrix estimator lacks positive semi-definiteness, all local minimizers are proved to enjoy near optimal error bound. Moreover, by the concentration inequality of product process and covering argument, we establish near minimax uniform recovery guarantee for quantized compressed sensing with heavy-tailed noise.
translated by 谷歌翻译
In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.
translated by 谷歌翻译
Since higher-order tensors are naturally suitable for representing multi-dimensional data in real-world, e.g., color images and videos, low-rank tensor representation has become one of the emerging areas in machine learning and computer vision. However, classical low-rank tensor representations can only represent data on finite meshgrid due to their intrinsical discrete nature, which hinders their potential applicability in many scenarios beyond meshgrid. To break this barrier, we propose a low-rank tensor function representation (LRTFR), which can continuously represent data beyond meshgrid with infinite resolution. Specifically, the suggested tensor function, which maps an arbitrary coordinate to the corresponding value, can continuously represent data in an infinite real space. Parallel to discrete tensors, we develop two fundamental concepts for tensor functions, i.e., the tensor function rank and low-rank tensor function factorization. We theoretically justify that both low-rank and smooth regularizations are harmoniously unified in the LRTFR, which leads to high effectiveness and efficiency for data continuous representation. Extensive multi-dimensional data recovery applications arising from image processing (image inpainting and denoising), machine learning (hyperparameter optimization), and computer graphics (point cloud upsampling) substantiate the superiority and versatility of our method as compared with state-of-the-art methods. Especially, the experiments beyond the original meshgrid resolution (hyperparameter optimization) or even beyond meshgrid (point cloud upsampling) validate the favorable performances of our method for continuous representation.
translated by 谷歌翻译
Despite the huge advancement in knowledge discovery and data mining techniques, the X-ray diffraction (XRD) analysis process has mostly remained untouched and still involves manual investigation, comparison, and verification. Due to the large volume of XRD samples from high-throughput XRD experiments, it has become impossible for domain scientists to process them manually. Recently, they have started leveraging standard clustering techniques, to reduce the XRD pattern representations requiring manual efforts for labeling and verification. Nevertheless, these standard clustering techniques do not handle problem-specific aspects such as peak shifting, adjacent peaks, background noise, and mixed phases; hence, resulting in incorrect composition-phase diagrams that complicate further steps. Here, we leverage data mining techniques along with domain expertise to handle these issues. In this paper, we introduce an incremental phase mapping approach based on binary peak representations using a new threshold based fuzzy dissimilarity measure. The proposed approach first applies an incremental phase computation algorithm on discrete binary peak representation of XRD samples, followed by hierarchical clustering or manual merging of similar pure phases to obtain the final composition-phase diagram. We evaluate our method on the composition space of two ternary alloy systems- Co-Ni-Ta and Co-Ti-Ta. Our results are verified by domain scientists and closely resembles the manually computed ground-truth composition-phase diagrams. The proposed approach takes us closer towards achieving the goal of complete end-to-end automated XRD analysis.
translated by 谷歌翻译
Under climate change, the increasing frequency, intensity, and spatial extent of drought events lead to higher socio-economic costs. However, the relationships between the hydro-meteorological indicators and drought impacts are not identified well yet because of the complexity and data scarcity. In this paper, we proposed a framework based on the extreme gradient model (XGBoost) for Texas to predict multi-category drought impacts and connected a typical drought indicator, Standardized Precipitation Index (SPI), to the text-based impacts from the Drought Impact Reporter (DIR). The preliminary results of this study showed an outstanding performance of the well-trained models to assess drought impacts on agriculture, fire, society & public health, plants & wildlife, as well as relief, response & restrictions in Texas. It also provided a possibility to appraise drought impacts using hydro-meteorological indicators with the proposed framework in the United States, which could help drought risk management by giving additional information and improving the updating frequency of drought impacts. Our interpretation results using the Shapley additive explanation (SHAP) interpretability technique revealed that the rules guiding the predictions of XGBoost comply with domain expertise knowledge around the role that SPI indicators play around drought impacts.
translated by 谷歌翻译
立方正则化方法(CR)是一种流行的算法,用于无限制的非凸优化。在每次迭代中,CR解决了一个立方正规化的二次问题,称为立方正则化子问题(CRS)。解决CRS的一种方法依赖于解决世俗方程,其计算瓶颈在于计算Hessian矩阵的所有特征值。在本文中,我们根据近似的世俗方程提出和分析了一种新颖的CRS求解器,该方程仅需要一些Hessian特征值,因此更有效。开发了两个近似的世俗方程(ASE)。对于这两个ASE,我们首先研究其根的存在和独特性,然后在根部和标准世俗方程之间的间隙上建立上层界限。这样的上限可以依次用于绑定从基于AS的近似CRS解决方案到真实CRS解决方案的距离,从而为我们的CRS求解器提供理论保证。我们CRS求解器的理想特征是它仅需要矩阵向量乘法,而不需要矩阵反转,这使其特别适合于无限制的非凸优化的高维应用,例如低级别恢复和深度学习。进行合成和实际数据集的数值实验是为了研究拟议的CRS求解器的实际性能。实验结果表明,所提出的求解器的表现优于两种最先进的方法。
translated by 谷歌翻译
在评估临床机器学习模型的性能时,必须考虑部署人群。当观察到的标签患者的人群只是部署人群的一部分(选择标签)时,对观察到的人群的标准模型绩效估计可能会产生误导。在这项研究中,我们描述了三类的标签选择,并模拟了五个有因果关系的场景,以评估特定选择机制如何偏向一套常见的二进制机器学习模型性能指标。 Simulations reveal that when selection is affected by observed features, naive estimates of model discrimination may be misleading. When selection is affected by labels, naive estimates of calibration fail to reflect reality.我们从因果推理文献中借用传统的加权估计器,发现当正确指定选择概率时,它们会恢复全部人口估计。然后,我们解决了监视部署的机器学习模型的性能的现实任务,该模型的相互作用与临床医生相互作用并影响标签的选择机制。我们训练三个机器学习模型来标记低收益实验室的诊断,并模拟它们减少浪费实验室利用的预期结果。我们发现,对观察到的人群的幼稚估计值降低了20%。这样的差异可能足够大,可以导致成功终止成功的临床决策支持工具。我们提出了一个更改的部署程序,该程序将注入随机化的注入随机化与传统加权估计相结合,并发现其恢复了真正的模型性能。
translated by 谷歌翻译
非参考语音质量模型对于越来越多的应用程序很重要。 VoiceMos 2022挑战提供了一个带有主观标签的合成语音转换和文本到语音样本的数据集。这项研究着眼于在元数据的主观语音质量和数据集的分布不平衡的主观评级中可以解释的差异。使用WAV2VEC 2.0构建语音质量模型,具有其他元数据功能,其中包括评估者组和系统标识符,并获得了竞争性指标,包括Spearman等级相关系数(SRCC)为0.934,MSE为0.088,在系统级别和0.877和0.198和0.198和0.198的MSE和0.198话语级别。使用数据限制或盲目的数据和元数据进一步改善了指标。元数据分析表明,由于验证和测试数据集中每个系统使用的话语数量的广泛变化,系统级指标并不代表模型的系统级预测。我们得出的结论是,通常,条件在测试集中应具有足够的话语以绑定样本平均误差,并且在系统之间的话语计数中相对平衡,否则话语级别的指标可能更可靠和可解释。
translated by 谷歌翻译
现在,具有成本效益的深度和红外传感器作为常规RGB传感器的替代方案已成为现实,并且在自主导航和遥控传感等域中具有比RGB的优势。因此,建立计算机视觉和深度学习系统以进行深度和红外数据至关重要。但是,仍然缺乏针对这些模式的大型标签数据集。在这种情况下,将知识从源模式(RGB)的良好标记的大型数据集训练的神经网络转移到在目标模式(深度,红外等)上工作的神经网络具有很大价值。出于内存和隐私等原因,可能无法访问源数据,并且知识转移需要仅与源模型一起使用。我们描述了一个有效的解决方案,插座:无源的跨模式知识转移,用于将知识从一个源模式转移到不同目标模式的具有挑战性的任务,而无需访问与任务相关的源数据。该框架使用配对的任务 - IRRELELERVANT数据以及将目标特征的平均值和方差与源模型中存在的批处理统计信息匹配,从而减少了模态差距。我们通过广泛的实验表明,我们的方法明显优于无法解释模式差距的分类任务的现有无源方法。
translated by 谷歌翻译