Document summarization aims to create a precise and coherent summary of a text document. Many deep learning summarization models are developed mainly for English, often requiring a large training corpus and efficient pre-trained language models and tools. However, English summarization models for low-resource Indian languages are often limited by rich morphological variation, syntax, and semantic differences. In this paper, we propose GAE-ISumm, an unsupervised Indic summarization model that extracts summaries from text documents. In particular, our proposed model, GAE-ISumm uses Graph Autoencoder (GAE) to learn text representations and a document summary jointly. We also provide a manually-annotated Telugu summarization dataset TELSUM, to experiment with our model GAE-ISumm. Further, we experiment with the most publicly available Indian language summarization datasets to investigate the effectiveness of GAE-ISumm on other Indian languages. Our experiments of GAE-ISumm in seven languages make the following observations: (i) it is competitive or better than state-of-the-art results on all datasets, (ii) it reports benchmark results on TELSUM, and (iii) the inclusion of positional and cluster information in the proposed model improved the performance of summaries.
translated by 谷歌翻译
The necessity of data driven decisions in healthcare strategy formulation is rapidly increasing. A reliable framework which helps identify factors impacting a Healthcare Provider Facility or a Hospital (from here on termed as Facility) Market Share is of key importance. This pilot study aims at developing a data driven Machine Learning - Regression framework which aids strategists in formulating key decisions to improve the Facilitys Market Share which in turn impacts in improving the quality of healthcare services. The US (United States) healthcare business is chosen for the study; and the data spanning across 60 key Facilities in Washington State and about 3 years of historical data is considered. In the current analysis Market Share is termed as the ratio of facility encounters to the total encounters among the group of potential competitor facilities. The current study proposes a novel two-pronged approach of competitor identification and regression approach to evaluate and predict market share, respectively. Leveraged model agnostic technique, SHAP, to quantify the relative importance of features impacting the market share. The proposed method to identify pool of competitors in current analysis, develops Directed Acyclic Graphs (DAGs), feature level word vectors and evaluates the key connected components at facility level. This technique is robust since its data driven which minimizes the bias from empirical techniques. Post identifying the set of competitors among facilities, developed Regression model to predict the Market share. For relative quantification of features at a facility level, incorporated SHAP a model agnostic explainer. This helped to identify and rank the attributes at each facility which impacts the market share.
translated by 谷歌翻译
Temporal difference (TD) learning is a simple algorithm for policy evaluation in reinforcement learning. The performance of TD learning is affected by high variance and it can be naturally enhanced with variance reduction techniques, such as the Stochastic Variance Reduced Gradient (SVRG) method. Recently, multiple works have sought to fuse TD learning with SVRG to obtain a policy evaluation method with a geometric rate of convergence. However, the resulting convergence rate is significantly weaker than what is achieved by SVRG in the setting of convex optimization. In this work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG. We prove a geometric convergence bound with predetermined learning rate of 1/8, that is identical to the convergence bound available for SVRG in the convex setting.
translated by 谷歌翻译
Named Entity Recognition and Intent Classification are among the most important subfields of the field of Natural Language Processing. Recent research has lead to the development of faster, more sophisticated and efficient models to tackle the problems posed by those two tasks. In this work we explore the effectiveness of two separate families of Deep Learning networks for those tasks: Bidirectional Long Short-Term networks and Transformer-based networks. The models were trained and tested on the ATIS benchmark dataset for both English and Greek languages. The purpose of this paper is to present a comparative study of the two groups of networks for both languages and showcase the results of our experiments. The models, being the current state-of-the-art, yielded impressive results and achieved high performance.
translated by 谷歌翻译
先前的工作表明,单词在语音维度上是超级定义的,这些语音将它们与最小对竞争者区分开来。该现象已称为对比度超颗粒(CH)。我们提出了语音发作时间(fot)计划的动态神经场(DNF)模型,该模型从最小对竞争者的抑制作用中得出了CH。我们通过一项新的实验来测试模型的一些预测,该实验研究了伪金中无声的停止辅音CH。结果证明了伪造中的CH效应,这与实时计划和语音生产的效果的基础一致。与CH相比,用真实的词降低了伪金中CH的范围和大小,这与词汇和语音计划之间的互动激活的作用一致。我们讨论了模型统一一组明显不同现象的潜力,从CH到语音邻域效应到语音误差中的语音痕量效应。
translated by 谷歌翻译
离群值检测是一项具有挑战性的活动。文献中提出了几种机器学习技术,以进行异常检测。在本文中,我们为双向gan(Bigan)提出了一种新的培训方法,以检测异常值。为了验证拟议的方法,我们采用拟议的培训方法来培训一个Bigan,以检测正在操纵其纳税申报表的纳税人。对于每个纳税人,我们从他/她提交的纳税申报表中得出六个相关参数和三个比率参数。我们在这九个派生的地面数据集上采用拟议的培训方法来训练Bigan。接下来,我们使用$ encoder $(使用$ encoder $编码此数据集)生成此数据集的潜在表示,并使用$ Generator $(使用$ Generator $解码)再生此数据集,通过提供此潜在表示为输入。对于每个纳税人,计算其基地数据和再生数据之间的余弦相似性。具有较低余弦相似性措施的纳税人是潜在的回程操纵者。我们应用了我们的方法来分析印度特兰加纳政府商业税务部提供的钢铁纳税人数据集。
translated by 谷歌翻译
循环贸易是商品和服务税的逃税形式,其中一组欺诈性纳税人(交易者)的目标是通过在短期内将几项虚拟交易(在商品或服务中添加价值不高)来掩盖非法交易,以掩盖非法交易。。由于纳税人的庞大数据库,当局可以手动识别循环交易者和他们所涉及的非法交易的群体是不可行的。这项工作使用大数据分析和图形表示技术来提出一个框架来识别循环交易者社区并隔离各个社区的非法交易。我们的方法经过印度特兰加纳政府商业税部提供的现实生活数据,在那里我们发现了几个循环商人社区。
translated by 谷歌翻译
实例优化系统的新兴类别通过专门研究特定的数据和查询工作负载来显示出高性能的潜力。特别是,机器学习(ML)技术已成功地应用于构建各种实例优化的组件(例如,学习的索引)。本文研究以利用ML技术来增强给定数据和查询工作负载的空间索引,尤其是R-Tree的性能。当R-Tree索引节点覆盖的区域在空间中重叠,在搜索空间中的特定点时,可能会探索从根到叶子的多个路径。在最坏的情况下,可以搜索整个R-Tree。在本文中,我们定义并使用重叠比来量化范围查询所需的外叶节点访问的程度。目的是提高传统的R-Tree对高度重叠范围查询的查询性能,因为它们往往会产生长时间的跑步时间。我们介绍了一个新的AI-Tree,将R-Tree的搜索操作转换为多标签分类任务,以排除外部叶子节点访问。然后,我们将传统的R-Tree扩大到Ai-Tree,形成混合“ AI+R” -tree。 “ AI+R” -tree可以使用学习模型自动区分高和低封闭的查询。因此,“ AI+R” -Tree使用AI-Tree处理高重叠的查询,并使用R-Tree处理低重叠的查询。实际数据集上的实验表明,“ AI+R” -Tree可以在传统的R-Tree上提高查询性能高达500%。
translated by 谷歌翻译
现实世界的顺序决策需要数据驱动的算法,这些算法在整个培训中为性能提供实际保证,同时还可以有效利用数据。无模型的深入强化学习代表了此类数据驱动决策的框架,但是现有算法通常只关注其中一个目标,同时牺牲了相对于另一个目标。政策算法确保整个培训的政策改进,但遭受了较高的样本复杂性,而政策算法则可以通过样本重用,但缺乏理论保证来有效利用数据。为了平衡这些竞争目标,我们开发了一系列广义政策改进算法,这些算法结合了政策改进的政策保证和理论支持的样本重用的效率。我们通过对DeepMind Control Suite的各种连续控制任务进行广泛的实验分析来证明这种新算法的好处。
translated by 谷歌翻译
开放信息提取(OIE)方法从非结构化文本中提取大量的OIE三元<名词短语,关系短语,名词短语>,它们组成了大型开放知识基础(OKB)。此类OKB中的名词短语和关系短语不是规范化的,这导致了散落和冗余的事实。发现知识的两种观点(即,基于事实三重的事实视图和基于事实三重源上下文的上下文视图)提供了互补信息,这对于OKB规范化的任务至关重要,该信息将其簇为同义名词短语和关系短语分为同一组,并为他们分配唯一的标识符。但是,到目前为止,这两种知识的观点已被现有作品孤立地利用。在本文中,我们提出了CMVC,这是一个新颖的无监督框架,该框架利用这两种知识的观点共同将典范的OKBS化,而无需手动注释的标签。为了实现这一目标,我们提出了一种多视图CH K均值聚类算法,以相互加强通过考虑其不同的聚类质量从每个视图中学到的特定视图嵌入的聚类。为了进一步提高规范化的性能,我们在每个特定视图中分别提出了一个培训数据优化策略,以迭代方式完善学习视图的特定嵌入。此外,我们提出了一种对数跳跃算法,以数据驱动的方式预测簇数的最佳数量,而无需任何标签。我们通过针对最新方法的多个现实世界OKB数据集进行了广泛的实验来证明我们的框架的优势。
translated by 谷歌翻译