大量的现代机器学习任务要求将大规模分布式簇作为训练管道的关键组成部分。但是,工人节点的异常拜占庭行为可能会使训练脱轨并损害推理的质量。这种行为可以归因于无意的系统故障或精心策划的攻击;结果,一些节点可能会将任意结果返回到协调培训的参数服务器(PS)。最近的工作考虑了广泛的攻击模型,并探索了强大的聚合和/或计算冗余以纠正扭曲的梯度。在这项工作中,我们考虑攻击模型从强大的攻击模型:$ q $无所不知的对手,对防御协议充分了解可以从迭代变为迭代变为弱者:$ q $随机选择的对手有限,勾结能力只会改变每一个,一次迭代很少。我们的算法依赖于冗余任务分配以及对抗行为的检测。对于强烈的攻击,我们证明,与先前的最新时间相比,扭曲梯度的比例从16 \%-99 \%降低。与最先进的攻击相比,我们在CIFAR-10数据集上的TOP-1分类准确性结果表明,在最复杂的攻击下,准确性(平均和弱方案平均)的优势(平均相对于强度和弱方案平均)。
translated by 谷歌翻译
最先进的机器学习模型在大规模分布式集群上常规培训。粗略地,当一些计算设备表现出异常(拜占庭)行为并将任意结果返回到参数服务器(PS)时,这种系统可能会受到损害。此行为可能归因于多种原因,包括系统故障和策划攻击。现有工作表明强大的聚合和/或计算冗余,以减轻扭曲渐变的效果。然而,当对手知道任务任务时,大多数这些方案都无效,并且可以明智地选择攻击的工人来诱导最大损害。我们所提出的方法ASPIS使用基于子集的分配为工作节点分配梯度计算,该分配允许对工作节点的行为进行多个一致性检查。通过中央节点检查计算出的梯度和后处理(在适当构造的图中的Clique-Conceping)允许有效的检测和随后从训练过程中排除对手。在弱势和强劲的攻击下,我们证明了拜占庭的复原力和检测保证,并广泛评估了各种大规模培训场景的系统。我们的实验的主要指标是测试准确性,与CIFAR-10数据集上的许多最先进的方法相比,我们表现出约30%的显着提高。相应减少损坏梯度的分数范围为16%至99%。
translated by 谷歌翻译
在这项工作中,我们研究了缺少数据(ST-MISS)和离群值(强大的ST-MISS)的子空间跟踪问题。我们提出了一种新颖的算法,并为这两个问题提供了保证。与过去在该主题上的工作不同,当前的工作并不强加分段恒定的子空间变更假设。此外,所提出的算法比我们以前的工作要简单得多(使用较少的参数)。其次,我们将方法及其分析扩展到当数据联合到数据时,以及在$ k $对等点点和中心之间的信息交换时,可以证明解决这些问题。我们通过广泛的数值实验来验证理论主张。
translated by 谷歌翻译
The paper presents a cross-domain review analysis on four popular review datasets: Amazon, Yelp, Steam, IMDb. The analysis is performed using Hadoop and Spark, which allows for efficient and scalable processing of large datasets. By examining close to 12 million reviews from these four online forums, we hope to uncover interesting trends in sales and customer sentiment over the years. Our analysis will include a study of the number of reviews and their distribution over time, as well as an examination of the relationship between various review attributes such as upvotes, creation time, rating, and sentiment. By comparing the reviews across different domains, we hope to gain insight into the factors that drive customer satisfaction and engagement in different product categories.
translated by 谷歌翻译
Automated offensive language detection is essential in combating the spread of hate speech, particularly in social media. This paper describes our work on Offensive Language Identification in low resource Indic language Marathi. The problem is formulated as a text classification task to identify a tweet as offensive or non-offensive. We evaluate different mono-lingual and multi-lingual BERT models on this classification task, focusing on BERT models pre-trained with social media datasets. We compare the performance of MuRIL, MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT on the HASOC 2022 test set. We also explore external data augmentation from other existing Marathi hate speech corpus HASOC 2021 and L3Cube-MahaHate. The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASOC 2022 test set. With this, we also provide a new state-of-the-art result on HASOC 2022 / MOLD v2 test set.
translated by 谷歌翻译
We consider the problem of continually releasing an estimate of the population mean of a stream of samples that is user-level differentially private (DP). At each time instant, a user contributes a sample, and the users can arrive in arbitrary order. Until now these requirements of continual release and user-level privacy were considered in isolation. But, in practice, both these requirements come together as the users often contribute data repeatedly and multiple queries are made. We provide an algorithm that outputs a mean estimate at every time instant $t$ such that the overall release is user-level $\varepsilon$-DP and has the following error guarantee: Denoting by $M_t$ the maximum number of samples contributed by a user, as long as $\tilde{\Omega}(1/\varepsilon)$ users have $M_t/2$ samples each, the error at time $t$ is $\tilde{O}(1/\sqrt{t}+\sqrt{M}_t/t\varepsilon)$. This is a universal error guarantee which is valid for all arrival patterns of the users. Furthermore, it (almost) matches the existing lower bounds for the single-release setting at all time instants when users have contributed equal number of samples.
translated by 谷歌翻译
Speech-centric machine learning systems have revolutionized many leading domains ranging from transportation and healthcare to education and defense, profoundly changing how people live, work, and interact with each other. However, recent studies have demonstrated that many speech-centric ML systems may need to be considered more trustworthy for broader deployment. Specifically, concerns over privacy breaches, discriminating performance, and vulnerability to adversarial attacks have all been discovered in ML research fields. In order to address the above challenges and risks, a significant number of efforts have been made to ensure these ML systems are trustworthy, especially private, safe, and fair. In this paper, we conduct the first comprehensive survey on speech-centric trustworthy ML topics related to privacy, safety, and fairness. In addition to serving as a summary report for the research community, we point out several promising future research directions to inspire the researchers who wish to explore further in this area.
translated by 谷歌翻译
The automated synthesis of correct-by-construction Boolean functions from logical specifications is known as the Boolean Functional Synthesis (BFS) problem. BFS has many application areas that range from software engineering to circuit design. In this paper, we introduce a tool BNSynth, that is the first to solve the BFS problem under a given bound on the solution space. Bounding the solution space induces the synthesis of smaller functions that benefit resource constrained areas such as circuit design. BNSynth uses a counter-example guided, neural approach to solve the bounded BFS problem. Initial results show promise in synthesizing smaller solutions; we observe at least \textbf{3.2X} (and up to \textbf{24X}) improvement in the reduction of solution size on average, as compared to state of the art tools on our benchmarks. BNSynth is available on GitHub under an open source license.
translated by 谷歌翻译
The research on text summarization for low-resource Indian languages has been limited due to the availability of relevant datasets. This paper presents a summary of various deep-learning approaches used for the ILSUM 2022 Indic language summarization datasets. The ISUM 2022 dataset consists of news articles written in Indian English, Hindi, and Gujarati respectively, and their ground-truth summarizations. In our work, we explore different pre-trained seq2seq models and fine-tune those with the ILSUM 2022 datasets. In our case, the fine-tuned SoTA PEGASUS model worked the best for English, the fine-tuned IndicBART model with augmented data for Hindi, and again fine-tuned PEGASUS model along with a translation mapping-based approach for Gujarati. Our scores on the obtained inferences were evaluated using ROUGE-1, ROUGE-2, and ROUGE-4 as the evaluation metrics.
translated by 谷歌翻译
The necessity of data driven decisions in healthcare strategy formulation is rapidly increasing. A reliable framework which helps identify factors impacting a Healthcare Provider Facility or a Hospital (from here on termed as Facility) Market Share is of key importance. This pilot study aims at developing a data driven Machine Learning - Regression framework which aids strategists in formulating key decisions to improve the Facilitys Market Share which in turn impacts in improving the quality of healthcare services. The US (United States) healthcare business is chosen for the study; and the data spanning across 60 key Facilities in Washington State and about 3 years of historical data is considered. In the current analysis Market Share is termed as the ratio of facility encounters to the total encounters among the group of potential competitor facilities. The current study proposes a novel two-pronged approach of competitor identification and regression approach to evaluate and predict market share, respectively. Leveraged model agnostic technique, SHAP, to quantify the relative importance of features impacting the market share. The proposed method to identify pool of competitors in current analysis, develops Directed Acyclic Graphs (DAGs), feature level word vectors and evaluates the key connected components at facility level. This technique is robust since its data driven which minimizes the bias from empirical techniques. Post identifying the set of competitors among facilities, developed Regression model to predict the Market share. For relative quantification of features at a facility level, incorporated SHAP a model agnostic explainer. This helped to identify and rank the attributes at each facility which impacts the market share.
translated by 谷歌翻译