我们提出了一种用于训练深神经网络的新型混合算法,该算法将最先进的梯度下降(GD)方法与混合整数线性编程(MILP)求解器相结合,以准确性以及变体的差异以及变体,以及回归和分类任务的资源和数据效率。我们的GD+求解器混合算法称为GDSolver,工作如下:给定DNN $ d $作为输入,GDSolver召集GD派出部分训练$ d $,直到卡在当地的最小值中,这一点GDSOLVER将Milp Solver召集到一定程度上详尽地搜索损失景观的区域,围绕$ d $的最后一层参数的重量分配,目的是贯穿并逃脱本地的最小值。重复该过程,直到达到所需的准确性。在我们的实验中,我们发现GDSolver不仅可以很好地扩展到其他数据和非常大的模型大小,而且还优于收敛和数据效率率的所有其他竞争方法。对于回归任务,GDOLVER生产的模型平均在48%的时间内降低了31.5%,并且对于MNIST和CIFAR10的分类任务,GDSOLVER仅使用所有竞争方法就能达到最高精度,仅使用50% GD基准需要的培训数据。
translated by 谷歌翻译
We consider a long-term average profit maximizing admission control problem in an M/M/1 queuing system with a known arrival rate but an unknown service rate. With a fixed reward collected upon service completion and a cost per unit of time enforced on customers waiting in the queue, a dispatcher decides upon arrivals whether to admit the arriving customer or not based on the full history of observations of the queue-length of the system. \cite[Econometrica]{Naor} showed that if all the parameters of the model are known, then it is optimal to use a static threshold policy - admit if the queue-length is less than a predetermined threshold and otherwise not. We propose a learning-based dispatching algorithm and characterize its regret with respect to optimal dispatch policies for the full information model of \cite{Naor}. We show that the algorithm achieves an $O(1)$ regret when all optimal thresholds with full information are non-zero, and achieves an $O(\ln^{3+\epsilon}(N))$ regret in the case that an optimal threshold with full information is $0$ (i.e., an optimal policy is to reject all arrivals), where $N$ is the number of arrivals and $\epsilon>0$.
translated by 谷歌翻译
Active target sensing is the task of discovering and classifying an unknown number of targets in an environment and is critical in search-and-rescue missions. This paper develops a deep reinforcement learning approach to plan informative trajectories that increase the likelihood for an uncrewed aerial vehicle (UAV) to discover missing targets. Our approach efficiently (1) explores the environment to discover new targets, (2) exploits its current belief of the target states and incorporates inaccurate sensor models for high-fidelity classification, and (3) generates dynamically feasible trajectories for an agile UAV by employing a motion primitive library. Extensive simulations on randomly generated environments show that our approach is more efficient in discovering and classifying targets than several other baselines. A unique characteristic of our approach, in contrast to heuristic informative path planning approaches, is that it is robust to varying amounts of deviations of the prior belief from the true target distribution, thereby alleviating the challenge of designing heuristics specific to the application conditions.
translated by 谷歌翻译
Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency. As the systems grow in complexity, fine-tuning architectural parameters across multiple sub-systems (e.g., datapath, memory blocks in different hierarchies, interconnects, compiler optimization, etc.) quickly results in a combinatorial explosion of design space. This makes domain-specific customization an extremely challenging task. Prior work explores using reinforcement learning (RL) and other optimization methods to automatically explore the large design space. However, these methods have traditionally relied on single-agent RL/ML formulations. It is unclear how scalable single-agent formulations are as we increase the complexity of the design space (e.g., full stack System-on-Chip design). Therefore, we propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem. The key idea behind using MARL is an observation that parameters across different sub-systems are more or less independent, thus allowing a decentralized role assigned to each agent. We test this hypothesis by designing domain-specific DRAM memory controller for several workload traces. Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines such as Proximal Policy Optimization and Soft Actor-Critic over different target objectives such as low power and latency. To this end, this work opens the pathway for new and promising research in MARL solutions for hardware architecture search.
translated by 谷歌翻译
Reinforcement Learning (RL) algorithms are known to scale poorly to environments with many available actions, requiring numerous samples to learn an optimal policy. The traditional approach of considering the same fixed action space in every possible state implies that the agent must understand, while also learning to maximize its reward, to ignore irrelevant actions such as $\textit{inapplicable actions}$ (i.e. actions that have no effect on the environment when performed in a given state). Knowing this information can help reduce the sample complexity of RL algorithms by masking the inapplicable actions from the policy distribution to only explore actions relevant to finding an optimal policy. This is typically done in an ad-hoc manner with hand-crafted domain logic added to the RL algorithm. In this paper, we propose a more systematic approach to introduce this knowledge into the algorithm. We (i) standardize the way knowledge can be manually specified to the agent; and (ii) present a new framework to autonomously learn these state-dependent action constraints jointly with the policy. We show experimentally that learning inapplicable actions greatly improves the sample efficiency of the algorithm by providing a reliable signal to mask out irrelevant actions. Moreover, we demonstrate that thanks to the transferability of the knowledge acquired, it can be reused in other tasks to make the learning process more efficient.
translated by 谷歌翻译
Generalizability of time series forecasting models depends on the quality of model selection. Temporal cross validation (TCV) is a standard technique to perform model selection in forecasting tasks. TCV sequentially partitions the training time series into train and validation windows, and performs hyperparameter optmization (HPO) of the forecast model to select the model with the best validation performance. Model selection with TCV often leads to poor test performance when the test data distribution differs from that of the validation data. We propose a novel model selection method, H-Pro that exploits the data hierarchy often associated with a time series dataset. Generally, the aggregated data at the higher levels of the hierarchy show better predictability and more consistency compared to the bottom-level data which is more sparse and (sometimes) intermittent. H-Pro performs the HPO of the lowest-level student model based on the test proxy forecasts obtained from a set of teacher models at higher levels in the hierarchy. The consistency of the teachers' proxy forecasts help select better student models at the lowest-level. We perform extensive empirical studies on multiple datasets to validate the efficacy of the proposed method. H-Pro along with off-the-shelf forecasting models outperform existing state-of-the-art forecasting methods including the winning models of the M5 point-forecasting competition.
translated by 谷歌翻译
Explainability has been widely stated as a cornerstone of the responsible and trustworthy use of machine learning models. With the ubiquitous use of Deep Neural Network (DNN) models expanding to risk-sensitive and safety-critical domains, many methods have been proposed to explain the decisions of these models. Recent years have also seen concerted efforts that have shown how such explanations can be distorted (attacked) by minor input perturbations. While there have been many surveys that review explainability methods themselves, there has been no effort hitherto to assimilate the different methods and metrics proposed to study the robustness of explanations of DNN models. In this work, we present a comprehensive survey of methods that study, understand, attack, and defend explanations of DNN models. We also present a detailed review of different metrics used to evaluate explanation methods, as well as describe attributional attack and defense methods. We conclude with lessons and take-aways for the community towards ensuring robust explanations of DNN model predictions.
translated by 谷歌翻译
我们为多机器人任务计划和分配问题提出了一种新的公式,该公式结合了(a)任务之间的优先关系; (b)任务的协调,允许多个机器人提高效率; (c)通过形成机器人联盟的任务合作,而单独的机器人不能执行。在我们的公式中,任务图指定任务和任务之间的关系。我们在任务图的节点和边缘上定义了一组奖励函数。这些功能对机器人联盟规模对任务绩效的影响进行建模,并结合一个任务的性能对依赖任务的影响。最佳解决此问题是NP-HARD。但是,使用任务图公式使我们能够利用最小成本的网络流量方法有效地获得近似解决方案。此外,我们还探索了一种混合整数编程方法,该方法为问题的小实例提供了最佳的解决方案,但计算上很昂贵。我们还开发了一种贪婪的启发式算法作为基准。我们的建模和解决方案方法导致任务计划,即使在与许多代理商的大型任务中,也利用任务优先关系的关系以及机器人的协调和合作来实现高级任务绩效。
translated by 谷歌翻译
以前在外围防御游戏中的研究主要集中在完全可观察到的环境上,在该环境中,所有玩家都知道真正的玩家状态。但是,这对于实际实施而言是不现实的,因为捍卫者可能必须感知入侵者并估计其国家。在这项工作中,我们在照片真实的模拟器和现实世界中研究外围防御游戏,要求捍卫者从视力中估算入侵者状态。我们通过域随机化训练一个基于机器学习的系统,用于入侵者姿势检测,该系统汇总了多个视图,以减少状态估计错误并适应防御策略来解决此问题。我们新介绍性能指标来评估基于视觉的外围防御。通过广泛的实验,我们表明我们的方法改善了国家的估计,最终在两场比赛中的VS-1-Intruder游戏和2-Fefenders-VS-1-Intruder游戏中最终进行了外围防御性能。
translated by 谷歌翻译
这项研究提供了一个新颖的框架,以根据开源数据估算全球城市的公共交通巴士的经济,环境和社会价值。电动巴士是替代柴油巴士以获得环境和社会利益的引人注目的候选人。但是,评估总线电气化价值的最先进模型的适用性受到限制,因为它们需要可能难以购买的总线运营数据的细粒和定制数据。我们的估值工具使用通用过境饲料规范,这是全球运输机构使用的标准数据格式,为制定优先级排序策略提供了高级指导,以使总线机队电气化。我们开发了物理知识的机器学习模型,以评估每种运输途径的能耗,碳排放,健康影响以及总拥有成本。我们通过对大波士顿和米兰大都会地区的公交线路进行案例研究来证明我们的工具的可扩展性。
translated by 谷歌翻译