我们提出了一种用于构建线性时间不变(LTI)模型的新颖框架,用于一类稳定的非线性动态的Koopman运算符的数据驱动表示。 Koopman操作员(发电机)将有限维非线性系统升压到可能无限的线性特征空间。为了利用它来建模,需要发现Koopman运算符的有限维表示。学习合适的功能是具有挑战性的,因为一种需要学习koopman-invariant(在动态下线性演变的LTI功能以及相关(跨越原始状态) - 一般无监督的学习任务。对于这个问题的理论上是良好的解决方案,我们通过用潜伏的线性模型的提升的聚集体系来组合扩散综合学习者来提出学习Koopman-Invoriant坐标。使用稳定矩阵的无约束参数化以及上述特征结构,我们学习Koopman操作员特征而不假设预定义的功能库或了解频谱,同时确保操作员近似精度而确保稳定性。我们展示了所提出的方法与众所周知的LASA手写数据集上的最先进方法的卓越效果。
translated by 谷歌翻译
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.
translated by 谷歌翻译
估计大规模森林AGB和精细的空间决议对于温室气体会计,监测和验证工作以减轻气候变化的范围变得越来越重要。机载LiDAR对于在包括AGB在内的森林结构的属性建模非常有价值,但大多数LiDAR收集都发生在涵盖不规则,不连续的足迹的本地或区域尺度上,导致不同景观细分市场在各个时间点进行拼布。在这里,作为纽约州(美国)全州森林碳评估的一部分,我们解决了利用激光雷达拼布在景观尺度上的雷达拼凑而成的障碍,包括选择培训数据,对预测的区域或覆盖范围的特定模式的调查错误,并绘制与多个量表的现场清单一致。三种机器学习算法和一个集合模型经过FIA场测量,空气传播的激光雷达和地形,气候和心形地理训练。使用一组严格的地块选择标准,选择了801个FIA图,并从17个叶子覆盖范围(2014-2019)的拼布中绘制的共同定位的点云(2014-2019)。我们的合奏模型用于在预测定义的适用性区域(占激光雷达覆盖率的98%)内生成30 m AGB的预测表面,并将所得的AGB图与FIA绘图级别和面积估计值进行比较。我们的模型总体准确(%RMSE 22-45%; MAE 11.6-29.4 mg ha $^{ - 1} $; me 2.4-6.3 mg ha $^{ - 1} $),解释了73-80%的领域 - 观察到的变化,并得出与FIA基于设计的估计值一致的估计值(FIA 95%CI中的估计值的89%)。我们分享实用的解决方案,以使用LIDAR的时空拼布面临的挑战来满足不断增长的AGB映射需求,以支持森林碳会计和生态系统中的应用。
translated by 谷歌翻译
Novel plant communities reshape landscapes and pose challenges for land cover classification and mapping that can constrain research and stewardship efforts. In the US Northeast, emergence of low-statured woody vegetation, or shrublands, instead of secondary forests in post-agricultural landscapes is well-documented by field studies, but poorly understood from a landscape perspective, which limits the ability to systematically study and manage these lands. To address gaps in classification/mapping of low-statured cover types where they have been historically rare, we developed models to predict shrubland distributions at 30m resolution across New York State (NYS), using a stacked ensemble combining a random forest, gradient boosting machine, and artificial neural network to integrate remote sensing of structural (airborne LIDAR) and optical (satellite imagery) properties of vegetation cover. We first classified a 1m canopy height model (CHM), derived from a patchwork of available LIDAR coverages, to define shrubland presence/absence. Next, these non-contiguous maps were used to train a model ensemble based on temporally-segmented imagery to predict shrubland probability for the entire study landscape (NYS). Approximately 2.5% of the CHM coverage area was classified as shrubland. Models using Landsat predictors trained on the classified CHM were effective at identifying shrubland (test set AUC=0.893, real-world AUC=0.904), in discriminating between shrub/young forest and other cover classes, and produced qualitatively sensible maps, even when extending beyond the original training data. Our results suggest that incorporation of airborne LiDAR, even from a discontinuous patchwork of coverages, can improve land cover classification of historically rare but increasingly prevalent shrubland habitats across broader areas.
translated by 谷歌翻译
我们在现有的长尾分类方法中解决了被忽视的无偏见:我们发现它们的整体改善主要归因于尾部过度的偏置偏好,因为假设测试分配是平衡的;但是,当测试与长尾训练数据一样不平衡 - 让测试尊重ZIPF的自然定律 - 尾巴偏差不再有益,因为它伤害了大多数人。在本文中,我们提出了跨域经验风险最小化(XIM)来训练一个非偏见模型,以实现对两个测试分布的强大性能,经验证明Xerm通过学习更好的特征表示而不是头部与头部来改善分类。游戏。基于因果关系,我们进一步理论上解释了Xerm实现了非偏见的原因:通过调整不平衡域和平衡但不合形的结构域的经验风险来消除由域选择引起的偏差。代码可在https://github.com/beierzhu/xerm获得。
translated by 谷歌翻译
With the advent of Neural Style Transfer (NST), stylizing an image has become quite popular. A convenient way for extending stylization techniques to videos is by applying them on a per-frame basis. However, such per-frame application usually lacks temporal-consistency expressed by undesirable flickering artifacts. Most of the existing approaches for enforcing temporal-consistency suffers from one or more of the following drawbacks. They (1) are only suitable for a limited range of stylization techniques, (2) can only be applied in an offline fashion requiring the complete video as input, (3) cannot provide consistency for the task of stylization, or (4) do not provide interactive consistency-control. Note that existing consistent video-filtering approaches aim to completely remove flickering artifacts and thus do not respect any specific consistency-control aspect. For stylization tasks, however, consistency-control is an essential requirement where a certain amount of flickering can add to the artistic look and feel. Moreover, making this control interactive is paramount from a usability perspective. To achieve the above requirements, we propose an approach that can stylize video streams while providing interactive consistency-control. Apart from stylization, our approach also supports various other image processing filters. For achieving interactive performance, we develop a lite optical-flow network that operates at 80 Frames per second (FPS) on desktop systems with sufficient accuracy. We show that the final consistent video-output using our flow network is comparable to that being obtained using state-of-the-art optical-flow network. Further, we employ an adaptive combination of local and global consistent features and enable interactive selection between the two. By objective and subjective evaluation, we show that our method is superior to state-of-the-art approaches.
translated by 谷歌翻译
In this work, we address the problem of unsupervised moving object segmentation (MOS) in 4D LiDAR data recorded from a stationary sensor, where no ground truth annotations are involved. Deep learning-based state-of-the-art methods for LiDAR MOS strongly depend on annotated ground truth data, which is expensive to obtain and scarce in existence. To close this gap in the stationary setting, we propose a novel 4D LiDAR representation based on multivariate time series that relaxes the problem of unsupervised MOS to a time series clustering problem. More specifically, we propose modeling the change in occupancy of a voxel by a multivariate occupancy time series (MOTS), which captures spatio-temporal occupancy changes on the voxel level and its surrounding neighborhood. To perform unsupervised MOS, we train a neural network in a self-supervised manner to encode MOTS into voxel-level feature representations, which can be partitioned by a clustering algorithm into moving or stationary. Experiments on stationary scenes from the Raw KITTI dataset show that our fully unsupervised approach achieves performance that is comparable to that of supervised state-of-the-art approaches.
translated by 谷歌翻译
Implicit Neural Representations (INR) have recently shown to be powerful tool for high-quality video compression. However, existing works are limiting as they do not explicitly exploit the temporal redundancy in videos, leading to a long encoding time. Additionally, these methods have fixed architectures which do not scale to longer videos or higher resolutions. To address these issues, we propose NIRVANA, which treats videos as groups of frames and fits separate networks to each group performing patch-wise prediction. This design shares computation within each group, in the spatial and temporal dimensions, resulting in reduced encoding time of the video. The video representation is modeled autoregressively, with networks fit on a current group initialized using weights from the previous group's model. To further enhance efficiency, we perform quantization of the network parameters during training, requiring no post-hoc pruning or quantization. When compared with previous works on the benchmark UVG dataset, NIRVANA improves encoding quality from 37.36 to 37.70 (in terms of PSNR) and the encoding speed by 12X, while maintaining the same compression rate. In contrast to prior video INR works which struggle with larger resolution and longer videos, we show that our algorithm is highly flexible and scales naturally due to its patch-wise and autoregressive designs. Moreover, our method achieves variable bitrate compression by adapting to videos with varying inter-frame motion. NIRVANA achieves 6X decoding speed and scales well with more GPUs, making it practical for various deployment scenarios.
translated by 谷歌翻译
This work proposes a view of probability as a relative measure rather than an absolute one. To demonstrate this concept, we focus on finite outcome spaces and develop three fundamental axioms that establish requirements for relative probability functions. We then provide a library of examples of these functions and a system for composing them. Additionally, we discuss a relative version of Bayesian inference and its digital implementation. Finally, we prove the topological closure of the relative probability space, highlighting its ability to preserve information under limits.
translated by 谷歌翻译
Deep spiking neural networks (SNNs) offer the promise of low-power artificial intelligence. However, training deep SNNs from scratch or converting deep artificial neural networks to SNNs without loss of performance has been a challenge. Here we propose an exact mapping from a network with Rectified Linear Units (ReLUs) to an SNN that fires exactly one spike per neuron. For our constructive proof, we assume that an arbitrary multi-layer ReLU network with or without convolutional layers, batch normalization and max pooling layers was trained to high performance on some training set. Furthermore, we assume that we have access to a representative example of input data used during training and to the exact parameters (weights and biases) of the trained ReLU network. The mapping from deep ReLU networks to SNNs causes zero percent drop in accuracy on CIFAR10, CIFAR100 and the ImageNet-like data sets Places365 and PASS. More generally our work shows that an arbitrary deep ReLU network can be replaced by an energy-efficient single-spike neural network without any loss of performance.
translated by 谷歌翻译