Benefiting from its single-photon sensitivity, single-photon avalanche diode (SPAD) array has been widely applied in various fields such as fluorescence lifetime imaging and quantum computing. However, large-scale high-fidelity single-photon imaging remains a big challenge, due to the complex hardware manufacture craft and heavy noise disturbance of SPAD arrays. In this work, we introduce deep learning into SPAD, enabling super-resolution single-photon imaging over an order of magnitude, with significant enhancement of bit depth and imaging quality. We first studied the complex photon flow model of SPAD electronics to accurately characterize multiple physical noise sources, and collected a real SPAD image dataset (64 $\times$ 32 pixels, 90 scenes, 10 different bit depth, 3 different illumination flux, 2790 images in total) to calibrate noise model parameters. With this real-world physical noise model, we for the first time synthesized a large-scale realistic single-photon image dataset (image pairs of 5 different resolutions with maximum megapixels, 17250 scenes, 10 different bit depth, 3 different illumination flux, 2.6 million images in total) for subsequent network training. To tackle the severe super-resolution challenge of SPAD inputs with low bit depth, low resolution, and heavy noise, we further built a deep transformer network with a content-adaptive self-attention mechanism and gated fusion modules, which can dig global contextual features to remove multi-source noise and extract full-frequency details. We applied the technique on a series of experiments including macroscopic and microscopic imaging, microfluidic inspection, and Fourier ptychography. The experiments validate the technique's state-of-the-art super-resolution SPAD imaging performance, with more than 5 dB superiority on PSNR compared to the existing methods.
translated by 谷歌翻译
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.
translated by 谷歌翻译
Accurate spatial-temporal traffic flow forecasting is essential for helping traffic managers to take control measures and drivers to choose the optimal travel routes. Recently, graph convolutional networks (GCNs) have been widely used in traffic flow prediction owing to their powerful ability to capture spatial-temporal dependencies. The design of the spatial-temporal graph adjacency matrix is a key to the success of GCNs, and it is still an open question. This paper proposes reconstructing the binary adjacency matrix via tensor decomposition, and a traffic flow forecasting method is proposed. First, we reformulate the spatial-temporal fusion graph adjacency matrix into a three-way adjacency tensor. Then, we reconstructed the adjacency tensor via Tucker decomposition, wherein more informative and global spatial-temporal dependencies are encoded. Finally, a Spatial-temporal Synchronous Graph Convolutional module for localized spatial-temporal correlations learning and a Dilated Convolution module for global correlations learning are assembled to aggregate and learn the comprehensive spatial-temporal dependencies of the road network. Experimental results on four open-access datasets demonstrate that the proposed model outperforms state-of-the-art approaches in terms of the prediction performance and computational cost.
translated by 谷歌翻译
The effective application of contrastive learning technology in natural language processing tasks shows the superiority of contrastive learning in text analysis tasks. How to construct positive and negative samples correctly and reasonably is the core challenge of contrastive learning. Since it is difficult to construct contrastive objects in multi-label multi-classification tasks, there are few contrastive losses for multi-label multi-classification text classification. In this paper, we propose five contrastive losses for multi-label multi-classification tasks. They are Strict Contrastive Loss (SCL), Intra-label Contrastive Loss (ICL), Jaccard Similarity Contrastive Loss (JSCL), and Jaccard Similarity Probability Contrastive Loss (JSPCL) and Stepwise Label Contrastive Loss (SLCL). We explore the effectiveness of contrastive learning for multi-label multi-classification tasks under different strategies, and provide a set of baseline methods for contrastive learning techniques on multi-label classification tasks. We also perform an interpretability analysis of our approach to show how different contrastive learning methods play their roles. The experimental results in this paper demonstrate that our proposed contrastive losses can bring some improvement for multi-label multi-classification tasks. Our work reveal how to "appropriately" change the contrastive way of contrastive learning is the key idea to improve the adaptability of contrastive learning in multi-label multi-classification tasks.
translated by 谷歌翻译
Recently, webly supervised learning (WSL) has been studied to leverage numerous and accessible data from the Internet. Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain. However, only by tackling the performance gap above can we fully exploit the practical value of web datasets. To this end, we propose a Few-shot guided Prototypical (FoPro) representation learning method, which only needs a few labeled examples from reality and can significantly improve the performance in the real-world domain. Specifically, we initialize each class center with few-shot real-world data as the ``realistic" prototype. Then, the intra-class distance between web instances and ``realistic" prototypes is narrowed by contrastive learning. Finally, we measure image-prototype distance with a learnable metric. Prototypes are polished by adjacent high-quality web images and involved in removing distant out-of-distribution samples. In experiments, FoPro is trained on web datasets with a few real-world examples guided and evaluated on real-world datasets. Our method achieves the state-of-the-art performance on three fine-grained datasets and two large-scale datasets. Compared with existing WSL methods under the same few-shot settings, FoPro still excels in real-world generalization. Code is available at https://github.com/yuleiqin/fopro.
translated by 谷歌翻译
Pre-trained language models (PLMs) are known to improve the generalization performance of natural language understanding models by leveraging large amounts of data during the pre-training phase. However, the out-of-distribution (OOD) generalization problem remains a challenge in many NLP tasks, limiting the real-world deployment of these methods. This paper presents the first attempt at creating a unified benchmark named GLUE-X for evaluating OOD robustness in NLP models, highlighting the importance of OOD robustness and providing insights on how to measure the robustness of a model and how to improve it. The benchmark includes 13 publicly available datasets for OOD testing, and evaluations are conducted on 8 classic NLP tasks over 19 popularly used PLMs. Our findings confirm the need for improved OOD accuracy in NLP tasks, as significant performance degradation was observed in all settings compared to in-distribution (ID) accuracy.
translated by 谷歌翻译
Most semantic communication systems leverage deep learning models to provide end-to-end transmission performance surpassing the established source and channel coding approaches. While, so far, research has mainly focused on architecture and model improvements, but such a model trained over a full dataset and ergodic channel responses is unlikely to be optimal for every test instance. Due to limitations on the model capacity and imperfect optimization and generalization, such learned models will be suboptimal especially when the testing data distribution or channel response is different from that in the training phase, as is likely to be the case in practice. To tackle this, in this paper, we propose a novel semantic communication paradigm by leveraging the deep learning model's overfitting property. Our model can for instance be updated after deployment, which can further lead to substantial gains in terms of the transmission rate-distortion (RD) performance. This new system is named adaptive semantic communication (ASC). In our ASC system, the ingredients of wireless transmitted stream include both the semantic representations of source data and the adapted decoder model parameters. Specifically, we take the overfitting concept to the extreme, proposing a series of ingenious methods to adapt the semantic codec or representations to an individual data or channel state instance. The whole ASC system design is formulated as an optimization problem whose goal is to minimize the loss function that is a tripartite tradeoff among the data rate, model rate, and distortion terms. The experiments (including user study) verify the effectiveness and efficiency of our ASC system. Notably, the substantial gain of our overfitted coding paradigm can catalyze semantic communication upgrading to a new era.
translated by 谷歌翻译
太阳耀斑,尤其是M级和X级耀斑,通常与冠状质量弹出(CMES)有关。它们是太空天气影响的最重要来源,可能会严重影响近地环境。因此,必须预测耀斑(尤其是X级),以减轻其破坏性和危险后果。在这里,我们介绍了几种统计和机器学习方法,以预测AR的耀斑指数(FI),这些方法通过考虑到一定时间间隔内的不同类耀斑的数量来量化AR的耀斑生产力。具体而言,我们的样本包括2010年5月至2017年12月在太阳能磁盘上出现的563个AR。25个磁性参数,由空中震动和磁性成像器(HMI)的太空天气HMI活性区域(Sharp)提供的太阳能动力学观测值(HMI)。 (SDO),表征了代理中存储在ARS中的冠状磁能,并用作预测因子。我们研究了这些尖锐的参数与ARS的FI与机器学习算法(样条回归)和重采样方法(合成少数群体过度采样技术,用于使用高斯噪声回归的合成少数群体过度采样技术,smogn简短)。基于既定关系,我们能够在接下来的1天内预测给定AR的FIS值。与其他4种流行的机器学习算法相比,我们的方法提高了FI预测的准确性,尤其是对于大型FI。此外,我们根据Borda Count方法从由9种不同的机器学习方法渲染的等级计算出尖锐参数的重要性。
translated by 谷歌翻译
在计算机音乐和心理声学中,感知响度与身体属性之间的关系是一个重要的主题。对“相等大通轮廓”的早期研究可以追溯到1920年代,从那以后,对强度和频率进行了测量的响度已被修订了多次。然而,大多数研究仅关注合成的声音,并且很少有合理的自然色调理论。为此,我们通过建模钢琴音调在本文中研究了天然音调感知的理论和应用。该理论部分包含:1)对音高的钢琴相等大小轮廓的准确测量,以及2)一个机器学习模型,能够纯粹基于基于人类主题测量的光谱特征来推断响度。至于应用程序,我们将理论应用于钢琴控制转移,其中我们调整了两个不同玩家钢琴(在不同的声学环境中)上的MIDI速度,以达到相同的感知效果。实验表明,我们的理论响度建模和相应的性能控制转移算法都显着优于其基准。
translated by 谷歌翻译
关于多模式情绪识别的最新作品转向端到端模型,该模型可以提取与两阶段管道相比,目标任务监督的特定任务特征。但是,以前的方法仅模拟文本和声学和视觉方式之间的特征相互作用,而忽略了捕获声学和视觉方式之间的特征相互作用。在本文中,我们提出了多模式的端到端变压器(ME2ET),该变压器可以有效地对低级和高级水平的文本,声学和视觉方式之间的三模式特征进行建模。在低水平,我们提出了进行性三模式的注意,可以通过采用两次通行策略来对三模式特征相互作用进行建模,并可以进一步利用这种相互作用,以通过降低输入令牌来显着降低计算和记忆复杂性长度。在高水平上,我们引入了三模式特征融合层,以明确汇总三种模式的语义表示。 CMU-MOSEI和IEMOCAP数据集的实验结果表明,ME2ET实现了最新性能。进一步的深入分析证明了拟议的渐进三模式关注的有效性,效率和解释性,这可以帮助我们的模型实现更好的性能,同时显着降低计算和记忆成本。我们的代码将公开可用。
translated by 谷歌翻译