视频识别的标准方法通常在完整的输入视频上运行,由于视频中的时空冗余率广泛,因此效率低下。蒙版视频建模(即视频)的最新进展表明,香草视觉变压器(VIT)仅具有有限的视觉内容来补充时空上下文的能力。受到这一点的启发,我们提出了建议的蒙版动作识别(MAR),该识别(MAR)通过丢弃一定比例的补丁并仅在视频的一部分上操作来减少冗余计算。 MAR包含以下两个必不可少的组件:单元运行掩盖和桥接分类器。具体而言,为了使VIT轻松地感知细节以外的细节,并且会呈现单元格的掩蔽,以保留视频中的时空相关性,从而确保可以在同一空间位置观察到在同一空间位置的贴片,以便轻松地重建。此外,我们注意到,尽管部分观察到的特征可以重建语义上明确的隐形贴片,但它们无法实现准确的分类。为了解决这个问题,提出了一个桥接分类器,以弥合重建的VIT编码功能与专门用于分类的功能之间的语义差距。我们提出的MAR将VIT的计算成本降低了53%,并且广泛的实验表明,MAR始终以明显的边距优于现有的VIT模型。尤其是,我们发现由MAR训练的Vit-Lage胜过由标准培训方案训练的Vit-Bugue,这是通过说服Kinetics-400和某些v2数据集中的利润率,而VIT-LARGE的计算开销仅为14.5%。维特(Vit-Huge)。
translated by 谷歌翻译
空间卷积广泛用于许多深度视频模型。它基本上假设了时空不变性,即,使用不同帧中的每个位置的共享权重。这项工作提出了用于视频理解的时间 - 自适应卷积(Tadaconv),这表明沿着时间维度的自适应权重校准是促进在视频中建模复杂的时间动态的有效方法。具体而言,Tadaconv根据其本地和全局时间上下文校准每个帧的卷积权重,使空间卷积具有时间建模能力。与先前的时间建模操作相比,Tadaconv在通过卷积内核上运行而不是特征,其维度是比空间分辨率小的数量级更有效。此外,内核校准还具有增加的模型容量。通过用Tadaconv替换Reset中的空间互联网来构建坦达2D网络,这与多个视频动作识别和定位基准测试的最先进方法相比,导致PAR或更好的性能。我们还表明,作为可忽略的计算开销的容易插入操作,Tadaconv可以有效地改善许多具有令人信服的边距的现有视频模型。 HTTPS://github.com/alibaba-mmai-research/pytorch-video -Undersing提供代码和模型。
translated by 谷歌翻译
对比学习的核心思想是区分不同的实例,并从相同实例中强制不同的视图以共享相同的表示。为了避免琐碎的解决方案,增强在生成不同视图中起重要作用,其中显示了随机裁剪来对模型来学习广义和鲁棒的表示。常用的随机作物操作保持沿着训练过程不变的两个视图之间的分布。在这项工作中,我们表明,自适应地控制沿着训练过程的两个增强视图之间的视差增强了学习的表示的质量。具体而言,我们提出了一种参数立方裁剪操作,用于视频对比度学习,其通过可分辨率的3D仿射变换自动批量3D立方。参数使用对抗目标与视频骨干同时培训,并从数据中学习最佳裁剪策略。可视化表明,参数自适应地控制了两个增强视图之间的中心距离和IOU,并且沿着训练过程的差异中的学习变化是有利于学习强烈的表示。广泛的消融研究证明了所提出的参数对多个对比学习框架和视频骨干的有效性。可以使用代码和模型。
translated by 谷歌翻译
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
translated by 谷歌翻译
Supervised Deep-Learning (DL)-based reconstruction algorithms have shown state-of-the-art results for highly-undersampled dynamic Magnetic Resonance Imaging (MRI) reconstruction. However, the requirement of excessive high-quality ground-truth data hinders their applications due to the generalization problem. Recently, Implicit Neural Representation (INR) has appeared as a powerful DL-based tool for solving the inverse problem by characterizing the attributes of a signal as a continuous function of corresponding coordinates in an unsupervised manner. In this work, we proposed an INR-based method to improve dynamic MRI reconstruction from highly undersampled k-space data, which only takes spatiotemporal coordinates as inputs. Specifically, the proposed INR represents the dynamic MRI images as an implicit function and encodes them into neural networks. The weights of the network are learned from sparsely-acquired (k, t)-space data itself only, without external training datasets or prior images. Benefiting from the strong implicit continuity regularization of INR together with explicit regularization for low-rankness and sparsity, our proposed method outperforms the compared scan-specific methods at various acceleration factors. E.g., experiments on retrospective cardiac cine datasets show an improvement of 5.5 ~ 7.1 dB in PSNR for extremely high accelerations (up to 41.6-fold). The high-quality and inner continuity of the images provided by INR has great potential to further improve the spatiotemporal resolution of dynamic MRI, without the need of any training data.
translated by 谷歌翻译
The counting task, which plays a fundamental rule in numerous applications (e.g., crowd counting, traffic statistics), aims to predict the number of objects with various densities. Existing object counting tasks are designed for a single object class. However, it is inevitable to encounter newly coming data with new classes in our real world. We name this scenario as \textit{evolving object counting}. In this paper, we build the first evolving object counting dataset and propose a unified object counting network as the first attempt to address this task. The proposed model consists of two key components: a class-agnostic mask module and a class-increment module. The class-agnostic mask module learns generic object occupation prior via predicting a class-agnostic binary mask (e.g., 1 denotes there exists an object at the considering position in an image and 0 otherwise). The class-increment module is used to handle new coming classes and provides discriminative class guidance for density map prediction. The combined outputs of class-agnostic mask module and image feature extractor are used to predict the final density map. When new classes come, we first add new neural nodes into the last regression and classification layers of this module. Then, instead of retraining the model from scratch, we utilize knowledge distilling to help the model remember what have already learned about previous object classes. We also employ a support sample bank to store a small number of typical training samples of each class, which are used to prevent the model from forgetting key information of old data. With this design, our model can efficiently and effectively adapt to new coming classes while keeping good performance on already seen data without large-scale retraining. Extensive experiments on the collected dataset demonstrate the favorable performance.
translated by 谷歌翻译
In this paper, we consider an intelligent reflecting surface (IRS)-aided cell-free massive multiple-input multiple-output system, where the beamforming at access points and the phase shifts at IRSs are jointly optimized to maximize energy efficiency (EE). To solve EE maximization problem, we propose an iterative optimization algorithm by using quadratic transform and Lagrangian dual transform to find the optimum beamforming and phase shifts. However, the proposed algorithm suffers from high computational complexity, which hinders its application in some practical scenarios. Responding to this, we further propose a deep learning based approach for joint beamforming and phase shifts design. Specifically, a two-stage deep neural network is trained offline using the unsupervised learning manner, which is then deployed online for the predictions of beamforming and phase shifts. Simulation results show that compared with the iterative optimization algorithm and the genetic algorithm, the unsupervised learning based approach has higher EE performance and lower running time.
translated by 谷歌翻译
With the ever-growing model size and the limited availability of labeled training data, transfer learning has become an increasingly popular approach in many science and engineering domains. For classification problems, this work delves into the mystery of transfer learning through an intriguing phenomenon termed neural collapse (NC), where the last-layer features and classifiers of learned deep networks satisfy: (i) the within-class variability of the features collapses to zero, and (ii) the between-class feature means are maximally and equally separated. Through the lens of NC, our findings for transfer learning are the following: (i) when pre-training models, preventing intra-class variability collapse (to a certain extent) better preserves the intrinsic structures of the input data, so that it leads to better model transferability; (ii) when fine-tuning models on downstream tasks, obtaining features with more NC on downstream data results in better test accuracy on the given task. The above results not only demystify many widely used heuristics in model pre-training (e.g., data augmentation, projection head, self-supervised learning), but also leads to more efficient and principled fine-tuning method on downstream tasks that we demonstrate through extensive experimental results.
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
Time series anomaly detection strives to uncover potential abnormal behaviors and patterns from temporal data, and has fundamental significance in diverse application scenarios. Constructing an effective detection model usually requires adequate training data stored in a centralized manner, however, this requirement sometimes could not be satisfied in realistic scenarios. As a prevailing approach to address the above problem, federated learning has demonstrated its power to cooperate with the distributed data available while protecting the privacy of data providers. However, it is still unclear that how existing time series anomaly detection algorithms perform with decentralized data storage and privacy protection through federated learning. To study this, we conduct a federated time series anomaly detection benchmark, named FedTADBench, which involves five representative time series anomaly detection algorithms and four popular federated learning methods. We would like to answer the following questions: (1)How is the performance of time series anomaly detection algorithms when meeting federated learning? (2) Which federated learning method is the most appropriate one for time series anomaly detection? (3) How do federated time series anomaly detection approaches perform on different partitions of data in clients? Numbers of results as well as corresponding analysis are provided from extensive experiments with various settings. The source code of our benchmark is publicly available at https://github.com/fanxingliu2020/FedTADBench.
translated by 谷歌翻译