标准扩散模型涉及图像变换 - 添加高斯噪声 - 以及逆转此降解的图像恢复操作员。我们观察到,扩散模型的生成行为并不是很大程度上取决于图像降解的选择,实际上,可以通过改变这种选择来构建整个生成模型家族。即使使用完全确定性的降解(例如,模糊,掩蔽等),培训和测试时间更新规则是基于扩散模型的培训和测试时间更新规则,可以轻松地概括为创建生成模型。这些完全确定的模型的成功使社区对扩散模型的理解质疑,这依赖于梯度Langevin动力学或变异推理中的噪声,并为反转任意过程的广义扩散模型铺平了道路。我们的代码可从https://github.com/arpitbansal297/cold-diffusion-models获得
translated by 谷歌翻译
水印是保护创作者对数字图像,视频和音频的权利的常用策略。最近,水印方法已扩展到深度学习模型 - 原则上,当对手试图复制该模型时,应保留水印。但是,实际上,智能对手通常可以去除水印。几篇论文提出了水印方法,这些方法声称对不同类型的拆除攻击具有耐药性,但是在面对新的或更好的对手时,这些新技术通常会失败。在本文中,我们提出了一种可认证的水印方法。使用Chiang等人提出的随机平滑技术,我们表明我们的水印是不明显的,除非模型参数的更改超过一定的L2阈值。除了获得认证外,与以前的水印方法相比,我们的水印在经验上也更强。我们的实验可以在https://github.com/arpitbansal297/certified_watermarks上复制。
translated by 谷歌翻译
对表格数据的深度学习的最新工作表明了深层表格模型的强劲表现,通常会弥合梯度增强的决策树和神经网络之间的差距。除了准确性之外,神经模型的主要优点是它们学习可重复使用的功能,并且在新域中很容易进行微调。该属性通常在计算机视觉和自然语言应用中被利用,在特定于任务的培训数据稀缺时,转移学习是必不可少的。在这项工作中,我们证明上游数据使表格神经网络比广泛使用的GBDT模型具有决定性的优势。我们为表格转移学习提出了一个现实的医学诊断基准,并提出了使用上游数据来通过各种表格神经网络体系结构来提高性能的方法指南。最后,我们为上游和下游特征集不同的情况提出了一种伪特征方法,在现实世界中,特定于表格的问题广泛。我们的代码可在https://github.com/levinroman/tabular-transfer-learning上找到。
translated by 谷歌翻译
Recently, online social media has become a primary source for new information and misinformation or rumours. In the absence of an automatic rumour detection system the propagation of rumours has increased manifold leading to serious societal damages. In this work, we propose a novel method for building automatic rumour detection system by focusing on oversampling to alleviating the fundamental challenges of class imbalance in rumour detection task. Our oversampling method relies on contextualised data augmentation to generate synthetic samples for underrepresented classes in the dataset. The key idea exploits selection of tweets in a thread for augmentation which can be achieved by introducing a non-random selection criteria to focus the augmentation process on relevant tweets. Furthermore, we propose two graph neural networks(GNN) to model non-linear conversations on a thread. To enhance the tweet representations in our method we employed a custom feature selection technique based on state-of-the-art BERTweet model. Experiments of three publicly available datasets confirm that 1) our GNN models outperform the the current state-of-the-art classifiers by more than 20%(F1-score); 2) our oversampling technique increases the model performance by more than 9%;(F1-score) 3) focusing on relevant tweets for data augmentation via non-random selection criteria can further improve the results; and 4) our method has superior capabilities to detect rumours at very early stage.
translated by 谷歌翻译
Language models have been shown to perform better with an increase in scale on a wide variety of tasks via the in-context learning paradigm. In this paper, we investigate the hypothesis that the ability of a large language model to in-context learn-perform a task is not uniformly spread across all of its underlying components. Using a 66 billion parameter language model (OPT-66B) across a diverse set of 14 downstream tasks, we find this is indeed the case: $\sim$70% of attention heads and $\sim$20% of feed forward networks can be removed with minimal decline in task performance. We find substantial overlap in the set of attention heads (un)important for in-context learning across tasks and number of in-context examples. We also address our hypothesis through a task-agnostic lens, finding that a small set of attention heads in OPT-66B score highly on their ability to perform primitive induction operations associated with in-context learning, namely, prefix matching and copying. These induction heads overlap with task-specific important heads, suggesting that induction heads are among the heads capable of more sophisticated behaviors associated with in-context learning. Overall, our study provides several insights that indicate large language models may be under-trained to perform in-context learning and opens up questions on how to pre-train language models to more effectively perform in-context learning.
translated by 谷歌翻译
$ $With recent advances in CNNs, exceptional improvements have been made in semantic segmentation of high resolution images in terms of accuracy and latency. However, challenges still remain in detecting objects in crowded scenes, large scale variations, partial occlusion, and distortions, while still maintaining mobility and latency. We introduce a fast and efficient convolutional neural network, ASBU-Net, for semantic segmentation of high resolution images that addresses these problems and uses no novelty layers for ease of quantization and embedded hardware support. ASBU-Net is based on a new feature extraction module, atrous space bender layer (ASBL), which is efficient in terms of computation and memory. The ASB layers form a building block that is used to make ASBNet. Since this network does not use any special layers it can be easily implemented, quantized and deployed on FPGAs and other hardware with limited memory. We present experiments on resource and accuracy trade-offs and show strong performance compared to other popular models.
translated by 谷歌翻译
Prompting large language models has enabled significant recent progress in multi-step reasoning over text. However, when applied to text generation from semi-structured data (e.g., graphs or tables), these methods typically suffer from low semantic coverage, hallucination, and logical inconsistency. We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning. MURMUR is a best-first search method that generates reasoning paths using: (1) neural and symbolic modules with specific linguistic and logical skills, (2) a grammar whose production rules define valid compositions of modules, and (3) value functions that assess the quality of each reasoning step. We conduct experiments on two diverse data-to-text generation tasks like WebNLG and LogicNLG. These tasks differ in their data representations (graphs and tables) and span multiple linguistic and logical skills. MURMUR obtains significant improvements over recent few-shot baselines like direct prompting and chain-of-thought prompting, while also achieving comparable performance to fine-tuned GPT-2 on out-of-domain data. Moreover, human evaluation shows that MURMUR generates highly faithful and correct reasoning paths that lead to 26% more logically consistent summaries on LogicNLG, compared to direct prompting.
translated by 谷歌翻译
Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/
translated by 谷歌翻译
The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are (i) the spatiotemporal architecture design, (ii) the multimodal fusion schemes, (iii) the pretraining objectives, (iv) the choice of pretraining data, (v) pretraining and finetuning protocols, and (vi) dataset and model scaling. Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at: https://github.com/klauscc/VindLU.
translated by 谷歌翻译
We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark (DUE).
translated by 谷歌翻译