Universal Mini-Batch Consistency for
Set Encoding Functions
Abstract
Previous works have established solid foundations for neural set functions, as well as effective architectures which preserve the necessary properties for operating on sets, such as being invariant to permutations of the set elements. Subsequently, Mini-Batch Consistency (MBC), the ability to sequentially process any permutation of any random set partition scheme while maintaining consistency guarantees on the output, has been established but with limited options for network architectures. We further study the MBC property in neural set encoding functions, establishing a method for converting arbitrary non-MBC models to satisfy MBC. In doing so, we provide a framework for a universally-MBC (UMBC) class of set functions. Additionally, we explore an interesting dropout strategy made possible by our framework, and investigate its effects on probabilistic calibration under test-time distributional shifts. We validate UMBC with proofs backed by unit tests, also providing qualitative/quantitative experiments on toy data, clean and corrupted point cloud classification, and amortized clustering on ImageNet. The results demonstrate the utility of UMBC, and we further discover that our dropout strategy improves uncertainty calibration.
1 Introduction
Set encoding functions (Zaheer et al., 2017; Bruno et al., 2021; Lee et al., 2019; Kim, 2021) are becoming a broadly researched and cited topic in recent literature. This popularity can be partly attributed to natural set structures in data such as point clouds or even datasets themselves. Given a set of cardinality , one may desire to group the elements (clustering), identify them (classification), or find likely elements to complete the set (completion/extension). Different from vanilla neural networks working on fixed input sizes, neural set functions must be able to handle dynamic set cardinalities for each input set. Additionally, sets are inherently unordered, so the function must make consistent predictions for any permutation of set elements.
Deep Sets (Zaheer et al., 2017) is a canonical work providing an in-depth investigation of the requirements and valid structures of neural set functions. Deep Sets utilizes traditional, permutation equivariant (creftype 3.2) linear and convolutional neural network layers in conjunction with featurewise permutation invariant (creftype 3.1) set-pooling functions (e.g. {min, max, sum, mean}) in order to satisfy the permutation consistency requirements and perform inference on sets. The Set Transformer (Lee et al., 2019) utilizes the power of Multi-Headed Attention (Vaswani et al., 2017) to construct multiple set-capable attention blocks, as well as an attentive pooling function. The previously mentioned works never explicitly consider the case where it may be required to process a set in multiple partitions, which can happen for a variety of reasons including resource constraints, prohibitively large or even infinite set sizes, and streaming set data.
Model | MBC | Cross-Attn., | Self-Attn. |
---|---|---|---|
Deep Sets (Zaheer et al., 2017) | ✓ | ✗ | ✗ |
SSE (Bruno et al., 2021) | ✓ | ✓ | ✗ |
Set Transformer (Lee et al., 2019) | ✗ | ✓ | ✓ |
UMBC+Set Transformer | ✓ | ✓ | ✓ |
Powerful models such as the Set Transformer cannot make consistency guarantees when updating pooled set representations, as self-attention blocks require all elements in one pass, and therefore do not satisfy MBC (i.e. batch processing of set partitions yields a different output than processing the whole set at once). Naively using such non-MBC set encoders in an MBC setting causes under-performance, as depicted in Figure 3 (\subrefst-single-point-\subrefst-chunk) where Set Transformer exhibits poor likelihood and inconsistent predictions. With an MBC guarantee Figure 3 (\subrefumbc-st-single-point-\subrefumbc-st-chunk), UMBC+Set Transformer gives consistent results, and vastly better likelihood (See Appendices B and 5 for details of the experiment) The quantitative effect of MBC vs non-MBC encoding on pooled set representations can be seen in Figure 2 which shows the variance between final representations of 100 random partitions of the same set. (See Appendix C for details).
The MBC property of set functions was identified by Bruno et al. (2021) who also proposed the Slot Set Encoder (SSE), a specific, constrained version of an attentive pooling mechanism, eliminating the need to store all set elements during the computation. The introduction of the MBC property naturally leads to the rise of a new dimension in the taxonomy of set functions, namely, those which satisfy MBC and those which do not. The main limitation of the SSE is the fact that it limits the number of valid MBC architectures, eliminating powerful models such as the Set Transformer, which can be the best choice for tasks which require leveraging pairwise set element relationships and self-attention. This is shown in Tables 2 and 4) where the Set Transformer outperforms SSE. In this work, we identify, prove, and verify that there is a universal way to convert arbitrary non-MBC set functions to MBC functions which can provably produce the same result for random partitioning schemes, allowing any set encoder to be used in an MBC setting. This result has large implications for all current and future set functions which do not natively satisfy MBC, as it unifies all set-functions into the MBC class of set-functions. This unification allows models to scale to sets larger than they could otherwise handle, and also allows them to be used in a wider range of settings (i.e.streaming data). Animations, code, and unit tests can be found in the supplementary file and also at: https://github.com/anonymous-subm1t/umbc
Our contributions in this work are as follows:
-
In Theorem 4.1 we show that with a change in architecture, any arbitrary non-MBC set encoder can become MBC, guaranteeing that minibatch processing of sets gives the same result as processing the full set at once.
-
We loosen the constraints of the SSE attention activations by showing many functions can be used. By factorizing the activation, we can maintain the MBC property and still normalize over (i.e. like traditional attention (Vaswani et al., 2017)).
-
We uncover a connection between the pooling mechanism of the Set Transformer and SSE layers, which we show only differ in the attention matrix activation. We explore the effect of 5 different activation approaches.
-
We explore an interesting dropout approach which arises as a consequence of UMBC’s structure, delivering improvements in calibration for both in-distribution and corrupted test sets.
2 Related Work
Processing, pooling, and making a prediction for set structured data has been an active topic since the introduction of DeepSets Zaheer et al. (2017). Attention has been shown to be powerful in these tasks Lee et al. (2019), as simple independent row-wise operations may fail to capture pairwise interactions between the set elements. There have been subsequent works and variations of set attention which draw connections to optimal transport Mialon et al. (2020), and expectation maximization Kim (2021). Likewise, an efficient version of set-attention has been proposed which incorporates cross attention with lower dimensional self-attention in an iterative process Jaegle et al. (2021). Outside of attention, other approaches to set pooling functions include featurewise sorting Zhang et al. (2019), and canonical orderings to permutation sensitive functions Murphy et al. (2018).
Bruno et al. (2021), provide and especially important lens through which to view our work. Prior to the proposal of the MBC property, previous set function research never explicitly considered the mini-batched setting, which will likely become important with the ever increasing scales of models and data (Brown et al., 2020). Indeed most set functions do not satisfy creftype 3.3 (e.g. (Lee et al., 2019; Kim, 2021; Mialon et al., 2020; Jaegle et al., 2021; Zhang et al., 2019; Murphy et al., 2018)). Our work builds on the concepts established by Bruno et al. (2021), and ensures that any set functions proposed in the future, can be considered in terms of their MBC performance by incorporating UMBC.
Numerous prior works (Ovadia et al., 2019; Guo et al., 2017) focus on uncertainty quantification and improving probabilistic calibration, which can be crucial for tasks such as autonomous driving (Chen et al., 2017) and medical diagnosis (Zhou et al., 2021) where decisions can impact human well being. Guo et al. (2017), proposed quantifying uncertainty with the expected calibration error (ECE) metric measuring the mismatch between accuracy and confidence. Ovadia et al. (2019) used corrupted datasets made by Hendrycks and Dietterich (2019) (similar in form to ModelNet40-C (Ren et al., 2022) used in our experiments) to survey the landscape of neural network calibration. Guo et al. (2017); Ovadia et al. (2019) analyze variants of deep convolutional models, while Minderer et al. (2021) evaluate large Vision Transformers. To our knowledge, our work is the first to analyze set function calibration specifically, as most other works focus on general purpose classifiers.
3 Preliminaries on Set Functions
For our setting, we define a neural set function which operates on a set with each set element . A dataset of set-structured data is itself a set composed of sets forming input sets and output sets which can be learned by an appropriate function via mini-batch stochastic gradient descent. A set function has a set-structured input space and output space , which may be discrete or continuous. An input to a set function, is a set. As the input is a set and the function must process any valid set, therefore any element of the powerset also represents a valid input.
Deep Sets Zaheer et al. (2017) provided a crucial groundwork for neural set functions, formalizing the requirements of permutation equivariant architectures and invariant pooling mechanisms necessary for feature extraction and pooling of sets. Following these requirements, a function can assign a single output for each valid subset which is invariant to the permutations of the elements .
Property 3.1 (Permutation Invariance).
A function acting on sets is permutation invariant to the order of objects in the set iff for any permutation function
Permutation invariant layers are commonly referred to as set pooling functions, and have a stationary, fixed size output given any permutation, or cardinality of the input set, respectively. This stationary output can be seen as a Set to Vector function.
Definition 3.1 (Set 2 Vector Function).
A Set to Vector Function (S2V) is a pooling function which satisfies creftype 3.1, and projects a set of cardinality to one or more vectors with .
Additionally, Zaheer et al. (2017) prescribes that prior to any permutation invariant pooling, any composition of permutation equivariant layers may be used for feature extraction. Common linear and convolutional neural network layers are permutation equivariant layers when considering a batch of inputs as a set. For the remainder we assume contains both equivariant and invariant layers.
Property 3.2 (Permutation Equivariance).
A function acting on sets is permutation equivariant to the order of objects in the set iff for any permutation function
Lee et al. (2019) identified that Self-Attention (Vaswani et al., 2017) blocks satisfy creftype 3.2 and thus can be used as equivariant feature extractors for set functions, proposing the Set Transformer. SAB’s are defined as . Additionally, the permutation invariant pooling layer of the Set Transformer (PMA), performs the attention operation between a learnable seed parameter and the input set, .
Bruno et al. (2021) identified and formalized the MBC property, proposing the MBC Slot Set Encoder (SSE), adding a new dimension to the original view of creftype 3.1 from Zaheer et al. (2017). Instead of merely requiring that be permutation invariant for any permutation of the indices of a specific subset , the MBC property also requires that sequential, mini-batched extraction/pooling, and subsequent aggregation of any partition of is also permutation invariant.
Property 3.3 (Mini-Batch Consistency).
Let be partitioned such that and be a S2V set encoding function such that . Given an aggregation function , and are Mini-Batch Consistent if and only if
An SSE (Bruno et al., 2021) layer works in a similar fashion to the PMA layers of the Set Transformer, using parameterized slots as queries , and partition as keys and values , with an attention activation for a single which does not depend on the other elements within the set. SSE uses a sigmoid activation with normalization over the slot dimension (Bruno et al., 2021; Locatello et al., 2020) in the attention matrix . Then with and
thereby allowing any partition scheme, and satisfying creftype 3.3. With the prior description in mind, an SSE can be phrased in terms of a PMA such that , with signifying the use of the slot-normalized sigmoid activation used in order to satisfy creftype 3.3.
4 Building a Universally MBC Set Function
Originally, SSE acts as a S2V function, creating an encoded set representation for downstream handling by a task specific decoder. Decoders make different predictions given different representations, therefore creftype 3.3 need only be satisfied until the invariant S2V pooling function.
Lemma 4.1.
Let and be arbitrary neural set functions, and let be an MBC aggregation function in the functional composition . For to satisfy creftype 3.3, It is sufficient to require the representation as input to satisfies creftype 3.3.
Proof.
Assume that satisfies creftype 3.3 and the composition does not satisfy creftype 3.3. updates as new partitions arrive, yielding the same input to , and therefore the same output of for any permutation of a random partition of , contradicting the statement that does not satisfy creftype 3.3. ∎
Put simply, Lemma 4.1 states that every module coming after a module which satisfies creftype 3.3 will continue to satisfy creftype 3.3, even though itself may not satisfy creftype 3.3. With this established, we can therefore use Lemma 4.1 in order to build a universally MBC set function.
Theorem 4.1 (Universal MBC Set Function (UMBC)).
Let be a neural S2V function satisfying creftype 3.3, be an arbitrary unconstrained neural S2V function, and be an MBC aggregation function. By Lemma 4.1, the composition of functions satisfies creftype 3.3.
In the simplest setting where the pooled representation is a vector , receives a singleton set as input, which is valid, but may provide limited utility over a Deep Sets style encoder, as sees only a single element. SSE’s and PMA’s, however, output a set . Therefore, using a SSE/PMA layer with as the base module , we can view SSE/PMA layer as a type of invariant feature extractor which takes a set of cardinality and maps it to a set of cardinality . Our UMBC has features flowing through the model from the input space to the output space as,
Maintaining attention normalization over
We now turn to the question of whether or not the constrained attention operation (i.e. avoiding normalization over in the attention activation) described for is necessary in order to satisfy creftype 3.3.
Proposition 4.1.
By factorizing the normalization constant from the attention matrix softmax, normalization over can be performed across mini-batched partitions while still satisfying creftype 3.3.
Proof.
With as a softmax, and as the elementwise exponential,
(1) |
Where is a diagonal matrix containing the normalization constants of the softmax function where is the set cardinality and is a single slot. Outside of , the final multiplication can occur in any order, so we may simply evaluate , keeping a vector with the sum of the rows of . Factoring the attention in this way, we can update and at the arrival of every partition , normalize over , and still satisfy creftype 3.3.
(2) |
∎
Interestingly, in our ablation study (Figure 5), we find the softmax most effective, which requires the normalization over as described above. For a note about about the numerical stability of the softmax calculated this way, see Appendix F.
function () | norm. | norm. | name | used in |
---|---|---|---|---|
slot-sigmoid | Slot Set Encoder Bruno et al. (2021) | |||
slot-softmax | Slot Attention Locatello et al. (2020) | |||
softmax | Set Transformer Lee et al. (2019) | |||
slot-exp | - | |||
sigmoid | - |
SSE’s Connection to PMA’s
With the introduction of Proposition 4.1, it is easy to see that the only difference between an SSE and a PMA is the choice of the attention activation function. Indeed any deterministic elementwise function which 1) maps the pre-activation attention matrix to strictly positive values, and 2) has an optional normalization constant over which can be factored as in Proposition 4.1 is valid and will satisfy creftype 3.3. With this in mind, we identify five functions, and explore their performance effects in Figure 5. For the remainder of this work, we will refer to UMBC layers as , and likewise, models with a base UMBC module are prefixed with UMBC+. A diagram of a UMBC attention can be seen in Figure 9.
Slot Dropout
As outlined in 4, our UMBC framework projects a set of cardinality to a fixed cardinality . In doing so, there is a unique opportunity where we can treat each slot as a Bernoulli random variable, dropping it with probability (i.e. dropout (Hinton et al., 2012; Gal and Ghahramani, 2016)). This strategy could be useful for faster training due to a reduced set size as input to (Figure 18), for combatting overfitting (Appendices J and 7), or achieving test time ensembling of set representations by sampling multiple dropout masks and averaging the predictions via MC integration (Figure 7) as done by Gal and Ghahramani (2016).
Multiheaded and Parallel Universal Blocks
In addition to multiheaded attention in UMBC, we can also consider multiple parallel layers , each with independent multiheaded projections, allowing for independent representations of the same input set.
5 Experiments
Metrics & Model Setup
In the following experiments, our aim is to compare the overall effect of the composition . In these experiments, we could place arbitrarily hard MBC settings on baselines (e.g. streaming settings in Figures 8 and 3). Instead, we compare performance in the full batch setting where a standalone, non-MBC performs well in order to analyze any possible downsides and highlight the benefits in choosing a UMBC model over an MBC model like Deep Sets or SSE. In addition to accuracy, we report negative log likelihood (NLL), expected calibration error (ECE) (Guo et al., 2017), and Adjusted Rand Index (ARI) (Hubert and Arabie, 1985; Vinh et al., 2010). Standard settings of all UMBC models follow those shown in Table 5 unless otherwise specified. All models are trained for 5 runs with random initializations, with error bars corresponding to one standard deviation. We use open source code provided by Zaheer et al. (2017); Lee et al. (2019); Kim (2021) where applicable.
Amortized clustering
We perform amortized clustering on a similar Mixture Of Gaussians dataset as Lee et al. (2019) (See Appendix B for dataset details). The goal is to maximize the likelihood (Equation 3) of a set with Gaussian components by predicting the component prior, mean, and variance . Figures 8 and 3 contain a qualitative example of the task as well as a demonstration of how non-MBC models can fail when used in a MBC setting, considering 4 different streaming settings for the inputs:
(3) |
-
single point stream streams each point in the set one by one. This causes the most severe underperformance by the Set Transformer.
-
class stream streams an entire class at once. The attention modules within Set Transformer cannot compare the input class with any other clusters, thereby degrading performance of Set Transformer.
-
chunk stream streams 8 random points at a time from the dataset, Providing limited information to the Set Transformer’s attention.
-
one each stream streams a set consisting of a single instance from each class. Set Transformer can see examples of each class, but with a limited sample size, the encoding fails to make accurate predictions.
We show the effect of different train/test set sizes in Figure 4. Interestingly, the best performing MBC models in Figure 4 are UMBC+(Diff. EM, Set Transformer), giving a concrete example of a task where UMBC can leverage the added power of the Set Transformer to outperform existing MBC models. Note that this is the same task as depicted in Figures 8 and 3, showing that the marginally better bottom line performance of the Set Transformer in Figure 4 disappears in the MBC setting.
Model | MBC | NLL | ARI |
---|---|---|---|
Oracle | - | 1028.221.24 | 44.090.11 |
Deep Sets (Zaheer et al., 2017) | ✓ | 531.440.15 | 6.180.08 |
SSE (Bruno et al., 2021) | ✓ | 520.290.63 | 22.911.85 |
Diff. EM444Diff. EM showed some instability on the ImageNet clustering task and failed to converge for one run. Therefore variance is reported on 4/5 runs. | ✗ | 524.740.38 | 13.220.16 |
Set Transformer (Lee et al., 2019) | ✗ | 512.590.33 | 17.133.67 |
UMBC+Diff. EM | ✓ | 518.560.92 | 13.040.45 |
UMBC+Set Transformer | ✓ | 503.890.87 | 23.681.85 |
We extended the amortized clustering to ImageNet Deng et al. (2009). We used features extracted from a pretrained, frozen ResNet50 He et al. (2016) model, and then projected to a lower dimension via a random matrix (See details in Appendix E). Results for ImageNet clustering can be seen in Table 2. The Oracle entry in Table 2 is the NLL and ARI, using the actual prior, empirical mean, and diagonal covariance of each class cluster. As in the toy clustering task, UMBC+Set Transformer performs well, and even outperforms all models. To account for UMBC’s added parameters, we included UMBC on the baseline MBC models in Table 6, and UMBC+Set Transformer still shows the best performance.
Ablation Study
Using the mixture of Gaussians dataset, we evaluate various aspects of UMBC layers in Figure 5. Of activation functions identified as valid in Section 4 we found that the traditional softmax used in attention performs the best. We use this activation in all other experiments. In agreement with Bruno et al. (2021), we find that treating the slots as a Gaussian random variable, leads to a better overall result. We learn the slots with reparameterization, outlined in Appendix H. We find that layernorm on the post attention linear layer, residual connections on the slots, before the FF layer (like PMA layers of Lee et al. (2019)) to be beneficial. A moderate number of slots (the cardinality of input to ), helps up to a point and then shows an overfitting effect from overparameterization. We used these settings to inform our base settings given in Table 5.
The effect of test time MC sampling of these Bernoulli slots can be seen in Figure 5 (bottom right). Empirically, on the MoG task, we found that using no slot dropout at test time ultimately led to the best performance, which we think is likely due to the fact that the MoG dataset has an infinite number of instances and is therefore extremely resistant to overfitting. Using dropout on the ModelNet40 dataset (Figure 7), which is prone to overfitting, lead to better results on all metrics. Note that Monte Carlo sampling the slots at test time does not violate creftype 3.3 as long as dropout noise is pre-sampled at the beginning of a mini-batch sequence, and applied in the same way to each partition.
Accuracy | NLL | ECE | ||||||||
Model | MBC | 100 | 1000 | 2048 | 100 | 1000 | 2048 | 100 | 1000 | 2048 |
Deep Sets (Zaheer et al., 2017) | ✓ | 65.371.07 | 88.350.32 | 88.720.21 | 1.570.03 | 0.400.01 | 0.400.01 | 17.380.95 | 4.210.27 | 4.020.16 |
SSE (Bruno et al., 2021) | ✓ | 71.090.51 | 87.850.39 | 87.920.42 | 1.420.10 | 0.520.05 | 0.510.06 | 16.691.11 | 5.931.06 | 5.881.17 |
Diff-EM (Kim, 2021) | ✗ | 62.671.21 | 86.080.12 | 86.860.36 | 2.400.11 | 0.710.02 | 0.690.03 | 22.160.93 | 5.150.11 | 4.960.28 |
Set Transformer (Lee et al., 2019) | ✗ | 74.211.67 | 87.810.44 | 88.170.32 | 1.760.08 | 0.790.08 | 0.780.08 | 17.120.46 | 7.480.62 | 7.370.54 |
UMBC+Diff-EM | ✓ | 67.071.67 | 86.221.23 | 86.371.03 | 1.610.12 | 0.580.06 | 0.570.05 | 13.971.51 | 4.321.37 | 4.381.27 |
UMBC+Set Transformer | ✓ | 71.181.52 | 86.560.49 | 86.770.29 | 1.230.15 | 0.530.03 | 0.510.03 | 10.372.24 | 2.600.19 | 2.350.24 |
Point cloud classification
We perform set classification experiments ModelNet40 (Wu et al., 2015) and analyze the robustness of different set encoders to dataset shifts and varying test-time set sizes using ModelNet40-C (Ren et al., 2022) containing 15 corruptions at 5 levels of intensity. Our experiments use the version of ModelNet40 and ModelNet40-C used by Ren et al. (2022) which contains 2048 points sampled from the original ModelNet40 (Wu et al., 2015) CAD models. Results are presented in Table 3. Overall, compared with MBC baseline models, we witnessed a marginal decrease in accuracy for the 1000 and 2048 test set sizes, and mixed increases/decreases in accuracy for the 100 set size experiments. In terms of ECE, ‘UMBC+’ models outperform all baseline models. This increase in ECE can be partly attributed to MC sampling slots at test time (See Figures 12, 11 and 7) and partly to Slot Dropout at train time (See Table 7).
ModelNet40-C results can be seen in Figure 6. UMBC+ models give strong ECE performance in all test set sizes, improving over baselines, especially for test set size 100 and UMBC+Set Transformer where the largest miscalibration in baseline models is.
6 Conclusion
In this work, we have shown that composing a set function consisting of a mini-batch consistent base , with an arbitrary set function head , we can make the composition universally mini-batch consistent. We have provided proofs in Theorem 4.1, experiments Figure 2, and unit tests (included in the supplementary file) which prove our assertions. Likewise we have loosened the known constraints on the structure of the SSE Proposition 4.1, establishing an equivalency to the PMA layers of the Set Transformer. We have demonstrated that there are cases where a UMBC outperforms previous simpler MBC models, and explored an interesting dropout strategy which is made possible by our architecture and improved the calibration and NLL of UMBC. As the field of set-functions continues to widen, we look forward to seeing future research in the area of MBC set functions.
References
- Weight uncertainty in neural network. In International conference on machine learning, pp. 1613–1622. Cited by: Appendix M, Appendix H.
- Language models are few-shot learners. Advances in neural information processing systems 33, pp. 1877–1901. Cited by: §2.
- Mini-batch consistent slot set encoder for scalable set encoding. Advances in Neural Information Processing Systems 34. Cited by: Table 7, Appendix M, Appendix G, Appendix H, Figure 2, §1, §1, §2, §3, §3, Table 1, §5, Table 2, Table 3.
- Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915. Cited by: §2.
- Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §5.
- Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §4.
- Concrete dropout. Advances in neural information processing systems 30. Cited by: Appendix M.
- On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §2, §5.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.
- Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: §2.
- Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §4.
- Comparing partitions journal of classification 2 193–218. Google Scholar, pp. 193–128. Cited by: §5.
- Perceiver: general perception with iterative attention. In International conference on machine learning, pp. 4651–4664. Cited by: §2, §2.
- Differentiable expectation-maximization for set representation learning. In International Conference on Learning Representations, Cited by: Table 5, §1, §2, §2, §5, Table 3.
- Set transformer: a framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744–3753. Cited by: Appendix J, Table 7, Appendix B, Appendix G, Appendix G, Table 5, Appendix I, Figure 2, §1, §1, §2, §2, §3, Table 1, §5, §5, §5, Table 2, Table 3.
- Object-centric learning with slot attention. Advances in Neural Information Processing Systems 33, pp. 11525–11538. Cited by: Appendix N, Appendix H, §3, Table 1.
- A trainable optimal transport embedding for feature aggregation and its relationship to attention. arXiv preprint arXiv:2006.12065. Cited by: §2, §2.
- Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems 34. Cited by: §2.
- Janossy pooling: learning deep permutation-invariant functions for variable-size inputs. arXiv preprint arXiv:1811.01900. Cited by: §2, §2.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems 32. Cited by: §2.
- Random features for large-scale kernel machines. Advances in neural information processing systems 20. Cited by: Appendix E.
- Benchmarking and analyzing point cloud classification under corruptions. arXiv:2202.03377. Cited by: Appendix J, §2, §5.
- Attention is all you need. Advances in neural information processing systems 30. Cited by: 2nd item, §1, §3.
- Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. The Journal of Machine Learning Research 11, pp. 2837–2854. Cited by: §5.
- 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §5.
- Deep sets. Advances in neural information processing systems 30. Cited by: Appendix J, Table 7, Appendix G, Appendix G, Table 5, Figure 2, §1, §1, §2, §3, §3, §3, §5, Table 2, Table 3.
- Fspool: learning set representations with featurewise sort pooling. arXiv preprint arXiv:1906.02795. Cited by: §2, §2.
- A review of deep learning in medical imaging: imaging traits, technology trends, case studies with progress highlights, and future promises. Proceedings of the IEEE. Cited by: §2.
Appendix A Appendix
We will briefly describe the contents of each section of this appendix below:
-
Appendix B: Extra information and results related to MoG Amortized Clustering.
-
Appendix C: Details of the experiment depicted in Figure 2.
-
Appendix D: A note on MBC testing of the Set Transformer.
-
Appendix E: Details on ImageNet amortized clustering.
-
Appendix F: A note on the UMBC attention softmax stability.
-
Appendix G: Training parameters/setup.
-
Appendix H: Model hyperparameters/setup.
-
Appendix I: Additional ablation study results/discussion.
-
Appendix J: Addition results/discussion for ModelNet40 experiments.
-
Appendix K Extra results augmenting MBC models with UMBC.
-
Appendix L Societal Impacts.
-
Appendix M Limitations and Future Work.
-
Appendix N Attention Activation Effects on Calibration.
Appendix B Details on the Mixture of Gaussians Amortized Clustering Experiment
We used a modified version of the MoG amortized clustering dataset which was used by Lee et al. [2019]. We modified the experiment, adding random variance into the procedure in order to make a more difficult dataset. Specifically, to sample a single task for a problem with classes,
-
Sample set size for the batch .
-
Sample class priors .
-
Sample class labels for .
-
Generate cluster centers for and .
-
Generate cluster covariances for and . Then make a covariance matrix for each class with as the diagonal.
-
Sample data
In our MoG experiments, we set .
The Motivational Example in Figure 3 also used the MoG dataset, and utilized mini-batch testing of the set transformer corresponding to the procedure outlined in Appendix D
Appendix C Measuring the Variance of Pooled Features
In Figure 2, we show the direct quantitative effect on the pooled representation when using the original Set Transformer and our UMBC module added, UMBC+Set Transformer. The UMBC model variance is always effectively 0, while the Set Transformer gives different results for different set partition chunk sizes. The downward slope of the Set Transformer line can be explained by the fact that as the chunk size gets larger, the pooled representation will become closer to that of the full set. The procedure for mini-batch testing of the Set Transformer is outlined in Appendix D.
To perform this experiment, we used a randomly initialized model with 128 hidden units, and sampled a random normal input with a set size of 1024, . We then created 100 random permutations of the set elements of the input and split each permutation into partitions with various chunk sizes where the cardinality . We then encode the whole set for each chunk size and report the observed variance between the 100 different random partitions at the various chunk sizes in Figure 2. Note that the encoded set representation is a vector and Figure 2 shows a scalar value. To achieve this, we take the feature-wise variance over the 100 encodings and report the mean over each feature. Specifically, with representing all 100 encodings, , with . We then achieve the y values in Figure 2 by a simple mean over the feature dimension,
(4) |
Appendix D A Note on MBC Testing of the Set Transformer
In some illustrative experiments Figures 2 and 3, we apply mini-batch testing to the Set Transformer to study the effects of using a non-MBC model in an MBC setting. The Set Transformer does not have a prescribed way to do this in the original work, so we took the approach of processing each chunk up until the pooled representation that results from the PMA layer. We then performed a mean pooling operation over the chunks in the following way, with representing the final mini-batch pooled features,
(5) |
Appendix E Details on the ImageNet Amortized Clustering Experiment
Version | ARI |
---|---|
45.930.12 | |
44.090.11 |
For the ImageNet amortized clustering experiment outlined in Section 5, we first extracted the features up until the last hidden representation and before the final linear classifier layer of the pretrained and frozen ResNet50. These features are of a large dimension which would create excessively large linear layers for this experiment. Therefore, we projected the features down to a lower dimension using a random orthogonal Gaussian matrix. As this random Gaussian projection is suitable for random feature kernels [Rahimi and Recht, 2007], it should preserve the distances between points required for effective clustering with a marginal effect on overall clustering performance. To validate this assumption, we ran the Oracle model (which computes the empirical cluster mean and diagonal covariance) on both the original features and the projected features and present the results in the table above.
To construct the ImageNet dataset, we first initialized and saved the random Gaussian projection matrix, and proceeded to process the entire ImageNet1k training set with the saved matrix. From these extracted and projected features, we chose a fixed 80/20 split for our train/test sets. Class indices for the train/text sets can be found in the supplementary file.
Appendix F Numerical stability of MBC softmax attention activation
Numerical stability of the softmax requires that the values are not allowed to overflow. Generally this is done by subtracting the maximum value from all softmax logits which allows a stable and equivalent computation.
(6) |
This poses a problem when using the plain softmax attention activation, as the in Equation 6 requires a max over the whole set of items which is unknowable given the current mini-batch.
Originally, we had devised a special conditional update rule which would maintain the same form as in Equation 6, by tracking the overall max of each row of the attention matrix and then conditionally updating either the current and or the previously stored values from the last processed partition. Those updates needed to be calculated in the exponential space which cause a propagation of numerical errors through the network, becoming large enough to interfere with inference. In our experiments,w e found it sufficient to calculate the softmax as a simple exponential activation with a subsequent sum over with no consideration for numerical stability. If numerical stability is a concern, one could also set a hyperparameter for the model such that the softmax is calculated with an exponential function such as , which should provide a reasonable solution.
Appendix G Training Specification
We use no L2 regularization, except for the ModelNet40 experiments, which use a small weight decay of . This was a setting taken from previous experiments by Lee et al. [2019], Zaheer et al. [2017] which used dropout before and after the pooling layers and other regularization strategies such as gradient clipping to avoid overfitting.
The only experiment which utilized any kind of data augmentations was the ModelNet40 experiments which used random rotations of the point cloud as is common in the precedent experiments [Zaheer et al., 2017, Lee et al., 2019, Bruno et al., 2021]
All single runs of all of our experiments were able to fit on a single GPU with 12GB of memory.
Experiments | |||
Setting | MoG | ImageNet | ModelNet40 |
Optimizer | Adam | Adam | Adam |
Learning Rate | 1e-3 | 1e-3 | 1e-3 |
Data Augmentation | ✗ | ✗ | ✓ |
Epochs | 50 | 50 | 1000 |
Iters/Epoch | 1000 | 1000 | 9840 |
Appendix H Universal Model Specification
Unless otherwise specified, all universal modules were run with the following model hyperparameter settings in Table 5. The settings for the MoG dataset apply to those in Figure 4, and Figure 5 studies the effects of changing individual settings.
Experiments | |||
Setting | MoG | ImageNet | ModelNet40 |
Embedder | ✓ | ✗ | ✓ |
Hidden dim | 128 | 256 | 256 |
Num. Slots Per Parallel UMBC | 128 | 32 | 64 |
Slot-type | random | random | random |
Slot LayerNorm | ✓ | ✓ | ✓ |
FF LayerNorm | ✓ | ✓ | ✓ |
Heads | 4 | 4 | 4 |
Slot Dropout Prob. | 0% | 50% | 50% |
Attention Activation | softmax | softmax | softmax |
Slot Residual | ✓ | ✓ | ✓ |
UMBC Num. Parallel | 1 | 4 | 4 |
Test MC Samples | 10 | 100 | 10 |
Slots
Different from both [Locatello et al., 2020] and Bruno et al. [2021], we use unique initial slot parameters for each slot such that the set of slots has a separate parameter for each . We do this because the original Slot Attention in [Locatello et al., 2020] used a GRU in an inner loop to adapt the single general slot into specific slots for a given task, forcing them to ‘compete’ to capture different parts of the input. We cannot use a GRU, as it violates creftype 3.3, so we instead let each slot learn to adapt to the overall data distribution. We always used the same dimension of inputs and slots .
Random Slots
To initialize the random Gaussian slots, we use a similar initialization strategy as [Blundell et al., 2015] and initialize and . During training, we sample the distribution with reparameterization with .
Embedder
We found it useful to place a single layer embedding function at the base of UMBC modules which consists of a single linear layer and a ReLU activation function. We used this embedder in all experiments except the ImageNet amortized clustering, as the ResNet feature extractor acted as the embedding function in this case.
Appendix I Additional Ablation Results
In addition to the results in Figure 5, we also did an experiment looking at the effect of the number of attention heads in the UMBC layer in Figure 10. This result was uninformative, but we choose to use a stock setting of 4 attention heads in our experiment as was common in the experiments performed by Lee et al. [2019].
Appendix J Additional ModelNet/ModelNet-C Results
Table 7 shows extra results from the ModelNet point cloud classification task. In this table, we include results for ‘UMBC+SSE’ and ‘UMBC+Deep Sets’ for completeness. While there is a slight decrease in accuracy for both ‘UMBC+SSE’ and ‘UMBC+Deep Sets,’ UMBC improves SSE in terms of NLL and ECE while lowering the performance of Deep Sets. This seems to generally agree with the results in Figure 4, indicating that it is likely unhelpful to add a UMBC to an already MBC , and instead the model should be chosen according to the given task first, and then UMBC considered if MBC treatment will be necessary.
ModelNet40 is prone to overfitting, and previous experiments in Deep Sets [Zaheer et al., 2017] and Set Transformer [Lee et al., 2019] have used Dropout layers both before and after the pooling function in their encoders. To evaluate the regularization effect of our dropout strategy, the last block of Table 7 includes UMBC models trained without dropout. Training without dropout generally lowers test set performance in all metric categories.
For examples of the corrupted point clouds, we refer the reader to the original work which proposed ModelNet40-C [Ren et al., 2022]. In Figures 14 and 13 we provide additional boxplots for accuracy and NLL metrics which correspond to the ECE metric reported in Figure 6. In Figures 15, 17 and 16 we provide individual boxplots for each individual corruption on accuracy, ECE, and NLL respectively. The aggregate of all of these datapoints forms the boxplots seen in Figures 6, 14 and 13. Size is reduced to avoid excessive page length. Best viewed on screen with a high zoom.
Appendix K Adding the UMBC module To existing MBC Functions
Model | NLL | ARI |
---|---|---|
Oracle | 1028.221.24 | 44.090.11 |
Deep Sets | 531.440.15 | 6.180.08 |
SSE | 520.290.63 | 22.911.85 |
Set Transformer | 512.590.33 | 17.133.67 |
UMBC+Deep Sets | 532.870.69 | 6.220.18 |
UMBC+SSE | 544.673.64 | 16.591.26 |
UMBC+Set Transformer | 503.890.87 | 23.681.85 |
Accuracy | NLL | ECE | |||||||
---|---|---|---|---|---|---|---|---|---|
Model | 100 | 1000 | 2048 | 100 | 1000 | 2048 | 100 | 1000 | 2048 |
Deep Sets [Zaheer et al., 2017] | 65.371.07 | 88.350.32 | 88.720.21 | 1.570.03 | 0.400.01 | 0.400.01 | 17.380.95 | 4.210.27 | 4.020.16 |
SSE [Bruno et al., 2021] | 71.090.51 | 87.850.39 | 87.920.42 | 1.420.10 | 0.520.05 | 0.510.06 | 16.691.11 | 5.931.06 | 5.881.17 |
Set Transformer [Lee et al., 2019] | 74.211.67 | 87.810.44 | 88.170.32 | 1.760.08 | 0.790.08 | 0.780.08 | 17.120.46 | 7.480.62 | 7.370.54 |
UMBC+Deep Sets | 71.531.03 | 87.520.25 | 87.740.45 | 1.480.09 | 0.610.03 | 0.620.03 | 16.391.52 | 7.530.38 | 7.490.50 |
UMBC+SSE | 71.030.73 | 86.190.62 | 86.360.46 | 1.110.09 | 0.500.01 | 0.490.01 | 9.672.03 | 2.420.77 | 2.371.10 |
UMBC+Set Transformer | 71.181.52 | 86.560.49 | 86.770.29 | 1.230.15 | 0.530.03 | 0.510.03 | 10.372.24 | 2.600.19 | 2.350.24 |
UMBC+Deep Sets (No Dropout train) | 69.960.64 | 87.500.21 | 87.580.16 | 1.820.06 | 0.660.02 | 0.640.02 | 21.250.54 | 8.590.32 | 8.510.26 |
UMBC+SSE (No Dropout train) | 68.801.00 | 84.811.17 | 84.891.39 | 1.190.06 | 0.550.04 | 0.540.04 | 11.802.07 | 3.050.64 | 3.020.82 |
UMBC+Set Transformer (No Dropout train) | 71.520.75 | 86.560.47 | 86.610.45 | 1.500.43 | 0.630.14 | 0.620.15 | 13.364.66 | 4.222.04 | 4.282.09 |
Appendix L Potential Societal Impacts
We are not aware of any potential negative societal impacts of MBC processing of sets. Although, generally speaking, sets are a natural choice for estimating things like population statistics as our amortized clustering experiments did. In this setting, fairness to all involved groups is an important factor to consider, especially if human well-being is at stake.
Appendix M Limitations & Future Work
UMBC is a bottleneck
UMBC projects the input set to a fixed size, and can therefore be a bottleneck, causing possible loss of information from the input set. An interesting line of research could be an exploration of methods to maximize mutual information between the input set of cardinality and the projected set of cardinality , or an exploration of other forms which a UMBC may take, we look forward to seeing future research in this area.
Train/Test Set Size Variability
In Figure 4, Deep Sets shows the tightest grouping between training set sizes, although giving the lowest overall performance, indicating that more complicated set functions which make pairwise comparisons may be less robust to varying training set sizes, which may provide an interesting topic of future research.
Bayesian Slots
In our experiments, we used a similar random slot parameter initialization as Blundell et al. [2015]. Following Bruno et al. [2021], we use no Bayesian prior on these random slots, so the increased performance of random slots is likely due to randomness aiding in exploration of the parameter space rather than learning a proper Bayesian posterior. Future work could explore the effects of incorporating a prior distribution over slots or slot dropout rates (e.g. Concrete Dropout [Gal et al., 2017]). This could lead to further increases in robustness to corruptions and varying set sizes.
Appendix N Attention Activations & Calibration
To test the effect of training with different attention activation functions on calibration, we train and evaluate the UMBC+Set Transformer model on all corruptions of ModelNet40-C in Figures 21, 20 and 19, and individual corruptions in Figures 24, 23 and 22. Besides the change in attention activation, each model was trained with the same settings as the UMBC+Set Transformer from the corresponding experiments in Figures 14, 6 and 13. Surprisingly, we find the slot-softmax, originally used by Locatello et al. [2020] delivers strong performance in terms of NLL and ECE, although it gives slightly lower accuracy on the natural, uncorrupted test set.