Semi-Supervised Disentanglement of
Tactile Contact Geometry from Sliding-Induced Shear

Anupam K. Gupta

^{1}

, Alex Church

^{1}

and Nathan F. Lepora

^{1}

*This work was supported by an award from the Leverhulme Trust on “A biomimetic forebrain for robot touch” (RL-2016-39)

^{1}

The authors are with the Department of Engineering Mathematics and the Bristol Robotics Laboratory, University of Bristol, UK {anupam.gupta,ac14293,n.lepora}@bristol.ac.uk

Abstract

The sense of touch is fundamental to human dexterity. When mimicked in robotic touch, particularly by use of soft optical tactile sensors, it suffers from distortion due to motion-dependent shear. This complicates tactile tasks like shape reconstruction and exploration that require information about contact geometry. In this work, we pursue a semi-supervised approach to remove shear while preserving contact-only information. We validate our approach by showing a match between the model-generated unsheared images with their counterparts from vertically tapping onto the object. The model-generated unsheared images give faithful reconstruction of contact-geometry otherwise masked by shear, along with robust estimation of object pose then used for sliding exploration and full reconstruction of several planar shapes. We show that our semi-supervised approach achieves comparable performance to its fully supervised counterpart across all validation tasks with an order of magnitude less supervision. The semi-supervised method is thus more computational and labeled sample-efficient. We expect it will have broad applicability to wide range of complex tactile exploration and manipulation tasks performed via a shear-sensitive sense of touch.

Fig. 1: Sensor, stimuli, model schematic and object shape reconstructions. (a) TacTip sensor. (b) TacTip internal sensing surface. (c) 3D printed tactile stimuli. Red circles highlight approximate location of data collection for training. (d) & (e) Schematic of the model architecture for self-supervised and supervised training phases respectively. $S_{i n p}$ and $U_{i n p}$ represent sheared data (distorted by motion-induced shear) and unsheared (vertical tap) input samples. $S_{r e c}$ and $U_{r e c}$ represent sheared and unsheared reconstructed samples. $U_{G T}$ represents unsheared ground truth samples used for fine-tuning sheared-unsheared mapping in the supervised phase. P & S are latent representations of pose (contact-only) and shear components of sensor response respectively. $L 1_{p a t c h}$ loss is computed between 100 random patches ( $20 \times 20$ pix.) of model-generated unsheared images $U_{r e c}$ and its ground-truth counterparts $U_{G T}$ , is used to enforce local image compliance. Detailed architectures in Fig. 2. (f) Representative examples of full object shape reconstructions. For full set of results, see Fig. 3.

I Introduction

The sense of touch is ubiquitous in everyday life. It allows us to interact, explore and manipulate the environment around us using physical contact. The physical contact allows us to measure the environment directly, which is not possible with other sensory modalities like vision. However, this physical contact also results in entangling of relevant information like contact geometry with the manner in which the sensor-stimuli contact is established. For example, when sliding or rubbing fingers across tactile stimuli, soft tactile sensors are inevitably distorted by motion-dependent shear, making their response history-dependent and generalization hard across tactile tasks. This necessitates the removal of motion-induced global shear while preserving the sensor distortions due to the contact-geometry or any other attribute of interest. This issue plagues all shear-sensitive tactile sensors, in particular camera-based soft tactile sensors that use the displacement of markers to encode stimuli attributes [1, 2, 3, 4], complicating tactile-dependent tasks like shape modelling and exploration.

Recently, this shear problem was tackled in [5, 6] for sliding exploration of complex 2D or 3D objects, using a supervised deep learning model to learn mappings between shear-distorted data and the target pose. However, this approach to learn insensitivity to shear requires training separate models for each tactile task, such as contact reconstruction or shape exploration. A study by the authors of this paper instead used a supervised deep learning model to remove shear distortion from the original sheared tactile image so that quantities of interest (e.g. pose) can be predicted from unsheared tactile data in a task-agnostic manner, which is more computational and data efficient [7]. However, because these approaches are supervised, they suffer from costly burden of requiring annotated data for learning. This limitation can be overcome by the use of self- and semi-supervised approaches that can learn rich representations from uncurated/unlabelled datasets, and are thus more labeled sample-efficient. This mode of learning is closer to human learning, as we likewise learn with only small amounts of labelled data and a vast resource of self-supervised experiences.

In this work, we propose a semi-supervised learning approach to remove shear from shear-distorted tactile images that uses only a small subset of annotated data (10%) to achieve comparable performance to previous fully-supervised approaches [7]. Self supervision with tactile data differs from computer vision because it lacks attributes like colour, texture and the large-scale structure of natural image data that have been exploited in computer vision to generate supervisory signals without external supervision [8, 9, 10]. Moreover, the present work relies only on tactile data without recourse to techniques that correlate data across multiple sensory modalities to generate training signals without external supervision [11, 12, 13] (see Background). In the presence of these limitations, we achieved self-supervision by first disentangling sliding data into its contact-only and shear components, followed by recombining the shear component with paired and novel canonical (non-sheared or vertical tap) data to generate training signals without external supervision.

Our experiments demonstrated that the proposed model could successfully recover contact geometry from sheared images, allow continuous shape exploration and also allow object reconstruction across multiple planar objects. We also found that our proposed semi-supervised model achieved near identical performance to its fully supervised counterpart [7] while using only 10% of labelled samples for training.

Ii Background

Soft optical tactile sensors like the TacTip [1] and other marker-based sensors [2, 4] are vulnerable to sliding-induced global shear that distorts the geometry of contact. Local shear caused by the stimulus imprint on the sensor, represented in the lateral displacement of the markers, is superimposed upon a global shear caused by contact motion such as sliding. This distortion, by masking contact geometry, complicates challenging tactile tasks like shape reconstruction and tactile exploration, which require spatial information about the local geometry of an object free from motion-induced distortion.

Recent works that use TacTip sensor (also used here) addressed this shear problem by using a shear-insensitive pose-prediction deep network to slide around complex planar shapes [5] and complex 3D objects [6, 14]. Shear-insensitive training was achieved by mapping the shear-distorted samples directly to target pose outputs. While effective, this approach necessitated separate training for each new tactile task, for example surface- and contour-following [14], greatly complicating the hyper-parameter tuning for accurate network performance [6], and could not be applied to shape reconstruction. To overcome these drawbacks, the authors of the present paper instead focused on learning a mapping between shear-distorted tactile images and their unsheared (vertical discrete tap) counterparts with a fully-supervised convolutional neural network [7]. Once learned, this mapping can be reused to remove shear for any number of downstream tasks, making this approach computational and data efficient. However, the supervised nature of that approach necessitated complex labelled-data collection. Here, we improve upon that work by proposing a semi-supervised model that achieves comparable performance to the previous supervised baseline, by training first in self-supervised phase followed by a brief supervised phase (with a tenth of the labelled data from [7]).

Tactile data differs from natural images in being low-dimensional due to absence of attributes like color and texture, alongside lacking the large-scale global structure of natural image data. This is primarily due to the localised nature of tactile sensing, unlike vision or audition which are global sensing modalities over a scene. This increases the complexity of generating a supervisory signal in the absence of external supervision; for example, in computer vision, supervisory signals in prior studies have been generated by: 1) learning to predict withheld data dimensions at the output, such as in [8] two sub-networks are trained that are each tasked with predicting one subset of the data channels (color or lightness channel) from another (lightness or color channel); 2) image completion conditioned on the immediate surroundings, as in [10] in which the input image has a missing part that needs to be predicted at the output based on the surroundings; and 3) predicting the correct arrangement of image subsections, as in [9] in which the input image is first split into $N$ rectangular tiles whose locations are shuffled to break spatial structure, and the network is then tasked to predict the original image from a shuffled input image. These methods cannot be readily extended to tactile data due to its low-dimensional structure and lack of global structure as discussed previously.

Another approach to generate supervisory signals in absence of external supervision is by correlating data from multiple sensory modalities like audio and vision [11] or touch and vision [12, 13]. In this work, however we rely on the information encoded by the tactile sensor alone.

To generate a supervisory signal, like in [7], the proposed model learns to disentangle the sensor response components due to contact geometry and motion-induced shear. Our approach, like [7], is inspired from similar works in computer vision where disentanglement of object attributes and factors of variation are considered important for learning robust representation that improve generalizability on out-of-distribution (OOD) samples or novel scenarios [16]. Several works in computer vision learn disentangled representations via varitional autoencoders (VAE) [17, 18] and generative adversarial networks (GAN) [19, 20, 21] for downstream applications such as style transfer between images[22, 19]. Our approach is akin to style transfer where the input image is decomposed into its content (geometry) and style (color, intensity, texture) components. Once the content and style components are disentangled, they can be used to transfer style to novel content. In our work, content is akin to sensor response due to contact geometry and style to response due to motion-induced shear. We are aware of no other prior work exploring semi-supervised learning of disentangled representations for robust representation learning in tactile sensing.

Fig. 2: Network architecture. (a) Self-supervised training: The sheared ( $S_{i n p}$ ) – unsheared ( $U_{r e c}$ ), unsheared ( $U_{i n p}$ ) – sheared ( $S_{r e c}$ ) and sheared ( $S_{i n p}$ ) – sheared ( $S_{r e c}$ ) mappings $M$ , $K$ and $N$ are learned via cycle consistency loss ( $L_{c y c}$ ) [15] and reconstruction loss ( $L_{r e c}$ ). $L_{c y c F}$ and $L_{c y c B}$ are the components of ( $L_{c y c}$ ) corresponding to the forward and backward cycle respectively. (b) Supervised training: The mappings $M$ and $N$ previously learned in (a) are further fine-tuned via $L_{s u p}$ loss (M), $L 1_{p a t c h}$ (M) and reconstructions loss $L_{r e c}$ (N). $L 1_{p a t c h}$ loss, computed between 100 random patches ( $20 \times 20$ pix.) of model-generated unsheared images $U_{r e c}$ and its canonical (non-sheared) counterparts $U_{G T}$ , is used to enforce local image compliance.

Iii Methods

Iii-a Experimental Setup

The setup used in this work is similar to that of [7], where a soft biomimetic optical tactile sensor – the TacTip [1] – is mounted on an industrial robot arm (ABB IRB120) as an end effector. The tactile sensor encodes tactile information in the shear displacement of marker pins on the inner side of the sensing surface caused by the soft sensor skin deformation. The morphology of this TacTip consisted of a 3D-printed soft rubber-like hemispherical dome (40 mm dia., TangoBlack+) with its inner surface covered by 331 hard white tip pins arranged in a concentric circular grid fashion (Figs 1a,b). The soft dome was filled with an optically-clear silicon gel to mimic the compliance of a human fingertip. An RGB camera (ELP 1080p module) captures the inner surface of the sensor dome. For more details, we refer to Ref. [1].

For test stimuli, we used seven distinct planar shapes with various morphologies, including two acrylic spiral shapes that differ in frictional properties to the other five 3D-printed ABS plastic shapes (Fig. 3, second row). For model training we used three shapes: the circular disk, clover and teardrop shown in Fig. 1 (c) (red circles show approximate location of data collection). All shapes were securely fastened to the workspace to prevent accidental motion during experiments.

Iii-B Data Collection

This work proposes a semi-supervised deep learning model, as an alternative to the fully-supervised model proposed in [7], to remove motion-induced shear distortion from tactile images. This motion-induced shear is governed by the manner of contact between the sensor and object, distorting the sensor response induced by the stimulus geometry during contact.

Our semi-supervised model is trained in two phases: self-supervised and supervised. To ease data collection for these model training phases, we reused the paired data (tap or canonical and sheared data with same relative pose) previously collected in [7] to train the fully-supervised model. We generate unpaired data, used for self-supervised training phase, by randomizing the pose pairings (every epoch) between the canonical (non-sheared) tactile images $U_{i n p}$ (collected by vertically tapping on the stimulus) and sheared tactile images $S_{i n p}$ with random global shear (collected by sliding across the stimulus), for all but a small subset (10%) of paired data held back for supervised training phase.

To collect non-sheared canonical tap data, the sensor is brought in the contact with the stimuli vertically to minimise global shear. For sliding data, the sensor is brought to the target pose while in contact with the object from a random offset location along either the $x$ - or $y$ -axis or both lateral directions simultaneously. To obtain paired data, the tap and sliding data are collected for same relative target poses between the sensor and stimulus (Fig. 1c). In total, a set of 200 random poses were collected, with the sensor location relative to the initial contact location randomly sampled from a uniform distribution spanning a range $[- 5, 5]$ mm in the two lateral directions ( $x$ - and $y$ -axes), $[- 45, 45]$ deg in yaw ( $θ n_{z}$ ) and $[- 6, - 1]$ mm in depth ( $z$ -axis). Finally, the tactile images captured are cropped and subsampled to a $256 \times 256$ -pixel region and binarized to minimise any effect of lighting changes inside the sensor.

Iii-B1 Canonical Data

The canonical tapping dataset had 30,000 samples in total: 50 instances of each of the 200 random poses for the three stimuli shapes used for training (Fig. 1 (c)). The instances were generated by varying the indentation depth ( $z$ -axis) randomly between $[- 1, 1]$ mm from a set indentation depth. This dataset was used as the target (reference) to which the sliding data should be restored.

Iii-B2 Sheared Data

The sheared data had 90,000 samples in total: 150 instances for each of the 200 random poses for each of the three stimuli shapes (Fig. 1 (c)). To generate instances, the sensor was first brought in contact with the stimulus at a lateral offset location sampled randomly between $[- 2.5, 2.5]$ mm laterally (along the $x$ - or $y$ -axes or both), then sliding to the target pose.

Iii-B3 Training and Test Data

The canonical and sheared data were first partitioned randomly into training, validation and test sets in the ratio 60:20:20. Thus, for each stimulus shape (Fig. 1), 120 poses were aligned to training set and 40 poses each to validation and test sets.

The paired training set (canonical-sheared) was further randomly split in the ratio 90:10. The larger subset (90%) was shuffled to generate random paired data for the first self-supervised phase of model training. The shuffling was repeated before every training epoch to increase variability in training data. The other subset (10%) kept the pairing between sheared data and a canonical instance of the same pose, for the supervised phase of training.

Iii-C Formulation

Our goal is to learn mapping functions between two domains, tap (T) and sheared (S), given training samples comprising canonical (tap) data ${x_{t}}_{i = 1}^{N}$ where $x_{t} \in X_{t}$ and shear-transformed training samples ${x_{s}}_{i = 1}^{N}$ where $x_{s} \in X_{s}$ . Our model includes three mappings, $M : X_{s} \to X_{t}$ , $N : X_{s} \to X_{s}$ and $K : X_{t} \to X_{s}$ . Our objective has four loss terms: reconstruction loss to match the input $S_{i n p}$ and reconstructed sheared samples $S_{r e c}$ to learn the mapping $N$ , an L2 loss to match the reconstructed $U_{r e c}$ and canonical samples $U_{G T}$ in the supervised training phase to fine tune the learned mapping $M$ , an L1 patch loss to enforce image similarity locally between the reconstructed $U_{r e c}$ and canonical samples $U_{G T}$ in the supervised training phase to fine tune the learned mapping $M$ and cycle consistency loss to learn mappings $M$ and $K$ and prevent them from contradicting each other.

The encoder $E_{S}$ (Fig. 2) disentangles the latent representation ( $z \in Z$ ) of input sheared samples $S_{i n p}$ into a pose code ( $z_{t} \in Z_{t}$ ) and shear code ( $z_{s} \in Z_{s}$ ). This disentanglement of sensor response not only aids in learning of robust mapping $M$ as shown in [7], but also allows recombination of unpaired shear and pose codes to synthesize novel sheared samples necessary for generating training signals via cycle consistency loss [15] without external supervision.

cycle Consistency Loss ( $L_{c y c}$ )

We apply the cycle consistency loss $L_{c y c}$ to train the model (Fig. 2) in the self-supervised phase of training. This loss aids learning of mappings $M$ and $K$ , and prevents them contradicting each other. Enforcing cycle-consistency reduces the space of possible mapping functions that can be learned. For each image $x_{s}$ from the domain $S$ , the forward cycle should be able to bring the input image $x_{s}$ back to original image i.e. $x_{s} \to M (x_{s}) \to K (M (x_{s})) \approx x_{s}$ . Similarly, the backward cycle should satisfy $x_{t} \to K (x_{t}) \to M (K (x_{t})) \approx x_{t}$ for each $x_{t} \in T$ . This behaviour is enforced via the cycle consistency loss:

\begin{matrix} L_{c y c} (M, K) & = E_{x_{s} \sim X_{s}} [∥ x_{s} - K (M (x_{s})) ∥_{2}] + E_{x_{t} \sim X_{t}} [∥ x_{t} - M (K (x_{t}) ∥_{2}] . \end{matrix}

(1)

Reconstruction Loss ( $L_{r e c}$ )

We apply the reconstruction loss $L_{r e c}$ to learn the mapping $N : X_{s} \to X_{s}$ between input sheared samples $S_{i n p}$ and reconstructed output sheared samples $S_{r e c}$ through a decoder $D_{S}$ , in both self-supervised and supervised phases of training. This loss aids the disentanglement of sheared-input samples into pose and shear codes:

L_{r e c} (N) = E_{x_{s} \sim X_{s}} [∥ x_{s} - N (x_{s}) ∥_{2}] .

(2)

$L_{s u p}$ Loss

We apply an L2 loss ( $L_{s u p}$ ) to fine tune the mapping $M : X_{s} \to X_{t}$ between input sheared samples $S_{i n p}$ and model-generated unsheared images $U_{r e c} (M (x_{s}))$ .

L_{s u p} (M) = E_{x_{s} \sim X_{s}, x_{t} \sim X_{t}} [∥ x_{t} - M (x_{s}) ∥_{2}] .

(3)

$L 1_{p a t c h}$ Loss

We apply $L 1_{p a t c h}$ loss $L_{p a t c h}$ to fine tune the mapping $M : X_{s} \to X_{t}$ . Unlike $L_{s u p}$ , this loss helps enforce image similarity locally, because the change in tactile sensor response with contact conditions is predominantly local in nature. The loss was computed between 100 random crops of size $20 \times 20$ of model-generated unsheared images $U_{r e c} (M (x_{s}))$ and its canonical (tap) counterparts $U_{G T} (x_{t})$ .

\begin{matrix} L_{p a t c h} (M) = E_{x_{s} \sim X_{s}, x_{t} \sim X_{t}} [∥ & c r o p s (x_{t}) - c r o p s (M (x_{s})) ∥_{1}] . \end{matrix}

(4)

Full Objective

Our full objectives for Encoders $E_{S}$ , $E_{T}$ and Decoders $D_{S}$ , $D_{T}$ are:

	$L_{E_{S}} = L_{c y c} (M, K) + L_{r e c} (N) + L_{s u p} (M) + λ \cdot L_{p a t c h} (M)$		(5)
	$L_{E_{T}} = L_{c y c} (M, K)$		(6)
	$L_{D_{T}} = L_{c y c} (M, K) + L_{s u p} (M) + λ \cdot L_{p a t c h} (M)$		(7)
	$L_{D_{S}} = L_{r e c} (N)$		(8)

where $λ = 0.1$ is the relative loss scale factor.

The fully-supervised counterpart [7] of the above model is akin in architecture and training methodology to supervised phase of the proposed model except in the amount of annotated data (10 $\times$ ) used for training.

Iii-D Implementation

Iii-D1 Network Architecture

Shear Encoder

The shear encoder $E_{S}$ compressed the sheared input $S_{i n p}$ ( $256 \times 256$ -pixel tactile image) using four convolutional (Conv) layers, each followed by batch normalization (BN) and rectified linear unit (ReLU) activation layers respectively (architecture in Fig. 2). The convolution layers had 32, 64, 64 and 128 filters respectively. The output of the last convolutional layer was passed to two additional Conv layers whose outputs represent the two latent codes: Pose ( $8 \times 8 \times 64$ ) and Shear ( $8 \times 8 \times 64$ ).

Pose (Tap) Decoder

The pose decoder $D_{T}$ upsampled the pose code to reconstruct unsheared output $U_{r e c}$ ( $256 \times 256$ -pixel tactile image). It had five transposed convolutional layers (Tconv), each followed by a BN and ReLU activation layers respectively except the output layer which used sigmoid activation function. The five transposed convolution layers had 64, 64, 32, 32 and 1 filter respectively.

Shear Decoder

The shear decoder $D_{S}$ upsampled the pose and shear code to reconstruct sheared input $S_{r e c}$ ( $256 \times 256$ -pixel tactile image). It had the same overall architecture as the pose decoder $D_{T}$ except the number of filters used in Tconv layers. The five transposed convolution layers had 128, 64, 64, 32 and 1 filter respectively.

Pose (Tap) Encoder

The pose encoder $E_{T}$ compressed the unsheared output of $D_{T}$ to $8 \times 8 \times 64$ using five Conv layers each followed by a BN and ReLU activation layer. The five convolution layers had 32, 32, 64, 64 and 64 filters.

All encoders and decoders used a kernel of $4 \times 4$ with stride of 2. For detailed architecture, see Fig. 2.

Iii-D2 Training Details

The input and output tactile data (binary $256 \times 256$ -pixel images) were scaled to the range $[0, 1]$ . The network weights were initialized from a zero-centered normal distribution with standard deviation 0.02.

For training, a batch size of 32 was used. All network layers used L1/L2 regularization along with random image shifts of 1-2% to prevent overfitting. We used ADAM optimizer [24] with $β_{1}$ = 0.5, $β_{2}$ = 0.999 and learning rate of $5 \times 10^{- 5}$ for self-supervised (SS) and $2.5 \times 10^{- 5}$ for supervised (S) training phases respectively. The model was trained for 50 epochs in the SS phase and for another 50 epochs in the S phase. We used learning rate scheduling for S training phase with training rate reduced to one-tenth after 25 epochs. The training and optimization of the networks was implemented in the Tensorflow 2.0 library on a NVIDIA GTX 1660 (6 GB memory) hosted on a Ubuntu machine.

Iv Experimental Results

Iv-a Disentanglement of Latent Representations

We did an ablation study to verify the successful disentanglement of sensor response components due to the geometry of contact (pose code) and that induced by motion-induced shear (shear code). To do this, we reconstructed unsheared samples ( $U_{r e c}$ ) from the shear code instead of the pose code. This resulted in a $\sim$ 5-fold increase in mean-squared error (MSE) between test set model-generated unsheared images $U_{r e c}$ and their canonical counterparts $U_{G T}$ (0.11 from 0.024) as well as $> 70$ % reduction in image similarity (SSIM) (18% from 92.5%), indicating the two images are structurally very dissimilar [27]. This shows that the appropriate information to reconstruct unsheared samples ( $U_{r e c}$ ) resides in the pose code. Similarly, we tested the reconstruction of sheared samples ( $S_{r e c}$ ) from only the pose code, which resulted in a 60-fold increase in MSE error between the input $S_{i n p}$ and the reconstructed $S_{r e c}$ sheared samples, and $> 70$ % reduction in SSIM between $S_{i n p}$ and $S_{r e c}$ (18% from 93%), showing that the both shear and pose code are jointly required for reconstruction of $S_{r e c}$ . These two experiments jointly demonstrate the successful disentanglement of the sensor response by our model (Fig. 2).

Iv-B Motion-induced Shear Removal from Tactile Images

First, we used multi-scale SSIM [27] to test the effectiveness of our approach in removing motion-induced global shear. To do so, we computed the average image similarity between the model-generated unsheared images $U_{r e c}$ and their test set sheared $S_{i n p}$ and canonical (non-sheared) $U_{G T}$ counterparts. On comparison, we found a closer match between $U_{r e c}$ and $U_{G T}$ (92.5% with 10% supervision), dropping to 74% in absence of any supervision) than $U_{r e c}$ and $S_{i n p}$ (32%). The corresponding values for fully supervised baseline [7] were 93% and 32% respectively. Thus, our model not only successfully removes global shear from tactile images but is also economical with labelled data (using just 10% of the previous supervised method [7]). Degraded performance was seen both in the supervised baseline [7] and semi-supervised approach proposed in this work on reducing the labelled data used for training. For example, a 20% reduction in labelled training data lead to a drop of approximately 2% (from 93%) and 0.5% (from 92.5%) in image similarity between the model-generated unsheared images and their test set canonical (non-sheared counterparts) for supervised baseline [7] and proposed semi-supervised approaches respectively.

We also found that omitting $L 1_{p a t c h}$ loss leads to a 1.5% drop in image similarity between test set canonical images $U_{G T}$ and their model-generated unsheared counterparts $U_{r e c}$ (91.5% from 93%)at 10% supervision.

Pose Prediction Error

Tap Data

Sheared Data

Unsheared Data

(Supervised)

Unsheared Data

(Semi-Supervised)

horizontal,

τ_{x}

(mm)

0.43

2.72

0.64

0.71

yaw,

θ n_{Z}

(degrees)

2.13

22.20

4.38

4.81

TABLE I: Mean Absolute Error (MAE) of Pose Prediction

Iv-C Local Contact Geometry Reconstruction

Motion-induced shear masks the contact geometry requiring removal of global shear for its faithful reconstruction (Fig. 3, top row). To reconstruct the approximate indentation field, we used a Voronoi-based method [23] that transforms the inner sensor surface with tactile markers into hexagonal cells, thus tessellating the grid of the markers [23, Fig. 4]. The change in cell areas from an undeformed reference represent the magnitude of local skin deformation that correlates with the indentation due to contact geometry.

A 3D surface was fitted to the $(x, y)$ centroid coordinates of the Voronoi cells with cell areas as the corresponding height values (Fig. 3, top row). The fitted surface from the test sheared data $S_{i n p}$ is visibly distorted by shear (Fig. 3, top row). Once global shear is removed in the model-generated unsheared images $U_{r e c}$ , the fitted surface closely resembles the true representation from the original canonical (non-sheared) data $U_{G T}$ and the unsheared images obtained via supervised baseline [7] across multiple stimuli (Fig. 3, top row). This again demonstrates the effectiveness of our approach in removing global shear vis-á-vis the supervised baseline [7].

Fig. 3: Contact geometry reconstructions, contour following and full object shape reconstructions. Top row: Local contact geometry reconstruction: The approximate contact geometry can be recovered from sheared data using the model-generated unsheared images but not from the original sheared data. Middle row: Trajectories, overlaid on the objects, under robust sliding using pose estimation from model-generated unsheared images. Red and green trajectories correspond to semi-supervised and supervised [7] approaches respectively. Red dot is the starting point of trajectory. For both, contact reconstruction and contour following, supervised [7] & semi-supervised approaches achieve comparable performance. Bottom row: Full object reconstructions from combining the unsheared local contact geometry reconstructions along the sliding trajectories.

Iv-D 2D Continuous Shape Exploration

We further confirmed the model performance by computing the mean-absolute error (MAE) between the predicted pose (lateral position $τ_{x}$ and in-plane rotation $θ n_{z}$ ), from test set model-generated unsheared images $U_{r e c}$ , sheared images $S_{i n p}$ and canonical (non-sheared) images $U_{G T}$ , and target pose using a pose prediction network PoseNet trained on canonical (non-sheared) images previously used in [7]. In comparison, the prediction error with $S_{i n p}$ was over 4-fold higher than with $U_{r e c}$ (Table I), showing the successful removal of global shear. Our results also show that the model-generated unsheared images from the present model and from the fully supervised baseline [7] have similar prediction errors (Table I).

To test the real-time performance and model generalizability to novel situations, we used the PoseNet [7] for sliding exploration of various planar shapes (Fig. 3, second row). This testing included four novel shapes absent from training, including two acrylic shapes (spirals 1 and 2, Fig. 3, second row) with distinct friction to the 3D-printed stimuli. In addition, for all test shapes, we tested the performance by varying the location of initial contact, depth of indentation, exploration speed, and also the relative location between the object and the sensor tip. The robot successfully traced contours around all test shapes (Fig. 3, second row, red) by a combination of model-generated unsheared images, PoseNet for prediction of pose parameters and a simple servo control policy [7], demonstrating successful removal of global shear. Thus, our model generalizes well to novel experimental conditions. Moreover, comparison of the contours using the current approach (red) and supervised baseline (green) [7] show near-identical performance (Fig. 3, second row).

Iv-E 2D Object Reconstruction

As a final test, we demonstrate our model’s utility in the removal of global shear from sheared images by reconstruction of full object shape. This required: 1) faithful contact-geometry reconstruction and 2) successful contour tracing, with both tasks adversely impacted by motion-induced global shear. To obtain full object reconstructions, we first fused together the contact information extracted from unsheared images of sliding contacts recorded during shape exploration and later interpolated them on a rectangular grid (Fig. 3, third row) to obtain smooth object reconstructions from discreet sliding contacts. These results demonstrate not only the successful removal of global shear but also the effectiveness of learning the sheared-unsheared mapping once, for later reuse on multiple downstream tasks. The faithful object reconstructions with successful shape exploration across multiple shapes and experimental conditions discussed in section IV.D also show that our model generalizes to novel situations (Fig. 3, second and third row).

V Discussion

This work proposed a semi-supervised approach that preserved the sensor deformations due to stimulus contact geometry while removing the shear-distortion caused by motion-dependent shear. We showed that the proposed approach achieves comparable performance to its fully supervised counterpart [7] with an order-of-magnitude less annotated data, simplifying data collection considerably.

We validated our approach, similar to [7] by: 1) demonstrating a good match (92.5%) between the model-generated unsheared images (from sliding contacts) and their non-sheared counterpart taken from vertical tapping, using the structural similarity index measure; 2) reconstructing an approximate stimulus contact geometry from model-generated unsheared images that was previously masked by sliding-induced shear (Fig. 3, top row); 3) predicting local object pose from a pose prediction network [7] trained on non-sheared (tapping) data (Table I). This pose prediction was then used for sliding exploration of multiple planar objects (Fig. 3, second row), and so result 4) combined 2) and 3) to reconstruct full object shapes for several planar objects (Fig. 3, third row).

One limitation of our study is that it considered only planar objects. We expect our approach will extend to 3D objects, like recent work that has successfully implemented tactile servoing to slide over complex 3D surfaces (3 pose components) and edges (5 pose components) [5, 6]. An extension would be to sliding shape exploration and full 3D object reconstruction, which are open problem under active investigation [25, 26]. This would open interesting avenues for future research, such as: 1) the fusion of visual and tactile sensory modalities for robust object representations and 2) tactile object recognition by exploiting prior visual experience of the object. An additional limitation is the requirement of annotated data, although much less than the fully supervised counterpart [7]. An unsolved problem for future work is to completely eliminate external supervision. Finally, our training was limited to translational shear even though servo control introduces rotational shear from the angular motion while sliding over objects. That the methods worked well both for sliding exploration and contact geometry reconstruction in the absence of trying to compensate for rotational shear shows that some shear removal carries over from translation to rotational shear. Our expectation is that bringing rotational shear into the training will become important when extending the methods to 3D objects.

Overall, we expect that our methods should apply to various other exploration and manipulation tasks using soft optical tactile sensors adversely affected by motion-induced shear.

References

[1] N. F. Lepora, ”Soft Biomimetic Optical Tactile Sensing with the TacTip: A Review,” in IEEE Sensors Journal, vol. 21, no. 19, pp. 21131-21143, 2021.
[2] X. Lin, L. Willemet, A. Bailleul and M. Wiertlewski, ”Curvature sensing with a spherical tactile sensor using the color-interference of a marker array,” 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 603-609.
[3] K. Kamiyama, K. Vlack, T. Mizota, H. Kajimoto, K. Kawakami and S. Tachi, ”Vision-based sensor for real-time measuring of surface traction fields,” in IEEE Computer Graphics and Applications, vol. 25, no. 1, pp. 68-75, Jan. 2005.
[4] C. Sferrazza and R. D’Andrea. ”Design, Motivation and Evaluation of a Full-Resolution Optical Tactile Sensor” Sensors, vol. 19, no. 4, pp. 928, 2019.
[5] N. F. Lepora, A. Church, C. de Kerckhove, R. Hadsell and J. Lloyd, ”From Pixels to Percepts: Highly Robust Edge Perception and Contour Following Using Deep Learning and an Optical Biomimetic Tactile Sensor,” in IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 2101-2107, April 2019.
[6] N. F. Lepora and J. Lloyd, ”Optimal Deep Learning for Robot Touch: Training Accurate Pose Models of 3D Surfaces and Edges,” in IEEE Robotics & Automation Magazine, vol. 27, no. 2, pp. 66-77, June 2020.
[7] A. K. Gupta, L. Aitchison and N. F. Lepora, ”Tactile Image-to-Image Disentanglement of Contact Geometry from Motion-Induced Shear,” 5th Conference on Robot Learning (CoRL), 2021.
[8] R. Zhang, P. Isola and A. A. Efros, ”Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 645-654, 2017.
[9] M. Noroozi and P. Favaro, ”Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles,” In: Leibe B., Matas J., Sebe N., Welling M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol. 9910, Springer, Cham.
[10] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell and A. A. Efros, ”Context Encoders: Feature Learning by Inpainting,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2536-2544, 2016.
[11] A. Owens, A. A. Efros, ”Audio-Visual Scene Analysis with Self-Supervised Multisensory Features,” In: Ferrari V., Hebert M., Sminchisescu C., Weiss Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science, vol 11210. Springer, Cham.
[12] M. Zambelli, Y. Aytar, F. Visin, Y. Zhou and R. Hadsell, ”Learning rich touch representations through cross-modal
self-supervision,” 4th Conference on Robot Learning (CoRL), 2020.
[13] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg and J. Bohg, ”Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks,” 2019 International Conference on Robotics and Automation (ICRA), pp. 8943-8950, 2019.
[14] N. Lepora and J. Lloyd, ”Pose-Based Tactile Servoing: Controlled Soft Touch Using Deep Learning,” in IEEE Robotics & Automation Magazine, vol. 28, no. 4, pp. 43-55, 2021.
[15] J. Zhu, T. Park, P. Isola and A. Efros, ”Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks,” IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2242-2251.
[16] Y.Bengio. Deep Learning of Representations: Looking Forward. In: Dediu, AH., Martín-Vide, C., Mitkov, R., Truthe, B. (eds) Statistical Language and Speech Processing. SLSP 2013. Lecture Notes in Computer Science, vol. 7978. Springer, Berlin, Heidelberg.
[17] I. Higgins, L. Matthey, A. Pal, C. P. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed and A. Lerchner, ”beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework”, International Conference on Learning Representations (ICLR), 2016.
[18] H. Kim and A. Mnih, ”Disentangling by Factorising”, Proceedings of the 35th International Conference on Machine Learning (PMLR) vol. 80, pp. 2649-2658, 2018.
[19] N. Siddharth, B. Paige, J.-W. van de Meent, A. Desmaison, N. D. Goodman, P. Kohli, F. Wood and P. H. S. Torr, ”Learning Disentangled Representations with Semi-Supervised Deep Generative Models”, 31st Conference on Neural Information Processing Systems (NIPS), 2017.
[20] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever and P. Abbeel, ”InfoGAN: interpretable representation learning by information maximizing generative adversarial nets”, 30th Conference on Neural Information Processing Systems (NIPS), 2016.
[21] I. Jeon, W. Lee, M. Pyeon and G Kim, ” IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks”, Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-21), vol. 35, no. 9, pp. 7926-7934, 2021.
[22] T. Park, J.-Y. Zhu, O. Wang, J. Lu, E. Shechtman, A. A. Efros, R. Zhang, ”Swapping Autoencoder for Deep Image Manipulation”, 34th Conference on Neural Information Processing Systems (NeurIPS), 2020.
[23] L. Cramphorn, J. Lloyd and N. F. Lepora, ”Voronoi Features for Tactile Sensing: Direct Inference of Pressure, Shear, and Contact Locations,” 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2752-2757, 2018.
[24] D.P. Kingma and J.L. Ba,” Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
[25] S. Suresh, M. Bauza, K. T. Yu, J, G. Mangelson, A. Rodriguez and M. Kaess, ”Tactile SLAM: Real-time inference of shape and pose from planar pushing,” 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018.
[26] M. Bauza, O. Canal and A. Rodriguez, ”Tactile Mapping and Localization from High-Resolution Tactile Imprints,” 2019 International Conference on Robotics and Automation (ICRA), pp. 3811-3817, 2019.
[27] Z. Wang, E. Simoncelli and A. Bovik, ”Multiscale structural similarity for image quality assessment,” The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 1398-1402, 2003.

Semi-Supervised Disentanglement of Tactile Contact Geometry from Sliding-Induced Shear