Learning to SLAM on the Fly in Unknown Environments: A Continual Learning Approach for Drones in Visually Ambiguous Scenes

Ali Safa

^{1, 4}

, Tim Verbelen

^{2, 4}

, Ilja Ocket

^{4}

, André Bourdoux

^{4}

, Hichem Sahli

^{3, 4}

,
Francky Catthoor

^{1, 4}

, Georges G.E. Gielen

^{1, 4}

This research has received funding from the Flemish Government (AI Research Program) and the European Union’s ECSEL Joint Undertaking under grant agreement n° 826655 - project TEMPO.

^{1}

Faculty of Electrical Engineering (ESAT) KU Leuven, 3001, Belgium

^{2}

IDLab, Ghent University, B-9052 Gent, Belgium

^{3}

ETRO, VUB, 1050 Brussels, Belgium

^{4}

imec, Kapeldreef 75, 3001, Leuven, Belgium {Ali.Safa, Tim.Verbelen, Ilja.Ocket, Andre.Bourdoux, Hichem.Sahli, Francky.Catthoor}@imec.be, Georges.Gielen@kuleuven.be

Abstract

Learning to safely navigate in unknown environments is an important task for autonomous drones used in surveillance and rescue operations. In recent years, a number of learning-based Simultaneous Localisation and Mapping (SLAM) systems relying on deep neural networks (DNNs) have been proposed for applications where conventional feature descriptors do not perform well. However, such learning-based SLAM systems rely on DNN feature encoders trained offline in typical deep learning settings. This makes them less suited for drones deployed in environments unseen during training, where continual adaptation is paramount. In this paper, we present a new method for learning to SLAM on the fly in unknown environments, by modulating a low-complexity Dictionary Learning and Sparse Coding (DLSC) pipeline with a newly proposed Quadratic Bayesian Surprise (QBS) factor. We experimentally validate our approach with data collected by a drone in a challenging warehouse scenario, where the high number of ambiguous scenes makes visual disambiguation hard.

Multimedia material

A video showing our continual learning SLAM pipeline is provided at http://tinyurl.com/ycyc5upc.

I Introduction

Simultaneous Localisation and Mapping (SLAM) is a fundamental task for autonomous agents such as surveillance and rescue drones [1]. State-of-the-art SLAM systems either rely on handcrafted feature descriptors [2, 3] or on learning-based representations [4, 5] for template matching and loop closure detection. Compared to handcrafted descriptors, learning-based systems perform better in environments with less features or with high amounts of visual ambiguities (i.e., different scenes that look similar) [5, 6].

Still, most learning-based SLAM systems use traditional deep neural networks (DNNs), where a dataset of the target environment must be available a priori for an offline DNN training phase [5]. Hence, those learning-based systems do not always generalize well to environments unseen during training [4]. This has recently motivated the study of continual learning (CL) for SLAM [4].

In the past years, CL has gained much attention and has been mostly investigated in the context of deep learning,

Fig. 1: Proposed Continual Dictionary Learning approach for performing SLAM in unknown environments (without any offline pre-training). a) The consecutive images captured by a flying drone are fed in their natural order to a surprise-driven dictionary learning and Sparse Coding (DLSC) pipeline. b) The DLSC continuously infers a latent code ${¯ c}_{k}$ corresponding to the current observation ${¯ s}_{k}$ . At the same time, DLSC continuously learns a dictionary $Φ$ when the proposed Quadratic Bayesian Surprise (QBS) factor $S_{2}$ is positive. c) the latent codes ${¯ c}_{k}$ are fed to a RatSLAM back-end [2] to perform loop closure detection through template matching [5].

Fig. 2: Our multi-sensor drone used for data acquisition (RGB camera data and odometry solely used in this work). A view of the warehouse environment (described in [8]) where data was acquired is also shown. The warehouse is equipped with an Ultra Wide Band (UWB) localisation system for ground truth labelling of the drone’s trajectory [8].

with the aim of training deep neural networks on streams of non-shuffled data which cannot be assumed independent and identically distributed (i.i.d.). This poses significant challenges during training, severely jeopardizing performance [7]. Most notably, research in CL has been mainly devoted to classification tasks, using non-i.i.d versions of popular datasets such as MNIST, CIFAR10 and ImageNet [7].

More recently, Simultaneous Localisation and Mapping (SLAM) has also been proposed as an interesting application for CL [4]. Indeed, when adopting a learning-based SLAM system, a robot trained offline on a dataset acquired beforehand could perform extremely poorly in environments that were not captured by the dataset [4], leading to unreliable and unsafe navigation. Of course, this situation is paramount in the context of search and rescue, which requires robust navigation in extreme and harsh environments [1].

Therefore, to enable the deployment of learning-based SLAM systems in unseen environments, this paper proposes continual Dictionary Learning and Sparse Coding (DLSC), a fast and robust CL system for SLAM, and a low-complexity yet high-performance alternative to DNN approaches. The contributions of this paper are:

We propose a new Quadratic Bayesian Surprise (QBS) for modulating DLSC, enabling continual learning.
We experimentally demonstrate our continual learning DLSC-QBS method for performing SLAM in the important context of environments not captured by any dataset beforehand, without any offline pre-training.

This paper is organized as follows. First, related work is reviewed in Section II. Next, necessary background is given in Section III. Our methods are provided in Section IV and results are presented in Section V. Finally, conclusions are provided in Section VI.

Ii Related work

A number of SLAM architectures have been proposed in the past decades, exploring the use of handcrafted features and learned representations [2, 3, 4, 5].

Related to this work, the popular RatSLAM architecture has been proposed as a bio-inspired system modelled following the navigational processes taking place in the rat hippocampus [2]. In contrast to our proposed system, RatSLAM does not use learned representations, but opts for a simple template matching based on raw RGB data in order to detect loop closures. This simple template matching was later shown to be inefficient in environments with lots of ambiguous views (as in the case of our work) [5, 6].

More recently, the LatentSLAM architecture has been proposed in [5], outperforming RatSLAM through the use of a DNN encoder outputting latent codes for template matching, trained offline as a variational autoencoder. In contrast to LatentSLAM, our CL system does not require any offline training phase, enabling its deployment in unseen environments (not captured by a dataset available a priori).

In this work, we also deviate from most learning-based SLAM systems by using a lower-complexity, unsupervised DLSC learning pipeline [10] instead of conventional DNNs [5, 4]. Similar to LatentSLAM [5], we feed the feature descriptors produced by DLSC to the loop closure detection and map correction back-end of the RatSLAM system [2].

In Section V, we benchmark our system against both RatSLAM and LatentSLAM, and against the use of hand-crafted ORB feature matching [9], extensively used in state-of-the-art SLAM systems [3].

Iii Background theory

Iii-a Joint Dictionary Learning and Sparse Coding (DLSC)

DLSC is concerned with the problem of jointly learning a dictionary $Φ$ of size $N \times M$ and inferring output codes ${¯ c}_{k}$ of size $M$ from a stream of input data ${¯ s}_{k}$ of size $N$ [10]:

C, Φ = arg min C, Φ K \sum k = 1 \frac{1}{2} | | Φ {¯ c}_{k} - {¯ s}_{k} | |_{2}^{2} + λ_{1} | | {¯ c}_{k} | |_{1}

(1)

where $C = [{¯ c}_{1}, {¯ c}_{2}, \dots, {¯ c}_{K}]$ contains all output vectors ${¯ c}_{k}$ . The first term in (1) seeks to minimize the re-projection error between the output sparse code ${¯ c}_{k}$ and the input ${¯ s}_{k}$ through the dictionary $Φ$ . The second term provides $l_{1}$ regularization, controlling the sparsity of the codes (as in LASSO [10]).

Since our goal is to use the codes ${¯ c}_{k}$ as global descriptors for performing template matching in SLAM systems, we consider an under-complete dictionary $Φ$ with $M ≪ N$ to infer a lower-dimensional latent code for each input ${¯ s}_{k}$ [5].

Conventionally, it is assumed that the input sequence ${¯ s}_{k}$ originates from a shuffled dataset (i.e., the realizations ${¯ s}_{k}$ are assumed i.i.d) and the DLSC problem in (1) is classically solved by alternating between a stochastic gradient descent (SGD) step for the learning of $Φ$ and a proximal descent step for the inference of ${¯ c}_{k}$ [10, 11]. The proximal operator to the $l_{1}$ norm being defined as [11]:

{Prox}_{η_{c} λ_{1} | | . | |_{1}} (¯ x) = max (0, ¯ x - η_{c} λ_{1}) + min (0, ¯ x + η_{c} λ_{1})

(2)

The DLSC problem (1) is therefore a good starting point for setting up an unsupervised encoder that can jointly infer latent codes while learning features from the environment. Still, conventional DLSC solving via alternating descent [10] (summarized above) does not cover the continual learning case since it requires a shuffled dataset. This issue will be addressed in Section IV with our proposed Algorithm 1.

Iii-B Continual Learning Challenges

As already stated in Section III-A, the input sequence ${¯ s}_{k}$ is generally assumed to be i.i.d following the conventional training procedure where a dataset is assumed to be available a priori and shuffled before being fed to the optimisation procedure of Algorithm 1. Violating this i.i.d assumption as in this work, by feeding video sequences in their natural order, causes significant problems during learning since the stochastic gradients are not representative of the full loss [7]:

E [\frac{\partial l ({¯ s}_{k}, Φ)}{\partial Φ}] ↛ \frac{\partial \frac{1}{K} \sum_{k = 1}^{K} l ({¯ s}_{k}, Φ)}{\partial Φ}

(3)

where $E$ denotes the expected value, $K$ is the total number of data points and $l ({¯ s}_{k}, Φ) = \frac{1}{2} | | Φ {¯ c}_{k} - {¯ s}_{k} | |_{2}^{2} + λ_{1} | | {¯ c}_{k} | |_{1}$ is the loss associated to the data point ${¯ s}_{k}$ (used by SGD).

In this work, our goal is to enable continual DLSC in non-i.i.d data streams such as videos where the consecutive image frames are highly correlated [7] ( $k$ is the time index):

{¯ s}_{k + 1} \approx {¯ s}_{k}

(4)

Therefore, our goal is to continuously learn $Φ$ and infer the latent codes ${¯ c}_{k}$ from a video stream without any pre-training and without re-initializing $¯ c$ between each frames (see Fig. 1). As a challenging, real-world use case, we use the drone setup in Fig. 2 to illustrate the applicability of our method for learning to SLAM on the fly in a highly ambiguous warehouse environment (see Fig. 3).

This high redundancy (4) poses a significant challenge for both the continual learning aspect [7] due to the SGD bias (3), and to the loop closure detection, due to the high levels of aliasing in the environment, as shown in [6] (see Fig. 3).

Fig. 3: Flight data used in this work. Drone location is hard to disambiguate with RGB (e.g., view 1 vs. 4, 2 vs. 3). Data has been acquired during three different flight sequences: the blue path, the red path and a combination of both.

Iv Proposed Method

Fig. 4: Effect of the QBS on DLSC performance and stability. a) Reprojection error obtained after continual dictionary learning on an RGB video sequence captured with our drone setup (images of size $N = 346 \times 260$ ). The sequence contains long episodes of ambiguous views (see Fig. 3) and is therefore challenging due to high over-fitting risks. The use of the QBS factor leads to a lower loss during sequence replay, indicating a better learning performance (parameters: $η_{c} = 5 \times 10^{- 3}$ , $λ_{1} = 0.5$ , $η_{d} = 2 \times 10^{- 3}$ , $N_{c} = 10$ , $N_{d} = 1$ , $M = 64$ in Algorithm 1) b) Without the use of the surprise factor, severe run-away behaviors can happen during the continual learning process (parameters: $η_{c} = 10^{- 2}$ , $λ_{1} = 0.25$ , $η_{d} = 4 \times 10^{- 3}$ , $N_{c} = 10$ , $N_{d} = 1$ , $M = 64$ in Algorithm 1). c) Subset of the dictionaries learned during the CL process.

In order to cast (1) into the continual learning case, we use the DLSC reprojection error $| | Φ {¯ c}_{k} - {¯ s}_{k} | |_{2}^{2}$ in (1) to reduce the coding problem (inference of ${¯ c}_{k}$ by LASSO proximal descent) into a 2-class likelihood test. Indeed, the DLSC reprojection is linked to the Gaussian likelihood $p ({¯ s}_{k} | Φ)$ :

- log p ({¯ s}_{k} | Φ) \sim | | Φ {¯ c}_{k} - {¯ s}_{k} | |_{2}^{2}

(5)

This likelihood can be used to perform an inlier-outlier test:

{\begin{matrix} if p ({¯ s}_{k} | Φ) \geq θ Then H_{0} if p ({¯ s}_{k} | Φ) < θ Then H_{1} \end{matrix}

(6)

The test (6) can be seen as a binary classification problem inferring whether the current input ${¯ s}_{k}$ corresponds to the model learned in the dictionary $Φ$ ( $H_{0}$ ) or not ( $H_{1}$ ). This classification is done by testing $p ({¯ s}_{k} | Φ)$ against a threshold $θ$ . Therefore, we identify two dual problems: A) the original problem that we are seeking to solve (i.e., learning $Φ$ in a continual manner in order to infer $¯ c$ ) and B) its associated dual problem as learning $Φ$ in a continual manner in order to classify $¯ s$ . Hence, by casting the dual problem B) to the continual learning case, we implicitly cast the original problem A) to the continual learning case as well.

In order to transpose B) to the continual case, we must provide robustness to the severe data imbalance caused by the non-stationary stream ${¯ s}_{k}$ (due to the fact that ${¯ s}_{k}$ does not originate from a shuffled dataset where realisations are assumed to be i.i.d, see Section III-B). Therefore, we propose to solve the binary classification problem B) using the well-known Support Vector Classification (SVC) framework [12]:

(7)

where $k^{*}$ is the index of the current time step in the data stream, $θ$ is the threshold used in (6), and $ξ_{j}$ are the slack variables defining the margin of the separation hyperplane.

The SVC formulation in (7) is indeed a natural choice when dealing with class imbalance since in an SVC, all data points are not contributing to the synthesis of the decision hyperplane [12] (the contribution is only due to a few support vectors). This prevents SVC over-fitting on the redundant data found in our heavily imbalanced video streams.

We can then cast (7) into its hinge loss formulation [12]:

Φ_{k^{*}} = arg min Φ \sum j \in H_{0} max (0, | | Φ {¯ c}_{j} - {¯ s}_{j} | |_{2}^{2} - log θ) + \sum j \in H_{1} max (0, log θ - | | Φ {¯ c}_{j} - {¯ s}_{j} | |_{2}^{2})

(8)

where the hinge loss $max (0, x)$ leads to the sparsity property of SVCs, which in turn leads to a large number of input vectors ${¯ s}_{k}$ not playing any role in the synthesis of the SVC hyperplane. This provides the robustness to data redundancy.

In our particular case, we seek to continuously maintain a dictionary $Φ$ such that all incoming data points are well modelled by this dictionary. Therefore, the generic 2-class formulation in (8) reduces to (9) in our case since all data points must be associated to the $H_{0}$ class to be inliers [13]:

Φ_{k^{*}} = arg min Φ k^{*} \sum j = 1 max (0, | | Φ {¯ c}_{j} - {¯ s}_{j} | |_{2}^{2} - log θ)

(9)

Using (9) in a probabilistic context, we can define the Bayesian surprise factor $S_{1}$ associated with the incoming data point ${¯ s}_{k^{*} + 1}$ as the posterior to prior ratio [14]:

S_{1} (k^{*} + 1) = - log \frac{p (Φ_{k^{*} + 1} | {¯ s}_{k^{*} + 1})}{p (Φ_{k^{*}})} \sim - log p ({¯ s}_{k^{*} + 1} | Φ_{k^{*}}) \propto max (0, | | Φ_{k^{*}} {¯ c}_{k^{*} + 1} - {¯ s}_{k^{*} + 1} | |_{2}^{2} - log θ)

(10)

where $S_{1}$ is proportional to the negative log-likelihood via the Bayes theorem. In addition, the formulation in (10) assumes that the negative log-likelihood $- log p ({¯ s}_{k^{*} + 1} | Φ_{k^{*}})$ is proportional to the hinge loss energy term in (8) under the SVC formulation (9). The surprise factor $S_{1}$ is null when the hinge loss associated with the incoming data point ${¯ s}_{k^{*} + 1}$ is zero, denoting that the data point is already well modelled by the dictionary $Φ$ and hence, does not lead to any surprise.

Still, using $S_{1}$ in (10) is inconvenient since it requires the knowledge of the likelihood threshold $θ$ . This can be an even greater problem since an optimal choice of $θ$ could slowly vary over time due to changes in the environment. Under the generic assumption of a high-enough camera frame rate (4), we now show that having an explicit knowledge of $θ$ can be eliminated by introducing a second-order surprise factor $S_{2}$ :

S_{2} (k^{*} + 1) \equiv S_{1} (k^{*} + 1) - S_{1} (k^{*}) = - log \frac{p (Φ_{k^{*} + 1} | {¯ s}_{k^{*} + 1})}{p (Φ_{k^{*}})} + log \frac{p (Φ_{k^{*}} | {¯ s}_{k^{*}})}{p (Φ_{k^{*} - 1})} \sim \frac{δ^{2} - log p (Φ_{k} | {¯ s}_{k})}{δ k^{2}} |_{k^{*} + 1}

(11)

which is related to the discrete-time second derivative ( $\frac{δ^{2}}{δ k^{2}}$ ) of the posterior density evolution through time. We thus refer to $S_{2}$ as a Quadratic Bayesian Surprise (QBS) factor.

Intuitively, $S_{2}$ can be understood as the curvature of the posterior density evolution through time, indicating how much a newly received input vector ${¯ s}_{k^{*} + 1}$ will accelerate or decelerate the change in posterior $p (Φ_{k} | {¯ s}_{k})$ .

Dropping proportionality constants and using (10) in (11):

S_{2} \equiv max (0, e_{k^{*} + 1}      | | Φ_{k^{*}} {¯ c}_{k^{*} + 1} - {¯ s}_{k^{*} + 1} | |_{2}^{2} - log θ) - max (0, e_{k^{*}}      | | Φ_{k^{*} - 1} {¯ c}_{k^{*}} - {¯ s}_{k^{*}} | |_{2}^{2} - log θ) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} 0 if e_{k^{*} + 1} - log θ \leq 0, e_{k^{*}} - log θ \leq 0 a) e_{k^{*} + 1} - log θ if only e_{k^{*}} - log θ \leq 0 b) log θ - e_{k^{*}} if only e_{k^{*} + 1} - log θ \leq 0 c) e_{k^{*} + 1} - e_{k^{*}} else d) \end{matrix}

(12)

where $e_{k^{*} + 1}$ and $e_{k^{*}}$ are the current and past reprojection errors. Four possible outcomes a,b,c,d are identified in (12). Under the high frame rate assumption (4), the probability of outcomes b and c in (12) will be very low since (4) induces that if $e_{k^{*}} - log θ \leq 0$ then $e_{k^{*} + 1} - log θ \leq 0$ must likely be the case as well, and vice versa. In addition, outcomes a and c in (12) can be merged under the assumption (4):

a) if e_{k^{*} + 1} - log θ \leq 0, e_{k^{*}} - log θ \leq 0 then S_{2} = 0 \approx e_{k^{*} + 1} - e_{k^{*}} using {¯ s}_{k^{*} + 1} \approx {¯ s}_{k^{*}}

(13)

while still observing the conditions $e_{k^{*} + 1} - log θ \leq 0, e_{k^{*}} - log θ \leq 0$ of a) in (12). Therefore, the QBS in (11) can simply be computed as follows:

S_{2} (k^{*} + 1) \approx | | Φ_{k^{*}} {¯ c}_{k^{*} + 1} - {¯ s}_{k^{*} + 1} | |_{2}^{2} - | | Φ_{k^{*} - 1} {¯ c}_{k^{*}} - {¯ s}_{k^{*}} | |_{2}^{2}

(14)

In practice, we estimate the QBS (14) by low-pass filtering $S_{2}$ using a moving average of width $5$ to attenuate noise.

We can now use the QBS $S_{2}$ to modulate SGD learning in an online manner using the following dictionary update procedure (with learning rate $η_{d}$ ) in Algorithm 1 (lines 6):

{\begin{matrix} if S_{2} (k^{*} + 1) > 0, Φ_{k^{*} + 1} \leftarrow Φ_{k^{*}} - η_{d} \frac{\partial e_{k^{*} + 1}}{\partial Φ} else if S_{2} (k^{*} + 1) \leq 0, Φ_{k^{*} + 1} \leftarrow Φ_{k^{*}} \end{matrix}

(15)

{¯ s}_{k} \forall k

: input vector stream,

η_{c}

: coding rate,

η_{d}

: learning rate,

λ_{1}

: regularization,

N_{c, d}

: number of iterations

Φ_{i j} \leftarrow N (0, σ_{w})

(zero-mean normal distribution with std. deviation

σ_{w} \sim 0.01

{¯ c}_{1} \leftarrow 0

S_{2} \leftarrow 1

1: for

k \in {1, . . ., T_{e n d}}

(feed data sequence) do

2: for

i = 1, . . ., N_{c}

(local coding iterations) do

{¯ c}_{k} \leftarrow {Prox}_{η_{c} λ_{1} | | . | |_{1}} {{¯ c}_{k} - η_{c} Φ^{T} (Φ {¯ c}_{k} - {¯ s}_{k})}

see (2)

4: end for

5: for

i = 1, . . ., N_{d}

(local learning iterations) do

6: if

S_{2} > 0

then

Φ_{k} \leftarrow Φ_{k - 1} - η_{d} (Φ {¯ c}_{k} - {¯ s}_{k}) {¯ c}_{k}^{T}

8: else

Φ_{k} \leftarrow Φ_{k - 1}

10: end if

11: end for

12: Compute

S_{2}

following (14)

13:

RatSLAM {{¯ c}_{k}, {odometry}_{k}}

// Feed to SLAM

14: end for

Algorithm 1 DLSC-QBS for SLAM (see Fig. 1)

Therefore, by ignoring the effect of redundant data associated to a negative QBS factor, we prevent the over-fitting of the dictionary learning to the local contexts in the video streams. In turn, this leads to the learning of a more diverse set of features in $Φ$ and consequently, to lower reprojection errors during the replay of the input sequence (see Fig. 4).

In addition, we observed during our experiments a useful stability property induced by the use of the QBS in our continual dictionary learning scenario. Indeed, it is known that DLSC problems can suffer from instability [15] and we have observed run-away behaviors as well for cases where the QBS was not being used. In contrast, we did not observe any instability issue when using our QBS-modulated learning rule (15) in Algorithm 1, suggesting a regularizing effect (see Fig. 4 b, where using the QBS induces stability).

V Experimental Results

We evaluate the Mean Absolute Error (MAE) of our DLSC-QBS SLAM approach (see Fig. 1) against the ground truth localisation and mapping, acquired by UWB indoor positioning beacons [8]. As DLSC parameters, we use $η_{c} = 5 \times 10^{- 3}$ , $η_{d} = 1.4 \times 10^{- 3}$ , $λ_{1} = 0.2$ , $N_{c} = 10$ , $N_{d} = 1$ , $M = 64$ in Algorithm 1. These parameters were tuned empirically in order to optimise SLAM performance.

Three different flight sequences are used for performance assessment, with increasing difficulty:

Fligh 1: Flying between wall and shelves, which makes the continual learning and loop closure detection easier since the visited environment is more diverse and hence, easier to disambiguate (blue path in Fig. 3).
Fligh 2: Flying between shelves only, which leads to a larger amount of similar and ambiguous views compared to flight 1 (red path in Fig. 3)
Fligh 3: Combining flights 1 and 2 (both red and blue paths in Fig. 3 are visited by the drone).

Since the SLAM result and the ground truth do not share the same coordinate system [2], the ground truth coordinates are projected to the SLAM domain via translation and rotation. The translation vector and rotation angle are found via grid search, by minimizing the localisation MAE.

The localisation MAE is computed as the average error between the SLAM localisation $(x_{k}, y_{k})$ and the ground truth $(x_{k}^{gt}, y_{k}^{gt})$ on the flight sequence of length $T_{e n d}$ :

{MAE}_{L} = \frac{1}{T_{e n d}} T_{e n d} \sum k = 1 | x_{k} - x_{k}^{% g t} | + | y_{k} - y_{k}^{gt} |

(16)

In addition, the mapping MAE is computed as the a posteriori deviation between the map obtained at the end of the SLAM process and all the ground truth points:

{MAE}_{M} = \frac{1}{T_{e n d}} T_{e n d} \sum k = 1 min j {| x_{k} - x_{j}^{gt} | + | y_{k} - y_{j}^{gt} |}

(17)

Table I quantitatively compares the performance of our approach against i) the original RatSLAM [2]; ii) a VGG-11 pre-trained on ImageNet [16] as feature encoder to RatSLAM; iii) the LatentSLAM, trained specifically for our environment following [5], on a set of 3998 frames acquired independently from flights 1-3; and iv) ORB features for template matching [9] following the Lowe’s Ratio Test [17].

For the ORB features, the similarity between two frames $i, j$ is computed as $s_{i j} = 1 - N_{m} / {max}_{m}$ where $N_{m}$ is the number of matches and ${max}_{m}$ the maximum number of possible matches. For VGG-11, LatentSLAM and DLSC-QBS, the similarity is computed as $s_{i j} = \frac{{¯ c}_{i}^{T} {¯ c}_{j}}{| | {¯ c}_{i} | |_{2} | | {¯ c}_{j} | |_{2}}$ with ${¯ c}_{i, j}$ denoting the feature descriptor obtained by each method.

The RatSLAM back-end integrates the raw odometry and detects loop closures by sampling latent codes from the data stream each $100$ ms and by comparing the similarities $s_{i j}$ against a threshold $μ$ tuned for minimum MAE (see Fig. 1).

Architecture	Flight 1	Flight 2	Flight 3
RatSLAM [2]	0.72 $\|$ 0.22	1.499 $\|$ 0.248	2.714 $\|$ 1.15
VGG-11 [16]	3.95 $\|$ 0.244	2.58 $\|$ 1.15	1.775 $\|$ 0.266
LatentSLAM [5]	0.54 $\|$ 0.21	0.438 $\|$ 0.239	1.266 $\|$ 0.274
ORB [9]	0.456 $\|$ 0.165	0.407 $\|$ 0.215	4.258 $\|$ 1.595
DLSC-QBS (Ours)	0.588 $\|$ 0.1572	0.742 $\|$ 0.329	1.596 $\|$ 0.244

TABLE I:

{MAE}_{L}

|

{MAE}_{M}

(the lower the better).

Fig. 7: Flight 3, both red and blue paths in Fig. 3.

Fig. 5, 6 and 7 show the SLAM trajectories obtained by each method, compared to our proposed DLSC-QBS system. As expected by prior work [5, 6], the original RatSLAM system does not perform well in our ambiguous warehouse environment since its loop closure detection approach, based on the matching of raw image data, is too sensitive to the ambiguities between different views (see Fig. 3).

The pre-trained VGG-11 (deep architecture with 11 layers) used as feature descriptor performs better than the original RatSLAM for Flight 3 (see Fig. 7), but does not perform well overall. This confirms the observations in [5] where finding a good template matching threshold $μ$ was shown to be hard with off-the-shelf DNNs.

Latent-SLAM is among the top performers in Table I, but requires the offline training of an 11-layer DNN on a dataset from the specific environment where SLAM must be performed. This makes LatentSLAM not suited for navigation in unknown environments that are not captured in datasets available beforehand, which was our goal in this work.

The use of handcrafted ORB features, extensively used for SLAM [3], performs well in between the storage aisles (e.g., view 2 in Fig. 3) but can also perform poorly at the end of the aisles, where lighting conditions suddenly change, leading to unreliable feature matching. Still, the SLAM back-end is able to recover the track in most cases (see Fig. 5, 6) but can also fail when, at the same time, the drift in raw odometry becomes too unreliable (see Fig. 7). This leads to large errors during flight 3 in Table I.

On the other hand, our continual learning DLSC-QBS method either outperforms or either reports close MAE performance to both the pre-trained LatentSLAM and the use of ORB features, while a) having a complexity similar to a 1-hidden-layer network; b) not requiring any pre-training and c) not suffering from track loss issues encountered with ORB features when lighting conditions suddenly change as the drone exits the aisles (see Fig. 7). Therefore, our approach might be a promising avenue for safety-critical navigation in unknown places where a dataset is not available beforehand.

Vi Conclusion

This paper has presented what is, to the best of our knowledge, one of the first continual learning SLAM systems. Our method has been experimentally validated by performing SLAM with a drone in a challenging and visually ambiguous warehouse environment, without any model pre-training, while reporting competitive performance compared to prior systems. We hope that this work will contribute towards safer indoor drones that can adapt to new environments on the fly.

Acknowledgment

We thank Prof. J. Suykens for discussing the QBS formulation of Section IV, and Dr. L. Keuninckx for his support.

References

[1] H. Surmann, D. Slomma, S. Grobelny and R. Grafe, ”Deployment of Aerial Robots after a major fire of an industrial hall with hazardous substances, a report,” 2021 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), 2021, pp. 40-47
[2] M. J. Milford, G. F. Wyeth and D. Prasser, ”RatSLAM: a hippocampal model for simultaneous localization and mapping,” IEEE International Conference on Robotics and Automation, 2004.
[3] R. Mur-Artal, J. M. M. Montiel and J. D. Tardós, ”ORB-SLAM: A Versatile and Accurate Monocular SLAM System,” in IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147-1163, Oct. 2015
[4] Vödisch, N., Cattaneo, D., Burgard, W., Valada, A.. (2022). ”Continual SLAM: Beyond Lifelong Simultaneous Localization and Mapping through Continual Learning.”
[5] O. Çatal, W. Jansen, T. Verbelen, B. Dhoedt and J. Steckel, ”LatentSLAM: unsupervised multi-sensor representation learning for localization and mapping,” 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 6739-6745
[6] Yu, S., Wu, J., Xu, H., Sun, R., Sun, L. (2020). ”Robustness Improvement of Visual Templates Matching Based on Frequency-Tuned Model in RatSLAM.” Frontiers in Neurorobotics, 14.
[7] M. De Lange and T. Tuytelaars, ”Continual Prototype Evolution: Learning Online from Non-Stationary Data Streams,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021
[8] M. Ridolfi, N. Macoir, J. V. Gerwen, J. Rossey, J. Hoebeke and E. de Poorter, ”Testbed for warehouse automation experiments using mobile AGVs and drones,” IEEE INFOCOM 2019
[9] E. Rublee, V. Rabaud, K. Konolige, G. R. Bradski: ”ORB: An efficient alternative to SIFT or SURF.” ICCV 2011: 2564-2571.
[10] Mairal, J., Bach, F., Ponce, J., Sapiro, G. (2009). ”Online Dictionary Learning for Sparse Coding.” In Proceedings of the 26th ACM Annual International Conference on Machine Learning (pp. 689–696).
[11] Daubechies, I., Defrise, M. and De Mol, C. (2004), ”An iterative thresholding algorithm for linear inverse problems with a sparsity constraint.” Comm. Pure Appl. Math., 57: 1413-1457.
[12] Schölkopf, B., Smola, A. (2002). ”Learning with kernels : support vector machines, regularization, optimization, and beyond.” MIT Press.
[13] Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., Platt, J. (1999). ”Support Vector Method for Novelty Detection.” In Advances in Neural Information Processing Systems. MIT Press.
[14] Baldi, Pierre, and Laurent Itti. “Of bits and wows: A Bayesian theory of surprise with applications to attention.” Neural networks : the official journal of the International Neural Network Society (2010)
[15] H. Xu, C. Caramanis and S. Mannor, ”Sparse Algorithms Are Not Stable: A No-Free-Lunch Theorem,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 1, pp. 187-193, 2012
[16] Simonyan, K., Zisserman, A. (2014). ”Very deep convolutional networks for large-scale image recognition.”
[17] Lowe, D.G. ”Distinctive Image Features from Scale-Invariant Keypoints.” International Journal of Computer Vision 60, 91–110 (2004).

Architecture	Flight 1	Flight 2	Flight 3
RatSLAM [2]	0.72 $\|$ 0.22	1.499 $\|$ 0.248	2.714 $\|$ 1.15
VGG-11 [16]	3.95 $\|$ 0.244	2.58 $\|$ 1.15	1.775 $\|$ 0.266
LatentSLAM [5]	0.54 $\|$ 0.21	0.438 $\|$ 0.239	1.266 $\|$ 0.274
ORB [9]	0.456 $\|$ 0.165	0.407 $\|$ 0.215	4.258 $\|$ 1.595
DLSC-QBS (Ours)	0.588 $\|$ 0.1572	0.742 $\|$ 0.329	1.596 $\|$ 0.244