Deep Anomaly Detection and Search via Reinforcement Learning

Chao Chen,\equalcontrib¹ Dawei Wang,\equalcontrib² Feng Mao,²
Zongzhang Zhang,¹ Yang Yu ¹

Abstract

Semi-supervised Anomaly Detection (AD) is a kind of data mining task which aims at learning features from partially-labeled datasets to help detect outliers. In this paper, we classify existing semi-supervised AD methods into two categories: unsupervised-based and supervised-based, and point out that most of them suffer from insufficient exploitation of labeled data and under-exploration of unlabeled data. To tackle these problems, we propose Deep Anomaly Detection and Search (DADS), which applies Reinforcement Learning (RL) to balance exploitation and exploration. During the training process, the agent searches for possible anomalies with hierarchically-structured datasets and uses the searched anomalies to enhance performance, which in essence draws lessons from the idea of ensemble learning. Experimentally, we compare DADS with several state-of-the-art methods in the settings of leveraging labeled known anomalies to detect both other known anomalies and unknown anomalies. Results show that DADS can efficiently and precisely search anomalies from unlabeled data and learn from them, thus achieving good performance.

\affiliations

¹ National Key Lab for Novel Software Technology, Nanjing University, China
² Alibaba Inc., China
181840013@smail.nju.edu.cn, david.wdw@alibaba-inc.com, maofeng.mf@alibaba-inc.com,
zzzhang@nju.edu.cn, yuy@nju.edu.cn

Introduction

Anomaly detection (AD) Chandola et al. (2009) is a classical data mining task which aims at detecting data instances that significantly deviate from the majority. Nowadays, with the rapid development of information technology, AD is playing a more and more important role in domains like cybersecurity, healthcare, and finance Nguyen and Reddi (2019); Yu et al. (2021a); Charpentier et al. (2021). In this work, we focus on a specific type of AD, namely tabular semi-supervised AD, where the training dataset is composed of a small labeled dataset and a large unlabeled dataset. We think this kind of task is of great importance in real application scenarios since labeled data is more difficult to obtain than unlabeled data due to privacy, security, and other problemsZhu et al. (2019).

In classical semi-supervised AD tasks, both training and testing sets share the same distribution, which means labeled anomalies in training set contains all classes of anomalies in testing set. However, in real world scenarios, it is very common for the incoming anomalies come from previously unseen distributions, which requires an AD method to have stronger ability in mining unlabeled dataset, e.g., it needs to find unknown anomaly classes in unlabeled dataset to make correct judgments.

We highlight several challenges that need to be addressed in the tabular semi-supervised AD domain:

Robust to contaminated unlabeled data: Real-world datasets often contain contaminated unlabeled data, however, state-of-the-art unsupervised and semi-supervised AD methods suffer robustness issues with incremental contamination ratio Ruff et al. (2019); Yoon et al. (2021).
Utilize known anomalies to detect unknown anomalies: Anomalies from real-word datasets are sometimes diversified Pang et al. (2021b) and incoming new data may include unknown anomalies Villa-Pérez et al. (2021). Semi-supervised AD methods need to leverage known anomalies to explore contaminated unlabeled data efficiently and effectively to find potential unlabeled unknown anomalies.
Efficiently utilize limited labeled anomalies to solve exploration-exploitation dilemma: In real-world datasets, anomalies are very rare. On top of exploring unlabeled data, semi-supervised AD methods need to exploit labeled known anomalies effectively.

Reinforcement learning (RL) Kaelbling et al. (1996) is a machine learning paradigm in which an agent learns an optimal policy through interacting with the environment. Since its birth, RL has attracted lots of attention and shown its effectiveness in many tasks. In recent years, deep RL (DRL) Arulkumaran et al. (2017) takes a step further, which combines RL with deep learning LeCun et al. (2015) and achieves prominent performance in many fields such as games and automatic driving Mnih et al. (2015, 2013); Sallab et al. (2017); Silver et al. (2017).

However, there is still little research on the application of RL to AD. A recently purposed RL-based AD method called DPLAN Pang et al. (2021b) uses an RL setting to explore unlabeled samples based on labeled anomalies. Experiments in the original paper of DPLAN show that it is a successful trial of applying RL to the AD domain. However, we argue that the distance-based search of DPLAN is easy to fall into local optima. In addition, through reproduction we find that it faces problems like overfitting and weak robustness to the contamination of unlabeled dataset.

Regardless of the disadvantages of DPLAN, we still believe that RL will become a powerful tool for AD with a carefully designed environment, since it has a strong advantage in balancing short-term reward (e.g., the reward of the correct judgment of the current data) and long-term reward (e.g., the reward of finding anomalies from unlabeled data), which is also known as balancing between exploitation and exploration. In this paper, we purpose a method called DADS which applies RL to AD effectively. DADS draws the idea of ensemble and has an environment with hierarchical anomaly search mechanism. We conducted comprehensive experiments which prove the capability of DADS in tackling the above mentioned challenges of semi-supervised AD.

The main contributions of this work can be summarized as improvements in scenarios where testing set contains unknown anomaly classes. With the help of RL, DADS integrates the search and refinement of unlabeled dataset into one model. Thus, the issues of contamination and unknown anomaly search existing in most AD methods are well solved.

Background and Related Work

In this section, we first introduce the basic of reinforcement learning (RL) and anomaly detection (AD), then present some classical or latest semi-supervised tabular AD methods.

Reinforcement Learning

In RL, the agent is asked to interact with the environment to deal with a sequential decision-making problem. At each time step $t$ , the agent is capable of sensing the state $s_{t}$ from the environment and then selects an action $a_{t}$ according to policy $π (a | s)$ , given the state $s = s_{t}$ . The chosen action is passed to the environment and executed, which causes a transition to a new state $s_{t + 1}$ and a scalar reward $r_{t}$ . The interaction continues until the task terminates. The objective of RL is to train such an agent to maximize the expected discounted cumulative rewards (a.k.a. return) $E_{π} [\sum_{t} γ^{t} r_{t}]$ , where $γ \in (0, 1]$ is the discount factor.

The standard RL can be formalized as a Markov decision process (MDP). It can be defined as a 4-tuple $⟨ S, A, R, P ⟩$ , where $S$ denotes state space; $A$ denotes action space; $R$ denotes reward function $S \times A \to R, R (s, a) = E [r_{t} | s_{t} = s, a_{t} = a]$ ; and $P$ denotes transition function $S \times A \times S \to [0, 1]$ , $P (s, a, s^{'}) = p (s_{t + 1} = s^{'} | s_{t} = s, a_{t} = a)$ . The whole process can be expressed as a trajectory: $τ = (s_{0}, a_{0}, s_{1}, a_{1} . . . s_{T})$ , which obeys an implicit probability distribution determined by transition dynamics and policy $π$ : $p_{π} (τ) = p (s_{0}) \prod_{t = 1}^{T} π (a_{t - 1} | s_{t - 1}) p (s_{t} | s_{t - 1}, a_{t - 1})$ . The optimization decision problem can be formalized as:

max π J (π) = E_{τ \sim p_{π} (τ)} [r (τ)],

(1)

where $r (τ) = \sum_{t} γ^{t} r_{t}$ is the sum of discounted rewards.

Deep Reinforcement Learning

Deep reinforcement learning (DRL) refers to algorithms that combine RL with deep learning. According to the difference in target network, it can be divided into three categories: value-based, policy-based, and actor-critic algorithms.

Value-based methods use the estimated Q-value to select action. DQN Mnih et al. (2013, 2015) is a classical value-based method. It combines deep learning with Q-learning, and uses a neural network to estimate Q-value. Experiments on Atari show that it can achieve human-level performance, which is a great success. After that, there are many works trying to make improvements, including double DQN Mnih et al. (2016), dueling DQN Wang et al. (2016), etc.

Policy-based methods directly model the relationship between state and optimal action. REINFORCE Williams (1992) is a representative of this category, which uses the Monte-Carlo method to sample trajectories and update parameters. Improved methods include adding a baseline to reduce variance.

Actor-critic methods combine the above two categories. They use actor network and critic network to learn policy function and value function respectively, which can effectively reduce variance and speedup training. Actor-critic Konda and Tsitsiklis (1999) is the most basic method, and by adding baseline we can get the A2C method. A3C Mnih et al. (2016) applies parallel training on the basis of A2C. TRPO Schulman et al. (2015) introduces importance sampling and trust domain constraint in policy update. PPO Schulman et al. (2017) substitutes KL-divergence constraint in TRPO with clipping, which effectively improves its performance. The soft actor-critic (SAC) Haarnoja et al. (2018) method effectively improves the robustness of the algorithm by introducing entropy regularization and dueling Q-network. In this paper, we use SAC as the RL algorithm.

Anomaly Detection

Given a training dataset $X = {x_{1}, x_{2}, \dots, x_{N}}$ , AD aims at learning a feature representation mapping function $τ (\cdot) : X \mapsto Z$ or an anomaly scoring function $ϕ (\cdot) : X \mapsto R$ in a way that anomalies can be easily differentiated in the subspace of $τ$ or $ϕ$ , where $Z \in R^{K} (K ≪ N)$ denotes representation space. Since AD focuses on rare and irregular events, it faces many challenges including unknownness of anomalies, class imbalance, various types of anomalies, etc Pang et al. (2021a). According to the amount of known data, AD can be divided into three categories: supervised, semi-supervised, and unsupervised Ruff et al. (2021). Unsupervised AD is the most studied category, many works have been done in this area, including Isolation Forest Liu et al. (2008), Histogram-based Outlier Score Goldstein and Dengel (2012), One-Class SVM Schölkopf et al. (2001), NCAE Yu et al. (2021b), etc. Most supervised AD methods are classification methods Ruff et al. (2021) including XGBoost Chen and Guestrin (2016), VIME Yoon et al. (2020), etc.

Semi-supervised Tabular Anomaly Detection

In this paper, we focus on semi-supervised tabular AD problems, where both unlabeled samples and a small portion of the labeled anomalies and labeled normal samples are available during training Ruff et al. (2021). Generally, semi-supervised AD methods can be classified into supervised-based and unsupervised-based categories Görnitz et al. (2013).

On top of the supervised learning paradigm, supervised-based semi-supervised AD methods add a bias term to describe the data characterization based on some assumptions, e.g., the cluster assumption Chapelle et al. (2009), in the loss function Pang et al. (2019). Their performance relies heavily on the consistency between the distribution of datasets and the assumptions. The mentioned DPLAN is this method.

On the other hand, unsupervised-based semi-supervised AD methods extend the unsupervised AD methods by exploiting labeled data as a regularization term in the loss function Görnitz et al. (2013); Ruff et al. (2019). A recent work Chang et al. (2022) extends the explainable $G A^{2} M$ model to the AD method with the Partial IDentification (PID) objective Gopalan et al. (2019), and a differentiable AUC loss is added to incorporate labeled data. These methods also exist problems like weak robustness to data contamination, insufficient exploration of unlabeled data, etc.

Our Method

In this section, we introduce our method called DADS in details. Before that, we would like to give a formal statement of the task we aim at. Consider a semi-supervised AD scenario, where two datasets are available:

Anomaly dataset $D^{a}$ : a small dataset containing only anomalies.
Unlabeled dataset $D^{u}$ : a relatively large dataset containing both abnormal and normal data.

Given the above two datasets for training, our aim is to find an anomaly scoring function $ϕ (\cdot) : D \mapsto R$ , such that $ϕ (s_{i}) > ϕ (s_{j})$ , where $s_{i}$ is abnormal and $s_{j}$ is normal. In this paper, we simply focus on tabular data.

Algorithm Framework and Key Components

Figure 1 is an illustration of DADS, which is mainly composed of two parts: a SAC-based agent and an anomaly search environment. To apply RL, we must first transform the training dataset into a data flow environment where key components of RL are clearly defined.

State space is the whole training dataset $D = {D^{a}, D^{u}}$ , with each $s_{t} \in D$ sampled at timestep $t$ be a state.
Action space is a continuous space within range $[0, 1]$ , with value corresponding to anomaly score of input data.
Environment has its own inner dataset, and a single piece of data will be sampled as the state in each training step. When receiving the action of the agent, its inner dataset will make adjustments to its inner dataset according to a series of rules and calculate the reward.
Agent takes the current data as input, uses its inner network to calculate the corresponding action (the anomaly score of the input data), and returns it to the environment.

Figure 1: An illustration of our method DADS. See text for details.

SAC-Based Agent

In the design of the agent, we use the RL algorithm SAC, which adds an extra entropy term to the original target of RL, i.e., maximizing the expected cumulative rewards, and thus its objective function becomes: $J (π) = \sum_{t = 0}^{T} E_{(s_{t}, a_{t}) \sim τ_{π}} γ^{t} [r (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t}))]$ , where $π$ is represented as a policy network, $(s_{t}, a_{t})$ is a state-action pair within trajectory $τ_{π}$ generated by policy $π$ , $H (π (. ∣ s_{t}))$ is the entropy term which evaluates the uncertainty of policy $π$ , and $α > 0$ is the trade-off coefficient.

To find the optimal policy, SAC uses $π (θ)$ to approximate policy function, uses $q (s, a; ϕ_{i}), i \in {1, 2}$ to approximate action-value function, and uses $v (s; ψ)$ and $v (s; ¯ ¯¯ ¯ ψ)$ to estimate state-value function. The way of updating each network is listed in Appendix A. After training, we get $π^{*} (θ)$ . For every single data $s$ , $π^{*} (s; θ)$ is used as anomaly score.

We emphasize that the entropy term of the objective function in SAC is of great help to the anomaly search of DADS, since the agent is encouraged to take different actions and thus discover more possible anomalies. This entropy regularization, together with the sampling function of DADS, contributes to the search mechanism of DADS.

Anomaly Search Environment

The environment design of DADS is divided into the following three parts.

Hierarchical datasets tailored for anomaly search: The training dataset is separated into three parts and multi-stage screening is used to search for possible anomalies.
Ensemble-based sampling function: Ensemble learning is adopted to integrate different unsupervised AD methods.
Reward function for AD: The agent receives a reward combining both supervised and unsupervised rewards.

Hierarchical Datasets Tailored for Anomaly Search

As mentioned before, the training dataset is composed of $D^{a}$ and $D^{u}$ . To make the environment more suitable for anomaly search, we divided the whole training dataset into three interconnected datasets: anomaly dataset $A$ , temporary dataset $T$ , and unlabeled dataset $U$ . As is illustrated in Figure 1, three blue boxes correspond to three inner datasets of the environment. There are two threshold hyperparameters in the environment, respectively $T H_{s c o r e}$ and $T H_{c o n f}$ . There is two colors in the legend, in which green indicates anomaly score smaller than threshold $T H_{s c o r e}$ , and red indicates the opposite. Each line represents the flow between datasets, the text beside the line is the reward and other information of this flow, and the color of the text is the action of the agent. For example, the leftmost arrow means if current data $s_{t}$ is sampled from dataset $U$ , and the agent gives an anomaly score smaller than threshold, then $s_{t}$ will be placed back to $U$ .

Anomaly dataset $A$ contains all data in $D^{a}$ and other anomaly data searched from $D^{u}$ . Temporary dataset $T$ contains possible anomalies, which need further judgment. Unlabeled dataset $U$ contains all remaining data. During training agent will continuously search for possible anomalies from $U$ . Before each episode of training, we will initialize $A$ as $D^{a}$ , $T$ as $\emptyset$ , $U$ as $D^{u}$ .

Each time the agent returns an action $a_{t}$ to the environment, the environment will change the inner dataset to which $s_{t}$ belongs. Here we give every data $s_{t}$ an attribute $c o n f (s_{t})$ initialized to 0, which measures confidence of classifying $s_{t}$ as an anomaly. For those $s_{t}$ sampled from $U$ , if $a_{t} \geq T H_{s c o r e}$ , then we move $s_{t}$ to $T$ for further judgement, and set $c o n f (s_{t}) = 1$ ; if $a_{t} < T H_{s c o r e}$ , then we place it back to $U$ and set $c o n f (s_{t})$ to $0$ . For $s_{t}$ sampled from $T$ , if $a_{t} \geq T H_{s c o r e}$ , then we put it back to $T$ , meanwhile, $c o n f (s_{t}) = c o n f (s_{t}) + 1$ ; if $a_{t} < T H_{s c o r e}$ , $c o n f (s_{t})$ is set to $0$ and $s_{t}$ is put into $U$ . Data in $A$ are only in but not out, when the confidence $c o n f (s_{t})$ of a data in $T$ is greater than threshold $T H_{c o n f}$ , we think that it is highly possible of $s_{t}$ be an anomaly, so we put it into $A$ and ask the agent to judge it as an anomaly. For specific value selection of $T H_{c o n f}$ , please refer to Appendix B.

Altogether, each time the environment will pick out a data $s_{t}$ from an inner dataset (here we stipulate that $s$ is taken away from the chosen dataset) after the agent returns action $a_{t}$ , the environment will place $s_{t}$ to a new inner dataset according to the following function $T r a n s$ :

T r a n s (s_{t}, a_{t}) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ \begin{matrix} A & s_{t} \in A A & a_{t} \geq T H_{s c o r e}, s_{t} \in T, c o n f (s_{t}) \geq T H_{c o n f} T & a_{t} \geq T H_{s c o r e}, s_{t} \in U U & a_{t} < T H_{s c o r e}, s_{t} \in U \cup T \end{matrix} .

(2)

Ensemble-based Sampling Function

The sampling strategy of the environment from its inner dataset ${A, T, U}$ can be summarized as two stages. In the first stage, the environment chooses a certain dataset from three inner datasets according to a predefined probability distribution (e.g. the probabilities of sampling from $A$ , $T$ , and $U$ are $20 %$ , $20 %$ , and $60 %$ , respectively), we denote $C$ as the chosen inner dataset. In the second stage, the environment samples data from the dataset selected in the first stage. If the dataset selected in the first stage is in ${A, T}$ , since the size of these datasets is relatively small, we just use random sampling. If we selected $U$ in the first stage, to better explore anomalies in $U$ , we learn from the idea of ensemble learning and designed an anomaly-biased sampling strategy.

Specifically, we choose three unsupervised AD algorithms, namely Isolation Forest, Histogram-based Outlier Score, and One-Class SVM. The implementation of Pyod Zhao et al. (2019) is used to calculate unsupervised metrics normalized to range $[0, 1]$ . We choose these three methods for the purpose that they are based on different assumptions, so by combining them together, the sampling strategy is more likely to cover different types of anomalies Ruff et al. (2021). Let the anomaly scoring function of these three algorithms be $I F o r e s t (x)$ , $H B O S (x)$ and $O C S V M (x)$ . We also define a random score generator $R A N D O M (x)$ , which gives out scores within the range $[0, 1]$ randomly. Each time the environment will first sample a batch of data $S$ from $U$ with a certain size, then select one of the above three unsupervised methods (including $R A N D O M$ ) based on a predefined probability distribution (e.g., uniform distribution), we denote the chosen unsupervised method as $U N S U P (x)$ . We use it to calculate the anomaly score of each data in $S$ and at last, choose the one with the highest anomaly score.

To conclude, the sampling function of the environment can be written as:

(3)

where $U (C)$ is random sampling over the chosen dataset $C$ .

Reward Function for Anomaly Detection

The reward function of the environment is designed based on both supervised and unsupervised rewards.

For each data $s_{t}$ , if it comes from $A$ , the agent is asked to make a correct judgment. If the agent correctly classifies $s_{t}$ , it will get a positive reward, otherwise, it will be punished. This is what we call supervised reward.

If current data $s_{t}$ comes from $T$ or $U$ , since there are no supervisory signals, we introduce unsupervised rewards. For $s_{t}$ sampled from $T$ , if $c o n f (s_{t}) \geq T H_{c o n f}$ and the agent classify $s_{t}$ as anomaly, then the environment will move $s_{t}$ to $A$ and give a positive reward to encourage the search of possible anomalies; else, reward is 0. For $s_{t}$ sampled from $U$ , to enable the agent to learn data distribution with the help of unsupervised methods, the environment will give an unsupervised reward using isolation forest, which is written as $I F o r e s t (s_{t})$ above.

To conclude, the reward $r$ of data $s_{t}$ with regard to action $a_{t}$ can be written as follows:

r (s_{t} | a_{t}) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ \begin{matrix} 1 & s_{t} \in A, a_{t} \geq T H_{s c o r e} - 1 & s_{t} \in A, a_{t} < T H_{s c o r e} I F o r e s t (s_{t}) & s_{t} \in U 1 & s_{t} \in T, a_{t} \geq T H_{s c o r e}, c o n f (s_{t}) \geq T H_{c o n f} 0 & else \end{matrix} .

(4)

The specific value in reward function is designed in the simplest way with no factor multiplied. We think that these parameters can be further optimized for different datasets, however, current setting of reward have already performed well in the experiments.

Algorithm Analysis

As stated above, DADS uses a hierarchical search to find anomalies in the unlabeled dataset, and we believe that such a search mechanism is the key to solving the challenges proposed at the beginning of the paper. To be specific, DADS features in the following three aspects.

Strong robustness to contamination: When the agent searches for possible anomalies in unlabeled dataset, it is also cleaning it. In addition, the reward function of DADS is designed in a simple way that it does not include any prior assumption of the unlabeled dataset. Considering the above two points, it can be predicted that DADS will be robust to the contamination of unlabeled dataset.
High search efficiency and accuracy: On one hand, DADS borrows the idea of ensemble learning to improve the accuracy of search. Before a possible anomaly in $A$ is finally added into A, it needs to be classified as an anomaly by the agent for $T H_{c o n f}$ times. Notice that the agent is continuously updated, the $T H_{c o n f}$ times judgment is actually an integration of the results of different agents. On the other hand, DADS combines different unsupervised methods in its sampling function to improve search efficiency, while the entropy term of SAC also plays a role in improving it.
Nice balance of exploration and exploitation: With the help of RL, the whole training process of DADS integrates exploring unlabeled dataset and exploiting labeled anomalies in a natural way, which solves the exploration-exploitation dilemma automatically.

Experiments

This experiments section first gives a brief introduction to the datasets, evaluation metrics, and baseline methods, then gives empirical results on two experiment scenarios.

	Known AD Scenario
	DADS	DPLAN	DeepSAD	DevNet	SSAD	STOC+GOAD	STOC+IForest	Supervised	Unsupervised
annthyroid	0.881 $\pm$ 0.011	0.824 $\pm$ 0.042	0.894 $\pm$ 0.025	0.832 $\pm$ 0.025	0.739 $\pm$ 0.059	0.610 $\pm$ 0.028	0.833 $\pm$ 0.023	0.984 $\pm$ 0.010	0.804 $\pm$ 0.037
cardio	0.973 $\pm$ 0.017	0.769 $\pm$ 0.098	0.879 $\pm$ 0.054	0.967 $\pm$ 0.021	0.898 $\pm$ 0.047	0.945 $\pm$ 0.015	0.919 $\pm$ 0.018	0.912 $\pm$ 0.076	0.920 $\pm$ 0.014
satellite	0.822 $\pm$ 0.041	0.852 $\pm$ 0.032	0.903 $\pm$ 0.033	0.845 $\pm$ 0.015	0.859 $\pm$ 0.015	0.775 $\pm$ 0.014	0.786 $\pm$ 0.011	0.838 $\pm$ 0.044	0.769 $\pm$ 0.014
satimage2	0.990 $\pm$ 0.006	0.813 $\pm$ 0.102	0.980 $\pm$ 0.018	0.974 $\pm$ 0.047	0.973 $\pm$ 0.015	0.953 $\pm$ 0.023	0.994 $\pm$ 0.003	0.969 $\pm$ 0.033	0.994 $\pm$ 0.003
thyroid	0.993 $\pm$ 0.007	0.940 $\pm$ 0.068	0.953 $\pm$ 0.031	0.995 $\pm$ 0.006	0.960 $\pm$ 0.028	0.901 $\pm$ 0.031	0.978 $\pm$ 0.007	0.966 $\pm$ 0.043	0.974 $\pm$ 0.005
Average	0.932 $\pm$ 0.016	0.839 $\pm$ 0.068	0.922 $\pm$ 0.032	0.922 $\pm$ 0.023	0.886 $\pm$ 0.033	0.837 $\pm$ 0.022	0.902 $\pm$ 0.012	0.934 $\pm$ 0.041	0.892 $\pm$ 0.015
AverageRank	3.0	7.0	4.4	3.4	5.8	7.4	4.2	4.8	5.0
	Unknown AD Scenario
multi $_$ annthyroid	0.759 $\pm$ 0.047	0.716 $\pm$ 0.055	0.706 $\pm$ 0.064	0.712 $\pm$ 0.047	0.655 $\pm$ 0.047	0.554 $\pm$ 0.037	0.651 $\pm$ 0.039	0.949 $\pm$ 0.043	0.645 $\pm$ 0.040
multi $_$ cardio	0.943 $\pm$ 0.021	0.670 $\pm$ 0.058	0.890 $\pm$ 0.03	0.862 $\pm$ 0.044	0.873 $\pm$ 0.022	0.831 $\pm$ 0.023	0.799 $\pm$ 0.047	0.826 $\pm$ 0.036	0.844 $\pm$ 0.018
multi $_$ har	0.965 $\pm$ 0.017	0.628 $\pm$ 0.135	0.843 $\pm$ 0.139	0.943 $\pm$ 0.016	0.889 $\pm$ 0.019	0.681 $\pm$ 0.148	0.903 $\pm$ 0.011	0.944 $\pm$ 0.020	0.906 $\pm$ 0.012
Average	0.889 $\pm$ 0.028	0.671 $\pm$ 0.082	0.813 $\pm$ 0.078	0.839 $\pm$ 0.036	0.806 $\pm$ 0.029	0.689 $\pm$ 0.070	0.784 $\pm$ 0.032	0.906 $\pm$ 0.033	0.798 $\pm$ 0.023
AverageRank	1.3	7.0	4.7	3.7	5.0	7.7	6.7	3.3	5.7

Table 1: AUC-ROC performance (mean

\pm

std) of settings 1.1 and 2.1.

Datasets and Evaluation Metrics

To make a comprehensive comparison between DADS and other methods, we choose $5$ datasets with single anomaly class and $3$ datasets with multiple anomaly classes. The former $5$ of datasets are used in experiments of scenario $1$ , with dimension ranging from $6$ to $36$ and anomaly percentage ranging from $1.2 %$ to $31.6 %$ . Detailed information are in Table S3 of Appendix B. To further compare DADS with baselines, scenario $2$ is designed using $3$ datasets with multiple anomaly classes, of which size ranging from thousands to tens of thousands and dimension ranging from $9$ to $561$ . Detailed information are in Table S4 of Appendix B. The last column indicates whether the corresponding class is known or not. If the anomaly class is known, it will appear in both anomaly dataset and unlabeled dataset, else it only exists in unlabeled dataset.

The mentioned datasets serve as base pool of our experiments. We set two adjustable parameters, namely anomalies ratio and contamination ratio. Anomalies ratio corresponds to the ratio of known anomalies to total anomalies (in scenario $2$ total anomalies means anomalies of the known anomaly class); contamination ratio corresponds to the percentage of anomalies in unlabeled data.

We stratified split each dataset into training set, validation set and testing set, each accounting for $60 %$ , $20 %$ , and $20 %$ of the total dataset with $10$ different random seeds. validation set and testing set are fixed after splitting, so the percentage of each class is consistent with original dataset. Training dataset is manually generated according to the above two adjustable parameters: anomalies ratio and contamination ratio, and, as stated before, it is composed of $D^{a}$ and $D^{u}$ . In particular, DADS will search for optimal training episodes on validation set, SSAD and XGBoost will also tune on validation set. Detailed information are in Appendix B.

We select Area Under Receiver Operating Characteristic Curve (AUC-ROC) as the evaluation metric, which measures the area under ROC curve. The reported AUC-ROC are average results over 10 different random seeds.

Competing Methods

We select SSAD Görnitz et al. (2013), Deep SAD Ruff et al. (2019), DevNet Pang et al. (2019), XGBoost Chen and Guestrin (2016), STOC Yoon et al. (2021), DPLAN Pang et al. (2021b), and Isolation Forest Liu et al. (2008) as baselines. SSAD and Deep SAD are all based on SVDD Ruff et al. (2018), which use hypersphere to envelop the anomalies in the embedding space. According to experiments in Deep SAD, SSAD and Deep SAD are both state-of-the-art semi-supervised AD methods. Notice that here we use the improved version of Deep SAD called Hypersphere Classification (HSC) Ruff et al. (2020), which is about $1.5 %$ better than the original Deep SAD. DevNet introduces deviation loss and combines it with supervised loss to train an end-to-end anomaly scoring network. We combine the data-driven STOC method with both Isolation Forest and self-supervised representation learning method GOAD Bergman and Hoshen (2020). We reproduce DPLAN and include it in our baselines, considering that the distance calculation step in DPLAN is time-consuming, we decrease the sampling size from 1000 to 100. Finally, to make a comprehensive comparison, we also consider XGBoost and Isolation Forest, while XGBoost and Isolation Forest are classical supervised and unsupervised methods, respectively.

Scenario 1: Experiments on Datasets with Single Anomaly Class

In this scenario, we use $5$ datasets with single anomaly class to compare DADS with baseline methods. As stated above, there are two adjustable parameters: anomalies ratio and contamination ratio, we will fix one at a time and adjust another. The results of these two groups of settings are as follows. Detailed graphic results are shown in Appendix C, below we only list some processed results.

Setting 1.1: AUC versus incremental contamination ratio with fixed anomalies ratio.

First we set anomalies ratio to $0.1$ , which means the known anomalies account for $10 %$ of total anomalies. Next, we gradually increase contamination ratio from $0$ to $0.1$ , with interval set to $0.02$ , to test whether ADAS is more robust to pollution in unlabeled data compared with baselines. Here we report the result of DADS and baseline methods when $contamination_ratio = 0.1$ in the upper half of Table 1, and the average rank of each method averages its ranking of results on different datasets. DADS ranks the first in both average AUC-ROC and average rank, which proves the advantage of DADS in resisting high contamination in unlabeled data.

Setting 1.2: AUC versus incremental anomalies ratio with fixed contamination ratio.

In this part, we set contamination ratio to $0.04$ and adjust anomalies ratio within range $[0.01, 0.5]$ to see whether the search of possible anomalies makes contribution to the performance of DADS. Table 2 shows the results when $anomalies_ratio = 0.05$ , and the upper half corresponds to this setting. From the results we can see that DADS performs well when only a little labeled anomalies is available. Another interesting point is that the performance of some methods like IForest and STOC+IF declines as anomalies ratio increases, while DADS monotonically increases with the growth of anomalies ratio.

	Known AD Scenario
	DADS	DPLAN	DeepSAD	DevNet	SSAD	STOC+GOAD	STOC+IForest	Supervised	Unsupervised
annthyroid	0.857 $\pm$ 0.013	0.671 $\pm$ 0.069	0.868 $\pm$ 0.041	0.809 $\pm$ 0.035	0.722 $\pm$ 0.051	0.613 $\pm$ 0.024	0.866 $\pm$ 0.015	0.932 $\pm$ 0.075	0.843 $\pm$ 0.026
cardio	0.956 $\pm$ 0.037	0.574 $\pm$ 0.185	0.846 $\pm$ 0.064	0.929 $\pm$ 0.067	0.897 $\pm$ 0.044	0.949 $\pm$ 0.012	0.935 $\pm$ 0.017	0.887 $\pm$ 0.088	0.937 $\pm$ 0.010
satellite	0.845 $\pm$ 0.032	0.630 $\pm$ 0.079	0.903 $\pm$ 0.018	0.849 $\pm$ 0.018	0.858 $\pm$ 0.011	0.769 $\pm$ 0.012	0.795 $\pm$ 0.019	0.803 $\pm$ 0.014	0.786 $\pm$ 0.016
satimage2	0.990 $\pm$ 0.006	0.886 $\pm$ 0.105	0.970 $\pm$ 0.034	0.989 $\pm$ 0.022	0.973 $\pm$ 0.012	0.957 $\pm$ 0.023	0.994 $\pm$ 0.003	0.960 $\pm$ 0.029	0.994 $\pm$ 0.003
thyroid	0.976 $\pm$ 0.053	0.639 $\pm$ 0.191	0.911 $\pm$ 0.038	0.986 $\pm$ 0.019	0.946 $\pm$ 0.024	0.900 $\pm$ 0.030	0.976 $\pm$ 0.006	0.896 $\pm$ 0.091	0.974 $\pm$ 0.005
Average	0.925 $\pm$ 0.028	0.680 $\pm$ 0.126	0.900 $\pm$ 0.039	0.912 $\pm$ 0.032	0.879 $\pm$ 0.029	0.838 $\pm$ 0.020	0.913 $\pm$ 0.012	0.895 $\pm$ 0.060	0.907 $\pm$ 0.012
AverageRank	3.0	8.8	4.6	3.8	5.0	6.8	3.2	5.6	4.2
	Unknown AD Scenario
multi $_$ annthyroid	0.703 $\pm$ 0.075	0.617 $\pm$ 0.128	0.693 $\pm$ 0.059	0.693 $\pm$ 0.069	0.664 $\pm$ 0.05	0.562 $\pm$ 0.035	0.690 $\pm$ 0.028	0.927 $\pm$ 0.043	0.685 $\pm$ 0.046
multi $_$ cardio	0.978 $\pm$ 0.013	0.561 $\pm$ 0.091	0.921 $\pm$ 0.030	0.927 $\pm$ 0.049	0.897 $\pm$ 0.045	0.860 $\pm$ 0.029	0.802 $\pm$ 0.029	0.823 $\pm$ 0.028	0.881 $\pm$ 0.016
multi $_$ har	0.977 $\pm$ 0.011	0.712 $\pm$ 0.13	0.898 $\pm$ 0.127	0.966 $\pm$ 0.015	0.987 $\pm$ 0.002	0.766 $\pm$ 0.058	0.916 $\pm$ 0.014	0.882 $\pm$ 0.081	0.920 $\pm$ 0.013
Average	0.886 $\pm$ 0.033	0.630 $\pm$ 0.116	0.837 $\pm$ 0.072	0.862 $\pm$ 0.044	0.849 $\pm$ 0.032	0.730 $\pm$ 0.041	0.803 $\pm$ 0.024	0.877 $\pm$ 0.051	0.829 $\pm$ 0.025
AverageRank	1.7	8.7	4.3	2.7	4.0	7.7	6.0	5.0	5.0

Table 2: AUC-ROC performance (mean

\pm

std) of settings 1.2 and 2.2.

Scenario 2: Experiments on Datasets with Multiple Anomaly Classes

The above experiments on datasets with single anomaly class shows the advantage of DADS in both the exploration of unlabeled dataset and robustness to contaminated data. For further verification, we conducted similar experiments on datasets with multiple anomaly classes. For each dataset, we set one anomaly as the known anomaly class while leave others as unknown. We believe that the tasks in this scenario will be more difficult and can better show the advantages of DADS. Same as above, detailed results are in Appendix C.

Setting 2.1: AUC versus incremental contamination ratio with fixed anomalies ratio.

Like setting 1.1, we set anomalies ratio to $0.1$ and adjust contamination ratio from $0$ to $0.1$ . An obvious phenomenon in Figure 2 is that as contamination ratio goes up, all methods except DADS experiences different degrees of decline, which again proves the robustness of DADS to contamination. The lower half of Table 1 summarizes the results when $contamination_ratio = 0.1$ . Although slightly inferior to XGBoost in average AUC, DADS still ranks the first in average rank.

AUC-ROC of DADS and baselines in Setting 2.1. — (a) multi_annthyroid

Setting 2.2: AUC versus incremental anomalies ratio with fixed contamination ratio.

Here we fix contamination ratio to $0.04$ and increase anomalies ratio from $0.01$ to $0.5$ . Average AUC results and rank are shown in Table 2 and DADS performs the best. We believe this should be attributed to the efficient search mechanism, which enables DADS to search for possible anomalies when there are multiple anomaly classes.

Analysis of Experiment Results

From the results of experiments conducted on two scenarios, we can summarize several empirical results.

DADS vs. DPLAN: We think it is necessary to make a comprehensive comparison between DADS and DPLAN, since they are all AD methods based on RL. According to the experiment results, DADS outperforms DPLAN in all two scenarios. To be specific, settings 1.1 and 1.2, especially datasets like cardio, satimage2, and multi-har prove the superiority of robustness of DADS to contamination. In settings 1.2 and 2.2, DADS outperforms DPLAN when only small amount of labeled anomalies is known, which should give credit to the SAC algorithm and the search mechanism that avoid the problem of DPLAN that only searches locally. Another advantage of DADS is that it removes the distance-based sampling strategy of DPLAN, thus accelerating training to a large extent.
DADS vs. unsupervised/supervised/semi-supervised AD methods: According to the experiment results, we can also make a comparison between DADS and different kinds of AD methods. First of all, compared with unsupervised methods like Isolation Forest, STOC+IForest, and STOC+GOAD, DADS has advantages in settings 1.2 and 2.2 since it can search and exploit unknown anomalies.

Secondly, though supervised methods like XGBoost achieve quite good results in some datasets like annthyroid, it cannot keep the advantage on all datasets, while the performance of DADS is relatively stable across different datasets. When the aomunt of known anomalies is extremely small(e.g. 1%), DADS also performs better than supervised methods, which can be verified in Figures S5 and S7 of Appendix C.

Lastly, compared with semi-supervised methods like DPLAN, DeepSAD, SSAD and DevNet, DADS outperforms them in all four settings. Figures S4 and 2 of settings 1.1 and 2.1 indicates that DADS is more robust to data contamination, especially when compared with supervised-based methods like DPLAN. In settings 1.2 and 2.2, different from unsupervised-based methods like SSAD and Deep SAD that reach performance plateau prematurely, DADS can maintain synchronous performance growth with the increases of known anomalies, which shows stronger ability in utilizing supervision signals.

Ablation Study

To identify to what extent the search mechanism helps, we remove the search mechanism of DADS and leave other components unchanged, that is to say, no anomalies will be found and added into the anomaly dataset $A$ . Experiments of settings 2.1 and 2.2 are conducted on original DADS and DADS without search. We plot the result on multi_cardio dataset with line chart, which can demonstrate the advantage of DADS well. For detailed results of experiment, please refer to Tables S7 and S8 of Appendix D.

Ablation study on setting 2.1.

The left subfigure of Figure 3 shows that DADS with search is almost always better than DADS without search. Another point worth noting is that as contamination ratio goes up, the fluctuation of DADS without search is increasing, and the AUC of DADS with search is also increasing. We think that it is a convincing evidence of the importance of the search mechanism in DADS. Without search, DADS can neither clean the unlabeled data, nor acquiring new anomalies, thus suffering from the contamination issue.

Ablation study on setting 2.2.

From Figure 3 we can see that DADS with search is better than DADS without search across different amount of labeled anomalies. This result is consistent with our former analysis, since DADS can search for possible anomalies to improve its performance. While in contrast, DADS without search can only fit limited known anomalies. In Table S8 of Appendix D, the average results also prove the necessity of the search of DADS.

Conclusion

This paper presents an RL-based semi-supervised tabular AD method DADS. We use the idea of ensemble learning and design environment with hierarchically organized datasets to achieve better robustness to contaminated unlabeled data. By combining unsupervised AD methods into the sampling function, we further improve the effectiveness of the search mechanism. DADS performs well in experiments with various settings, which demonstrate its effectiveness.

References

K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath (2017) Deep reinforcement learning: a brief survey. IEEE Signal Processing Magazine 34 (6), pp. 26–38. Cited by: Introduction.
L. Bergman and Y. Hoshen (2020) Classification-based anomaly detection for general data. In International Conference on Learning Representations, Cited by: Competing Methods.
V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM Computing Surveys 41 (3), pp. 1–58. Cited by: Introduction.
C. Chang, J. Yoon, S. Arik, M. Udell, and T. Pfister (2022) Data-efficient and interpretable tabular anomaly detection. arXiv:2203.02034. Cited by: Semi-supervised Tabular Anomaly Detection.
O. Chapelle, B. Scholkopf, and A. Zien (2009) Semi-supervised learning. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: Semi-supervised Tabular Anomaly Detection.
A. Charpentier, R. Elie, and C. Remlinger (2021) Reinforcement learning in economics and finance. Computational Economics, pp. 1–38. Cited by: Introduction.
T. Chen and C. Guestrin (2016) Xgboost: a scalable tree boosting system. In ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 785–794. Cited by: Anomaly Detection, Competing Methods.
M. Goldstein and A. Dengel (2012) Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm. In German Conference on Artificial Intelligence, pp. 59–63. Cited by: Anomaly Detection.
P. Gopalan, V. Sharan, and U. Wieder (2019) Pidforest: anomaly detection via partial identification. Advances in Neural Information Processing Systems 32. Cited by: Semi-supervised Tabular Anomaly Detection.
N. Görnitz, M. Kloft, K. Rieck, and U. Brefeld (2013) Toward supervised anomaly detection. Journal of Artificial Intelligence Research 46, pp. 235–262. Cited by: Semi-supervised Tabular Anomaly Detection, Semi-supervised Tabular Anomaly Detection, Competing Methods.
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. (2018) Soft actor-critic algorithms and applications. arXiv:1812.05905. Cited by: Deep Reinforcement Learning.
L. P. Kaelbling, M. L. Littman, and A. W. Moore (1996) Reinforcement learning: a survey. Journal of Artificial Intelligence Research 4, pp. 237–285. Cited by: Introduction.
V. Konda and J. Tsitsiklis (1999) Actor-critic algorithms. In Advances in Neural Information Processing Systems 12, Cited by: Deep Reinforcement Learning.
Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: Introduction.
F. T. Liu, K. M. Ting, and Z. Zhou (2008) Isolation forest. In IEEE International Conference on Data Mining, pp. 413–422. Cited by: Anomaly Detection, Competing Methods.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. Cited by: Deep Reinforcement Learning, Deep Reinforcement Learning.
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing Atari with deep reinforcement learning. arXiv:1312.5602. Cited by: Introduction, Deep Reinforcement Learning.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: Introduction, Deep Reinforcement Learning.
T. T. Nguyen and V. J. Reddi (2019) Deep reinforcement learning for cyber security. IEEE Transactions on Neural Networks and Learning Systems. Cited by: Introduction.
G. Pang, C. Shen, L. Cao, and A. V. D. Hengel (2021a) Deep learning for anomaly detection: a review. ACM Computing Surveys 54 (2), pp. 1–38. Cited by: Anomaly Detection.
G. Pang, C. Shen, and A. van den Hengel (2019) Deep anomaly detection with deviation networks. In ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 353–362. Cited by: Semi-supervised Tabular Anomaly Detection, Competing Methods.
G. Pang, A. van den Hengel, C. Shen, and L. Cao (2021b) Toward deep supervised anomaly detection: reinforcement learning from partially labeled anomaly data. In ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 1298–1308. Cited by: 2nd item, Introduction, Competing Methods.
L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K. Müller (2021) A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, pp. 756–795. Cited by: Anomaly Detection, Semi-supervised Tabular Anomaly Detection, Ensemble-based Sampling Function.
L. Ruff, R. A. Vandermeulen, B. J. Franks, K. Müller, and M. Kloft (2020) Rethinking assumptions in deep anomaly detection. arXiv:2006.00339. Cited by: Competing Methods.
L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K. Müller, and M. Kloft (2019) Deep semi-supervised anomaly detection. arXiv:1906.02694. Cited by: 1st item, Semi-supervised Tabular Anomaly Detection, Competing Methods.
L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In International Conference on Machine Learning, pp. 4393–4402. Cited by: Competing Methods.
A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani (2017) Deep reinforcement learning framework for autonomous driving. Electronic Imaging 2017 (19), pp. 70–76. Cited by: Introduction.
B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural Computation 13 (7), pp. 1443–1471. Cited by: Anomaly Detection.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: Deep Reinforcement Learning.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv:1707.06347. Cited by: Deep Reinforcement Learning.
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of Go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: Introduction.
M. E. Villa-Pérez, M. Á. Álvarez-Carmona, O. Loyola-González, M. A. Medina-Pérez, J. C. Velazco-Rossell, and K. R. Choo (2021) Semi-supervised anomaly detection algorithms: a comparative summary and future research directions. Knowledge-Based Systems 218, pp. 106878. Cited by: 2nd item.
Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas (2016) Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pp. 1995–2003. Cited by: Deep Reinforcement Learning.
R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3), pp. 229–256. Cited by: Deep Reinforcement Learning.
J. Yoon, K. Sohn, C. Li, S. O. Arik, C. Lee, and T. Pfister (2021) Self-trained one-class classification for unsupervised anomaly detection. arXiv:2106.06115. Cited by: 1st item, Competing Methods.
J. Yoon, Y. Zhang, J. Jordon, and M. van der Schaar (2020) Vime: extending the success of self-and semi-supervised learning to tabular domain. In Advances in Neural Information Processing Systems 33, pp. 11033–11043. Cited by: Anomaly Detection.
C. Yu, J. Liu, S. Nemati, and G. Yin (2021a) Reinforcement learning in healthcare: a survey. ACM Computing Surveys (CSUR) 55 (1), pp. 1–36. Cited by: Introduction.
J. Yu, H. Oh, M. Kim, and J. Kim (2021b) Normality-calibrated autoencoder for unsupervised anomaly detection on data contamination. arXiv:2110.14825. Cited by: Anomaly Detection.
Y. Zhao, Z. Nasrullah, and Z. Li (2019) PyOD: a Python toolbox for scalable outlier detection. Journal of Machine Learning Research 20 (96), pp. 1–7. External Links: Link Cited by: Ensemble-based Sampling Function.
S. Zhu, H. S. Yuchi, M. Zhang, and Y. Xie (2019) Sequential adversarial anomaly detection for one-class event data. arXiv preprint arXiv:1910.09161. Cited by: Introduction.

Appendix

A: Algorithm of DADS

The training and inference procedures of ADAS are illustrated in Algorithms S1 and S2.

1:Input:

D = {D^{a}, D^{u}}

//training data

Π^{*} (s; θ)

// policy network of agent

3:Initialize:

q (s, a; ϕ_{1}), q (s, a; ϕ_{2}), v (s; ψ), v (s; ¯ ¯¯ ¯ ψ)

with the Xavier initialization method

4:Experience replay buffer

D

5:Adam optimizer of each network with

l e a r n i n g_r a t e = l r

t o t a l_s t e p s = 0

7:for

i = 1

n_e p i s o d e s

: do

8: for

t

n_s t e p s

: do

t o t a l_s t e p s = t o t a l_s t e p s + 1

10: Environment samples data

s_{t}

according to Equation 3

11: Agent takes action

a_{t}

according to the distribution of

Π^{*} (s; θ)

12: Environment adjusts the dataset according to Equation 2

13: Environment returns

r_{t}

to agent according to Equation 4

14: Store

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in the replay buffer

D

15: if

t o t a l_s t e p s % u p d a t e_i n t e r v a l == 0

and

t o t a l_s t e p s > w a r m u p_s t e p s

then

16: for

j

u p d a t e_t i m e s

: do

17: Randomly sample a batch of transitions

B

from

D

with

s i z e = b a t c h_s i z e

18: Update

ϕ_{i}, i \in {1, 2}

using the following objective function:

19:

20: Update

ψ

using the following objective function:

21:

J (ψ) = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim B} [{(v (s_{t}; ψ) - U_{t}^{(v)})}_{t}^{2}]

22: Update

θ

using the following objective function:

23:

J (θ) = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim B, a^{'} \sim π (\cdot | s_{t}; θ)} [q (s_{t}, a_{t}; ϕ_{1}) - α ln (a^{'} | s_{t}; θ)]

24: where

25:

U_{t}^{(q)} = r_{t} + γ v (s_{t}; ¯ ¯¯ ¯ ψ)

26:

U_{t}^{(v)} = E_{a^{'} \sim π (\cdot | s_{t}; θ)} [min i \in {1, 2} q (s_{t}, a_{t}; ϕ_{i}) - α ln (a^{'} | s_{t}; θ)]

27:

¯ ¯¯ ¯ ψ \leftarrow (1 - β) ¯ ¯¯ ¯ ψ + β ψ

28: end for

29: end if

30: end for

31:end for

Algorithm S1 Training of DADS

1:Input:

D^{T} = (s_{1}, s_{2} \dots s_{n})

// test data

Π^{*} (s; θ)

// policy network of agent

3:for

s_{i}

D^{T}

y_{i} = Π^{*} (s_{i}; θ)

5:end for

6:return anomaly scores

y = {y_{1}, y_{2}, \dots, y_{n}}

Algorithm S2 Inference of DADS

B: Details in Experiments

Preprocessing and acquisition of datasets

For all datasets used in experiments, we apply a train-validate-test split of $(0.6, 0.2, 0.2)$ . After that, minimax scaler will be applied to three datasets respectively. Validation set and testing set are fixed from then on, and we just need to generate the training dataset according to anomalies ratio, contamination ratio, and normalies ratio. Notice that for datasets with single anomaly class, anomalies ratio is calculated with regard to all anomalies, while for datasets with multiple anomaly classes, anomalies ratio is calculated with regard to the known anomaly class.

Detailed information of datasets used in experiments are listed in Tables S3 and S4.

Dataset			Normal Class	Anomaly Class
Name	N	D	Normal Class	Anomaly Class
annthyroid	7200	6	6666	$534 (7.4 %)$
cardio	1831	21	1655	$176 (9.6 %)$
satellite	6435	36	4399	$2036 (31.6 %)$
satimage2	5803	36	5732	$71 (1.2 %)$
thyroid	3772	6	3679	$93 (2.5 %)$

Table S3: Datasets with single anomaly class.

Dataset			Normal Class		Anomaly Class
Name	Size	Dim	Name	Size	Name	Size	Known
annthyroid	3772	21	normal	3488	hypothyroid	$93 (2.5 %)$	N
annthyroid	3772	21	normal	3488	subnormal	$191 (5.1 %)$	Y
cardio	2126	21	normal	1655	suspect	$295 (13.9 %)$	Y
cardio	2126	21	normal	1655	pathologic	$176 (8.28 %)$	N
har	10299	561	walking, sitting	7349	upstairs	$1544 (15.0 %)$	Y
har	10299	561	standing, laying	7349	downstairs	$1406 (13.7 %)$	N

Table S4: Datasets with multiple anomaly classes.

All datasets used in experiments can be accessed via links in Table S5.

Datsaset	link
annthyroid	http://odds.cs.stonybrook.edu/annthyroid-dataset/
cardio	http://odds.cs.stonybrook.edu/cardiotocogrpahy-dataset/
satellite	http://odds.cs.stonybrook.edu/satellite-dataset/
satimage2	http://odds.cs.stonybrook.edu/satimage-2-dataset/
thyroid	http://odds.cs.stonybrook.edu/thyroid-disease-dataset/
multi_annthyroid	https://www.openml.org/d/40497
multi_cardio	https://archive.ics.uci.edu/ml/datasets/Cardiotocography
multi_har	https://www.openml.org/d/1478

Table S5: Links of Datasets.

Training of DADS

The training process of DADS is illustrated in Appendix A, and here are some details. First, in order to avoid possible overfitting, we use validation set to search for best training episodes. For each set of parameters, we will train the agent for 10 episodes and record the sum of AUC-ROC and AUC-PR after each episode of training and choose the one with the best result as the training episode used in the following training of 10 random seeds. The DADS experiments are run on an AWS C5.9xlarge instance with 36 Intel Xeon 3.6 GHz cores and 72 GB memory.

For unsupervised methods used in DADS (IForest, HBOS, and OCSVM), we fit them with unlabeled dataset $D^{u}$ before each episode.

AUC-ROC of DADS and baselines in Setting 1.1. — (a) annthyroid

The choice of parameter $T H_{c o n f}$ is also worth mentioning. As stated in Section 3.5, the search of anomalies should be accurate and efficient, which requires that the threshold should be set carefully. If threshold is too small, it is likely that normal data will be added to $A$ and cause contamination. If threshold is too large, no anomalies will be searched, and the whole DADS method is reduced to a method that simply fits the known anomalies. Here we design an adaptive threshold setting method, which can automatically set threshold $T H_{c o n f}$ after each episode of training. We initialize $T H_{c o n f}$ to a relatively large number $8$ , which can give the agent enough time to be trained to a reasonable action. If the number of searched anomalies in the current episode exceeds half of the number of known anomalies, $T H_{c o n f}$ will be added by $1$ to reduce the risk of $A$ being polluted. If no anomaly is searched for in the current episode, $T H_{c o n f}$ will be decreased by $1$ to increase the probability of the agent finding possible anomalies. For other conditions, $T H_{c o n f}$ will remain the same in the next training episode.

Other hyperparamters of DADS are listed in Table S6.

Hyperparameter	Value
n $_$ steps	$5000$
lr	$0.0005$
hidden $_$ layers	$[32, 16]$
warmup $_$ steps	$10000$
batch $_$ size	$64$
update $_$ interval	$16$
update $_$ times	$16$
$T H_{s c o r e}$	$0.1$
$α$	$1$
$β$	$0.2$
$γ$	$0.99$

Table S6: Hyperparameters of DADS.

C: Experiment Results

The experiment results are shown in Figures S4 $\sim$ S7.

AUC-ROC of DADS and baselines in Setting 1.2. — (a) annthyroid

AUC-ROC of DADS and baselines in Setting 2.2. — (a) annthyroid

D: Results of Ablation Study

Tables S7 and S8 respectively summarize the results of ablation study on settings 2.1 and 2.2.

Dataset	contamination ratio	w. search	w/o search
multi_annthyroid	0.00	0.778 $\pm$ 0.049	0.750 $\pm$ 0.072
	0.02	0.747 $\pm$ 0.084	0.764 $\pm$ 0.039
	0.04	0.773 $\pm$ 0.046	0.760 $\pm$ 0.046
	0.06	0.760 $\pm$ 0.047	0.729 $\pm$ 0.078
	0.08	0.738 $\pm$ 0.069	0.741 $\pm$ 0.060
	0.10	0.759 $\pm$ 0.047	0.740 $\pm$ 0.060
multi_cardio	0.00	0.981 $\pm$ 0.010	0.973 $\pm$ 0.011
	0.02	0.965 $\pm$ 0.024	0.955 $\pm$ 0.027
	0.04	0.956 $\pm$ 0.036	0.912 $\pm$ 0.038
	0.06	0.938 $\pm$ 0.035	0.956 $\pm$ 0.034
	0.08	0.954 $\pm$ 0.034	0.894 $\pm$ 0.062
	0.10	0.943 $\pm$ 0.021	0.895 $\pm$ 0.083
multi_har	0.00	0.989 $\pm$ 0.006	0.979 $\pm$ 0.015
	0.02	0.983 $\pm$ 0.007	0.964 $\pm$ 0.044
	0.04	0.967 $\pm$ 0.017	0.966 $\pm$ 0.024
	0.06	0.975 $\pm$ 0.008	0.951 $\pm$ 0.069
	0.08	0.962 $\pm$ 0.022	0.966 $\pm$ 0.015
	0.10	0.965 $\pm$ 0.017	0.965 $\pm$ 0.014
average		0.896 $\pm$ 0.032	0.880 $\pm$ 0.044

Table S7: Results of ablation study on Setting 2.1.

Dataset	contamination ratio	w. search	w/o search
multi_annthyroid	0.01	0.663 $\pm$ 0.089	0.758 $\pm$ 0.078
	0.05	0.703 $\pm$ 0.075	0.728 $\pm$ 0.080
	0.10	0.751 $\pm$ 0.056	0.775 $\pm$ 0.082
	0.15	0.780 $\pm$ 0.055	0.762 $\pm$ 0.043
	0.50	0.821 $\pm$ 0.046	0.723 $\pm$ 0.085
multi_cardio	0.01	0.840 $\pm$ 0.099	0.741 $\pm$ 0.067
	0.05	0.978 $\pm$ 0.013	0.942 $\pm$ 0.044
	0.10	0.962 $\pm$ 0.028	0.900 $\pm$ 0.050
	0.15	0.975 $\pm$ 0.017	0.957 $\pm$ 0.022
	0.50	0.962 $\pm$ 0.026	0.965 $\pm$ 0.022
multi_har	0.01	0.969 $\pm$ 0.018	0.954 $\pm$ 0.025
	0.05	0.977 $\pm$ 0.011	0.925 $\pm$ 0.076
	0.10	0.966 $\pm$ 0.025	0.970 $\pm$ 0.028
	0.15	0.965 $\pm$ 0.034	0.967 $\pm$ 0.029
	0.50	0.979 $\pm$ 0.010	0.958 $\pm$ 0.052
average		0.886 $\pm$ 0.040	0.868 $\pm$ 0.052

Table S8: Results of ablation study on Setting 2.2.