title: 'Learning Under Concept Drift:A Review'
date: 2020-11-30 13:45:22
tags: concept-drift
Abstract—Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding, and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.
intro
Governments and companies are generating huge amounts of streaming data and urgently need efficient data analytics and machine learning techniques to support them making predictions and decisions. However, the rapidly changing environment of new products, new markets and new customer behaviors inevitably results in the appearance of concept drift problem. Concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways [1]. If the concept drift occurs, the induced pattern of past data may not be relevant to the new data, leading to poor predictions and decision outcomes. The phenomenon of concept drift has been recognized as the root cause of decreased effectiveness in many data-driven information systems such as data-driven early warning systems and data-driven decision support systems. In an everchanging and big data environment, how to provide more reliable data-driven predictions and decision facilities has become a crucial issue.
Concept drift problem exists in many real-world situations. An example can be seen in the changes of behavior in mobile phone usage, as shown in Fig. 1. From the bars in this figure, the time percentage distribution of the mobile phone usage pattern has changed from “Audio Call” to “Camera” and then to “Mobile Internet” over the past two decades.
Recent attractive research in the field of concept drift targets more challenging problems, i.e., how to accurately detect concept drift in unstructured and noisy datasets [2], [3], how to quantitatively understand concept drift in a explainable way [4], [5], and how to effectively react to drift by adapting related knowledge [6], [7].
Solving these challenges endows prediction and decision-making with the adaptability in an uncertain environment. Conventional research related to machine learning has been significantly improved by introducing concept drift techniques in data science and artificial intelligence in general, and in pattern recognition and data stream mining in particular. These new studies enhance the effectiveness of analogical and knowledge reasoning in an ever-changing environment. A new topic is formed during this development: adaptive data-driven prediction/decision systems. In particular, concept drift is a highly prominent and significant issue in the context of the big data era because the uncertainty of data types and data distribution is an inherent nature of big data.
Conventional machine learning has two main components: training/learning and prediction. Research on learning under concept drift presents three new components: drift detection (whether or not drift occurs), drift understanding (when, how, where it occurs) and drift adaptation (reaction to the existence of drift) as shown in Fig. 2. These will be discussed in Sections 3, 4, and 5.
In literature, a detailed concept drift survey paper [8] was published in 2014 but intentionally left certain sub-problems of concept drift to other publications, such as the details of the data distribution change (PðXÞ) as mentioned in their Section 2.1. In 2015, another comprehensive survey paper [9] was published, which surveys and gives tutorial of both the established and the state-of-the-art approaches. It provides a hybrid-view about concept drift from two primary perspectives, active and passive. Both survey papers are comprehensive and can be a good introduction to concept drift researching. However, many new publications have become available in the last three years, even a new category of drift detection methods has arisen, named multiple hypothesis tests drift detection. It is necessary to review the past research focuses and give the most recent research trends about concept drift, which is one of the main contribution of this survey paper.
Besides these two publications, four related survey papers [6], [7], [10], [11] have also provided valuable insights into how to address concept drift, but their specific research focus is only on data stream learning, rather than analyzing concept drift adaptation algorithms and understanding concept drift. Specifically, paper [7] focuses on data reduction for stream learning incorporating concept drift, while [6] only focuses on investigating the development in learning ensembles for data stream learning in a dynamic environment. [11] concerns the evolution of data stream clustering, and [10] focuses on investigating the current and future trends of data stream learning. There is therefore a gap in the current literature that requires a fuller picture of established and the new emerged research on concept drift; a comprehensive review of the three major aspects of concept drift: concept drift detection, understanding and adaptation, as shown in Fig. 2; and a discussion about the new trend of concept drift research.
The selection of references in this survey paper was performed according to the following steps:
Step 1. Publication database: Science Direct, ACM Digital Library, IEEE Xplore and SpringerLink.
Step 2. Preliminary screening of articles: The first search was based on keywords. The articles were then selected as references if they 1) present new theory, algorithm or methodology in the area of concept drift, or 2) report a concept drift application.
Step 3. Result filtering for articles: The articles selected in Step 2 were divided into three groups: concept drift detection, understanding, and adaptation. The references in each group were filtered again, based on 1) Time: published mainly within the last 10 years, or 2) Impact: published in high quality journals/conferences or having high citations.
Step 4. Dataset selection: To help readers test their research results, this paper lists popular datasets and their characteristics, the dataset providers, and how each dataset can be used.
On completion of this process, 137 research articles, 10 widely used synthetic datasets for evaluating the performance of learning algorithms dealing with concept drift, and 14 publicly available and widely used real-world datasets were listed for discussion. The main contributions of this paper are:
1)It perceptively summarizes concept drift research achievements and clusters the research into three categories: concept drift detection, understanding and adaptation, providing a clear framework for concept drift research development (Fig. 2);
-
It proposes a new component, concept drift understanding, for retrieving information about the status of concept drift in aspects of when, how, and where. This also creates a connection between drift detection and drift adaptation;
-
It uncovers several very new concept drift techniques, such as active learning under concept drift and fuzzy competence model-based drift detection, and identifies related research involving concept drift;
-
It systematically examines two sets of concept drift datasets, Synthetic datasets and Real-world datasets, through multiple dimensions: dataset description, availability , suitability for type of drift, and existing applications;
-
It suggests several emerging research topics and potential research directions in this area.
The remainder of this paper is structured as follows. In Section 2, the definitions of concept drift are given and discussed. Section 3 presents research methods and algorithms in concept drift detection. Section 4 discusses research developments in concept drift understanding. Research results on drift adaptation (concept drift reaction) are reported in Section 5. Section 6 presents evaluation systems and related datasets used to test concept drift algorithms. Section 7 summaries related research concerning the concept drift problem. Section 8 presents a comprehensive analysis of main findings and future research directions.
PROBLEM DESCRIPTION
This section first gives the formal definition and the sources of concept drift in Section 2.1. Then, in Section 2.2, the commonly defined types of concept drift are introduced.
Concept drift definition and the sources
Concept drift is a phenomenon in which the statistical properties of a target domain change over time in an arbitrary way [3]. It was first proposed by [12] who aimed to point out that noise data may turn to non-noise information at different time. These changes might be caused by changes in hidden variables which cannot be measured directly [4]. Formally , concept drift is defined as follows:
Given a time period $[0, t],$ a set of samples, denoted as $S_{0, t}={d_{0}, \ldots, d_{t}},$ where $d_{i}=\left(X_{i}, y_{i}\right)$ is one observation (or a data instance), $X_{i}$ is the feature vector, $y_{i}$ is the label, and $S_{0, t}$ follows a certain distribution $F_{0, t}(X, y) .$ Concept drift occurs at timestamp $t+1,$ if $F_{0, t}(X, y) \neq F_{t+1, \infty}(X, y),$ denoted as
$\exists t: P_{t}(X, y) \neq P_{t+1}(X, y)[2],[8],[13],[14]$
Concept drift has also been defined by various authors using alternative names, such as dataset shift [15] or concept shift [1]. Other related terminologies were introduced in [16]’s work, the authors proposed that concept drift or shift is only one subcategory of dataset shift and the dataset shift is consists of covariate shift, prior probability shift and concept shift. These definitions clearly stated the research scope of each research topics. However, since concept drift is usually associated with covariate shift and prior probability shift, and an increasing number of publications [2], [8], [13], [14] refer to the term ”concept drift” as the problem in which $\exists t: P_{t}(X, y) \neq P_{t+1}(X, y) .$Therefore, we apply the same definition of concept drift in this survey. Accordingly, concept drift at time t can be defined as the change of joint probability of X and y at time t. Since the joint probability $P_t(X, y)$ can be decomposed into two parts as $P_{t}(X, y)=P_{t}(X) \times P_{t}(y \mid X),$, concept drift can be triggered by three sources:
- Source
$\mathbf{I}: P_{t}(X) \neq P_{t+1}(X) \quad$ while $\quad P_{t}(y \mid X)=$
$P_{t+1}(y \mid X),$ that is, the research focus is the drift in $P_{t}(X)$ while $P_{t}(y \mid X)$ remains unchanged. since $P_{t}(X)$ drift does not affect the decision boundary, it has also been considered as virtual drift [7], Fig. $3(\mathrm{a})$ - Source II: $P_{t}(y \mid X) \neq P_{t+1}(y \mid X)$ while $P_{t}(X)=$ $P_{t+1}(X)$ while $P_{t}(X)$ remains unchanged. This drift will cause decision boundary change and lead to learning accuracy decreasing, which is also called actual drift, Fig. $3(\mathrm{~b})$
- Source III: mixture of Source I and Source II, namely $P_{t}(X) \neq P_{t+1}(X)$ and $P_{t}(y \mid X) \neq P_{t+1}(y \mid X) .$ Concept
drift focus on the drift of both $P_{t}(y \mid X)$ and $P_{t}(X)$ since both changes convey important information about learning environment Fig. $3(\mathrm{c})$
Fig. 3 demonstrates how these sources differ from each other in a two-dimensional feature space. Source I is feature space drift, and Source II is decision boundary drift. In many real-world applications, Source I and Source II occur together, which creates Source III
The types of concept drift
Commonly , concept drift can be distinguished as four types [8] as shown in Fig. 4:
Research into concept drift adaptation in Types 1-3 focuses on how to minimize the drop in accuracy and achieve the fastest recovery rate during the concept transformation process. In contrast, the study of Type 4 drift emphasizes the use of historical concepts, that is, how to find the best matched historical concepts with the shortest time. The new concept may suddenly, incrementally, or gradually reoccur.
To better demonstrate the differences between these types, the term “intermediate concept” was introduced by [8] to describe the transformation between concepts. As mentioned by [4], a concept drift may not only take place at an exact timestamp, but may also last for a long period. As a result, intermediate concepts may appear during the transformation as one concept (starting concept) changes to another (ending concept). An intermediate concept can be a mixture of the starting concept and the ending concept, like the incremental drift, or one of the starting or ending concept, such as the gradual drift.
CONCEPT DRIFT DETECTION
This section focuses on summarizing concept drift detection algorithms. Section 3.1 introduces a typical drift detection framework. Then, Section 3.2 systematically reviews and categorizes drift detection algorithms according to their implementation details for each component in the framework. At last, Section 3.3 lists the state-of-the-art drift detection algorithms with comparisons of their implementation details.
A general framework for drift detection
Drift detection refers to the techniques and mechanisms that characterize and quantify concept drift via identifying change points or change time intervals [17]. A general framework for drift detection contains four stages, as shown in Fig. 5.
Stage 1 (Data Retrieval) aims to retrieve data chunks from data streams. Since a single data instance cannot carry enough information to infer the overall distribution [2], knowing how to organize data chunks to form a meaningful pattern or knowledge is important in data stream analysis tasks [7].
Stage 2 (Data Modeling) aims to abstract the retrieved data and extract the key features containing sensitive information, that is, the features of the data that most impact a system if they drift. This stage is optional, because it mainly concerns dimensionality reduction, or sample size reduction, to meet storage and online speed requirements [4].
Stage 3 (Test Statistics Calculation) is the measurement of dissimilarity , or distance estimation. It quantifies the severity of the drift and forms test statistics for the hypothesis test. It is considered to be the most challenging aspect of concept drift detection. The problem of how to define an accurate and robust dissimilarity measurement is still an open question. A dissimilarity measurement can also be used in clustering evaluation [11], and to determine the dissimilarity between sample sets [18].
Stage 4 (Hypothesis Test) uses a specific hypothesis test to evaluate the statistical significance of the change observed in Stage 3, or the p-value. They are used to determine drift detection accuracy by proving the statistical bounds of the test statistics proposed in Stage 3. Without Stage 4, the test statistics acquired in Stage 3 are meaningless for drift detection, because they cannot determine the drift confidence interval(置信区间), that is, how likely it is that the change is caused by concept drift and not noise or random sample selection bias [3]. The most commonly used hypothesis tests are: estimating the distribution of the test statistics [19], [20], bootstrapping [21], [22], the permutation test [3], and Hoeffding’s inequality-based bound identification [23].
It is also worth to mention that, without Stage 1, the concept drift detection problem can be considered as a two-sample test problem which examines whether the population of two given sample sets are from the same distribution [18]. In other words, any multivariate two-sample test is an option that can be adopted in Stages 2-4 to detect concept drift [18]. However, in some cases, the distribution drift may not be included in the target features, therefore the selection of the target feature will affect the overall performance of a learning system and is a critical problem in concept drift detection [24].
Concept drift detection algorithms
This section surveys drift detection methods and algorithms, which are classified into three categories in terms of the test statistics they apply.
Error rate-based drift detection
PLearner error rate-based drift detection algorithms form the largest category of algorithms. These algorithms focus on tracking changes in the online error rate of base classifiers. If an increase or decrease of the error rate is proven to be statistically significant, an upgrade process (drift alarm) will be triggered.
One of the most-referenced concept drift detection algorithms is the Drift Detection Method (DDM) [20]. It was the first algorithm to define the warning level and drift level for concept drift detection. In this algorithm, Stage 1 is implemented by a landmark time window, as shown in Fig. 6. When a new data instance become available for evaluation, DDM detects whether the overall online error rate within the time window has increased significantly. If the confidence level of the observed error rate change reaches the warning level, DDM starts to build a new learner while using the old learner for predictions. If the change reached the drift level, the old learner will be replaced by the new learner for further prediction tasks. To acquire the online error rate, DDM needs a classifier to make the predictions. This process converts training data to a learning model, which is considered as the Stage 2 (Data Modeling). The test statistics in Stage 3 constitute the online error rate. The hypothesis test, Stage 4, is conducted by estimating the distribution of the online error rate and calculating the warning level and drift threshold.
Similar implementations have been adopted and applied in the Learning with Local Drift Detection (LLDD) [25], Early Drift Detection Method (EDDM) [26], Heoffding’s inequality based Drift Detection Method (HDDM) [23], Fuzzy Windowing Drift Detection Method (FW-DDM) [5], Dynamic Extreme Learning Machine (DELM) [27]. LLDD modifies Stages 3 and 4, dividing the overall drift detection problem into a set of decision tree node-based drift detection problems; EDDM improves Stage 3 of DDM using the distance between two correct classifications to improve the sensitivity of drift detection; HDDM modifies Stage 4 using Hoeffding’s inequality to identify the critical region of a drift; FW-DDM improves Stage 1 of DDM using a fuzzy time window instead of a conventional time window to address the gradual drift problem; DEML does not change the DDM detection algorithm but uses a novel base learner, which is a single hidden layer feedback neural network called Extreme Learning Machine (ELM) [28] to improve the adaptation process after a drift has been confirmed. EWMA for Concept Drift Detection (ECDD) [29] takes advantage of the error rate to detect concept drift. ECDD employs the EWMA chart to track changes in the error rate. The implementation of Stages 1-3 of ECDD is the same as for DDM, while Stage 4 is different. ECDD modifies the conventional EWMA chart using a dynamic mean $ \hat{p}{0, t}$ instead of the conventional static mean $p{0},$ where $\hat{p}{0, t}$ is the estimated online error rate within time $[0, t],$ and $p{0}$ implies the theoretical error rate when the learner was initially built. Accordingly, the dynamic variance can be calculated by $\sigma_{Z_{t}}^{2}=\hat{p}{0, t}\left(1-\hat{p}{0, t}\right) \sqrt{\frac{\lambda}{2-\lambda}\left(1-(1-\lambda)^{2 t}\right)}$ where $\lambda$ controls
how much weight is given to more recent data as opposed to older data, and $\lambda=0.2$ is recommended by the authors. Also, when the test statistic of the conventional EWMA chart is $Z_{t}>\hat{p}{0, t}+0.5 L \sigma{Z_{t}},$ ECDD will report a concept drift warning; when $Z_{t}>\hat{p}{0, t}+L \sigma{Z_{t}},$ ECDD will report a concept drift. The control limits $L$ is given by the authors through experimental evaluation.
In contrast to DDM and other similar algorithms, Statistical Test of Equal Proportions Detection (STEPD) [30] detects error rate change by comparing the most recent time window with the overall time window, and for each timestamp, there are two time windows in the system, as shown in Fig. $7 .$ The size of the new window must be defined by the user. According to $[30],$ the test statistic $\theta_{\text {STEPD }}$ conforms(符合) to standard normal distribution, denoted as $\theta_{\mathrm{STEPD}} \sim \mathcal{N}(0,1) .$ The significance level of the warning level and the drift level were suggested as $\alpha_{w}=0.05$ and $\alpha_{d}=0.003$ respectively. As a result, the warning threshold and drift threshold can be easily calculated.
Another popular two-time window-based drift detection algorithm is ADaptive WINdowing (ADWIN) [31]. Unlike STEPD, ADWIN does not require users to define the size of the compared windows in advance; it only needs to specify the total size $n$ of a "sufficiently large" window $W$. It then examines all possible cuts of $W$ and computes optimal sub-window sizes $n_{\text {hist }}$ and $n_{\text {new }}$ according to the rate of change between the two sub-windows $w_{\text {hist }}$ and $w_{\text {new }}$. The test statistic is the difference of the two sample means $\theta_{\mathrm{ADWIN}}=$ $\left|\hat{\mu}{\mathrm{hist}}-\hat{\mu}{\mathrm{new}}\right| $ An optimal cut is found when the
difference exceeds a threshold with a predefined confidence interval $\delta .$ The author proved that both the false positive rate and false negative rate are bounded by $\delta .$ It is worth noting that many concept drift adaptation methods/algorithms in the literature are derived from or combined with ADWIN, such as $[32]-[35] .$ since their drift detection methods are implemented with almost the same strategy, we will not discuss them in detail.
Data Distribution-based Drift Detection
The second largest category of drift detection algorithms is data distribution-based drift detection. Algorithms of this category use a distance function/metric to quantify the dissimilarity between the distribution of historical data and the new data. If the dissimilarity is proven to be statistically significantly different, the system will trigger a learning model upgradation process. These algorithms address concept drift from the root sources, which is the distribution drift. Not only can they accurately identify the time of drift, they can also provide location information about the drift. However, these algorithms are usually reported as incurring higher computational cost than the algorithms mentioned in Section 3.2.1 [2]. In addition, these algorithms usually require users to predefine the historical time window and new data window. The commonly used strategy is two sliding windows with the historical time window fixed while sliding the new data window [3], [22], [36], as shown in Fig. 8.
According to the literature, the first formal treatment of change detection in data streams was proposed by [37]. In their study, the authors point out that the most natural notion of distance between distributions is total variation, as defined by: $TV\left(P_{1}, P_{2}\right)=2\sup_{E\in \varepsilon}\left|P_{1}(E)-P_{2}(E)\right| $ or equivalently, when the distribution has the density functions $f_{1}$ and $f_{2}, d i s t_{L^{1}}=\int\left|f_{1}(x)-f_{2}(x)\right| \mathrm{d} x .$ This provides practical guidance on the design of a distance function for distribution discrepancy analysis. Accordingly, [37] roposed a family of distances, called Relativized Discrepancy $(\mathrm{RD}) .$ The authors also present the significance level of the distance according to the number of data instances. The bounds on the probabilities of missed detections and false alarms are theoretically proven, using Chernoff bounds and the Vapnik-Chervonenkis dimension. The authors of [37] do not propose novel high-dimensional friendly data models for Stage 2 (data modeling); instead, they stress that a suitable model choice is an open question.
Another typical density-based drift detection algorithm is the Information-Theoretic Approach (ITA) [22]. The intuitive idea underlying this algorithm is to use kdqTree to partition the historical and new data (multi-dimensional) into a set of bins, denoted as $\mathcal{A}$, and then use Kullback-Leibler divergence to quantify the difference of the density $\theta_{\text {ITA }}$ in each bin. The hypothesis test applied by ITA is bootstrapping by merging $W_{\text {hist }}, W_{\text {new }}$ as $W_{\text {all }}$ and resampling as $W_{\text {hist }}^{\prime} W_{\text {new}}$to recompute the $\theta_{\text {ITA }}^{\prime}$. Once the estimated probability $P\left(\theta_{\mathrm{ITA}}^{*} \geq \theta_{\mathrm{ITA}}\right)<1-\alpha,$ concept drift is confirmed, where $\alpha$ is the significant level controlling the sensitivity of drift detection.
Similar distribution-based drift detection methods/algorithms are: Statistical Change Detection for multidimensional data (SCD) [38], Competence Model-based drift detection (CM) [2], a prototype-based classification model for evolving data streams called SyncStream [36], PCA-based change detection framework (PCA-CD) [39], Equal Density Estimation (EDE) [40], Least Squares Density Difference-based Change Detection Test (LSDD-CDT) [21], Incremental version of LSDD-CDT (LSDD-INC) [41] and Local Drift Degree-based Density Synchronized Drift Adaptation (LDD-DSDA) [4].
Multiple Hypothesis Test Drift Detection
Multiple hypothesis test drift detection algorithms apply similar techniques to those mentioned in the previous two categories. The novelty of these algorithms is that they use multiple hypothesis tests to detect concept drift in different ways. These algorithms can be divided into two groups: 1) parallel multiple hypothesis tests; and 2) hierarchical multiple hypothesis tests.
The idea of parallel multiple hypothesis drift detection algorithm is demonstrated in Fig. 9. According to the literature, Just-In-Time adaptive classifiers (JIT) [19] is the first algorithm that set multiple drift detection hypothesis in this way. The core idea of JIT is to extend the CUSUM chart, known as the Computational Intelligence-based CUSUM test (CI-CUSUM), to detect the mean change in the features interested by learning systems. The authors of $[19],$ gave the following four configurations for the drift detection target. Config1: the features extracted by Principal Component Analysis (PCA), which removes eigenvalues whose sum is below a threshold, e.g. 0.001. Config2: PCA extracted features plus one generic component of the original features $x_{i} ;$ Config3: detects the drift in each $x_{i}$ individually. Config4:detects drift in all possible combinations of the feature space $x_{i} .$ The authors stated that Config2 is the preferred setting for most situations, according to their experimentation, and also mentioned that Config1 may have a high missing rate, Config3 suffers from a high false alarm rate, and Config4 has exponential computational complexity. The same drift detection strategy has also been applied in $[42]-[45]$ for concept drift adaptation.
Similar implementations have been applied in Linear Four Rate drift detection (LFR) [46], which maintains and tracks the changes in True Positive rate (TP), True Negative rate (TN), False Positive rate (FP) and False Negative rate (FN) in an online manner. The drift detection process also includes warning and drift levels.
Another parallel multiple hypothesis drift detection algorithm is three-layer drift detection, based on Information Value and Jaccard similarity (IV-Jac) [47]. IV-Jac aims to individually address the label drift $P_{t}(y)$ Layer I, feature space drift $P_{t}(X)$ Layer II, and decision boundary drift $P_{t}(y \mid X)$ Layer III. It extracts the Weight of Evidence (WoE) and Information Value (IV) from the available data and then detects whether a significant change exists between the WoE and IV extracted from $W_{\text {hist }}$ and $W_{\text {new }}$ by measuring the contribution to the label for a feature value. The hypothesis test thresholds are predefined parameters $\theta_{P_{t}(y)}=\theta_{P_{t}(X)}=\theta_{P_{t}(X \mid y)}=0.5$ by default, which are chosen empirically.
Ensemble of Detectors (e-Detector) [48] proposed to detect concept drift via ensemble of heterogeneous drift detector. The authors consider two drift detectors are homogeneous as if they are equivalent in finding concept drifts, otherwise they are heterogeneous. e-Detector groups homogeneous drift detectors via a diversity(多样性) measurement, named diversity vector. For each group, it select the one with the smallest coefficient of failure as the base detector to form the ensemble. e-Detector reports concept drift following the early-find-early-report rule, which means no matter which base detector detect a drift, the e-Detector reports a drift. Similar strategy has been applied in drift detection ensemble (DDE) [49].
Hierarchical drift detection is an emerging drift detection category that has a multiple verification schema.The algorithms in this category usually detect drift using an existing method, called the detection layer, and then apply an extra hypothesis test, called the validation layer, to obtain a second validation of the detected drift in a hierarchical way. The overall workflow is shown in Fig. 10.
According to the claim made by [50], Hierarchical Change-Detection Tests (HCDTs) is the first attempt to address concept drift using a hierarchical architecture. The detection layer can be any existing drift detection method that has a low drift delay rate and low computational burden. The validation layer will be activated and deactivated based on the results returned by the detection layer. The authors recommend two strategies for designing the validation layer: 1) estimating the distribution of the test statistics by maximizing the likelihood; 2) adapting an existing hypothesis test, such as the Kolmogorov-Smirnov test or the Cramer-Von Mises test.
Hierarchical Linear Four Rate (HLFR) [51] is another recently proposed hierarchical drift detection algorithm. It applies the drift detection algorithm LFR as the detection layer. Once a drift is confirmed by the detection layer, the validation layer will be triggered. The validation layer of HLFR is simply a zero-one loss, denoted as E, over the ordered train-test split. If the estimated zero-one loss exceeds a predefined threshold, η = 0.01, the validation layer will confirm the drift and report to the learning system to trigger a model upgradation process.
Two-Stage Multivariate Shift-Detection based on EWMA (TSMSD-EWMA) [52] has a very similar implementation, however, the authors do not claim that their method is a hierarchy-based algorithm.
Hierarchical Hypothesis Testing with Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with Attribute-wise ”Goodness-of-fit” (HHT-AG) are two drift detection algorithms based on request and reverify strategy [53]. For HHT-CU, the detection layer is a hypotheses test based on Heoffding’s inequality that monitoring the change of the classification uncertainty measurement. The validation layer is a permutation test that evaluates the change of the zero-one loss of the learner. For HHT-AG, the detection layer is conducted based on Kolmogorov-Smirnov (KS) test for each feature distribution. Then HHT-AG validate the potential drift points by requiring true labels of data that come from $w_{new}$, and performing d independent two-dimensional (2D) KS test with each feature-label bivariate distribution. Compare to other drift detection algorithms, HHT-AG can handle concept drift with less true labels, which makes it more powerful when dealing with high verification latency
Summary of concept drift detection methods/algorithms
TABLE 1 lists the most popular concept drift detection methods/algorithms against the general framework summarized in Section 3.1 (Fig. 5). A comparative study on eight popular drift detection methods can be found in [54].
CONCEPT DRIFT UNDERSTANDING
Drift understanding refers to retrieving concept drift information about “When” (the time at which the concept drift occurs and how long the drift lasts), “How” (the severity /degree of concept drift), and “Where” (the drift regions of concept drift). This status information is the output of the drift detection algorithms, and is used as input for drift adaptation.
The time of concept drift occurs (When)
The most basic function of drift detection is to identify the timestamp when a drift occurs. Recalling the definition of concept drift $\exists t: P_{t}(X, y) \neq P_{t+1}(X, y)$, the variable t represents the time at which a concept drift occurs. In drift detection methods/algorithms, an alarm signal is used to indicate whether the concept drift has or has not occurred or not at the current timestamp. It is also a signal for a learning system to adapt to a new concept. Accurately identifying the time a drift occurs is critical to the adaptation process of a learning system; a delay or a false alarm will lead to failure of the learning system to track new concepts.
A drift alarm usually has a statistical guarantee with a predefined false alarm rate. Error rate-based drift detection algorithms monitor the performance of the learning system, based on statistical process control. For example, DDM [20] sends a drift signal when the learning accuracy of the learner drops below a predefined threshold, which is chosen by the three-sigma rule [55]. ECCD [29] reports a drift when the online error rate exceeds the control limit of EWMA. Most data distribution-based drift detection algorithms report a drift alarm when two data samples have a statistically significant difference. PCA-based drift detection [36] outputs a drift signal when the p-value of the generalized Wilcoxon test statistic $W_{B F}^{l}$ is significantly large. The method in [3] confirms that a drift has occurred by verifying whether the empirical competence-based distance is significantly large through permuataion test.
Taking into account the various drift types, concept drift understanding needs to explore the start time point, the change period, and the end time point of concept drift. And these time information could be useful input for the adaptation process of the learning system. However the drift timestamp alert in existing drift detection algorithms is delayed compared to the actual drifting timestamp, since most drift detectors require a minimum number of new data to evaluate the status of the drift, as shown in Fig. 11. The emergence time of the new concept is therefore still vague. Some concept drift detection algorithms such as DDM [20], EDDM [26], STEPD [30], and HDDM [23], trigger a warning level to indicate a drift may have occurred. The threshold used to trigger warning level is a relaxed condition of the threshold used for the drift level; for example, the warning level is set p-value to 95% or 2σ, and the drift level is set p-value to 99% or 3σ. The data accumulated between the warning level and the drift level are used as the training set for updating a learning model.
The severity of concept drift (How)
The severity of concept drift refers to using a quantified value to measure the similarity between the new concept and the previous concept, as shown in Fig. 11. Formally , the severity of concept drift can be represented as ∆ = δ(Pt(X, y), Pt+1(X, y)), where δ is a function to measure the discrepancy of two data distributions, and t is the timestamp when the concept drift occurred. ∆ usually is a non-negative value indicating the severity of concept drift. The greater the value of ∆, the larger the severity of the concept drift is.
In general, error rate-based drift detection cannot directly measure the severity of concept drift, because it mainly focuses on monitoring the performance of the learning system, not the changes in the concept itself. However, the degree of decrease in learning accuracy can be used as an indirect measurement to indicate the severity of concept drift. If learning accuracy has dropped significantly when drift is observed, this indicates that the new concept is different from the previous one. For example, the severity of concept drift could be reflected by the difference between piand pminin [20], [27], denoted as ∆ ∼ pi− pmin; the difference between overall accuracy ˆ phistand recent accuracy ˆ pnewin [30], expressed as ∆ ∼ ˆ pnew− ˆ phist; and the difference between test statistics in the former window E[ˆXcut] and test statistics in the later window E[ˆYi−cut] [23], denoted as ∆ ∼ E[ˆYi−cut] − E[ˆXcut]. However, the meaning of these differences is not discussed in existing publications. The ability of error rate-based drift detection to output the severity of concept drift is still vague.
Data distribution-based drift detection methods can directly quantify the severity of concept drift since the measurement used to compare two data samples already reflects the difference. For example, [37] employed a relaxation of the total variation distance $d_A(S1, S2)$ to measure the difference between two data distributions. [3] proposed a competence-based empirical distance $d^{C B}\left(S_{1}, S 2\right)$ to show the difference between two data samples. Other drift detection methods have used information-theoretic distance; for example, Kullback-Leibler divergence $D(W_1||W_2)$, also called relative entropy , was used in the study reported in [22]. The range of these distances is [0,1]. The greater the distance, the larger the severity of the concept drift is. The distance “1” means that a new concept is different from the pervious one, while the distance “0” means that two data concepts are identical. The test statistic δ used in [38] gives an extremely small negative value if the new concept is quite different from the previous concept. The degree of concept drift in [36] is measured by the resulting p-value of the test statistic $W_{B F}^{l}$ and the defined $A n g l e\left(D_{t}, D_{t+1}\right)$ of two datasets $D_{t}$ and $D_{t+1}$
The severity of concept drift can be used as a guideline for choosing drift adaptation strategies. For example, if the severity of drift in a classification task is low, the decision boundary may not move much in the new concept. Thus, adjusting the current learner by incremental learning will be adequate. In contrast, if the severity of the concept drift is high, the decision boundary could change significantly, therefore discarding the old learner and retraining a new one could be better than incrementally updating the old learner. We would like to mention that, even though some researches have promoted the ability to describe and quantify the severity of the detected drift, this information is not yet widely utilized in drift adaptation.
The drift regions of concept drift (Where)
The drift regions of concept drift are the regions of conflict between a new concept and the previous concept. Drift regions are located by finding regions in data feature space $X$ where $P_{t}(X, y)$ and $P_{t+1}(X, y)$ have significant difference. To illustrate this, we give an example of a classification task in Fig. 12. The data used in this task are uniformly sampled in the range of $[0,10]^{2}$. The left sub-figure is the data sample accumulated at time $t,$ where the decision boundary is $x_{1}+x_{2}=8 .$ The middle sub-figure is the data accumulated at time $t+1,$ where the decision boundary is $x_{1}+x_{2}=9$. Intuitively, data located in regions $8 \leq x_{1}+x_{2}<9$ have different classes in time $t$ and time $t+1,$ since the decision boundary has changed. The right sub-figure shows the data located in the drift regions.
The techniques to identify drift regions are highly dependent on the data model used in the drift detection methods/algorithms. Paper [25] detected drift data in local regions of the instance space by using online errorrate inside the inner-nodes of a decision tree. The whole data feature space is partitioned by decision tree. Each leaf of this decision tree corresponds to a hyper-rectangle in the data feature space. All leaf nodes contain a drift detector. When the leaf nodes are alerted to a drift, the corresponding hyper-rectangles indicate the regions of drift in the data feature space. Similar to [25], [22] utilized the nodes of the kdq-tree with Kulldorff’s spatial scan statistic to identify the regions in which data had changed the most. Once a drift has been reported, Kulldorff’s statistic measures how the two datasets differ only with respect to the region associated with the leaf node of the kdq-tree. The leaf nodes with the greater statistical value of show the drift regions. [2] highlighted the most severe regions of concept drift through top-p-competence areas. Utilizing the RelatedSets of the competence model, the data feature space is partitioned by a set of overlapping hyperspheres. The RelatedSet-based empirical distance defines the distance between two datasets on a particular hypersphere. The drift regions are identified by the corresponding hyperspheres with large empirical distance at top p% level. [4] identified drift regions by monitoring the discrepancy in the regional density of data, which is called the local drift degree. A local region with a corresponding increase or decrease in density will be highlighted as a critical drift region.
Locating concept drift regions benefits drift adaptation. Paper [56] pointed out that even if the concept of the entire dataset drifts, there are regions of the feature space that will remain stable significantly longer than other regions. In an ensemble scenario, the old learning models of stable regions could still be useful for predicting data instances located within stable regions, or data instances located in drift regions could be used to build a more updated current model. The authors of [3] also pointed out that identifying drift regions can help in recognizing obsolete data that conflict with current concepts and distinguish noise data from novel data. In their later research [2], they utilized top-p-competence areas to edit cases in a case-based reasoning system for fast new concept switching. One step in their drift adaptation is to remove conflict instances. To preserve as many instances of a new concept as possible, they only remove obsolete conflict instances which are outside the drift regions. LDD-DSDA [4] utilized drift regions as an instance selection strategy to construct a training set that continually tracked a new concept.
Summary of drift understanding
We summarize concept drift detection algorithms according to their ability to identify when, how, and where concept drift occurs, as shown in TABLE 2. All drift detection algorithms can identify the occurrence time of concept drift (when), and most data distribution-based drift detection algorithms can also measure the severity of concept drift (how) through the test statistics, but only a few algorithms have the ability to locate drift regions (where).
DRIFT ADAPTATION
This section focuses on strategies for updating existing learning models according to the drift, which is known as drift adaptation or reaction. There are three main groups of drift adaptation methods, namely simple retraining, ensemble retraining and model adjusting, that aim to handle different types of drift.
Training new models for global drift
Perhaps the most straightforward way of reacting to concept drift is to retrain a new model with the latest data to replace the obsolete model, as shown in Fig. 13. An explicit concept drift detector is required to decide when to retrain the model (see Section 3 on drift detection). A window strategy is often adopted in this method to preserve the most recent data for retraining and/or old data for distribution change test. Paired Learners [57] follows this strategy and uses two learners: the stable learner and the reactive learner. If the stable learner frequently misclassifies instances that the reactive learner correctly classifies, a new concept is detected and the stable learner will be replaced with the reactive learner. This method is simple to understand and easy to implement, and can be applied at any point in the data stream.
When adopting a window-based strategy , a trade-off must be made in order to decide an appropriate window size. A small window can better reflect the latest data distribution, but a large window provides more data for training a new model. A popular window scheme algorithm that aims to mitigate this problem is ADWIN [31]. Unlike most earlier works, it does not require the user to guess a fixed size of the windows being compared in advance; instead, it examines all possible cuts of the window and computes optimal sub-window sizes according to the rate of change between the two sub-windows. After the optimal window cut has been found, the window containing old data is dropped and a new model can be trained with the latest window data.
Instead of directly retraining the model, researchers have also attempted to integrate the drift detection process with the retraining process for specific machine learning algorithms. DELM [27] extends the traditional ELM algorithm with the ability to handle concept drift by adaptively adjusting the number of hidden layer nodes. When the classification error rate increases — which could indicate the emergence of a concept drift — more nodes are added to the network layers to improve its approximation capability. Similarly, FP-ELM [58] is an ELM-extended method that adapts to drift by introducing a forgetting parameter to the ELM model. A parallel version of ELM-based method [59] has also been developed for high-speed classification tasks under concept drift. OS-ELM [60] is another online learning ensemble of repressor models that integrates ELM using an ordered aggregation (OA) technique, which overcomes the problem of defining the optimal ensemble size.
Instance-based lazy learners for handling concept drift have also been studied intensively. The Just-in-Time adaptive classifier [19], [42] is such a method which follows the ”detect and update model” strategy . For drift detection, it extends the traditional CUSUM test [61] to a pdf-free form. This detection method is then integrated with a kNN classifier [42]. When a concept drift is detected, old instances (more than the last T samples) are removed from the case base. In later work, the authors of [11], [45] extended this algorithm to handle recurrent concepts by computing and comparing current concept to previously stored concepts. NEFCS [2] is another kNN-based adaptive model. A competence model-based drift detection algorithm [3] was used to locate drift instances in the case base and distinguish them from noise instances and a redundancy removal algorithm, Stepwise Redundancy Removal (SRR), was developed to remove redundant instances in a uniform way , guaranteeing that the reduced case base would still preserve enough information for future drift detection.
Model ensemble for recurring drift
In the case of recurring concept drift, preserving and reusing old models can save significant effort to retrain a new model for recurring concepts. This is the core idea of using ensemble methods to handle concept drift. Ensemble methods have received much attention in stream data mining research community in recent years. Ensemble methods comprise a set of base classifiers that may have different types or different parameters. The output of each base classifier is combined using certain voting rules to predict the newly arrived data. Many adaptive ensemble methods have been developed that aim to handle concept drift by extending classical ensemble methods or by creating specific adaptive voting rules. An example is shown in Fig. 14, where new base classifier is added to the ensemble when concept drift occurs.
Bagging, Boosting and Random Forests are classical ensemble methods used to improve the performance of single classifiers. They have all been extended for handling streaming data with concept drift. An online version of the bagging method was first proposed in [62] which uses each instance only once to simulate the batch mode bagging. In a later study [63], this method was combined with the ADWIN drift detection algorithm [31] to handle concept drift. When a concept drift is reported, the newly proposed method, called Leveraging Bagging, trains a new classifier on the latest data to replace the existing classifier with the worst performance. Similarly, an adaptive boosting method was developed in [64] which handles concept drift by monitoring prediction accuracy using a hypothesis test, assuming that classification errors on non-drifting data should follow Gaussian distribution. In a recent work [35], the Adaptive Random Forest (ARF) algorithm was proposed, which extends the random forest tree algorithm with a concept drift detection method, such as ADWIN [31], to decide when to replace an obsolete tree with a new one. A similar work can be found in [65], which uses Hoeffding bound to distinguish concept drift from noise within decision trees.
Besides extending classical methods, many new ensemble methods have been developed to handle concept drift using novel voting techniques. Dynamic Weighted Majority (DWM) [66] is such an ensemble method that is capable of adapting to drifts with a simple set of weighted voting rules. It manages base classifiers according to the performance of both the individual classifiers and the global ensemble. If the ensemble misclassifies an instance, DWM will train a new base classifier and add it to ensemble. If a base classifier misclassifies an instance, DWM reduces its weight by a factor. When the weight of a base classifier drops below a user defined threshold, DWM removes it from the ensemble. The drawback of this method is that the adding classifier process may be triggered too frequently, introducing performance issues on some occasions, such as when gradual drift occurs. A well-known ensemble method, Learn++NSE [67], mitigates this issue by weighting base classifiers according to their prediction error rate on the latest batch of data. If the error rate of the youngest classifier exceeds 50%, a new classifier will be trained based on the latest data. This method has several other benefits: it can easily adopt almost any base classifier algorithm; it does not store history data, only the latest batch of data, which it only uses once to train a new classifier; and it can handle sudden drift, gradual drift, and recurrent drift because underperforming classifiers can be reactivated or deactivated as needed by adjusting their weights. Other voting strategies than standard weighted voting have also been applied to handle concept drift. Examples include hierarchical ensemble structure [68], [69], short term and long term memory [13], [70] and dynamic ensemble sizes [71], [72].
A number of research efforts have been made that focus on developing ensemble methods for handling concept drift of certain types. Accuracy Update Ensemble (AUE2) [73] was proposed with an emphasis on handling both sudden drift and gradual drift equally well. It is a batch mode weighted voting ensemble method based on incremental base classifiers. By doing re-weighting, the ensemble is able react quickly to sudden drift. All classifiers are also incrementally trained with the latest data, which ensures that the ensemble evolves with gradual drift. The Optimal Weights Adjustment (OWA) method [74] achieves the same goal by building ensembles using both weighted instances and weighted classifiers for different concept drift types. The authors of [75] considered a special case of concept drift — class evolution — the phenomenon of class emergence and disappearance. Recurring concepts are handled in [76], [77], which monitor concept information to decide when to reactivate previously stored obsolete models. [78] is another method that handles recurring concepts by refining the concept pool to avoid redundancy.
Adjusting existing models for regional drift
An alternative to retraining an entire model is to develop a model that adaptively learns from the changing data. Such models have the ability to partially update themselves when the underlying data distribution changes, as shown in Fig. 15. This approach is arguably more efficient than retraining when the drift only occurs in local regions. Many methods in this category are based on the decision tree algorithm because trees have the ability to examine and adapt to each sub-region separately.
In a foundational work [79], an online decision tree algorithm, called Very Fast Decision Tree classifier (VFDT) was proposed, which is especially tailored for high speed data streams. It uses Hoeffding bound to limit the number of instances required for node splitting. This method has become very popular because of its several distinct advantages: 1) it only needs to process each instance once and does not store instances in memory or disk; 2) the tree itself only consumes a small amount of space and does not grow with the number of instances it processes unless there is new information in the data; 3) the cost of tree maintenance is very low. An extended version, called CVFDT [80], was later proposed to handle concept drift. In CVFDT, a sliding window is maintained to hold the latest data. An alternative sub-tree is trained based on the window and its performance is monitored. If the alternative subtree outperforms its original counterpart, it will be used for future prediction and the original obsolete sub-tree will be pruned. VFDT c [81] is another attempt to make improvements to VFDT with several enhancements: the ability to handle numerical attributes; the application of naive Bayes classifiers in tree leaves and the ability to detect and adapt to concept drift. Two node-level drift detection methods were proposed based on monitoring differences between a node and its sub-nodes. The first method uses classification error rate and the second directly checks distribution difference. When a drift is detected on a node, the node becomes a leaf and its descending sub-tree is removed. Later work [82], [83] further extended VFDT c using an adaptive leaf strategy that chooses the best classifier from three options: majority voting, Naive Bayes and Weighted Naive Bayes.
Despite the success of VFDT, recent studies [84], [85] have shown that its foundation, the Hoeffding bound, may not be appropriate for the node splitting calculation because the variables it computes, the information gain, are not independent. A new online decision tree model [86] was developed based on an alternative impurity measure. The paper shows that this measure also reflects concept drift and can be used as a replacement measure in CVFDT. In the same spirit, another decision tree algorithm (IADEM-3) [87] aims to rectify the use of Hoeffding bound by computing the sum of independent random variables, called relative frequencies. The error rate of sub-trees are monitored to detect drift and are used for tree pruning.
EVALUATION, DATASETS AND BENCHMARKS
Section 6.1 discusses the evaluation systems used for learning algorithms handling concept drift. Section 6.2 introduces synthetic datasets, which used to simulate specific and controllable types of concept drift. Section 6.3 describes real-world datasets, which used to test the overall performance in a real-life scenario.
Evaluation Systems
The evaluation systems is an important part for learning algorithms. Some evaluation methodologies used in learning under concept drift have been mentioned in [8]. We enrich this previous research by reviewing the evaluation systems from three aspects: 1) validation methodology , 2) evaluation metrics, and 3) statistical significance, and each evaluation is followed by its computation equation and usage introduction.
Validation methodology refers to the procedure for a learning algorithm to determine which data instances are used as the training set and which are used as the testing set. There are three procedures peculiar to the evaluation for learning algorithms capable of handling concept drift: 1) holdout, 2) prequential, and 3) controlled permutation. In the scenario of a dataset involving concept drift, holdout should follow the rule: when testing a learning algorithm at time t, the holdout set represents exactly the same concept at that time t. Unfortunately , it is only applied on synthetic datasets with predefined concept drift times. Prequential is a popular evaluation scheme used in streaming data. Each data instance is first used to test the learning algorithm, and then to train the learning algorithm. This scheme has the advantage that there is no need to know the drift time of concepts, and it makes maximum use of the available data. The prequential error is computed based on an accumulated sum of a loss function between the prediction and observed label: $S=\sum_{t=1}^{n} f\left(\hat{y}{t}, y{t}\right)$. There are three prequential error rate estimates: a landmark window (interleaved-test-then-train), a sliding window, and a forgetting mechanism [88]. Controlled permutation [89] runs multiple test datasets in which the data order has been permutated in a controlled way to preserve the local distribution, which means that data instances that were originally close to one another in time need to remain close after a permutation. Controlled permutation reduces the risk that their prequential evaluation may produce biased results for the fixed order of data in a sequence.
Evaluation metrics for datasets involving concept drift could be selected from traditional accuracy measures, such as precision/recall in retrieval tasks, mean absolute scaled error in regression, or root mean square error in recommender systems. In addition to that, the following measures should be examined: 1) RAM-hours [90] for the computation cost of the mining process; 2) Kappa statistic $\kappa=\frac{p-p_{\text {ran }}}{1-p_{\text {ran }}}$ [91] for classification taking into account class imbalance, where $p$ is the accuracy of the classifier under consideration (reference classifier) and $p_{\text {ran }}$ is the accuracy of the random classifier; 3) Kappa-Temporal statistic $\kappa_{p e r}=\frac{p-p_{\text {per }}}{1-p_{\text {per }}}[92]$ for the classification of streaming data with temporal dependence, where $p_{\text {per }}$ is the accuracy of the persistent classifier (a classifier that predicts the same label as previously observed); 4) Combined Kappa statistic $\kappa^{+}=\sqrt{\max (0, \kappa) \max \left(0, \kappa_{\text {per }}\right)}$ [92], which combines the $\kappa$ and $\kappa_{\text {per }}$ by taking the geometric average; 5) Prequential $A U C$ [93]; and 6 ) the Averaged Normalized Area Under the Curve (NAUC) values for Precision-Range curve and Recall-Range curve [53], for the classification of streaming data involving concept drift. Apart from evaluating the performance of learning algorithms, the accuracy of the concept drift detection method/algorithm can be accessed according to the following criteria: 1) true detection rate, 2) false detection rate, 3) miss detection rate, and 4) delay of detection [22].
Statistical significance is used to compare learning algorithms on achieved error rates. The three most frequently used statistical tests for comparing two learning algorithms $[94],[95]$ are: 1$)$ McNemar test [96]: denote the number of data instances misclassified by the first classifier and correctly classified by the second classifier by $a,$ and denote $b$ in the opposite way. The McNemar statistic is computed as $M=\operatorname{sign}(a-b) \times(a-b)^{2} /(a+b)$ to test whether two classifiers perform equally well. The test follows the $\chi^{2}$ distribution; 2 ) Sign test: for $N$ data instances, denote the number of data instances misclassified by the first classifier and correctly classified by the second classifier by $B$ and the number of ties by $T$. Conduct one-sided sign test by computing $p=\sum_{k=B}^{N-T}\left(\begin{array}{c}N-T \ k\end{array}\right) 0.5^{k} \times 0.5^{N-T-k} .$ If $p$ less than a significant level, then the second classifier is better than the first classifier. and 3 ) Wilcoxon's signrank test: For testing two classifiers on $N$ datasets, let $x_{i, 1}$ and $x_{i, 2}(i=1, \ldots, N)$ denote the measurements. The number of ties is $T$ and $N_{r}=N-T .$ The test statistic $W=\sum_{i=1}^{N_{r}}\left(\operatorname{sign}\left(x_{i, 1}-x_{i, 2}\right) \times R_{i}\right)$ where $R_{i}$ is the rank ordered by $\left|x_{i, 1}-x_{i, 2}\right|$ increasingly. Two classifiers perform equally is rejected if $|W|>W_{\text {critical }, N_{r}}$ (two-sided), where $W_{\text {critical }, N_{r}}$ can be acquired from the statistical table. All three tests are non-parametric. The Nemenyi test [97] is used to compare more than two learning algorithms. It is an appropriate test for comparing all learning algorithms with multiple datasets, based on the average rank of algorithms over all datasets. The Nemenyi test consists of the following: two classifiers are performing differently if the corresponding average ranks differ by at least the critical difference $\mathrm{CD}=q_{\alpha} \sqrt{k(k+1) / 6 N},$ where $k$ is the number of learners, $N$ is the number of datasets, and critical values $q_{\alpha}$ are based on the Studentized range statistic divided by $\sqrt{2}$. Other tests can be used to compare learning algorithms with a control algorithm [97].
Synthetic datasets
We list several widely used synthetic datasets for evaluating the performance of learning algorithms dealing with concept drift. Since data instances are generated by predefined rules and specific parameters, a synthetic dataset is a good option for evaluating the performance of learning algorithms in different concept drift scenarios. The dataset provider, the number of instances (#Insts.), the number of attributes (#Attrs.), the number of classes (#Cls.), types of drift (Types), sources of drift (Sources), and used by references, are listed in TABLE 3.
Real-world datasets
In this section, we collect several publicly available real-world datasets, including real-world datasets with synthetic drifts. The dataset provider, the number of instances (#Insts.), the number of attributes (#Attrs.), the number of classes (#Cls.), and used by references, are shown in TABLE 4.
Most of these datasets contain temporal concept drift spanning over different period range - e.g. daily (Sensor [108]), seasonally (Electricity [109]) or yearly (Airlines [104], NOAA weather [67]). Others include geographical (Covertype [106]) or categorical (Poker-Hand [106]) concept drift. Certain datesets, mainly text based, are targeting at specific drift types, such as sudden drift (Email data [110]), gradural drift (Spam assassin corpus [111]), recurrent drift (Usenet [112]) or novel class (KDDCup’99 [106], ECUE drift dataset 2 [113])
These datasets provide realistic benchmark for evaluating differnent concept drift handling methods. There are, however, two limitations of real world data sets: 1) the groud truth of precise start and end time of drifts is unknown; 2) some real datasets may include mixed drift types. These limitations make it difficult to evaluate methods for understanding the drift, and could introduce bias when comparing different machine learning models.
THE CONCEPT DRIFT PROBLEM IN OTHER RESEARCH AREAS
We have observed that handling the concept drift problem is not a standalone research subject but has a large number of indirect usage scenarios. In this section, we adopt this new perspective to review recent developments in other research areas that benefit from handling the concept drift problem.
Class imbalance
Class imbalance is a common problem in stream data mining in addition to concept drift. Research effort has been made to develop effective learning algorithms to tackle both problems at same time. [117] presented two ensemble methods for learning under concept drift with imbalanced class. The first method, Learn++.CDS, is extended from Learn++.NSE and combined with the Synthetic Minority class Oversampling Technique (SMOTE). The second algorithm, Learn++.NIE, improves on the previous method by employing a different penalty constraint to prevent prediction accuracy bias and replacing SMOTE with bagging to avoid oversampling. ESOS-ELM [118] is another ensemble method which uses Online Sequential Extreme Learning Machine (OS-ELM) as a basic classifier to improve performance with class imbalanced data. A concept drift detector is integrated to retrain the classifier when drift occurs. The author then developed another algorithm [119], which is able to tackle multi-class imbalanced data with concept drift. [120] proposed two learning algorithms OOB and UOB, which build an ensemble model to overcome the class imbalance in real time through resampling and time-decayed metrics. [121] developed an ensemble method which handles concept drift and class imbalance with additional true label data limitation.
Big data mining
Data mining in big data environments faces similar challenges to stream data mining [122]: data is generated at a fast rate (Velocity) and distribution uncertainty always exists in the data, which means that handling concept drift is also crucial in big data applications. Additionally , scalability is an important consideration because in big data environments, a data stream may come in very large and potentially unpredictable quantities (Volume) and cannot be processed in a single computer server. An attempt to handle concept drift in a distributed computing environment was made by [123] in which an Online Map-Reduce Drift Detection Method (OMR-DDM) was proposed, using the combined online error rate of the parallel classification algorithms to identify the changes in a big data stream. A recent study [124] proposed another scalable stream data mining algorithm, called Micro-Cluster Nearest Neighbor (MC-NN), based on nearest neighbor classifier. This method extends the original Micro-Cluster algorithm [125] to adapt to concept drift by monitoring classification error. This microcluster algorithm was further extended to a parallel version using the map-reduce technique in [126] and applied to solve the label-drift classification problem where class labels are not known in advance [127].
Active learning and semi-supervised learning
Active learning is based on the assumption that there is a large amount of unlabeled data but only a fraction of them can be labeled by human effort. This is a common situation in stream data applications, which are often also subject to the concept drift problem. [115] presented a general framework that combines active learning and concept drift adaptation. It first compares different instance-sampling strategies for labeling to guarantee that the labeling cost will be under budget, and that distribution bias will be prevented. A drift adaptation mechanism is then adopted, based on the DDM detection method [20]. In [128], the authors proposed a new active learning algorithm that primarily aims to avoid bias in the sampling process of choosing instances for labeling. They also introduced a memory loss factor to the model, enabling it to adapt to concept drift.
Semi-supervised learning concerns how to use limited true label data more efficiently by leveraging unsupervised techniques. In this scenario, additional design effort is required to handle concept drift. For example, in [129], the authors applied a Gaussian Mixture model to both labeled and unlabeled data, and assigned labels, which has the ability to adapt to gradual drift. Similarly , [99], [130], [131] are all cluster-based semi-supervised ensemble methods that aim to adapt to drift with limited true label data. The latter are also able to recognize recurring concepts. In [132], the author adopted a new perspective on the true label scarcity problem by considering the true labeled data and unlabeled data as two independent non-stationary data generating processes. Concept drift is handled asynchronously on these two streams. The SAND algorithm [133], [134] is another semi-supervised adaptive method which detects concept drift on cluster boundaries. There are also studies [90, 91] that focus on adapting to concept drift in circumstances where true label data is completely unavailable.
Decision Rules
Data-driven decision support systems need to be able to adapt to concept drift in order to make accurate decisions and decision rules is the main technique for this purpose. [102] proposed a decision rule induction algorithm, Very Fast Decision Rules (VFDR), to effectively process stream data. An extended version, Adaptive VFDR, was developed to handle concept drift by dynamically adding and removing decision rules according to their error rate which is monitored by drift detector. Instead of inducing rules from decision trees, [135] proposed another decision rule algorithm based on PRISM [136] to directly induce rules from data. This algorithm is also able to adapt to drift by monitoring the performance of each rule on a sliding window of latest data. [137] also developed an adaptive decision making algorithm based on fuzzy rules. The algorithm includes a rule pruning procedure, which removes obsolete rules to adapt to changes, and a rule recal procedure to adapt to recurring concepts.
This section by no means attempts to cover every research field in which concept drift handling is used. There are many other studies that also consider concept drift as a dual problem. For example, [138] is a dimension reduction algorithm to separate classes based on least squares linear discovery analysis (LSLDA), which is then extended to adapt to drift; [139] considered the concept drift problem in time series and developed an online explicit drift detection method by monitoring time series features; and [140] developed an incremental scaffolding classification algorithm for complex tasks that also involve concept drift.
CONCLUSIONS: FINDINGS AND FUTURE DIRECTIONS
We summarize the recent developments of concept drift research, and the following important findings can be extracted:
-
Error rate-based and data distribution-based drift detection methods are still playing a dominant role in concept drift detection research, while multiple hypothesis test methods emerge in recent years;
-
Regarding to concept drift understanding, all drift detection methods can answer “When”, but very few methods have the ability to answer “How” and “Where”;
-
Adaptive models and ensemble techniques have played an increasingly important role in recent concept drift adaptation developments. In contrast, research of retraining models with explicit drift detection has slowed;
-
Most existing drift detection and adaptation algorithms assume the ground true label is available after classification/prediction, or extreme verification latency . Very few research has been conducted to address unsupervised or semi-supervised drift detection and adaptation.
-
Some computational intelligence techniques, such as fuzzy logic, competence model, have been applied in concept drift;
-
There is no comprehensive analysis on real-world data streams from the concept drift aspect, such as the drift occurrence time, the severity of drift, and the drift regions.
-
An increasing number of other research areas have recognized the importance of handling concept drift, especially in big data community
Based on these findings, we suggest four new directions in future concept drift research:
-
Drift detection research should not only focus on identifying drift occurrence time accurately , but also need to provide the information of drift severity and regions. These information could be utilized for better concept drift adaptation.
-
In the real-world scenario, the cost to acquire true label could be expensive, that is, unsupervised or semi-supervised drift detection and adaptation could still be promising in the future.
-
A framework for selecting real-world data streams should be established for evaluating learning algorithms handling concept drift.
-
Research on effectively integrating concept drift handling techniques with machine learning methodologies for data-driven applications is highly desired.
We hope this paper can provide researchers with state-of-the-art knowledge on concept drift research developments and provide guidelines about how to apply concept drift techniques in different domains to support users in various prediction and decision activities.
ACKNOWLEDGMENTS
The work presented in this paper was supported by the Australian Research Council (ARC) under discovery grant DP150101645. We significantly thank Yiliao Song for her help in preparation of datasets and applications shown in Sections 6.