Sunday, October 23, 2016

Anomaly Detection in Time Series Data

A brief survey of R packages

Abstract

The definition of anomaly embraces everything is remarkably different from what expected.

The anomaly detection is becoming more and more important as applications based on real time analytics aim to early detect anomalies in data collected as time series.

The anomalies root causes may comprise device malfunctioning, misuse of resources, unexpected overload or malicious attacks, to mention some.

Loading Data

I am going to run anomaly detection tests on the time series of the Nottingham castle temperature collected between January 1920 and December 1939, time series already introduced in some of my past posts.

That time series can be found as part of the Time Series Data Library (ref. [1]). where data can be downloaded by selecting the Export option on the left pane and then CSV file (comma separated, for example) format. The dataset file is named as mean-monthly-air-temperature-deg.csv. Last record stores a NA value.

set.seed(1023)
not_temp <- read.csv("mean-monthly-air-temperature-deg.csv", header=TRUE, sep=",", stringsAsFactors = FALSE)
# forcing column names
colnames(not_temp) <- c("Date", "Temperature")
# last row contains NA
not_temp <- not_temp[complete.cases(not_temp),]
# formatting dates
not_temp$Date <- as.Date(paste(not_temp$Date,1,sep="-"),"%Y-%m-%d")
# creating an additional anomaly
(not_temp$Temperature[19] = not_temp$Temperature[19]*1.1)
## [1] 72.93
# computing main frequency value
suppressPackageStartupMessages(library(TSA))
temp.pdgram <- TSA::periodogram(not_temp$Temperature, plot=FALSE)
max.freq <- which.max(temp.pdgram$spec)
(temp.frequency = 1/temp.pdgram$freq[max.freq])
## [1] 12
not_temp_ts <- ts(not_temp[,2], frequency=temp.frequency)
plot(not_temp_ts)

The plot above shows the time series under analysis.

Survey

I am going to take advantage of packages basic functions to verify presence of anomalies in our time series. Plots highlighting the detected anomalies will be shown. In case the package in use does not find anomalies for our time series, I will show an additional example based on synthetic data where anomalies are found.

The R packages I am going to introduce are:

  • AnomalyDetection

  • Breakout

  • ecp

  • RAD

  • bfast

  • anomalous

  • bcp

  • qcc

  • extremevalues

  • strucchange

  • changepoint

  • cpm

Let us start with the first one.

\[ \\ \]

The AnomalyDetection package

This package has been released by Twitter. The underlying algorithm is called S-H-ESD, or Seasonal Hybrid Extreme Studentized Deviates. This is an extension of a generalized ESD method by adding steps which break down data series into piecewise approximations (a hybrid method) and can account for seasonality in each submodel.

Generalized Extreme Studentized Deviates (ESD) is a well established statistical procedure for the detection of outliers where the assumption is that the inliers are normal distributed.

S-H-ESD can be used to detect both global as well as local anomalies. Take a look here for definition and examples of local and global anomalies.

In the following, I run AnomalyDetectionVec() function to find out global anomalies as the parameter longterm_period is equal to NULL.

suppressPackageStartupMessages(library(AnomalyDetection))
AnomalyDetectionVec(not_temp$Temperature, period=temp.frequency, direction='both',
                    longterm_period=NULL, plot=TRUE)
## $anoms
##   index anoms
## 1    19 72.93
## 2   110 31.30
## 
## $plot

To detect the local anomalies, I specify the longterm_period parameter equal to four times the season window length (4x12 months = 4 years).

AnomalyDetectionVec(not_temp$Temperature, period=temp.frequency, direction='both',
                    longterm_period=4*temp.frequency, plot=TRUE)
## $anoms
##    index anoms
## 1     19 72.93
## 2     43 64.20
## 3     60 43.60
## 4    110 31.30
## 5    156 41.80
## 6    180 45.80
## 7    194 35.00
## 8    203 41.60
## 9    207 38.40
## 10   215 41.40
## 
## $plot

As we can see, the number of anomalies detected have been increased by including also the ones occurring within the seasonality pattern.

More info at:

\[ \\ \]

The BreakoutDetection package

This is another package made available by Twitter.

While anomalies are point-in-time anomalous data points, breakouts are characterized by a ramp up from one steady state to another.

suppressPackageStartupMessages(library(BreakoutDetection))
breakout(not_temp$Temperature, min.size=temp.frequency/2, method='multi', degree=1, plot=TRUE, xlab = "index", ylab = "value", title = "Breakout Anomalies")
## $loc
##  [1]   9  15  21  27  33  40  46  52  58  64  70  76  82  88  94 100 106
## [18] 112 118 124 130 136 142 148 154 160 166 172 178 184 190 196 202 208
## [35] 214 221 227 233
## 
## $time
## [1] 0.02
## 
## $pval
## [1] NA
## 
## $plot

The breakout package detects breakouts and highlights them as vertical dotted lines in the resulting plot.

More info at:

\[ \\ \]

The ecp package

The ecp package provides methods for change point analysis that are able to detect any type of distributional change within a time series.

Determination of the number of change points is also addressed as estimation involves both the number and locations of change points.

The e.divisive() function is capable of detecting change in spatial distribution by taking advantage of a divisive hierarchical estimation algorithm for multiple change point analysis. As said, that is not the only function exported by the ecp package.

suppressPackageStartupMessages(library(ecp))
mat_temp <- matrix(not_temp$Temperature, nrow=length(not_temp$Temperature), ncol=1)
ecp_fit <- e.divisive(X=mat_temp, min.size=5, sig.lvl=.1, alpha=2)
plot(not_temp$Temperature, type='l')
abline(v=ecp_fit$estimates, col='red')

No anomalies in terms of distribution change have been detected. As default, the plot highlights beginning and end of the time series. Hence, I am going to show an example based on synthetic data where for sure change of distributions occur.

ts_example <- c(rnorm(100),rnorm(100,3),rnorm(100,0,2))
x1_fit <- e.divisive(X=matrix(ts_example), sig.lvl=0.05, R=199, k=NULL, min.size=30, alpha=1)
plot(ts_example, type='l')
abline(v=x1_fit$estimates, col='red')

More info at:

\[ \\ \]

The RAD package

Netflix has recently made available the RAD package as part of its Surus project.

Their algorithm uses Robust Principal Component Analysis (RPCA) to detect anomalies. PCA uses the Singular Value Decomposition (SVD) to find low rank representations of the data.

The RPCA identifies a low rank representation, random noise, and a set of outliers by repeatedly calculating the SVD and applying thresholds to the singular values and error for each iteration.

suppressPackageStartupMessages(library(RAD))
rpca_res <- AnomalyDetection.rpca(not_temp$Temperature, frequency=temp.frequency)
plot(rpca_res$time, rpca_res$X_original, type ='l', main = "Robust PCA Anomalies", xlab = "index", ylab = "value")
lines(rpca_res$time, rpca_res$X_transform)
rpca_res_an <- subset(rpca_res, abs(S_transform) > 0)
points(rpca_res_an$time, rpca_res_an$X_transform, col='red')

More info at:

\[ \\ \]

The bfast package

The bfast package is capable to identify breaks in seasonal and trend component of a time series. Seasonal breaks is a function that combines the iterative decomposition of time series into trend, seasonal and remainder components with significant break detection in the decomposed components of the time series.

suppressPackageStartupMessages(library(bfast))
bfast(not_temp_ts, h=temp.frequency/length(not_temp_ts), season="harmonic", 
      max.iter=5, type="OLS-CUSUM")
## 
##   TREND BREAKPOINTS:  None
## 
##   SEASONAL BREAKPOINTS:  None

As we can see above, no breakpoints in trend or seasonal components have been identified.

To show bfast capabilities, I am going to show the following example based on harvest time series.

(rdist <- 10/length(harvest))
## [1] 0.05025126
fit <- bfast(harvest, h=rdist, season="harmonic", max.iter=1)
plot(fit)

The plots above shows the breaks in the time series trend as identified by the bfast() function.

More info at:

\[ \\ \]

The anomalous package

Methods in this package compute a vector of features on each time series, measuring characteristics of the series. Features may include lag correlation, strength of seasonality, spectral entropy, etc. Then a robust principal component decomposition is used on the features, and various bivariate outlier detection methods are applied to the first two principal components.

suppressPackageStartupMessages(library(anomalous))
not_temp_mat <- matrix(not_temp_ts, nrow=temp.frequency, ncol=20) # a matrix is required
y <- tsmeasures(t(not_temp_mat), window = temp.frequency/2)
anomaly(y, n=2, plot=TRUE)

## $index
## [1] 7 1
## 
## $scores
##             [,1]        [,2]
##  [1,]  2.0342543 -0.20641720
##  [2,] -1.1118371 -0.31848792
##  [3,]  1.5963868  0.16726848
##  [4,] -0.6122305 -0.51308259
##  [5,] -0.4667898 -0.55901796
##  [6,] -0.4844314 -0.55211452
##  [7,] -1.4732926  3.24251228
##  [8,] -1.0136910 -0.36198522
##  [9,]  1.7226190 -0.07575058
## [10,] -1.0926325 -0.32448911
## [11,] -0.7977742 -0.44744740
## [12,]  1.6994190 -0.05098826

The resulting plot shows the PCA decomposition and highlights with the triangle symbol where anomalies have been identified. Some more work is needed in order to report such finding within the original time series plot.

More info at:

\[ \\ \]

The bcp package

The bcp() function implements an approximation to the Barry and Hartigan product partition model for the normal errors change point problem using Markov Chain Monte Carlo. This algorithm is used when there exists an unknown partition of a sequence, or sequences, into contiguous blocks such that the mean is constant within each block.

suppressPackageStartupMessages(library(bcp))
bcp_res <- bcp(not_temp$Temperature)
post_prob <- na.omit(bcp_res$posterior.prob)
max_idx <- which(post_prob > 0.8)
plot(not_temp$Temperature, type ='l')
abline(v=max_idx, col='red')

As you can see, the red vertical lines highlights the changepoints identified by the bcp() function having a posterior probability greater than a pre-determined threshold, equal to 0.8 in this case.

More info at:

\[ \\ \]

The qcc package

The qcc package makes available different types of control charts, including CUSUM and EWMA. In the following I plot an EWMA chart where anomalies are identified as observations at least 4 standard deviation far from the mean.

suppressPackageStartupMessages(library(qcc))
ewma(not_temp_ts, lambda=0.2, nsigmas=4, plot=TRUE)

## List of 15
##  $ call      : language ewma(data = not_temp_ts, lambda = 0.2, nsigmas = 4, plot = TRUE)
##  $ type      : chr "ewma"
##  $ data.name : chr "not_temp_ts"
##  $ data      : num [1:240, 1] 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ statistics: Named num [1:240] 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 ...
##   ..- attr(*, "names")= chr [1:240] "1" "2" "3" "4" ...
##  $ sizes     : int [1:240] 1 1 1 1 1 1 1 1 1 1 ...
##  $ center    : num 49.1
##  $ std.dev   : num 3.87
##  $ x         : int [1:240] 1 2 3 4 5 6 7 8 9 10 ...
##  $ y         : Named num [1:240] 47.4 46.1 45.7 45.9 47.6 ...
##   ..- attr(*, "names")= chr [1:240] "1" "2" "3" "4" ...
##  $ sigma     : num [1:240] 0.775 0.992 1.11 1.178 1.22 ...
##  $ lambda    : num 0.2
##  $ nsigmas   : num 4
##  $ limits    : num [1:240, 1:2] 46 45.1 44.6 44.4 44.2 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ violations: Named int [1:20] 19 20 21 22 50 51 52 81 110 111 ...
##   ..- attr(*, "names")= chr [1:20] "19" "20" "21" "22" ...
##  - attr(*, "class")= chr "ewma.qcc"

More info at:

\[ \\ \]

The extremevalues package

The extremevalues package can be used to identify outliers in a observations set by specifying left and right threshold quantiles (0.05 and 0.95 in this case) and the reference distribution type, (normal).

suppressPackageStartupMessages(library(extremevalues))
left_thresh <- 0.05
right_thresh <- 0.95
not_temp_outlier <- getOutliersI(not_temp$Temperature, FLim=c(left_thresh, right_thresh), distribution="normal")
(outliers_x <- sort(c(not_temp_outlier$iLeft, not_temp_outlier$iRight)))
## integer(0)
if (length(outliers_x)) {
  plot(not_temp, type='l')
  points(outliers_x, not_temp$Temperature, col='red')
  outlierPlot(not_temp$Temperature, not_temp_outlier, mode="qq")
}

No outliers have been detected by the getOutliersI() function. For an example where outliers are detected by the extremevalues package, take a look at this post.

More info at:

\[ \\ \]

The strucchange package

The strucchange package is used to detect structural changes by several techniques including EFP (Empirical Fluctuation Process) CUSUM Tests, Chow Test and Fstats test.

In the following I am running a Fstats() test to identify structural changes. The Fstats() test extends the basic idea of the Chow Test when the breakpoint is not known in advance.

suppressPackageStartupMessages(library(strucchange))
not_temp.efp.Fstats <- Fstats(not_temp_ts ~ 1,  from=0.05, to=0.95)
sctest(not_temp.efp.Fstats, type="supF")
## 
##  supF test
## 
## data:  not_temp.efp.Fstats
## sup.F = 1.5644, p-value = 0.984

No structural changes have been detected. For an example where structural changes are detected by this package, take a look at this post.

More info at:

\[ \\ \]

The changepoint package

The changepoint package provides search algorithms for multiple changepoint detection in addition to a variety of test statistics. Specifically methods are implemented for the change in mean and/or variance settings.

suppressPackageStartupMessages(library(changepoint))
plot(cpt.mean(not_temp$Temperature, penalty="SIC", method="AMOC", class=TRUE))

A spurius change in mean at the beginning ot the time series has been detected. Actually, it looks to me a false positive. A more explanatory example follows.

y_ts <- c(rnorm(100,0,1), rnorm(100,2,1))
plot(cpt.mean(y_ts, penalty="SIC", method="AMOC", class=TRUE))

Red horizontal lines highlight two different mean levels and the shift in between them.

More info at:

\[ \\ \]

The cpm package

The CPM framework is an approach to sequential change detection (also known as Phase II process monitoring) which allows standard statistical hypothesis tests to be deployed sequentially.

The main two general purpose functions in the package are detectChangePoint() and processStream() for detecting single and multiple change points respectively.

There are available several Change Point Model (CPM) types. In the following example, I am taking advantage of the Student CPM type. Further, the ARL0 parameter corresponds to the average number of observations before a false positive occurs, assuming the sequence does not undergo a change. The startup parameter is the number of observations after which monitoring begins.

suppressPackageStartupMessages(library(cpm))
change_detected <- processStream(not_temp$Temperature, cpmType = "Student", ARL0=100, startup=temp.frequency)
plot(not_temp$Temperature, type ='l')
abline(v=change_detected$changePoints, col='red')

Remarkably, the cpm package exports functionalities for online streaming anomaly detection, see processObservation() function description.

More info at:

\[ \\ \]

Conclusions

There are many R packages providing with functionalities capable of detecting anomalies in time series data.

In this post I introduced some of the packages exporting anomaly detection functionalities, together with examples to outline basic results.

Actually, all those packages deserve in-depth study to fully understand their own capabilities in exhaustive way.

\[ \\ \]

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.