Sunday, November 6, 2016

Financial products with capital protection barrier - part 4

Outliers Detection

Abstract

In this post, I am going to investigate the presence of outliers in our observations set. If outliers are present, I will try to characterize the timeline by which they occur.

The outlier is an observation which is abnormally distant in value from the typical observations. It is important to find out them and determine how much recurrant they are. Further, outliers may carry some important information of the underlying process. In our scenarios, the outliers are not going to be deleted or their value recomputed as they are not the result of measuring errors.

Analysis

I am going to take advantage of the getOutliersI() function provided by the extremevalue package. I consider outliers any value outside the [0.1%, 99.9%] quantile interval. Further, original time series scatterplot with outliers highlighted in red color and outlierPlot as made available by the extremevalue package.

At the purpose, I define the left and right thresholds and take advantage of the getOutliersI() function whose return value stores the time series outliers indexes. Moreover, a scatterplot highlighting outliers in red color and an outliers plot are shown.

load(file="structured-product-2.RData")
invisible(lapply(ts.package, function(x) {
  suppressPackageStartupMessages(library(x, character.only=TRUE)) }))

set.seed(1023)

GSPC_log_returns <- as.vector(coredata(GSPC_log_returns))
left_thresh <- 0.001 
right_thresh <- 0.999
GSPC_outlier <- getOutliersI(as.vector(GSPC_log_returns), FLim=c(left_thresh, right_thresh), distribution="normal")
(outliers_x <- sort(c(GSPC_outlier$iLeft, GSPC_outlier$iRight)))
## [1]   1  93 116 212 425 467 481 507 543
par(mfrow=c(1,2))
plot(GSPC_log_returns)
points(outliers_x, GSPC_log_returns[outliers_x], col='red')
outlierPlot(GSPC_log_returns, GSPC_outlier, mode="qq")

The outliers inter-arrival time bar plot can help in guessing a potential fit for the empirical distribution.

outliers_ia <- sort(na.omit(diff(outliers_x)), decreasing = TRUE)
barplot(outliers_ia, col = rgb(1, 0, 0, alpha = 0.5))

A first qualitative evaluation suggests the outliers inter-arrival time might have an exponential distribution. In the following, I compute the exponential distribution fit \(\lambda\) value and its confidence interval.

fitexp <- fitdist(outliers_ia, "exp")
fit.boundaries <- c(fitexp$estimate - 2*fitexp$sd, 
                    fitexp$estimate, 
                    fitexp$estimate + 2*fitexp$sd)
lambda.est <- data.frame("fit.boundaries" = fit.boundaries)
rownames(lambda.est) <- c("lower-bound", "expected-value", "upper-bound")
kable(lambda.est, align='l')
fit.boundaries
lower-bound 0.0043713
expected-value 0.0147601
upper-bound 0.0251490

From above results, it is confirmed that the fit is significant as the confidence interval does not include the zero value. Further, the Kolmogorov-Smirnov test is run against the null hypothesis that the two distributions are the same.

ks.test(outliers_ia, "pexp", fitexp$estimate, alternative="two.sided")
## 
## Results of Hypothesis Test
## --------------------------
## 
## Alternative Hypothesis:          two-sided
## 
## Test Name:                       One-sample Kolmogorov-Smirnov test
## 
## Data:                            outliers_ia
## 
## Test Statistic:                  D = 0.1866893
## 
## P-value:                         0.898147

The p-value confirms that we cannot reject the null hypothesis. Congruently we may can consider outliers inter-arrival time as obtained from an exponential distribution with \(\lambda\) equal to 0.0147601.

An overlapping barplot can further give qualitative confirmation of such finding.

expdist <- sort(rexp(50, fitexp$estimate), decreasing = TRUE)
barplot(outliers_ia, col = rgb(1, 0, 0, alpha=0.2))
par(new = TRUE)
barplot(expdist, axes = FALSE, col = rgb(0, 0, 1, alpha=0.2))

To further characterize the outliers set, I carry some specific investigation which puts in evidence when dealing with left or right sided outliers.

(outliers_l <- GSPC_outlier$iLeft)
## [1]   1 116 212 425 507 543
(outliers_r <- GSPC_outlier$iRight)
## [1]  93 467 481
# outliers left/right sided sequence
(outliers_sequence <- sapply(outliers_x, function(x) { ifelse(x %in% outliers_l, "L", "R")}))
## [1] "L" "R" "L" "L" "L" "R" "R" "L" "L"
# percentage of left sided outliers
(perc_l <- length(outliers_l)/length(outliers_x))
## [1] 0.6666667
# percentage of right sided outliers
(perc_r <- length(outliers_r)/length(outliers_x))
## [1] 0.3333333
# left sided outliers mean value mean observation value without outliers ratio
(magn_avg_perc_l <- mean(GSPC_log_returns[outliers_l])/mean(GSPC_log_returns[-outliers_x]))
## [1] -48.79481
# right sided outliers mean value mean observation value without outliers ratio
(magn_avg_perc_r <- mean(GSPC_log_returns[outliers_r])/mean(GSPC_log_returns[-outliers_x]))
## [1] 43.06774
# left sided outliers max value max observation value without outliers ratio
(magn_max_perc_l <- max(GSPC_log_returns[outliers_l])/max(GSPC_log_returns[-outliers_x]))
## [1] -1.117827
# right sided outliers max value max observation value without outliers ratio
(magn_max_perc_r <- max(GSPC_log_returns[outliers_r])/max(GSPC_log_returns[-outliers_x]))
## [1] 1.9691

There are 9 outliers in total, 6 outliers below the left threshold (hence daily losses), 3 outliers over the right threshold (hence daily gains).

The computed sequence {L, R, L, L, L, R, R, L, L} highlights the frequency and the length runs per outliers side (left, right).

Outliers average and maximum magnitude ratio as above computed do not show remarkable difference among left and right outliers.

In terms of interarrival time split on outliers side, we can see some noticeable differences as below shown.

outliers_ia_l <- na.omit(diff(outliers_l))
outliers_ia_r <- na.omit(diff(outliers_r))
summary(outliers_l)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   140.0   318.5   300.7   486.5   543.0
summary(outliers_r)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      93     280     467     347     474     481
boxplot(outliers_ia_l, outliers_ia_r, col="lightgreen")

Updating the saved environment data with the outliers result.

save(ts.package, GSPC_log_returns, GSPC_AdjClose, outliers_l, outliers_r, file="structured-product-3.RData")

Conclusions

We determined the presence of outliers in our time series and we were able to fit their inter-arrival time with an exponential distribution. Further, we analyze the left vs right sided outliers time sequence.

All that represents an important achievement that would allow us to simulate outliers observations.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.