Sunday, January 25, 2015

Temperature monitoring and time series analysis (part 3)

Harmonic Seasonal Model

Abstract

In this post, I am going to analyse the underlying structure of the Nottingham castle temperature collected between January 1920 and December 1939, time series already introduced in my previous posts. In the specific, my intention is to build an harmonic seasonal model of the original time series as obtained by the linear regression fit based on the significative harmonics. The time series remainder is then analyzed to spot any further time series structure.

Analysis

I am going to analyze the dataset reporting the monthly temperature of the Nottingham castle from January 1920 up to December 1939. Such time series can be found at ref. [1]. as part of the Time Series Data Library (ref. [2]).

The data has been downloaded in the local file-system by selecting the Export option on the left pane and then CSV file (comma separated) format. The dataset file is named as mean-monthly-air-temperature-deg.csv.

suppressPackageStartupMessages(library(FitARMA))
suppressPackageStartupMessages(library(forecast))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(timeSeries))
suppressPackageStartupMessages(library(TSA))
suppressPackageStartupMessages(library(zoo))
invisible(Sys.setlocale("LC_TIME", "english"))
set.seed(1023)

mean.monthly.temp <- read.csv("mean-monthly-air-temperature-deg.csv", header=TRUE, stringsAsFactors = FALSE)
colnames(mean.monthly.temp) <- c("Date", "Temperature")
mmt <- mean.monthly.temp[complete.cases(mean.monthly.temp),]

As you see above, I also cut out its last record as it reports just a description string without any temperature data associated.

Our dataset reports monthly data, so the frequency to be used in building the timeseries should be 12 (see ref. [3]). Anyway I want to verify that by means of the periodogram function made available by the TSA package. That could be useful in case you are uncertain about the frequency value to be specified.

temp.pdgram <- TSA::periodogram(mmt$Temperature, plot=TRUE)

(max.freq <- which.max(temp.pdgram$spec))
## [1] 20
(temp.frequency = 1/temp.pdgram$freq[max.freq])
## [1] 12

As we can see the frequency is confirmed to be equal to 12. I then create the ts time series object and plot it.

mean.monthly.ts <- ts(mmt$Temperature, start=c(1920,1), end=c(1939,12), frequency=temp.frequency)
tt <- time(mean.monthly.ts)
plot(mean.monthly.ts)
grid(col='blue',lty="dotted")

In my previous posts, I outlined the exploratory analysis of this dataset and also verified there are not any outliers or structural changes.

The following functions suite is useful to collect some basic statistics, test for summaries, stationarity and normality tests.

compute_tests <- function(timeserie) {
   summary <- summary(timeserie)
   sd <- sd(timeserie)
   skew <- TSA::skewness(timeserie)
   kurt <- TSA::kurtosis(timeserie)
   adf.test <- tseries::adf.test(timeserie) # test for stationarity
   jb.test <- tseries::jarque.bera.test(timeserie) # test for normality
   lillie.test <- nortest::lillie.test(timeserie) # test for normality
   
   list("summary" = summary,
        "sd" = sd,
        "skew" = skew, 
        "kurtosis" = kurt, 
        "adf" = adf.test,
        "jarque" = jb.test, 
        "lillie" = lillie.test)
}
## $summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.30   41.55   47.35   49.04   57.00   66.50 
## 
## $sd
## [1] 8.569705
## 
## $skew
## [1] 0.1819617
## 
## $kurtosis
## [1] -1.228793
## 
## $adf
## 
##  Augmented Dickey-Fuller Test
## 
## data:  timeserie
## Dickey-Fuller = -13, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
## 
## 
## $jarque
## 
##  Jarque Bera Test
## 
## data:  timeserie
## X-squared = 16.424, df = 2, p-value = 0.0002714
## 
## 
## $lillie
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  timeserie
## D = 0.10541, p-value = 8.49e-07

The time serie shows:

  • a positive skew (long tail to the right) equal to 0.1819617

  • negative kurtosis (flat) equal to -1.2287926

  • unit root presence in the differenciated time series hypothesis is rejected based on p-value equal to 0.01 resulting from Augmented Dickey-Fuller test

  • normality hypothesis is rejected based on p-value equal to 2.71416110^{-4} resulting from Jarque-Bera test

  • normality hypothesis is rejected based on p-value equal to 8.489645210^{-7} resulting from Lillie test

Furter verification on the non-normality of data is possible by qq-plot herein shown.

qqnorm(mean.monthly.ts)
qqline(mean.monthly.ts)

I try out a linear regression of the time series observations against their timeline to spot if any linear trend component is present.

trend.lm <- lm(coredata(mean.monthly.ts) ~ index(mean.monthly.ts))
summary(trend.lm)
## 
## Call:
## lm(formula = coredata(mean.monthly.ts) ~ index(mean.monthly.ts))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.677  -7.576  -1.623   8.025  17.879 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)            -92.50857  185.07678  -0.500    0.618
## index(mean.monthly.ts)   0.07334    0.09590   0.765    0.445
## 
## Residual standard error: 8.577 on 238 degrees of freedom
## Multiple R-squared:  0.002452,   Adjusted R-squared:  -0.00174 
## F-statistic: 0.5849 on 1 and 238 DF,  p-value: 0.4451

As we can see above, linear regression coefficients are not significative, hence I do not apply any detrending to the original time series.

I then take advantage of the TSA package harmonic() function to reveal the seasonal component internal structure in terms of harmonics (ref. [4]).

h <- harmonic(mean.monthly.ts, m=6)
Temperature <- as.vector(mean.monthly.ts)
mean.monthly.lm <- lm(Temperature ~ h) 
(sum <- summary(mean.monthly.lm))
## 
## Call:
## lm(formula = Temperature ~ h)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8900 -1.3688  0.2825  1.4050  6.2800 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    49.04125    0.14931 328.449  < 2e-16 ***
## hcos(2*pi*t)  -11.46977    0.21116 -54.318  < 2e-16 ***
## hcos(4*pi*t)    1.25875    0.21116   5.961 9.44e-09 ***
## hcos(6*pi*t)    0.34333    0.21116   1.626 0.105342    
## hcos(8*pi*t)    0.30875    0.21116   1.462 0.145071    
## hcos(10*pi*t)   0.03394    0.21116   0.161 0.872464    
## hcos(12*pi*t)   0.19875    0.14931   1.331 0.184483    
## hsin(2*pi*t)   -1.39115    0.21116  -6.588 3.04e-10 ***
## hsin(4*pi*t)    0.81767    0.21116   3.872 0.000141 ***
## hsin(6*pi*t)    0.06583    0.21116   0.312 0.755499    
## hsin(8*pi*t)   -0.19702    0.21116  -0.933 0.351783    
## hsin(10*pi*t)  -0.14552    0.21116  -0.689 0.491439    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.313 on 228 degrees of freedom
## Multiple R-squared:  0.9305, Adjusted R-squared:  0.9271 
## F-statistic: 277.5 on 11 and 228 DF,  p-value: < 2.2e-16
(coef.sig <- subset(sum$coefficients, sum$coefficients[,4]< 0.05))
##                 Estimate Std. Error    t value      Pr(>|t|)
## (Intercept)   49.0412500  0.1493117 328.448713 4.660057e-307
## hcos(2*pi*t) -11.4697687  0.2111587 -54.318241 1.950748e-132
## hcos(4*pi*t)   1.2587500  0.2111587   5.961156  9.443756e-09
## hsin(2*pi*t)  -1.3911499  0.2111587  -6.588173  3.044000e-10
## hsin(4*pi*t)   0.8176723  0.2111587   3.872312  1.408545e-04

As you can see above, I filtered out the significative coefficients in order to represent only significative harmonic components.

Based on such resulting coefficients, I write down a function capable to build an estimate of the original time series as a function of their associated harmonics.

temp.estimate <- function(coef, time) {
  omega <- pi*time
  y <- coef[1]+coef[2]*cos(2*omega)+coef[3]*cos(4*omega)+
               coef[4]*sin(2*omega)+coef[5]*sin(4*omega)
  y
}

coef <- coef.sig[,1]
mean.monthly.harm.ts <- temp.estimate(coef, tt)

Model #1 can be so outlined:

Model #1:

\[ \begin{equation} \begin{aligned} y_{t}\ =\ 49.04125 \ -11.47 cos(2 \omega t) + 1.259 cos(4 \omega t) \ -1.391 sin(2 \omega t)\ +\ 0.818 sin(4 \omega t)\ +\ z_{t} \\ \end{aligned} \end{equation} \]

where \(z_{t}\) is the remainder that will be analysed later.

Finally, I plot the original time series against its harmonics based fit.

fit.ts.1 <- mean.monthly.harm.ts
merge.ts.1 <- merge(as.zoo(fit.ts.1), as.zoo(mean.monthly.ts))
plot(merge.ts.1, screen=1, lty=c("dotted", "solid"),col= c("blue", "red"), ylim=c(30,75))
legend("topleft", c("fit", "original"), lty=c("dotted", "solid"), col= c("blue", "red"), bty='n')
grid(col='blue',lty="dotted")

The visual analysis shows a good result in fitting the original time serie. Further analysis of the remainder time series follows.

remainder.ts.1 <- mean.monthly.ts - mean.monthly.harm.ts
plot(remainder.ts.1)
grid(col='blue',lty="dotted")

par(mfrow=c(1,2))
acf(remainder.ts.1, lag.max=24)
pacf(remainder.ts.1, lag.max=24)

LjungBoxTest(remainder.ts.1, lag.max=24, k=5)
##   m    Qm     pvalue
##   1  4.78 0.02886050
##   2  5.33 0.02098029
##   3  5.48 0.01920029
##   4  5.49 0.01916905
##   5  5.49 0.01908138
##   6  6.32 0.01194263
##   7  7.12 0.02848136
##   8  7.66 0.05365280
##   9  9.24 0.05528377
##  10 10.11 0.07227983
##  11 12.53 0.05110883
##  12 13.54 0.05991187
##  13 15.15 0.05637693
##  14 15.50 0.07810880
##  15 15.51 0.11449027
##  16 16.44 0.12568219
##  17 16.44 0.17178260
##  18 17.63 0.17226731
##  19 17.84 0.21394919
##  20 18.33 0.24562381
##  21 20.08 0.21664471
##  22 20.08 0.27016217
##  23 28.48 0.05505828
##  24 33.77 0.01956594
compute_tests(remainder.ts.1)
## $summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -8.4500 -1.3850  0.2702  0.0000  1.4020  6.0750 
## 
## $sd
## [1] 2.298671
## 
## $skew
## [1] -0.3398661
## 
## $kurtosis
## [1] 0.6192687
## 
## $adf
## 
##  Augmented Dickey-Fuller Test
## 
## data:  timeserie
## Dickey-Fuller = -5.3022, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
## 
## 
## $jarque
## 
##  Jarque Bera Test
## 
## data:  timeserie
## X-squared = 8.4553, df = 2, p-value = 0.01459
## 
## 
## $lillie
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  timeserie
## D = 0.062906, p-value = 0.02237
qqnorm(remainder.ts.1)
qqline(remainder.ts.1)

The time series represented by the remainder of the original one subtracted the harmonic regression based one shows significative correlations and fails the normality tests. I then try to capture further structure in there by fitting a ARIMA(1,0,0)(1,0,1)(12) model.

remainder.ar <- Arima(remainder.ts.1, order=c(1,0,0), include.mean = FALSE, seasonal=list(order=c(1,0,1), period=12))
remainder.ar
## Series: remainder.ts.1 
## ARIMA(1,0,0)(1,0,1)[12] with zero mean     
## 
## Coefficients:
##          ar1     sar1    sma1
##       0.2306  -0.6401  0.5067
## s.e.  0.0632   0.1779  0.1922
## 
## sigma^2 estimated as 4.93:  log likelihood=-530.73
## AIC=1069.47   AICc=1069.64   BIC=1083.39

I plot residuals and run statistic tests to check if they represent white noise or they still have some relevant structure.

resid <- remainder.ar$resid
plot(resid)
grid(col='blue',lty="dotted")

par(mfrow=(c(1,2)))
acf(resid, lag.max=24) 
pacf(resid, lag.max=24) 

LjungBoxTest(resid, lag.max=24, k=3)
##   m    Qm    pvalue
##   1  1.67 0.1959527
##   2  2.07 0.1501621
##   3  2.46 0.1164644
##   4  2.68 0.1018047
##   5  2.72 0.2571382
##   6  3.33 0.3430408
##   7  3.73 0.4443558
##   8  4.73 0.4496285
##   9  5.79 0.4467038
##  10  6.26 0.5093993
##  11  6.80 0.5580042
##  12  7.75 0.5596922
##  13  9.69 0.4680232
##  14  9.71 0.5569683
##  15 10.13 0.6043345
##  16 10.69 0.6366845
##  17 10.93 0.6912904
##  18 12.80 0.6176318
##  19 12.81 0.6867983
##  20 12.95 0.7392751
##  21 13.66 0.7508649
##  22 14.68 0.7428701
##  23 19.15 0.5119692
##  24 21.00 0.4588031
compute_tests(resid)
## $summary
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -7.21300 -1.42300  0.08996 -0.00107  1.35300  5.57700 
## 
## $sd
## [1] 2.210971
## 
## $skew
## [1] -0.1460784
## 
## $kurtosis
## [1] 0.2297553
## 
## $adf
## 
##  Augmented Dickey-Fuller Test
## 
## data:  timeserie
## Dickey-Fuller = -5.583, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
## 
## 
## $jarque
## 
##  Jarque Bera Test
## 
## data:  timeserie
## X-squared = 1.3814, df = 2, p-value = 0.5012
## 
## 
## $lillie
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  timeserie
## D = 0.039518, p-value = 0.4793
qqnorm(resid)
qqline(resid)
McLeod.Li.test(y=resid, plot=TRUE, main="McLeod-Li test")

Partial autocorrelation plot and Ljung-Box test p-value confirm that seasonal ARIMA(1,0,0)(1,0,1)(12) residuals are not correlated. The correlation at lag=24 shown in the PACF plot is not significative based on its corresponding Ljung-Box test p-value.

Further:

  • we cannot reject the null hypothesis that the residuals are normally distributed as result of Jarque-Bera and Lillie tests

  • we cannot reject the null-hypothesis of homoscedasticity in the residuals as result of the McLeod-Li test for heteroscedasticity, (see plot above to the right)

As a conclusion, we determined that such residuals are white noise and the second model can be so outlined:

Model #2:

\[ \begin{equation} \begin{cases} y_{t}\ =\ 49.04125 \ -11.47 cos(2 \omega t) + 1.259 cos(4 \omega t) \ -1.391 sin(2 \omega t)\ +\ 0.818 sin(4 \omega t)\ +\ z_{t} \\ \\ z_{t}\ =\ 0.231 z_{t-1}\ -0.64 z_{t-12}\ +\ \eta_{t} \\ \\ \eta_{t} =\ 0.507 \eta_{t-12}\ +\ \epsilon_{t} \\ \\ \epsilon_{t} \sim WN(0,\ 4.93) \\ \end{cases} \end{equation} \]

I then simulate ARIMA(1,0,0)(1,0,1)(12) process \(z_{t}\) and include its result in a new time series fit together with the harmonic components.

In doing that, I take advantage of the simulate() function available by the forecast package.

l <- length(remainder.ts.1)
res.sim.ts <- forecast::simulate.Arima(remainder.ar, future=FALSE, nsim=l, seed=1023)

Hence, the second model for our time serie fit is:

fit.ts.2 <- mean.monthly.harm.ts + res.sim.ts
merge.ts.2 <- merge(as.zoo(fit.ts.2), as.zoo(mean.monthly.ts))

plot(merge.ts.2, screen=1, lty=c("dotted", "solid"), col= c("blue", "red"), ylim=c(30,75))
legend("topleft", c("fit", "original"), lty=c("dotted", "solid"), col= c("blue", "red"), bty='n')
grid(col='blue',lty="dotted")

Let us compare the remainders associated to the two time series fits.

remainder.ts.2 <- mean.monthly.ts - fit.ts.2

boxplot(remainder.ts.1, remainder.ts.2)

compute_tests(remainder.ts.1)
## $summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -8.4500 -1.3850  0.2702  0.0000  1.4020  6.0750 
## 
## $sd
## [1] 2.298671
## 
## $skew
## [1] -0.3398661
## 
## $kurtosis
## [1] 0.6192687
## 
## $adf
## 
##  Augmented Dickey-Fuller Test
## 
## data:  timeserie
## Dickey-Fuller = -5.3022, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
## 
## 
## $jarque
## 
##  Jarque Bera Test
## 
## data:  timeserie
## X-squared = 8.4553, df = 2, p-value = 0.01459
## 
## 
## $lillie
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  timeserie
## D = 0.062906, p-value = 0.02237
compute_tests(remainder.ts.2)
## $summary
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -9.26500 -2.16300  0.06787 -0.03921  2.44600  7.80100 
## 
## $sd
## [1] 3.315876
## 
## $skew
## [1] -0.2801247
## 
## $kurtosis
## [1] -0.3141876
## 
## $adf
## 
##  Augmented Dickey-Fuller Test
## 
## data:  timeserie
## Dickey-Fuller = -5.6277, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
## 
## 
## $jarque
## 
##  Jarque Bera Test
## 
## data:  timeserie
## X-squared = 4.1259, df = 2, p-value = 0.1271
## 
## 
## $lillie
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  timeserie
## D = 0.055116, p-value = 0.07423
par(mfrow=(c(1,2)))

acf(remainder.ts.2, lag.max=24); 
pacf(remainder.ts.2, lag.max=24); 

LjungBoxTest(remainder.ts.2, lag.max=24, k=5+3)
##   m    Qm       pvalue
##   1  0.15 0.6950186131
##   2  0.93 0.3354828418
##   3  0.94 0.3311026890
##   4  1.20 0.2741437705
##   5  1.38 0.2396999431
##   6  1.46 0.2265607188
##   7  4.22 0.0398327703
##   8  4.65 0.0311154273
##   9  5.10 0.0238695061
##  10 12.70 0.0017502656
##  11 19.17 0.0002523176
##  12 20.08 0.0004807345
##  13 23.21 0.0003083374
##  14 23.67 0.0006005851
##  15 23.81 0.0012324721
##  16 24.08 0.0022199052
##  17 24.54 0.0035234148
##  18 25.21 0.0049625757
##  19 25.25 0.0083835267
##  20 26.48 0.0091874676
##  21 28.22 0.0084241904
##  22 30.23 0.0071012493
##  23 40.57 0.0003718016
##  24 40.74 0.0006065220
McLeod.Li.test(y=remainder.ts.2, plot=TRUE, main="McLeod-Li test")

Such comparison reveals that the second model remainder time series (which includes the ARIMA model based simulated component) passes the normality tests, however having higher standard deviation than the first model remainder.

The first model however is not capable to capture some more structure as present in its remainder component.

To remark that, the second model final remainder, shows yet some significative autocorrelation at lags {10, 11, 24} and heteroscedasticity. However, since model #2 residuals \(\epsilon_{t}\) are white noise, those remainder auto-correlation and heteroscedasticity appear to be more likely a specific consequence of its simulation rather than a modeling issue.

Even though applied for forecast accuracy evaluations, the accuracy() function provided by the forecast package is useful to obtain more metrics for fit accuracy evaluation, as herein shown.

accuracy(fit.ts.1, mean.monthly.ts)
##                    ME     RMSE      MAE       MPE     MAPE      ACF1
## Test set 2.122005e-13 2.293877 1.793989 -0.258706 3.863274 0.2144241
##          Theil's U
## Test set 0.4639782
accuracy(fit.ts.2, mean.monthly.ts)
##                   ME     RMSE      MAE       MPE     MAPE      ACF1
## Test set -0.03921384 3.309193 2.702384 -0.385475 5.775314 0.2336116
##          Theil's U
## Test set 0.6598435

Conclusions

When the time series shows evident seasonality, the harmonic analysis easily allows to define a regression model with significative coefficients.

Analysis of residuals has spot further structure in the remainder whose ARIMA structure has been determined so that final residuals have been verified to be white noise.

Two models remainders have been compaired against several metrics.

Note

Original post published on January 25th 2015 has been updated by substantial reworking and enhancing the analysis.