Exploratory analysis (cont.)
I herein go on with the exploratory analysis started in the previous post. First, I load saved environment data and the required packages.
load(file="nottingham.RData")
suppressPackageStartupMessages(library(xts))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(Rmisc))
suppressPackageStartupMessages(library(extremevalues))
suppressPackageStartupMessages(library(strucchange))
It is time to take advantage of the mmt_df table to compute some summary statistics as based on same year or same month observations. The apply() function second parameter indicates if third paramter function is applied to rows (1) or columns (2). If applied to rows, we compute summary statistics for same year observations across months. If applied to columns, we compute summary statistics for same months observations across years.
apply(mmt_df[,-1], 1, summary)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## Min. 39.80 39.70 37.50 36.60 37.50 36.30 39.20 35.20 37.30 31.30 37.10
## 1st Qu. 42.38 43.85 41.15 41.38 42.53 40.38 42.95 41.58 42.37 41.68 41.50
## Median 48.60 50.55 44.60 47.50 47.65 47.55 47.80 48.70 48.75 46.55 48.90
## Mean 48.89 50.73 47.28 47.84 48.72 48.46 49.37 48.37 48.99 48.13 49.15
## 3rd Qu. 54.82 57.42 54.65 53.13 56.72 55.20 56.97 54.78 55.65 57.62 57.78
## Max. 58.50 66.30 57.80 64.20 60.80 63.50 62.50 60.50 62.20 62.50 61.60
## [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## Min. 37.10 38.40 35.60 38.20 36.40 35.00 37.10 39.20 37.80
## 1st Qu. 40.05 42.25 41.40 42.20 43.28 41.53 40.95 45.48 42.02
## Median 46.55 45.95 49.45 49.05 47.85 46.80 49.15 49.25 47.25
## Mean 48.13 49.01 50.18 50.32 49.89 48.60 49.10 50.28 49.39
## 3rd Qu. 54.90 56.40 60.28 59.30 57.72 58.22 56.87 57.50 58.05
## Max. 60.60 63.50 65.50 66.50 64.60 61.10 61.80 60.40 61.80
apply(mmt_df[,-1], 1, sd)
## [1] 7.154205 8.764633 7.911916 8.699264 8.394276 9.604588 8.467836
## [8] 8.474060 8.263442 10.398368 8.973547 8.543933 8.718680 10.891458
## [15] 9.495725 9.456355 9.305228 9.234717 7.482358 8.609451
apply(mmt_df[,-1], 2, summary)
## January February March April May June July August September
## Min. 34.80 31.30 38.30 42.10 49.20 52.70 56.80 54.30 53.00
## 1st Qu. 38.78 38.35 40.38 45.40 51.13 56.98 60.32 59.82 54.63
## Median 39.70 39.55 42.60 46.80 52.90 58.45 61.75 60.50 56.60
## Mean 39.71 39.19 42.20 46.29 52.56 58.04 61.90 60.52 56.48
## 3rd Qu. 41.00 40.92 44.10 47.15 53.87 59.10 63.68 61.80 57.65
## Max. 44.20 43.40 47.30 48.90 55.70 60.80 66.50 64.90 60.10
## October November December
## Min. 46.60 36.60 35.20
## 1st Qu. 48.28 41.60 37.25
## Median 49.90 42.85 39.50
## Mean 49.50 42.60 39.52
## 3rd Qu. 50.55 43.75 41.72
## Max. 54.20 47.80 45.80
apply(mmt_df[,-1], 2, sd)
## January February March April May June July
## 2.287558 2.702221 2.555175 1.688007 1.675646 1.924742 2.636784
## August September October November December
## 2.461407 2.008371 1.905525 2.594219 2.894023
Boxplots and density plots are capable to graphycally gather temperature observations statistical dispersion.
ggplot(data=mmt_df_long_2, aes(x = Month, y=Temperature, fill=Month)) +
geom_boxplot() +
theme(legend.position="none") +
scale_x_discrete(limits=mmt_df_long_2$Month[1:12]) +
theme(axis.text.x = element_text(face="bold", angle=90)) +
ggtitle("Temperature across same month of the year")
ggplot(data=mmt_df_long_2, aes(x = Year, y=Temperature, fill=Year)) +
geom_boxplot() +
theme(legend.position="none") +
theme(axis.text.x = element_text(face="bold", angle=90)) +
ggtitle("Temperature across years")
ggplot(data=mmt_df_long_2, aes(x = Temperature, fill = 1)) +
geom_density() +
theme(legend.position="none") +
ggtitle("Temperature distribution density")
Temperature distribution densities per each month based on yearly collected observations are herein shown.
ggplot(data=mmt_df_long_2, aes(x=Temperature, fill=Month)) +
facet_grid(. ~ Month) +
geom_density() +
coord_flip() +
theme(legend.position="none") +
theme(axis.text.x = element_text(face="bold", angle=90)) +
ggtitle("Temperature distribution density for each month")
Aggregating the result per year allows to have a look at how the average temperature per year evolved.
yearly.mean <- data.frame(Years = years, Temperature = coredata(apply.yearly(mmt, mean)))
kable(yearly.mean)
Years | Temperature |
---|---|
1920 | 48.89167 |
1921 | 50.73333 |
1922 | 47.27500 |
1923 | 47.84167 |
1924 | 48.72500 |
1925 | 48.45833 |
1926 | 49.36667 |
1927 | 48.36667 |
1928 | 48.99167 |
1929 | 48.13333 |
1930 | 49.15000 |
1931 | 48.13333 |
1932 | 49.00833 |
1933 | 50.17500 |
1934 | 50.31667 |
1935 | 49.89167 |
1936 | 48.60000 |
1937 | 49.10000 |
1938 | 50.27500 |
1939 | 49.39167 |
summary(yearly.mean)
## Years Temperature
## Min. :1920 Min. :47.27
## 1st Qu.:1925 1st Qu.:48.44
## Median :1930 Median :49.00
## Mean :1930 Mean :49.04
## 3rd Qu.:1934 3rd Qu.:49.52
## Max. :1939 Max. :50.73
p1 <- ggplot(data=yearly.mean, aes(x = factor(0), y=Temperature)) +
geom_boxplot() +
theme(legend.position="none") +
xlab("") +
ggtitle("Yearly Mean Temperature")
p2 <- ggplot(data=yearly.mean, aes(x = Temperature, fill = 1)) +
geom_density() +
theme(legend.position="none") +
ggtitle("Yearly Mean Temperature Density")
multiplot(p1, p2, cols=2)
I am going to investigate for outliers by taking advantage of the extremevalue package and its getOutliersI() and outlierPlot() functions.
# left outliers quantile
left_thresh <- 0.01
# right outliers quantile
right_thresh <- 0.99
# determining the outliers
mmt_outlier <- getOutliersI(mmt_df_long_2$Temperature,
FLim = c(left_thresh, right_thresh),
distribution = "normal")
mmt_outlier$iLeft
## integer(0)
mmt_outlier$iRight
## integer(0)
# outliers are plotted in red color
outlierPlot(mmt_df_long_2$Temperature, mmt_outlier, mode="qq")
As aboveshown, no outliers are present.
Further I investigate any structural change present by taking advantage of the strucchange package and its efp and Fstats functions.
Specifically, the efp function allows to investigate the empirical fluctuation process according to a specified method such as OLS residuals.
par(mfrow=(c(1,2)))
temperature.efp <- efp(mmt ~ 1, type = "OLS-CUSUM")
plot(temperature.efp, alpha = 0.05, alt.boundary = FALSE)
plot(temperature.efp, alpha = 0.05, alt.boundary = TRUE)
From above shown plot there are not any thresholds (red shown) crossings, hence no shift in level associated to our time series.
Further the Fstat() funcion allows to investigate any structural changes whereby the breakpoint is not known in advance.
mmt_data <- coredata(mmt)
temperature.efp.efp.Fstats <- Fstats(mmt_data ~ 1, from=0.05, to=0.95)
sctest(temperature.efp.efp.Fstats, type="supF")
##
## supF test
##
## data: temperature.efp.efp.Fstats
## sup.F = 1.6824, p-value = 0.9702
Based on resulting p-value, the null hypothesis of no structural changes cannot be rejected.
Conclusions
There is a plenty of information that can be highlighted even in a simple dataset as the one I have herein shown. All that in terms of summarisations and plots.
In the following posts I will analyze the time series structure of the dataset herein introduced.
References
Note
Original post published on January 1st 2015 has been updated by substantial reworking and enhancing the analysis. Due to blog limits in post size, I had also to split it in two separate posts.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.