Thursday, January 1, 2015

Temperature monitoring and time series analysis (part 1)

Exploratory analysis

Abstract

Time series analysis applies whenever you have available the evolution with time of some physical quantity. The aim of such analysis is to understand the structure of the time series, in order to describe it, to simulate or even make predictions.

In the present and next post I am going to run some basic exploratory analysis, showing plots and summarisations on a given time series.

Analysis

I am going to analyze the dataset reporting the monthly temperature of the Nottingham castle from January 1920 up to December 1939. Such time series can be found at ref. [1]. as part of the Time Series Data Library (ref. [2]).

The data has been downloaded in the local file-system by selecting the Export option on the left pane and then CSV file (comma separated) format. The dataset file is named as mean-monthly-air-temperature-deg.csv.

After having load it, I cut out its last record as it reports just a description string without any temperature data associated.

suppressPackageStartupMessages(library(xts))
suppressPackageStartupMessages(library(TSA))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(Rmisc))
suppressPackageStartupMessages(library(lubridate))

Sys.setlocale("LC_TIME", "english")
## [1] "English_United States.1252"
mean.monthly.temp <- read.csv("mean-monthly-air-temperature-deg.csv", header=TRUE, stringsAsFactors = FALSE)
colnames(mean.monthly.temp) <- c("Date", "Temperature")
kable(head(mean.monthly.temp))
Date Temperature
1920-01 40.6
1920-02 40.8
1920-03 44.4
1920-04 46.7
1920-05 54.1
1920-06 58.5
kable(tail(mean.monthly.temp))
Date Temperature
236 1939-08 61.8
237 1939-09 58.2
238 1939-10 46.7
239 1939-11 46.6
240 1939-12 37.8
241 Mean monthly air temperature (Deg. F) Nottingham Castle 1920-1939 NA
mmt <- mean.monthly.temp[complete.cases(mean.monthly.temp),]

Our dataset reports monthly data, so the frequency to be used in building the timeseries should be 12 (see ref. [3]). Anyway I want to verify that by means of the periodogram function made available by the TSA package. That could be useful in case you are uncertain about the frequency value to be specified.

temp.pdgram <- periodogram(mmt$Temperature, plot=TRUE)

(max.freq <- which.max(temp.pdgram$spec))
## [1] 20
(temp.frequency = 1/temp.pdgram$freq[max.freq])
## [1] 12

As we can see the frequency is confirmed to be equal to 12. I then create an xts time series object. The xts package made available some useful functions to summarise per specific time periods, such as week, month and year.

mmt$Date <- as.Date(paste(mmt$Date,1,sep="-"),"%Y-%m-%d")
mmt <- xts(mmt$Temperature, order.by=mmt$Date, frequency=temp.frequency)
plot(mmt, ylab="Temperature", main="Average Monthly Temperature")
points(mmt, pch=20)

Since the data we have available can be thought as indexed by a compound key as identified by the {year, month} pair, it is interesting to compare:

  • same month data related to different years

  • same year data related to all months

At the purpose I arrange the original time series in a data frame where columns store observations associated to each specific month of the year while rows are monthly observations associated to each specific year. All that for ease of consultation and to facilitate the computation of summary statistics.

months_idx <- sort(unique(month(mmt))) - 1
years <- sort(unique(year(mmt)))

mmt.by_month <- lapply(months_idx, function(x) { mmt[.indexmon(mmt) == x]})
names(mmt.by_month) <- format(ISOdate(2016, 1:12, 1),"%B")

mmt_df <- data.frame()
mmt_df <- sapply(mmt.by_month, function(x) {coredata(x)})
mmt_df <- data.frame(Year = years, mmt_df)
mmt_df <- as.data.frame(mmt_df)

kable(mmt_df)
Year January February March April May June July August September October November December
1920 40.6 40.8 44.4 46.7 54.1 58.5 57.7 56.4 54.3 50.5 42.9 39.8
1921 44.2 39.8 45.1 47.0 54.1 58.7 66.3 59.9 57.0 54.2 39.7 42.8
1922 37.5 38.7 39.5 42.1 55.7 57.8 56.8 54.3 54.3 47.1 41.8 41.7
1923 41.8 40.1 42.9 45.8 49.2 52.7 64.2 59.6 54.4 49.2 36.6 37.6
1924 39.3 37.5 38.3 45.5 53.2 57.7 60.8 58.2 56.4 49.8 44.4 43.6
1925 40.0 40.5 40.8 45.1 53.8 59.4 63.5 61.0 53.0 50.0 38.1 36.3
1926 39.2 43.4 43.4 48.9 50.6 56.8 62.5 62.0 57.5 46.7 41.6 39.8
1927 39.4 38.5 45.3 47.1 51.7 55.0 60.4 60.5 54.7 50.3 42.3 35.2
1928 40.8 41.1 42.8 47.3 50.9 56.4 62.2 60.5 55.4 50.2 43.0 37.3
1929 34.8 31.3 41.0 43.9 53.1 56.9 62.5 60.3 59.8 49.2 42.9 41.9
1930 41.6 37.1 41.2 46.9 51.2 60.4 60.1 61.6 57.0 50.9 43.0 38.8
1931 37.1 38.4 38.4 46.5 53.5 58.4 60.6 58.2 53.8 46.6 45.5 40.6
1932 42.4 38.4 40.3 44.6 50.9 57.0 62.1 63.5 56.2 47.3 43.6 41.8
1933 36.2 39.3 44.5 48.7 54.2 60.8 65.5 64.9 60.1 50.2 42.1 35.6
1934 39.4 38.2 40.4 46.9 53.4 59.6 66.5 60.4 59.2 51.2 42.8 45.8
1935 40.4 42.6 43.5 47.1 50.0 60.5 64.6 64.0 56.8 48.6 44.2 36.4
1936 37.3 35.0 44.0 43.9 52.7 58.6 60.0 61.1 58.1 49.6 41.6 41.3
1937 40.8 41.0 38.4 47.4 54.1 58.6 61.4 61.8 56.3 50.9 41.4 37.1
1938 42.1 41.2 47.3 46.6 52.4 59.0 59.6 60.4 57.0 50.7 47.8 39.2
1939 39.4 40.9 42.4 47.8 52.4 58.0 60.7 61.8 58.2 46.7 46.6 37.8

I then turn out the data frame above in a long format to ease the plots based on two types of abovementioned data slicing.

mmt_df_long <- gather(data=mmt_df, key=Month, value=Temperature, January:December)
mmt_df_long$Temperature <- unlist(mmt_df_long$Temperature)
month_idx <- as.vector(sapply(1:12, function(x) { rep(x,nrow(mmt_df)) }))
mmt_df_long <- data.frame(month_idx = month_idx, mmt_df_long)
kable(head(mmt_df_long))
month_idx Year Month Temperature
1 1920 January 40.6
1 1921 January 44.2
1 1922 January 37.5
1 1923 January 41.8
1 1924 January 39.3
1 1925 January 40.0

Finally the plots.

ggplot(data=mmt_df_long, aes(x = Year, y=Temperature, colour = Month)) + 
  geom_line() + 
  geom_point() + 
  ggtitle("Temperature along years for each month")

mmt_df_long_2 <- mmt_df_long %>% arrange(Year) %>% mutate(Year = factor(Year)) %>% mutate(Month = factor(Month))

ggplot(data=mmt_df_long_2, aes(x = month_idx, y=Temperature, colour = Year)) +
  geom_line() + 
  geom_point() + 
  scale_x_discrete(limits=mmt_df_long_2$Month[1:12]) +
  theme(axis.text.x = element_text(face="bold", angle=90)) +
  ggtitle("Temperature along months for each year")

I save some of the variables in order to be able to go on next post with the exploratory analysis.

save(months_idx, years, mmt, mmt_df, mmt_df_long, mmt_df_long_2, file="nottingham.RData")

Note

Original post published on January 1st 2015 has been updated by substantial reworking and enhancing the analysis. Due to blog limits in post size, I had also to split it in two separate posts.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.