Part 1 - exploring your dataset
Sunday, September 14, 2014
In the next four blog posts, all tagged as Weather Forecast, I am going to build a model to forecast the day after rain condition. I am going to split my analysis in the following four steps, one per each post:
- introductive consideration and exploring the quantitative variables
- exploring the categorical variable analysis
- building the model and evaluating its prediction misclassification rate
- further considerations on top of the work done
Then, in this first post, I am going to do a preliminary investigation into a dataset reporting measurements from the weather station located in Camberra, concentrating my attention first on quantitative variables.
The dataset has many measurement variables about weather conditions and two response variables about Rain occurring the same day of measurement (RainToday) and the day after (RainTomorrow). It is publicly available at:
http://www.biz.uiowa.edu/faculty/jledolter/datamining/dataexercises.html
Let us have a look.
data <- read.csv("weather.csv", header=TRUE)
dim(data)
## [1] 366 24
head(data, n=7)
## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine
## 1 11/1/2007 Canberra 8.0 24.3 0.0 3.4 6.3
## 2 11/2/2007 Canberra 14.0 26.9 3.6 4.4 9.7
## 3 11/3/2007 Canberra 13.7 23.4 3.6 5.8 3.3
## 4 11/4/2007 Canberra 13.3 15.5 39.8 7.2 9.1
## 5 11/5/2007 Canberra 7.6 16.1 2.8 5.6 10.6
## 6 11/6/2007 Canberra 6.2 16.9 0.0 5.8 8.2
## 7 11/7/2007 Canberra 6.1 18.2 0.2 4.2 8.4
## WindGustDir WindGustSpeed WindDir9am WindDir3pm WindSpeed9am
## 1 NW 30 SW NW 6
## 2 ENE 39 E W 4
## 3 NW 85 N NNE 6
## 4 NW 54 WNW W 30
## 5 SSE 50 SSE ESE 20
## 6 SE 44 SE E 20
## 7 SE 43 SE ESE 19
## WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am
## 1 20 68 29 1020 1015 7
## 2 17 80 36 1012 1008 5
## 3 6 82 69 1010 1007 8
## 4 24 62 56 1006 1007 2
## 5 28 68 49 1018 1018 7
## 6 24 70 57 1024 1022 7
## 7 26 63 47 1025 1022 4
## Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
## 1 7 14.4 23.6 No 3.6 Yes
## 2 3 17.5 25.7 Yes 3.6 Yes
## 3 7 15.4 20.2 Yes 39.8 Yes
## 4 7 13.5 14.1 Yes 2.8 Yes
## 5 7 11.1 15.4 Yes 0.0 No
## 6 5 10.9 14.8 No 0.2 No
## 7 6 12.4 17.3 No 0.0 No
My purpose is to build a model able to predict the RainTomorrow response variable. First I filter out rows havin NA values.
data.clean <- na.omit(data)
dim(data.clean)
## [1] 328 24
Further, intentionaly I am not taking into account of the RISK_MM variable, as it would facilitate too much the analysis (I will show it at the end of discussion). Moreover, by looking at how the RainToday matches RainTomorrow values:
attach(data.clean)
length(which(RainToday==RainTomorrow))/nrow(data.clean)
## [1] 0.753
We notice only in 25% of cases raining conditions changed one day after. Further, I like to figure out how much rainfall has to occur to determine for the current day the Yes value assigned to RainToday observed variable.
min(subset(Rainfall, RainToday=="Yes"))
## [1] 1.2
Let us cut off RISK_MM and RainToday to concentrate out attention to RainTomorrow response variable.
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday))
colnames(data.clean)
## [1] "Date" "Location" "MinTemp" "MaxTemp"
## [5] "Rainfall" "Evaporation" "Sunshine" "WindGustDir"
## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "WindSpeed9am"
## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "Pressure9am"
## [17] "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am"
## [21] "Temp3pm" "RainTomorrow"
At this point, I start with some visual analysis of our dataset, with the purpose to indentify the most determining observed variables to distinguish across RainTomorrow=YES/NO scenario. I start by boxplotting the quantitative variable against the observed RainTomorrow observation.
table(RainTomorrow)
## RainTomorrow
## No Yes
## 268 60
def.par <- par(no.readonly = TRUE)
par(mfrow=c(1,4))
boxplot(MinTemp~RainTomorrow, ylab = "MinTemp")
boxplot(MaxTemp~RainTomorrow, ylab = "MaxTemp")
boxplot(Evaporation~RainTomorrow, ylab = "Evaporation")
boxplot(Rainfall~RainTomorrow, ylab = "Rainfall")
boxplot(Sunshine~RainTomorrow, ylab = "Sunshine")
boxplot(WindGustSpeed~RainTomorrow, ylab = "WindGustSpeed")
boxplot(WindSpeed9am~RainTomorrow, ylab = "WindSpeed9am")
boxplot(WindSpeed3pm~RainTomorrow, ylab = "WindSpeed3pm")
boxplot(Humidity9am~RainTomorrow, ylab = "Humidity9am")
boxplot(Humidity3pm~RainTomorrow, ylab = "Humidity3pm")
boxplot(Pressure9am~RainTomorrow, ylab ="Pressure9am")
boxplot(Pressure3pm~RainTomorrow, ylab ="Pressure3pm")
boxplot(Cloud9am~RainTomorrow, ylab="Cloud9am")
boxplot(Cloud3pm~RainTomorrow, ylab ="Cloud3pm")
boxplot(Temp9am~RainTomorrow, ylab ="Temp9am")
boxplot(Temp3pm~RainTomorrow, ylab ="Temp3pm")
par(def.par)
Based on boxplots, I choose the following explanatory variables:
Rainfall, Sunshine, WindGustSpeed, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Cloud3pm
as they show up better separation between the two RainTomorrow scenario. However, seven variables may be too many for a good prediction, hence I am going to investigate their correlation. Keeping two highly correlated variables in the prediction model do not improve very much our prediction quality, enough to keep only one of.
expl = data.frame(Rainfall, Sunshine, WindGustSpeed, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm)
expl.cor = cor(expl)
expl.cor
## Rainfall Sunshine WindGustSpeed Humidity9am Humidity3pm
## Rainfall 1.00000 -0.15806 0.09944 0.1463 0.28724
## Sunshine -0.15806 1.00000 0.08477 -0.5016 -0.76027
## WindGustSpeed 0.09944 0.08477 1.00000 -0.3383 -0.04325
## Humidity9am 0.14632 -0.50160 -0.33828 1.0000 0.52670
## Humidity3pm 0.28724 -0.76027 -0.04325 0.5267 1.00000
## Pressure9am -0.34873 0.02563 -0.52474 0.1023 -0.13629
## Pressure3pm -0.26371 -0.02412 -0.51083 0.1095 -0.04761
## Cloud9am 0.17261 -0.69760 -0.01822 0.4175 0.56517
## Cloud3pm 0.13489 -0.65720 0.04285 0.2896 0.53072
## Pressure9am Pressure3pm Cloud9am Cloud3pm
## Rainfall -0.34873 -0.26371 0.17261 0.13489
## Sunshine 0.02563 -0.02412 -0.69760 -0.65720
## WindGustSpeed -0.52474 -0.51083 -0.01822 0.04285
## Humidity9am 0.10225 0.10955 0.41750 0.28962
## Humidity3pm -0.13629 -0.04761 0.56517 0.53072
## Pressure9am 1.00000 0.96674 -0.16832 -0.14620
## Pressure3pm 0.96674 1.00000 -0.13225 -0.14624
## Cloud9am -0.16832 -0.13225 1.00000 0.52830
## Cloud3pm -0.14620 -0.14624 0.52830 1.00000
Pressure9am and Pressure3pm are highly correlated.
There is remarkable correlation between Humidity9am and Humidity3pm.
Sunshine is high negatively correlated with Humidity3pm.
Cloud3pm is highly negative correlated with Sunshine.
There is remarkable correlation between Cloud9am and Humidity9am.
There is remarkable correlation between Cloud3pm and Humidity3pm.
There is remarkable correlation between WindGustSpeed and Pressure3pm.
My choice is then for the following explanatory variables:
Pressure3pm, Humidity3pm, Rainfall
while choosing 3pm data as more determining of the day after possible scenario than 9am data. Further, thinking back to the possible timelines weather forecasts are delivered, it would be better to create a model able to handle three sets of explanatory variables:
Pressure9am and Humidity9am, to generate the weather prediction at 9am.
Pressure3pm and Humidity3pm, to generate another weather prediction at 3pm.
Pressure3pm, Humidity3pm and Sunshine hours to generate a prediction at sunset time (let us say by 8pm, after evening news time)
That makes perfectly sense in a real world case scenario. At this point, we may wonder if we are following a good approach. Let us quickly evaluate it by building a decision tree based on that
library(tree)
mytree <- tree(RainTomorrow~(Humidity9am+Pressure9am), data=data.clean, mincut=1)
summary(mytree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity9am + Pressure9am), data = data.clean,
## mincut = 1)
## Number of terminal nodes: 13
## Residual mean deviance: 0.635 = 200 / 315
## Misclassification error rate: 0.134 = 44 / 328
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm), data=data.clean, mincut=1)
summary(mytree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm), data = data.clean,
## mincut = 1)
## Number of terminal nodes: 9
## Residual mean deviance: 0.598 = 191 / 319
## Misclassification error rate: 0.125 = 41 / 328
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine), data=data.clean, mincut=1)
summary(mytree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + Sunshine),
## data = data.clean, mincut = 1)
## Number of terminal nodes: 20
## Residual mean deviance: 0.355 = 109 / 308
## Misclassification error rate: 0.0732 = 24 / 328
It makes sense that using 3pm data results with a lower misclassification rate than using 9am one. That is just a first look into the model I am going to build, a classification tree to predict next day rain condition. I have not yet defined the training set nor the test set, as I will do once analysed also the categorical variables. In the next post, I am going to do that.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.