Saturday, September 20, 2014

Weather forecast (part 1)

Part 1 - exploring your dataset

In the next four blog posts, all tagged as Weather Forecast, I am going to build a model to forecast the day after rain condition. I am going to split my analysis in the following four steps, one per each post:

  1. introductive consideration and exploring the quantitative variables
  2. exploring the categorical variable analysis
  3. building the model and evaluating its prediction misclassification rate
  4. further considerations on top of the work done

Then, in this first post, I am going to do a preliminary investigation into a dataset reporting measurements from the weather station located in Camberra, concentrating my attention first on quantitative variables.

The dataset has many measurement variables about weather conditions and two response variables about Rain occurring the same day of measurement (RainToday) and the day after (RainTomorrow). It is publicly available at:

http://www.biz.uiowa.edu/faculty/jledolter/datamining/dataexercises.html

Let us have a look.

data <- read.csv("weather.csv", header=TRUE)
dim(data)
## [1] 366  24
head(data, n=7)
##        Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine
## 1 11/1/2007 Canberra     8.0    24.3      0.0         3.4      6.3
## 2 11/2/2007 Canberra    14.0    26.9      3.6         4.4      9.7
## 3 11/3/2007 Canberra    13.7    23.4      3.6         5.8      3.3
## 4 11/4/2007 Canberra    13.3    15.5     39.8         7.2      9.1
## 5 11/5/2007 Canberra     7.6    16.1      2.8         5.6     10.6
## 6 11/6/2007 Canberra     6.2    16.9      0.0         5.8      8.2
## 7 11/7/2007 Canberra     6.1    18.2      0.2         4.2      8.4
##   WindGustDir WindGustSpeed WindDir9am WindDir3pm WindSpeed9am
## 1          NW            30         SW         NW            6
## 2         ENE            39          E          W            4
## 3          NW            85          N        NNE            6
## 4          NW            54        WNW          W           30
## 5         SSE            50        SSE        ESE           20
## 6          SE            44         SE          E           20
## 7          SE            43         SE        ESE           19
##   WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am
## 1           20          68          29        1020        1015        7
## 2           17          80          36        1012        1008        5
## 3            6          82          69        1010        1007        8
## 4           24          62          56        1006        1007        2
## 5           28          68          49        1018        1018        7
## 6           24          70          57        1024        1022        7
## 7           26          63          47        1025        1022        4
##   Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
## 1        7    14.4    23.6        No     3.6          Yes
## 2        3    17.5    25.7       Yes     3.6          Yes
## 3        7    15.4    20.2       Yes    39.8          Yes
## 4        7    13.5    14.1       Yes     2.8          Yes
## 5        7    11.1    15.4       Yes     0.0           No
## 6        5    10.9    14.8        No     0.2           No
## 7        6    12.4    17.3        No     0.0           No

My purpose is to build a model able to predict the RainTomorrow response variable. First I filter out rows havin NA values.

data.clean <- na.omit(data)
dim(data.clean)
## [1] 328  24

Further, intentionaly I am not taking into account of the RISK_MM variable, as it would facilitate too much the analysis (I will show it at the end of discussion). Moreover, by looking at how the RainToday matches RainTomorrow values:

attach(data.clean)
length(which(RainToday==RainTomorrow))/nrow(data.clean)
## [1] 0.753

We notice only in 25% of cases raining conditions changed one day after. Further, I like to figure out how much rainfall has to occur to determine for the current day the Yes value assigned to RainToday observed variable.

min(subset(Rainfall, RainToday=="Yes"))
## [1] 1.2

Let us cut off RISK_MM and RainToday to concentrate out attention to RainTomorrow response variable.

data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday))
colnames(data.clean)
##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainTomorrow"

At this point, I start with some visual analysis of our dataset, with the purpose to indentify the most determining observed variables to distinguish across RainTomorrow=YES/NO scenario. I start by boxplotting the quantitative variable against the observed RainTomorrow observation.

table(RainTomorrow)
## RainTomorrow
##  No Yes 
## 268  60
def.par <- par(no.readonly = TRUE)
par(mfrow=c(1,4))
boxplot(MinTemp~RainTomorrow, ylab = "MinTemp")
boxplot(MaxTemp~RainTomorrow, ylab = "MaxTemp")
boxplot(Evaporation~RainTomorrow, ylab = "Evaporation")
boxplot(Rainfall~RainTomorrow, ylab = "Rainfall")

plot of chunk unnamed-chunk-6

boxplot(Sunshine~RainTomorrow, ylab = "Sunshine")
boxplot(WindGustSpeed~RainTomorrow, ylab = "WindGustSpeed")
boxplot(WindSpeed9am~RainTomorrow, ylab = "WindSpeed9am")
boxplot(WindSpeed3pm~RainTomorrow, ylab = "WindSpeed3pm")

plot of chunk unnamed-chunk-6

boxplot(Humidity9am~RainTomorrow, ylab = "Humidity9am")
boxplot(Humidity3pm~RainTomorrow, ylab = "Humidity3pm")
boxplot(Pressure9am~RainTomorrow, ylab ="Pressure9am")
boxplot(Pressure3pm~RainTomorrow, ylab ="Pressure3pm")

plot of chunk unnamed-chunk-6

boxplot(Cloud9am~RainTomorrow, ylab="Cloud9am")
boxplot(Cloud3pm~RainTomorrow, ylab ="Cloud3pm")
boxplot(Temp9am~RainTomorrow, ylab ="Temp9am")
boxplot(Temp3pm~RainTomorrow, ylab ="Temp3pm")

plot of chunk unnamed-chunk-6

par(def.par)

Based on boxplots, I choose the following explanatory variables:

Rainfall, Sunshine, WindGustSpeed, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Cloud3pm

as they show up better separation between the two RainTomorrow scenario. However, seven variables may be too many for a good prediction, hence I am going to investigate their correlation. Keeping two highly correlated variables in the prediction model do not improve very much our prediction quality, enough to keep only one of.

expl = data.frame(Rainfall, Sunshine, WindGustSpeed, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm)
expl.cor = cor(expl)
expl.cor
##               Rainfall Sunshine WindGustSpeed Humidity9am Humidity3pm
## Rainfall       1.00000 -0.15806       0.09944      0.1463     0.28724
## Sunshine      -0.15806  1.00000       0.08477     -0.5016    -0.76027
## WindGustSpeed  0.09944  0.08477       1.00000     -0.3383    -0.04325
## Humidity9am    0.14632 -0.50160      -0.33828      1.0000     0.52670
## Humidity3pm    0.28724 -0.76027      -0.04325      0.5267     1.00000
## Pressure9am   -0.34873  0.02563      -0.52474      0.1023    -0.13629
## Pressure3pm   -0.26371 -0.02412      -0.51083      0.1095    -0.04761
## Cloud9am       0.17261 -0.69760      -0.01822      0.4175     0.56517
## Cloud3pm       0.13489 -0.65720       0.04285      0.2896     0.53072
##               Pressure9am Pressure3pm Cloud9am Cloud3pm
## Rainfall         -0.34873    -0.26371  0.17261  0.13489
## Sunshine          0.02563    -0.02412 -0.69760 -0.65720
## WindGustSpeed    -0.52474    -0.51083 -0.01822  0.04285
## Humidity9am       0.10225     0.10955  0.41750  0.28962
## Humidity3pm      -0.13629    -0.04761  0.56517  0.53072
## Pressure9am       1.00000     0.96674 -0.16832 -0.14620
## Pressure3pm       0.96674     1.00000 -0.13225 -0.14624
## Cloud9am         -0.16832    -0.13225  1.00000  0.52830
## Cloud3pm         -0.14620    -0.14624  0.52830  1.00000

Pressure9am and Pressure3pm are highly correlated.

There is remarkable correlation between Humidity9am and Humidity3pm.

Sunshine is high negatively correlated with Humidity3pm.

Cloud3pm is highly negative correlated with Sunshine.

There is remarkable correlation between Cloud9am and Humidity9am.

There is remarkable correlation between Cloud3pm and Humidity3pm.

There is remarkable correlation between WindGustSpeed and Pressure3pm.

My choice is then for the following explanatory variables:

Pressure3pm, Humidity3pm, Rainfall

while choosing 3pm data as more determining of the day after possible scenario than 9am data. Further, thinking back to the possible timelines weather forecasts are delivered, it would be better to create a model able to handle three sets of explanatory variables:

Pressure9am and Humidity9am, to generate the weather prediction at 9am.

Pressure3pm and Humidity3pm, to generate another weather prediction at 3pm.

Pressure3pm, Humidity3pm and Sunshine hours to generate a prediction at sunset time (let us say by 8pm, after evening news time)

That makes perfectly sense in a real world case scenario. At this point, we may wonder if we are following a good approach. Let us quickly evaluate it by building a decision tree based on that

library(tree)
mytree <- tree(RainTomorrow~(Humidity9am+Pressure9am), data=data.clean, mincut=1)
summary(mytree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity9am + Pressure9am), data = data.clean, 
##     mincut = 1)
## Number of terminal nodes:  13 
## Residual mean deviance:  0.635 = 200 / 315 
## Misclassification error rate: 0.134 = 44 / 328
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm), data=data.clean, mincut=1)
summary(mytree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm), data = data.clean, 
##     mincut = 1)
## Number of terminal nodes:  9 
## Residual mean deviance:  0.598 = 191 / 319 
## Misclassification error rate: 0.125 = 41 / 328
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine), data=data.clean, mincut=1)
summary(mytree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + Sunshine), 
##     data = data.clean, mincut = 1)
## Number of terminal nodes:  20 
## Residual mean deviance:  0.355 = 109 / 308 
## Misclassification error rate: 0.0732 = 24 / 328

It makes sense that using 3pm data results with a lower misclassification rate than using 9am one. That is just a first look into the model I am going to build, a classification tree to predict next day rain condition. I have not yet defined the training set nor the test set, as I will do once analysed also the categorical variables. In the next post, I am going to do that.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.