Sunday, September 21, 2014

Weather forecast (part 2)

Categorical variables

In the previous post, I outlined my preliminary analysis upon quantitative variables of the weather dataset available at:

http://www.biz.uiowa.edu/faculty/jledolter/datamining/dataexercises.html

In this post I am going to continue my analysis focusing on categorical variables. Let us load the data.

data <- read.csv("weather.csv", header=TRUE)
data.clean <- na.omit(data)
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday))
attach(data.clean)
colnames(data.clean)
##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainTomorrow"

The categorical variables of our interest are: WindGustDir, WindDir9am, WindDir3pm. Let us see how relevant they are by plotting their cross table row proportions.

library(gmodels)
library(lattice)
wgd <- CrossTable(WindGustDir, RainTomorrow, prop.chisq=F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  328 
## 
##  
##              | RainTomorrow 
##  WindGustDir |        No |       Yes | Row Total | 
## -------------|-----------|-----------|-----------|
##            E |        31 |         3 |        34 | 
##              |     0.912 |     0.088 |     0.104 | 
##              |     0.116 |     0.050 |           | 
##              |     0.095 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##          ENE |        27 |         2 |        29 | 
##              |     0.931 |     0.069 |     0.088 | 
##              |     0.101 |     0.033 |           | 
##              |     0.082 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##          ESE |        17 |         6 |        23 | 
##              |     0.739 |     0.261 |     0.070 | 
##              |     0.063 |     0.100 |           | 
##              |     0.052 |     0.018 |           | 
## -------------|-----------|-----------|-----------|
##            N |        19 |         2 |        21 | 
##              |     0.905 |     0.095 |     0.064 | 
##              |     0.071 |     0.033 |           | 
##              |     0.058 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##           NE |        12 |         3 |        15 | 
##              |     0.800 |     0.200 |     0.046 | 
##              |     0.045 |     0.050 |           | 
##              |     0.037 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##          NNE |         7 |         0 |         7 | 
##              |     1.000 |     0.000 |     0.021 | 
##              |     0.026 |     0.000 |           | 
##              |     0.021 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##          NNW |        23 |        12 |        35 | 
##              |     0.657 |     0.343 |     0.107 | 
##              |     0.086 |     0.200 |           | 
##              |     0.070 |     0.037 |           | 
## -------------|-----------|-----------|-----------|
##           NW |        49 |        15 |        64 | 
##              |     0.766 |     0.234 |     0.195 | 
##              |     0.183 |     0.250 |           | 
##              |     0.149 |     0.046 |           | 
## -------------|-----------|-----------|-----------|
##            S |        18 |         3 |        21 | 
##              |     0.857 |     0.143 |     0.064 | 
##              |     0.067 |     0.050 |           | 
##              |     0.055 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##           SE |        11 |         0 |        11 | 
##              |     1.000 |     0.000 |     0.034 | 
##              |     0.041 |     0.000 |           | 
##              |     0.034 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##          SSE |         9 |         3 |        12 | 
##              |     0.750 |     0.250 |     0.037 | 
##              |     0.034 |     0.050 |           | 
##              |     0.027 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##          SSW |         2 |         2 |         4 | 
##              |     0.500 |     0.500 |     0.012 | 
##              |     0.007 |     0.033 |           | 
##              |     0.006 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##           SW |         1 |         2 |         3 | 
##              |     0.333 |     0.667 |     0.009 | 
##              |     0.004 |     0.033 |           | 
##              |     0.003 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##            W |         9 |         6 |        15 | 
##              |     0.600 |     0.400 |     0.046 | 
##              |     0.034 |     0.100 |           | 
##              |     0.027 |     0.018 |           | 
## -------------|-----------|-----------|-----------|
##          WNW |        31 |         1 |        32 | 
##              |     0.969 |     0.031 |     0.098 | 
##              |     0.116 |     0.017 |           | 
##              |     0.095 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
##          WSW |         2 |         0 |         2 | 
##              |     1.000 |     0.000 |     0.006 | 
##              |     0.007 |     0.000 |           | 
##              |     0.006 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       268 |        60 |       328 | 
##              |     0.817 |     0.183 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
barchart(wgd$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top"))

plot of chunk unnamed-chunk-2

From the above barchart, it is highlighted that whenever the WindGustDir comes from West or South-West or South-South-West there are more chances for tomorrow rain. In other words, there are directions more relevant than others for WindGustDir to predict some chance of tomorrow rain.

wd9am <- CrossTable(WindDir9am, RainTomorrow, prop.chisq=F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  328 
## 
##  
##              | RainTomorrow 
##   WindDir9am |        No |       Yes | Row Total | 
## -------------|-----------|-----------|-----------|
##            E |        16 |         4 |        20 | 
##              |     0.800 |     0.200 |     0.061 | 
##              |     0.060 |     0.067 |           | 
##              |     0.049 |     0.012 |           | 
## -------------|-----------|-----------|-----------|
##          ENE |         7 |         1 |         8 | 
##              |     0.875 |     0.125 |     0.024 | 
##              |     0.026 |     0.017 |           | 
##              |     0.021 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
##          ESE |        28 |         1 |        29 | 
##              |     0.966 |     0.034 |     0.088 | 
##              |     0.104 |     0.017 |           | 
##              |     0.085 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
##            N |        18 |        12 |        30 | 
##              |     0.600 |     0.400 |     0.091 | 
##              |     0.067 |     0.200 |           | 
##              |     0.055 |     0.037 |           | 
## -------------|-----------|-----------|-----------|
##           NE |         2 |         2 |         4 | 
##              |     0.500 |     0.500 |     0.012 | 
##              |     0.007 |     0.033 |           | 
##              |     0.006 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##          NNE |         6 |         1 |         7 | 
##              |     0.857 |     0.143 |     0.021 | 
##              |     0.022 |     0.017 |           | 
##              |     0.018 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
##          NNW |        29 |         7 |        36 | 
##              |     0.806 |     0.194 |     0.110 | 
##              |     0.108 |     0.117 |           | 
##              |     0.088 |     0.021 |           | 
## -------------|-----------|-----------|-----------|
##           NW |        22 |         8 |        30 | 
##              |     0.733 |     0.267 |     0.091 | 
##              |     0.082 |     0.133 |           | 
##              |     0.067 |     0.024 |           | 
## -------------|-----------|-----------|-----------|
##            S |        24 |         2 |        26 | 
##              |     0.923 |     0.077 |     0.079 | 
##              |     0.090 |     0.033 |           | 
##              |     0.073 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##           SE |        41 |         6 |        47 | 
##              |     0.872 |     0.128 |     0.143 | 
##              |     0.153 |     0.100 |           | 
##              |     0.125 |     0.018 |           | 
## -------------|-----------|-----------|-----------|
##          SSE |        30 |         8 |        38 | 
##              |     0.789 |     0.211 |     0.116 | 
##              |     0.112 |     0.133 |           | 
##              |     0.091 |     0.024 |           | 
## -------------|-----------|-----------|-----------|
##          SSW |        13 |         4 |        17 | 
##              |     0.765 |     0.235 |     0.052 | 
##              |     0.049 |     0.067 |           | 
##              |     0.040 |     0.012 |           | 
## -------------|-----------|-----------|-----------|
##           SW |         6 |         1 |         7 | 
##              |     0.857 |     0.143 |     0.021 | 
##              |     0.022 |     0.017 |           | 
##              |     0.018 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
##            W |         8 |         0 |         8 | 
##              |     1.000 |     0.000 |     0.024 | 
##              |     0.030 |     0.000 |           | 
##              |     0.024 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##          WNW |        14 |         2 |        16 | 
##              |     0.875 |     0.125 |     0.049 | 
##              |     0.052 |     0.033 |           | 
##              |     0.043 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##          WSW |         4 |         1 |         5 | 
##              |     0.800 |     0.200 |     0.015 | 
##              |     0.015 |     0.017 |           | 
##              |     0.012 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       268 |        60 |       328 | 
##              |     0.817 |     0.183 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
barchart(wd9am$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top"))

plot of chunk unnamed-chunk-3

Similarly, when the WindDir9am is North-East or North there are more chances for tomorrow rain.

wd3pm <- CrossTable(WindDir3pm, RainTomorrow, prop.chisq=F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  328 
## 
##  
##              | RainTomorrow 
##   WindDir3pm |        No |       Yes | Row Total | 
## -------------|-----------|-----------|-----------|
##            E |        16 |         1 |        17 | 
##              |     0.941 |     0.059 |     0.052 | 
##              |     0.060 |     0.017 |           | 
##              |     0.049 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
##          ENE |         7 |         3 |        10 | 
##              |     0.700 |     0.300 |     0.030 | 
##              |     0.026 |     0.050 |           | 
##              |     0.021 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##          ESE |        23 |         3 |        26 | 
##              |     0.885 |     0.115 |     0.079 | 
##              |     0.086 |     0.050 |           | 
##              |     0.070 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##            N |        23 |         4 |        27 | 
##              |     0.852 |     0.148 |     0.082 | 
##              |     0.086 |     0.067 |           | 
##              |     0.070 |     0.012 |           | 
## -------------|-----------|-----------|-----------|
##           NE |        11 |         2 |        13 | 
##              |     0.846 |     0.154 |     0.040 | 
##              |     0.041 |     0.033 |           | 
##              |     0.034 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##          NNE |        11 |         3 |        14 | 
##              |     0.786 |     0.214 |     0.043 | 
##              |     0.041 |     0.050 |           | 
##              |     0.034 |     0.009 |           | 
## -------------|-----------|-----------|-----------|
##          NNW |        31 |         8 |        39 | 
##              |     0.795 |     0.205 |     0.119 | 
##              |     0.116 |     0.133 |           | 
##              |     0.095 |     0.024 |           | 
## -------------|-----------|-----------|-----------|
##           NW |        41 |        14 |        55 | 
##              |     0.745 |     0.255 |     0.168 | 
##              |     0.153 |     0.233 |           | 
##              |     0.125 |     0.043 |           | 
## -------------|-----------|-----------|-----------|
##            S |        11 |         2 |        13 | 
##              |     0.846 |     0.154 |     0.040 | 
##              |     0.041 |     0.033 |           | 
##              |     0.034 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
##           SE |        10 |         1 |        11 | 
##              |     0.909 |     0.091 |     0.034 | 
##              |     0.037 |     0.017 |           | 
##              |     0.030 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
##          SSE |         6 |         1 |         7 | 
##              |     0.857 |     0.143 |     0.021 | 
##              |     0.022 |     0.017 |           | 
##              |     0.018 |     0.003 |           | 
## -------------|-----------|-----------|-----------|
##          SSW |         5 |         0 |         5 | 
##              |     1.000 |     0.000 |     0.015 | 
##              |     0.019 |     0.000 |           | 
##              |     0.015 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##           SW |         4 |         0 |         4 | 
##              |     1.000 |     0.000 |     0.012 | 
##              |     0.015 |     0.000 |           | 
##              |     0.012 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##            W |        18 |         6 |        24 | 
##              |     0.750 |     0.250 |     0.073 | 
##              |     0.067 |     0.100 |           | 
##              |     0.055 |     0.018 |           | 
## -------------|-----------|-----------|-----------|
##          WNW |        43 |        10 |        53 | 
##              |     0.811 |     0.189 |     0.162 | 
##              |     0.160 |     0.167 |           | 
##              |     0.131 |     0.030 |           | 
## -------------|-----------|-----------|-----------|
##          WSW |         8 |         2 |        10 | 
##              |     0.800 |     0.200 |     0.030 | 
##              |     0.030 |     0.033 |           | 
##              |     0.024 |     0.006 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       268 |        60 |       328 | 
##              |     0.817 |     0.183 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
barchart(wd3pm$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top"))

plot of chunk unnamed-chunk-4

Differently from previous barcharts, the WindDir3pm does not seem to be that relevant for a tomorrow rain prediction. As a further verification, I run the Pearson's Chi-squared tests on those categorical variables. Look at the p-values, it confirms our basic evaluation of the barcharts.

suppressWarnings(chisq.test(WindGustDir, RainTomorrow))
## 
##  Pearson's Chi-squared test
## 
## data:  WindGustDir and RainTomorrow
## X-squared = 35.83, df = 15, p-value = 0.001868
suppressWarnings(chisq.test(WindDir9am, RainTomorrow))
## 
##  Pearson's Chi-squared test
## 
## data:  WindDir9am and RainTomorrow
## X-squared = 23.81, df = 15, p-value = 0.06834
suppressWarnings(chisq.test(WindDir3pm, RainTomorrow))
## 
##  Pearson's Chi-squared test
## 
## data:  WindDir3pm and RainTomorrow
## X-squared = 9.403, df = 15, p-value = 0.8555

Now, some considerations about the availability of WindGustDir. Can we expect to have it already available at 9am ? Possible but gusts may occur also in the afternoon and after 3pm. So, I think I can safely include WindGustDir in my 8pm forecast. For 9am forecast, I will include the WindDir9am and evaluate if any benefits may arise including it in the 3pm exlanatory variables as well. Here below, a quick preview of the classification trees misclassification rates.

library(tree)
mytree <- tree(RainTomorrow~(Humidity9am+Pressure9am+WindDir9am), data=data.clean, mincut=1)
summary(mytree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity9am + Pressure9am + WindDir9am), 
##     data = data.clean, mincut = 1)
## Number of terminal nodes:  22 
## Residual mean deviance:  0.377 = 115 / 306 
## Misclassification error rate: 0.0793 = 26 / 328
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+WindDir9am), data=data.clean, mincut=1)
summary(mytree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + WindDir9am), 
##     data = data.clean, mincut = 1)
## Number of terminal nodes:  23 
## Residual mean deviance:  0.356 = 108 / 305 
## Misclassification error rate: 0.0976 = 32 / 328
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine+WindGustDir), data=data.clean, mincut=1)
summary(mytree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + Sunshine + 
##     WindGustDir), data = data.clean, mincut = 1)
## Number of terminal nodes:  22 
## Residual mean deviance:  0.256 = 78.4 / 306 
## Misclassification error rate: 0.064 = 21 / 328

By comparing the misclassification rate of those trees, I can gain confidence on the explanatory variables choice.