Categorical variables
In the previous post, I outlined my preliminary analysis upon quantitative variables of the weather dataset available at:
http://www.biz.uiowa.edu/faculty/jledolter/datamining/dataexercises.html
In this post I am going to continue my analysis focusing on categorical variables. Let us load the data.
data <- read.csv("weather.csv", header=TRUE)
data.clean <- na.omit(data)
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday))
attach(data.clean)
colnames(data.clean)
## [1] "Date" "Location" "MinTemp" "MaxTemp"
## [5] "Rainfall" "Evaporation" "Sunshine" "WindGustDir"
## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "WindSpeed9am"
## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "Pressure9am"
## [17] "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am"
## [21] "Temp3pm" "RainTomorrow"
The categorical variables of our interest are: WindGustDir, WindDir9am, WindDir3pm. Let us see how relevant they are by plotting their cross table row proportions.
library(gmodels)
library(lattice)
wgd <- CrossTable(WindGustDir, RainTomorrow, prop.chisq=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 328
##
##
## | RainTomorrow
## WindGustDir | No | Yes | Row Total |
## -------------|-----------|-----------|-----------|
## E | 31 | 3 | 34 |
## | 0.912 | 0.088 | 0.104 |
## | 0.116 | 0.050 | |
## | 0.095 | 0.009 | |
## -------------|-----------|-----------|-----------|
## ENE | 27 | 2 | 29 |
## | 0.931 | 0.069 | 0.088 |
## | 0.101 | 0.033 | |
## | 0.082 | 0.006 | |
## -------------|-----------|-----------|-----------|
## ESE | 17 | 6 | 23 |
## | 0.739 | 0.261 | 0.070 |
## | 0.063 | 0.100 | |
## | 0.052 | 0.018 | |
## -------------|-----------|-----------|-----------|
## N | 19 | 2 | 21 |
## | 0.905 | 0.095 | 0.064 |
## | 0.071 | 0.033 | |
## | 0.058 | 0.006 | |
## -------------|-----------|-----------|-----------|
## NE | 12 | 3 | 15 |
## | 0.800 | 0.200 | 0.046 |
## | 0.045 | 0.050 | |
## | 0.037 | 0.009 | |
## -------------|-----------|-----------|-----------|
## NNE | 7 | 0 | 7 |
## | 1.000 | 0.000 | 0.021 |
## | 0.026 | 0.000 | |
## | 0.021 | 0.000 | |
## -------------|-----------|-----------|-----------|
## NNW | 23 | 12 | 35 |
## | 0.657 | 0.343 | 0.107 |
## | 0.086 | 0.200 | |
## | 0.070 | 0.037 | |
## -------------|-----------|-----------|-----------|
## NW | 49 | 15 | 64 |
## | 0.766 | 0.234 | 0.195 |
## | 0.183 | 0.250 | |
## | 0.149 | 0.046 | |
## -------------|-----------|-----------|-----------|
## S | 18 | 3 | 21 |
## | 0.857 | 0.143 | 0.064 |
## | 0.067 | 0.050 | |
## | 0.055 | 0.009 | |
## -------------|-----------|-----------|-----------|
## SE | 11 | 0 | 11 |
## | 1.000 | 0.000 | 0.034 |
## | 0.041 | 0.000 | |
## | 0.034 | 0.000 | |
## -------------|-----------|-----------|-----------|
## SSE | 9 | 3 | 12 |
## | 0.750 | 0.250 | 0.037 |
## | 0.034 | 0.050 | |
## | 0.027 | 0.009 | |
## -------------|-----------|-----------|-----------|
## SSW | 2 | 2 | 4 |
## | 0.500 | 0.500 | 0.012 |
## | 0.007 | 0.033 | |
## | 0.006 | 0.006 | |
## -------------|-----------|-----------|-----------|
## SW | 1 | 2 | 3 |
## | 0.333 | 0.667 | 0.009 |
## | 0.004 | 0.033 | |
## | 0.003 | 0.006 | |
## -------------|-----------|-----------|-----------|
## W | 9 | 6 | 15 |
## | 0.600 | 0.400 | 0.046 |
## | 0.034 | 0.100 | |
## | 0.027 | 0.018 | |
## -------------|-----------|-----------|-----------|
## WNW | 31 | 1 | 32 |
## | 0.969 | 0.031 | 0.098 |
## | 0.116 | 0.017 | |
## | 0.095 | 0.003 | |
## -------------|-----------|-----------|-----------|
## WSW | 2 | 0 | 2 |
## | 1.000 | 0.000 | 0.006 |
## | 0.007 | 0.000 | |
## | 0.006 | 0.000 | |
## -------------|-----------|-----------|-----------|
## Column Total | 268 | 60 | 328 |
## | 0.817 | 0.183 | |
## -------------|-----------|-----------|-----------|
##
##
barchart(wgd$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top"))
From the above barchart, it is highlighted that whenever the WindGustDir comes from West or South-West or South-South-West there are more chances for tomorrow rain. In other words, there are directions more relevant than others for WindGustDir to predict some chance of tomorrow rain.
wd9am <- CrossTable(WindDir9am, RainTomorrow, prop.chisq=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 328
##
##
## | RainTomorrow
## WindDir9am | No | Yes | Row Total |
## -------------|-----------|-----------|-----------|
## E | 16 | 4 | 20 |
## | 0.800 | 0.200 | 0.061 |
## | 0.060 | 0.067 | |
## | 0.049 | 0.012 | |
## -------------|-----------|-----------|-----------|
## ENE | 7 | 1 | 8 |
## | 0.875 | 0.125 | 0.024 |
## | 0.026 | 0.017 | |
## | 0.021 | 0.003 | |
## -------------|-----------|-----------|-----------|
## ESE | 28 | 1 | 29 |
## | 0.966 | 0.034 | 0.088 |
## | 0.104 | 0.017 | |
## | 0.085 | 0.003 | |
## -------------|-----------|-----------|-----------|
## N | 18 | 12 | 30 |
## | 0.600 | 0.400 | 0.091 |
## | 0.067 | 0.200 | |
## | 0.055 | 0.037 | |
## -------------|-----------|-----------|-----------|
## NE | 2 | 2 | 4 |
## | 0.500 | 0.500 | 0.012 |
## | 0.007 | 0.033 | |
## | 0.006 | 0.006 | |
## -------------|-----------|-----------|-----------|
## NNE | 6 | 1 | 7 |
## | 0.857 | 0.143 | 0.021 |
## | 0.022 | 0.017 | |
## | 0.018 | 0.003 | |
## -------------|-----------|-----------|-----------|
## NNW | 29 | 7 | 36 |
## | 0.806 | 0.194 | 0.110 |
## | 0.108 | 0.117 | |
## | 0.088 | 0.021 | |
## -------------|-----------|-----------|-----------|
## NW | 22 | 8 | 30 |
## | 0.733 | 0.267 | 0.091 |
## | 0.082 | 0.133 | |
## | 0.067 | 0.024 | |
## -------------|-----------|-----------|-----------|
## S | 24 | 2 | 26 |
## | 0.923 | 0.077 | 0.079 |
## | 0.090 | 0.033 | |
## | 0.073 | 0.006 | |
## -------------|-----------|-----------|-----------|
## SE | 41 | 6 | 47 |
## | 0.872 | 0.128 | 0.143 |
## | 0.153 | 0.100 | |
## | 0.125 | 0.018 | |
## -------------|-----------|-----------|-----------|
## SSE | 30 | 8 | 38 |
## | 0.789 | 0.211 | 0.116 |
## | 0.112 | 0.133 | |
## | 0.091 | 0.024 | |
## -------------|-----------|-----------|-----------|
## SSW | 13 | 4 | 17 |
## | 0.765 | 0.235 | 0.052 |
## | 0.049 | 0.067 | |
## | 0.040 | 0.012 | |
## -------------|-----------|-----------|-----------|
## SW | 6 | 1 | 7 |
## | 0.857 | 0.143 | 0.021 |
## | 0.022 | 0.017 | |
## | 0.018 | 0.003 | |
## -------------|-----------|-----------|-----------|
## W | 8 | 0 | 8 |
## | 1.000 | 0.000 | 0.024 |
## | 0.030 | 0.000 | |
## | 0.024 | 0.000 | |
## -------------|-----------|-----------|-----------|
## WNW | 14 | 2 | 16 |
## | 0.875 | 0.125 | 0.049 |
## | 0.052 | 0.033 | |
## | 0.043 | 0.006 | |
## -------------|-----------|-----------|-----------|
## WSW | 4 | 1 | 5 |
## | 0.800 | 0.200 | 0.015 |
## | 0.015 | 0.017 | |
## | 0.012 | 0.003 | |
## -------------|-----------|-----------|-----------|
## Column Total | 268 | 60 | 328 |
## | 0.817 | 0.183 | |
## -------------|-----------|-----------|-----------|
##
##
barchart(wd9am$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top"))
Similarly, when the WindDir9am is North-East or North there are more chances for tomorrow rain.
wd3pm <- CrossTable(WindDir3pm, RainTomorrow, prop.chisq=F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 328
##
##
## | RainTomorrow
## WindDir3pm | No | Yes | Row Total |
## -------------|-----------|-----------|-----------|
## E | 16 | 1 | 17 |
## | 0.941 | 0.059 | 0.052 |
## | 0.060 | 0.017 | |
## | 0.049 | 0.003 | |
## -------------|-----------|-----------|-----------|
## ENE | 7 | 3 | 10 |
## | 0.700 | 0.300 | 0.030 |
## | 0.026 | 0.050 | |
## | 0.021 | 0.009 | |
## -------------|-----------|-----------|-----------|
## ESE | 23 | 3 | 26 |
## | 0.885 | 0.115 | 0.079 |
## | 0.086 | 0.050 | |
## | 0.070 | 0.009 | |
## -------------|-----------|-----------|-----------|
## N | 23 | 4 | 27 |
## | 0.852 | 0.148 | 0.082 |
## | 0.086 | 0.067 | |
## | 0.070 | 0.012 | |
## -------------|-----------|-----------|-----------|
## NE | 11 | 2 | 13 |
## | 0.846 | 0.154 | 0.040 |
## | 0.041 | 0.033 | |
## | 0.034 | 0.006 | |
## -------------|-----------|-----------|-----------|
## NNE | 11 | 3 | 14 |
## | 0.786 | 0.214 | 0.043 |
## | 0.041 | 0.050 | |
## | 0.034 | 0.009 | |
## -------------|-----------|-----------|-----------|
## NNW | 31 | 8 | 39 |
## | 0.795 | 0.205 | 0.119 |
## | 0.116 | 0.133 | |
## | 0.095 | 0.024 | |
## -------------|-----------|-----------|-----------|
## NW | 41 | 14 | 55 |
## | 0.745 | 0.255 | 0.168 |
## | 0.153 | 0.233 | |
## | 0.125 | 0.043 | |
## -------------|-----------|-----------|-----------|
## S | 11 | 2 | 13 |
## | 0.846 | 0.154 | 0.040 |
## | 0.041 | 0.033 | |
## | 0.034 | 0.006 | |
## -------------|-----------|-----------|-----------|
## SE | 10 | 1 | 11 |
## | 0.909 | 0.091 | 0.034 |
## | 0.037 | 0.017 | |
## | 0.030 | 0.003 | |
## -------------|-----------|-----------|-----------|
## SSE | 6 | 1 | 7 |
## | 0.857 | 0.143 | 0.021 |
## | 0.022 | 0.017 | |
## | 0.018 | 0.003 | |
## -------------|-----------|-----------|-----------|
## SSW | 5 | 0 | 5 |
## | 1.000 | 0.000 | 0.015 |
## | 0.019 | 0.000 | |
## | 0.015 | 0.000 | |
## -------------|-----------|-----------|-----------|
## SW | 4 | 0 | 4 |
## | 1.000 | 0.000 | 0.012 |
## | 0.015 | 0.000 | |
## | 0.012 | 0.000 | |
## -------------|-----------|-----------|-----------|
## W | 18 | 6 | 24 |
## | 0.750 | 0.250 | 0.073 |
## | 0.067 | 0.100 | |
## | 0.055 | 0.018 | |
## -------------|-----------|-----------|-----------|
## WNW | 43 | 10 | 53 |
## | 0.811 | 0.189 | 0.162 |
## | 0.160 | 0.167 | |
## | 0.131 | 0.030 | |
## -------------|-----------|-----------|-----------|
## WSW | 8 | 2 | 10 |
## | 0.800 | 0.200 | 0.030 |
## | 0.030 | 0.033 | |
## | 0.024 | 0.006 | |
## -------------|-----------|-----------|-----------|
## Column Total | 268 | 60 | 328 |
## | 0.817 | 0.183 | |
## -------------|-----------|-----------|-----------|
##
##
barchart(wd3pm$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top"))
Differently from previous barcharts, the WindDir3pm does not seem to be that relevant for a tomorrow rain prediction. As a further verification, I run the Pearson's Chi-squared tests on those categorical variables. Look at the p-values, it confirms our basic evaluation of the barcharts.
suppressWarnings(chisq.test(WindGustDir, RainTomorrow))
##
## Pearson's Chi-squared test
##
## data: WindGustDir and RainTomorrow
## X-squared = 35.83, df = 15, p-value = 0.001868
suppressWarnings(chisq.test(WindDir9am, RainTomorrow))
##
## Pearson's Chi-squared test
##
## data: WindDir9am and RainTomorrow
## X-squared = 23.81, df = 15, p-value = 0.06834
suppressWarnings(chisq.test(WindDir3pm, RainTomorrow))
##
## Pearson's Chi-squared test
##
## data: WindDir3pm and RainTomorrow
## X-squared = 9.403, df = 15, p-value = 0.8555
Now, some considerations about the availability of WindGustDir. Can we expect to have it already available at 9am ? Possible but gusts may occur also in the afternoon and after 3pm. So, I think I can safely include WindGustDir in my 8pm forecast. For 9am forecast, I will include the WindDir9am and evaluate if any benefits may arise including it in the 3pm exlanatory variables as well. Here below, a quick preview of the classification trees misclassification rates.
library(tree)
mytree <- tree(RainTomorrow~(Humidity9am+Pressure9am+WindDir9am), data=data.clean, mincut=1)
summary(mytree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity9am + Pressure9am + WindDir9am),
## data = data.clean, mincut = 1)
## Number of terminal nodes: 22
## Residual mean deviance: 0.377 = 115 / 306
## Misclassification error rate: 0.0793 = 26 / 328
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+WindDir9am), data=data.clean, mincut=1)
summary(mytree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + WindDir9am),
## data = data.clean, mincut = 1)
## Number of terminal nodes: 23
## Residual mean deviance: 0.356 = 108 / 305
## Misclassification error rate: 0.0976 = 32 / 328
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine+WindGustDir), data=data.clean, mincut=1)
summary(mytree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + Sunshine +
## WindGustDir), data = data.clean, mincut = 1)
## Number of terminal nodes: 22
## Residual mean deviance: 0.256 = 78.4 / 306
## Misclassification error rate: 0.064 = 21 / 328
By comparing the misclassification rate of those trees, I can gain confidence on the explanatory variables choice.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.