Building the model
Giorgio
Saturday, September 27, 2014
In the previous two posts, I identified the explanatory variables, both quantitative and categorical, that I am going to use to build a model for predicting weather. More precisely, I identified three sets of explanatory variables, the following ones:
set#1: {Humidity9am, Pressure9am, WindDir9am} set#2: {Humidity3pm, Pressure3pm, WindDir9am} set#3: {Humidity3pm, Pressure3pm, Sunshine, WindGustDir}
The reason for identifying three sets of explanatory variables is determined by the need of being able to provide weather forecasts on three deadlines during the day: 10am, 4pm, 10pm. That need must be checked against the availability of specific explanatory variables. For example, at 10am, the Humidity3pm is obviously not available; at 3pm, some Sunshine or WindGustDir measurement are still going on.
So, let us reload the original dataset available at:
http://www.biz.uiowa.edu/faculty/jledolter/datamining/dataexercises.html
and run the already introduced preliminary preprocessing.
data <- read.csv("weather.csv", header=TRUE)
data.clean <- na.omit(data)
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday))
attach(data.clean)
colnames(data.clean)
## [1] "Date" "Location" "MinTemp" "MaxTemp"
## [5] "Rainfall" "Evaporation" "Sunshine" "WindGustDir"
## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "WindSpeed9am"
## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "Pressure9am"
## [17] "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am"
## [21] "Temp3pm" "RainTomorrow"
Specifically, I am going to build the model by taking advantage of classification trees. I do that for their simplicity in representing graphically the final result. I will explore further models on next posts. So, first let us generate the training and test sets.
set.seed(1)
n = nrow(data.clean)
ntrain = n*0.7
ntest = n - ntrain
trainset = sample(1:n, ntrain)
wtrain <- data.clean[trainset,]
wtest <- data.clean[-trainset,]
Considering the first variables set, I am going to build the classification tree and further pruning in order to get the tree with the optimal deviance. In addition to that, snipping is performed to remove nodes which do not drive any discrimination among yes/no scenarios, hence without increasing misclassification rate.
- First explanatory variables set
library(tree)
my.tree <- tree(RainTomorrow~(Humidity9am+Pressure9am+WindDir9am), data=wtrain, mincut=1)
summary(my.tree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity9am + Pressure9am + WindDir9am),
## data = wtrain, mincut = 1)
## Number of terminal nodes: 18
## Residual mean deviance: 0.341 = 71.9 / 211
## Misclassification error rate: 0.083 = 19 / 229
my.pruned.tree <- prune.tree(my.tree, method="misclass")
data.frame(my.pruned.tree$size, my.pruned.tree$dev)
## my.pruned.tree.size my.pruned.tree.dev
## 1 18 19
## 2 14 19
## 3 11 21
## 4 7 25
## 5 3 31
## 6 1 37
From the table above, it appears the best size being equal to 14. Also cross-validation confirms that.
cv.tree(my.tree, method="misclass")
## $size
## [1] 18 14 11 7 3 1
##
## $dev
## [1] 51 51 44 43 42 42
##
## $k
## [1] -Inf 0.0000 0.6667 1.0000 1.5000 3.0000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
Hence, I am going to prune my classification tree based on that.
my.best.tree.1 <- prune.tree(my.tree, best=14)
my.best.tree.1
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )
## 4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )
## 8) WindDir9am: NNE,SW,W,WNW,WSW 14 0 No ( 1.00 0.00 ) *
## 9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74 90 No ( 0.68 0.32 )
## 18) Humidity9am < 81.5 65 80 No ( 0.72 0.28 )
## 36) WindDir9am: E,ENE,ESE,NNW,NW,S,SSE,SSW 49 50 No ( 0.80 0.20 )
## 72) Pressure9am < 1013.2 24 30 No ( 0.67 0.33 ) *
## 73) Pressure9am > 1013.2 25 10 No ( 0.92 0.08 )
## 146) Pressure9am < 1017.15 19 0 No ( 1.00 0.00 ) *
## 147) Pressure9am > 1017.15 6 8 No ( 0.67 0.33 ) *
## 37) WindDir9am: N,SE 16 20 No ( 0.50 0.50 )
## 74) Humidity9am < 77 13 20 Yes ( 0.38 0.62 ) *
## 75) Humidity9am > 77 3 0 No ( 1.00 0.00 ) *
## 19) Humidity9am > 81.5 9 10 Yes ( 0.33 0.67 ) *
## 5) Humidity9am > 91.5 6 0 Yes ( 0.00 1.00 ) *
## 3) Pressure9am > 1018.05 135 60 No ( 0.95 0.05 )
## 6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: NE,NNE,NNW,SE,SSW 56 40 No ( 0.88 0.12 )
## 14) Humidity9am < 93.5 54 30 No ( 0.91 0.09 )
## 28) Humidity9am < 74.5 32 9 No ( 0.97 0.03 )
## 56) WindDir9am: NNE,NNW,SE,SSW 30 0 No ( 1.00 0.00 ) *
## 57) WindDir9am: NE 2 3 No ( 0.50 0.50 ) *
## 29) Humidity9am > 74.5 22 20 No ( 0.82 0.18 )
## 58) Pressure9am < 1019.5 3 4 Yes ( 0.33 0.67 ) *
## 59) Pressure9am > 1019.5 19 10 No ( 0.89 0.11 )
## 118) WindDir9am: NE,NNW,SE,SSW 18 8 No ( 0.94 0.06 ) *
## 119) WindDir9am: NNE 1 0 Yes ( 0.00 1.00 ) *
## 15) Humidity9am > 93.5 2 0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.1)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(74L, 72L))
## Number of terminal nodes: 15
## Residual mean deviance: 0.38 = 81.3 / 214
## Misclassification error rate: 0.0917 = 21 / 229
From the classification tree printout, it appears that I can snip nodes 28 and 36 as they do not provide any significant discrimination in terms of prediction.
my.snip.tree.1 <- snip.tree(my.best.tree.1, nodes=c(28,36))
summary.1 <- summary(my.snip.tree.1)
summary.1
##
## Classification tree:
## snip.tree(tree = my.best.tree.1, nodes = c(28L, 36L))
## Number of terminal nodes: 12
## Residual mean deviance: 0.455 = 98.8 / 217
## Misclassification error rate: 0.0917 = 21 / 229
plot(my.snip.tree.1)
text(my.snip.tree.1, digits=2)
my.prediction <- predict(my.snip.tree.1, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.1 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.1
## [1] 0.0917
It is remarkable that the training set error rate has not changed after pruning and snipping.
my.prediction <- predict(my.snip.tree.1, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.1 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.1
## [1] 0.1919
What if I build a classification tree with 11 nodes. Does the final misclassification rate change that much ? Let us try it out.
my.best.tree.2 <- prune.tree(my.tree, best=11)
my.best.tree.2
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )
## 4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )
## 8) WindDir9am: NNE,SW,W,WNW,WSW 14 0 No ( 1.00 0.00 ) *
## 9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74 90 No ( 0.68 0.32 )
## 18) Humidity9am < 81.5 65 80 No ( 0.72 0.28 )
## 36) WindDir9am: E,ENE,ESE,NNW,NW,S,SSE,SSW 49 50 No ( 0.80 0.20 )
## 72) Pressure9am < 1013.2 24 30 No ( 0.67 0.33 ) *
## 73) Pressure9am > 1013.2 25 10 No ( 0.92 0.08 )
## 146) Pressure9am < 1017.15 19 0 No ( 1.00 0.00 ) *
## 147) Pressure9am > 1017.15 6 8 No ( 0.67 0.33 ) *
## 37) WindDir9am: N,SE 16 20 No ( 0.50 0.50 )
## 74) Humidity9am < 77 13 20 Yes ( 0.38 0.62 ) *
## 75) Humidity9am > 77 3 0 No ( 1.00 0.00 ) *
## 19) Humidity9am > 81.5 9 10 Yes ( 0.33 0.67 ) *
## 5) Humidity9am > 91.5 6 0 Yes ( 0.00 1.00 ) *
## 3) Pressure9am > 1018.05 135 60 No ( 0.95 0.05 )
## 6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: NE,NNE,NNW,SE,SSW 56 40 No ( 0.88 0.12 )
## 14) Humidity9am < 93.5 54 30 No ( 0.91 0.09 ) *
## 15) Humidity9am > 93.5 2 0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.2)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(74L, 72L, 14L))
## Number of terminal nodes: 11
## Residual mean deviance: 0.46 = 100 / 218
## Misclassification error rate: 0.1 = 23 / 229
plot(my.best.tree.2)
text(my.best.tree.2, digits=2)
my.snip.tree.2 <- snip.tree(my.best.tree.2, nodes=c(36))
my.snip.tree.2
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )
## 4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )
## 8) WindDir9am: NNE,SW,W,WNW,WSW 14 0 No ( 1.00 0.00 ) *
## 9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74 90 No ( 0.68 0.32 )
## 18) Humidity9am < 81.5 65 80 No ( 0.72 0.28 )
## 36) WindDir9am: E,ENE,ESE,NNW,NW,S,SSE,SSW 49 50 No ( 0.80 0.20 ) *
## 37) WindDir9am: N,SE 16 20 No ( 0.50 0.50 )
## 74) Humidity9am < 77 13 20 Yes ( 0.38 0.62 ) *
## 75) Humidity9am > 77 3 0 No ( 1.00 0.00 ) *
## 19) Humidity9am > 81.5 9 10 Yes ( 0.33 0.67 ) *
## 5) Humidity9am > 91.5 6 0 Yes ( 0.00 1.00 ) *
## 3) Pressure9am > 1018.05 135 60 No ( 0.95 0.05 )
## 6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: NE,NNE,NNW,SE,SSW 56 40 No ( 0.88 0.12 )
## 14) Humidity9am < 93.5 54 30 No ( 0.91 0.09 ) *
## 15) Humidity9am > 93.5 2 0 Yes ( 0.00 1.00 ) *
summary.2 <- summary(my.snip.tree.2)
summary.2
##
## Classification tree:
## snip.tree(tree = my.best.tree.2, nodes = 36L)
## Number of terminal nodes: 9
## Residual mean deviance: 0.508 = 112 / 220
## Misclassification error rate: 0.1 = 23 / 229
plot(my.snip.tree.2)
text(my.snip.tree.2, digits=2)
my.prediction <- predict(my.snip.tree.2, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.2 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.2
## [1] 0.1004
Again, the training error rate has not changed after pruning and snipping. The test error rate is as following:
my.prediction <- predict(my.snip.tree.2, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.2 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.2
## [1] 0.1919
Surprise! The test error rate for the second classification tree is the same as the first one. However, the residual mean deviance is slight higher for this second tree. Since the first classification tree uses 12 nodes, my choice is for the second one that achieve the same misclassification rate with only 9 nodes.
It is time to try with just 7 nodes to observe an increase in the misclassification rate, in the case.
my.best.tree.3 <- prune.tree(my.tree, best=7)
my.best.tree.3
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )
## 4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )
## 8) WindDir9am: NNE,SW,W,WNW,WSW 14 0 No ( 1.00 0.00 ) *
## 9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74 90 No ( 0.68 0.32 )
## 18) Humidity9am < 81.5 65 80 No ( 0.72 0.28 )
## 36) WindDir9am: E,ENE,ESE,NNW,NW,S,SSE,SSW 49 50 No ( 0.80 0.20 )
## 72) Pressure9am < 1013.2 24 30 No ( 0.67 0.33 ) *
## 73) Pressure9am > 1013.2 25 10 No ( 0.92 0.08 )
## 146) Pressure9am < 1017.15 19 0 No ( 1.00 0.00 ) *
## 147) Pressure9am > 1017.15 6 8 No ( 0.67 0.33 ) *
## 37) WindDir9am: N,SE 16 20 No ( 0.50 0.50 ) *
## 19) Humidity9am > 81.5 9 10 Yes ( 0.33 0.67 ) *
## 5) Humidity9am > 91.5 6 0 Yes ( 0.00 1.00 ) *
## 3) Pressure9am > 1018.05 135 60 No ( 0.95 0.05 )
## 6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: NE,NNE,NNW,SE,SSW 56 40 No ( 0.88 0.12 )
## 14) Humidity9am < 93.5 54 30 No ( 0.91 0.09 ) *
## 15) Humidity9am > 93.5 2 0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.3)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(72L, 14L, 37L))
## Number of terminal nodes: 10
## Residual mean deviance: 0.48 = 105 / 219
## Misclassification error rate: 0.114 = 26 / 229
Some snipping is possible.
my.snip.tree.3 <- snip.tree(my.best.tree.3, nodes=c(18,36,73))
my.snip.tree.3
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )
## 4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )
## 8) WindDir9am: NNE,SW,W,WNW,WSW 14 0 No ( 1.00 0.00 ) *
## 9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74 90 No ( 0.68 0.32 )
## 18) Humidity9am < 81.5 65 80 No ( 0.72 0.28 ) *
## 19) Humidity9am > 81.5 9 10 Yes ( 0.33 0.67 ) *
## 5) Humidity9am > 91.5 6 0 Yes ( 0.00 1.00 ) *
## 3) Pressure9am > 1018.05 135 60 No ( 0.95 0.05 )
## 6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: NE,NNE,NNW,SE,SSW 56 40 No ( 0.88 0.12 )
## 14) Humidity9am < 93.5 54 30 No ( 0.91 0.09 ) *
## 15) Humidity9am > 93.5 2 0 Yes ( 0.00 1.00 ) *
summary.3 <- summary(my.snip.tree.3)
summary.3
##
## Classification tree:
## snip.tree(tree = my.best.tree.3, nodes = 18L)
## Number of terminal nodes: 7
## Residual mean deviance: 0.547 = 121 / 222
## Misclassification error rate: 0.114 = 26 / 229
plot(my.snip.tree.3)
text(my.snip.tree.3, digits=2)
my.prediction <- predict(my.snip.tree.3, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.3 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.3
## [1] 0.1135
my.prediction <- predict(my.snip.tree.3, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.3 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.3
## [1] 0.2121
Resuming up.
model.1 <- c(summary.1$size, summary.1$dev/summary.1$df, summary.1$misclass[1]/summary.1$misclass[2], train.error.rate.1, test.error.rate.1)
model.2 <- c(summary.2$size, summary.2$dev/summary.2$df, summary.2$misclass[1]/summary.2$misclass[2], train.error.rate.2, test.error.rate.2)
model.3 <- c(summary.3$size, summary.3$dev/summary.3$df, summary.3$misclass[1]/summary.3$misclass[2], train.error.rate.3, test.error.rate.3)
model.result <- data.frame(rbind(model.1, model.2, model.3))
colnames(model.result) <- c("size", "res. mean deviance", "mis. rate", "train error rate", "test error rate")
model.result
## size res. mean deviance mis. rate train error rate test error rate
## model.1 12 0.4554 0.0917 0.0917 0.1919
## model.2 9 0.5077 0.1004 0.1004 0.1919
## model.3 7 0.5472 0.1135 0.1135 0.2121
write.csv(model.result, file = "Set1.model.csv", row.names=F)
I am going to do the same kind of analysis for the remaining two explanatory variables sets.
- Second explanatory variables set.
my.tree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+WindDir9am), data=wtrain, mincut=1)
summary(my.tree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + WindDir9am),
## data = wtrain, mincut = 1)
## Number of terminal nodes: 21
## Residual mean deviance: 0.257 = 53.5 / 208
## Misclassification error rate: 0.0568 = 13 / 229
my.pruned.tree.1 <- prune.tree(my.tree, method="misclass")
data.frame(my.pruned.tree.1$size, my.pruned.tree.1$dev)
## my.pruned.tree.1.size my.pruned.tree.1.dev
## 1 21 13
## 2 18 13
## 3 12 15
## 4 4 23
## 5 1 37
cv.tree(my.tree, method="misclass")
## $size
## [1] 21 18 12 4 1
##
## $dev
## [1] 46 46 46 44 38
##
## $k
## [1] -Inf 0.0000 0.3333 1.0000 4.6667
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
This time, the cross-validation gives a different result with respect to pruning tree. It results to be 12 the size where minimum residual deviance occurs instead of 18 as the pruning tree based on misclassification does.
my.best.tree.1 <- prune.tree(my.tree, best=18)
my.best.tree.1
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 )
## 10) Humidity3pm < 80 12 20 Yes ( 0.33 0.67 )
## 20) WindDir9am: NNW,NW,S,SSE 8 10 No ( 0.50 0.50 ) *
## 21) WindDir9am: ENE,N 4 0 Yes ( 0.00 1.00 ) *
## 11) Humidity3pm > 80 5 0 Yes ( 0.00 1.00 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 )
## 116) WindDir9am: E,ESE,NE,SE 14 10 No ( 0.79 0.21 )
## 232) Pressure3pm < 1015.55 11 7 No ( 0.91 0.09 ) *
## 233) Pressure3pm > 1015.55 3 4 Yes ( 0.33 0.67 ) *
## 117) WindDir9am: N,SSW 6 8 Yes ( 0.33 0.67 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 )
## 30) WindDir9am: E,ESE,N,SSW 35 0 No ( 1.00 0.00 ) *
## 31) WindDir9am: NNE,NNW,SE 40 30 No ( 0.90 0.10 )
## 62) Humidity3pm < 38 16 0 No ( 1.00 0.00 ) *
## 63) Humidity3pm > 38 24 20 No ( 0.83 0.17 )
## 126) WindDir9am: NNW,SE 23 20 No ( 0.87 0.13 )
## 252) Humidity3pm < 47.5 9 0 No ( 1.00 0.00 ) *
## 253) Humidity3pm > 47.5 14 10 No ( 0.79 0.21 ) *
## 127) WindDir9am: NNE 1 0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.1)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 253L))
## Number of terminal nodes: 18
## Residual mean deviance: 0.289 = 61 / 211
## Misclassification error rate: 0.0611 = 14 / 229
my.snip.tree.1 <- snip.tree(my.best.tree.1, nodes=c(126))
my.snip.tree.1
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 )
## 10) Humidity3pm < 80 12 20 Yes ( 0.33 0.67 )
## 20) WindDir9am: NNW,NW,S,SSE 8 10 No ( 0.50 0.50 ) *
## 21) WindDir9am: ENE,N 4 0 Yes ( 0.00 1.00 ) *
## 11) Humidity3pm > 80 5 0 Yes ( 0.00 1.00 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 )
## 116) WindDir9am: E,ESE,NE,SE 14 10 No ( 0.79 0.21 )
## 232) Pressure3pm < 1015.55 11 7 No ( 0.91 0.09 ) *
## 233) Pressure3pm > 1015.55 3 4 Yes ( 0.33 0.67 ) *
## 117) WindDir9am: N,SSW 6 8 Yes ( 0.33 0.67 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 )
## 30) WindDir9am: E,ESE,N,SSW 35 0 No ( 1.00 0.00 ) *
## 31) WindDir9am: NNE,NNW,SE 40 30 No ( 0.90 0.10 )
## 62) Humidity3pm < 38 16 0 No ( 1.00 0.00 ) *
## 63) Humidity3pm > 38 24 20 No ( 0.83 0.17 )
## 126) WindDir9am: NNW,SE 23 20 No ( 0.87 0.13 ) *
## 127) WindDir9am: NNE 1 0 Yes ( 0.00 1.00 ) *
summary.1 <- summary(my.snip.tree.1)
summary.1
##
## Classification tree:
## snip.tree(tree = my.best.tree.1, nodes = 126L)
## Number of terminal nodes: 17
## Residual mean deviance: 0.303 = 64.3 / 212
## Misclassification error rate: 0.0611 = 14 / 229
plot(my.snip.tree.1)
text(my.snip.tree.1, digits=2)
my.prediction <- predict(my.snip.tree.1, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.1 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.1
## [1] 0.06114
my.prediction <- predict(my.snip.tree.1, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.1 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.1
## [1] 0.2121
Let us building the model with tree size equal to 12.
my.best.tree.2 <- prune.tree(my.tree, best=12)
my.best.tree.2
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 )
## 30) WindDir9am: E,ESE,N,SSW 35 0 No ( 1.00 0.00 ) *
## 31) WindDir9am: NNE,NNW,SE 40 30 No ( 0.90 0.10 )
## 62) Humidity3pm < 38 16 0 No ( 1.00 0.00 ) *
## 63) Humidity3pm > 38 24 20 No ( 0.83 0.17 ) *
summary(my.best.tree.2)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 5L, 63L, 58L))
## Number of terminal nodes: 12
## Residual mean deviance: 0.384 = 83.3 / 217
## Misclassification error rate: 0.0786 = 18 / 229
plot(my.best.tree.2)
text(my.best.tree.2, digits=2)
my.snip.tree.2 <- snip.tree(my.best.tree.2, nodes=c(15))
my.snip.tree.2
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 ) *
summary.2 <- summary(my.snip.tree.2)
summary.2
##
## Classification tree:
## snip.tree(tree = my.best.tree.2, nodes = 15L)
## Number of terminal nodes: 10
## Residual mean deviance: 0.424 = 92.9 / 219
## Misclassification error rate: 0.0786 = 18 / 229
plot(my.snip.tree.2)
text(my.snip.tree.2, digits=2)
my.prediction <- predict(my.snip.tree.2, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.2 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.2
## [1] 0.0786
my.prediction <- predict(my.snip.tree.2, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.2 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.2
## [1] 0.2121
Again, the misclassification rate is equal to the first classification tree one, with a slight higher residual mean deviance. Let us see with just four nodes.
my.best.tree.3 <- prune.tree(my.tree, best=4)
my.best.tree.3
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 ) *
summary(my.best.tree.3)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(5L, 8L, 7L))
## Number of terminal nodes: 5
## Residual mean deviance: 0.583 = 131 / 224
## Misclassification error rate: 0.1 = 23 / 229
plot(my.best.tree.3)
text(my.best.tree.3, digits=2)
my.snip.tree.3 <- snip.tree(my.best.tree.3, nodes=c(3))
my.snip.tree.3
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 ) *
summary.3 <- summary(my.snip.tree.3)
summary.3
##
## Classification tree:
## snip.tree(tree = my.best.tree.3, nodes = 3L)
## Number of terminal nodes: 4
## Residual mean deviance: 0.644 = 145 / 225
## Misclassification error rate: 0.1 = 23 / 229
plot(my.snip.tree.3)
text(my.snip.tree.3, digits=2)
my.prediction <- predict(my.snip.tree.3, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.3 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.3
## [1] 0.1004
my.prediction <- predict(my.snip.tree.3, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.3 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.3
## [1] 0.2222
Resuming up.
model.1 <- c(summary.1$size, summary.1$dev/summary.1$df, summary.1$misclass[1]/summary.1$misclass[2], train.error.rate.1, test.error.rate.1)
model.2 <- c(summary.2$size, summary.2$dev/summary.2$df, summary.2$misclass[1]/summary.2$misclass[2], train.error.rate.2, test.error.rate.2)
model.3 <- c(summary.3$size, summary.3$dev/summary.3$df, summary.3$misclass[1]/summary.3$misclass[2], train.error.rate.3, test.error.rate.3)
model.result <- data.frame(rbind(model.1, model.2, model.3))
colnames(model.result) <- c("size", "res. mean deviance", "mis. rate", "train error rate", "test error rate")
model.result
## size res. mean deviance mis. rate train error rate test error rate
## model.1 17 0.3032 0.06114 0.06114 0.2121
## model.2 10 0.4242 0.07860 0.07860 0.2121
## model.3 4 0.6445 0.10044 0.10044 0.2222
write.csv(model.result, file = "Set2.model.csv", row.names=F)
- Third explanatory variables set.
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine+WindGustDir), data=data.clean, mincut=1)
summary(mytree)
##
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + Sunshine +
## WindGustDir), data = data.clean, mincut = 1)
## Number of terminal nodes: 22
## Residual mean deviance: 0.256 = 78.4 / 306
## Misclassification error rate: 0.064 = 21 / 328
my.pruned.tree <- prune.tree(mytree, method="misclass")
data.frame(my.pruned.tree$size, my.pruned.tree$dev)
## my.pruned.tree.size my.pruned.tree.dev
## 1 22 21
## 2 18 21
## 3 16 22
## 4 13 24
## 5 8 29
## 6 5 35
## 7 3 42
## 8 1 60
cv.tree(my.tree, method="misclass")
## $size
## [1] 21 18 12 4 1
##
## $dev
## [1] 46 45 44 41 39
##
## $k
## [1] -Inf 0.0000 0.3333 1.0000 4.6667
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
my.best.tree.1 <- prune.tree(my.tree, best=18)
my.best.tree.1
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 )
## 10) Humidity3pm < 80 12 20 Yes ( 0.33 0.67 )
## 20) WindDir9am: NNW,NW,S,SSE 8 10 No ( 0.50 0.50 ) *
## 21) WindDir9am: ENE,N 4 0 Yes ( 0.00 1.00 ) *
## 11) Humidity3pm > 80 5 0 Yes ( 0.00 1.00 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 )
## 116) WindDir9am: E,ESE,NE,SE 14 10 No ( 0.79 0.21 )
## 232) Pressure3pm < 1015.55 11 7 No ( 0.91 0.09 ) *
## 233) Pressure3pm > 1015.55 3 4 Yes ( 0.33 0.67 ) *
## 117) WindDir9am: N,SSW 6 8 Yes ( 0.33 0.67 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 )
## 30) WindDir9am: E,ESE,N,SSW 35 0 No ( 1.00 0.00 ) *
## 31) WindDir9am: NNE,NNW,SE 40 30 No ( 0.90 0.10 )
## 62) Humidity3pm < 38 16 0 No ( 1.00 0.00 ) *
## 63) Humidity3pm > 38 24 20 No ( 0.83 0.17 )
## 126) WindDir9am: NNW,SE 23 20 No ( 0.87 0.13 )
## 252) Humidity3pm < 47.5 9 0 No ( 1.00 0.00 ) *
## 253) Humidity3pm > 47.5 14 10 No ( 0.79 0.21 ) *
## 127) WindDir9am: NNE 1 0 Yes ( 0.00 1.00 ) *
summary.1 <- summary(my.snip.tree.1)
summary.1
##
## Classification tree:
## snip.tree(tree = my.best.tree.1, nodes = 126L)
## Number of terminal nodes: 17
## Residual mean deviance: 0.303 = 64.3 / 212
## Misclassification error rate: 0.0611 = 14 / 229
plot(my.best.tree.1)
text(my.best.tree.1, digits=2)
my.snip.tree.1 <- snip.tree(my.best.tree.1, nodes=c(126))
my.snip.tree.1
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 )
## 10) Humidity3pm < 80 12 20 Yes ( 0.33 0.67 )
## 20) WindDir9am: NNW,NW,S,SSE 8 10 No ( 0.50 0.50 ) *
## 21) WindDir9am: ENE,N 4 0 Yes ( 0.00 1.00 ) *
## 11) Humidity3pm > 80 5 0 Yes ( 0.00 1.00 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 )
## 116) WindDir9am: E,ESE,NE,SE 14 10 No ( 0.79 0.21 )
## 232) Pressure3pm < 1015.55 11 7 No ( 0.91 0.09 ) *
## 233) Pressure3pm > 1015.55 3 4 Yes ( 0.33 0.67 ) *
## 117) WindDir9am: N,SSW 6 8 Yes ( 0.33 0.67 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 )
## 30) WindDir9am: E,ESE,N,SSW 35 0 No ( 1.00 0.00 ) *
## 31) WindDir9am: NNE,NNW,SE 40 30 No ( 0.90 0.10 )
## 62) Humidity3pm < 38 16 0 No ( 1.00 0.00 ) *
## 63) Humidity3pm > 38 24 20 No ( 0.83 0.17 )
## 126) WindDir9am: NNW,SE 23 20 No ( 0.87 0.13 ) *
## 127) WindDir9am: NNE 1 0 Yes ( 0.00 1.00 ) *
summary.1 <- summary(my.snip.tree.1)
summary.1
##
## Classification tree:
## snip.tree(tree = my.best.tree.1, nodes = 126L)
## Number of terminal nodes: 17
## Residual mean deviance: 0.303 = 64.3 / 212
## Misclassification error rate: 0.0611 = 14 / 229
plot(my.snip.tree.1)
text(my.snip.tree.1, digits=2)
my.prediction <- predict(my.snip.tree.1, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.1 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.1
## [1] 0.0524
my.prediction <- predict(my.snip.tree.1, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.1 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.1
## [1] 0.2121
Now with size 16.
my.best.tree.2 <- prune.tree(my.tree, best=16)
my.best.tree.2
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 )
## 10) Humidity3pm < 80 12 20 Yes ( 0.33 0.67 )
## 20) WindDir9am: NNW,NW,S,SSE 8 10 No ( 0.50 0.50 ) *
## 21) WindDir9am: ENE,N 4 0 Yes ( 0.00 1.00 ) *
## 11) Humidity3pm > 80 5 0 Yes ( 0.00 1.00 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 )
## 116) WindDir9am: E,ESE,NE,SE 14 10 No ( 0.79 0.21 )
## 232) Pressure3pm < 1015.55 11 7 No ( 0.91 0.09 ) *
## 233) Pressure3pm > 1015.55 3 4 Yes ( 0.33 0.67 ) *
## 117) WindDir9am: N,SSW 6 8 Yes ( 0.33 0.67 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 )
## 30) WindDir9am: E,ESE,N,SSW 35 0 No ( 1.00 0.00 ) *
## 31) WindDir9am: NNE,NNW,SE 40 30 No ( 0.90 0.10 )
## 62) Humidity3pm < 38 16 0 No ( 1.00 0.00 ) *
## 63) Humidity3pm > 38 24 20 No ( 0.83 0.17 )
## 126) WindDir9am: NNW,SE 23 20 No ( 0.87 0.13 ) *
## 127) WindDir9am: NNE 1 0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.2)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 126L))
## Number of terminal nodes: 17
## Residual mean deviance: 0.303 = 64.3 / 212
## Misclassification error rate: 0.0611 = 14 / 229
plot(my.best.tree.2)
text(my.best.tree.2, digits=2)
my.snip.tree.2 <- my.best.tree.2
summary.2 <- summary(my.snip.tree.2)
summary.2
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 126L))
## Number of terminal nodes: 17
## Residual mean deviance: 0.303 = 64.3 / 212
## Misclassification error rate: 0.0611 = 14 / 229
my.prediction <- predict(my.snip.tree.2, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.2 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.2
## [1] 0.05677
my.prediction <- predict(my.snip.tree.2, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.2 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.2
## [1] 0.2323
Now with size 13.
my.best.tree.3 <- prune.tree(my.tree, best=13)
my.best.tree.3
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 )
## 116) WindDir9am: E,ESE,NE,SE 14 10 No ( 0.79 0.21 )
## 232) Pressure3pm < 1015.55 11 7 No ( 0.91 0.09 ) *
## 233) Pressure3pm > 1015.55 3 4 Yes ( 0.33 0.67 ) *
## 117) WindDir9am: N,SSW 6 8 Yes ( 0.33 0.67 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 )
## 30) WindDir9am: E,ESE,N,SSW 35 0 No ( 1.00 0.00 ) *
## 31) WindDir9am: NNE,NNW,SE 40 30 No ( 0.90 0.10 )
## 62) Humidity3pm < 38 16 0 No ( 1.00 0.00 ) *
## 63) Humidity3pm > 38 24 20 No ( 0.83 0.17 ) *
summary(my.best.tree.3)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 5L, 63L))
## Number of terminal nodes: 14
## Residual mean deviance: 0.351 = 75.6 / 215
## Misclassification error rate: 0.0655 = 15 / 229
plot(my.best.tree.3)
text(my.best.tree.3, digits=2)
my.snip.tree.3 <- snip.tree(my.best.tree.3, nodes=c(31))
my.snip.tree.3
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 )
## 16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11 0 No ( 1.00 0.00 ) *
## 17) WindDir9am: E,N,SSE 16 20 No ( 0.75 0.25 )
## 34) Pressure3pm < 1006.95 2 0 Yes ( 0.00 1.00 ) *
## 35) Pressure3pm > 1006.95 14 10 No ( 0.86 0.14 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 )
## 116) WindDir9am: E,ESE,NE,SE 14 10 No ( 0.79 0.21 )
## 232) Pressure3pm < 1015.55 11 7 No ( 0.91 0.09 ) *
## 233) Pressure3pm > 1015.55 3 4 Yes ( 0.33 0.67 ) *
## 117) WindDir9am: N,SSW 6 8 Yes ( 0.33 0.67 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 )
## 30) WindDir9am: E,ESE,N,SSW 35 0 No ( 1.00 0.00 ) *
## 31) WindDir9am: NNE,NNW,SE 40 30 No ( 0.90 0.10 ) *
summary.3 <- summary(my.snip.tree.3)
summary.3
##
## Classification tree:
## snip.tree(tree = my.best.tree.3, nodes = 31L)
## Number of terminal nodes: 13
## Residual mean deviance: 0.37 = 79.9 / 216
## Misclassification error rate: 0.0655 = 15 / 229
plot(my.snip.tree.3)
text(my.snip.tree.3, digits=2)
my.prediction <- predict(my.snip.tree.3, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.3 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.3
## [1] 0.0655
my.prediction <- predict(my.snip.tree.3, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.3 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.3
## [1] 0.2121
Now with size 8.
my.best.tree.4 <- prune.tree(my.tree, best=8)
my.best.tree.4
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 )
## 14) Pressure3pm < 1016.05 34 40 No ( 0.71 0.29 )
## 28) WindDir9am: NNE,NNW 11 0 No ( 1.00 0.00 ) *
## 29) WindDir9am: E,ESE,N,NE,SE,SSW 23 30 No ( 0.57 0.43 )
## 58) Humidity3pm < 68.5 20 30 No ( 0.65 0.35 ) *
## 59) Humidity3pm > 68.5 3 0 Yes ( 0.00 1.00 ) *
## 15) Pressure3pm > 1016.05 75 30 No ( 0.95 0.05 ) *
summary(my.best.tree.4)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(5L, 58L, 15L, 8L))
## Number of terminal nodes: 8
## Residual mean deviance: 0.471 = 104 / 221
## Misclassification error rate: 0.0873 = 20 / 229
plot(my.best.tree.4)
text(my.best.tree.4, digits=2)
my.snip.tree.4 <- my.best.tree.4
summary.4 <- summary(my.snip.tree.4)
summary.4
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(5L, 58L, 15L, 8L))
## Number of terminal nodes: 8
## Residual mean deviance: 0.471 = 104 / 221
## Misclassification error rate: 0.0873 = 20 / 229
my.prediction <- predict(my.snip.tree.4, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.4 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.4
## [1] 0.08734
my.prediction <- predict(my.snip.tree.4, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.4 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.4
## [1] 0.2222
Now with size 5.
my.best.tree.5 <- prune.tree(my.tree, best=5)
my.best.tree.5
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 229 200 No ( 0.84 0.16 )
## 2) Pressure3pm < 1011.85 51 70 No ( 0.55 0.45 )
## 4) Humidity3pm < 61 34 40 No ( 0.71 0.29 )
## 8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27 20 No ( 0.85 0.15 ) *
## 9) WindDir9am: NNW,SE 7 6 Yes ( 0.14 0.86 ) *
## 5) Humidity3pm > 61 17 20 Yes ( 0.24 0.76 ) *
## 3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )
## 6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69 0 No ( 1.00 0.00 ) *
## 7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109 80 No ( 0.87 0.13 ) *
summary(my.best.tree.5)
##
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(5L, 8L, 7L))
## Number of terminal nodes: 5
## Residual mean deviance: 0.583 = 131 / 224
## Misclassification error rate: 0.1 = 23 / 229
plot(my.best.tree.5)
text(my.best.tree.5, digits=2)
my.snip.tree.5 <- snip.tree(my.best.tree.5, nodes=c(3))
summary.5 <- summary(my.snip.tree.5)
summary.5
##
## Classification tree:
## snip.tree(tree = my.best.tree.5, nodes = 3L)
## Number of terminal nodes: 4
## Residual mean deviance: 0.644 = 145 / 225
## Misclassification error rate: 0.1 = 23 / 229
plot(my.snip.tree.5)
text(my.snip.tree.5, digits=2)
my.prediction <- predict(my.snip.tree.5, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.5 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.5
## [1] 0.1004
my.prediction <- predict(my.snip.tree.5, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.5 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.5
## [1] 0.2222
Resuming up:
model.1 <- c(summary.1$size, summary.1$dev/summary.1$df, summary.1$misclass[1]/summary.1$misclass[2], train.error.rate.1, test.error.rate.1)
model.2 <- c(summary.2$size, summary.2$dev/summary.2$df, summary.2$misclass[1]/summary.2$misclass[2], train.error.rate.2, test.error.rate.2)
model.3 <- c(summary.3$size, summary.3$dev/summary.3$df, summary.3$misclass[1]/summary.3$misclass[2], train.error.rate.3, test.error.rate.3)
model.result <- data.frame(rbind(model.1, model.2, model.3))
model.4 <- c(summary.4$size, summary.4$dev/summary.4$df, summary.4$misclass[1]/summary.4$misclass[2], train.error.rate.4, test.error.rate.4)
model.5 <- c(summary.5$size, summary.5$dev/summary.5$df, summary.5$misclass[1]/summary.5$misclass[2], train.error.rate.5, test.error.rate.5)
model.result <- data.frame(rbind(model.1, model.2, model.3, model.4, model.5))
colnames(model.result) <- c("size", "res. mean deviance", "mis. rate", "train error rate", "test error rate")
model.result
## size res. mean deviance mis. rate train error rate test error rate
## model.1 17 0.3032 0.06114 0.05240 0.2121
## model.2 17 0.3032 0.06114 0.05677 0.2323
## model.3 13 0.3701 0.06550 0.06550 0.2121
## model.4 8 0.4709 0.08734 0.08734 0.2222
## model.5 4 0.6445 0.10044 0.10044 0.2222
write.csv(model.result, file = "Set3.model.csv", row.names=F)
In my next post, the fourth of this series, I am going to make my final decision among the models so far build for each explanatory variables set.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.