Saturday, November 8, 2014

Weather forecast (part 3)

Building the model

In the previous two posts, I identified the explanatory variables, both quantitative and categorical, that I am going to use to build a model for predicting weather. More precisely, I identified three sets of explanatory variables, the following ones:

set#1: {Humidity9am, Pressure9am, WindDir9am} set#2: {Humidity3pm, Pressure3pm, WindDir9am} set#3: {Humidity3pm, Pressure3pm, Sunshine, WindGustDir}

The reason for identifying three sets of explanatory variables is determined by the need of being able to provide weather forecasts on three deadlines during the day: 10am, 4pm, 10pm. That need must be checked against the availability of specific explanatory variables. For example, at 10am, the Humidity3pm is obviously not available; at 3pm, some Sunshine or WindGustDir measurement are still going on.

So, let us reload the original dataset available at:

http://www.biz.uiowa.edu/faculty/jledolter/datamining/dataexercises.html

and run the already introduced preliminary preprocessing.

data <- read.csv("weather.csv", header=TRUE)
data.clean <- na.omit(data)
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday))
attach(data.clean)
colnames(data.clean)
##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainTomorrow"

Specifically, I am going to build the model by taking advantage of classification trees. I do that for their simplicity in representing graphically the final result. I will explore further models on next posts. So, first let us generate the training and test sets.

set.seed(1)
n = nrow(data.clean)
ntrain = n*0.7
ntest = n - ntrain
trainset = sample(1:n, ntrain)
wtrain <- data.clean[trainset,]
wtest <- data.clean[-trainset,]

Considering the first variables set, I am going to build the classification tree and further pruning in order to get the tree with the optimal deviance. In addition to that, snipping is performed to remove nodes which do not drive any discrimination among yes/no scenarios, hence without increasing misclassification rate.

  • First explanatory variables set
library(tree)
my.tree <- tree(RainTomorrow~(Humidity9am+Pressure9am+WindDir9am), data=wtrain, mincut=1)
summary(my.tree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity9am + Pressure9am + WindDir9am), 
##     data = wtrain, mincut = 1)
## Number of terminal nodes:  18 
## Residual mean deviance:  0.341 = 71.9 / 211 
## Misclassification error rate: 0.083 = 19 / 229
my.pruned.tree <- prune.tree(my.tree, method="misclass")
data.frame(my.pruned.tree$size, my.pruned.tree$dev)
##   my.pruned.tree.size my.pruned.tree.dev
## 1                  18                 19
## 2                  14                 19
## 3                  11                 21
## 4                   7                 25
## 5                   3                 31
## 6                   1                 37

From the table above, it appears the best size being equal to 14. Also cross-validation confirms that.

cv.tree(my.tree, method="misclass")
## $size
## [1] 18 14 11  7  3  1
## 
## $dev
## [1] 51 51 44 43 42 42
## 
## $k
## [1]   -Inf 0.0000 0.6667 1.0000 1.5000 3.0000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

Hence, I am going to prune my classification tree based on that.

my.best.tree.1 <- prune.tree(my.tree, best=14)
my.best.tree.1
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )  
##       4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )  
##         8) WindDir9am: NNE,SW,W,WNW,WSW 14   0 No ( 1.00 0.00 ) *
##         9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74  90 No ( 0.68 0.32 )  
##          18) Humidity9am < 81.5 65  80 No ( 0.72 0.28 )  
##            36) WindDir9am: E,ENE,ESE,NNW,NW,S,SSE,SSW 49  50 No ( 0.80 0.20 )  
##              72) Pressure9am < 1013.2 24  30 No ( 0.67 0.33 ) *
##              73) Pressure9am > 1013.2 25  10 No ( 0.92 0.08 )  
##               146) Pressure9am < 1017.15 19   0 No ( 1.00 0.00 ) *
##               147) Pressure9am > 1017.15 6   8 No ( 0.67 0.33 ) *
##            37) WindDir9am: N,SE 16  20 No ( 0.50 0.50 )  
##              74) Humidity9am < 77 13  20 Yes ( 0.38 0.62 ) *
##              75) Humidity9am > 77 3   0 No ( 1.00 0.00 ) *
##          19) Humidity9am > 81.5 9  10 Yes ( 0.33 0.67 ) *
##       5) Humidity9am > 91.5 6   0 Yes ( 0.00 1.00 ) *
##     3) Pressure9am > 1018.05 135  60 No ( 0.95 0.05 )  
##       6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: NE,NNE,NNW,SE,SSW 56  40 No ( 0.88 0.12 )  
##        14) Humidity9am < 93.5 54  30 No ( 0.91 0.09 )  
##          28) Humidity9am < 74.5 32   9 No ( 0.97 0.03 )  
##            56) WindDir9am: NNE,NNW,SE,SSW 30   0 No ( 1.00 0.00 ) *
##            57) WindDir9am: NE 2   3 No ( 0.50 0.50 ) *
##          29) Humidity9am > 74.5 22  20 No ( 0.82 0.18 )  
##            58) Pressure9am < 1019.5 3   4 Yes ( 0.33 0.67 ) *
##            59) Pressure9am > 1019.5 19  10 No ( 0.89 0.11 )  
##             118) WindDir9am: NE,NNW,SE,SSW 18   8 No ( 0.94 0.06 ) *
##             119) WindDir9am: NNE 1   0 Yes ( 0.00 1.00 ) *
##        15) Humidity9am > 93.5 2   0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.1)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(74L, 72L))
## Number of terminal nodes:  15 
## Residual mean deviance:  0.38 = 81.3 / 214 
## Misclassification error rate: 0.0917 = 21 / 229

From the classification tree printout, it appears that I can snip nodes 28 and 36 as they do not provide any significant discrimination in terms of prediction.

my.snip.tree.1 <- snip.tree(my.best.tree.1, nodes=c(28,36))
summary.1 <- summary(my.snip.tree.1)
summary.1
## 
## Classification tree:
## snip.tree(tree = my.best.tree.1, nodes = c(28L, 36L))
## Number of terminal nodes:  12 
## Residual mean deviance:  0.455 = 98.8 / 217 
## Misclassification error rate: 0.0917 = 21 / 229
plot(my.snip.tree.1)
text(my.snip.tree.1, digits=2)

plot of chunk unnamed-chunk-6

my.prediction <- predict(my.snip.tree.1, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.1 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.1
## [1] 0.0917

It is remarkable that the training set error rate has not changed after pruning and snipping.

my.prediction <- predict(my.snip.tree.1, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.1 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.1
## [1] 0.1919

What if I build a classification tree with 11 nodes. Does the final misclassification rate change that much ? Let us try it out.

my.best.tree.2 <- prune.tree(my.tree, best=11)
my.best.tree.2
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )  
##       4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )  
##         8) WindDir9am: NNE,SW,W,WNW,WSW 14   0 No ( 1.00 0.00 ) *
##         9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74  90 No ( 0.68 0.32 )  
##          18) Humidity9am < 81.5 65  80 No ( 0.72 0.28 )  
##            36) WindDir9am: E,ENE,ESE,NNW,NW,S,SSE,SSW 49  50 No ( 0.80 0.20 )  
##              72) Pressure9am < 1013.2 24  30 No ( 0.67 0.33 ) *
##              73) Pressure9am > 1013.2 25  10 No ( 0.92 0.08 )  
##               146) Pressure9am < 1017.15 19   0 No ( 1.00 0.00 ) *
##               147) Pressure9am > 1017.15 6   8 No ( 0.67 0.33 ) *
##            37) WindDir9am: N,SE 16  20 No ( 0.50 0.50 )  
##              74) Humidity9am < 77 13  20 Yes ( 0.38 0.62 ) *
##              75) Humidity9am > 77 3   0 No ( 1.00 0.00 ) *
##          19) Humidity9am > 81.5 9  10 Yes ( 0.33 0.67 ) *
##       5) Humidity9am > 91.5 6   0 Yes ( 0.00 1.00 ) *
##     3) Pressure9am > 1018.05 135  60 No ( 0.95 0.05 )  
##       6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: NE,NNE,NNW,SE,SSW 56  40 No ( 0.88 0.12 )  
##        14) Humidity9am < 93.5 54  30 No ( 0.91 0.09 ) *
##        15) Humidity9am > 93.5 2   0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.2)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(74L, 72L, 14L))
## Number of terminal nodes:  11 
## Residual mean deviance:  0.46 = 100 / 218 
## Misclassification error rate: 0.1 = 23 / 229
plot(my.best.tree.2)
text(my.best.tree.2, digits=2)

plot of chunk unnamed-chunk-8

my.snip.tree.2 <- snip.tree(my.best.tree.2, nodes=c(36))
my.snip.tree.2
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 229 200 No ( 0.84 0.16 )  
##    2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )  
##      4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )  
##        8) WindDir9am: NNE,SW,W,WNW,WSW 14   0 No ( 1.00 0.00 ) *
##        9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74  90 No ( 0.68 0.32 )  
##         18) Humidity9am < 81.5 65  80 No ( 0.72 0.28 )  
##           36) WindDir9am: E,ENE,ESE,NNW,NW,S,SSE,SSW 49  50 No ( 0.80 0.20 ) *
##           37) WindDir9am: N,SE 16  20 No ( 0.50 0.50 )  
##             74) Humidity9am < 77 13  20 Yes ( 0.38 0.62 ) *
##             75) Humidity9am > 77 3   0 No ( 1.00 0.00 ) *
##         19) Humidity9am > 81.5 9  10 Yes ( 0.33 0.67 ) *
##      5) Humidity9am > 91.5 6   0 Yes ( 0.00 1.00 ) *
##    3) Pressure9am > 1018.05 135  60 No ( 0.95 0.05 )  
##      6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79   0 No ( 1.00 0.00 ) *
##      7) WindDir9am: NE,NNE,NNW,SE,SSW 56  40 No ( 0.88 0.12 )  
##       14) Humidity9am < 93.5 54  30 No ( 0.91 0.09 ) *
##       15) Humidity9am > 93.5 2   0 Yes ( 0.00 1.00 ) *
summary.2 <- summary(my.snip.tree.2)
summary.2
## 
## Classification tree:
## snip.tree(tree = my.best.tree.2, nodes = 36L)
## Number of terminal nodes:  9 
## Residual mean deviance:  0.508 = 112 / 220 
## Misclassification error rate: 0.1 = 23 / 229
plot(my.snip.tree.2)
text(my.snip.tree.2, digits=2)

plot of chunk unnamed-chunk-8

my.prediction <- predict(my.snip.tree.2, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.2 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.2
## [1] 0.1004

Again, the training error rate has not changed after pruning and snipping. The test error rate is as following:

my.prediction <- predict(my.snip.tree.2, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.2 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.2
## [1] 0.1919

Surprise! The test error rate for the second classification tree is the same as the first one. However, the residual mean deviance is slight higher for this second tree. Since the first classification tree uses 12 nodes, my choice is for the second one that achieve the same misclassification rate with only 9 nodes.

It is time to try with just 7 nodes to observe an increase in the misclassification rate, in the case.

my.best.tree.3 <- prune.tree(my.tree, best=7)
my.best.tree.3
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )  
##       4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )  
##         8) WindDir9am: NNE,SW,W,WNW,WSW 14   0 No ( 1.00 0.00 ) *
##         9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74  90 No ( 0.68 0.32 )  
##          18) Humidity9am < 81.5 65  80 No ( 0.72 0.28 )  
##            36) WindDir9am: E,ENE,ESE,NNW,NW,S,SSE,SSW 49  50 No ( 0.80 0.20 )  
##              72) Pressure9am < 1013.2 24  30 No ( 0.67 0.33 ) *
##              73) Pressure9am > 1013.2 25  10 No ( 0.92 0.08 )  
##               146) Pressure9am < 1017.15 19   0 No ( 1.00 0.00 ) *
##               147) Pressure9am > 1017.15 6   8 No ( 0.67 0.33 ) *
##            37) WindDir9am: N,SE 16  20 No ( 0.50 0.50 ) *
##          19) Humidity9am > 81.5 9  10 Yes ( 0.33 0.67 ) *
##       5) Humidity9am > 91.5 6   0 Yes ( 0.00 1.00 ) *
##     3) Pressure9am > 1018.05 135  60 No ( 0.95 0.05 )  
##       6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: NE,NNE,NNW,SE,SSW 56  40 No ( 0.88 0.12 )  
##        14) Humidity9am < 93.5 54  30 No ( 0.91 0.09 ) *
##        15) Humidity9am > 93.5 2   0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.3)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(72L, 14L, 37L))
## Number of terminal nodes:  10 
## Residual mean deviance:  0.48 = 105 / 219 
## Misclassification error rate: 0.114 = 26 / 229

Some snipping is possible.

my.snip.tree.3 <- snip.tree(my.best.tree.3, nodes=c(18,36,73))
my.snip.tree.3
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 229 200 No ( 0.84 0.16 )  
##    2) Pressure9am < 1018.05 94 100 No ( 0.68 0.32 )  
##      4) Humidity9am < 91.5 88 100 No ( 0.73 0.27 )  
##        8) WindDir9am: NNE,SW,W,WNW,WSW 14   0 No ( 1.00 0.00 ) *
##        9) WindDir9am: E,ENE,ESE,N,NNW,NW,S,SE,SSE,SSW 74  90 No ( 0.68 0.32 )  
##         18) Humidity9am < 81.5 65  80 No ( 0.72 0.28 ) *
##         19) Humidity9am > 81.5 9  10 Yes ( 0.33 0.67 ) *
##      5) Humidity9am > 91.5 6   0 Yes ( 0.00 1.00 ) *
##    3) Pressure9am > 1018.05 135  60 No ( 0.95 0.05 )  
##      6) WindDir9am: E,ENE,ESE,N,NW,S,SSE,SW,W,WNW,WSW 79   0 No ( 1.00 0.00 ) *
##      7) WindDir9am: NE,NNE,NNW,SE,SSW 56  40 No ( 0.88 0.12 )  
##       14) Humidity9am < 93.5 54  30 No ( 0.91 0.09 ) *
##       15) Humidity9am > 93.5 2   0 Yes ( 0.00 1.00 ) *
summary.3 <- summary(my.snip.tree.3)
summary.3
## 
## Classification tree:
## snip.tree(tree = my.best.tree.3, nodes = 18L)
## Number of terminal nodes:  7 
## Residual mean deviance:  0.547 = 121 / 222 
## Misclassification error rate: 0.114 = 26 / 229
plot(my.snip.tree.3)
text(my.snip.tree.3, digits=2)

plot of chunk unnamed-chunk-11

my.prediction <- predict(my.snip.tree.3, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.3 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.3
## [1] 0.1135
my.prediction <- predict(my.snip.tree.3, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.3 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.3
## [1] 0.2121

Resuming up.

model.1 <- c(summary.1$size, summary.1$dev/summary.1$df, summary.1$misclass[1]/summary.1$misclass[2], train.error.rate.1, test.error.rate.1)
model.2 <- c(summary.2$size, summary.2$dev/summary.2$df, summary.2$misclass[1]/summary.2$misclass[2], train.error.rate.2, test.error.rate.2)
model.3 <- c(summary.3$size, summary.3$dev/summary.3$df, summary.3$misclass[1]/summary.3$misclass[2], train.error.rate.3, test.error.rate.3)
model.result <- data.frame(rbind(model.1, model.2, model.3))
colnames(model.result) <- c("size", "res. mean deviance", "mis. rate", "train error rate", "test error rate")
model.result
##         size res. mean deviance mis. rate train error rate test error rate
## model.1   12             0.4554    0.0917           0.0917          0.1919
## model.2    9             0.5077    0.1004           0.1004          0.1919
## model.3    7             0.5472    0.1135           0.1135          0.2121
write.csv(model.result, file = "Set1.model.csv", row.names=F)

I am going to do the same kind of analysis for the remaining two explanatory variables sets.

  • Second explanatory variables set.
my.tree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+WindDir9am), data=wtrain, mincut=1)
summary(my.tree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + WindDir9am), 
##     data = wtrain, mincut = 1)
## Number of terminal nodes:  21 
## Residual mean deviance:  0.257 = 53.5 / 208 
## Misclassification error rate: 0.0568 = 13 / 229
my.pruned.tree.1 <- prune.tree(my.tree, method="misclass")
data.frame(my.pruned.tree.1$size, my.pruned.tree.1$dev)
##   my.pruned.tree.1.size my.pruned.tree.1.dev
## 1                    21                   13
## 2                    18                   13
## 3                    12                   15
## 4                     4                   23
## 5                     1                   37
cv.tree(my.tree, method="misclass")
## $size
## [1] 21 18 12  4  1
## 
## $dev
## [1] 46 46 46 44 38
## 
## $k
## [1]   -Inf 0.0000 0.3333 1.0000 4.6667
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

This time, the cross-validation gives a different result with respect to pruning tree. It results to be 12 the size where minimum residual deviance occurs instead of 18 as the pruning tree based on misclassification does.

my.best.tree.1 <- prune.tree(my.tree, best=18)
my.best.tree.1
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##       4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##         8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##          16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##          17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##            34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##            35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##         9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##       5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 )  
##        10) Humidity3pm < 80 12  20 Yes ( 0.33 0.67 )  
##          20) WindDir9am: NNW,NW,S,SSE 8  10 No ( 0.50 0.50 ) *
##          21) WindDir9am: ENE,N 4   0 Yes ( 0.00 1.00 ) *
##        11) Humidity3pm > 80 5   0 Yes ( 0.00 1.00 ) *
##     3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##       6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##        14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##          28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##          29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##            58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 )  
##             116) WindDir9am: E,ESE,NE,SE 14  10 No ( 0.79 0.21 )  
##               232) Pressure3pm < 1015.55 11   7 No ( 0.91 0.09 ) *
##               233) Pressure3pm > 1015.55 3   4 Yes ( 0.33 0.67 ) *
##             117) WindDir9am: N,SSW 6   8 Yes ( 0.33 0.67 ) *
##            59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##        15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 )  
##          30) WindDir9am: E,ESE,N,SSW 35   0 No ( 1.00 0.00 ) *
##          31) WindDir9am: NNE,NNW,SE 40  30 No ( 0.90 0.10 )  
##            62) Humidity3pm < 38 16   0 No ( 1.00 0.00 ) *
##            63) Humidity3pm > 38 24  20 No ( 0.83 0.17 )  
##             126) WindDir9am: NNW,SE 23  20 No ( 0.87 0.13 )  
##               252) Humidity3pm < 47.5 9   0 No ( 1.00 0.00 ) *
##               253) Humidity3pm > 47.5 14  10 No ( 0.79 0.21 ) *
##             127) WindDir9am: NNE 1   0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.1)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 253L))
## Number of terminal nodes:  18 
## Residual mean deviance:  0.289 = 61 / 211 
## Misclassification error rate: 0.0611 = 14 / 229
my.snip.tree.1 <- snip.tree(my.best.tree.1, nodes=c(126))
my.snip.tree.1
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##       4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##         8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##          16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##          17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##            34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##            35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##         9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##       5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 )  
##        10) Humidity3pm < 80 12  20 Yes ( 0.33 0.67 )  
##          20) WindDir9am: NNW,NW,S,SSE 8  10 No ( 0.50 0.50 ) *
##          21) WindDir9am: ENE,N 4   0 Yes ( 0.00 1.00 ) *
##        11) Humidity3pm > 80 5   0 Yes ( 0.00 1.00 ) *
##     3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##       6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##        14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##          28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##          29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##            58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 )  
##             116) WindDir9am: E,ESE,NE,SE 14  10 No ( 0.79 0.21 )  
##               232) Pressure3pm < 1015.55 11   7 No ( 0.91 0.09 ) *
##               233) Pressure3pm > 1015.55 3   4 Yes ( 0.33 0.67 ) *
##             117) WindDir9am: N,SSW 6   8 Yes ( 0.33 0.67 ) *
##            59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##        15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 )  
##          30) WindDir9am: E,ESE,N,SSW 35   0 No ( 1.00 0.00 ) *
##          31) WindDir9am: NNE,NNW,SE 40  30 No ( 0.90 0.10 )  
##            62) Humidity3pm < 38 16   0 No ( 1.00 0.00 ) *
##            63) Humidity3pm > 38 24  20 No ( 0.83 0.17 )  
##             126) WindDir9am: NNW,SE 23  20 No ( 0.87 0.13 ) *
##             127) WindDir9am: NNE 1   0 Yes ( 0.00 1.00 ) *
summary.1 <- summary(my.snip.tree.1)
summary.1
## 
## Classification tree:
## snip.tree(tree = my.best.tree.1, nodes = 126L)
## Number of terminal nodes:  17 
## Residual mean deviance:  0.303 = 64.3 / 212 
## Misclassification error rate: 0.0611 = 14 / 229
plot(my.snip.tree.1)
text(my.snip.tree.1, digits=2)

plot of chunk unnamed-chunk-14

my.prediction <- predict(my.snip.tree.1, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.1 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.1
## [1] 0.06114
my.prediction <- predict(my.snip.tree.1, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.1 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.1
## [1] 0.2121

Let us building the model with tree size equal to 12.

my.best.tree.2 <- prune.tree(my.tree, best=12)
my.best.tree.2
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 229 200 No ( 0.84 0.16 )  
##    2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##      4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##        8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##         16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##         17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##           34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##           35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##        9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##      5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 ) *
##    3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##      6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##      7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##       14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##         28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##         29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##           58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 ) *
##           59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##       15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 )  
##         30) WindDir9am: E,ESE,N,SSW 35   0 No ( 1.00 0.00 ) *
##         31) WindDir9am: NNE,NNW,SE 40  30 No ( 0.90 0.10 )  
##           62) Humidity3pm < 38 16   0 No ( 1.00 0.00 ) *
##           63) Humidity3pm > 38 24  20 No ( 0.83 0.17 ) *
summary(my.best.tree.2)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 5L, 63L, 58L))
## Number of terminal nodes:  12 
## Residual mean deviance:  0.384 = 83.3 / 217 
## Misclassification error rate: 0.0786 = 18 / 229
plot(my.best.tree.2)
text(my.best.tree.2, digits=2)

plot of chunk unnamed-chunk-15

my.snip.tree.2 <- snip.tree(my.best.tree.2, nodes=c(15))
my.snip.tree.2
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 229 200 No ( 0.84 0.16 )  
##    2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##      4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##        8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##         16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##         17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##           34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##           35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##        9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##      5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 ) *
##    3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##      6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##      7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##       14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##         28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##         29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##           58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 ) *
##           59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##       15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 ) *
summary.2 <- summary(my.snip.tree.2)
summary.2
## 
## Classification tree:
## snip.tree(tree = my.best.tree.2, nodes = 15L)
## Number of terminal nodes:  10 
## Residual mean deviance:  0.424 = 92.9 / 219 
## Misclassification error rate: 0.0786 = 18 / 229
plot(my.snip.tree.2)
text(my.snip.tree.2, digits=2)

plot of chunk unnamed-chunk-15

my.prediction <- predict(my.snip.tree.2, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.2 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.2
## [1] 0.0786
my.prediction <- predict(my.snip.tree.2, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.2 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.2
## [1] 0.2121

Again, the misclassification rate is equal to the first classification tree one, with a slight higher residual mean deviance. Let us see with just four nodes.

my.best.tree.3 <- prune.tree(my.tree, best=4)
my.best.tree.3
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 229 200 No ( 0.84 0.16 )  
##   2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##     4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##       8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 ) *
##       9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##     5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 ) *
##   3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##     6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##     7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 ) *
summary(my.best.tree.3)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(5L, 8L, 7L))
## Number of terminal nodes:  5 
## Residual mean deviance:  0.583 = 131 / 224 
## Misclassification error rate: 0.1 = 23 / 229
plot(my.best.tree.3)
text(my.best.tree.3, digits=2)

plot of chunk unnamed-chunk-16

my.snip.tree.3 <- snip.tree(my.best.tree.3, nodes=c(3))
my.snip.tree.3
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 229 200 No ( 0.84 0.16 )  
##   2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##     4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##       8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 ) *
##       9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##     5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 ) *
##   3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 ) *
summary.3 <- summary(my.snip.tree.3)
summary.3
## 
## Classification tree:
## snip.tree(tree = my.best.tree.3, nodes = 3L)
## Number of terminal nodes:  4 
## Residual mean deviance:  0.644 = 145 / 225 
## Misclassification error rate: 0.1 = 23 / 229
plot(my.snip.tree.3)
text(my.snip.tree.3, digits=2)

plot of chunk unnamed-chunk-16

my.prediction <- predict(my.snip.tree.3, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.3 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.3
## [1] 0.1004
my.prediction <- predict(my.snip.tree.3, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.3 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.3
## [1] 0.2222

Resuming up.

model.1 <- c(summary.1$size, summary.1$dev/summary.1$df, summary.1$misclass[1]/summary.1$misclass[2], train.error.rate.1, test.error.rate.1)
model.2 <- c(summary.2$size, summary.2$dev/summary.2$df, summary.2$misclass[1]/summary.2$misclass[2], train.error.rate.2, test.error.rate.2)
model.3 <- c(summary.3$size, summary.3$dev/summary.3$df, summary.3$misclass[1]/summary.3$misclass[2], train.error.rate.3, test.error.rate.3)
model.result <- data.frame(rbind(model.1, model.2, model.3))
colnames(model.result) <- c("size", "res. mean deviance", "mis. rate", "train error rate", "test error rate")
model.result
##         size res. mean deviance mis. rate train error rate test error rate
## model.1   17             0.3032   0.06114          0.06114          0.2121
## model.2   10             0.4242   0.07860          0.07860          0.2121
## model.3    4             0.6445   0.10044          0.10044          0.2222
write.csv(model.result, file = "Set2.model.csv", row.names=F)
  • Third explanatory variables set.
mytree <- tree(RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine+WindGustDir), data=data.clean, mincut=1)
summary(mytree)
## 
## Classification tree:
## tree(formula = RainTomorrow ~ (Humidity3pm + Pressure3pm + Sunshine + 
##     WindGustDir), data = data.clean, mincut = 1)
## Number of terminal nodes:  22 
## Residual mean deviance:  0.256 = 78.4 / 306 
## Misclassification error rate: 0.064 = 21 / 328
my.pruned.tree <- prune.tree(mytree, method="misclass")
data.frame(my.pruned.tree$size, my.pruned.tree$dev)
##   my.pruned.tree.size my.pruned.tree.dev
## 1                  22                 21
## 2                  18                 21
## 3                  16                 22
## 4                  13                 24
## 5                   8                 29
## 6                   5                 35
## 7                   3                 42
## 8                   1                 60
cv.tree(my.tree, method="misclass")
## $size
## [1] 21 18 12  4  1
## 
## $dev
## [1] 46 45 44 41 39
## 
## $k
## [1]   -Inf 0.0000 0.3333 1.0000 4.6667
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
my.best.tree.1 <- prune.tree(my.tree, best=18)
my.best.tree.1
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##       4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##         8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##          16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##          17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##            34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##            35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##         9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##       5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 )  
##        10) Humidity3pm < 80 12  20 Yes ( 0.33 0.67 )  
##          20) WindDir9am: NNW,NW,S,SSE 8  10 No ( 0.50 0.50 ) *
##          21) WindDir9am: ENE,N 4   0 Yes ( 0.00 1.00 ) *
##        11) Humidity3pm > 80 5   0 Yes ( 0.00 1.00 ) *
##     3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##       6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##        14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##          28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##          29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##            58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 )  
##             116) WindDir9am: E,ESE,NE,SE 14  10 No ( 0.79 0.21 )  
##               232) Pressure3pm < 1015.55 11   7 No ( 0.91 0.09 ) *
##               233) Pressure3pm > 1015.55 3   4 Yes ( 0.33 0.67 ) *
##             117) WindDir9am: N,SSW 6   8 Yes ( 0.33 0.67 ) *
##            59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##        15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 )  
##          30) WindDir9am: E,ESE,N,SSW 35   0 No ( 1.00 0.00 ) *
##          31) WindDir9am: NNE,NNW,SE 40  30 No ( 0.90 0.10 )  
##            62) Humidity3pm < 38 16   0 No ( 1.00 0.00 ) *
##            63) Humidity3pm > 38 24  20 No ( 0.83 0.17 )  
##             126) WindDir9am: NNW,SE 23  20 No ( 0.87 0.13 )  
##               252) Humidity3pm < 47.5 9   0 No ( 1.00 0.00 ) *
##               253) Humidity3pm > 47.5 14  10 No ( 0.79 0.21 ) *
##             127) WindDir9am: NNE 1   0 Yes ( 0.00 1.00 ) *
summary.1 <- summary(my.snip.tree.1)
summary.1
## 
## Classification tree:
## snip.tree(tree = my.best.tree.1, nodes = 126L)
## Number of terminal nodes:  17 
## Residual mean deviance:  0.303 = 64.3 / 212 
## Misclassification error rate: 0.0611 = 14 / 229
plot(my.best.tree.1)
text(my.best.tree.1, digits=2)

plot of chunk unnamed-chunk-18

my.snip.tree.1 <- snip.tree(my.best.tree.1, nodes=c(126))
my.snip.tree.1
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##       4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##         8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##          16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##          17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##            34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##            35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##         9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##       5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 )  
##        10) Humidity3pm < 80 12  20 Yes ( 0.33 0.67 )  
##          20) WindDir9am: NNW,NW,S,SSE 8  10 No ( 0.50 0.50 ) *
##          21) WindDir9am: ENE,N 4   0 Yes ( 0.00 1.00 ) *
##        11) Humidity3pm > 80 5   0 Yes ( 0.00 1.00 ) *
##     3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##       6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##        14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##          28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##          29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##            58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 )  
##             116) WindDir9am: E,ESE,NE,SE 14  10 No ( 0.79 0.21 )  
##               232) Pressure3pm < 1015.55 11   7 No ( 0.91 0.09 ) *
##               233) Pressure3pm > 1015.55 3   4 Yes ( 0.33 0.67 ) *
##             117) WindDir9am: N,SSW 6   8 Yes ( 0.33 0.67 ) *
##            59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##        15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 )  
##          30) WindDir9am: E,ESE,N,SSW 35   0 No ( 1.00 0.00 ) *
##          31) WindDir9am: NNE,NNW,SE 40  30 No ( 0.90 0.10 )  
##            62) Humidity3pm < 38 16   0 No ( 1.00 0.00 ) *
##            63) Humidity3pm > 38 24  20 No ( 0.83 0.17 )  
##             126) WindDir9am: NNW,SE 23  20 No ( 0.87 0.13 ) *
##             127) WindDir9am: NNE 1   0 Yes ( 0.00 1.00 ) *
summary.1 <- summary(my.snip.tree.1)
summary.1
## 
## Classification tree:
## snip.tree(tree = my.best.tree.1, nodes = 126L)
## Number of terminal nodes:  17 
## Residual mean deviance:  0.303 = 64.3 / 212 
## Misclassification error rate: 0.0611 = 14 / 229
plot(my.snip.tree.1)
text(my.snip.tree.1, digits=2)

plot of chunk unnamed-chunk-18

my.prediction <- predict(my.snip.tree.1, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.1 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.1
## [1] 0.0524
my.prediction <- predict(my.snip.tree.1, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.1 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.1
## [1] 0.2121

Now with size 16.

my.best.tree.2 <- prune.tree(my.tree, best=16)
my.best.tree.2
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##       4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##         8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##          16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##          17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##            34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##            35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##         9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##       5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 )  
##        10) Humidity3pm < 80 12  20 Yes ( 0.33 0.67 )  
##          20) WindDir9am: NNW,NW,S,SSE 8  10 No ( 0.50 0.50 ) *
##          21) WindDir9am: ENE,N 4   0 Yes ( 0.00 1.00 ) *
##        11) Humidity3pm > 80 5   0 Yes ( 0.00 1.00 ) *
##     3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##       6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##        14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##          28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##          29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##            58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 )  
##             116) WindDir9am: E,ESE,NE,SE 14  10 No ( 0.79 0.21 )  
##               232) Pressure3pm < 1015.55 11   7 No ( 0.91 0.09 ) *
##               233) Pressure3pm > 1015.55 3   4 Yes ( 0.33 0.67 ) *
##             117) WindDir9am: N,SSW 6   8 Yes ( 0.33 0.67 ) *
##            59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##        15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 )  
##          30) WindDir9am: E,ESE,N,SSW 35   0 No ( 1.00 0.00 ) *
##          31) WindDir9am: NNE,NNW,SE 40  30 No ( 0.90 0.10 )  
##            62) Humidity3pm < 38 16   0 No ( 1.00 0.00 ) *
##            63) Humidity3pm > 38 24  20 No ( 0.83 0.17 )  
##             126) WindDir9am: NNW,SE 23  20 No ( 0.87 0.13 ) *
##             127) WindDir9am: NNE 1   0 Yes ( 0.00 1.00 ) *
summary(my.best.tree.2)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 126L))
## Number of terminal nodes:  17 
## Residual mean deviance:  0.303 = 64.3 / 212 
## Misclassification error rate: 0.0611 = 14 / 229
plot(my.best.tree.2)
text(my.best.tree.2, digits=2)

plot of chunk unnamed-chunk-19

my.snip.tree.2 <- my.best.tree.2
summary.2 <- summary(my.snip.tree.2)
summary.2
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 126L))
## Number of terminal nodes:  17 
## Residual mean deviance:  0.303 = 64.3 / 212 
## Misclassification error rate: 0.0611 = 14 / 229
my.prediction <- predict(my.snip.tree.2, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.2 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.2
## [1] 0.05677
my.prediction <- predict(my.snip.tree.2, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.2 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.2
## [1] 0.2323

Now with size 13.

my.best.tree.3 <- prune.tree(my.tree, best=13)
my.best.tree.3
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##       4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##         8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##          16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##          17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##            34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##            35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##         9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##       5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 ) *
##     3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##       6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##        14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##          28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##          29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##            58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 )  
##             116) WindDir9am: E,ESE,NE,SE 14  10 No ( 0.79 0.21 )  
##               232) Pressure3pm < 1015.55 11   7 No ( 0.91 0.09 ) *
##               233) Pressure3pm > 1015.55 3   4 Yes ( 0.33 0.67 ) *
##             117) WindDir9am: N,SSW 6   8 Yes ( 0.33 0.67 ) *
##            59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##        15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 )  
##          30) WindDir9am: E,ESE,N,SSW 35   0 No ( 1.00 0.00 ) *
##          31) WindDir9am: NNE,NNW,SE 40  30 No ( 0.90 0.10 )  
##            62) Humidity3pm < 38 16   0 No ( 1.00 0.00 ) *
##            63) Humidity3pm > 38 24  20 No ( 0.83 0.17 ) *
summary(my.best.tree.3)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(35L, 5L, 63L))
## Number of terminal nodes:  14 
## Residual mean deviance:  0.351 = 75.6 / 215 
## Misclassification error rate: 0.0655 = 15 / 229
plot(my.best.tree.3)
text(my.best.tree.3, digits=2)

plot of chunk unnamed-chunk-20

my.snip.tree.3 <- snip.tree(my.best.tree.3, nodes=c(31))
my.snip.tree.3
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 229 200 No ( 0.84 0.16 )  
##     2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##       4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##         8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 )  
##          16) WindDir9am: ESE,NNE,NW,S,WNW,WSW 11   0 No ( 1.00 0.00 ) *
##          17) WindDir9am: E,N,SSE 16  20 No ( 0.75 0.25 )  
##            34) Pressure3pm < 1006.95 2   0 Yes ( 0.00 1.00 ) *
##            35) Pressure3pm > 1006.95 14  10 No ( 0.86 0.14 ) *
##         9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##       5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 ) *
##     3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##       6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##       7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##        14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##          28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##          29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##            58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 )  
##             116) WindDir9am: E,ESE,NE,SE 14  10 No ( 0.79 0.21 )  
##               232) Pressure3pm < 1015.55 11   7 No ( 0.91 0.09 ) *
##               233) Pressure3pm > 1015.55 3   4 Yes ( 0.33 0.67 ) *
##             117) WindDir9am: N,SSW 6   8 Yes ( 0.33 0.67 ) *
##            59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##        15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 )  
##          30) WindDir9am: E,ESE,N,SSW 35   0 No ( 1.00 0.00 ) *
##          31) WindDir9am: NNE,NNW,SE 40  30 No ( 0.90 0.10 ) *
summary.3 <- summary(my.snip.tree.3)
summary.3
## 
## Classification tree:
## snip.tree(tree = my.best.tree.3, nodes = 31L)
## Number of terminal nodes:  13 
## Residual mean deviance:  0.37 = 79.9 / 216 
## Misclassification error rate: 0.0655 = 15 / 229
plot(my.snip.tree.3)
text(my.snip.tree.3, digits=2)

plot of chunk unnamed-chunk-20

my.prediction <- predict(my.snip.tree.3, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.3 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.3
## [1] 0.0655
my.prediction <- predict(my.snip.tree.3, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.3 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.3
## [1] 0.2121

Now with size 8.

my.best.tree.4 <- prune.tree(my.tree, best=8)
my.best.tree.4
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 229 200 No ( 0.84 0.16 )  
##    2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##      4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##        8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 ) *
##        9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##      5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 ) *
##    3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##      6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##      7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 )  
##       14) Pressure3pm < 1016.05 34  40 No ( 0.71 0.29 )  
##         28) WindDir9am: NNE,NNW 11   0 No ( 1.00 0.00 ) *
##         29) WindDir9am: E,ESE,N,NE,SE,SSW 23  30 No ( 0.57 0.43 )  
##           58) Humidity3pm < 68.5 20  30 No ( 0.65 0.35 ) *
##           59) Humidity3pm > 68.5 3   0 Yes ( 0.00 1.00 ) *
##       15) Pressure3pm > 1016.05 75  30 No ( 0.95 0.05 ) *
summary(my.best.tree.4)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(5L, 58L, 15L, 8L))
## Number of terminal nodes:  8 
## Residual mean deviance:  0.471 = 104 / 221 
## Misclassification error rate: 0.0873 = 20 / 229
plot(my.best.tree.4)
text(my.best.tree.4, digits=2)

plot of chunk unnamed-chunk-21

my.snip.tree.4 <- my.best.tree.4
summary.4 <- summary(my.snip.tree.4)
summary.4
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(5L, 58L, 15L, 8L))
## Number of terminal nodes:  8 
## Residual mean deviance:  0.471 = 104 / 221 
## Misclassification error rate: 0.0873 = 20 / 229
my.prediction <- predict(my.snip.tree.4, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.4 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.4
## [1] 0.08734
my.prediction <- predict(my.snip.tree.4, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.4 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.4
## [1] 0.2222

Now with size 5.

my.best.tree.5 <- prune.tree(my.tree, best=5)
my.best.tree.5
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 229 200 No ( 0.84 0.16 )  
##   2) Pressure3pm < 1011.85 51  70 No ( 0.55 0.45 )  
##     4) Humidity3pm < 61 34  40 No ( 0.71 0.29 )  
##       8) WindDir9am: E,ESE,N,NNE,NW,S,SSE,WNW,WSW 27  20 No ( 0.85 0.15 ) *
##       9) WindDir9am: NNW,SE 7   6 Yes ( 0.14 0.86 ) *
##     5) Humidity3pm > 61 17  20 Yes ( 0.24 0.76 ) *
##   3) Pressure3pm > 1011.85 178 100 No ( 0.92 0.08 )  
##     6) WindDir9am: ENE,NW,S,SSE,SW,W,WNW,WSW 69   0 No ( 1.00 0.00 ) *
##     7) WindDir9am: E,ESE,N,NE,NNE,NNW,SE,SSW 109  80 No ( 0.87 0.13 ) *
summary(my.best.tree.5)
## 
## Classification tree:
## snip.tree(tree = my.tree, nodes = c(5L, 8L, 7L))
## Number of terminal nodes:  5 
## Residual mean deviance:  0.583 = 131 / 224 
## Misclassification error rate: 0.1 = 23 / 229
plot(my.best.tree.5)
text(my.best.tree.5, digits=2)

plot of chunk unnamed-chunk-22

my.snip.tree.5 <- snip.tree(my.best.tree.5, nodes=c(3))
summary.5 <- summary(my.snip.tree.5)
summary.5
## 
## Classification tree:
## snip.tree(tree = my.best.tree.5, nodes = 3L)
## Number of terminal nodes:  4 
## Residual mean deviance:  0.644 = 145 / 225 
## Misclassification error rate: 0.1 = 23 / 229
plot(my.snip.tree.5)
text(my.snip.tree.5, digits=2)

plot of chunk unnamed-chunk-22

my.prediction <- predict(my.snip.tree.5, wtrain, type="class")
errors <- which(my.prediction != wtrain[,"RainTomorrow"])
train.error.rate.5 <- length(errors)/length(wtrain[,"RainTomorrow"])
train.error.rate.5
## [1] 0.1004
my.prediction <- predict(my.snip.tree.5, wtest, type="class")
errors <- which(my.prediction != wtest[,"RainTomorrow"])
test.error.rate.5 <- length(errors)/length(wtest[,"RainTomorrow"])
test.error.rate.5
## [1] 0.2222

Resuming up:

model.1 <- c(summary.1$size, summary.1$dev/summary.1$df, summary.1$misclass[1]/summary.1$misclass[2], train.error.rate.1, test.error.rate.1)
model.2 <- c(summary.2$size, summary.2$dev/summary.2$df, summary.2$misclass[1]/summary.2$misclass[2], train.error.rate.2, test.error.rate.2)
model.3 <- c(summary.3$size, summary.3$dev/summary.3$df, summary.3$misclass[1]/summary.3$misclass[2], train.error.rate.3, test.error.rate.3)
model.result <- data.frame(rbind(model.1, model.2, model.3))
model.4 <- c(summary.4$size, summary.4$dev/summary.4$df, summary.4$misclass[1]/summary.4$misclass[2], train.error.rate.4, test.error.rate.4)
model.5 <- c(summary.5$size, summary.5$dev/summary.5$df, summary.5$misclass[1]/summary.5$misclass[2], train.error.rate.5, test.error.rate.5)
model.result <- data.frame(rbind(model.1, model.2, model.3, model.4, model.5))
colnames(model.result) <- c("size", "res. mean deviance", "mis. rate", "train error rate", "test error rate")
model.result
##         size res. mean deviance mis. rate train error rate test error rate
## model.1   17             0.3032   0.06114          0.05240          0.2121
## model.2   17             0.3032   0.06114          0.05677          0.2323
## model.3   13             0.3701   0.06550          0.06550          0.2121
## model.4    8             0.4709   0.08734          0.08734          0.2222
## model.5    4             0.6445   0.10044          0.10044          0.2222
write.csv(model.result, file = "Set3.model.csv", row.names=F)

In my next post, the fourth of this series, I am going to make my final decision among the models so far build for each explanatory variables set.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.