Saturday, November 15, 2014

Weather forecast (part 4)

Choosing the tree size

Having saved the results from each set of model, I am going to load them back to do some more considerations. In particular, I would like to choose the tree size which better fits for each model set.

set1 <- read.csv(file = "Set1.model.csv")
set2 <- read.csv(file = "Set2.model.csv")
set3 <- read.csv(file = "Set3.model.csv")
set1
##   size res..mean.deviance mis..rate train.error.rate test.error.rate
## 1   12             0.4554    0.0917           0.0917          0.1919
## 2    9             0.5077    0.1004           0.1004          0.1919
## 3    7             0.5472    0.1135           0.1135          0.2121
set2
##   size res..mean.deviance mis..rate train.error.rate test.error.rate
## 1   17             0.3032   0.06114          0.06114          0.2121
## 2   10             0.4242   0.07860          0.07860          0.2121
## 3    4             0.6445   0.10044          0.10044          0.2222
set3
##   size res..mean.deviance mis..rate train.error.rate test.error.rate
## 1   17             0.3032   0.06114          0.05240          0.2121
## 2   17             0.3032   0.06114          0.05677          0.2323
## 3   13             0.3701   0.06550          0.06550          0.2121
## 4    8             0.4709   0.08734          0.08734          0.2222
## 5    4             0.6445   0.10044          0.10044          0.2222

I recap again what are the explanatory variables set.

set#1: {Humidity9am, Pressure9am, WindDir9am} set#2: {Humidity3pm, Pressure3pm, WindDir9am} set#3: {Humidity3pm, Pressure3pm, Sunshine, WindGustDir}

The criteria to pick up the tree size for each model shall be based on two elements:

  1. a misclassification rate admittable top value
  2. the tree size increase vs. misclassification rate decrease trade off

The first establishes an absolute value associated to the misclassification rate that set forth a tolerable value for it. I set forth 10% as maximum misclassification rate. The second measures the advantage we can get as return of increased complexity of the resulting tree. It has to be measured in terms of comparison between the tree size increase percentage and the misclassification rate decrease percentage.

n1 <- nrow(set1)
p1 <- cbind((-100*(diff(set1[,1])/set1[2:n1,1])),(100*(diff(set1[,3])/set1[2:n1,3])))
colnames(p1) <- c("size.inc.perc","mis.dec.perc")
p1
##      size.inc.perc mis.dec.perc
## [1,]         33.33        8.696
## [2,]         28.57       11.538
ratio <- p1[,1]/p1[,2]
ratio
## [1] 3.833 2.476

Commenting the rows bottom up, with a tree size increase equals to 28%, the misclassification rate decrease 11%. Further 33% tree size increase allows for only 8% misclassification decrease. The first tree size increase has to be accepted in order to reach the 10% misclassification rate absolute threshold. Hence tree size 9 is determined for the first set of explanatory variables.

n2 <- nrow(set2)
p2 <- cbind((-100*(diff(set2[,1])/set2[2:n2,1])),(100*(diff(set2[,3])/set2[2:n2,3])))
colnames(p2) <- c("size.inc.perc","mis.dec.perc")
p2
##      size.inc.perc mis.dec.perc
## [1,]            70        22.22
## [2,]           150        21.74
ratio <- p2[,1]/p2[,2]
ratio
## [1] 3.15 6.90

Starting from a tree size equals to 4, if I consider acceptable the size increase from 4 to 10, hence 150% tree size increase with 21% gain in terms of accuracy, I have to accept a further 70% complexity increase for a 22% further accuracy increase. On the other hand with 4 tree size I have 10% accuracy already, which is the same as set1 tree. I choose 4 as tree size.

n3 <- nrow(set1)
p3 <- cbind((-100*(diff(set3[,1])/set3[2:n3,1])),(100*(diff(set3[,3])/set3[2:n3,3])))
colnames(p3) <- c("size.inc.perc","mis.dec.perc")
p3
##      size.inc.perc mis.dec.perc
## [1,]          0.00        0.000
## [2,]         30.77        6.667
## [3,]         29.41       35.714
## [4,]         30.77       20.000
ratio <- p3[2:4,1]/p3[2:4,2]
ratio
## [1] 4.6154 0.8235 1.5385

In this case, I do not see considerable advantages in increasing tree size from 13 to 17 as the relative misclassification rate decrease is just 6%. I choose 13 as tree size. The general criteria is based on the ratio between the size increase percentage and misclassification decrease percentage. From right to left, if the first ratio is acceptable, the next to the left is surely acceptable if less than the previous one. So, in this case, if I accept ratio 1.53 acceptable, also 0.82 is acceptable.

Set#1: tree size: 9; misclassification rate: 10%; train error rate: 19%

Set#2: tree size: 4; misclassification rate: 10%; train error rate: 21%

Set#3: tree size: 13; misclassification rate: 6.5%; train error rate: 21%

It makes sense to have the evening time forecast more accurate than the preceeding ones. However, the test error rate is steadily around 20% for all models and that makes me willing to explore other statistical learning models. That is going to be shown in my next posts.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.