Sunday, November 23, 2014

Weather forecast going naive

Weather forecast - Naive Bayes

In the previous posts series, I identified three explanatory variable sets, the following ones:

set#1: {Humidity9am, Pressure9am, WindDir9am}

set#2: {Humidity3pm, Pressure3pm, WindDir9am}

set#3: {Humidity3pm, Pressure3pm, Sunshine, WindGustDir}

Please remember that I work with three sets of variables as assume that there is the need to forecast tomorrow weather in three precise moment of the day: 9am, 3pm and evening (say 10pm). That assumption restricts the availability of some variables at specific day time.

The weather data is available at:

I start over again with the same preprocessing used in case of classification trees model.

data <- read.csv("weather.csv", header=TRUE)
data.clean <- na.omit(data)
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday))
attach(data.clean)
colnames(data.clean)
##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainTomorrow"

This time, I would like to evaluate a model based on Naive Bayes classifier. To allow for some comparison with the classification tree approach, I keep the choice of explanatory variables unchanged.

set.seed(1)
n = nrow(data.clean)
ntrain = n*0.7
ntest = n - ntrain
trainset = sample(1:n, ntrain)
wtrain <- data.clean[trainset,]
wtest <- data.clean[-trainset,]

Since one of the hypothesis of Naive Bayes is the indipendence across explanatory variables, a good choice should comprise low correlated variables. That criteria was already taken into account in previous posts, I just recap what the correlation across variables are:

expl = data.frame(Rainfall, Sunshine, WindGustSpeed, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm)
expl.cor = cor(expl)
expl.cor
##                  Rainfall    Sunshine WindGustSpeed Humidity9am
## Rainfall       1.00000000 -0.15806222    0.09944158   0.1463208
## Sunshine      -0.15806222  1.00000000    0.08476765  -0.5015955
## WindGustSpeed  0.09944158  0.08476765    1.00000000  -0.3382764
## Humidity9am    0.14632082 -0.50159550   -0.33827641   1.0000000
## Humidity3pm    0.28724404 -0.76026673   -0.04325351   0.5266952
## Pressure9am   -0.34873128  0.02562989   -0.52473677   0.1022502
## Pressure3pm   -0.26371024 -0.02412019   -0.51082621   0.1095494
## Cloud9am       0.17260983 -0.69760347   -0.01821614   0.4174955
## Cloud3pm       0.13489430 -0.65719792    0.04284946   0.2896181
##               Humidity3pm Pressure9am Pressure3pm    Cloud9am    Cloud3pm
## Rainfall       0.28724404 -0.34873128 -0.26371024  0.17260983  0.13489430
## Sunshine      -0.76026673  0.02562989 -0.02412019 -0.69760347 -0.65719792
## WindGustSpeed -0.04325351 -0.52473677 -0.51082621 -0.01821614  0.04284946
## Humidity9am    0.52669523  0.10225017  0.10954942  0.41749552  0.28961807
## Humidity3pm    1.00000000 -0.13628895 -0.04760723  0.56517442  0.53071540
## Pressure9am   -0.13628895  1.00000000  0.96674408 -0.16831577 -0.14619553
## Pressure3pm   -0.04760723  0.96674408  1.00000000 -0.13224707 -0.14623511
## Cloud9am       0.56517442 -0.16831577 -0.13224707  1.00000000  0.52829611
## Cloud3pm       0.53071540 -0.14619553 -0.14623511  0.52829611  1.00000000

Let use the same explanatory variables set as used in case of classification trees to explore if any improvement in the test error rate is achieved.

The Naive Bayes classifier is available in the e1071 library. The scales library is useful to format percentages. The ROCR library to plot roc curves and compute auc values.

library(e1071, verbose=FALSE, quietly=TRUE)
library(scales, verbose=FALSE, quietly=TRUE)
library(ROCR, verbose=FALSE, quietly=TRUE)
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
  • First explanatory variables set
train.col <- c("Humidity9am", "Pressure9am", "WindDir9am")
my.naive.1 <- naiveBayes(RainTomorrow~(Humidity9am+Pressure9am+WindDir9am), data=wtrain)

train.error.table.1 <- table(predict(my.naive.1, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))
train.error.table.1
##          actual
## predicted  No Yes
##       No  183  26
##       Yes   9  11
train.error.rate.1 <- (train.error.table.1[1,2]+train.error.table.1[2,1])/sum(train.error.table.1)

test.error.table.1 <- table(predict(my.naive.1, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.1
##          actual
## predicted No Yes
##       No  74  17
##       Yes  2   6
test.error.rate.1 <- (test.error.table.1[1,2]+test.error.table.1[2,1])/sum(test.error.table.1)

percent(train.error.rate.1)
## [1] "15.3%"
percent(test.error.rate.1)
## [1] "19.2%"
# ROCR analysis
my.naive.1.predict <- predict(my.naive.1, wtrain[,train.col], type='raw')
score <- my.naive.1.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.1.prediction <- prediction(score, actual_class)
my.naive.1.perf <- performance(my.naive.1.prediction, "tpr", "fpr")
plot(my.naive.1.perf, col="red")
abline(0, 1, col = "gray60")
grid()

my.naive.1.auc <- performance(my.naive.1.prediction, "auc")
my.naive.1.auc@y.values[[1]]
## [1] 0.8654279

Do you think that adding explanatory variables may help ? Let us try.

train.col <- c("Humidity9am", "Pressure9am", "WindDir9am", "Cloud9am", "Temp9am")
my.naive.1.b <- naiveBayes(RainTomorrow~(Humidity9am+Pressure9am+WindDir9am+Cloud9am+Temp9am), data=wtrain)
my.naive.1.b.predict <- predict(my.naive.1.b, wtrain[,train.col])
train.error.table.1.b <- table(predict(my.naive.1.b, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))

train.error.table.1.b
##          actual
## predicted  No Yes
##       No  180  20
##       Yes  12  17
train.error.rate.1.b <- (train.error.table.1.b[1,2]+train.error.table.1.b[2,1])/sum(train.error.table.1.b)
test.error.table.1.b <- table(predict(my.naive.1.b, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))

test.error.table.1.b
##          actual
## predicted No Yes
##       No  74  15
##       Yes  2   8
test.error.rate.1.b <- (test.error.table.1.b[1,2]+test.error.table.1.b[2,1])/sum(test.error.table.1.b)

percent(train.error.rate.1.b)
## [1] "14%"
percent(test.error.rate.1.b)
## [1] "17.2%"
# ROCR analysis
my.naive.1.b.predict <- predict(my.naive.1.b, wtrain[,train.col], type='raw')
score <- my.naive.1.b.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.1.b.prediction <- prediction(score, actual_class)
my.naive.1.b.perf <- performance(my.naive.1.b.prediction, "tpr", "fpr")
my.naive.1.b.auc <- performance(my.naive.1.b.prediction, "auc")
my.naive.1.b.auc@y.values[[1]]
## [1] 0.8685248
plot(my.naive.1.perf, col="red")
abline(0, 1, col = "gray60")
grid()

  • Second explanatory variables set
train.col <- c("Humidity3pm", "Pressure3pm", "WindDir9am")
my.naive.2 <- naiveBayes(RainTomorrow~(Humidity3pm+Pressure3pm+WindDir9am), data=wtrain)
train.error.table.2 <- table(predict(my.naive.2, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))

train.error.table.2
##          actual
## predicted  No Yes
##       No  186  17
##       Yes   6  20
train.error.rate.2 <- (train.error.table.2[1,2]+train.error.table.2[2,1])/sum(train.error.table.2)
test.error.table.2 <- table(predict(my.naive.2, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))

test.error.table.2
##          actual
## predicted No Yes
##       No  74  16
##       Yes  2   7
test.error.rate.2 <- (test.error.table.2[1,2]+test.error.table.2[2,1])/sum(test.error.table.2)

percent(train.error.rate.2)
## [1] "10%"
percent(test.error.rate.2)
## [1] "18.2%"
# ROCR analysis
my.naive.2.predict <- predict(my.naive.2, wtrain[,train.col], type='raw')
score <- my.naive.2.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.2.prediction <- prediction(score, actual_class)
my.naive.2.perf <- performance(my.naive.2.prediction, "tpr", "fpr")
plot(my.naive.2.perf, col="red")
abline(0, 1, col = "gray60")
grid()

my.naive.2.auc <- performance(my.naive.2.prediction, "auc")
my.naive.2.auc@y.values[[1]]
## [1] 0.8925957

Let us try again to add variables and compre the results.

train.col <- c("Humidity3pm", "Pressure3pm", "WindDir9am", "WindDir3pm", "Cloud3pm", "Temp3pm")
my.naive.2.b <- naiveBayes(RainTomorrow~(Humidity3pm+Pressure3pm+WindDir9am+WindDir3pm+Cloud3pm+Temp3pm), data=wtrain)
train.error.table.2.b <- table(predict(my.naive.2.b, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))

train.error.table.2.b
##          actual
## predicted  No Yes
##       No  181  13
##       Yes  11  24
train.error.rate.2.b <- (train.error.table.2.b[1,2]+train.error.table.2.b[2,1])/sum(train.error.table.2.b)

test.error.table.2.b <- table(predict(my.naive.2.b, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))

test.error.table.2.b
##          actual
## predicted No Yes
##       No  71  14
##       Yes  5   9
test.error.rate.2.b <- (test.error.table.2.b[1,2]+test.error.table.2.b[2,1])/sum(test.error.table.2.b)

percent(train.error.rate.2.b)
## [1] "10.5%"
percent(test.error.rate.2.b)
## [1] "19.2%"
# ROCR analysis
my.naive.2.b.predict <- predict(my.naive.2.b, wtrain[,train.col], type='raw')
score <- my.naive.2.b.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.2.b.prediction <- prediction(score, actual_class)
my.naive.2.b.perf <- performance(my.naive.2.b.prediction, "tpr", "fpr")
plot(my.naive.2.perf, col="red")
abline(0, 1, col = "gray60")
grid()

my.naive.2.b.auc <- performance(my.naive.2.b.prediction, "auc")
my.naive.2.b.auc@y.values[[1]]
## [1] 0.9083615

No remarkable improvements for the error rates when adding further variables.

  • Third explanatory variables set
train.col <- c("Humidity3pm", "Pressure3pm", "Sunshine", "WindGustDir")
f <- as.formula("RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine+WindGustDir)")
my.naive.3 <- naiveBayes(f, data=wtrain)
train.error.table.3 <- table(predict(my.naive.3, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))

train.error.table.3
##          actual
## predicted  No Yes
##       No  180  14
##       Yes  12  23
train.error.rate.3 <- (train.error.table.3[1,2]+train.error.table.3[2,1])/sum(train.error.table.3)
test.error.table.3 <- table(predict(my.naive.3, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))

test.error.table.3
##          actual
## predicted No Yes
##       No  75  13
##       Yes  1  10
test.error.rate.3 <- (test.error.table.3[1,2]+test.error.table.3[2,1])/sum(test.error.table.3)

percent(train.error.rate.3)
## [1] "11.4%"
percent(test.error.rate.3)
## [1] "14.1%"
# ROCR analysis
my.naive.3.predict <- predict(my.naive.3, wtrain[,train.col], type='raw')
score <- my.naive.3.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.3.prediction <- prediction(score, actual_class)
my.naive.3.perf <- performance(my.naive.3.prediction, "tpr", "fpr")
plot(my.naive.3.perf, col="red")
abline(0, 1, col = "gray60")
grid()

my.naive.3.auc <- performance(my.naive.3.prediction, "auc")
my.naive.3.auc@y.values[[1]]
## [1] 0.895411

What about putting all available variables afterwards 3pm in ? The train error rate is getting slight worse whilst the test error rate somehow better.

train.col <- c("MinTemp", "MaxTemp", "Rainfall", "Evaporation", "Sunshine", "WindGustDir", "WindGustSpeed", "WindDir3pm", "WindSpeed3pm",  "Humidity3pm",  "Pressure3pm", "Cloud3pm", "Temp3pm")
f <- as.formula ("RainTomorrow~(MinTemp+MaxTemp+Rainfall+Evaporation+Sunshine+WindGustDir+WindGustSpeed+WindDir3pm+WindSpeed3pm+Humidity3pm+Pressure3pm+Cloud3pm+Temp3pm)")
my.naive.all <- naiveBayes(f, data=wtrain)
train.error.table.all <- table(predict(my.naive.all, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))

train.error.table.all
##          actual
## predicted  No Yes
##       No  174  11
##       Yes  18  26
train.error.rate.all <- (train.error.table.all[1,2]+train.error.table.all[2,1])/sum(train.error.table.all)

test.error.table.all <- table(predict(my.naive.all, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.all
##          actual
## predicted No Yes
##       No  75  10
##       Yes  1  13
test.error.rate.all <- (test.error.table.all[1,2]+test.error.table.all[2,1])/sum(test.error.table.all)

percent(train.error.rate.all)
## [1] "12.7%"
percent(test.error.rate.all)
## [1] "11.1%"
# ROCR analysis
my.naive.all.predict <- predict(my.naive.all, wtrain[,train.col], type='raw')
score <- my.naive.all.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.all.prediction <- prediction(score, actual_class)
my.naive.all.perf <- performance(my.naive.all.prediction, "tpr", "fpr")
plot(my.naive.all.perf, col="red")
abline(0, 1, col = "gray60")
grid()

my.naive.all.auc <- performance(my.naive.all.prediction, "auc")
my.naive.all.auc@y.values[[1]]
## [1] 0.8878097

Do you think that adding all variables helps for the evening weather forecast. Actually, just one more explanatory variable is useful to achieve that error rate, the WindGustSpeed.

train.col <- c("Humidity3pm", "Pressure3pm", "Sunshine", "WindGustDir", "WindGustSpeed")
f <- as.formula("RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine+WindGustDir+WindGustSpeed)")
my.naive.3.b <- naiveBayes(f, data=wtrain)

train.error.table.3.b <- table(predict(my.naive.3.b, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))

train.error.table.3.b
##          actual
## predicted  No Yes
##       No  178  13
##       Yes  14  24
train.error.rate.3.b <- (train.error.table.3.b[1,2]+train.error.table.3.b[2,1])/sum(train.error.table.3.b)
test.error.table.3.b <- table(predict(my.naive.3.b, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))

test.error.table.3.b 
##          actual
## predicted No Yes
##       No  75  10
##       Yes  1  13
test.error.rate.3.b <- (test.error.table.3.b[1,2]+test.error.table.3.b[2,1])/sum(test.error.table.3.b)

percent(train.error.rate.3.b)
## [1] "11.8%"
percent(test.error.rate.3.b)
## [1] "11.1%"
# ROCR analysis
my.naive.3.b.predict <- predict(my.naive.3.b, wtrain[,train.col], type='raw')
score <- my.naive.3.b.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.3.b.prediction <- prediction(score, actual_class)
my.naive.3.b.perf <- performance(my.naive.3.b.prediction, "tpr", "fpr")
plot(my.naive.3.perf, col="red")
abline(0, 1, col = "gray60")
grid()

my.naive.3.b.auc <- performance(my.naive.3.b.prediction, "auc")
my.naive.3.b.auc@y.values[[1]]
## [1] 0.8910473

So, Naive Bayes classifier allows for better test error rates than classification tree in our example where I determined the explanatory variable by a preliminary analysis. Also the gap between train and test error rate is lower.

df.error.rate.1 <- data.frame(percent(train.error.rate.1), percent(test.error.rate.1), percent(train.error.rate.1.b), percent(test.error.rate.1.b))
colnames(df.error.rate.1) <- c("train.err.1", "test.err.1", "train.err.1.b", "test.err.1.b")
df.error.rate.2 <- data.frame(percent(train.error.rate.2), percent(test.error.rate.2), percent(train.error.rate.2.b), percent(test.error.rate.2.b))
colnames(df.error.rate.2) <- c("train.err.2", "test.err.2", "train.err.2.b", "test.err.2.b")
df.error.rate.3 <- data.frame(percent(train.error.rate.3), percent(test.error.rate.3), percent(train.error.rate.3.b), percent(test.error.rate.3.b))
colnames(df.error.rate.3) <- c("train.err.3", "test.err.3", "train.err.3.b", "test.err.3.b")
df.error.rate.1
##   train.err.1 test.err.1 train.err.1.b test.err.1.b
## 1       15.3%      19.2%           14%        17.2%
df.error.rate.2
##   train.err.2 test.err.2 train.err.2.b test.err.2.b
## 1         10%      18.2%         10.5%        19.2%
df.error.rate.3
##   train.err.3 test.err.3 train.err.3.b test.err.3.b
## 1       11.4%      14.1%         11.8%        11.1%

In case of morning and afternoon weather forecasts, the performance is close to the classification tree one.

In case of evening forecast, the test error rate is improved with respect the classification tree use.

There is no remarkable advantage in augmenting the explanatory variables set, besides the evening forecast case (adding WindGustSpeed results with 22% better performance).