Weather forecast - Naive Bayes
In the previous posts series, I identified three explanatory variable sets, the following ones:
set#1: {Humidity9am, Pressure9am, WindDir9am}
set#2: {Humidity3pm, Pressure3pm, WindDir9am}
set#3: {Humidity3pm, Pressure3pm, Sunshine, WindGustDir}
Please remember that I work with three sets of variables as assume that there is the need to forecast tomorrow weather in three precise moment of the day: 9am, 3pm and evening (say 10pm). That assumption restricts the availability of some variables at specific day time.
The weather data is available at:
I start over again with the same preprocessing used in case of classification trees model.
data <- read.csv("weather.csv", header=TRUE)
data.clean <- na.omit(data)
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday))
attach(data.clean)
colnames(data.clean)
## [1] "Date" "Location" "MinTemp" "MaxTemp"
## [5] "Rainfall" "Evaporation" "Sunshine" "WindGustDir"
## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "WindSpeed9am"
## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "Pressure9am"
## [17] "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am"
## [21] "Temp3pm" "RainTomorrow"
This time, I would like to evaluate a model based on Naive Bayes classifier. To allow for some comparison with the classification tree approach, I keep the choice of explanatory variables unchanged.
set.seed(1)
n = nrow(data.clean)
ntrain = n*0.7
ntest = n - ntrain
trainset = sample(1:n, ntrain)
wtrain <- data.clean[trainset,]
wtest <- data.clean[-trainset,]
Since one of the hypothesis of Naive Bayes is the indipendence across explanatory variables, a good choice should comprise low correlated variables. That criteria was already taken into account in previous posts, I just recap what the correlation across variables are:
expl = data.frame(Rainfall, Sunshine, WindGustSpeed, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm)
expl.cor = cor(expl)
expl.cor
## Rainfall Sunshine WindGustSpeed Humidity9am
## Rainfall 1.00000000 -0.15806222 0.09944158 0.1463208
## Sunshine -0.15806222 1.00000000 0.08476765 -0.5015955
## WindGustSpeed 0.09944158 0.08476765 1.00000000 -0.3382764
## Humidity9am 0.14632082 -0.50159550 -0.33827641 1.0000000
## Humidity3pm 0.28724404 -0.76026673 -0.04325351 0.5266952
## Pressure9am -0.34873128 0.02562989 -0.52473677 0.1022502
## Pressure3pm -0.26371024 -0.02412019 -0.51082621 0.1095494
## Cloud9am 0.17260983 -0.69760347 -0.01821614 0.4174955
## Cloud3pm 0.13489430 -0.65719792 0.04284946 0.2896181
## Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm
## Rainfall 0.28724404 -0.34873128 -0.26371024 0.17260983 0.13489430
## Sunshine -0.76026673 0.02562989 -0.02412019 -0.69760347 -0.65719792
## WindGustSpeed -0.04325351 -0.52473677 -0.51082621 -0.01821614 0.04284946
## Humidity9am 0.52669523 0.10225017 0.10954942 0.41749552 0.28961807
## Humidity3pm 1.00000000 -0.13628895 -0.04760723 0.56517442 0.53071540
## Pressure9am -0.13628895 1.00000000 0.96674408 -0.16831577 -0.14619553
## Pressure3pm -0.04760723 0.96674408 1.00000000 -0.13224707 -0.14623511
## Cloud9am 0.56517442 -0.16831577 -0.13224707 1.00000000 0.52829611
## Cloud3pm 0.53071540 -0.14619553 -0.14623511 0.52829611 1.00000000
Let use the same explanatory variables set as used in case of classification trees to explore if any improvement in the test error rate is achieved.
The Naive Bayes classifier is available in the e1071 library. The scales library is useful to format percentages. The ROCR library to plot roc curves and compute auc values.
library(e1071, verbose=FALSE, quietly=TRUE)
library(scales, verbose=FALSE, quietly=TRUE)
library(ROCR, verbose=FALSE, quietly=TRUE)
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
- First explanatory variables set
train.col <- c("Humidity9am", "Pressure9am", "WindDir9am")
my.naive.1 <- naiveBayes(RainTomorrow~(Humidity9am+Pressure9am+WindDir9am), data=wtrain)
train.error.table.1 <- table(predict(my.naive.1, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))
train.error.table.1
## actual
## predicted No Yes
## No 183 26
## Yes 9 11
train.error.rate.1 <- (train.error.table.1[1,2]+train.error.table.1[2,1])/sum(train.error.table.1)
test.error.table.1 <- table(predict(my.naive.1, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.1
## actual
## predicted No Yes
## No 74 17
## Yes 2 6
test.error.rate.1 <- (test.error.table.1[1,2]+test.error.table.1[2,1])/sum(test.error.table.1)
percent(train.error.rate.1)
## [1] "15.3%"
percent(test.error.rate.1)
## [1] "19.2%"
# ROCR analysis
my.naive.1.predict <- predict(my.naive.1, wtrain[,train.col], type='raw')
score <- my.naive.1.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.1.prediction <- prediction(score, actual_class)
my.naive.1.perf <- performance(my.naive.1.prediction, "tpr", "fpr")
plot(my.naive.1.perf, col="red")
abline(0, 1, col = "gray60")
grid()
my.naive.1.auc <- performance(my.naive.1.prediction, "auc")
my.naive.1.auc@y.values[[1]]
## [1] 0.8654279
Do you think that adding explanatory variables may help ? Let us try.
train.col <- c("Humidity9am", "Pressure9am", "WindDir9am", "Cloud9am", "Temp9am")
my.naive.1.b <- naiveBayes(RainTomorrow~(Humidity9am+Pressure9am+WindDir9am+Cloud9am+Temp9am), data=wtrain)
my.naive.1.b.predict <- predict(my.naive.1.b, wtrain[,train.col])
train.error.table.1.b <- table(predict(my.naive.1.b, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))
train.error.table.1.b
## actual
## predicted No Yes
## No 180 20
## Yes 12 17
train.error.rate.1.b <- (train.error.table.1.b[1,2]+train.error.table.1.b[2,1])/sum(train.error.table.1.b)
test.error.table.1.b <- table(predict(my.naive.1.b, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.1.b
## actual
## predicted No Yes
## No 74 15
## Yes 2 8
test.error.rate.1.b <- (test.error.table.1.b[1,2]+test.error.table.1.b[2,1])/sum(test.error.table.1.b)
percent(train.error.rate.1.b)
## [1] "14%"
percent(test.error.rate.1.b)
## [1] "17.2%"
# ROCR analysis
my.naive.1.b.predict <- predict(my.naive.1.b, wtrain[,train.col], type='raw')
score <- my.naive.1.b.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.1.b.prediction <- prediction(score, actual_class)
my.naive.1.b.perf <- performance(my.naive.1.b.prediction, "tpr", "fpr")
my.naive.1.b.auc <- performance(my.naive.1.b.prediction, "auc")
my.naive.1.b.auc@y.values[[1]]
## [1] 0.8685248
plot(my.naive.1.perf, col="red")
abline(0, 1, col = "gray60")
grid()
- Second explanatory variables set
train.col <- c("Humidity3pm", "Pressure3pm", "WindDir9am")
my.naive.2 <- naiveBayes(RainTomorrow~(Humidity3pm+Pressure3pm+WindDir9am), data=wtrain)
train.error.table.2 <- table(predict(my.naive.2, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))
train.error.table.2
## actual
## predicted No Yes
## No 186 17
## Yes 6 20
train.error.rate.2 <- (train.error.table.2[1,2]+train.error.table.2[2,1])/sum(train.error.table.2)
test.error.table.2 <- table(predict(my.naive.2, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.2
## actual
## predicted No Yes
## No 74 16
## Yes 2 7
test.error.rate.2 <- (test.error.table.2[1,2]+test.error.table.2[2,1])/sum(test.error.table.2)
percent(train.error.rate.2)
## [1] "10%"
percent(test.error.rate.2)
## [1] "18.2%"
# ROCR analysis
my.naive.2.predict <- predict(my.naive.2, wtrain[,train.col], type='raw')
score <- my.naive.2.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.2.prediction <- prediction(score, actual_class)
my.naive.2.perf <- performance(my.naive.2.prediction, "tpr", "fpr")
plot(my.naive.2.perf, col="red")
abline(0, 1, col = "gray60")
grid()
my.naive.2.auc <- performance(my.naive.2.prediction, "auc")
my.naive.2.auc@y.values[[1]]
## [1] 0.8925957
Let us try again to add variables and compre the results.
train.col <- c("Humidity3pm", "Pressure3pm", "WindDir9am", "WindDir3pm", "Cloud3pm", "Temp3pm")
my.naive.2.b <- naiveBayes(RainTomorrow~(Humidity3pm+Pressure3pm+WindDir9am+WindDir3pm+Cloud3pm+Temp3pm), data=wtrain)
train.error.table.2.b <- table(predict(my.naive.2.b, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))
train.error.table.2.b
## actual
## predicted No Yes
## No 181 13
## Yes 11 24
train.error.rate.2.b <- (train.error.table.2.b[1,2]+train.error.table.2.b[2,1])/sum(train.error.table.2.b)
test.error.table.2.b <- table(predict(my.naive.2.b, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.2.b
## actual
## predicted No Yes
## No 71 14
## Yes 5 9
test.error.rate.2.b <- (test.error.table.2.b[1,2]+test.error.table.2.b[2,1])/sum(test.error.table.2.b)
percent(train.error.rate.2.b)
## [1] "10.5%"
percent(test.error.rate.2.b)
## [1] "19.2%"
# ROCR analysis
my.naive.2.b.predict <- predict(my.naive.2.b, wtrain[,train.col], type='raw')
score <- my.naive.2.b.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.2.b.prediction <- prediction(score, actual_class)
my.naive.2.b.perf <- performance(my.naive.2.b.prediction, "tpr", "fpr")
plot(my.naive.2.perf, col="red")
abline(0, 1, col = "gray60")
grid()
my.naive.2.b.auc <- performance(my.naive.2.b.prediction, "auc")
my.naive.2.b.auc@y.values[[1]]
## [1] 0.9083615
No remarkable improvements for the error rates when adding further variables.
- Third explanatory variables set
train.col <- c("Humidity3pm", "Pressure3pm", "Sunshine", "WindGustDir")
f <- as.formula("RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine+WindGustDir)")
my.naive.3 <- naiveBayes(f, data=wtrain)
train.error.table.3 <- table(predict(my.naive.3, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))
train.error.table.3
## actual
## predicted No Yes
## No 180 14
## Yes 12 23
train.error.rate.3 <- (train.error.table.3[1,2]+train.error.table.3[2,1])/sum(train.error.table.3)
test.error.table.3 <- table(predict(my.naive.3, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.3
## actual
## predicted No Yes
## No 75 13
## Yes 1 10
test.error.rate.3 <- (test.error.table.3[1,2]+test.error.table.3[2,1])/sum(test.error.table.3)
percent(train.error.rate.3)
## [1] "11.4%"
percent(test.error.rate.3)
## [1] "14.1%"
# ROCR analysis
my.naive.3.predict <- predict(my.naive.3, wtrain[,train.col], type='raw')
score <- my.naive.3.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.3.prediction <- prediction(score, actual_class)
my.naive.3.perf <- performance(my.naive.3.prediction, "tpr", "fpr")
plot(my.naive.3.perf, col="red")
abline(0, 1, col = "gray60")
grid()
my.naive.3.auc <- performance(my.naive.3.prediction, "auc")
my.naive.3.auc@y.values[[1]]
## [1] 0.895411
What about putting all available variables afterwards 3pm in ? The train error rate is getting slight worse whilst the test error rate somehow better.
train.col <- c("MinTemp", "MaxTemp", "Rainfall", "Evaporation", "Sunshine", "WindGustDir", "WindGustSpeed", "WindDir3pm", "WindSpeed3pm", "Humidity3pm", "Pressure3pm", "Cloud3pm", "Temp3pm")
f <- as.formula ("RainTomorrow~(MinTemp+MaxTemp+Rainfall+Evaporation+Sunshine+WindGustDir+WindGustSpeed+WindDir3pm+WindSpeed3pm+Humidity3pm+Pressure3pm+Cloud3pm+Temp3pm)")
my.naive.all <- naiveBayes(f, data=wtrain)
train.error.table.all <- table(predict(my.naive.all, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))
train.error.table.all
## actual
## predicted No Yes
## No 174 11
## Yes 18 26
train.error.rate.all <- (train.error.table.all[1,2]+train.error.table.all[2,1])/sum(train.error.table.all)
test.error.table.all <- table(predict(my.naive.all, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.all
## actual
## predicted No Yes
## No 75 10
## Yes 1 13
test.error.rate.all <- (test.error.table.all[1,2]+test.error.table.all[2,1])/sum(test.error.table.all)
percent(train.error.rate.all)
## [1] "12.7%"
percent(test.error.rate.all)
## [1] "11.1%"
# ROCR analysis
my.naive.all.predict <- predict(my.naive.all, wtrain[,train.col], type='raw')
score <- my.naive.all.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.all.prediction <- prediction(score, actual_class)
my.naive.all.perf <- performance(my.naive.all.prediction, "tpr", "fpr")
plot(my.naive.all.perf, col="red")
abline(0, 1, col = "gray60")
grid()
my.naive.all.auc <- performance(my.naive.all.prediction, "auc")
my.naive.all.auc@y.values[[1]]
## [1] 0.8878097
Do you think that adding all variables helps for the evening weather forecast. Actually, just one more explanatory variable is useful to achieve that error rate, the WindGustSpeed.
train.col <- c("Humidity3pm", "Pressure3pm", "Sunshine", "WindGustDir", "WindGustSpeed")
f <- as.formula("RainTomorrow~(Humidity3pm+Pressure3pm+Sunshine+WindGustDir+WindGustSpeed)")
my.naive.3.b <- naiveBayes(f, data=wtrain)
train.error.table.3.b <- table(predict(my.naive.3.b, wtrain[,train.col]), wtrain[,"RainTomorrow"], dnn=list('predicted','actual'))
train.error.table.3.b
## actual
## predicted No Yes
## No 178 13
## Yes 14 24
train.error.rate.3.b <- (train.error.table.3.b[1,2]+train.error.table.3.b[2,1])/sum(train.error.table.3.b)
test.error.table.3.b <- table(predict(my.naive.3.b, wtest[,train.col]), wtest[,"RainTomorrow"], dnn=list('predicted','actual'))
test.error.table.3.b
## actual
## predicted No Yes
## No 75 10
## Yes 1 13
test.error.rate.3.b <- (test.error.table.3.b[1,2]+test.error.table.3.b[2,1])/sum(test.error.table.3.b)
percent(train.error.rate.3.b)
## [1] "11.8%"
percent(test.error.rate.3.b)
## [1] "11.1%"
# ROCR analysis
my.naive.3.b.predict <- predict(my.naive.3.b, wtrain[,train.col], type='raw')
score <- my.naive.3.b.predict[, "Yes"]
actual_class <- (wtrain[,"RainTomorrow"] == 'Yes')
my.naive.3.b.prediction <- prediction(score, actual_class)
my.naive.3.b.perf <- performance(my.naive.3.b.prediction, "tpr", "fpr")
plot(my.naive.3.perf, col="red")
abline(0, 1, col = "gray60")
grid()
my.naive.3.b.auc <- performance(my.naive.3.b.prediction, "auc")
my.naive.3.b.auc@y.values[[1]]
## [1] 0.8910473
So, Naive Bayes classifier allows for better test error rates than classification tree in our example where I determined the explanatory variable by a preliminary analysis. Also the gap between train and test error rate is lower.
df.error.rate.1 <- data.frame(percent(train.error.rate.1), percent(test.error.rate.1), percent(train.error.rate.1.b), percent(test.error.rate.1.b))
colnames(df.error.rate.1) <- c("train.err.1", "test.err.1", "train.err.1.b", "test.err.1.b")
df.error.rate.2 <- data.frame(percent(train.error.rate.2), percent(test.error.rate.2), percent(train.error.rate.2.b), percent(test.error.rate.2.b))
colnames(df.error.rate.2) <- c("train.err.2", "test.err.2", "train.err.2.b", "test.err.2.b")
df.error.rate.3 <- data.frame(percent(train.error.rate.3), percent(test.error.rate.3), percent(train.error.rate.3.b), percent(test.error.rate.3.b))
colnames(df.error.rate.3) <- c("train.err.3", "test.err.3", "train.err.3.b", "test.err.3.b")
df.error.rate.1
## train.err.1 test.err.1 train.err.1.b test.err.1.b
## 1 15.3% 19.2% 14% 17.2%
df.error.rate.2
## train.err.2 test.err.2 train.err.2.b test.err.2.b
## 1 10% 18.2% 10.5% 19.2%
df.error.rate.3
## train.err.3 test.err.3 train.err.3.b test.err.3.b
## 1 11.4% 14.1% 11.8% 11.1%
In case of morning and afternoon weather forecasts, the performance is close to the classification tree one.
In case of evening forecast, the test error rate is improved with respect the classification tree use.
There is no remarkable advantage in augmenting the explanatory variables set, besides the evening forecast case (adding WindGustSpeed results with 22% better performance).