Weather forecast - Logistic Regression
In the previous posts series, I identified three explanatory variable sets, the following ones:
set#1: {Humidity9am, Pressure9am, WindDir9am}
set#2: {Humidity3pm, Pressure3pm, WindDir9am}
set#3: {Humidity3pm, Pressure3pm, Sunshine, WindGustDir}
Please remember that I work with three sets of variables as assume that there is the need to forecast tomorrow weather in three precise moment of the day: 9am, 3pm and evening (say 10pm). That assumption restricts the availability of some variables at specific day time.
The weather data is available at:
This time, I would like to evaluate a model based on Logistic Regression. I keep the choice of explanatory variables unchanged to allow comparison with previous models.
I start over again with the same preprocessing used in the past. Then I will evaluate both train and test error rate. Furher, I will try to add some more explanatory variable to evaluate any benefits. I will end up by making some considerations about the logistic regression models herein outlined and previously posted ones.
Hence, in the following I will go through considering the same explanatory variables set already determined in the previous post, I will apply logistic regression model enhanced with cross-validation procedure. Such cross-validation procedure is intended to identify an optimal value for the cut-off to be used in order to translate the raw probability based prediction into the categorical predicted variable, RainTomorrow.
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(ROCR))
data <- read.csv("weather.csv", header=TRUE)
data.clean <- na.omit(data)
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday, Date, Location))
attach(data.clean)
colnames(data.clean)
## [1] "MinTemp" "MaxTemp" "Rainfall" "Evaporation"
## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am"
## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am"
## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "Cloud9am"
## [17] "Cloud3pm" "Temp9am" "Temp3pm" "RainTomorrow"
Building the train and test sets.
set.seed(1)
n = nrow(data.clean)
ntrain = n*0.7
ntest = n - ntrain
trainset = sample(1:n, ntrain)
wtrain <- data.clean[trainset,]
ytrain <- RainTomorrow[trainset]
wtest <- data.clean[-trainset,]
ynew <- RainTomorrow[-trainset]
I will take advantage of the glm() function available in the stats package. Herein, the two procedures:
glm.train(), used to train and run cross-validation to determine the optimal cut-off value; overall it returns the glm() based model, the optimal cut-off value, the train and test confusion matrixes, the train and test sets based predictions
glm.perf.plot() used to plot specific ROC performances
glm.train <- function(columns, trainset) {
f <- reformulate(columns, response='RainTomorrow')
my.model <- model.matrix(f, data=data.clean)[,-1]
xtrain <- my.model[trainset,]
ytrain = data.clean$RainTomorrow[trainset]
s <- length(trainset)/4
cutoff <- list()
error <- list()
for (q in seq(0.2,0.8,by=0.05)) {
train.error.rate <- 0
for (i in 1:4) {
start <- (i-1)*s
end <- start + s
valid.set <- seq(start, end)
xcv <- xtrain[-valid.set,]
xval <- xtrain[valid.set,]
ycv <- ytrain[-valid.set]
yval <- ytrain[valid.set]
my.glm <- glm(RainTomorrow ~ . , family=binomial,
data=data.frame(RainTomorrow=ycv, xcv))
pred <- predict(my.glm, newdata=data.frame(xcv), type="response")
my.result <- floor(pred+q)
train.error.table <- table(my.result, ycv, dnn=list('predicted','actual'))
train.error.rate <- train.error.rate +
(train.error.table[1,2]+train.error.table[2,1])/sum(train.error.table)
}
avg.error <- train.error.rate/5
cutoff <- c(cutoff, q)
error <- c(error, avg.error)
}
res <- data.frame(cbind(cutoff, error))
q <- unlist(res$cutoff[which.min(res$error)])
my.glm <- glm(RainTomorrow ~ . , family=binomial,
data=data.frame(RainTomorrow=ytrain, xtrain))
pred <- predict(my.glm, newdata=data.frame(xtrain), type="response")
my.result <- floor(pred+q)
train.error.table <- table(my.result, ytrain, dnn=list('predicted','actual'))
my.glm.train.prediction <- prediction(pred, data.frame(ytrain))
xnew <- my.model[-trainset,]
ynew <- data.clean$RainTomorrow[-trainset]
pred <- predict(my.glm, newdata=data.frame(xnew), type="response")
my.result <- floor(pred+q)
test.error.table <- table(my.result, ynew, dnn=list('predicted','actual'))
my.glm.test.prediction <- prediction(pred, data.frame(ynew))
list("model" = my.glm,
"cutoff" = q,
"train.cm" = train.error.table,
"train.prediction" = my.glm.train.prediction,
"test.cm" = test.error.table,
"test.prediction" = my.glm.test.prediction)
}
glm.perf.plot <- function (prediction, cutoff) {
perf <- performance(prediction, measure = "tpr", x.measure = "fpr")
par(mfrow=(c(1,2)))
plot(perf, col="red")
abline(0, 1, col = "gray60")
abline(v=g$cutoff, col="green")
grid()
perf <- performance(prediction, measure = "acc", x.measure = "cutoff")
plot(perf, col="red")
abline(0, 1, col = "gray60")
abline(v=g$cutoff, col="green")
grid()
my.glm.1.auc <- performance(prediction, "auc")
my.glm.1.auc@y.values[[1]]
}
- First explanatory variables set
In the following I consider the first explanatory variable set as determined in previous posts and comprising Humidity9am, Pressure9am, WinDir9am. The accuracies are computed both based on train and test data. Plots are shown to outline the overall performance of the build model. A vertical green lines is plot to highlights the cross-validation determined cut-off value.
train.col <- c("Humidity9am", "Pressure9am", "WindDir9am")
g <- glm.train(train.col, trainset)
summary(g$model)
##
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.54864 -0.50576 -0.26868 -0.00018 2.72441
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 205.71085 42.46306 4.844 1.27e-06 ***
## Humidity9am 0.05364 0.01913 2.804 0.00504 **
## Pressure9am -0.20796 0.04203 -4.947 7.53e-07 ***
## WindDir9amENE 0.15178 1.53468 0.099 0.92122
## WindDir9amESE -0.83685 1.35054 -0.620 0.53549
## WindDir9amN 1.14896 0.98296 1.169 0.24245
## WindDir9amNE 2.04220 1.51457 1.348 0.17754
## WindDir9amNNE 1.02819 1.39872 0.735 0.46228
## WindDir9amNNW 0.38647 0.98362 0.393 0.69439
## WindDir9amNW -0.70595 1.11765 -0.632 0.52762
## WindDir9amS -1.34028 1.36240 -0.984 0.32523
## WindDir9amSE 1.14642 0.96408 1.189 0.23439
## WindDir9amSSE 0.41277 1.02345 0.403 0.68672
## WindDir9amSSW 1.36499 1.22768 1.112 0.26621
## WindDir9amSW -15.39435 2864.01921 -0.005 0.99571
## WindDir9amW -16.91384 3007.09329 -0.006 0.99551
## WindDir9amWNW -17.31033 1635.37065 -0.011 0.99155
## WindDir9amWSW -16.60668 3651.20200 -0.005 0.99637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 202.56 on 228 degrees of freedom
## Residual deviance: 138.30 on 211 degrees of freedom
## AIC: 174.3
##
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.3
(train.error.table.1 <- g$train.cm)
## actual
## predicted No Yes
## 0 192 30
## 1 0 7
(train.error.rate.1 <- (train.error.table.1[1,2]+train.error.table.1[2,1])/sum(train.error.table.1))
## [1] 0.1310044
(test.error.table.1 <- g$test.cm)
## actual
## predicted No Yes
## 0 76 21
## 1 0 2
(test.error.rate.1 <- (test.error.table.1[1,2]+test.error.table.1[2,1])/sum(test.error.table.1))
## [1] 0.2121212
glm.perf.plot(g$test.prediction, g$cutoff)
## [1] 0.6887872
We can see cross-validation has helped a lot in determine a quite satisfactory cut-off value.
Do you think that adding explanatory variables may help ? Let us try.
train.col <- c("Humidity9am", "Pressure9am", "WindDir9am", "Cloud9am", "Temp9am")
g <- glm.train(train.col, trainset)
summary(g$model)
##
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.77263 -0.45725 -0.22293 -0.00018 2.83380
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 166.01320 48.01414 3.458 0.000545 ***
## Humidity9am 0.04389 0.02345 1.871 0.061280 .
## Pressure9am -0.16999 0.04689 -3.625 0.000289 ***
## WindDir9amENE 0.17793 1.55962 0.114 0.909171
## WindDir9amESE -0.71374 1.36517 -0.523 0.601099
## WindDir9amN 1.54912 1.04970 1.476 0.140003
## WindDir9amNE 2.58886 1.58331 1.635 0.102029
## WindDir9amNNE 1.70263 1.46650 1.161 0.245637
## WindDir9amNNW 0.90691 1.05376 0.861 0.389433
## WindDir9amNW -0.04908 1.20904 -0.041 0.967617
## WindDir9amS -1.21007 1.39672 -0.866 0.386291
## WindDir9amSE 1.59980 1.02024 1.568 0.116867
## WindDir9amSSE 0.59139 1.06214 0.557 0.577670
## WindDir9amSSW 1.69691 1.28069 1.325 0.185173
## WindDir9amSW -14.84658 2836.56912 -0.005 0.995824
## WindDir9amW -16.38025 3043.24858 -0.005 0.995705
## WindDir9amWNW -16.49546 1658.53611 -0.010 0.992065
## WindDir9amWSW -16.13280 3752.41481 -0.004 0.996570
## Cloud9am 0.17320 0.09058 1.912 0.055854 .
## Temp9am 0.04702 0.05406 0.870 0.384498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 202.56 on 228 degrees of freedom
## Residual deviance: 133.14 on 209 degrees of freedom
## AIC: 173.14
##
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.6
(train.error.table.1.b <- g$train.cm)
## actual
## predicted No Yes
## 0 180 18
## 1 12 19
(train.error.rate.1.b <- (train.error.table.1.b[1,2]+train.error.table.1.b[2,1])/sum(train.error.table.1.b))
## [1] 0.1310044
(test.error.table.1.b <- g$test.cm)
## actual
## predicted No Yes
## 0 72 16
## 1 4 7
(test.error.rate.1.b <- (test.error.table.1.b[1,2]+test.error.table.1.b[2,1])/sum(test.error.table.1.b))
## [1] 0.2020202
glm.perf.plot(g$test.prediction, g$cutoff)
## [1] 0.7093822
- Second explanatory variables set
Similar analysis is carried on for the second explanatory variables set.
train.col <- c("Humidity3pm", "Pressure3pm", "WindDir9am")
g <- glm.train(train.col, trainset)
summary(g$model)
##
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.28242 -0.44004 -0.17979 -0.00022 2.67535
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 233.26076 53.22684 4.382 1.17e-05 ***
## Humidity3pm 0.07239 0.01657 4.368 1.26e-05 ***
## Pressure3pm -0.23555 0.05274 -4.467 7.95e-06 ***
## WindDir9amENE 0.34139 1.82959 0.187 0.8520
## WindDir9amESE -0.80248 1.46983 -0.546 0.5851
## WindDir9amN 1.79945 1.13308 1.588 0.1123
## WindDir9amNE 2.50548 1.58578 1.580 0.1141
## WindDir9amNNE 2.03462 1.49093 1.365 0.1724
## WindDir9amNNW 0.55701 1.11220 0.501 0.6165
## WindDir9amNW -1.39841 1.30275 -1.073 0.2831
## WindDir9amS -1.74255 1.49131 -1.168 0.2426
## WindDir9amSE 1.68679 1.08324 1.557 0.1194
## WindDir9amSSE 0.25055 1.15853 0.216 0.8288
## WindDir9amSSW 2.26944 1.34758 1.684 0.0922 .
## WindDir9amSW -14.49563 2830.20762 -0.005 0.9959
## WindDir9amW -16.82831 2908.63991 -0.006 0.9954
## WindDir9amWNW -16.88056 1631.40830 -0.010 0.9917
## WindDir9amWSW -15.69797 3612.77577 -0.004 0.9965
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 202.56 on 228 degrees of freedom
## Residual deviance: 113.40 on 211 degrees of freedom
## AIC: 149.4
##
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.6
(train.error.table.2 <- g$train.cm)
## actual
## predicted No Yes
## 0 183 15
## 1 9 22
(train.error.rate.2 <- (train.error.table.2[1,2]+train.error.table.2[2,1])/sum(train.error.table.2))
## [1] 0.1048035
(test.error.table.2 <- g$test.cm)
## actual
## predicted No Yes
## 0 71 15
## 1 5 8
(test.error.rate.2 <- (test.error.table.2[1,2]+test.error.table.2[2,1])/sum(test.error.table.2))
## [1] 0.2020202
glm.perf.plot(g$test.prediction, g$cutoff)
## [1] 0.708238
Let us try again to add variables and compare the outcomes with the previous results.
train.col <- c("Humidity3pm", "Pressure3pm", "WindDir9am", "WindDir3pm", "Cloud3pm", "Temp3pm")
g <- glm.train(train.col, trainset)
summary(g$model)
##
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.28997 -0.30944 -0.10985 -0.00006 2.57847
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 172.25968 66.84643 2.577 0.009968 **
## Humidity3pm 0.13439 0.04011 3.350 0.000807 ***
## Pressure3pm -0.18546 0.06556 -2.829 0.004669 **
## WindDir9amENE 1.39854 2.39825 0.583 0.559793
## WindDir9amESE 0.03006 1.77306 0.017 0.986475
## WindDir9amN 1.38148 1.46298 0.944 0.345023
## WindDir9amNE 1.36204 2.31532 0.588 0.556348
## WindDir9amNNE 2.41302 1.71967 1.403 0.160560
## WindDir9amNNW 0.46172 1.34038 0.344 0.730494
## WindDir9amNW -2.52617 1.74108 -1.451 0.146802
## WindDir9amS -0.53669 1.87903 -0.286 0.775171
## WindDir9amSE 2.32217 1.41022 1.647 0.099626 .
## WindDir9amSSE 0.08909 1.46807 0.061 0.951610
## WindDir9amSSW 3.33052 1.58521 2.101 0.035641 *
## WindDir9amSW -15.12757 4232.77034 -0.004 0.997148
## WindDir9amW -18.77704 4496.66938 -0.004 0.996668
## WindDir9amWNW -18.33332 2438.99054 -0.008 0.994003
## WindDir9amWSW -16.47654 6167.34944 -0.003 0.997868
## WindDir3pmENE 0.46093 2.88802 0.160 0.873195
## WindDir3pmESE 0.46952 2.33841 0.201 0.840865
## WindDir3pmN 1.27439 2.38525 0.534 0.593150
## WindDir3pmNE 2.34910 2.52661 0.930 0.352503
## WindDir3pmNNE 0.58721 2.71099 0.217 0.828518
## WindDir3pmNNW 4.04519 2.42298 1.670 0.095016 .
## WindDir3pmNW 2.74193 2.33792 1.173 0.240872
## WindDir3pmS 0.70608 2.57725 0.274 0.784109
## WindDir3pmSE -1.28953 2.79452 -0.461 0.644476
## WindDir3pmSSE -0.39874 2.62116 -0.152 0.879089
## WindDir3pmSSW -12.37522 4236.43693 -0.003 0.997669
## WindDir3pmSW -14.58404 4376.22158 -0.003 0.997341
## WindDir3pmW 2.65011 2.41895 1.096 0.273271
## WindDir3pmWNW 2.83866 2.33564 1.215 0.224226
## WindDir3pmWSW -14.40480 4247.93685 -0.003 0.997294
## Cloud3pm 0.25631 0.17636 1.453 0.146142
## Temp3pm 0.18675 0.07367 2.535 0.011245 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 202.559 on 228 degrees of freedom
## Residual deviance: 92.525 on 194 degrees of freedom
## AIC: 162.52
##
## Number of Fisher Scoring iterations: 18
g$cutoff
## [1] 0.4
(train.error.table.2.b <- g$train.cm)
## actual
## predicted No Yes
## 0 189 16
## 1 3 21
(train.error.rate.2.b <- (train.error.table.2.b[1,2]+train.error.table.2.b[2,1])/sum(train.error.table.2.b))
## [1] 0.08296943
(test.error.table.2.b <- g$test.cm)
## actual
## predicted No Yes
## 0 70 17
## 1 6 6
(test.error.rate.2.b <- (test.error.table.2.b[1,2]+test.error.table.2.b[2,1])/sum(test.error.table.2.b))
## [1] 0.2323232
glm.perf.plot(g$test.prediction, g$cutoff)
## [1] 0.6550343
- Third explanatory variables set
Now with the third explanatory variables set.
train.col <- c("Humidity3pm", "Pressure3pm", "Sunshine", "WindGustDir")
g <- glm.train(train.col, trainset)
summary(g$model)
##
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1252 -0.4214 -0.2439 -0.1081 2.6824
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.975e+02 4.720e+01 4.185 2.85e-05 ***
## Humidity3pm 8.185e-03 2.357e-02 0.347 0.7284
## Pressure3pm -1.949e-01 4.615e-02 -4.222 2.42e-05 ***
## Sunshine -2.799e-01 1.185e-01 -2.362 0.0182 *
## WindGustDirENE -1.912e+00 1.385e+00 -1.380 0.1675
## WindGustDirESE 5.137e-01 1.094e+00 0.469 0.6388
## WindGustDirN 1.514e-01 1.405e+00 0.108 0.9142
## WindGustDirNE 5.883e-01 1.323e+00 0.445 0.6566
## WindGustDirNNE -1.628e+01 2.881e+03 -0.006 0.9955
## WindGustDirNNW 9.888e-01 1.009e+00 0.980 0.3269
## WindGustDirNW 3.361e-01 9.236e-01 0.364 0.7159
## WindGustDirS -2.559e-01 1.236e+00 -0.207 0.8360
## WindGustDirSE -1.686e+01 2.088e+03 -0.008 0.9936
## WindGustDirSSE 1.393e-02 1.335e+00 0.010 0.9917
## WindGustDirSSW 2.261e+00 1.479e+00 1.529 0.1263
## WindGustDirSW 1.362e+00 2.129e+00 0.640 0.5224
## WindGustDirW -2.016e-02 1.136e+00 -0.018 0.9858
## WindGustDirWNW -7.733e-01 1.296e+00 -0.597 0.5508
## WindGustDirWSW -1.635e+01 6.523e+03 -0.003 0.9980
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 202.56 on 228 degrees of freedom
## Residual deviance: 120.75 on 210 degrees of freedom
## AIC: 158.75
##
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.45
(train.error.table.3 <- g$train.cm)
## actual
## predicted No Yes
## 0 186 21
## 1 6 16
(train.error.rate.3 <- (train.error.table.3[1,2]+train.error.table.3[2,1])/sum(train.error.table.3))
## [1] 0.1179039
(test.error.table.3 <- g$test.cm)
## actual
## predicted No Yes
## 0 76 14
## 1 0 9
(test.error.rate.3 <- (test.error.table.3[1,2]+test.error.table.3[2,1])/sum(test.error.table.3))
## [1] 0.1414141
glm.perf.plot(g$test.prediction, g$cutoff)
## [1] 0.9210526
What about putting all available variables afterwards 3pm in the model ?
train.col <- c("MinTemp", "MaxTemp", "Rainfall", "Evaporation", "Sunshine", "WindGustDir", "WindGustSpeed", "WindDir3pm", "WindSpeed3pm", "Humidity3pm", "Pressure3pm", "Cloud3pm", "Temp3pm")
g <- glm.train(train.col, trainset)
summary(g$model)
##
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.08317 -0.26342 -0.07195 -0.00018 2.54064
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.227e+02 9.351e+01 2.382 0.01722 *
## MinTemp -5.385e-01 1.723e-01 -3.125 0.00178 **
## MaxTemp 7.236e-02 3.745e-01 0.193 0.84679
## Rainfall -1.761e-01 1.161e-01 -1.517 0.12925
## Evaporation 1.851e-01 2.003e-01 0.924 0.35548
## Sunshine -4.012e-01 2.095e-01 -1.915 0.05543 .
## WindGustDirENE -1.369e+00 1.878e+00 -0.729 0.46613
## WindGustDirESE 9.446e-01 1.495e+00 0.632 0.52739
## WindGustDirN -4.931e-01 2.972e+00 -0.166 0.86823
## WindGustDirNE 6.246e-01 1.793e+00 0.348 0.72753
## WindGustDirNNE -1.779e+01 4.047e+03 -0.004 0.99649
## WindGustDirNNW 2.576e+00 1.653e+00 1.558 0.11913
## WindGustDirNW 1.338e+00 1.556e+00 0.860 0.38988
## WindGustDirS -7.691e-01 2.207e+00 -0.348 0.72747
## WindGustDirSE -1.881e+01 3.304e+03 -0.006 0.99546
## WindGustDirSSE 1.623e+00 2.509e+00 0.647 0.51771
## WindGustDirSSW 8.861e-01 2.929e+00 0.303 0.76225
## WindGustDirSW 1.154e+00 3.207e+00 0.360 0.71904
## WindGustDirW -2.656e-01 1.770e+00 -0.150 0.88070
## WindGustDirWNW 8.115e-01 1.915e+00 0.424 0.67175
## WindGustDirWSW -1.679e+01 1.075e+04 -0.002 0.99875
## WindGustSpeed 7.646e-02 5.415e-02 1.412 0.15791
## WindDir3pmENE -2.687e-01 2.495e+00 -0.108 0.91422
## WindDir3pmESE 1.292e+00 2.309e+00 0.559 0.57590
## WindDir3pmN -2.459e-01 2.418e+00 -0.102 0.91901
## WindDir3pmNE 1.014e+00 2.414e+00 0.420 0.67444
## WindDir3pmNNE 1.350e+00 3.154e+00 0.428 0.66873
## WindDir3pmNNW 1.111e+00 2.338e+00 0.475 0.63461
## WindDir3pmNW -8.084e-02 2.448e+00 -0.033 0.97366
## WindDir3pmS 1.291e+00 3.222e+00 0.401 0.68862
## WindDir3pmSE -6.023e-01 3.843e+00 -0.157 0.87547
## WindDir3pmSSE -1.495e+00 2.706e+00 -0.553 0.58058
## WindDir3pmSSW -9.975e+00 3.807e+03 -0.003 0.99791
## WindDir3pmSW -1.278e+01 4.596e+03 -0.003 0.99778
## WindDir3pmW 1.581e+00 2.542e+00 0.622 0.53397
## WindDir3pmWNW 6.571e-01 2.379e+00 0.276 0.78240
## WindDir3pmWSW -1.588e+01 4.307e+03 -0.004 0.99706
## WindSpeed3pm -1.386e-01 8.465e-02 -1.637 0.10153
## Humidity3pm 1.846e-01 6.804e-02 2.714 0.00666 **
## Pressure3pm -2.386e-01 9.197e-02 -2.594 0.00948 **
## Cloud3pm 9.712e-02 1.979e-01 0.491 0.62357
## Temp3pm 5.706e-01 3.896e-01 1.465 0.14300
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 202.559 on 228 degrees of freedom
## Residual deviance: 78.611 on 187 degrees of freedom
## AIC: 162.61
##
## Number of Fisher Scoring iterations: 18
g$cutoff
## [1] 0.5
(train.error.table.all <- g$train.cm)
## actual
## predicted No Yes
## 0 188 10
## 1 4 27
(train.error.rate.all <- (train.error.table.all[1,2]+train.error.table.all[2,1])/sum(train.error.table.all))
## [1] 0.06113537
(test.error.table.all <- g$test.cm)
## actual
## predicted No Yes
## 0 73 16
## 1 3 7
(test.error.rate.all <- (test.error.table.all[1,2]+test.error.table.all[2,1])/sum(test.error.table.all))
## [1] 0.1919192
glm.perf.plot(g$test.prediction, g$cutoff)
## [1] 0.7345538
Not that good. On the contrary, just one more explanatory variable is useful to achieve lower test error rate, it is the WindGustSpeed.
train.col <- c("Humidity3pm", "Pressure3pm", "Sunshine", "WindGustDir", "WindGustSpeed")
g <- glm.train(train.col, trainset)
summary(g$model)
##
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain,
## xtrain))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0433 -0.4332 -0.2450 -0.1012 2.6673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.780e+02 5.210e+01 3.417 0.000634 ***
## Humidity3pm 9.180e-03 2.352e-02 0.390 0.696328
## Pressure3pm -1.764e-01 5.065e-02 -3.482 0.000498 ***
## Sunshine -2.821e-01 1.188e-01 -2.374 0.017613 *
## WindGustDirENE -1.798e+00 1.376e+00 -1.307 0.191191
## WindGustDirESE 5.182e-01 1.082e+00 0.479 0.632144
## WindGustDirN 1.092e-01 1.423e+00 0.077 0.938816
## WindGustDirNE 7.493e-01 1.332e+00 0.563 0.573741
## WindGustDirNNE -1.610e+01 2.882e+03 -0.006 0.995541
## WindGustDirNNW 1.017e+00 9.983e-01 1.019 0.308293
## WindGustDirNW 1.924e-01 9.436e-01 0.204 0.838453
## WindGustDirS -3.135e-01 1.209e+00 -0.259 0.795391
## WindGustDirSE -1.688e+01 2.082e+03 -0.008 0.993533
## WindGustDirSSE 8.404e-03 1.334e+00 0.006 0.994974
## WindGustDirSSW 2.002e+00 1.546e+00 1.295 0.195307
## WindGustDirSW 1.522e+00 2.205e+00 0.690 0.490007
## WindGustDirW -9.433e-02 1.139e+00 -0.083 0.933975
## WindGustDirWNW -9.320e-01 1.311e+00 -0.711 0.477132
## WindGustDirWSW -1.634e+01 6.523e+03 -0.003 0.998002
## WindGustSpeed 1.748e-02 2.149e-02 0.813 0.416087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 202.56 on 228 degrees of freedom
## Residual deviance: 120.09 on 209 degrees of freedom
## AIC: 160.09
##
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.5
(train.error.table.3.b <- g$train.cm)
## actual
## predicted No Yes
## 0 184 16
## 1 8 21
(train.error.rate.3.b <- (train.error.table.3.b[1,2]+train.error.table.3.b[2,1])/sum(train.error.table.3.b))
## [1] 0.1048035
(test.error.table.3.b <- g$test.cm)
## actual
## predicted No Yes
## 0 76 13
## 1 0 10
(test.error.rate.3.b <- (test.error.table.3.b[1,2]+test.error.table.3.b[2,1])/sum(test.error.table.3.b))
## [1] 0.1313131
glm.perf.plot(g$test.prediction, g$cutoff)
## [1] 0.9262014
Now I collect all the resulting error rates to compare altogether.
df.error.rate.1 <- data.frame(percent(train.error.rate.1), percent(test.error.rate.1), percent(train.error.rate.1.b), percent(test.error.rate.1.b))
colnames(df.error.rate.1) <- c("train.err.1", "test.err.1", "train.err.1.b", "test.err.1.b")
df.error.rate.2 <- data.frame(percent(train.error.rate.2), percent(test.error.rate.2), percent(train.error.rate.2.b), percent(test.error.rate.2.b))
colnames(df.error.rate.2) <- c("train.err.2", "test.err.2", "train.err.2.b", "test.err.2.b")
df.error.rate.3 <- data.frame(percent(train.error.rate.3), percent(test.error.rate.3), percent(train.error.rate.3.b), percent(test.error.rate.3.b))
colnames(df.error.rate.3) <- c("train.err.3", "test.err.3", "train.err.3.b", "test.err.3.b")
df.error.rate.1
## train.err.1 test.err.1 train.err.1.b test.err.1.b
## 1 13.1% 21.2% 13.1% 20.2%
df.error.rate.2
## train.err.2 test.err.2 train.err.2.b test.err.2.b
## 1 10.5% 20.2% 8.3% 23.2%
df.error.rate.3
## train.err.3 test.err.3 train.err.3.b test.err.3.b
## 1 11.8% 14.1% 10.5% 13.1%
We can see how the third logistic regression model has got better train and test error rate than the other models. Having train and test error rates close values, we can say such model has got bias and variance much more in control than the other models.
Overall, the logistic regression test error rates are slightly better than the classification tree model ones.
Further, the best logistic regression based model has got worse test set based accuracy than the best model shown based on Naive Bayes.