Saturday, December 6, 2014

Weather forecast going logistic

Weather forecast - Logistic Regression

In the previous posts series, I identified three explanatory variable sets, the following ones:

set#1: {Humidity9am, Pressure9am, WindDir9am}

set#2: {Humidity3pm, Pressure3pm, WindDir9am}

set#3: {Humidity3pm, Pressure3pm, Sunshine, WindGustDir}

Please remember that I work with three sets of variables as assume that there is the need to forecast tomorrow weather in three precise moment of the day: 9am, 3pm and evening (say 10pm). That assumption restricts the availability of some variables at specific day time.

The weather data is available at:

This time, I would like to evaluate a model based on Logistic Regression. I keep the choice of explanatory variables unchanged to allow comparison with previous models.

I start over again with the same preprocessing used in the past. Then I will evaluate both train and test error rate. Furher, I will try to add some more explanatory variable to evaluate any benefits. I will end up by making some considerations about the logistic regression models herein outlined and previously posted ones.

Hence, in the following I will go through considering the same explanatory variables set already determined in the previous post, I will apply logistic regression model enhanced with cross-validation procedure. Such cross-validation procedure is intended to identify an optimal value for the cut-off to be used in order to translate the raw probability based prediction into the categorical predicted variable, RainTomorrow.

suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(ROCR))

data <- read.csv("weather.csv", header=TRUE)
data.clean <- na.omit(data)
data.clean <- subset(data.clean, select = -c(RISK_MM, RainToday, Date, Location))
attach(data.clean)
colnames(data.clean)
##  [1] "MinTemp"       "MaxTemp"       "Rainfall"      "Evaporation"  
##  [5] "Sunshine"      "WindGustDir"   "WindGustSpeed" "WindDir9am"   
##  [9] "WindDir3pm"    "WindSpeed9am"  "WindSpeed3pm"  "Humidity9am"  
## [13] "Humidity3pm"   "Pressure9am"   "Pressure3pm"   "Cloud9am"     
## [17] "Cloud3pm"      "Temp9am"       "Temp3pm"       "RainTomorrow"

Building the train and test sets.

set.seed(1)
n = nrow(data.clean)
ntrain = n*0.7
ntest = n - ntrain
trainset = sample(1:n, ntrain)
wtrain <- data.clean[trainset,]
ytrain <- RainTomorrow[trainset]
wtest <- data.clean[-trainset,]
ynew <- RainTomorrow[-trainset]

I will take advantage of the glm() function available in the stats package. Herein, the two procedures:

  • glm.train(), used to train and run cross-validation to determine the optimal cut-off value; overall it returns the glm() based model, the optimal cut-off value, the train and test confusion matrixes, the train and test sets based predictions

  • glm.perf.plot() used to plot specific ROC performances

glm.train <- function(columns, trainset) {
   f <- reformulate(columns, response='RainTomorrow')
   my.model <- model.matrix(f, data=data.clean)[,-1]
   xtrain <- my.model[trainset,]
   ytrain = data.clean$RainTomorrow[trainset]

   s <- length(trainset)/4
   cutoff <- list()
   error <- list()
   for (q in seq(0.2,0.8,by=0.05)) {
     train.error.rate <- 0
     for (i in 1:4) {
       start <- (i-1)*s 
       end <- start + s
       valid.set <- seq(start, end)
       xcv <- xtrain[-valid.set,]
       xval <- xtrain[valid.set,]
       ycv <- ytrain[-valid.set]
       yval <- ytrain[valid.set]
       my.glm <- glm(RainTomorrow ~ . , family=binomial, 
                   data=data.frame(RainTomorrow=ycv, xcv))
       pred <- predict(my.glm, newdata=data.frame(xcv), type="response")
       my.result <- floor(pred+q)
       train.error.table <- table(my.result, ycv, dnn=list('predicted','actual'))
       train.error.rate <-  train.error.rate +   
          (train.error.table[1,2]+train.error.table[2,1])/sum(train.error.table)
     }
     avg.error <- train.error.rate/5
     cutoff <- c(cutoff, q)
     error <- c(error, avg.error)
   }
   res <- data.frame(cbind(cutoff, error))
   q <- unlist(res$cutoff[which.min(res$error)])
   
   my.glm <- glm(RainTomorrow ~ . , family=binomial, 
                 data=data.frame(RainTomorrow=ytrain, xtrain))
   pred <- predict(my.glm, newdata=data.frame(xtrain), type="response")
   my.result <- floor(pred+q)
   train.error.table <- table(my.result, ytrain, dnn=list('predicted','actual'))
   my.glm.train.prediction <- prediction(pred, data.frame(ytrain))
   
   xnew <- my.model[-trainset,]
   ynew <- data.clean$RainTomorrow[-trainset]
   pred <- predict(my.glm, newdata=data.frame(xnew), type="response")
   my.result <- floor(pred+q)
   test.error.table <- table(my.result, ynew, dnn=list('predicted','actual'))
   my.glm.test.prediction <- prediction(pred, data.frame(ynew))
   
   list("model" = my.glm,
        "cutoff" = q, 
        "train.cm" = train.error.table, 
        "train.prediction" = my.glm.train.prediction,
        "test.cm" = test.error.table,
        "test.prediction" = my.glm.test.prediction)
}

glm.perf.plot <- function (prediction, cutoff) {
    perf <- performance(prediction, measure = "tpr", x.measure = "fpr")     
    par(mfrow=(c(1,2)))
    plot(perf, col="red")
    abline(0, 1, col = "gray60")
    abline(v=g$cutoff, col="green")
    grid()
    perf <- performance(prediction, measure = "acc", x.measure = "cutoff")      
    plot(perf, col="red")
    abline(0, 1, col = "gray60")
    abline(v=g$cutoff, col="green")
    grid()
    my.glm.1.auc <- performance(prediction, "auc")
    my.glm.1.auc@y.values[[1]]
}
  • First explanatory variables set

In the following I consider the first explanatory variable set as determined in previous posts and comprising Humidity9am, Pressure9am, WinDir9am. The accuracies are computed both based on train and test data. Plots are shown to outline the overall performance of the build model. A vertical green lines is plot to highlights the cross-validation determined cut-off value.

train.col <- c("Humidity9am", "Pressure9am", "WindDir9am")

g <- glm.train(train.col, trainset)
summary(g$model)
## 
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.54864  -0.50576  -0.26868  -0.00018   2.72441  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    205.71085   42.46306   4.844 1.27e-06 ***
## Humidity9am      0.05364    0.01913   2.804  0.00504 ** 
## Pressure9am     -0.20796    0.04203  -4.947 7.53e-07 ***
## WindDir9amENE    0.15178    1.53468   0.099  0.92122    
## WindDir9amESE   -0.83685    1.35054  -0.620  0.53549    
## WindDir9amN      1.14896    0.98296   1.169  0.24245    
## WindDir9amNE     2.04220    1.51457   1.348  0.17754    
## WindDir9amNNE    1.02819    1.39872   0.735  0.46228    
## WindDir9amNNW    0.38647    0.98362   0.393  0.69439    
## WindDir9amNW    -0.70595    1.11765  -0.632  0.52762    
## WindDir9amS     -1.34028    1.36240  -0.984  0.32523    
## WindDir9amSE     1.14642    0.96408   1.189  0.23439    
## WindDir9amSSE    0.41277    1.02345   0.403  0.68672    
## WindDir9amSSW    1.36499    1.22768   1.112  0.26621    
## WindDir9amSW   -15.39435 2864.01921  -0.005  0.99571    
## WindDir9amW    -16.91384 3007.09329  -0.006  0.99551    
## WindDir9amWNW  -17.31033 1635.37065  -0.011  0.99155    
## WindDir9amWSW  -16.60668 3651.20200  -0.005  0.99637    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 202.56  on 228  degrees of freedom
## Residual deviance: 138.30  on 211  degrees of freedom
## AIC: 174.3
## 
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.3
(train.error.table.1 <- g$train.cm)
##          actual
## predicted  No Yes
##         0 192  30
##         1   0   7
(train.error.rate.1 <- (train.error.table.1[1,2]+train.error.table.1[2,1])/sum(train.error.table.1))
## [1] 0.1310044
(test.error.table.1 <- g$test.cm)
##          actual
## predicted No Yes
##         0 76  21
##         1  0   2
(test.error.rate.1 <- (test.error.table.1[1,2]+test.error.table.1[2,1])/sum(test.error.table.1))
## [1] 0.2121212
glm.perf.plot(g$test.prediction, g$cutoff)

## [1] 0.6887872

We can see cross-validation has helped a lot in determine a quite satisfactory cut-off value.

Do you think that adding explanatory variables may help ? Let us try.

train.col <- c("Humidity9am", "Pressure9am", "WindDir9am", "Cloud9am", "Temp9am")
g <- glm.train(train.col, trainset)
summary(g$model)
## 
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.77263  -0.45725  -0.22293  -0.00018   2.83380  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    166.01320   48.01414   3.458 0.000545 ***
## Humidity9am      0.04389    0.02345   1.871 0.061280 .  
## Pressure9am     -0.16999    0.04689  -3.625 0.000289 ***
## WindDir9amENE    0.17793    1.55962   0.114 0.909171    
## WindDir9amESE   -0.71374    1.36517  -0.523 0.601099    
## WindDir9amN      1.54912    1.04970   1.476 0.140003    
## WindDir9amNE     2.58886    1.58331   1.635 0.102029    
## WindDir9amNNE    1.70263    1.46650   1.161 0.245637    
## WindDir9amNNW    0.90691    1.05376   0.861 0.389433    
## WindDir9amNW    -0.04908    1.20904  -0.041 0.967617    
## WindDir9amS     -1.21007    1.39672  -0.866 0.386291    
## WindDir9amSE     1.59980    1.02024   1.568 0.116867    
## WindDir9amSSE    0.59139    1.06214   0.557 0.577670    
## WindDir9amSSW    1.69691    1.28069   1.325 0.185173    
## WindDir9amSW   -14.84658 2836.56912  -0.005 0.995824    
## WindDir9amW    -16.38025 3043.24858  -0.005 0.995705    
## WindDir9amWNW  -16.49546 1658.53611  -0.010 0.992065    
## WindDir9amWSW  -16.13280 3752.41481  -0.004 0.996570    
## Cloud9am         0.17320    0.09058   1.912 0.055854 .  
## Temp9am          0.04702    0.05406   0.870 0.384498    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 202.56  on 228  degrees of freedom
## Residual deviance: 133.14  on 209  degrees of freedom
## AIC: 173.14
## 
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.6
(train.error.table.1.b <- g$train.cm)
##          actual
## predicted  No Yes
##         0 180  18
##         1  12  19
(train.error.rate.1.b <- (train.error.table.1.b[1,2]+train.error.table.1.b[2,1])/sum(train.error.table.1.b))
## [1] 0.1310044
(test.error.table.1.b <- g$test.cm)
##          actual
## predicted No Yes
##         0 72  16
##         1  4   7
(test.error.rate.1.b <- (test.error.table.1.b[1,2]+test.error.table.1.b[2,1])/sum(test.error.table.1.b))
## [1] 0.2020202
glm.perf.plot(g$test.prediction, g$cutoff)

## [1] 0.7093822
  • Second explanatory variables set

Similar analysis is carried on for the second explanatory variables set.

train.col <- c("Humidity3pm", "Pressure3pm", "WindDir9am")

g <- glm.train(train.col, trainset)
summary(g$model)
## 
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.28242  -0.44004  -0.17979  -0.00022   2.67535  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    233.26076   53.22684   4.382 1.17e-05 ***
## Humidity3pm      0.07239    0.01657   4.368 1.26e-05 ***
## Pressure3pm     -0.23555    0.05274  -4.467 7.95e-06 ***
## WindDir9amENE    0.34139    1.82959   0.187   0.8520    
## WindDir9amESE   -0.80248    1.46983  -0.546   0.5851    
## WindDir9amN      1.79945    1.13308   1.588   0.1123    
## WindDir9amNE     2.50548    1.58578   1.580   0.1141    
## WindDir9amNNE    2.03462    1.49093   1.365   0.1724    
## WindDir9amNNW    0.55701    1.11220   0.501   0.6165    
## WindDir9amNW    -1.39841    1.30275  -1.073   0.2831    
## WindDir9amS     -1.74255    1.49131  -1.168   0.2426    
## WindDir9amSE     1.68679    1.08324   1.557   0.1194    
## WindDir9amSSE    0.25055    1.15853   0.216   0.8288    
## WindDir9amSSW    2.26944    1.34758   1.684   0.0922 .  
## WindDir9amSW   -14.49563 2830.20762  -0.005   0.9959    
## WindDir9amW    -16.82831 2908.63991  -0.006   0.9954    
## WindDir9amWNW  -16.88056 1631.40830  -0.010   0.9917    
## WindDir9amWSW  -15.69797 3612.77577  -0.004   0.9965    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 202.56  on 228  degrees of freedom
## Residual deviance: 113.40  on 211  degrees of freedom
## AIC: 149.4
## 
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.6
(train.error.table.2 <- g$train.cm)
##          actual
## predicted  No Yes
##         0 183  15
##         1   9  22
(train.error.rate.2 <- (train.error.table.2[1,2]+train.error.table.2[2,1])/sum(train.error.table.2))
## [1] 0.1048035
(test.error.table.2 <- g$test.cm)
##          actual
## predicted No Yes
##         0 71  15
##         1  5   8
(test.error.rate.2 <- (test.error.table.2[1,2]+test.error.table.2[2,1])/sum(test.error.table.2))
## [1] 0.2020202
glm.perf.plot(g$test.prediction, g$cutoff)

## [1] 0.708238

Let us try again to add variables and compare the outcomes with the previous results.

train.col <- c("Humidity3pm", "Pressure3pm", "WindDir9am", "WindDir3pm", "Cloud3pm", "Temp3pm")

g <- glm.train(train.col, trainset)
summary(g$model)
## 
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.28997  -0.30944  -0.10985  -0.00006   2.57847  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    172.25968   66.84643   2.577 0.009968 ** 
## Humidity3pm      0.13439    0.04011   3.350 0.000807 ***
## Pressure3pm     -0.18546    0.06556  -2.829 0.004669 ** 
## WindDir9amENE    1.39854    2.39825   0.583 0.559793    
## WindDir9amESE    0.03006    1.77306   0.017 0.986475    
## WindDir9amN      1.38148    1.46298   0.944 0.345023    
## WindDir9amNE     1.36204    2.31532   0.588 0.556348    
## WindDir9amNNE    2.41302    1.71967   1.403 0.160560    
## WindDir9amNNW    0.46172    1.34038   0.344 0.730494    
## WindDir9amNW    -2.52617    1.74108  -1.451 0.146802    
## WindDir9amS     -0.53669    1.87903  -0.286 0.775171    
## WindDir9amSE     2.32217    1.41022   1.647 0.099626 .  
## WindDir9amSSE    0.08909    1.46807   0.061 0.951610    
## WindDir9amSSW    3.33052    1.58521   2.101 0.035641 *  
## WindDir9amSW   -15.12757 4232.77034  -0.004 0.997148    
## WindDir9amW    -18.77704 4496.66938  -0.004 0.996668    
## WindDir9amWNW  -18.33332 2438.99054  -0.008 0.994003    
## WindDir9amWSW  -16.47654 6167.34944  -0.003 0.997868    
## WindDir3pmENE    0.46093    2.88802   0.160 0.873195    
## WindDir3pmESE    0.46952    2.33841   0.201 0.840865    
## WindDir3pmN      1.27439    2.38525   0.534 0.593150    
## WindDir3pmNE     2.34910    2.52661   0.930 0.352503    
## WindDir3pmNNE    0.58721    2.71099   0.217 0.828518    
## WindDir3pmNNW    4.04519    2.42298   1.670 0.095016 .  
## WindDir3pmNW     2.74193    2.33792   1.173 0.240872    
## WindDir3pmS      0.70608    2.57725   0.274 0.784109    
## WindDir3pmSE    -1.28953    2.79452  -0.461 0.644476    
## WindDir3pmSSE   -0.39874    2.62116  -0.152 0.879089    
## WindDir3pmSSW  -12.37522 4236.43693  -0.003 0.997669    
## WindDir3pmSW   -14.58404 4376.22158  -0.003 0.997341    
## WindDir3pmW      2.65011    2.41895   1.096 0.273271    
## WindDir3pmWNW    2.83866    2.33564   1.215 0.224226    
## WindDir3pmWSW  -14.40480 4247.93685  -0.003 0.997294    
## Cloud3pm         0.25631    0.17636   1.453 0.146142    
## Temp3pm          0.18675    0.07367   2.535 0.011245 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 202.559  on 228  degrees of freedom
## Residual deviance:  92.525  on 194  degrees of freedom
## AIC: 162.52
## 
## Number of Fisher Scoring iterations: 18
g$cutoff
## [1] 0.4
(train.error.table.2.b <- g$train.cm)
##          actual
## predicted  No Yes
##         0 189  16
##         1   3  21
(train.error.rate.2.b <- (train.error.table.2.b[1,2]+train.error.table.2.b[2,1])/sum(train.error.table.2.b))
## [1] 0.08296943
(test.error.table.2.b <- g$test.cm)
##          actual
## predicted No Yes
##         0 70  17
##         1  6   6
(test.error.rate.2.b <- (test.error.table.2.b[1,2]+test.error.table.2.b[2,1])/sum(test.error.table.2.b))
## [1] 0.2323232
glm.perf.plot(g$test.prediction, g$cutoff)

## [1] 0.6550343
  • Third explanatory variables set

Now with the third explanatory variables set.

train.col <- c("Humidity3pm", "Pressure3pm", "Sunshine", "WindGustDir")

g <- glm.train(train.col, trainset)
summary(g$model)
## 
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1252  -0.4214  -0.2439  -0.1081   2.6824  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.975e+02  4.720e+01   4.185 2.85e-05 ***
## Humidity3pm     8.185e-03  2.357e-02   0.347   0.7284    
## Pressure3pm    -1.949e-01  4.615e-02  -4.222 2.42e-05 ***
## Sunshine       -2.799e-01  1.185e-01  -2.362   0.0182 *  
## WindGustDirENE -1.912e+00  1.385e+00  -1.380   0.1675    
## WindGustDirESE  5.137e-01  1.094e+00   0.469   0.6388    
## WindGustDirN    1.514e-01  1.405e+00   0.108   0.9142    
## WindGustDirNE   5.883e-01  1.323e+00   0.445   0.6566    
## WindGustDirNNE -1.628e+01  2.881e+03  -0.006   0.9955    
## WindGustDirNNW  9.888e-01  1.009e+00   0.980   0.3269    
## WindGustDirNW   3.361e-01  9.236e-01   0.364   0.7159    
## WindGustDirS   -2.559e-01  1.236e+00  -0.207   0.8360    
## WindGustDirSE  -1.686e+01  2.088e+03  -0.008   0.9936    
## WindGustDirSSE  1.393e-02  1.335e+00   0.010   0.9917    
## WindGustDirSSW  2.261e+00  1.479e+00   1.529   0.1263    
## WindGustDirSW   1.362e+00  2.129e+00   0.640   0.5224    
## WindGustDirW   -2.016e-02  1.136e+00  -0.018   0.9858    
## WindGustDirWNW -7.733e-01  1.296e+00  -0.597   0.5508    
## WindGustDirWSW -1.635e+01  6.523e+03  -0.003   0.9980    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 202.56  on 228  degrees of freedom
## Residual deviance: 120.75  on 210  degrees of freedom
## AIC: 158.75
## 
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.45
(train.error.table.3 <- g$train.cm)
##          actual
## predicted  No Yes
##         0 186  21
##         1   6  16
(train.error.rate.3 <- (train.error.table.3[1,2]+train.error.table.3[2,1])/sum(train.error.table.3))
## [1] 0.1179039
(test.error.table.3 <- g$test.cm)
##          actual
## predicted No Yes
##         0 76  14
##         1  0   9
(test.error.rate.3 <- (test.error.table.3[1,2]+test.error.table.3[2,1])/sum(test.error.table.3))
## [1] 0.1414141
glm.perf.plot(g$test.prediction, g$cutoff)

## [1] 0.9210526

What about putting all available variables afterwards 3pm in the model ?

train.col <- c("MinTemp", "MaxTemp", "Rainfall", "Evaporation", "Sunshine", "WindGustDir", "WindGustSpeed", "WindDir3pm", "WindSpeed3pm",  "Humidity3pm",  "Pressure3pm", "Cloud3pm", "Temp3pm")

g <- glm.train(train.col, trainset)
summary(g$model)
## 
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.08317  -0.26342  -0.07195  -0.00018   2.54064  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     2.227e+02  9.351e+01   2.382  0.01722 * 
## MinTemp        -5.385e-01  1.723e-01  -3.125  0.00178 **
## MaxTemp         7.236e-02  3.745e-01   0.193  0.84679   
## Rainfall       -1.761e-01  1.161e-01  -1.517  0.12925   
## Evaporation     1.851e-01  2.003e-01   0.924  0.35548   
## Sunshine       -4.012e-01  2.095e-01  -1.915  0.05543 . 
## WindGustDirENE -1.369e+00  1.878e+00  -0.729  0.46613   
## WindGustDirESE  9.446e-01  1.495e+00   0.632  0.52739   
## WindGustDirN   -4.931e-01  2.972e+00  -0.166  0.86823   
## WindGustDirNE   6.246e-01  1.793e+00   0.348  0.72753   
## WindGustDirNNE -1.779e+01  4.047e+03  -0.004  0.99649   
## WindGustDirNNW  2.576e+00  1.653e+00   1.558  0.11913   
## WindGustDirNW   1.338e+00  1.556e+00   0.860  0.38988   
## WindGustDirS   -7.691e-01  2.207e+00  -0.348  0.72747   
## WindGustDirSE  -1.881e+01  3.304e+03  -0.006  0.99546   
## WindGustDirSSE  1.623e+00  2.509e+00   0.647  0.51771   
## WindGustDirSSW  8.861e-01  2.929e+00   0.303  0.76225   
## WindGustDirSW   1.154e+00  3.207e+00   0.360  0.71904   
## WindGustDirW   -2.656e-01  1.770e+00  -0.150  0.88070   
## WindGustDirWNW  8.115e-01  1.915e+00   0.424  0.67175   
## WindGustDirWSW -1.679e+01  1.075e+04  -0.002  0.99875   
## WindGustSpeed   7.646e-02  5.415e-02   1.412  0.15791   
## WindDir3pmENE  -2.687e-01  2.495e+00  -0.108  0.91422   
## WindDir3pmESE   1.292e+00  2.309e+00   0.559  0.57590   
## WindDir3pmN    -2.459e-01  2.418e+00  -0.102  0.91901   
## WindDir3pmNE    1.014e+00  2.414e+00   0.420  0.67444   
## WindDir3pmNNE   1.350e+00  3.154e+00   0.428  0.66873   
## WindDir3pmNNW   1.111e+00  2.338e+00   0.475  0.63461   
## WindDir3pmNW   -8.084e-02  2.448e+00  -0.033  0.97366   
## WindDir3pmS     1.291e+00  3.222e+00   0.401  0.68862   
## WindDir3pmSE   -6.023e-01  3.843e+00  -0.157  0.87547   
## WindDir3pmSSE  -1.495e+00  2.706e+00  -0.553  0.58058   
## WindDir3pmSSW  -9.975e+00  3.807e+03  -0.003  0.99791   
## WindDir3pmSW   -1.278e+01  4.596e+03  -0.003  0.99778   
## WindDir3pmW     1.581e+00  2.542e+00   0.622  0.53397   
## WindDir3pmWNW   6.571e-01  2.379e+00   0.276  0.78240   
## WindDir3pmWSW  -1.588e+01  4.307e+03  -0.004  0.99706   
## WindSpeed3pm   -1.386e-01  8.465e-02  -1.637  0.10153   
## Humidity3pm     1.846e-01  6.804e-02   2.714  0.00666 **
## Pressure3pm    -2.386e-01  9.197e-02  -2.594  0.00948 **
## Cloud3pm        9.712e-02  1.979e-01   0.491  0.62357   
## Temp3pm         5.706e-01  3.896e-01   1.465  0.14300   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 202.559  on 228  degrees of freedom
## Residual deviance:  78.611  on 187  degrees of freedom
## AIC: 162.61
## 
## Number of Fisher Scoring iterations: 18
g$cutoff
## [1] 0.5
(train.error.table.all <- g$train.cm)
##          actual
## predicted  No Yes
##         0 188  10
##         1   4  27
(train.error.rate.all <- (train.error.table.all[1,2]+train.error.table.all[2,1])/sum(train.error.table.all))
## [1] 0.06113537
(test.error.table.all <- g$test.cm)
##          actual
## predicted No Yes
##         0 73  16
##         1  3   7
(test.error.rate.all <- (test.error.table.all[1,2]+test.error.table.all[2,1])/sum(test.error.table.all))
## [1] 0.1919192
glm.perf.plot(g$test.prediction, g$cutoff)

## [1] 0.7345538

Not that good. On the contrary, just one more explanatory variable is useful to achieve lower test error rate, it is the WindGustSpeed.

train.col <- c("Humidity3pm", "Pressure3pm", "Sunshine", "WindGustDir", "WindGustSpeed")

g <- glm.train(train.col, trainset)
summary(g$model)
## 
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial, data = data.frame(RainTomorrow = ytrain, 
##     xtrain))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0433  -0.4332  -0.2450  -0.1012   2.6673  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.780e+02  5.210e+01   3.417 0.000634 ***
## Humidity3pm     9.180e-03  2.352e-02   0.390 0.696328    
## Pressure3pm    -1.764e-01  5.065e-02  -3.482 0.000498 ***
## Sunshine       -2.821e-01  1.188e-01  -2.374 0.017613 *  
## WindGustDirENE -1.798e+00  1.376e+00  -1.307 0.191191    
## WindGustDirESE  5.182e-01  1.082e+00   0.479 0.632144    
## WindGustDirN    1.092e-01  1.423e+00   0.077 0.938816    
## WindGustDirNE   7.493e-01  1.332e+00   0.563 0.573741    
## WindGustDirNNE -1.610e+01  2.882e+03  -0.006 0.995541    
## WindGustDirNNW  1.017e+00  9.983e-01   1.019 0.308293    
## WindGustDirNW   1.924e-01  9.436e-01   0.204 0.838453    
## WindGustDirS   -3.135e-01  1.209e+00  -0.259 0.795391    
## WindGustDirSE  -1.688e+01  2.082e+03  -0.008 0.993533    
## WindGustDirSSE  8.404e-03  1.334e+00   0.006 0.994974    
## WindGustDirSSW  2.002e+00  1.546e+00   1.295 0.195307    
## WindGustDirSW   1.522e+00  2.205e+00   0.690 0.490007    
## WindGustDirW   -9.433e-02  1.139e+00  -0.083 0.933975    
## WindGustDirWNW -9.320e-01  1.311e+00  -0.711 0.477132    
## WindGustDirWSW -1.634e+01  6.523e+03  -0.003 0.998002    
## WindGustSpeed   1.748e-02  2.149e-02   0.813 0.416087    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 202.56  on 228  degrees of freedom
## Residual deviance: 120.09  on 209  degrees of freedom
## AIC: 160.09
## 
## Number of Fisher Scoring iterations: 17
g$cutoff
## [1] 0.5
(train.error.table.3.b <- g$train.cm)
##          actual
## predicted  No Yes
##         0 184  16
##         1   8  21
(train.error.rate.3.b <- (train.error.table.3.b[1,2]+train.error.table.3.b[2,1])/sum(train.error.table.3.b))
## [1] 0.1048035
(test.error.table.3.b <- g$test.cm)
##          actual
## predicted No Yes
##         0 76  13
##         1  0  10
(test.error.rate.3.b <- (test.error.table.3.b[1,2]+test.error.table.3.b[2,1])/sum(test.error.table.3.b))
## [1] 0.1313131
glm.perf.plot(g$test.prediction, g$cutoff)

## [1] 0.9262014

Now I collect all the resulting error rates to compare altogether.

df.error.rate.1 <- data.frame(percent(train.error.rate.1), percent(test.error.rate.1), percent(train.error.rate.1.b), percent(test.error.rate.1.b))
colnames(df.error.rate.1) <- c("train.err.1", "test.err.1", "train.err.1.b", "test.err.1.b")
df.error.rate.2 <- data.frame(percent(train.error.rate.2), percent(test.error.rate.2), percent(train.error.rate.2.b), percent(test.error.rate.2.b))
colnames(df.error.rate.2) <- c("train.err.2", "test.err.2", "train.err.2.b", "test.err.2.b")
df.error.rate.3 <- data.frame(percent(train.error.rate.3), percent(test.error.rate.3), percent(train.error.rate.3.b), percent(test.error.rate.3.b))
colnames(df.error.rate.3) <- c("train.err.3", "test.err.3", "train.err.3.b", "test.err.3.b")

df.error.rate.1
##   train.err.1 test.err.1 train.err.1.b test.err.1.b
## 1       13.1%      21.2%         13.1%        20.2%
df.error.rate.2
##   train.err.2 test.err.2 train.err.2.b test.err.2.b
## 1       10.5%      20.2%          8.3%        23.2%
df.error.rate.3
##   train.err.3 test.err.3 train.err.3.b test.err.3.b
## 1       11.8%      14.1%         10.5%        13.1%

We can see how the third logistic regression model has got better train and test error rate than the other models. Having train and test error rates close values, we can say such model has got bias and variance much more in control than the other models.

Overall, the logistic regression test error rates are slightly better than the classification tree model ones.

Further, the best logistic regression based model has got worse test set based accuracy than the best model shown based on Naive Bayes.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.