Saturday, September 30, 2017

The Arbuthnot dataset - part 2

ARIMA Models

Introduction

This is the continuation of the analysis started in my previous post. Let us load packages and dataset used for the present analysis.

suppressPackageStartupMessages(library(forecast))
suppressPackageStartupMessages(library(astsa))
suppressPackageStartupMessages(library(fUnitRoots))
suppressPackageStartupMessages(library(strucchange))
suppressPackageStartupMessages(library(lmtest))
suppressPackageStartupMessages(library(Rmisc))
url <- "https://www.openintro.org/stat/data/arbuthnot.csv"
abhutondot <- read.csv(url, header=TRUE)

Analysis

As anticipated, I am going to analysis the fractional excess of boys with respect girls births. The excess_ts time series is so determined.

boys.ts <- ts(abhutondot$boys, frequency=1, start = abhutondot$year[1])
girls.ts <- ts(abhutondot$girls, frequency=1, start = abhutondot$year[1])
delta.ts <- boys.ts - girls.ts
excess_ts <- delta.ts/girls.ts
plot.ts(excess_ts)

summary(excess_ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01067 0.04847 0.06470 0.07075 0.08751 0.15610

As the summary above shows, the yearly boys births always exceed girls ones within the observation period of years. The following t.test() verifies if the non-zero mean is statistically significant.

t.test(excess_ts)
## 
##  One Sample t-test
## 
## data:  excess_ts
## t = 20.498, df = 81, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.06388090 0.07761528
## sample estimates:
##  mean of x 
## 0.07074809

Above report confirms it is.

The significance of this test answers question #1 by saying that the boys births excess is statistically significative.

Unit root tests for no-drift and drift scenarios are run.

urdftest_lag = floor(12*(length(excess_ts)/100)^0.25)
urdfTest(excess_ts, type = "nc", lags = urdftest_lag, doplot = FALSE)
## 
## Title:
##  Augmented Dickey-Fuller Unit Root Test
## 
## Test Results:
##   
##   Test regression none 
##   
##   Call:
##   lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##   
##   Residuals:
##         Min        1Q    Median        3Q       Max 
##   -0.052739 -0.018246 -0.002899  0.019396  0.069349 
##   
##   Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
##   z.lag.1      -0.01465    0.05027  -0.291 0.771802    
##   z.diff.lag1  -0.71838    0.13552  -5.301 1.87e-06 ***
##   z.diff.lag2  -0.66917    0.16431  -4.073 0.000143 ***
##   z.diff.lag3  -0.58640    0.18567  -3.158 0.002519 ** 
##   z.diff.lag4  -0.56529    0.19688  -2.871 0.005700 ** 
##   z.diff.lag5  -0.58489    0.20248  -2.889 0.005432 ** 
##   z.diff.lag6  -0.60278    0.20075  -3.003 0.003944 ** 
##   z.diff.lag7  -0.43509    0.20012  -2.174 0.033786 *  
##   z.diff.lag8  -0.28608    0.19283  -1.484 0.143335    
##   z.diff.lag9  -0.14212    0.18150  -0.783 0.436769    
##   z.diff.lag10  0.13232    0.15903   0.832 0.408801    
##   z.diff.lag11 -0.07234    0.12774  -0.566 0.573346    
##   ---
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   
##   Residual standard error: 0.03026 on 58 degrees of freedom
##   Multiple R-squared:  0.4638,   Adjusted R-squared:  0.3529 
##   F-statistic: 4.181 on 12 and 58 DF,  p-value: 0.0001034
##   
##   
##   Value of test-statistic is: -0.2914 
##   
##   Critical values for test statistics: 
##        1pct  5pct 10pct
##   tau1 -2.6 -1.95 -1.61
## 
## Description:
##  Sat Sep 30 21:25:59 2017 by user: egargio
urdfTest(excess_ts, type = "c", lags = urdftest_lag, doplot = FALSE)
## 
## Title:
##  Augmented Dickey-Fuller Unit Root Test
## 
## Test Results:
##   
##   Test regression drift 
##   
##   Call:
##   lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
##   
##   Residuals:
##         Min        1Q    Median        3Q       Max 
##   -0.051868 -0.018870 -0.005227  0.019503  0.067936 
##   
##   Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
##   (Intercept)   0.02239    0.01789   1.251   0.2159  
##   z.lag.1      -0.31889    0.24824  -1.285   0.2041  
##   z.diff.lag1  -0.44287    0.25820  -1.715   0.0917 .
##   z.diff.lag2  -0.40952    0.26418  -1.550   0.1266  
##   z.diff.lag3  -0.34933    0.26464  -1.320   0.1921  
##   z.diff.lag4  -0.35207    0.25966  -1.356   0.1805  
##   z.diff.lag5  -0.39863    0.25053  -1.591   0.1171  
##   z.diff.lag6  -0.44797    0.23498  -1.906   0.0616 .
##   z.diff.lag7  -0.31103    0.22246  -1.398   0.1675  
##   z.diff.lag8  -0.19044    0.20656  -0.922   0.3604  
##   z.diff.lag9  -0.07128    0.18928  -0.377   0.7079  
##   z.diff.lag10  0.18023    0.16283   1.107   0.2730  
##   z.diff.lag11 -0.04154    0.12948  -0.321   0.7495  
##   ---
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   
##   Residual standard error: 0.03012 on 57 degrees of freedom
##   Multiple R-squared:  0.4781,   Adjusted R-squared:  0.3683 
##   F-statistic: 4.352 on 12 and 57 DF,  p-value: 6.962e-05
##   
##   
##   Value of test-statistic is: -1.2846 0.8257 
##   
##   Critical values for test statistics: 
##         1pct  5pct 10pct
##   tau2 -3.51 -2.89 -2.58
##   phi1  6.70  4.71  3.86
## 
## Description:
##  Sat Sep 30 21:25:59 2017 by user: egargio

Based on test statistics values, we cannot reject the unit root presence for both type of tests.

Inspection of the ACF and PACF plots may reveal more on the time series structure.

par(mfrow=c(1,2))
acf(excess_ts)
pacf(excess_ts)

We observe a spike at lag 10 of the ACF plot that may advise about seasonality. I will take into account later. It is also important to verify if any structural break occurs. I do that by taking advantage of the functions provided within strucchange package.

(break_point <- breakpoints(excess_ts ~ 1))
## 
##   Optimal 2-segment partition: 
## 
## Call:
## breakpoints.formula(formula = excess_ts ~ 1)
## 
## Breakpoints at observation number:
## 42 
## 
## Corresponding to breakdates:
## 1670
plot(break_point)

Based on above plot, the minimum BIC is reached for m = 1 breakpoints, occurring on year 1670.

plot(excess_ts)
lines(fitted(break_point, breaks = 1), col = 4)
lines(confint(break_point, breaks = 1))

I then compare the following five models:

  1. (1,1,1)

  2. (1,0,0)(0,0,1)

  3. (1,0,0)(1,0,0)

  4. seasonal (0,0,0)(0,0,1)[10] with shift level regressor as intervention variable

  5. seasonal (1,0,0)(0,0,1)[10] with shift level regressor as intervention variable

  6. non seasonal (1,0,0) with shift level regressor as intervention variable

model_1 <- auto.arima(excess_ts, stepwise = FALSE)
summary(model_1)
## Series: excess_ts 
## ARIMA(1,1,1)                    
## 
## Coefficients:
##          ar1      ma1
##       0.2224  -0.9258
## s.e.  0.1318   0.0683
## 
## sigma^2 estimated as 0.0009316:  log likelihood=167.94
## AIC=-329.88   AICc=-329.57   BIC=-322.7
## 
## Training set error measures:
##                        ME       RMSE        MAE       MPE     MAPE
## Training set -0.002931698 0.02995934 0.02405062 -27.05674 46.53832
##                   MASE        ACF1
## Training set 0.8292635 -0.01005689
coeftest(model_1)
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value Pr(>|z|)    
## ar1  0.222363   0.131778   1.6874  0.09153 .  
## ma1 -0.925836   0.068276 -13.5603  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_2 <- Arima(excess_ts, order = c(1,0,0), 
                           seasonal = list(order = c(0,0,1), period = 10), 
                           include.mean = TRUE)
summary(model_2)
## Series: excess_ts 
## ARIMA(1,0,0)(0,0,1)[10] with non-zero mean 
## 
## Coefficients:
##          ar1    sma1    mean
##       0.2373  0.3441  0.0708
## s.e.  0.1104  0.1111  0.0053
## 
## sigma^2 estimated as 0.0008129:  log likelihood=176.23
## AIC=-344.46   AICc=-343.94   BIC=-334.83
## 
## Training set error measures:
##                         ME       RMSE        MAE       MPE     MAPE
## Training set -0.0002212383 0.02798445 0.02274858 -21.47547 42.96717
##                   MASE         ACF1
## Training set 0.7843692 -0.004420048
coeftest(model_2)
## 
## z test of coefficients:
## 
##            Estimate Std. Error z value  Pr(>|z|)    
## ar1       0.2372975  0.1104323  2.1488  0.031650 *  
## sma1      0.3440811  0.1110791  3.0976  0.001951 ** 
## intercept 0.0707836  0.0052663 13.4409 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_3 <- Arima(excess_ts, order = c(1,0,0), 
                 seasonal = list(order = c(1,0,0), period = 10), 
                 include.mean = TRUE)
summary(model_3)
## Series: excess_ts 
## ARIMA(1,0,0)(1,0,0)[10] with non-zero mean 
## 
## Coefficients:
##          ar1    sar1    mean
##       0.2637  0.3185  0.0705
## s.e.  0.1069  0.1028  0.0058
## 
## sigma^2 estimated as 0.0008173:  log likelihood=176.1
## AIC=-344.19   AICc=-343.67   BIC=-334.56
## 
## Training set error measures:
##                         ME       RMSE        MAE       MPE     MAPE
## Training set -0.0002070952 0.02806013 0.02273145 -21.42509 42.85735
##                   MASE         ACF1
## Training set 0.7837788 -0.005665764
coeftest(model_3)
## 
## z test of coefficients:
## 
##            Estimate Std. Error z value  Pr(>|z|)    
## ar1       0.2636602  0.1069472  2.4653  0.013689 *  
## sar1      0.3185397  0.1027903  3.0989  0.001942 ** 
## intercept 0.0704636  0.0058313 12.0836 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
level <- c(rep(0, break_point$breakpoints), 
           rep(1, length(excess_ts) - break_point$breakpoints))

model_4 <- Arima(excess_ts, order = c(0,0,0), 
                 seasonal = list(order = c(0,0,1), period=10), 
                 xreg = level, 
                 include.mean = TRUE)
summary(model_4)
## Series: excess_ts 
## Regression with ARIMA(0,0,0)(0,0,1)[10] errors 
## 
## Coefficients:
##         sma1  intercept    level
##       0.3437     0.0817  -0.0225
## s.e.  0.1135     0.0053   0.0072
## 
## sigma^2 estimated as 0.0007706:  log likelihood=178.45
## AIC=-348.9   AICc=-348.38   BIC=-339.27
## 
## Training set error measures:
##                        ME       RMSE        MAE       MPE     MAPE
## Training set 0.0001703111 0.02724729 0.02184016 -19.82639 41.28977
##                   MASE      ACF1
## Training set 0.7530469 0.1608774
coeftest(model_4)
## 
## z test of coefficients:
## 
##             Estimate Std. Error z value  Pr(>|z|)    
## sma1       0.3437109  0.1135387  3.0273  0.002468 ** 
## intercept  0.0817445  0.0052825 15.4745 < 2.2e-16 ***
## level     -0.0225294  0.0072468 -3.1089  0.001878 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_5 <- Arima(excess_ts, order = c(1,0,0), 
                 seasonal = list(order = c(0,0,1), period=10), 
                 xreg = level, 
                 include.mean = TRUE)
summary(model_5)
## Series: excess_ts 
## Regression with ARIMA(1,0,0)(0,0,1)[10] errors 
## 
## Coefficients:
##          ar1    sma1  intercept    level
##       0.1649  0.3188     0.0820  -0.0230
## s.e.  0.1099  0.1112     0.0061   0.0084
## 
## sigma^2 estimated as 0.0007612:  log likelihood=179.56
## AIC=-349.11   AICc=-348.32   BIC=-337.08
## 
## Training set error measures:
##                        ME       RMSE        MAE       MPE     MAPE
## Training set 8.225255e-05 0.02690781 0.02174375 -19.42485 40.90147
##                   MASE        ACF1
## Training set 0.7497229 0.007234682
coeftest(model_5)
## 
## z test of coefficients:
## 
##             Estimate Std. Error z value  Pr(>|z|)    
## ar1        0.1648878  0.1099118  1.5002  0.133567    
## sma1       0.3187896  0.1112301  2.8660  0.004156 ** 
## intercept  0.0819981  0.0061227 13.3926 < 2.2e-16 ***
## level     -0.0230176  0.0083874 -2.7443  0.006064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The auto-regressive coefficient is not significative, hence this model is not taken into account further.

model_6 <- Arima(excess_ts, order = c(1,0,0), xreg = level, include.mean = TRUE)
summary(model_6)
## Series: excess_ts 
## Regression with ARIMA(1,0,0) errors 
## 
## Coefficients:
##          ar1  intercept    level
##       0.1820     0.0821  -0.0232
## s.e.  0.1088     0.0053   0.0076
## 
## sigma^2 estimated as 0.0008369:  log likelihood=175.68
## AIC=-343.35   AICc=-342.83   BIC=-333.73
## 
## Training set error measures:
##                         ME       RMSE        MAE       MPE    MAPE
## Training set -7.777648e-05 0.02839508 0.02258574 -21.93086 43.2444
##                   MASE         ACF1
## Training set 0.7787544 0.0003897161
coeftest(model_6)
## 
## z test of coefficients:
## 
##             Estimate Std. Error z value  Pr(>|z|)    
## ar1        0.1820054  0.1088357  1.6723  0.094466 .  
## intercept  0.0821470  0.0053294 15.4139 < 2.2e-16 ***
## level     -0.0232481  0.0076044 -3.0572  0.002234 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I collect AIC, AICc and BIC metrics within a dataframe.

df <- data.frame(col_1_res = c(model_1$aic, model_2$aic, model_3$aic, model_4$aic,  model_6$aic),
                 col_2_res = c(model_1$aicc, model_2$aicc, model_3$aicc, model_4$aicc, model_6$aicc),
                 col_3_res = c(model_1$bic, model_2$bic, model_3$bic, model_4$bic, model_6$bic))

colnames(df) <- c("AIC", "AICc", "BIC")
rownames(df) <- c("ARIMA(1,1,1)", "ARIMA(1,0,0)(0,0,1)[10]", "ARIMA(1,0,0)(1,0,0)[10]", "ARIMA(0,0,0)(0,0,1)[10] with level shift", 
"ARIMA(1,0,0) with level shift")

Resuming up.

df
##                                                AIC      AICc       BIC
## ARIMA(1,1,1)                             -329.8844 -329.5727 -322.7010
## ARIMA(1,0,0)(0,0,1)[10]                  -344.4575 -343.9380 -334.8306
## ARIMA(1,0,0)(1,0,0)[10]                  -344.1906 -343.6711 -334.5637
## ARIMA(0,0,0)(0,0,1)[10] with level shift -348.8963 -348.3768 -339.2694
## ARIMA(1,0,0) with level shift            -343.3529 -342.8334 -333.7260

The models with seasonal components have better AIC, AICc and BIC metrics than the (1,1,1) one. However, the last model based on intervention analysis has the best AIC, AICc and BIC metrics.

A further comment about the results of augmented Dickey Fuller test, whose null hypothesis cannot be rejected. Now we understand that was a type II error. In case of structural changes, the (augmented) Dickey-Fuller test is biased toward the non rejection of the null.

Further interesting details are shown by the following sarima() plots.

sarima(excess_ts, p = 0, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10, xreg = level, details = FALSE)

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xreg, optim.control = list(trace = trc, REPORT = 1, 
##     reltol = tol))
## 
## Coefficients:
##         sma1  intercept     xreg
##       0.3437     0.0817  -0.0225
## s.e.  0.1135     0.0053   0.0072
## 
## sigma^2 estimated as 0.0007424:  log likelihood = 178.45,  aic = -348.9
## 
## $degrees_of_freedom
## [1] 79
## 
## $ttable
##           Estimate     SE t.value p.value
## sma1        0.3437 0.1135  3.0273  0.0033
## intercept   0.0817 0.0053 15.4745  0.0000
## xreg       -0.0225 0.0072 -3.1089  0.0026
## 
## $AIC
## [1] -6.132432
## 
## $AICc
## [1] -6.101706
## 
## $BIC
## [1] -7.044381

Forecast with 20 years horizon is shown below.

h_fut <- 20
plot(forecast(model_4, h = h_fut, xreg = rep(1, h_fut)))
abline(h = 0, lty = 3)

Now the forecast appears to capture more of the time series dynamics. However, it does not specifically help in answering question #3. The answer is going to be obtained by simulation of the candidate ARIMA model. However, since the data generating process has changed its structure in year 1670, we need to model the time series afterwards such date.

break_date <- breakdates(break_point)
excess_ts_before_brk <- window(excess_ts, end = break_date)
excess_ts_after_brk <- window(excess_ts, start = break_date)

par(mfrow=c(1,2))
acf(excess_ts_before_brk)
pacf(excess_ts_before_brk)

par(mfrow=c(1,2))
acf(excess_ts_after_brk)
pacf(excess_ts_after_brk)

We can observe as the seasonality represented by the spike in the ACF plot at lag = 10 occurs before the break dates 1670. Afterwards white noise results. We can further verify such hypothesis fitting a ARIMA(0,0,0) model for after break dates time series.

model_sim <- Arima(excess_ts_after_brk, order = c(0,0,0), include.mean = TRUE)
summary(model_sim)
## Series: excess_ts_after_brk 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##         mean
##       0.0600
## s.e.  0.0036
## 
## sigma^2 estimated as 0.0005297:  log likelihood=96.96
## AIC=-189.93   AICc=-189.61   BIC=-186.5
## 
## Training set error measures:
##                        ME       RMSE        MAE       MPE     MAPE
## Training set 1.641595e-17 0.02273299 0.01787552 -23.22752 43.51151
##                   MASE         ACF1
## Training set 0.7107262 -0.002468808

Further confirmation is given by the sarima() plots.

sarima(excess_ts_after_brk, p=0, d=0, q=0, P=0, D=0, Q=0)
## initial  value -3.783938 
## iter   1 value -3.783938
## final  value -3.783938 
## converged
## initial  value -3.783938 
## iter   1 value -3.783938
## final  value -3.783938 
## converged

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##        xmean
##       0.0600
## s.e.  0.0036
## 
## sigma^2 estimated as 0.0005168:  log likelihood = 96.96,  aic = -189.93
## 
## $degrees_of_freedom
## [1] 40
## 
## $ttable
##       Estimate     SE t.value p.value
## xmean     0.06 0.0036 16.8761       0
## 
## $AIC
## [1] -6.519096
## 
## $AICc
## [1] -6.462613
## 
## $BIC
## [1] -7.477301

Simulations are run and results shown.

n_sim <- 1000
n_future <- 82

sim <- matrix(0, nrow = n_sim, ncol = n_future)

for (i in 1:n_sim) {
  sim[i,] <- simulate(model_sim, n = n_future, seed = i*17+1, 
                      n.start = length(excess_ts_after_brk),
                      start.innov = excess_ts_after_brk) 
}

plot(sim[1,], type='n', main = "Excess time series simulation results",
     xlab = "time index", ylab = "girls", ylim = c(-0.05, 0.15))

# first ten simulation results plot
for (i in 1:30) {
  lines(sim[i,], col = i)
}
abline(h = 0, lty = 3) # yearly girls births == yearly boys births 

Let us gather the summary of the first twenty simulatons.

apply(sim, 1, summary)[1:6, 1:30]
##            [,1]    [,2]    [,3]     [,4]       [,5]    [,6]     [,7]
## Min.    0.01093 0.01287 0.01088 0.006594 -0.0001903 0.01060 0.004852
## 1st Qu. 0.03956 0.04906 0.04424 0.045810  0.0457700 0.04849 0.040310
## Median  0.05716 0.06610 0.05783 0.056310  0.0649200 0.06155 0.061670
## Mean    0.05710 0.06523 0.05839 0.057980  0.0633200 0.06115 0.059770
## 3rd Qu. 0.07178 0.07986 0.07216 0.071030  0.0779500 0.07678 0.078650
## Max.    0.11820 0.13680 0.11350 0.117400  0.1059000 0.11930 0.121800
##             [,8]    [,9]    [,10]    [,11]   [,12]     [,13]   [,14]
## Min.    0.003544 0.01558 0.005585 0.008047 0.01063 -0.003882 0.01108
## 1st Qu. 0.047850 0.04194 0.041710 0.040970 0.04669  0.047500 0.04140
## Median  0.063590 0.05907 0.062250 0.059200 0.05827  0.063310 0.05578
## Mean    0.061470 0.05879 0.060040 0.059780 0.05973  0.060610 0.05950
## 3rd Qu. 0.076340 0.07475 0.075210 0.077120 0.07836  0.075770 0.07416
## Max.    0.103800 0.12430 0.108900 0.113800 0.10760  0.110800 0.13640
##            [,15]     [,16]    [,17]   [,18]   [,19]     [,20]   [,21]
## Min.    -0.01901 -0.009148 0.009421 0.01032 0.01111 -0.007183 0.00549
## 1st Qu.  0.04874  0.044630 0.052990 0.03816 0.04482  0.040880 0.04501
## Median   0.06274  0.060530 0.067340 0.05582 0.05847  0.058770 0.05726
## Mean     0.05992  0.059110 0.066030 0.05480 0.06051  0.059320 0.06131
## 3rd Qu.  0.07307  0.073020 0.078060 0.07008 0.07703  0.078920 0.07842
## Max.     0.11820  0.115000 0.130600 0.11670 0.11380  0.110000 0.12410
##             [,22]    [,23]    [,24]    [,25]    [,26]   [,27]   [,28]
## Min.    -0.008185 -0.01659 0.005331 0.007003 0.009089 0.01707 0.01273
## 1st Qu.  0.037470  0.04184 0.043180 0.049510 0.045770 0.04464 0.04966
## Median   0.058780  0.05878 0.060180 0.060590 0.057260 0.06280 0.06082
## Mean     0.057020  0.05692 0.059020 0.061940 0.057760 0.06104 0.06336
## 3rd Qu.  0.076760  0.07293 0.076630 0.078490 0.068660 0.07370 0.07865
## Max.     0.119600  0.12360 0.115000 0.124200 0.120400 0.12370 0.12600
##             [,29]   [,30]
## Min.    -0.008431 0.01010
## 1st Qu.  0.045090 0.04701
## Median   0.060030 0.05825
## Mean     0.060040 0.05806
## 3rd Qu.  0.077780 0.06973
## Max.     0.112100 0.11690

Finally, the fraction of times the girls exceeds boys in yearly births is computed.

girls_more <- apply(sim, 1, function(x) { which(x < 0) })
(girls_more_fract <- sum(sapply(girls_more, function(x) {length(x) > 0}))/n_sim)
## [1] 0.297

Approximately 29.7% of times girls are forecasted to exceeds boys in yearly births within a time horizon of 82 years. Of course nothing guarantees further structural breaks will not occur, and, in the case that happen, the model used for our simulation would result not applicable any longer.

Conclusions

Structural changes analysis of the Arbuthnot dataset revealed interesting insights and allowed to determine the candidate ARIMA model.

The answers to the original questions proposed in part one:

  1. The boys versus girls births difference greater than zero was found statistically significative.

  2. The (0,0,0)(0,0,1)[10] model with level shift is the candidate model to represent the boys versus girls births fractional excess time series.

  3. There are 29.7% to have girls births greater than boys births some time in the future within 82 years time horizon.