Friday, September 29, 2017

The Arbuthnot dataset - part 1

Arbuthnot Dataset

Introduction

I am going to analyse the Arbuthnot dataset containing information of male and female births in London from 1639 to 1710 year. It will be shown how the boys births are always greater than girls within the given dataset. Once defined the fractional excess of boys births versus girls as:

\[ \begin{equation} \begin{aligned} \dfrac{(Boys - Girls)}{Girls} \end{aligned} \end{equation} \]

and a time series for, I will try to answer the following questions:

  1. Is the fractional excess of boys births versus girls statistically significant ?

  2. What ARIMA time series model fits the fractional excess data ?

  3. What are the chances the boys versus girls births fractional excess shall become less than zero ?

Packages

I am going to take advantage of the following packages.

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(forecast))
suppressPackageStartupMessages(library(astsa))
suppressPackageStartupMessages(library(fUnitRoots))
suppressPackageStartupMessages(library(strucchange))
suppressPackageStartupMessages(library(lmtest))
suppressPackageStartupMessages(library(reshape))
suppressPackageStartupMessages(library(Rmisc))
suppressPackageStartupMessages(library(fBasics))

Analysis

First of all, let us run a basic exploration analysis of our dataset.

url <- "https://www.openintro.org/stat/data/arbuthnot.csv"
abhutondot <- read.csv(url, header=TRUE)
nrow(abhutondot)
## [1] 82
head(abhutondot)
##   year boys girls
## 1 1629 5218  4683
## 2 1630 4858  4457
## 3 1631 4422  4102
## 4 1632 4994  4590
## 5 1633 5158  4839
## 6 1634 5035  4820
abhutondot_rs <- melt(abhutondot, id = c("year"))
head(abhutondot_rs)
##   year variable value
## 1 1629     boys  5218
## 2 1630     boys  4858
## 3 1631     boys  4422
## 4 1632     boys  4994
## 5 1633     boys  5158
## 6 1634     boys  5035
tail(abhutondot_rs)
##     year variable value
## 159 1705    girls  7779
## 160 1706    girls  7417
## 161 1707    girls  7687
## 162 1708    girls  7623
## 163 1709    girls  7380
## 164 1710    girls  7288
ggplot(data = abhutondot_rs, aes(x = year)) + geom_line(aes(y = value, colour = variable)) +
  scale_colour_manual(values = c("blue", "red"))

Some summary statistics and further plots are shown.

basicStats(abhutondot[-1])
##                      boys         girls
## nobs         8.200000e+01  8.200000e+01
## NAs          0.000000e+00  0.000000e+00
## Minimum      2.890000e+03  2.722000e+03
## Maximum      8.426000e+03  7.779000e+03
## 1. Quartile  4.759250e+03  4.457000e+03
## 3. Quartile  7.576500e+03  7.150250e+03
## Mean         5.907098e+03  5.534646e+03
## Median       6.073000e+03  5.718000e+03
## Sum          4.843820e+05  4.538410e+05
## SE Mean      1.825161e+02  1.758222e+02
## LCL Mean     5.543948e+03  5.184815e+03
## UCL Mean     6.270247e+03  5.884477e+03
## Variance     2.731595e+06  2.534902e+06
## Stdev        1.652754e+03  1.592137e+03
## Skewness    -2.139250e-01 -2.204720e-01
## Kurtosis    -1.221721e+00 -1.250893e+00
p1 <- ggplot(data = abhutondot_rs, aes(x = variable, y = value)) + geom_boxplot()
p2 <- ggplot(data = abhutondot, aes(boys)) + geom_density()
p3 <- ggplot(data = abhutondot, aes(girls)) + geom_density()
multiplot(p1, p2, p3, cols=3)

Boys births appear to be consistently higher than girls births. Let us run a t.test to further verify if there is a true difference in the mean of the two groups, boys and girls births.

t.test(value ~ variable, data = abhutondot_rs)
## 
##  Welch Two Sample t-test
## 
## data:  value by variable
## t = 1.4697, df = 161.77, p-value = 0.1436
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -128.0016  872.9040
## sample estimates:
##  mean in group boys mean in group girls 
##            5907.098            5534.646

Based on the p-value, we cannot reject the null hypothesis that the true difference in means is equal to zero.

Resuming up:

  1. there seems to be a consistent higher number of boys with respect girls along the years

  2. time pattern is similar among boys and girls with a few short term deviations

  3. densities are similar and both show negative skewness

To inspect for time series unit roots presence, I run the augmented Dickey-Fuller test on the boys time series first. I do that considering the scenario of both drift and linear time trend. Time series are defined with frequency = 1 as they represent yearly collected data.

boys.ts <- ts(abhutondot$boys, frequency=1, start = abhutondot$year[1])
girls.ts <- ts(abhutondot$girls, frequency=1, start = abhutondot$year[1])

urdftest_lag = floor(12*(length(boys.ts)/100)^0.25)
urdfTest(boys.ts, type = "ct", lags = urdftest_lag, doplot = FALSE)
## 
## Title:
##  Augmented Dickey-Fuller Unit Root Test
## 
## Test Results:
##   
##   Test regression trend 
##   
##   Call:
##   lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##   
##   Residuals:
##        Min       1Q   Median       3Q      Max 
##   -1756.38  -239.69     6.07   235.34  1083.07 
##   
##   Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
##   (Intercept)  716.72668  275.97750   2.597  0.01199 * 
##   z.lag.1       -0.29053    0.08893  -3.267  0.00186 **
##   tt            22.04307    6.74305   3.269  0.00185 **
##   z.diff.lag1   -0.06626    0.12351  -0.536  0.59374   
##   z.diff.lag2    0.04908    0.12409   0.396  0.69397   
##   z.diff.lag3   -0.01400    0.12356  -0.113  0.91021   
##   z.diff.lag4    0.09084    0.12303   0.738  0.46337   
##   z.diff.lag5    0.07620    0.12230   0.623  0.53577   
##   z.diff.lag6    0.11828    0.14487   0.816  0.41769   
##   z.diff.lag7    0.21746    0.16470   1.320  0.19207   
##   z.diff.lag8    0.17692    0.16424   1.077  0.28600   
##   z.diff.lag9    0.24283    0.16559   1.466  0.14813   
##   z.diff.lag10   0.28655    0.16275   1.761  0.08374 . 
##   z.diff.lag11  -0.24705    0.16647  -1.484  0.14341   
##   ---
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   
##   Residual standard error: 467.7 on 56 degrees of freedom
##   Multiple R-squared:  0.3006,   Adjusted R-squared:  0.1382 
##   F-statistic: 1.851 on 13 and 56 DF,  p-value: 0.05716
##   
##   
##   Value of test-statistic is: -3.2671 3.8838 5.6115 
##   
##   Critical values for test statistics: 
##         1pct  5pct 10pct
##   tau3 -4.04 -3.45 -3.15
##   phi2  6.50  4.88  4.16
##   phi3  8.73  6.49  5.47
## 
## Description:
##  Sat Sep 30 21:28:12 2017 by user: egargio

Based on reported test statistics, we cannot reject the null hypothesis of a unit root presence. I then consider the log difference and test against the null hypothesis of unit root presence under the scenario of no drift.

boys_dlog.ts <- diff(log(boys.ts))
urdfTest(boys_dlog.ts, type = "nc", lags = urdftest_lag, doplot = FALSE)
## 
## Title:
##  Augmented Dickey-Fuller Unit Root Test
## 
## Test Results:
##   
##   Test regression none 
##   
##   Call:
##   lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##   
##   Residuals:
##        Min       1Q   Median       3Q      Max 
##   -0.24037 -0.04290 -0.00318  0.04352  0.26396 
##   
##   Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
##   z.lag.1      -0.958561   0.416780  -2.300   0.0251 *
##   z.diff.lag1  -0.032166   0.395993  -0.081   0.9355  
##   z.diff.lag2   0.001279   0.385063   0.003   0.9974  
##   z.diff.lag3  -0.076514   0.375149  -0.204   0.8391  
##   z.diff.lag4  -0.045292   0.360398  -0.126   0.9004  
##   z.diff.lag5  -0.042107   0.343638  -0.123   0.9029  
##   z.diff.lag6  -0.061117   0.328200  -0.186   0.8529  
##   z.diff.lag7   0.128191   0.297591   0.431   0.6683  
##   z.diff.lag8   0.154923   0.265092   0.584   0.5613  
##   z.diff.lag9   0.280346   0.233514   1.201   0.2349  
##   z.diff.lag10  0.321749   0.192659   1.670   0.1004  
##   z.diff.lag11 -0.012512   0.157215  -0.080   0.9368  
##   ---
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   
##   Residual standard error: 0.08808 on 57 degrees of freedom
##   Multiple R-squared:  0.5579,   Adjusted R-squared:  0.4648 
##   F-statistic: 5.993 on 12 and 57 DF,  p-value: 1.265e-06
##   
##   
##   Value of test-statistic is: -2.2999 
##   
##   Critical values for test statistics: 
##        1pct  5pct 10pct
##   tau1 -2.6 -1.95 -1.61
## 
## Description:
##  Sat Sep 30 21:28:12 2017 by user: egargio

Based on reported test statistics, we reject the null hypothesis of a unit root presence.

Same unit root tests are run for the girls time series. Conclusions are the same.

urdftest_lag = floor(12*(length(girls.ts)/100)^0.25)
urdfTest(girls.ts, type = "ct", lags = urdftest_lag, doplot = FALSE)
## 
## Title:
##  Augmented Dickey-Fuller Unit Root Test
## 
## Test Results:
##   
##   Test regression trend 
##   
##   Call:
##   lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##   
##   Residuals:
##        Min       1Q   Median       3Q      Max 
##   -1649.89  -248.89    47.01   192.18   911.80 
##   
##   Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
##   (Intercept)  632.35915  245.03006   2.581  0.01251 *  
##   z.lag.1       -0.29028    0.08473  -3.426  0.00115 ** 
##   tt            21.51165    6.18246   3.479  0.00098 ***
##   z.diff.lag1   -0.09327    0.11968  -0.779  0.43908    
##   z.diff.lag2    0.01960    0.12029   0.163  0.87115    
##   z.diff.lag3    0.00738    0.12040   0.061  0.95134    
##   z.diff.lag4    0.06440    0.11992   0.537  0.59336    
##   z.diff.lag5    0.11348    0.11904   0.953  0.34457    
##   z.diff.lag6    0.24960    0.13419   1.860  0.06815 .  
##   z.diff.lag7    0.08135    0.15928   0.511  0.61153    
##   z.diff.lag8    0.18003    0.15658   1.150  0.25512    
##   z.diff.lag9    0.18899    0.15773   1.198  0.23589    
##   z.diff.lag10   0.41226    0.15473   2.664  0.01006 *  
##   z.diff.lag11  -0.23102    0.16420  -1.407  0.16497    
##   ---
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   
##   Residual standard error: 440 on 56 degrees of freedom
##   Multiple R-squared:  0.3362,   Adjusted R-squared:  0.1821 
##   F-statistic: 2.182 on 13 and 56 DF,  p-value: 0.02253
##   
##   
##   Value of test-statistic is: -3.4259 4.3472 6.2854 
##   
##   Critical values for test statistics: 
##         1pct  5pct 10pct
##   tau3 -4.04 -3.45 -3.15
##   phi2  6.50  4.88  4.16
##   phi3  8.73  6.49  5.47
## 
## Description:
##  Sat Sep 30 21:28:12 2017 by user: egargio
girls_dlog.ts <- diff(log(girls.ts))
urdfTest(girls_dlog.ts, type = "c", lags = urdftest_lag, doplot = FALSE)
## 
## Title:
##  Augmented Dickey-Fuller Unit Root Test
## 
## Test Results:
##   
##   Test regression drift 
##   
##   Call:
##   lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
##   
##   Residuals:
##         Min        1Q    Median        3Q       Max 
##   -0.273868 -0.038467 -0.000195  0.036676  0.261630 
##   
##   Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
##   (Intercept)   0.005185   0.011695   0.443   0.6592  
##   z.lag.1      -1.010707   0.440082  -2.297   0.0254 *
##   z.diff.lag1  -0.014640   0.421229  -0.035   0.9724  
##   z.diff.lag2   0.005882   0.410270   0.014   0.9886  
##   z.diff.lag3  -0.052151   0.395357  -0.132   0.8955  
##   z.diff.lag4  -0.054940   0.378809  -0.145   0.8852  
##   z.diff.lag5  -0.054173   0.360717  -0.150   0.8812  
##   z.diff.lag6   0.010905   0.343085   0.032   0.9748  
##   z.diff.lag7   0.139386   0.311598   0.447   0.6564  
##   z.diff.lag8   0.143548   0.272727   0.526   0.6007  
##   z.diff.lag9   0.156237   0.238624   0.655   0.5153  
##   z.diff.lag10  0.348606   0.197313   1.767   0.0827 .
##   z.diff.lag11  0.065587   0.157531   0.416   0.6787  
##   ---
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   
##   Residual standard error: 0.09407 on 56 degrees of freedom
##   Multiple R-squared:  0.5678,   Adjusted R-squared:  0.4751 
##   F-statistic:  6.13 on 12 and 56 DF,  p-value: 1.012e-06
##   
##   
##   Value of test-statistic is: -2.2966 2.644 
##   
##   Critical values for test statistics: 
##         1pct  5pct 10pct
##   tau2 -3.51 -2.89 -2.58
##   phi1  6.70  4.71  3.86
## 
## Description:
##  Sat Sep 30 21:28:12 2017 by user: egargio

To verify further both log-difference time series are white noise, I take advantage of sarima() function provided within astsa package.

par(mfrow=c(1,2))
sarima(boys_dlog.ts, p = 0, d = 0, q = 0, details = FALSE)

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##        xmean
##       0.0047
## s.e.  0.0093
## 
## sigma^2 estimated as 0.006983:  log likelihood = 86.12,  aic = -168.24
## 
## $degrees_of_freedom
## [1] 80
## 
## $ttable
##       Estimate     SE t.value p.value
## xmean   0.0047 0.0093   0.507  0.6136
## 
## $AIC
## [1] -3.939644
## 
## $AICc
## [1] -3.913053
## 
## $BIC
## [1] -4.910083
par(mfrow=c(1,2))
sarima(girls_dlog.ts, p = 0, d = 0, q = 0, details = FALSE)

## $fit
## 
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, 
##     Q), period = S), xreg = xmean, include.mean = FALSE, optim.control = list(trace = trc, 
##     REPORT = 1, reltol = tol))
## 
## Coefficients:
##        xmean
##       0.0055
## s.e.  0.0097
## 
## sigma^2 estimated as 0.007553:  log likelihood = 82.94,  aic = -161.88
## 
## $degrees_of_freedom
## [1] 80
## 
## $ttable
##       Estimate     SE t.value p.value
## xmean   0.0055 0.0097  0.5654  0.5734
## 
## $AIC
## [1] -3.86113
## 
## $AICc
## [1] -3.834539
## 
## $BIC
## [1] -4.831569

Both sarima() reports confirm the log difference boys and girls time series are white noise. I then take advantage of the auto.arima() function within the forecast package and build models for the original boys and girls time series.

(boys_aa <- auto.arima(boys.ts, stepwise = FALSE))
## Series: boys.ts 
## ARIMA(0,1,0)                    
## 
## sigma^2 estimated as 232156:  log likelihood=-615.32
## AIC=1232.64   AICc=1232.69   BIC=1235.03
(girls_aa <- auto.arima(girls.ts, stepwise = FALSE))
## Series: girls.ts 
## ARIMA(0,1,1)                    
## 
## Coefficients:
##           ma1
##       -0.2343
## s.e.   0.1119
## 
## sigma^2 estimated as 208433:  log likelihood=-610.48
## AIC=1224.96   AICc=1225.11   BIC=1229.74

The girls time series shows a moving average component not found for the boys one. Forecasts with 20 years horizon are so obtained.

par(mfrow=c(1,2))
plot(forecast(boys_aa, h = 20))
plot(forecast(girls_aa, h = 20))

Actually, those plots do not tell us much of the reciprocal dynamics of the time series under study within early future. Therefore, in the next post, I will turn my attention on the fractional excess boys versus girls time series.

Conclusions

Analysis of the Arbuthnot dataset revealed the structure of the births fractional excess of boys versus girls. Its average value greater than zero was found statistically significative.

Further investigations and answers to our three basic questions in my next post.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.