Arbuthnot Dataset
Introduction
I am going to analyse the Arbuthnot dataset containing information of male and female births in London from 1639 to 1710 year. It will be shown how the boys births are always greater than girls within the given dataset. Once defined the fractional excess of boys births versus girls as:
\[ \begin{equation} \begin{aligned} \dfrac{(Boys - Girls)}{Girls} \end{aligned} \end{equation} \]
and a time series for, I will try to answer the following questions:
Is the fractional excess of boys births versus girls statistically significant ?
What ARIMA time series model fits the fractional excess data ?
What are the chances the boys versus girls births fractional excess shall become less than zero ?
Packages
I am going to take advantage of the following packages.
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(forecast))
suppressPackageStartupMessages(library(astsa))
suppressPackageStartupMessages(library(fUnitRoots))
suppressPackageStartupMessages(library(strucchange))
suppressPackageStartupMessages(library(lmtest))
suppressPackageStartupMessages(library(reshape))
suppressPackageStartupMessages(library(Rmisc))
suppressPackageStartupMessages(library(fBasics))
Analysis
First of all, let us run a basic exploration analysis of our dataset.
url <- "https://www.openintro.org/stat/data/arbuthnot.csv"
abhutondot <- read.csv(url, header=TRUE)
nrow(abhutondot)
## [1] 82
head(abhutondot)
## year boys girls
## 1 1629 5218 4683
## 2 1630 4858 4457
## 3 1631 4422 4102
## 4 1632 4994 4590
## 5 1633 5158 4839
## 6 1634 5035 4820
abhutondot_rs <- melt(abhutondot, id = c("year"))
head(abhutondot_rs)
## year variable value
## 1 1629 boys 5218
## 2 1630 boys 4858
## 3 1631 boys 4422
## 4 1632 boys 4994
## 5 1633 boys 5158
## 6 1634 boys 5035
tail(abhutondot_rs)
## year variable value
## 159 1705 girls 7779
## 160 1706 girls 7417
## 161 1707 girls 7687
## 162 1708 girls 7623
## 163 1709 girls 7380
## 164 1710 girls 7288
ggplot(data = abhutondot_rs, aes(x = year)) + geom_line(aes(y = value, colour = variable)) +
scale_colour_manual(values = c("blue", "red"))
Some summary statistics and further plots are shown.
basicStats(abhutondot[-1])
## boys girls
## nobs 8.200000e+01 8.200000e+01
## NAs 0.000000e+00 0.000000e+00
## Minimum 2.890000e+03 2.722000e+03
## Maximum 8.426000e+03 7.779000e+03
## 1. Quartile 4.759250e+03 4.457000e+03
## 3. Quartile 7.576500e+03 7.150250e+03
## Mean 5.907098e+03 5.534646e+03
## Median 6.073000e+03 5.718000e+03
## Sum 4.843820e+05 4.538410e+05
## SE Mean 1.825161e+02 1.758222e+02
## LCL Mean 5.543948e+03 5.184815e+03
## UCL Mean 6.270247e+03 5.884477e+03
## Variance 2.731595e+06 2.534902e+06
## Stdev 1.652754e+03 1.592137e+03
## Skewness -2.139250e-01 -2.204720e-01
## Kurtosis -1.221721e+00 -1.250893e+00
p1 <- ggplot(data = abhutondot_rs, aes(x = variable, y = value)) + geom_boxplot()
p2 <- ggplot(data = abhutondot, aes(boys)) + geom_density()
p3 <- ggplot(data = abhutondot, aes(girls)) + geom_density()
multiplot(p1, p2, p3, cols=3)
Boys births appear to be consistently higher than girls births. Let us run a t.test to further verify if there is a true difference in the mean of the two groups, boys and girls births.
t.test(value ~ variable, data = abhutondot_rs)
##
## Welch Two Sample t-test
##
## data: value by variable
## t = 1.4697, df = 161.77, p-value = 0.1436
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -128.0016 872.9040
## sample estimates:
## mean in group boys mean in group girls
## 5907.098 5534.646
Based on the p-value, we cannot reject the null hypothesis that the true difference in means is equal to zero.
Resuming up:
there seems to be a consistent higher number of boys with respect girls along the years
time pattern is similar among boys and girls with a few short term deviations
densities are similar and both show negative skewness
To inspect for time series unit roots presence, I run the augmented Dickey-Fuller test on the boys time series first. I do that considering the scenario of both drift and linear time trend. Time series are defined with frequency = 1 as they represent yearly collected data.
boys.ts <- ts(abhutondot$boys, frequency=1, start = abhutondot$year[1])
girls.ts <- ts(abhutondot$girls, frequency=1, start = abhutondot$year[1])
urdftest_lag = floor(12*(length(boys.ts)/100)^0.25)
urdfTest(boys.ts, type = "ct", lags = urdftest_lag, doplot = FALSE)
##
## Title:
## Augmented Dickey-Fuller Unit Root Test
##
## Test Results:
##
## Test regression trend
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1756.38 -239.69 6.07 235.34 1083.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 716.72668 275.97750 2.597 0.01199 *
## z.lag.1 -0.29053 0.08893 -3.267 0.00186 **
## tt 22.04307 6.74305 3.269 0.00185 **
## z.diff.lag1 -0.06626 0.12351 -0.536 0.59374
## z.diff.lag2 0.04908 0.12409 0.396 0.69397
## z.diff.lag3 -0.01400 0.12356 -0.113 0.91021
## z.diff.lag4 0.09084 0.12303 0.738 0.46337
## z.diff.lag5 0.07620 0.12230 0.623 0.53577
## z.diff.lag6 0.11828 0.14487 0.816 0.41769
## z.diff.lag7 0.21746 0.16470 1.320 0.19207
## z.diff.lag8 0.17692 0.16424 1.077 0.28600
## z.diff.lag9 0.24283 0.16559 1.466 0.14813
## z.diff.lag10 0.28655 0.16275 1.761 0.08374 .
## z.diff.lag11 -0.24705 0.16647 -1.484 0.14341
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 467.7 on 56 degrees of freedom
## Multiple R-squared: 0.3006, Adjusted R-squared: 0.1382
## F-statistic: 1.851 on 13 and 56 DF, p-value: 0.05716
##
##
## Value of test-statistic is: -3.2671 3.8838 5.6115
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -4.04 -3.45 -3.15
## phi2 6.50 4.88 4.16
## phi3 8.73 6.49 5.47
##
## Description:
## Sat Sep 30 21:28:12 2017 by user: egargio
Based on reported test statistics, we cannot reject the null hypothesis of a unit root presence. I then consider the log difference and test against the null hypothesis of unit root presence under the scenario of no drift.
boys_dlog.ts <- diff(log(boys.ts))
urdfTest(boys_dlog.ts, type = "nc", lags = urdftest_lag, doplot = FALSE)
##
## Title:
## Augmented Dickey-Fuller Unit Root Test
##
## Test Results:
##
## Test regression none
##
## Call:
## lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24037 -0.04290 -0.00318 0.04352 0.26396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## z.lag.1 -0.958561 0.416780 -2.300 0.0251 *
## z.diff.lag1 -0.032166 0.395993 -0.081 0.9355
## z.diff.lag2 0.001279 0.385063 0.003 0.9974
## z.diff.lag3 -0.076514 0.375149 -0.204 0.8391
## z.diff.lag4 -0.045292 0.360398 -0.126 0.9004
## z.diff.lag5 -0.042107 0.343638 -0.123 0.9029
## z.diff.lag6 -0.061117 0.328200 -0.186 0.8529
## z.diff.lag7 0.128191 0.297591 0.431 0.6683
## z.diff.lag8 0.154923 0.265092 0.584 0.5613
## z.diff.lag9 0.280346 0.233514 1.201 0.2349
## z.diff.lag10 0.321749 0.192659 1.670 0.1004
## z.diff.lag11 -0.012512 0.157215 -0.080 0.9368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08808 on 57 degrees of freedom
## Multiple R-squared: 0.5579, Adjusted R-squared: 0.4648
## F-statistic: 5.993 on 12 and 57 DF, p-value: 1.265e-06
##
##
## Value of test-statistic is: -2.2999
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau1 -2.6 -1.95 -1.61
##
## Description:
## Sat Sep 30 21:28:12 2017 by user: egargio
Based on reported test statistics, we reject the null hypothesis of a unit root presence.
Same unit root tests are run for the girls time series. Conclusions are the same.
urdftest_lag = floor(12*(length(girls.ts)/100)^0.25)
urdfTest(girls.ts, type = "ct", lags = urdftest_lag, doplot = FALSE)
##
## Title:
## Augmented Dickey-Fuller Unit Root Test
##
## Test Results:
##
## Test regression trend
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + tt + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1649.89 -248.89 47.01 192.18 911.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 632.35915 245.03006 2.581 0.01251 *
## z.lag.1 -0.29028 0.08473 -3.426 0.00115 **
## tt 21.51165 6.18246 3.479 0.00098 ***
## z.diff.lag1 -0.09327 0.11968 -0.779 0.43908
## z.diff.lag2 0.01960 0.12029 0.163 0.87115
## z.diff.lag3 0.00738 0.12040 0.061 0.95134
## z.diff.lag4 0.06440 0.11992 0.537 0.59336
## z.diff.lag5 0.11348 0.11904 0.953 0.34457
## z.diff.lag6 0.24960 0.13419 1.860 0.06815 .
## z.diff.lag7 0.08135 0.15928 0.511 0.61153
## z.diff.lag8 0.18003 0.15658 1.150 0.25512
## z.diff.lag9 0.18899 0.15773 1.198 0.23589
## z.diff.lag10 0.41226 0.15473 2.664 0.01006 *
## z.diff.lag11 -0.23102 0.16420 -1.407 0.16497
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 440 on 56 degrees of freedom
## Multiple R-squared: 0.3362, Adjusted R-squared: 0.1821
## F-statistic: 2.182 on 13 and 56 DF, p-value: 0.02253
##
##
## Value of test-statistic is: -3.4259 4.3472 6.2854
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau3 -4.04 -3.45 -3.15
## phi2 6.50 4.88 4.16
## phi3 8.73 6.49 5.47
##
## Description:
## Sat Sep 30 21:28:12 2017 by user: egargio
girls_dlog.ts <- diff(log(girls.ts))
urdfTest(girls_dlog.ts, type = "c", lags = urdftest_lag, doplot = FALSE)
##
## Title:
## Augmented Dickey-Fuller Unit Root Test
##
## Test Results:
##
## Test regression drift
##
## Call:
## lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.273868 -0.038467 -0.000195 0.036676 0.261630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.005185 0.011695 0.443 0.6592
## z.lag.1 -1.010707 0.440082 -2.297 0.0254 *
## z.diff.lag1 -0.014640 0.421229 -0.035 0.9724
## z.diff.lag2 0.005882 0.410270 0.014 0.9886
## z.diff.lag3 -0.052151 0.395357 -0.132 0.8955
## z.diff.lag4 -0.054940 0.378809 -0.145 0.8852
## z.diff.lag5 -0.054173 0.360717 -0.150 0.8812
## z.diff.lag6 0.010905 0.343085 0.032 0.9748
## z.diff.lag7 0.139386 0.311598 0.447 0.6564
## z.diff.lag8 0.143548 0.272727 0.526 0.6007
## z.diff.lag9 0.156237 0.238624 0.655 0.5153
## z.diff.lag10 0.348606 0.197313 1.767 0.0827 .
## z.diff.lag11 0.065587 0.157531 0.416 0.6787
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09407 on 56 degrees of freedom
## Multiple R-squared: 0.5678, Adjusted R-squared: 0.4751
## F-statistic: 6.13 on 12 and 56 DF, p-value: 1.012e-06
##
##
## Value of test-statistic is: -2.2966 2.644
##
## Critical values for test statistics:
## 1pct 5pct 10pct
## tau2 -3.51 -2.89 -2.58
## phi1 6.70 4.71 3.86
##
## Description:
## Sat Sep 30 21:28:12 2017 by user: egargio
To verify further both log-difference time series are white noise, I take advantage of sarima() function provided within astsa package.
par(mfrow=c(1,2))
sarima(boys_dlog.ts, p = 0, d = 0, q = 0, details = FALSE)
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = xmean, include.mean = FALSE, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## xmean
## 0.0047
## s.e. 0.0093
##
## sigma^2 estimated as 0.006983: log likelihood = 86.12, aic = -168.24
##
## $degrees_of_freedom
## [1] 80
##
## $ttable
## Estimate SE t.value p.value
## xmean 0.0047 0.0093 0.507 0.6136
##
## $AIC
## [1] -3.939644
##
## $AICc
## [1] -3.913053
##
## $BIC
## [1] -4.910083
par(mfrow=c(1,2))
sarima(girls_dlog.ts, p = 0, d = 0, q = 0, details = FALSE)
## $fit
##
## Call:
## stats::arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D,
## Q), period = S), xreg = xmean, include.mean = FALSE, optim.control = list(trace = trc,
## REPORT = 1, reltol = tol))
##
## Coefficients:
## xmean
## 0.0055
## s.e. 0.0097
##
## sigma^2 estimated as 0.007553: log likelihood = 82.94, aic = -161.88
##
## $degrees_of_freedom
## [1] 80
##
## $ttable
## Estimate SE t.value p.value
## xmean 0.0055 0.0097 0.5654 0.5734
##
## $AIC
## [1] -3.86113
##
## $AICc
## [1] -3.834539
##
## $BIC
## [1] -4.831569
Both sarima() reports confirm the log difference boys and girls time series are white noise. I then take advantage of the auto.arima() function within the forecast package and build models for the original boys and girls time series.
(boys_aa <- auto.arima(boys.ts, stepwise = FALSE))
## Series: boys.ts
## ARIMA(0,1,0)
##
## sigma^2 estimated as 232156: log likelihood=-615.32
## AIC=1232.64 AICc=1232.69 BIC=1235.03
(girls_aa <- auto.arima(girls.ts, stepwise = FALSE))
## Series: girls.ts
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.2343
## s.e. 0.1119
##
## sigma^2 estimated as 208433: log likelihood=-610.48
## AIC=1224.96 AICc=1225.11 BIC=1229.74
The girls time series shows a moving average component not found for the boys one. Forecasts with 20 years horizon are so obtained.
par(mfrow=c(1,2))
plot(forecast(boys_aa, h = 20))
plot(forecast(girls_aa, h = 20))
Actually, those plots do not tell us much of the reciprocal dynamics of the time series under study within early future. Therefore, in the next post, I will turn my attention on the fractional excess boys versus girls time series.
Conclusions
Analysis of the Arbuthnot dataset revealed the structure of the births fractional excess of boys versus girls. Its average value greater than zero was found statistically significative.
Further investigations and answers to our three basic questions in my next post.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.