Leaderboard Analysis
Abstract
In this post I am going to analyze a dataset reporting ATP Grand Slam tennis tournaments winners historical data. Specifically, I am going to evaluate the potential presence of outliers.
To clarify that an outlier:
is a value remarkably distant from other ones within the same dataset
lies at the far boundary of the overall dataset distribution
is rare to be observed
may not be unique within the same dataset
Analysis
I start my analysis by loading ATP Grand Slam tennis tournaments winners historical data (ref. [1]). I then take a quick view to the top and tail records.
library_toload <- c("dplyr", "knitr", "fitdistrplus", "extremevalues")
invisible(lapply(library_toload, function(x) {suppressPackageStartupMessages(library(x, character.only=TRUE))}))
slam_win <- read.delim("tennis-grand-slam-winners.txt", sep="\t")
dim(slam_win)
## [1] 485 4
nr <- nrow(slam_win)
kable(head(slam_win), row.names=FALSE)
YEAR | TOURNAMENT | WINNER | RUNNER.UP |
---|---|---|---|
2016 | U.S. Open | Stan Wawrinka | Novak Djokovic |
2016 | Wimbledon | Andy Murray | Milos Raonic |
2016 | French Open | Novak Djokovic | Andy Murray |
2016 | Australian Open | Novak Djokovic | Andy Murray |
2015 | U.S. Open | Novak Djokovic | Roger Federer |
2015 | Wimbledon | Novak Djokovic | Roger Federer |
kable(tail(slam_win), row.names=FALSE)
YEAR | TOURNAMENT | WINNER | RUNNER.UP |
---|---|---|---|
1881 | U.S. Open | Richard D. Sears | William E. Glyn |
1881 | Wimbledon | William Renshaw | John Hartley |
1880 | Wimbledon | John Hartley | Herbert Lawford |
1879 | Wimbledon | John Hartley | V. St. Leger Gould |
1878 | Wimbledon | Frank Hadow | Spencer Gore |
1877 | Wimbledon | Spencer Gore | William Marshall |
As we can see, the records are ordered by year, starting on 1877 and ending on 2016. In 140 years, 485 ATP Grand Slam tournaments took place and 166 champions won at least one event.
I then order champions' names by decreasing number of wins and associate a leaderboard position number.
# ordering in decreasing wins order
slam_win_df <- as.data.frame(table(slam_win[,"WINNER"]))
slam_win_df <- slam_win_df %>% arrange(desc(Freq))
# computing champions' leaderboard position
pos <- rep(0, nrow(slam_win_df))
pos[1] <- 1
for(i in 2:nrow(slam_win_df)) {
pos[i] <- ifelse(slam_win_df$Freq[i] != slam_win_df$Freq[i-1], i, pos[i-1])
}
# creating and showing leaderboard dataframe
slam_winners <- data.frame(RANK = pos,
PLAYER = slam_win_df$Var1,
WINS = slam_win_df$Freq)
kable(slam_winners, caption="ATP Gran Slam Tournaments Leaderboard")
RANK | PLAYER | WINS |
---|---|---|
1 | Roger Federer | 17 |
2 | Pete Sampras | 14 |
2 | Rafael Nadal | 14 |
4 | Novak Djokovic | 12 |
4 | Roy Emerson | 12 |
6 | Bjorn Borg | 11 |
6 | Rod Laver | 11 |
8 | William T. Tilden | 10 |
9 | Andre Agassi | 8 |
9 | Fred Perry | 8 |
9 | Henri Cochet | 8 |
9 | Ivan Lendl | 8 |
9 | Jimmy Connors | 8 |
9 | Ken Rosewall | 8 |
9 | Max Decugis | 8 |
9 | William A. Larned | 8 |
17 | John McEnroe | 7 |
17 | John Newcombe | 7 |
17 | Mats Wilander | 7 |
17 | Rene Lacoste | 7 |
17 | Richard D. Sears | 7 |
17 | William Renshaw | 7 |
23 | Boris Becker | 6 |
23 | Donald Budge | 6 |
23 | Stefan Edberg | 6 |
26 | Frank Sedgman | 5 |
26 | Jack Crawford | 5 |
26 | Jean Borotra | 5 |
26 | Laurie Doherty | 5 |
26 | Tony Trabert | 5 |
31 | Andre Vacherot | 4 |
31 | Anthony Wilding | 4 |
31 | Ashley J. Cooper | 4 |
31 | Frank Parker | 4 |
31 | Guillermo Vilas | 4 |
31 | Jim Courier | 4 |
31 | Lewis Hoad | 4 |
31 | Manuel Santana | 4 |
31 | Pat O̢۪Hara Wood | 4 |
31 | Paul Ayme | 4 |
31 | Reggie Doherty | 4 |
31 | Robert D. Wrenn | 4 |
43 | Adrian Quist | 3 |
43 | Andy Murray | 3 |
43 | Arthur Ashe | 3 |
43 | Arthur Gore | 3 |
43 | Gerald Patterson | 3 |
43 | Gustavo Kuerten | 3 |
43 | H. Ellsworth Vines | 3 |
43 | Jack Kramer | 3 |
43 | Jan Kodes | 3 |
43 | Jaroslav Drobny | 3 |
43 | Malcolm D. Whitman | 3 |
43 | Maurice Germot | 3 |
43 | Neale Fraser | 3 |
43 | Norman Brookes | 3 |
43 | Oliver S. Campbell | 3 |
43 | Robert Riggs | 3 |
43 | Stan Wawrinka | 3 |
43 | Wilfred Baddeley | 3 |
43 | William M. Johnston | 3 |
62 | Andre Gobert | 2 |
62 | E. Victor Seixas Jr. | 2 |
62 | Fred Stolle | 2 |
62 | Gottfried Von Cramm | 2 |
62 | Henry W. Slocum Jr. | 2 |
62 | Ilie Nastase | 2 |
62 | James Anderson | 2 |
62 | Johan Kriek | 2 |
62 | John Bromwich | 2 |
62 | John Hartley | 2 |
62 | Joshua Pim | 2 |
62 | Lleyton Hewitt | 2 |
62 | Marat Safin | 2 |
62 | Maurice E. McLoughlin | 2 |
62 | Mervyn Rose | 2 |
62 | Nicola Pietrangeli | 2 |
62 | Patrick Rafter | 2 |
62 | R. Lindley Murray | 2 |
62 | Richard A. Gonzales | 2 |
62 | Richard N. Williams | 2 |
62 | Rodney Heath | 2 |
62 | Sergi Bruguera | 2 |
62 | Stan Smith | 2 |
62 | Tony Wilding | 2 |
62 | Yevgeny Kafelnikov | 2 |
87 | A.R.F. Kingscote | 1 |
87 | Adriano Panatta | 1 |
87 | Albert Costa | 1 |
87 | Alejandro Olmedo | 1 |
87 | Alex Olmedo | 1 |
87 | Andres Gimeno | 1 |
87 | Andres Gomez | 1 |
87 | Andy Roddick | 1 |
87 | Arthur Larsen | 1 |
87 | Beals C. Wright | 1 |
87 | Bill Bowrey | 1 |
87 | Brian Teacher | 1 |
87 | Budge Patty | 1 |
87 | C.R. McKinley | 1 |
87 | Carlos Moya | 1 |
87 | Cecil Parke | 1 |
87 | Dick Savitt | 1 |
87 | Dinny Pails | 1 |
87 | Donald McNeill | 1 |
87 | E.F. Parker | 1 |
87 | Ernest Renshaw | 1 |
87 | F.R. Schroeder | 1 |
87 | Francis Lowe | 1 |
87 | Francois Blanchy | 1 |
87 | Frank Hadow | 1 |
87 | Fred Alexander | 1 |
87 | Fred H. Hovey | 1 |
87 | Frederick R. Schroeder, Jr. | 1 |
87 | Gar Moon | 1 |
87 | Gaston Gaudio | 1 |
87 | Goran Ivanisevic | 1 |
87 | H. Briggs | 1 |
87 | Harold Mahoney | 1 |
87 | Henner Henkel | 1 |
87 | Herbert Lawford | 1 |
87 | Holcombe Ward | 1 |
87 | Horace Rice | 1 |
87 | Hugh L. Doherty | 1 |
87 | J. Schopfer | 1 |
87 | J.E. Patty | 1 |
87 | Jean Samazeuilh | 1 |
87 | John Crawford | 1 |
87 | John Gregory | 1 |
87 | John H. Doeg | 1 |
87 | John Hawkes | 1 |
87 | Jozsef Asboth | 1 |
87 | Juan Carlos Ferrero | 1 |
87 | Juan Martin del Potro | 1 |
87 | Ken McGregor | 1 |
87 | L. Riboulet | 1 |
87 | Lt. Joseph R. Hunt | 1 |
87 | M. Vacherot | 1 |
87 | Malcolm J. Anderson | 1 |
87 | Manuel Orantes | 1 |
87 | Marcel Bernard | 1 |
87 | Marin Cilic | 1 |
87 | Mark Edmondson | 1 |
87 | Michael Chang | 1 |
87 | Michael Stich | 1 |
87 | Pat Cash | 1 |
87 | Petr Korda | 1 |
87 | R. Falkenburg | 1 |
87 | R. Savitt | 1 |
87 | Rafael Osuna | 1 |
87 | Rhys Gemmell | 1 |
87 | Richard Krajicek | 1 |
87 | Roscoe Tanner | 1 |
87 | S.B. Wood | 1 |
87 | Spencer Gore | 1 |
87 | Sven Davidson | 1 |
87 | Thomas Johannson | 1 |
87 | Thomas Muster | 1 |
87 | Tony Roche | 1 |
87 | Vitas Gerulaitis | 1 |
87 | Vivian McGrath | 1 |
87 | William Hamilton | 1 |
87 | William McNeill | 1 |
87 | Wilmer L. Allison | 1 |
87 | Yannick Noah | 1 |
87 | Yvon Petra | 1 |
At the top of the leaderboard we recognize well known tennis champions (!). Please note that equal wins champions are sorted in alphabetic order.
It is interesting to take a look at wins frequencies and summary.
wins_frequency <- as.data.frame(table(slam_winners[,"WINS"]))
colnames(wins_frequency) <- c("WINS", "FREQUENCY")
kable(wins_frequency)
WINS | FREQUENCY |
---|---|
1 | 80 |
2 | 25 |
3 | 19 |
4 | 12 |
5 | 5 |
6 | 3 |
7 | 6 |
8 | 8 |
10 | 1 |
11 | 2 |
12 | 2 |
14 | 2 |
17 | 1 |
summary(slam_winners[,"WINS"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.922 3.750 17.000
I then fit the empirical distribution of wins by taking advantage of the log-normal distribution, which appears to be well suited for the purpose.
Curiously, reference [3] reports that in nature species abundance follows a log-normal distribution according to Preston's studies. It is a nice similitude with professional tennis champions 'breeds', I think.
For the purpose, I take advantage of the fitdist() function provided by the fitdistrplus package.
fw <- fitdist(slam_winners$WINS, "lnorm")
summary(fw)
## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## meanlog 0.7033182 0.06228297
## sdlog 0.8024599 0.04404040
## Loglikelihood: -315.7624 AIC: 635.5249 BIC: 641.7489
## Correlation matrix:
## meanlog sdlog
## meanlog 1 0
## sdlog 0 1
plot(fw)
I investigate the presence of outliers by imposing very restrictive quantiles equal to 0.1%. I take advantage of the getOutliersI() and outlierPlot() functions provided by the extremevalues package.
# left outliers quantile
left_thresh <- 0.001
# right outliers quantile
right_thresh <- 0.999
# determining the outliers
slam_outlier <- getOutliersI(as.vector(slam_winners$WINS),
FLim = c(left_thresh, right_thresh),
distribution = "lognormal")
# outliers are plotted in red color
outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")
Above outlier plot highlights in red the presence of three outliers related to top wins performance. The corresponding top three champions are:
slam_winners[slam_outlier$iRight,]
RANK | PLAYER | WINS |
---|---|---|
1 | Roger Federer | 17 |
2 | Pete Sampras | 14 |
2 | Rafael Nadal | 14 |
On the contrary, there are not any outliers at the left side of the distribution.
I compute probabilities and standard deviation from the mean related to 17 and 14 wins performance. See reference [2] pag. 6 Table 1 for details.
# mean and standard deviation of the fitted lognormal distribution
(mean_log <- fw$estimate["meanlog"])
## meanlog
## 0.7033182
(sd_log <- fw$estimate["sdlog"])
## sdlog
## 0.8024599
# clearing names
names(mean_log) <- NULL
names(sd_log) <- NULL
# average value based on fitted parameters
(mean_x <- exp(mean_log + 0.5*sd_log^2))
## [1] 2.787902
# probability associated to the 17 wins performance or better
(lnorm_17 <- plnorm(17, mean_log, sd_log, lower.tail=FALSE))
## [1] 0.003974747
# standard deviation times from the mean associated to 17 wins
(deviation_17 <- abs(log(log(17)/mean_log, base=sd_log)))
## [1] 6.331331
# countercheck the deviation value
(mean_log - log(17)*sd_log^deviation_17)
## [1] -1.110223e-16
# probability associated to the 14 wins performance or better
(lnorm_14 <- plnorm(14, mean_log, sd_log, lower.tail=FALSE))
## [1] 0.007927063
# standard deviation times from the mean associated to 14 wins
(deviation_14 <- abs(log(log(14)/mean_log, base=sd_log)))
## [1] 6.008758
# countercheck the deviation value
(mean_log - log(14)*sd_log^deviation_14)
## [1] 0
As a results summary:
17 wins or better performance has probability to happen equal to 0.0039747
17 wins performance is 6.331 standard deviation times from the mean (in the log scale)
14 wins or better performance has probability to happen equal to 0.0079271
14 wins performance is 6.009 standard deviation times from the mean (in the log scale)
Conclusions
Performances in sports can be remarkable. Statistical analysis of sports historical data may be useful to better quantify performances magnitude.
In this post I have identified outliers in the ATP Grand Slam tennis tournaments champions based on historical data reporting from year 1877 up to 2016
The log-normal distribution was used for empirical distribution fitting. Computation of outliers performance probabilities and their deviation from the mean have been outlined.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.