Friday, October 28, 2016

Tennis Analytics - ATP Grand Slam Champions

Leaderboard Analysis

Abstract

In this post I am going to analyze a dataset reporting ATP Grand Slam tennis tournaments winners historical data. Specifically, I am going to evaluate the potential presence of outliers.

To clarify that an outlier:

  • is a value remarkably distant from other ones within the same dataset

  • lies at the far boundary of the overall dataset distribution

  • is rare to be observed

  • may not be unique within the same dataset

Analysis

I start my analysis by loading ATP Grand Slam tennis tournaments winners historical data (ref. [1]). I then take a quick view to the top and tail records.

library_toload <- c("dplyr", "knitr", "fitdistrplus", "extremevalues")
invisible(lapply(library_toload, function(x) {suppressPackageStartupMessages(library(x, character.only=TRUE))}))

slam_win <- read.delim("tennis-grand-slam-winners.txt", sep="\t")
dim(slam_win)
## [1] 485   4
nr <- nrow(slam_win)
kable(head(slam_win), row.names=FALSE)
YEAR TOURNAMENT WINNER RUNNER.UP
2016 U.S. Open Stan Wawrinka Novak Djokovic
2016 Wimbledon Andy Murray Milos Raonic
2016 French Open Novak Djokovic Andy Murray
2016 Australian Open Novak Djokovic Andy Murray
2015 U.S. Open Novak Djokovic Roger Federer
2015 Wimbledon Novak Djokovic Roger Federer
kable(tail(slam_win), row.names=FALSE)
YEAR TOURNAMENT WINNER RUNNER.UP
1881 U.S. Open Richard D. Sears William E. Glyn
1881 Wimbledon William Renshaw John Hartley
1880 Wimbledon John Hartley Herbert Lawford
1879 Wimbledon John Hartley V. St. Leger Gould
1878 Wimbledon Frank Hadow Spencer Gore
1877 Wimbledon Spencer Gore William Marshall

As we can see, the records are ordered by year, starting on 1877 and ending on 2016. In 140 years, 485 ATP Grand Slam tournaments took place and 166 champions won at least one event.

I then order champions' names by decreasing number of wins and associate a leaderboard position number.

# ordering in decreasing wins order
slam_win_df <- as.data.frame(table(slam_win[,"WINNER"]))
slam_win_df <-  slam_win_df %>% arrange(desc(Freq))

# computing champions' leaderboard position
pos <- rep(0, nrow(slam_win_df))
pos[1] <- 1
for(i in 2:nrow(slam_win_df)) {
  pos[i] <- ifelse(slam_win_df$Freq[i] != slam_win_df$Freq[i-1], i, pos[i-1])
}

# creating and showing leaderboard dataframe
slam_winners <- data.frame(RANK = pos, 
                           PLAYER = slam_win_df$Var1,
                           WINS = slam_win_df$Freq)
kable(slam_winners, caption="ATP Gran Slam Tournaments Leaderboard")
ATP Gran Slam Tournaments Leaderboard
RANK PLAYER WINS
1 Roger Federer 17
2 Pete Sampras 14
2 Rafael Nadal 14
4 Novak Djokovic 12
4 Roy Emerson 12
6 Bjorn Borg 11
6 Rod Laver 11
8 William T. Tilden 10
9 Andre Agassi 8
9 Fred Perry 8
9 Henri Cochet 8
9 Ivan Lendl 8
9 Jimmy Connors 8
9 Ken Rosewall 8
9 Max Decugis 8
9 William A. Larned 8
17 John McEnroe 7
17 John Newcombe 7
17 Mats Wilander 7
17 Rene Lacoste 7
17 Richard D. Sears 7
17 William Renshaw 7
23 Boris Becker 6
23 Donald Budge 6
23 Stefan Edberg 6
26 Frank Sedgman 5
26 Jack Crawford 5
26 Jean Borotra 5
26 Laurie Doherty 5
26 Tony Trabert 5
31 Andre Vacherot 4
31 Anthony Wilding 4
31 Ashley J. Cooper 4
31 Frank Parker 4
31 Guillermo Vilas 4
31 Jim Courier 4
31 Lewis Hoad 4
31 Manuel Santana 4
31 Pat O̢۪Hara Wood 4
31 Paul Ayme 4
31 Reggie Doherty 4
31 Robert D. Wrenn 4
43 Adrian Quist 3
43 Andy Murray 3
43 Arthur Ashe 3
43 Arthur Gore 3
43 Gerald Patterson 3
43 Gustavo Kuerten 3
43 H. Ellsworth Vines 3
43 Jack Kramer 3
43 Jan Kodes 3
43 Jaroslav Drobny 3
43 Malcolm D. Whitman 3
43 Maurice Germot 3
43 Neale Fraser 3
43 Norman Brookes 3
43 Oliver S. Campbell 3
43 Robert Riggs 3
43 Stan Wawrinka 3
43 Wilfred Baddeley 3
43 William M. Johnston 3
62 Andre Gobert 2
62 E. Victor Seixas Jr. 2
62 Fred Stolle 2
62 Gottfried Von Cramm 2
62 Henry W. Slocum Jr. 2
62 Ilie Nastase 2
62 James Anderson 2
62 Johan Kriek 2
62 John Bromwich 2
62 John Hartley 2
62 Joshua Pim 2
62 Lleyton Hewitt 2
62 Marat Safin 2
62 Maurice E. McLoughlin 2
62 Mervyn Rose 2
62 Nicola Pietrangeli 2
62 Patrick Rafter 2
62 R. Lindley Murray 2
62 Richard A. Gonzales 2
62 Richard N. Williams 2
62 Rodney Heath 2
62 Sergi Bruguera 2
62 Stan Smith 2
62 Tony Wilding 2
62 Yevgeny Kafelnikov 2
87 A.R.F. Kingscote 1
87 Adriano Panatta 1
87 Albert Costa 1
87 Alejandro Olmedo 1
87 Alex Olmedo 1
87 Andres Gimeno 1
87 Andres Gomez 1
87 Andy Roddick 1
87 Arthur Larsen 1
87 Beals C. Wright 1
87 Bill Bowrey 1
87 Brian Teacher 1
87 Budge Patty 1
87 C.R. McKinley 1
87 Carlos Moya 1
87 Cecil Parke 1
87 Dick Savitt 1
87 Dinny Pails 1
87 Donald McNeill 1
87 E.F. Parker 1
87 Ernest Renshaw 1
87 F.R. Schroeder 1
87 Francis Lowe 1
87 Francois Blanchy 1
87 Frank Hadow 1
87 Fred Alexander 1
87 Fred H. Hovey 1
87 Frederick R. Schroeder, Jr. 1
87 Gar Moon 1
87 Gaston Gaudio 1
87 Goran Ivanisevic 1
87 H. Briggs 1
87 Harold Mahoney 1
87 Henner Henkel 1
87 Herbert Lawford 1
87 Holcombe Ward 1
87 Horace Rice 1
87 Hugh L. Doherty 1
87 J. Schopfer 1
87 J.E. Patty 1
87 Jean Samazeuilh 1
87 John Crawford 1
87 John Gregory 1
87 John H. Doeg 1
87 John Hawkes 1
87 Jozsef Asboth 1
87 Juan Carlos Ferrero 1
87 Juan Martin del Potro 1
87 Ken McGregor 1
87 L. Riboulet 1
87 Lt. Joseph R. Hunt 1
87 M. Vacherot 1
87 Malcolm J. Anderson 1
87 Manuel Orantes 1
87 Marcel Bernard 1
87 Marin Cilic 1
87 Mark Edmondson 1
87 Michael Chang 1
87 Michael Stich 1
87 Pat Cash 1
87 Petr Korda 1
87 R. Falkenburg 1
87 R. Savitt 1
87 Rafael Osuna 1
87 Rhys Gemmell 1
87 Richard Krajicek 1
87 Roscoe Tanner 1
87 S.B. Wood 1
87 Spencer Gore 1
87 Sven Davidson 1
87 Thomas Johannson 1
87 Thomas Muster 1
87 Tony Roche 1
87 Vitas Gerulaitis 1
87 Vivian McGrath 1
87 William Hamilton 1
87 William McNeill 1
87 Wilmer L. Allison 1
87 Yannick Noah 1
87 Yvon Petra 1

At the top of the leaderboard we recognize well known tennis champions (!). Please note that equal wins champions are sorted in alphabetic order.

It is interesting to take a look at wins frequencies and summary.

wins_frequency <- as.data.frame(table(slam_winners[,"WINS"]))
colnames(wins_frequency) <- c("WINS", "FREQUENCY")
kable(wins_frequency)
WINS FREQUENCY
1 80
2 25
3 19
4 12
5 5
6 3
7 6
8 8
10 1
11 2
12 2
14 2
17 1
summary(slam_winners[,"WINS"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.922   3.750  17.000

I then fit the empirical distribution of wins by taking advantage of the log-normal distribution, which appears to be well suited for the purpose.

Curiously, reference [3] reports that in nature species abundance follows a log-normal distribution according to Preston's studies. It is a nice similitude with professional tennis champions 'breeds', I think.

For the purpose, I take advantage of the fitdist() function provided by the fitdistrplus package.

fw <- fitdist(slam_winners$WINS, "lnorm")
summary(fw)
## Fitting of the distribution ' lnorm ' by maximum likelihood 
## Parameters : 
##          estimate Std. Error
## meanlog 0.7033182 0.06228297
## sdlog   0.8024599 0.04404040
## Loglikelihood:  -315.7624   AIC:  635.5249   BIC:  641.7489 
## Correlation matrix:
##         meanlog sdlog
## meanlog       1     0
## sdlog         0     1
plot(fw)

I investigate the presence of outliers by imposing very restrictive quantiles equal to 0.1%. I take advantage of the getOutliersI() and outlierPlot() functions provided by the extremevalues package.

# left outliers quantile
left_thresh <- 0.001 

# right outliers quantile
right_thresh <- 0.999

# determining the outliers
slam_outlier <- getOutliersI(as.vector(slam_winners$WINS), 
                             FLim = c(left_thresh, right_thresh), 
                             distribution = "lognormal")

# outliers are plotted in red color
outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")

Above outlier plot highlights in red the presence of three outliers related to top wins performance. The corresponding top three champions are:

slam_winners[slam_outlier$iRight,]
RANK PLAYER WINS
1 Roger Federer 17
2 Pete Sampras 14
2 Rafael Nadal 14

On the contrary, there are not any outliers at the left side of the distribution.

I compute probabilities and standard deviation from the mean related to 17 and 14 wins performance. See reference [2] pag. 6 Table 1 for details.

# mean and standard deviation of the fitted lognormal distribution
(mean_log <- fw$estimate["meanlog"])
##   meanlog 
## 0.7033182
(sd_log <- fw$estimate["sdlog"])
##     sdlog 
## 0.8024599
# clearing names
names(mean_log) <- NULL
names(sd_log) <- NULL

# average value based on fitted parameters
(mean_x <- exp(mean_log + 0.5*sd_log^2))
## [1] 2.787902
# probability associated to the 17 wins performance or better
(lnorm_17 <- plnorm(17, mean_log, sd_log, lower.tail=FALSE))
## [1] 0.003974747
# standard deviation times from the mean associated to 17 wins
(deviation_17 <- abs(log(log(17)/mean_log, base=sd_log)))
## [1] 6.331331
# countercheck the deviation value
(mean_log - log(17)*sd_log^deviation_17)
## [1] -1.110223e-16
# probability associated to the 14 wins performance or better
(lnorm_14 <- plnorm(14, mean_log, sd_log, lower.tail=FALSE))
## [1] 0.007927063
# standard deviation times from the mean associated to 14 wins
(deviation_14 <- abs(log(log(14)/mean_log, base=sd_log)))
## [1] 6.008758
# countercheck the deviation value
(mean_log - log(14)*sd_log^deviation_14)
## [1] 0

As a results summary:

  • 17 wins or better performance has probability to happen equal to 0.0039747

  • 17 wins performance is 6.331 standard deviation times from the mean (in the log scale)

  • 14 wins or better performance has probability to happen equal to 0.0079271

  • 14 wins performance is 6.009 standard deviation times from the mean (in the log scale)

Conclusions

Performances in sports can be remarkable. Statistical analysis of sports historical data may be useful to better quantify performances magnitude.

In this post I have identified outliers in the ATP Grand Slam tennis tournaments champions based on historical data reporting from year 1877 up to 2016

The log-normal distribution was used for empirical distribution fitting. Computation of outliers performance probabilities and their deviation from the mean have been outlined.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.