around-R: Tennis Analytics - ATP Grand Slam Champions

Leaderboard Analysis

Abstract

In this post I am going to analyze a dataset reporting ATP Grand Slam tennis tournaments winners historical data. Specifically, I am going to evaluate the potential presence of outliers.

To clarify that an outlier:

is a value remarkably distant from other ones within the same dataset
lies at the far boundary of the overall dataset distribution
is rare to be observed
may not be unique within the same dataset

Analysis

I start my analysis by loading ATP Grand Slam tennis tournaments winners historical data (ref. [1]). I then take a quick view to the top and tail records.

library_toload <- c("dplyr", "knitr", "fitdistrplus", "extremevalues")
invisible(lapply(library_toload, function(x) {suppressPackageStartupMessages(library(x, character.only=TRUE))}))

slam_win <- read.delim("tennis-grand-slam-winners.txt", sep="\t")
dim(slam_win)

## [1] 485   4

nr <- nrow(slam_win)
kable(head(slam_win), row.names=FALSE)

YEAR	TOURNAMENT	WINNER	RUNNER.UP
2016	U.S. Open	Stan Wawrinka	Novak Djokovic
2016	Wimbledon	Andy Murray	Milos Raonic
2016	French Open	Novak Djokovic	Andy Murray
2016	Australian Open	Novak Djokovic	Andy Murray
2015	U.S. Open	Novak Djokovic	Roger Federer
2015	Wimbledon	Novak Djokovic	Roger Federer

kable(tail(slam_win), row.names=FALSE)

YEAR	TOURNAMENT	WINNER	RUNNER.UP
1881	U.S. Open	Richard D. Sears	William E. Glyn
1881	Wimbledon	William Renshaw	John Hartley
1880	Wimbledon	John Hartley	Herbert Lawford
1879	Wimbledon	John Hartley	V. St.Â Leger Gould
1878	Wimbledon	Frank Hadow	Spencer Gore
1877	Wimbledon	Spencer Gore	William Marshall

As we can see, the records are ordered by year, starting on 1877 and ending on 2016. In 140 years, 485 ATP Grand Slam tournaments took place and 166 champions won at least one event.

I then order champions' names by decreasing number of wins and associate a leaderboard position number.

# ordering in decreasing wins order
slam_win_df <- as.data.frame(table(slam_win[,"WINNER"]))
slam_win_df <-  slam_win_df %>% arrange(desc(Freq))

# computing champions' leaderboard position
pos <- rep(0, nrow(slam_win_df))
pos[1] <- 1
for(i in 2:nrow(slam_win_df)) {
  pos[i] <- ifelse(slam_win_df$Freq[i] != slam_win_df$Freq[i-1], i, pos[i-1])
}

# creating and showing leaderboard dataframe
slam_winners <- data.frame(RANK = pos, 
                           PLAYER = slam_win_df$Var1,
                           WINS = slam_win_df$Freq)
kable(slam_winners, caption="ATP Gran Slam Tournaments Leaderboard")

ATP Gran Slam Tournaments Leaderboard
RANK	PLAYER	WINS
1	Roger Federer	17
2	Pete Sampras	14
2	Rafael Nadal	14
4	Novak Djokovic	12
4	Roy Emerson	12
6	Bjorn Borg	11
6	Rod Laver	11
8	William T. Tilden	10
9	Andre Agassi	8
9	Fred Perry	8
9	Henri Cochet	8
9	Ivan Lendl	8
9	Jimmy Connors	8
9	Ken Rosewall	8
9	Max Decugis	8
9	William A. Larned	8
17	John McEnroe	7
17	John Newcombe	7
17	Mats Wilander	7
17	Rene Lacoste	7
17	Richard D. Sears	7
17	William Renshaw	7
23	Boris Becker	6
23	Donald Budge	6
23	Stefan Edberg	6
26	Frank Sedgman	5
26	Jack Crawford	5
26	Jean Borotra	5
26	Laurie Doherty	5
26	Tony Trabert	5
31	Andre Vacherot	4
31	Anthony Wilding	4
31	Ashley J. Cooper	4
31	Frank Parker	4
31	Guillermo Vilas	4
31	Jim Courier	4
31	Lewis Hoad	4
31	Manuel Santana	4
31	Pat Oâ€™Hara Wood	4
31	Paul Ayme	4
31	Reggie Doherty	4
31	Robert D. Wrenn	4
43	Adrian Quist	3
43	Andy Murray	3
43	Arthur Ashe	3
43	Arthur Gore	3
43	Gerald Patterson	3
43	Gustavo Kuerten	3
43	H. Ellsworth Vines	3
43	Jack Kramer	3
43	Jan Kodes	3
43	Jaroslav Drobny	3
43	Malcolm D. Whitman	3
43	Maurice Germot	3
43	Neale Fraser	3
43	Norman Brookes	3
43	Oliver S. Campbell	3
43	Robert Riggs	3
43	Stan Wawrinka	3
43	Wilfred Baddeley	3
43	William M. Johnston	3
62	Andre Gobert	2
62	E. Victor Seixas Jr.	2
62	Fred Stolle	2
62	Gottfried Von Cramm	2
62	Henry W. Slocum Jr.	2
62	Ilie Nastase	2
62	James Anderson	2
62	Johan Kriek	2
62	John Bromwich	2
62	John Hartley	2
62	Joshua Pim	2
62	Lleyton Hewitt	2
62	Marat Safin	2
62	Maurice E. McLoughlin	2
62	Mervyn Rose	2
62	Nicola Pietrangeli	2
62	Patrick Rafter	2
62	R. Lindley Murray	2
62	Richard A. Gonzales	2
62	Richard N. Williams	2
62	Rodney Heath	2
62	Sergi Bruguera	2
62	Stan Smith	2
62	Tony Wilding	2
62	Yevgeny Kafelnikov	2
87	A.R.F. Kingscote	1
87	Adriano Panatta	1
87	Albert Costa	1
87	Alejandro Olmedo	1
87	Alex Olmedo	1
87	Andres Gimeno	1
87	Andres Gomez	1
87	Andy Roddick	1
87	Arthur Larsen	1
87	Beals C. Wright	1
87	Bill Bowrey	1
87	Brian Teacher	1
87	Budge Patty	1
87	C.R. McKinley	1
87	Carlos Moya	1
87	Cecil Parke	1
87	Dick Savitt	1
87	Dinny Pails	1
87	Donald McNeill	1
87	E.F. Parker	1
87	Ernest Renshaw	1
87	F.R. Schroeder	1
87	Francis Lowe	1
87	Francois Blanchy	1
87	Frank Hadow	1
87	Fred Alexander	1
87	Fred H. Hovey	1
87	Frederick R. Schroeder, Jr.	1
87	Gar Moon	1
87	Gaston Gaudio	1
87	Goran Ivanisevic	1
87	H. Briggs	1
87	Harold Mahoney	1
87	Henner Henkel	1
87	Herbert Lawford	1
87	Holcombe Ward	1
87	Horace Rice	1
87	Hugh L. Doherty	1
87	J. Schopfer	1
87	J.E. Patty	1
87	Jean Samazeuilh	1
87	John Crawford	1
87	John Gregory	1
87	John H. Doeg	1
87	John Hawkes	1
87	Jozsef Asboth	1
87	Juan Carlos Ferrero	1
87	Juan Martin del Potro	1
87	Ken McGregor	1
87	L. Riboulet	1
87	Lt. Joseph R. Hunt	1
87	M. Vacherot	1
87	Malcolm J. Anderson	1
87	Manuel Orantes	1
87	Marcel Bernard	1
87	Marin Cilic	1
87	Mark Edmondson	1
87	Michael Chang	1
87	Michael Stich	1
87	Pat Cash	1
87	Petr Korda	1
87	R. Falkenburg	1
87	R. Savitt	1
87	Rafael Osuna	1
87	Rhys Gemmell	1
87	Richard Krajicek	1
87	Roscoe Tanner	1
87	S.B. Wood	1
87	Spencer Gore	1
87	Sven Davidson	1
87	Thomas Johannson	1
87	Thomas Muster	1
87	Tony Roche	1
87	Vitas Gerulaitis	1
87	Vivian McGrath	1
87	William Hamilton	1
87	William McNeill	1
87	Wilmer L. Allison	1
87	Yannick Noah	1
87	Yvon Petra	1

At the top of the leaderboard we recognize well known tennis champions (!). Please note that equal wins champions are sorted in alphabetic order.

It is interesting to take a look at wins frequencies and summary.

wins_frequency <- as.data.frame(table(slam_winners[,"WINS"]))
colnames(wins_frequency) <- c("WINS", "FREQUENCY")
kable(wins_frequency)

WINS	FREQUENCY
1	80
2	25
3	19
4	12
5	5
6	3
7	6
8	8
10	1
11	2
12	2
14	2
17	1

summary(slam_winners[,"WINS"])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.922   3.750  17.000

I then fit the empirical distribution of wins by taking advantage of the log-normal distribution, which appears to be well suited for the purpose.

Curiously, reference [3] reports that in nature species abundance follows a log-normal distribution according to Preston's studies. It is a nice similitude with professional tennis champions 'breeds', I think.

For the purpose, I take advantage of the fitdist() function provided by the fitdistrplus package.

fw <- fitdist(slam_winners$WINS, "lnorm")
summary(fw)

## Fitting of the distribution ' lnorm ' by maximum likelihood 
## Parameters : 
##          estimate Std. Error
## meanlog 0.7033182 0.06228297
## sdlog   0.8024599 0.04404040
## Loglikelihood:  -315.7624   AIC:  635.5249   BIC:  641.7489 
## Correlation matrix:
##         meanlog sdlog
## meanlog       1     0
## sdlog         0     1

plot(fw)

I investigate the presence of outliers by imposing very restrictive quantiles equal to 0.1%. I take advantage of the getOutliersI() and outlierPlot() functions provided by the extremevalues package.

# left outliers quantile
left_thresh <- 0.001 

# right outliers quantile
right_thresh <- 0.999

# determining the outliers
slam_outlier <- getOutliersI(as.vector(slam_winners$WINS), 
                             FLim = c(left_thresh, right_thresh), 
                             distribution = "lognormal")

# outliers are plotted in red color
outlierPlot(slam_winners$WINS, slam_outlier, mode="qq")

Above outlier plot highlights in red the presence of three outliers related to top wins performance. The corresponding top three champions are:

slam_winners[slam_outlier$iRight,]

RANK	PLAYER	WINS
1	Roger Federer	17
2	Pete Sampras	14
2	Rafael Nadal	14

On the contrary, there are not any outliers at the left side of the distribution.

I compute probabilities and standard deviation from the mean related to 17 and 14 wins performance. See reference [2] pag. 6 Table 1 for details.

# mean and standard deviation of the fitted lognormal distribution
(mean_log <- fw$estimate["meanlog"])

##   meanlog 
## 0.7033182

(sd_log <- fw$estimate["sdlog"])

##     sdlog 
## 0.8024599

# clearing names
names(mean_log) <- NULL
names(sd_log) <- NULL

# average value based on fitted parameters
(mean_x <- exp(mean_log + 0.5*sd_log^2))

## [1] 2.787902

# probability associated to the 17 wins performance or better
(lnorm_17 <- plnorm(17, mean_log, sd_log, lower.tail=FALSE))

## [1] 0.003974747

# standard deviation times from the mean associated to 17 wins
(deviation_17 <- abs(log(log(17)/mean_log, base=sd_log)))

## [1] 6.331331

# countercheck the deviation value
(mean_log - log(17)*sd_log^deviation_17)

## [1] -1.110223e-16

# probability associated to the 14 wins performance or better
(lnorm_14 <- plnorm(14, mean_log, sd_log, lower.tail=FALSE))

## [1] 0.007927063

# standard deviation times from the mean associated to 14 wins
(deviation_14 <- abs(log(log(14)/mean_log, base=sd_log)))

## [1] 6.008758

# countercheck the deviation value
(mean_log - log(14)*sd_log^deviation_14)

## [1] 0

As a results summary:

17 wins or better performance has probability to happen equal to 0.0039747
17 wins performance is 6.331 standard deviation times from the mean (in the log scale)
14 wins or better performance has probability to happen equal to 0.0079271
14 wins performance is 6.009 standard deviation times from the mean (in the log scale)

Conclusions

Performances in sports can be remarkable. Statistical analysis of sports historical data may be useful to better quantify performances magnitude.

In this post I have identified outliers in the ATP Grand Slam tennis tournaments champions based on historical data reporting from year 1877 up to 2016

The log-normal distribution was used for empirical distribution fitting. Computation of outliers performance probabilities and their deviation from the mean have been outlined.

around-R

Friday, October 28, 2016

Tennis Analytics - ATP Grand Slam Champions

Leaderboard Analysis

Abstract

Analysis

Conclusions

References

No comments:

Post a Comment

Featured Post

Plant Leaf Classification - Part 3

Total Pageviews