500
Views
0
CrossRef citations to date
0
Altmetric
Articles

Predicting transfer fees in professional European football before and during COVID-19 using machine learning

ORCID Icon, ORCID Icon & ORCID Icon
Pages 603-623 | Received 23 Jul 2021, Accepted 25 Nov 2022, Published online: 15 Dec 2022
 

ABSTRACT

Research question

Our study aims to extend findings from previous efforts exploring the factors associated with transfer fees to and from all big five league clubs in European football (men) by building upon advances in machine learning, which allow to depart from linear functional forms. Furthermore, we provide a simple test of whether the transfer market has changed since the beginning of the COVID-19 pandemic.

Research methods

A fully flexible random forest estimator as well as generalized and quantile additive models are used to analyze smooth (non-linear) effects across different quantiles of scraped data (including remaining contract duration) from transfermarkt.de (n = 3,512). While we train our models with a randomly drawn subsample of before-COVID-19 transfers, we compare the prediction accuracy for two subsets of test data, that is, before and during COVID-19.

Results and findings

Since our findings suggest several non-linear predictors of transfer fees, moving beyond linearity is insightful and relevant. Moreover, our models trained with before-COVID-19 data significantly underestimate the actual transfer fees paid during COVID-19 particularly for high- and medium-priced players, thus questioning any cooling-off effect of the transfer market.

Implications

In the discussion of our findings, we showcase how moving beyond linearity and modeling quantiles can be revealing for both research and practice. We discuss limitations such as sample selection issues and provide directions for future research.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes

1 While the website is often used to scrape transfer market value figures, the present study considers transfer fee figures (see Quansah et al., Citation2021, for conceptual differences between the two). Expert and crowdsource-based knowledge is needed to get these data, because there is no official register. Most transfer fees for players from outside Europe and for less popular players are not publicly accessible otherwise. Coates and Parshakov (Citation2021), for example, relate market value to actual transfer fees and find that market value is a positive predictor (with a particular underestimation of the value of players with national team experience; see also Depken & Globan, Citation2021, for an alternative approach of considering deviations between the two as the dependent variable, a variable that is called transfer premium).

2 Even though this could eventually bias some predictions, we prefer this approach since we end up with a comparably homogenous sample of players.

3 The eligible transfers are scraped using Python packages (e.g. requests, lxml, openyxl), where a web scraping approach is followed (Landers et al., Citation2016). The web scraping codes are available from the authors on request under a GitHub private repository.

4 Based on previous findings, remaining contract duration is both theoretically and statistically of great importance to be included in the models (Coates & Parshakov, Citation2021; Feess et al., Citation2004; Garcia-del-Barrio & Pujol, Citation2020; McHale & Holmes, Citation2022). In our own analyses, we find that missing information on the remaining contract duration is not random, but connected to time trends (e.g. there are only 22 out of 577 transfers in 2008/09 with information on remaining contract duration, while there are 463 out of 621 transfers with information on remaining contract duration in 2015/16). Thus, in our prediction analyses, we exclusively consider transfers from the season 2015/16 onwards, that is, the time period for which information about remaining contract duration is mostly available.

5 As a complex contact sport, professional football players have a high injury rate (Hawkins et al., Citation2001; Pfirrmann et al., Citation2016). Moreover, studies show that injury history is an important risk factor for another football-related injury (Hägglund et al., Citation2006).

6 The random sample split was done automatically via an R function (sample, among similar others) as part of the cross-validation process. To do so, we set up the size of the training and the test data for the before-COVID-19 transfers (70% and 30%, respectively, a widely used split percentage, which fits our models). Then, the function randomly takes out 70% of the data as the training set and the remaining 30% as the test set. To ensure reproducible results, we use the set.seed function to generate the same random sequence each time.

7 See Coates and Parshakov (Citation2021), Fort et al. (Citation2019), and Leeds (Citation2014) for applications of standard quantile regression models.

8 Appendix Figure 1 (see supplementary material) shows the trends of season-specific median values of real transfer fees in the corresponding leagues of buying clubs. As could be expected, the highest median of transfer fees can be observed for Premier League clubs.

9 The functional form of other non-linear effects is also in line with expectations. However, since the estimates lack precision, we refrain from discussing these results in the main text. For instance, we observe a significant (inverted u-shaped) relation between injury proneness and transfer fees just for the 50th quantile. Our time trend variable is positive but lacks precision particularly for the higher quantiles (i.e. the 75th and 90th quantiles).

10 For reasons of completeness, we present our models excluding remaining contract duration in Appendix Table 3 and Appendix Figure 3 (see supplementary material). Deviations (if any) between the results of our main specification and these models can simply be explained by other variables serving as rough proxy for contract duration, thus picking up some of the variance when remaining contract duration is excluded.

11 Additional robustness checks are performed by trimming 1% of the lowest and highest transfer fees in the test and the during-COVID-19 data set, respectively. Our main findings remain (Appendix Figure 2, see supplementary material).

12 We thank one of the anonymous reviewers for their insightful comments on this discussion.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.