• Analytics Blog
LAST UPDATED
Nov 22nd, 2018
Data Golf predictive model: methodology
- Last Updated: November 22, 2018
Introduction
In this document we describe the current methodology behind our predictive model and discuss some interesting ideas and problems with prediction in golf more generally. We have previously written about our first attempt at modelling golf here, which I would recommend reading but is not necessary to follow the contents of this article. This document is a little more technical than the previous one, so if you are struggling to follow along here it is probably worth reading the first methodology blog.

The goal of this prediction exercise is to estimate probabilities of certain finish positions in golf tournaments (e.g. winning, finishing in the top 10). We are going to obtain these estimates by specifying a probability distribution for each golfer's scores. With those distributions in hand, the probability of any tournament result can be estimated through simulation. Let's dig in to the details.

We model each golfer's performance as normally distributed with some unknown mean and variance. These means can be thought of as the current "ability" of each golfer. Performance in golf is only meaningful in relation to other golfers: a 72 on one golf course could indicate a very different performance than a 72 on a different course. Therefore, throughout this analysis we focus on an adjusted strokes-gained measure (i.e. how many strokes better you were than some benchmark) that allows for direct comparisons of performance on any course. To return to our simple probability model of a golfer's performance, we can now more specifically say that we are modelling each golfer's adjusted strokes-gained in a given round as normally distributed with some mean and some variance.
An obvious, but critical, point is that our measure of performance is in units of strokes per round. Strokes relative-to-the field are the currency of the game of golf: this decides who wins golf tournaments. If we can accurately specify each golfer's probability distribution of strokes-gained relative to some benchmark, then we can accurately estimate probabilities of certain events occuring in golf tournaments [1].
The overall approach we take can be broken down as follows: first, we adjust raw scores from all professional golf tournaments to obtain a measure of performance that is not confounded by the difficulty of the course it was played on. Second, we use various statistical methods to estimate the player-specific means and variances (mentioned above) using all available data before a round is played. Third and finally, we use these estimates to simulate golf tournaments and obtain the probabilities of interest.
Let's talk first about how to convert a set of raw scores into the more interpretable adjusted strokes-gained measure. The approach we take roughly follows Connolly and Rendleman (2008). We estimate the following regression:
$$\normalsize (1) \>\>\>\>\>\>\>\> S_{ij} = \mu_{i}(t) + \delta_{j} + \epsilon_{ij}$$
where i indexes the golfer and j indexes a tournament-round (or a round played on a specific course for multi-course tournaments), $$\normalsize S_{ij}$$ is the raw score in a given tournament-round, $$\normalsize \mu_{i}(t)$$ is some player-specific function of "golf time" (i.e. the sequence of rounds for a golfer), and $$\normalsize \delta_{j}$$ is the coefficient from a dummy variable for tournament-round j. This regression produces estimates of each golfer's ability at each point in time ($$\normalsize \mu_{i}(t)$$) and of the difficulty of each course in each tournament-round ($$\normalsize \delta_{j}$$). $$\normalsize \mu_{i}(t)$$ could be any function: one simple functional form could just be a constant, which would mean we force each player's ability to be constant throughout our sample time period (a strong assumption). In practice, we fit second or third order polynomials, depending how many data points the player has in our sample, which allows each golfer's ability to vary flexibly over time. All that we actually care about from this regression are the estimates of course difficulty, as we define our adjusted strokes-gained measure as $$\normalsize S_{ij} - \delta_{j}$$. The interpretation of a single $$\normalsize \delta_{j}$$ is the expected score for some reference player at the course tournament-round j was played on. (More intuitively, $$\normalsize \delta_{j}$$ can be thought of as the field average score in round j after accounting for the skill of each golfer in that field.) Therefore, our adjusted strokes-gained measure is interpreted as the performance relative to that reference point. The choice of a reference player is arbitrary and not of great importance, so we typically make everything relative to the average PGA Tour player in a given year. A final point about this specification: there are no course-player effects (i.e. players are not allowed to "match" better with certain courses than others). With respect to obtaining consistent estimates of the $$\normalsize \delta_{j}$$ (which is our only goal here), this is likely not too important [2].
With our adjusted strokes-gained measure in hand, the next step is to estimate the golfer-specific parameters: the mean and the variance of their scoring distributions (at each point in time). It would seem that the function $$\normalsize \mu_{i}(t)$$ would be a good candidate for an estimate of the mean of player i's scoring distribution [3]. It may be useful for some settings, but when your goal is predicting out-of-sample, I don't think it is. Rather, we are going to estimate our player-specific means using regression and backtesting. It's worth noting that this method is not quite internally consistent. We require estimates of player ability at each point in time to estimate the course difficulty parameters ($$\normalsize \delta_{j}$$), but we do not actually use the player ability estimates from (1) to make predictions [4]. You can think of the purpose of estimating (1) as only to recover the course difficulty parameters ($$\normalsize \delta_{j}$$), from which we can calculate an adjusted strokes-gained measure for each round played in our sample. The remainder of this document is concerned with how best to predict these adjusted strokes-gained values with the available data at the time each round is played.
Predicting scores using historical total strokes-gained
In this section we give the overview of our predictive model and in the following two sections we discuss the (potential) addition of a couple other features to the model.
The estimating sample includes data from 2010-onwards on the PGA Tour, Web.com Tour, and European Tour. We use a regression framework to predict a golfer's adjusted score in a tournament-round using only information available up to that date. This seems to be a good fit for our goals with this model (i.e. predicting out-of-sample), while you could maybe argue the model in (1) would be better at describing data in-sample. In this first iteration of the model, the main input to predict strokes-gained is a golfer's historical strokes-gained (seems logical enough, right?). We expect that recent strokes-gained performances are more relevant than performances further into the past, but we will let the data decide whether and to what degree that is the case. For now, suppose we have a weighting scheme: that is, each round a golfer has played moving back in time has been assigned a weight. From this we construct a weighted average and use that to predict a golfer's adjusted strokes-gained in their next tournament-round. Also used to form these predictions are the number of rounds that the weighted average is calcuated from, and the number of days since a golfer's last tournament-round. More specifically, predictions are the fitted values from a regression of adjusted strokes-gained in a given round on the set of predictors (weighted average SG up to that point in time, rounds played up to that point in time, days since last tournament-round) and various interactions of these predictors. The figure below summarizes the predictions the model makes: we plot fitted values as a function of how many rounds a golfer has played for a few different values of the weighted strokes-gained average:

Notes: Plotted are predicted values from a regression model. Predictions (in units of SG per round) are a function of how many rounds a golfer has played and their historical weighted average strokes-gained. The weighting scheme used here to construct the weighted average would be considered middle of the road in terms of how heavily it weights recent versus older rounds. All predictions are are for a golfer who played in the previous week.
There are a couple main takeaways here. First, even for golfers who have played a lot (i.e. 150 rounds or more), there is some regression to the mean. That is, if a golfer has a weighted average of +2 then our prediction for their next tournament-round might be just +1.8. Importantly, how much regression to the mean is present depends on the weighting scheme. Longer-term weighting schemes (i.e. those that don't weight recent rounds that much more than less recent ones) exhibit less regression to the mean, while shorter-term weighting schemes exhibit more. This makes sense intuitively, as we would expect short-term form to be less predictive than long-term form. However, what might be a little less intuitive is the fact that these shorter-term weighting schemes can outperform the longer-term ones. The reason is that although short-term form is not as predictive as long-term form — in the sense that a 1 stroke increase in scoring average over a shorter time horizon does not translate to an average increase in 1 stroke moving forward, while a 1 stroke increase in long-term form more or less does — there is more variance in short-term form across players [5].
The second takeaway is the pattern of discounting as a function of the number of rounds played. As you would expect, the smaller the sample of rounds we have at our disposal, the more a golfer's past performance is regressed to the mean. As the number of rounds goes to zero, our predictions converge towards about -2 adjusted strokes-gained. It should also be pointed out that another input to the model is which tour (PGA, Euro, or Web.com) the tournament is a part of: this has an impact on very low-data predictions, as rookies / new players are generally of different quality on different tours.

The predicted values from this regression are our estimates for the player-specific means. What about player-specific variances? These are estimated by analyzing the residuals from the regression model above. The residuals are used because we want to estimate the part of the variance in a golfer's adjusted scores that is not due to their ability changing over time. We won't cover the details of estimating player-specific variances, but will make general two points. First, golfers for whom we have a lot of data have their variance parameter estimated just using their data, while golfers with less data available have their variance parameters estimated by looking at similar golfers. Second, estimates of variance are not that predictive (i.e. high-variance players in 2017 will tend to have lower variances in 2018). Therefore, we regress our variance estimates towards the tour average (e.g. a golfer who had a standard deviation of 3.0 in 2018 might be given an estimate of 2.88 moving forward).

With our assumption of normality, along with estimates (or, predictions) of each golfer's mean adjusted strokes-gained and the variance in their adjusted strokes-gained, we can now easily simulate a golf tournament. Each iteration draws a score from each golfer's probability distribution, and through many iterations we can define the probability of some event (e.g. golfer A winning) as the number of times it occured divided by the number of iterations.
Incorporating detailed strokes-gained categories
The model above is fairly simple (which is a good thing). But, given that total strokes-gained can be broken down into 4 categories, each of which is (very conveniently) expressed in units of strokes per round, a logical next step is to make use of this breakdown when attempting to predict total strokes-gained. This will improve predictions if certain categories are more predictive than others. For example, if strokes-gained off-the-tee (SG:OTT) is very predictive of future SG:OTT, while strokes-gained putting (SG:PUTT) is not that predictive of future SG:PUTT, then we should have different predictions for two players who have both been averaging +2 total strokes-gained, but have achieved this differently. More specifically, we would tend to predict that the golfer who has gained the majority of those 2 strokes from his off-the-tee play to stay near +2, while a golfer who gained the majority of their strokes through putting would be predicted to move away from +2 (i.e. regress towards the mean).
Because of the fact that total strokes-gained equals the sum of its parts (off-the-tee (OTT), approach (APP), around-the-green (ARG), and putting (PUTT)), we can do some nice regression exercises. Consider the following regression:
$$\normalsize (2) \>\>\>\>\>\>\>\>TOTAL_{ij} = \beta_{1}\cdot OTT_{i,-j} + \beta_{2} \cdot APP_{i,-j} + \beta_{3} \cdot ARG_{i,-j} + \beta_{4} \cdot PUTT_{i,-j} + u_{ij}$$
where $$\normalsize TOTAL_{ij}$$ is total adjusted strokes-gained for player i in tournament-round j, and the 4 regressors are all defined similarly as some weighted average for each category using all rounds up to but not including round j [6]. Therefore, in this regression we are predicting total strokes-gained using a golfer's historical averages in each category (all of which are adjusted [7]). We can also run 4 other regressions where we replace the dependent variable here $$\normalsize (TOTAL_{ij}$$) with the golfer's performance in round j in each strokes-gained category (OTT, APP, etc.). This will have the nice property that the 4 coefficients on $$\normalsize OTT_{i, -j}$$ (for example) from the latter 4 regressions will add up to $$\normalsize \beta_{1}$$ in the regression above.
So what do we find? The coefficients are, roughly, $$\normalsize \beta_{1} = 1.2$$, $$\normalsize \beta_{2} = 1$$, $$\normalsize \beta_{3} = 0.9$$, and $$\normalsize \beta_{4} = 0.6$$. Recall their interpretation: $$\normalsize \beta_{1}$$ can be thought of as the predicted increase in total strokes-gained from having a historical average SG:OTT that is 1 stroke higher, holding constant the golfer's historical performance in all other SG categories. Therefore, the fact that $$\normalsize \beta_{1}$$ is greater than 1 is very interesting (or, worriesome?!). Why would a 1 stroke increase in historical SG:OTT be associated with a greater than 1 stroke increase in future total strokes-gained? We can get an answer by looking at our subregressions: using $$\normalsize OTT_{ij}$$ as the dependent variable, the coefficient is close to 1 (as we would perhaps expect), using $$\normalsize APP_{ij}$$ the coefficient is 0.2, and for the other two categories the coefficients are both roughly 0. So, if you take these estimates seriously (which we do; this is a robust result), this means that historical SG:OTT performance has predictive power not only for future SG:OTT performance, but also for future SG:APP performance. That is interesting. This means that for a golfer who is currently averaging +1 SG:OTT and 0 SG:APP, we should predict their future SG:APP to be something like +0.2. A possible story here is that a golfer's off-the-tee play provides some signal about a golfer's general ball-striking ability (which we would define as being useful for both OTT and APP performance). The other coefficients fall in line with our intution: putting is the least predictive of future performance.

How can we incorporate this knowledge into our predictive model to improve it's performance? The main takeaways from the work above is that the strokes-gained categories differ in their predictive power for future strokes-gained performance (with OTT > APP > ARG > PUTT). However, a difficult practical issue is that we only have data on detailed strokes-gained performance for a subset of our data: namely PGA Tour events that have ShotLink set up on-site. We incorporate our findings above by using a reweighting method for each round that has detailed strokes-gained data available; if the SG categories aren't available, we simply use total strokes-gained. In this reweighting method, if there were two rounds that both were measured as +2 total strokes-gained, with one mainly due to off-the-tee play while the other was mainly due to putting, the former would be increased while the latter would be decreased. To determine which weighting works best, we just evaluate out-of-sample fit (discussed below). That's why prediction is relatively easy, while casual inference is hard.
Incorporating course fit (or not?)
We have argued in the past that there is no statistically responsible way to incorporate course history into a predictive model. But, after watching the Americans get slaughtered at the 2018 Ryder Cup at Le Golf National in France, we came away thinking that course fit was something we had to try to incorporate into our models. Spoiler alert: we tried two approaches, and failed. In the first approach, we tried to correlate a golfer's historical strokes-gained performance in the different categories (OTT, APP, etc.) with that golfer's performance at a specific course. The logic here is that perhaps certain courses favor players who are good drivers of the ball (i.e. good SG:OTT [8]), while other courses favour players who are better around the greens (i.e. good SG:ARG). For some courses we have a reasonable amount of data (e.g. 9 years worth for events that have been hosted on the same course since 2010). The problem is that, even for these relatively high data courses, the results are still very noisy. It is true that if you run the regression as in (2) separately for each course in the data, you will find results that are different (to a *statistically significant* degree) from the baseline result in (2). For example, instead of SG:APP having a coefficient of 1, for some courses we will find it has a coefficient of 0.5. Should we take these estimates at face value? No, I don't think so. Statistical significance is not very meaningful at the best of times, and especially not when you are running many regressions: of course you will find some statistically significant differences if you have 30 courses in your data and 4 variables per regression. The ultimate proof is in whether this additional information improves your out-of-sample predictive performance, and in our case it did not.
Our second attempt at incorporating course fit involved trying to group courses together that have similar characteristics, and then essentially doing a course history exercise except using a golfer's historical performance on the group of similar courses instead of just a single course. We have done this exercise before using course groupings based off course length. This time we tried grouping courses using clustering algorithms, where the main characteristics again involved the detailed strokes-gained categories (e.g. the % of variance in total scores that was explained by each category). Ultimately I do think this is the way to go if you want to incorporate course fit: if you had detailed course data (perhaps about average fairway width, length, etc.) you could potentially make more natural groupings than we did. Unfortunately in our case, with the course variables we used, it was again mostly a noise mine. This has left us thinking that there is not an effective way to systematically incorporate course fit into our statistical models. The sample sizes are too small, and the measures of course similarity to crude, to make much headway on this problem. That's not to say that course history doesn't exist; it probably does. But to separate the signal from the noise is very hard.
Model evaluation and selection
Given the analysis and discussion so far, we can now think of having a set of models to choose from where differences between models are defined by a few parameters. These parameters are the choice of weighting scheme on the historical strokes-gained averages (this involves just a single parameter that determines the rate of exponential decay moving backwards in time), and also the weights that are used to incorporate the detailed strokes-gained categories through a reweighting method.
The optimal set of parameters are selected through brute force: we loop through all possible combinations of parameters, and for each set of parameters we evaluate the model's performance through a cross validation exercise. This is done to avoid overfitting: that is, choosing a model that fits the estimating data very well but does not generalize well to new data. The basic idea is to divide your data into a "training" set and a "testing" set. The training set is used to estimate the parameters of your model (for our model, this is basically just a set of regression coefficients [9]), and then the testing set is used to evaluate the predictions of the model. We evaluate the models using mean-squared prediction error, which in this context is defined as the difference between our predicted strokes-gained and the observed strokes-gained, squared and then averaged. Cross validation involves repeating this process several times (i.e. dividing your sample into training and testing sets) and averaging the model's performance on the testing sets. This repetitive process is again done to avoid overfitting. The model that performs the best in the cross validation exercise should (hopefully) be the one that generalizes the best to new data. That is, after all, the goal of our predictive model: to make predictions for tournament outcomes that have not occurred yet.
One thing that becomes clear when testing different parameterizations is how similar they perform overall despite disagreeing in their predictions quite often. This is troubling if you plan to use your model to bet on golf. For example, suppose you and I both have models that perform pretty similar overall (i.e. have similar mean-squared prediction error), but also disagree a fair bit on specific predictions. This means that both of our models would find what we perceive to be "value" in betting on some outcome against the other's model. However, in reality, there is not as much value as you think: roughly half of those discrepancies will be cases where your model is "incorrect" (because we know, overall, that the two models fit the data similarly). This is not exactly a deep insight: it simply means that to assume your model's odds as *truth* is an unrealistic best-case scenario for calculating expected profits.

The model that we select through the cross validation exercise has a weighting scheme that I would classify as "medium-term": rounds played 2-3 years ago do receive non-zero weight, but the rate of decay is fairly quick. Compared to our previous models this version responds more to a golfer's recent form. In terms of incorporating the detailed strokes-gained categories, past performance that has been driven more by ball-striking, rather than by short-game and putting, will tend to have less regression to the mean in the predictions of future performance.
To use the output of this model — our pre-tournament estimates of the mean and variance parameters that define each golfer's scoring distribution — to make live predictions as a golf tournament progresses, there are a few challenges to be addressed.

First, we need to convert our round-level scoring estimates to hole-level scoring estimates. This is accomplished using an approximation which takes as input our estimates of a golfer's round-level mean and variance and gives as output the probability of making each score type on a given hole (i.e. birdie, par, bogey, etc.).
Second, we need to take into account the course conditions for each golfer's remaining holes. For this we track the field scoring averages on each hole during the tournament, weighting recent scores more heavily so that the model can adjust quickly to changing course difficulty during the round. (Of course, there is a tradeoff here between sample size and the model's speed of adjustment.) Another important detail in a live model is allowing for uncertainty in future course conditions. This matters mostly for estimating cutline probabilities accurately, but does also matter for estimating finish probabilities. If a golfer has 10 holes remaining, we allow for the possibility that these remaining 10 holes play harder or easier than they have played so far (due to wind picking up or settling down, for example). We incorporate this uncertainty by specifying a normal distribution for each hole's future scoring average, with a mean equal to it's scoring average so far, and a variance that is calibrated from historical data [10].
The third challenge is updating our estimates of player ability as the tournament progresses. This can be important for the golfers that we had very little data on pre-tournament. For example, if for a specific golfer we only have 3 rounds to make the pre-tournament prediction, then by the fourth round of the tournament we will have doubled our data on this golfer! Updating the estimate of this golfer's ability seems necessary. To do this, we have a rough model that takes 4 inputs: a player's pre-tournament prediction, the number of rounds that this prediction was based off of, their performance so far in the tournament (relative to the appropriate benchmark), and the number of holes played so far in the tournament. The predictions for golfers with a large sample size of rounds pre-tournament will not be adjusted very much: a 1 stroke per round increase in performance during the tournament translates to a 0.02-0.03 stroke increase in their ability estimate (in units of strokes per round). However, for a very low data player, the ability update could be much more substantial (1 stroke per round improvement could translate to 0.2-0.3 stroke updated ability increase).

With these adjustments made, all of the live probabilities of interest can be estimated through simulation. For this simulation, in each iteration we first draw from the course difficulty distribution to obtain the difficulty of each remaining hole, and then we draw scores from each golfer's scoring distribution taking into account the hole difficulty.

1. The assumption of normality could be questioned. Round-level scores for a given golfer are approximately normal - although it is true that the left-tail (i.e. bad scores) is longer than the right tail. Normality allows things to be much easier computationally, and seems to be a good approximation. [Back to text]
2. Remember the goal of estimating (1) is only to recover the course difficulty parameters ($$\normalsize \delta_{j}$$). By not including course-player effects, we will be in trouble only if the course indicator variables are correlated with the course-player indicator variables (i.e. this would be like omitted variable bias). The intuition for why this would be a problem can be illustrated with an example: suppose most players who play Harbour Town have very good *fit* for that course; then, we will underestimate Harbour Town's difficulty in (1) because we are not accounting for the fact that most players who played there had good course-player matches. When could this correlation arise? If course-player effects exist, and players select which courses to play on the basis of their relative advantages, then we would have this correlation. While this could be happening, the downside to including course-player effects is you introduce a lot of noise. My intuition is that we are still getting fairly clean estimates of the true course difficulty parameters. [Back to text]
3. For example, to make out-of-sample predictions we could use the player's estimated ability ($$\normalsize \mu_{i}(t)$$) at the most recent date in the sample. [Back to text]
4. Our predictive model uses adjusted scores to generate estimates of golfers' ability level at each point in time. But, to obtain those adjusted scores to begin with, you needed estimates of player ability. The issue is that our final estimates of player ability will not exactly match up with the abilities used to adjust scores to begin with. This could be resolved by performing our entire estimation exercise in a loop, and repeating until we achieved convergence in our ability estimates. In practice, we think this is of little consequence. [Back to text]
5. In the models using shorter-term weighted averages the predictions will be regressed more towards the mean. However, we will also have larger differences between the players in their weighted averages (because short-term form has more variance than long-term form). This latter point is why the short-term model can (potentially) fit the data better. [Back to text]
6. In practice things are a little tricker. What weighting should we use to construct our historical weighted averages in each SG category? We try various weighting schemes. The regression results are qualitatively similar no matter which weighting you use (provided it is somewhat reasonable — i.e. not 100% weight on the last round played). An interesting thing to note is that it seems like the most predictive weighting scheme for OTT is a pretty short-term weighting, while for putting it's a longer-term scheme. Intuitively, this could be because OTT is a much lower variance performance measure. [Back to text]
7. The strokes-gained categories can be adjusted in exactly the same manner we adjusted raw scores in (1). The strokes-gained categories need to be adjusted because they are calculated by subtracting off the field average, and not all fields are of equal quality. [Back to text]
8. Simply looking at a golfer's historical SG:OTT may be too crude of a statistic because it combines gains due to accuracy and distance off the tee, and it seems that players often have one but not the other. Harbour Town could be a course that favours accurate drivers of the ball, but not bombers. Using historical SG:OTT would not pick this up. [Back to text]
9. I have used the term "parameters" twice now in this paragraph. To be clear, we are looping through a set of parameters that are held constant during the entire cross validation exercise (these are the parameters that define the weighting schemes), and then within each iteration of this loop we have to estimate that model's parameters (regression coefficients). [Back to text]
10. There are actually 3 types of uncertainty in course conditions we allow for. First, there is general uncertainty in course conditions that is always present, which is modelled as a normally distributed mean-zero shock with a standard deviation of 0.85 (strokes per round); second, there is uncertainty in the difference between morning and afternoon conditions on a given day, which is modelled as a $$\normalsize N(0,1)$$ shock at the start of the day, but has a variance that decays to zero as the morning wave completes their round; third, there is a shock for course conditions on any day of golf remaining to be played (i.e. on Thursday, there is a Friday, Saturday, and Sunday shock), each of which are modelled as $$\normalsize N(0,1)$$. These last shocks are usually only important for projecting Friday's cutline during Thursday's play. [Back to text]