True SG in Amateur Golf

We recently expanded our database of round scores to include a comprehensive set of amateur events dating back to the fall of 2009. This includes any event that is eligible for points in the World Amateur Golf Rankings (WAGR), as well as the vast majority of college events played in the United States (most of which are also included in the WAGR). By linking the golfers in the amateur data to our existing data for the professional tours, we are able to estimate a single strokes-gained measure that can be directly compared across any tournament and tour. That is, a value of +2 for this measure for a round played in the Canadian Men's Amateur and a value of +2 for a round played in the U.S. Open will indicate performances of equal quality. Pretty cool; but why should you believe our numbers? The purpose of this blog is to provide some transparent arguments for the validity and usefulness of this adjusted, or "true", strokes-gained measure.

We'll tackle validity first. The motivating idea behind converting raw scores to adjusted scores is that we only want to compare golfers who played at the same course on the same day, in order to control for the large differences in course difficulty that exist across the many global tours and tournaments. However, the obvious issue is that most golfers never compete directly against each other. To compare scores shot in different tournaments, we make use of the fact that there is overlap in the set of golfers that compete in different tournaments. For example, if Matthew Wolff beats Justin Suh by 2 strokes per round at the NCAA Championship, and then Rory McIlroy beats Matthew Wolff by 2 strokes per round at a PGA Tour event a few weeks later, we could (maybe?) conclude that McIlroy is 4 strokes per round better than Suh, despite the fact that they never played against each other! Of course, this seems like a bad idea because we know that golfers have "good" and "bad" days; if Wolff had a good day when he played against Suh, and a bad day when he played against McIlroy, this would lead us to overestimate how much better McIlroy is than Suh. This problem is mitigated if we have not just one golfer connecting McIlroy to Suh but many: some golfers will have bad days and some will have good days, but on average it should even out. The basic logic described here allows us to fairly compare scores shot in major championships to those shot in tournaments as obscure as the Slovenian National Junior Championship. All that is required is that these tournaments can be "connected" in some way; that is, a player in the Slovenian Championship needs to be able to say, "I played against a golfer who played against a golfer... who played against a golfer in a major championship". The fewer connections a tournament has to the rest of the golfing world, the less certain we can be about the quality of the performances from that tournament; in the extreme case of having no connections, we can say nothing about how those performances compare to other tournaments. Therefore, to be included in our database, all tournaments must be connected in this sense. Luckily, competitive golf, like the real world, is a lot more connected than you might think.

Let's next look at some simple data summaries to provide some reassuring evidence that the true strokes-gained metric is accomplishing what we want it to. This first table compares a golfer's performance in Korn Ferry Tour events in 2019 to their performance in PGA Tour events in 2019. We show 10 golfers who played a reasonable number of rounds on both tours last year:

The far right column is the one of interest: this tells us how many more strokes per round each golfer gained against KFT fields than they did against PGA Tour fields. A positive value indicates that they beat KFT fields by more on average; a negative value indicates the opposite. Why would the same golfer beat KFT fields by more than PGA Tour fields? Presumably, it's because the average player on the latter tour is better than the average player on the former. The fact that there are large differences between golfers in their relative performance on the two tours simply reflects the fact that golf performance is random; some golfers may happen to play really well in the PGA Tour events they enter (e.g. Lanto Griffin). The next table shows the overall numbers from this exercise for 2019 for 3 different tours:

Looking at the first row, we see that in total there were 2095 "shared" rounds — defined for each golfer as the minimum of the number of rounds they played on each tour, e.g. for Sucher in the first table it would be 38 — and that on average the same golfer gained 0.69 more strokes per round against KFT fields than they did against PGA Tour fields. The final column is our estimate, from the true strokes-gained method, of the difference in the average player quality between the KFT and PGA Tour fields used in this calculation. Ideally we want these last two columns to be equal — if the average golfer is in fact 0.72 strokes better in these PGA Tour events, we should see a given set of golfers gaining approximately 0.72 fewer strokes against those fields, which we do.

The next two tables repeat the exercise using golfers who played in amateur events in 2019.

In this second set of tables, we see that when amateurs play in PGA Tour events the difference between their strokes-gained over fields on the PGA Tour and their strokes-gained over amateur fields is on average equal to 3.7. From this we should infer that the average golfer in these PGA Tour events is 3.7 strokes better than the average golfer in these amateur events; almost exactly (3.61) what our true strokes-gained method estimated the difference to be.

Adjusted or "true" strokes-gained is equal to a golfer’s strokes-gained over the field plus how many strokes better (or worse) that field is than some benchmark. Because it is trivial to calculate strokes-gained over the field, it should be clear that the validity of true strokes-gained hinges on properly estimating field strengths. In the exercise above we showed that our estimates of average player quality for different tours closely match the estimates you would obtain by simply comparing the performance of a set of golfers against one tour to the performance of that same set of golfers against another tour. This provides an intuitive way to understand differences in field strength across tours rather than having to simply place blind trust in our model being correct.

What is this benchmark I mentioned above? Strokes-gained is a relative measure, meaning that it is only useful insofar as we have something to compare it to. Just as it’s not very useful to tell me the raw score of a golfer (e.g. 66) without providing the field’s average score that day, it’s also not useful to tell me that a golfer gained 4 strokes without telling me who they "gained strokes over". We typically set the benchmark to be the average field on the PGA Tour in a given season. Therefore, if, in the year 2019, a golfer has an adjusted strokes-gained of +2 in a given round, this means that their performance was estimated to be two strokes better than what the average PGA Tour field in 2019 would be expected to shoot on that course. An adjusted strokes-gained of +2 could be achieved by beating a field that is on average 1 shot better than the average PGA Tour field (e.g. the TOUR Championship) by 1 stroke, or by beating a field that is on average 1 stroke worse than the average PGA Tour field (e.g. a typical Korn Ferry Tour event) by 3 strokes.

Hopefully this has helped clarify how the true strokes-gained measure is derived and also why it is valid to directly compare the true strokes-gained numbers of golfers who competed in different tournaments. Next, we are going to show why the true strokes-gained metric is useful.

The World Amateur Golf Rankings are put on by the USGA and R&A with the stated aim of providing a comprehensive and accurate ranking of amateur golfers. We are going to compare our amateur ranking system, which is solely based off the true strokes-gained metric, to the WAGR system. More precisely, our amateur rankings are determined by taking a weighted average of each golfers' historical true strokes-gained, with more recent rounds receiving more weight. For golfers with many rounds played, approximately their most recent 150 rounds contribute to their ranking. The WAGR operates in a similar fashion to the Official World Golf Ranking (OWGR) with rounds from the last 104 weeks contributing to their rankings. The main differences between our rankings and the WAGR is that we only include stroke play rounds while the WAGR includes both stroke and match play, and that the WAGR rewards top finishers disproportionately — as all official rankings systems do, and should — while our rankings only depend on golfers' scores and not their finish position.

The exercise we will do is the following: first, we form our rankings for every week since September 2011 using the same time frame as WAGR; that is, for a given week all events that have been played on or before the most recent Sunday are included in the ranking calculation. (WAGR releases rankings on Wednesday, so events that are played on the Monday and Tuesday of that week only count in the following week's rankings.) Second, we use each of these ranking systems to predict the performance of golfers who are playing in the following week. To be clear, the events we are predicting always occurred after the date on which the rankings were formed.

Our measure of performance is strokes-gained relative-to-the-field — this is what we are trying to predict. An example will help clarify exactly how this prediction exercise works. In the first round of the 2020 Junior Invitational at Sage Valley, which took place on March 12-14, there were 37 golfers that could be matched to both the WAGR and our rankings on March 11, 2020. The field was comprised of 52 golfers, meaning that 15 golfers weren't matched either because they hadn't played a round in the last year (requirement to be in our rankings), or they weren't in the top 1000 of the WAGR. Using these 37 golfers, we subtract the average value of our 3 variables of interest: their round score, their Data Golf (DG) ranking, and their WAGR ranking. Then, we correlate their relative-to-field ranking with their score relative to the field. For example, heading into this tournament Conor Gough was ranked 27th in the WAGR and 279th in the DG rankings; Gough ultimately shot 72 in the first round. The average WAGR ranking of the 36 golfers playing in this first round was 372, and their average DG ranking was 271; their average score that day was 73.3. Therefore, Gough was ranked well above average in this field by WAGR (+345), was ranked slightly below average by DG (-8), and he ended up beating the field by 1.3 strokes (+1.3 strokes-gained). Therefore, in this single instance, the WAGR rankings could be said to have been more predictive of Gough's performance that day.

We do this exercise using every tournament-round in our database where at least 2 golfers can be matched to both the WAGR and our rankings. This results in 306,431 data points to predict. Finally, we run 3 regressions (simply think of a regression as a way of predicting some variable Y with some other variable X): 1) a regression of a golfer's strokes-gained on their DG ranking; 2) a regression of their strokes-gained on their WAGR ranking; and 3) a regression of strokes-gained on both their DG and WAGR rankings. The summary measure we will use from these regressions is the percentage of strokes-gained that can be explained by each ranking system (e.g. if the DG rankings indicate golfer A is ranked higher than golfer B, and golfer A goes on to beat golfer B, we would say the DG rankings "explained" this result to some degree). In statistical parlance this is known as the "R-squared" of a regression.

Finally, let's see some results. When using the previous week's ranking to predict performance in the current week, the WAGR can explain 4.93% of golfers' strokes-gained; the DG rankings can explain 6.67%; and using the rankings together they can explain 6.72%. (The low values for these percentages reflects the fact that golf performance on any given day is mostly not explainable by past performance.) If we perform the same exercise described so far, except using the rankings from 2 weeks ago, or 3 weeks ago, etc, we get the following results:

As you would expect, the predictive power for both ranking systems declines as we predict further into the future. For any time horizon, it can be seen that the DG rankings vastly outperform the WAGR in terms of predicting performance. The "both" line represents the quality of predictions that can be achieved by leveraging information from both ranking systems; for short-term predictions, WAGR does add a bit of predictive value to the DG rankings, but for longer-term predictions this help disappears. The small added benefit of WAGR likely comes from the fact that there is some value in incorporating match play performances into amateur rankings, and also because there were some events where we could not apply our scoring adjustment method and thus do not contribute to the DG rankings. The fact that using both DG and WAGR together adds very little value on top of DG alone indicates that if you know a golfer's DG ranking, additionally knowing their WAGR rank provides only a small amount of new information with regards to the quality of the golfer.

These results should not be surprising; we constructed our rankings to predict performance as well as possible. Conversely, the WAGR also has to consider notions of fairness and deservedness: if, in the U.S. Amateur, a golfer plays a mediocre first two rounds but qualifies for the match play portion and ultimately wins the tournament, they will receive a huge boost to their WAGR position. However, their standing in our rankings may hardly rise, or may even decline depending on their previous skill level, because we only take into account the stroke play portion of the U.S. Amateur. Even though match play rounds have little predictive power (not zero; but not a lot) for a golfer's future performance, it is still likely the case that a ranking system should reward top performances in match play events. More generally, the WAGR rewards wins and top finishes disproportionately — the difference in WAGR points between finishing 1st and 2nd is larger than the difference between 10th and 11th. Again, this is not ideal for maximizing the predictive power of your rankings, but in our opinion it is how an official ranking system should function. All of this is just to say that the main reason the DG rankings outperform the WAGR in predicting future performance is that our rankings were constructed with that as the sole objective, while the WAGR has other factors to consider. Finally, for inquiring minds, the correlation between the DG and WAGR rankings is equal to 0.76 — meaning that, for the most part, the two ranking systems do generally agree.

To conclude, let's look at one dimension of the WAGR where an undesirable bias could exist: that tournaments played in specific geographic regions of the world recieve too many points on average. This is a logical dimension to look at because there is less overlap between the set of golfers competing at tournaments in distinct parts of the globe, making it more difficult for the WAGR point allocation to accurately reflect field strength differences. Unfortunately, we don't actually have exact data on where tournaments were played, but as a proxy for that we use a golfers' nationality; below we plot golfers' WAGR rank against their DG rank for a few different nationalities:

These plots use end-of-year WAGR and DG rankings for the years 2011, 2013, 2015, 2017, and 2019. (Even years are excluded so we don't double count, as the WAGR operates on a 2-year cycle). Each data point represents a golfer in one of those years. The plots with a majority of data points above the 45 degree line (i.e. DG = WAGR) indicate that golfers from that country are, on average, ranked worse in the WAGR than the DG rankings; plots with a greater number of points below the 45 degree line indicate the opposite — that the WAGR ranks them more favourably than our rankings. The United States, Japan, and Korea all fall into the former category, while Continental Europe, Australia, New Zealand, and South Africa fall into the latter. Golfers from the UK and Ireland are ranked similarly on average by the two ranking systems. Given that golfers don't just compete in their home countries — the most obvious example of this being the US college golf scene — our guess is that these plots underestimate the bias present in the tournaments that actually take place in the listed countries. That is, even though most European amateurs play some of their events in the United States, the fact that they play a higher-than-typical fraction in Europe is enough to introduce the bias we see in their plot. Our finding that golfers from Japan and Korea are treated unfavourably by WAGR is surprising, as we have found the opposite to be true in the Official World Golf Rankings. The finding that players in the United States are ranked worse in the WAGR than in our rankings is similar to what we have found when analyzing the OWGR, and also with work that Mark Broadie has done regarding bias in the OWGR.

There are many other dimensions that could be explored to better understand any inconsistencies and biases in the WAGR system. With our strokes-gained model, it is possible to drill down to the event level and examine whether too many or too few points are being allocated to a given event. The basic idea is that we are able to estimate the expected WAGR points for a golfer of a given skill level — e.g. the average NCAA D1 golfer — at each event; the higher the expected number of WAGR points for this golfer, the more favourable is the WAGR point allocation at that event. That will have to be explored another day, however, as I am tired and this blog is already too long.