Wednesday, August 22, 2012

Opening Pandora's Box: Do we need a new scoring system?

Until now, I haven't touched on what seems to be the most controversial topic when the Games roll around each year: the scoring system. I have been mainly focused on evaluating what happened this year and predicting results based on the system we have. But there is no doubt that the scoring system in place, which is based entirely on rank, has its flaws. The question is this: can we devise a system that is truly better?

Update: Before I get any further, I'd like to mention that the 2012 Open data I am using is from Jeff King, downloaded from http://media.jsza.com/CFOpen2012-wk5.zip. Thanks a ton to Jeff for gathering all the data. Much of this analysis would not be possible without it.

First, let me lay out four key flaws I see in the points-per-place system:
1) The results are heavily dependent on who you include in the field. Take the Games competitors, for example, and rank them based on their Open performance. You get very different results if you score each event based on the athletes' rank among the entire Open field than if you score each event based on the athletes' rank among only the Games competitors. Neal Maddox would move from 5th using the entire field to 2nd using only Games competitors, Rob Forte would move from 15th to 25th and Marcus Hendren would move from 34th to 16th. This is a problem.
2) There is no reward for truly outstanding performances. In the Open, Scott Panchik did 161 burpees in 7 minutes. The next closest mens Games competitor was Rich Froning at 141. In a field with only Games competitors, Panchik would only gain ONE point on Rich. He was not rewarded at all for any burpees beyond 142 (even with all Games competitors included, he only beat Froning by 33 spots, a relatively slim margin among 30,000+ competitors).
3) Along the same lines, tiny increments can be worth massive points if the field is bunched in one spot or another. If I had performed one more burpee (I did 104), for instance, I would have gained 857 spots worldwide. The difference between 70 and 71 burpees (a larger proportional increase in work output) was worth only 327 spots. And the gap between 141 and 161 was only 33 spots.
4) Other athletes can have a huge impact on the outcome between two competitors. Why should the differential between Rich Froning and Graham Holmberg come down to how many other competitors finished between their scores on a certain event? If those other competitors hadn't even been competing, it wouldn't change how Rich and Graham compared to each other.

Now, while I don't agree with everything Tony Budding says, I think he brought up a good point when he defended the scoring system on the CrossFit Games Update show earlier this year. Regardless of whether it has some mathematical imperfections, the fact of the matter is the points-per-rank scoring system is very easy to understand and very easy to implement. Watching the Olympic Decathlon, which has been refining its scoring system for years, reminded me why the points-per-place system isn't so bad. Unless you have a scoring table and a calculator handy, the Decathlon scores seem awfully mysterious. So if we're going to come up with a scoring system to replace the points-per-place system, I believe it has to be easy for viewers and athletes to understand.

That being said, we can learn from what the Decathlon has done. The idea behind the Decathlon scoring system is to attempt to weight each of the events equally, so that performances of equal skill level in each event yield similar point totals. Beyond that, the same scoring system should be applicable to athletes ranging from beginners to elite athlete. Additionally, the scoring system for all events is at least slightly "progressive" - this means that as performances get closer and closer to world record levels, each increment of performance more and more valuable. For instance, the difference in score between a 11-second to a 12-second 100 meters is wider than the difference between a 12-second and a 13-second 100 meters.

Each event is scored based on a formula, taking one of two forms

Points for running events = a * (b - Time)^c
Points for throws/jumps = a * (Distance - b)^c

For each, the value of b represents a beginner level result (for instance, 18.00 seconds in the men's 100 meters), and c is greater than 1, which is the reason the scores are progressive. Certain events are more progressive than others; generally, the running events are more progressive than the throws. Here is a chart showing the point value of times in the 100 meters.


The Decathlon scoring system, for all its complexity, generally does a good job distributing points among the 10 events. It also rewards exceptional performances much more so than the points-per-place system we use in CrossFit. However, there is simply no way to create such a system for CrossFit, even if we were fine with the complexity. Why? Because the events are unknown, and they almost always have never been performed before in competition, which means calibrating the formulas to be appropriate would have to be done on the fly. There was no objective measure about what a "good" performance was on the Track Triplet before it occurred this year's Games, and there certainly was no way to say what was an equivalent performance on the Medball Clean-HSPU workout, for example.

Of course, it's easy to pick apart other scoring methods, but the key question here is whether we can come up with anything better. In thinking about this post, I initially considered three types of systems: 1) a logarithm system, in which all performances are converted to logarithms, which gives us an indication of scores relative to one another; 2) a percentage of work system, where the top finisher is awarded a score of 100% and all others are scored based on their performance relative to that performance; and 3) a standard deviation system, where each finisher's score is based on how far from the average score they fell.

As we move away from a points-per-place system, there is one key point that need to be addressed. Since we are now considering differences in performance rather than just rank, we must think about how much a repetition of one movement is worth compared to a repetition of another movement. Think of Open WOD 4: one muscle-up is far more difficult than one double-under. If we count each movement equally, an athlete who completes ten muscle-up scores 250, which is only 4.2% higher than an athlete who completes all the double-unders but no muscle-ups (240). Clearly, this does not accurately reflect the difference in performance, and the movements need to be weighted accordingly. I think that it would not be too difficult for those designing the workout to make the points-per-rep system clear when the workout is announced. For example, HQ could simply say that each segment of the workout is weighted equally; completing 150 wall-balls is worth one point, completing 90 double-unders is worth one point and completing 30 muscle-ups is one point (10 muscle-ups is then worth 0.33 points). That's still a little light on the muscle-ups, in my opinion, but it is a simple solution for now, and it works well for most workouts (I'll use it for most events in my comparisons throughout this post). HQ could come up with whatever weightings they feel are appropriate. Sure, they would be somewhat arbitrary, but the workouts themselves are also arbitrary; if HQ lays out the rules, people will play by them.

Now, let me first discuss the logarithm system, which is definitely the most unusual of the three. The key point about logarithms, in this context, is that the difference in two athletes' scores is based only on the ratio of their performances. For example, let's say we had 3 athletes, one of which completed 40 burpees, one of which completed 80 burpees and one of which completed 160 burpees. The logarithm scoring system (we'll use a natural logarithm, although the base is irrelevant) would give athlete A a score of 3.689, athlete B a score of 4.382 and athlete C a score of 5.075. The difference between athletes A and B is .693, which is exactly the same as the difference between athletes B and C. By using this system, we can compare a 20-minute event exactly as we'd compare a 5-minute event: it's only the ratio between athletes that is important. The scores are also completely independent of who is in the field.

However, the logarithm system has a couple of significant drawbacks. First, it is certainly not easy to interpret, and most non-math majors might have a tough time recalling what a logarithm even is. But more importantly, this system does not reward the outstanding scores whatsoever. It actually does the reverse of the Decathlon's progressive system: as scores get better, you need a wider and wider gap in performance to gain the same point value. Scott Panchik's 161 burpees would give him the same point advantage over Rich Froning's 141 as an athlete doing 40 burpees would gain on an athlete doing 35.

So as mathematically pleasing as it is, let's drop the logarithm from the discussion. Let's move on to the percentage of work method. This method is simple: to score an athlete, we simply take the ratio of their score to the top score in competition (personally, I'd keep the genders separate). My score of 104 burpees would be translated to a score of 64.6% (104/161). Using this system, here are the top 5 men's Open results among competitors who reached the Games:


Note: In calculating the scores for each workout, I assumed each portion of workouts 3 and 4 were weighted equally (for WOD 3, 15 box jumps = 12 push press = 9 toes-to-bar = 1.00 points each). For workout 2, I weighted each rep by the weight used. The first 30 reps were worth 75 points each, then 135 each for the next 30, and so on. Workouts 1 and 5 were scored with all reps counting equally.

Keep in mind that for workouts with a set workload performed for time, we need to convert the times to work-per-unit of time. For instance, doing Fran in 4:00 could be converted to 90 reps/240 seconds = 0.375. A 5:00 Fran would be 0.300, which would be 80% of the work (well, technically power, not work) of the 4:00 Fran. If we have an event where all athletes might not finish within a time cap, we need to be careful to weight the reps appropriately (as described above). For instance, if Open WOD 4 had been prescribed as 150 wall-balls, 90 double-unders, 30 muscle-ups for time (12:00 time cap), we use our weights to accurately score all those athletes who did not finish in 12:00.

This method solves many of the issues we had with the points-per-place system. The only part of the scoring system that is dependent on the rest of the field is the winner, and most fields of competitors will have a winning score that is the same ballpark for a given workout. If you were to restrict the field to only Games competitors, the results would be identical. Outstanding performances are indeed rewarded, like Scott Panchik's 161 burpees (12% spread over next highest Games athlete). Bunching in one spot is not an issue, because athletes are scored based on performance only, not rank. Similarly, other competitors finishing between two athletes has no bearing on the relative scores of those two athletes.

However, there is one major concern about the percentage-of-work system. This method assumes that the athletes' scores will be distributed between 0 and the top score in a similar fashion for each workout, when in reality, some events are naturally going to have a tighter pack. Consider the sprint workout at the Games: the last-place competitor on the men's side would have received a score of 78%. On the medball clean-HSPU workout, the last-place competitor would have scored just 30%. Essentially, the sprint workout becomes much less meaningful than the medball clean-HSPU workout because there is much less opportunity for the winners to gain ground. This is easy to see when we compare the distributions of the two workouts graphically.




There are a couple of options to remedy this. One option is to modify the percentage-of-work system so that we see where an athlete's percentage of work falls between the lowest and highest score. Using this method, the 30% on the medball clean-HSPU workout and the 77% on the sprint both receive a score of 0%. A score of 65% on the medball clean-HSPU would score 50%, as would an 89% on the sprint workout. The problem with this solution is that one outlier performance can skew the low end. In the Open, using the entire field, there was a score of exactly 1 on every workout. Even among Games competitors, there may be one athlete who either is injured or simply struggles mightily with a particular movement, and that can drag the low end down unfairly.

The second option is to use the standard deviation system. This system looks at how far an athlete was from the average score in a given workout, taking into account how spread out the scores are. To calculate an athlete's score, we use the following formula:

Score = (Athlete's Result - Average Result) / Standard Deviation

For those unfamiliar with a standard deviation, it basically gives an indication of how far in either direction most athletes were from the average. If a distribution is normal (which most of these workouts tend to be), then in general, about 2/3 of the scores will fall within 1 standard deviation of the average. About 95% will fall within 2 standard deviations of the average. A related concept, called the coefficient of variation, tells us how large the standard deviation is compared to the mean (which basically indicates whether we had a tight pack or a more spread out field). The coefficient of variation for the sprint was 4.9%, but on the medball clean-HSPU event it was 28.8%.

On the sprint event, the average result was 6.59 meters/second (45.53 seconds). The winning speed was 7.14 meters/second (42.00) seconds. The standard deviation was 0.32, so the winning time would receive a score of (7.14 - 6.59) / 0.32 = 1.73. The worst time (5.51 meters/second, or 54.40 seconds) would receive a score of -3.36, giving us a total spread of 5.09. On the medball clean-HSPU event, the winning speed was 0.96 stations/minute (finished in 6:15.8). The standard deviation was 0.18, so the winning time would receive a score of 1.98. The worst time (0.29 stations/minute, or 10:00 plus 25 reps remaining) would receive a score of -1.83, giving us a total spread of 3.81, which is actually considerably less than the spread on the sprint workout. The reason is that the score of 54 seconds in the sprint was well outside the normal range, and it was punished accordingly.

Update: Using the standard deviation system (with all Games competitors included in calculating mean and standard deviation), here are the top 5 men's Open results among competitors who reached the Games:



Mathematically, the biggest drawback to this system is that it is somewhat dependent on the field. On Open WOD 1, the overall average score was 95.4 (among men under 55 years old) with a standard deviation of 17.2. If we limit that to only Games competitors, the average is 123.8 and the standard deviation is only 8.8. This makes an outlier performance like Panchik's 161 burpees more valuable when we only look at Games competitors than if we look at the whole field. Still, each competitor moved an average of just 1.3 spots in either direction when we switched the field from all Open competitors to Games competitors only. Using the points-per-place system, each competitor moved an average of 3.5 spots.

My feeling is that, despite this drawback, the standard deviation system is the optimal solution. I understand that the term "standard deviation" may sound foreign to many athletes and fans, but it is a relatively simple and intuitive mathematical concept. And we can easily change the name to something less intimidating, perhaps the "spread factor" or simply the "spread." If the weighting of each movement is clearly defined beforehand, the calculations for the scores of each workout should not be overly difficult, and the results should be fairly easy to understand. Certainly it would be far more transparent than the Decathlon system, while providing a similar level of fairness. There is also the convenient property that a total score of 0.00 is exactly average.

Imagine competing in the Open with this system. Once you have completed your workout, assuming there have been at least a few thousand entries so far, you already have a reasonably good idea of your score. The average and the standard deviation will not change much over the course of the next couple days. You won't need to worry about a logjam at one particular score unduly influencing your own result. The effects of attrition (fewer people completing the workout each week) should be basically negated, since we are not scoring based on points.

In my view, this is a much more equitable Open. It also makes for a more equitable Regional and Games competition. Does that mean HQ will veer from their hard-line stance on the points-per-place system? I have my doubts. But hopefully this provides some insight into why this is a discussion worth having.

Sunday, August 5, 2012

Were the Games Well-Programmed? (Part 2)

In this post, I'd like to look at the 2012 CrossFit Games season as a whole. In response to the question "Were the Games Well-Programmed?", it's going to be difficult for anyone to give an absolute "yes" or "no." Still, I think we can certainly look back and see aspects that were done well and other areas where I believe HQ could improve.

In my last post, I gave a generally positive review of the programming in the CrossFit Games finals. But the Games cannot simply be viewed alone, because the athletes competing were only there because of their performances in the Open and the Regional. To be sure, athletes could not have any glaring weaknesses, or else they would not have made the Games at all. But let's look at the programming across all three levels of competition and see where HQ put the most emphasis.

The following table shows every movement that was used in competition this season. As you can see, more than 30 distinct movements were tested, and very few, if any, CrossFit staples were left out. However, the extent to which the movements were tested varied widely. In adding up the total value assigned to each movement, I assumed that each workout was worth a total of 1.00 (Games workouts scored on a 50-point scale were worth only .50). Within each workout, I assumed that each "station" in the workout was worth equal value, so the box jumps in Open WOD 3 were each worth 0.33 points, whereas the burpees in Open WOD 1 were worth 1.00 points*.



What is clear from this is that HQ puts a large value on the Olympic lifts. The clean and snatch were worth a total of 5.35 events on their own! Add in shoulder-to-overhead (0.67) and that's more than 6 events worth of points based on the Olympic lifts. Although I am a big fan of the Olympic lifts myself, I do think the snatch in particular was over-valued. It was worth nearly 14% of all the available points, including 20% of the Open and 17% of the Regional. The pull-up, a CrossFit staple for years, accounted for 40% of the value of the snatch (maybe slightly more if you considered the pull-up-like elements of the obstacle course). 

However, in total, the lifting bias was not as great as some people believe. In total, purely bodyweight movements (excluding running, but including the obstacle course and double-unders) accounted for 45% of all available points; barbell or dumbbell-based movements accounted for about 38%; running or rowing accounted for 6%; all others (including medball lifts) accounted for 14%. I think there was good balance here, with the exception of the running and rowing. 

I think the lack of running in the Open and Regionals showed in the Games. For both men and women, neither of the run-focused events (shuttle sprint and Pendleton 2) were highly correlated with success across all other events in the season. In fact, the sprint had basically 0 correlation with success in all other events for the men. For comparison, two charts are below: one shows the weak correlation between men's shuttle sprint and all other events, and one shows the strong correlation between women's Open WOD 3 and all other events (the concept of correlation with other events is detailed in my post "Are certain events 'better' than others?").



In other words, the shuttle sprint was sort of a crapshoot, because the top finishers didn't necessarily do well in those events, whereas Open WOD 3 was dominated by athletes who did well across the board. My feeling is that because running was not tested earlier, we may have omitted some athletes who would have done better on the running events at the Games.

Let's look a bit more into the qualification structure on the road to the Games. The Open, Regionals and Games should all be testing similar things, and in my mind, there are two over-arching goals when programming and carrying out the Open and Region rounds: 1) In the Open, find the athletes with the best shot of reaching the Games, and 2) at the Regionals, find the athletes with the best shot of winning the Games. Put another way: 1) The Open should not eliminate any athletes who would have had a legitimate shot at reaching the Games if they had competed at Regionals, and 2) The Regionals should not eliminate any athletes who would have had a legitimate shot at winning the Games if they had qualified. It is certainly possible to disagree with that sentiment, but my feeling is that we want to pick the best athletes for the Games. We do not want to send athletes to the Games who will not do well there.

So, let's take a look to see if those goals were accomplished. It is impossible to say for sure how the eliminated athletes would have done, but there are ways to get a good sense. First, let's look at the lowest Open finishers to make the Games. On the men's side, Patrick Burke took 35th in his region (Southwest) and Brian Quinlan took 27th (Mid-Atlantic). For the women, Caroline Fryklund took 25th (Europe) and Shana Alverson took 22nd (South East). Given that no one below 35th (and hardly anyone below 20th) wound up reaching the Games, I highly doubt any athletes placing below 60 in the Open would have reached the Games. In this respect, I think the Open did its job. That being said, I think that with the size of the competition pool increasing so rapidly, expanding the Regionals beyond 60 (possibly 100?) might make sense, although logistically this might be challenging.

At the Regional level, it was well-documented on the Games site just how challenging it was for even the elite athletes to qualify for the Games. Notable former Games athletes like Blair Morrison (5th in 2011) and Zach Forrest (12th in 2011) were unable to qualify this season. Could these athletes, or others who narrowly missed out, have contended for the title? Again, it is impossible to know for sure, but we can use the cross-regional comparison to look at the odds.

Because of the points-per-place scoring system, the cross-regional comparison can vary slightly based on how large of a field we use, but I have used a scoring system that includes all athletes who completed all 6 events. I also adjusted for the week of competition (as detailed in my first two posts, a couple months back). Using this system, let's look at the highest finishers not to make the Games. On the men's side, we had Gerald Sasser (21st - Central East), Joseph Weigel (22nd - Central East), David Charbonneau (26th - North East), Nick Urankar (29th - Central East) and Ryan Fischer (30th - Southern California). On the women's side, we had Andrea Ager (19th - Southern California), Sarah Hopping (32nd - Northern California), Chyna Cho (33rd - Northern California) and Amanda Schwarz (38th - South Central).

Now, in the Games, let's see how well athletes with similar ranks in the regionals did. For men, the highest finisher to finish worse than 21st in regionals (i.e., worse than Sasser) was Chad Mackay, who took 9th at the Games despite ranking 32nd in this regional comparison. The next-highest was Patrick Burke, who was 16th at the Games and 24th in the regional comparison. So it is probably fair to assume that none of the non-qualifying athletes would have been able to challenge Froning for the title, but certainly they could have made a run at finishing in the top 10. For women, however, several top women finished lower than Ager in the regional comparison, including Jenny Davis (8th at Games, 28th at Regionals), Christy Phillips (11th at Games, 20th at Regionals), Deborah Cordner-Carson (13th at Games, 34th at Regionals) and Cheryl Brost (15th at Games, 21st at Regionals). Could Ager have challenged Annie Thorisdottir for the title? I doubt it, but given her Regional performance and her Open result (6th in the World), I think it is not out of the question that she could have challenged for a spot in the top 5.

I think the women's results do indicate that some top athletes might have missed the Games. Now, was this a result of poor programming at Regionals, or perhaps do we simply need more qualifying spots? In Ager's case, if we look at the athletes from her region who did make the Games, we see that all four (Kristan Clever, Rebecca Voight, Valerie Voboril and Lindsey Valenzuela) finished in the top 10, so this leads me to believe that the programming was not the issue. The bigger issue is that certain regions are simply too competitive. Consider the men's Central East: all five qualifying men finished in the top 10 (including the champion), and five other men were in the top 35 in this cross-regional comparison (the three mentioned above, plus Elijah Muhammad and Nick Fory). Other regions, such as the North West, had no athletes in the top 20 at the Games. I don't think it's unfair to suggest that HQ consider re-allocating the Games spots or adding more spots across the board.

Overall, I think we have to consider the 2012 Games season a successful one - the increased participation and interest in the Games speaks for itself. With that in mind, I believe there are clearly some adjustments that need to be made moving forward. Hopefully we see HQ continue to refine the system in 2013.



*Notes on valuation of movements: I broke down burpee-box jumps and burpee-muscle-ups into two movements, each worth half of that station's total value. For instance, in the Games Chipper, there were 11 total stations, one of which was burpeee-muscle-ups. So burpees and muscle-ups were each given 0.5/11 (~0.04) points. Also, I ignored the run portion of Regional WOD 3 (DB snatch/run) because it was virtually inconsequential to the results.