We’ve officially hit information overload on this year’s draft class. RGIII is now apparently a “miscreant” (ironic label applied by Ryan Burns) and Kendall Wright is apparently fat. This all reminded me of the section from Game Plan that discusses whether human experts can beat simple algorithms in prediction contests. It turns out that they usually can’t, even when they have a significant information advantage (like what % body fat a wide receiver is).
For your reading pleasure, the below is excerpted from Game Plan: A Radical Approach to Decision Making in the NFL.
Human Experts vs. Formulas
The issue that bias has in affecting the work done by doctors or football scouts has actually been studied broadly and a simple solution has been available for some time. It is (unfortunately) worth noting that experts are almost always reluctant to adopt this solution. The solution is to involve a simple formula, or an algorithm, in the decision making process.
As applied to football, an algorithm could be as simple as this:
Wide Receiver Production = Player Weight/40 Yard Dash Time + College Touchdowns/Game + College Yards/Game
In fact an algorithm not much different than the one above would have been better at predicting wide receiver performance than NFL scouts have been.
This revelation wouldn’t be a shock to the people who study the issue of experts vs. algorithms. A body of research work exists that suggests that experts make no better assessments than algorithms. For instance, in his book on bias, Berkeley psychologist Daniel Kahneman relates the story of how a simple formula has been able to aid doctors in making assessments that have saved hundreds of thousands of infants’ lives. From Kahneman’s book “Thinking Fast and Slow”:
A classic application of this approach is a simple algorithm that has saved the lives of hundreds of thousands of infants. Obstetricians had always known that an infant who is not breathing normally within a few minutes of birth is at high risk of brain damage or death. Until the anesthesiologist Virginia Apgar intervened in 1953, physicians and midwives used their clinical judgment to determine whether a baby was in distress. Different practitioners focused on different cues. Some watched for breathing problems while others monitored how soon the baby cried. Without a standardized procedure, danger signs were often missed, and many newborn infants died. One day over breakfast, a medical resident asked how Dr. Apgar would make a systematic assessment of a newborn. “That’s easy,” she replied. “You would do it like this.” Apgar jotted down five variables (heart rate, respiration, reflex, muscle tone, and color) and three scores (0, 1, or 2, depending on the robustness of each sign). Realizing that she might have made a breakthrough that any delivery room could implement, Apgar began rating infants by this rule one minute after they were born. A baby with a total score of 8 or above was likely to be pink, squirming, crying, grimacing, with a pulse of 100 or more—in good shape. A baby with a score of 4 or below was probably bluish, flaccid, passive, with a slow or weak pulse—in need of immediate intervention. Applying Apgar’s score, the staff in delivery rooms finally had consistent standards for determining which babies were in trouble, and the formula is credited for an important contribution to reducing infant mortality. The Apgar test is still used every day in every delivery room.
Simple algorithms, like the Apgar score, are useful because humans are subject to bias. Our brains are not very good at making complex assessments on their own. It’s difficult to know whether the assessments we’ve made in the past were any good and in fact we’re more likely to remember only the good assessments we’ve made. Left without help, our assessments don’t get any better.
Again, there is a mountain of research that shows that algorithms do just as well as experts and that many times algorithms exceed the effectiveness of expert predictions. A 1996 paper from University of Minnesota researchers Paul Meehl and William Grove covered the body of research that has been done on this topic. They count all of the studies that show that human experts don’t beat algorithms at making judgments. Meehl and Grove discuss one study that looked at whether a group of academic counselors could outperform a simple algorithm in predicting student grades. This is a problem similar to whether a football scout could outperform an algorithm in predicting player performance.
From the paper:
Sarbin compared the accuracy of a group of counselors predicting college freshmen academic grades with the accuracy of a two-variable cross-validative linear equation in which the variables were college aptitude test score and high school grade record. The counselors had what was thought to be a great advantage. As well as the two variables in the mathematical equation (both known from previous research to be predictors of college academic grades), they had a good deal of additional information that one would usually consider relevant in this predictive task. This supplementary information included notes from a preliminary interviewer, scores on the Strong Vocational Interest Blank, scores on a four-variable personality inventory, an eight-page individual record form the student had filled out (dealing with such matters as number of siblings, hobbies, magazines, books in the home, and availability of a quiet study area), and scores on several additional aptitude and achievement tests. After seeing all this information, the counselor had an interview with the student prior to the beginning of classes. The accuracy of the counselors’ predictions was approximately equal to the two-variable equation for female students, but there was a significant difference in favor of the regression equation for male students, amounting to an improvement of 8% in predicted variance over that of the counselors.
Note that the mountain of potential evidence that the counselors were given did not ultimately help them beat the very simple algorithm. The counselors and the algorithm tied when predicting performance of female students and the algorithm was a significant improvement over the counselors when predicting male student performance.
The evidence that the academic counselors were provided with is similar to the amount of evidence that NFL teams have about prospects before the draft. They might have information on the player’s I.Q. (Wonderlic score), they run background checks, they have hours of video on the player, they have the player’s college stats, they have the results from the NFL combine. And yet teams aren’t any better at selecting wide receivers than a simple algorithm that contains just a few variables. It’s also worth noting that the results should actually be biased in favor of the team evaluations. Teams get to assign playing time, which the algorithm has no control over.
The researchers Meehl and Grove then go on to summarize the 136 studies that they looked at (note they refer to the algorithm approach as the “actuarial” approach and they refer to the expert approach as “clinician”).
Of the 136 studies, 64 favored the actuary by this criterion, 64 showed approximately equivalent accuracy, and 8 favored the clinician. The 8 studies favoring the clinician are not concentrated in any one predictive area, do not over-represent any one type of clinician (e.g., medical doctors), and do not in fact have any obvious characteristics in common. This is disappointing, as one of the chief goals of the meta-analysis was to identify particular areas in which the clinician might outperform the mechanical prediction method. According to the logicians’ “total evidence rule,” the most explanation of these deviant studies is that they arose by a combination of random sampling errors (8 deviant out of 136) and the clinicians’ informational advantage in being provided with more data than the actuarial formula.
Only 8 of the 136 studies came out in favor of the expert and those studies didn’t seem to have anything in common. It also didn’t matter how much education or experience the human experts had. Even though the experts were always given more information than the algorithm, the score was still 64-8 in favor of the algorithm, with 64 other studies resulting in ties.
Human experts will always have excuses as to why their judgment should be preferable to an algorithm, even if the experts can’t beat the algorithm when score is being kept. The most common excuse is probably that the mountain of evidence that experts can’t out-judge algorithms somehow does not apply to a certain field… like football for instance.
This might be a good time to return to the subject at hand – the NFL’s front offices. I’ve compared the NFL’s scouts to doctors to illustrate what I think are valuable points. First, human bias affects everyone, even the most educated among us. Doctors are significantly more educated in what they do than NFL scouts are in what they do. Doctors attend schools where a formal curriculum is involved. NFL scouts learn what they do from watching others do it. But even the education of doctors doesn’t prevent them from making diagnostic errors, which studies have shown they make even when they are fully confident.
The ways to address the problems that are going to affect scouts might also be the same as the solutions that exist for the medical industry. The NFL’s front offices may want to engage in regular feedback on the effectiveness of their decision making. They may also want to seek support from computers (or algorithms), which aren’t affected by human bias. The goal of pursuing these two strategies is to reduce the impact that human bias has in the NFL decision making process.
Let’s look at one example of the way that human bias might make its way into the player evaluation process. Leading up to the NFL draft in the spring, it is common to hear scouts compare NFL prospects to active players in the NFL. However, with not very much looking you can often find player comparisons that rely solely on two players who might look similar. They might have played at the same college, or they might be the same race. Players from Georgia Tech are amazingly somehow similar to other players who previously played at Georgia Tech. White wide receivers are amazingly somehow similar to other white wide receivers. Linemen from the University of Iowa are compared to other linemen who also played at Iowa. Black quarterbacks are compared to other black quarterbacks.
For example, prior to the 2011 draft, North Carolina wide receiver Greg Little was compared to former North Carolina wide receiver Hakeem Nicks. Nicks was coming off of a very successful second year in the league and this may have made its way into the minds of NFL talent evaluators. From an article that appeared on a Cleveland news channel’s website:
Little’s strengths include solid speed, terrific hands and is a solid blocker. Scouts compare Little to former Tarheel and current New York Giants wide receiver Hakeem Nicks.
The problem is that Nicks and Little only look similar if they’re wearing the same college uniform. Nicks is about medium size for a receiver at 212 pounds. Little is huge at 230 pounds. Little is two inches taller than Nicks. Nicks was an extremely accomplished wide receiver, having compiled 1200 yards and 12 touchdowns in a year that the North Carolina passing offense was lackluster at best. Little’s best year had been less than 800 yards, with 5 touchdowns. Hakeem Nicks had averaged an amazing 18 yards per reception in his last year in college, while Little was used more like a wide receiver/running back combo player and had averaged just 11 yards per reception. When Nicks was drafted he had some of the largest hands that had been measured for wide receiver prospects. When Little was measured he had some of the smallest hands. Only when they are wearing the same college uniform do they resemble each other!
But if it sounds like I am criticizing anyone who might have thought the two wide receivers were similar, I am not. It’s difficult to look at Greg Little in his UNC uniform and not immediately make a pattern match with another successful UNC receiver. When NFL front offices are sitting down to evaluate Greg Little, it takes effort to separate Little’s on the field play from his physical appearance. This same kind of association takes place with any number of players.
This is where human bias could be balanced out by involving computers and algorithms in the evaluation process. The easiest thing that could be done is that a simple regression could be performed to determine what player attributes have historically correlated with pro success. Before NFL scouts do anything, it would be useful to use the results of the regression to project player performance. After they do that, the scouts could use a numbers based similarity process to generate a group of names that the subject player might be similar to. Teams have a number of valuable data points about players before they have to draft them. Those data points include height, weight , speed and college performance. These data points can be used to find similar players. But this should be done by a computer, not by a human. As we have seen, humans tend to pattern match and the patterns often are irrelevant (like skin color or college team).
For instance, when Buffalo Bills wide receiver Steve Johnson left college, it was probably difficult for scouts to imagine a good NFL wide receiver coming from the University of Kentucky. Kentucky is not known for its football program as much as its basketball program. This was perhaps one of the reasons that Johnson wasn’t picked until the 7th round of the 2008 draft. But had scouts conducted a brief similarity exercise using a database of previous player information, they would have found that Johnson was very similar in a number of respects to Indianapolis Colts wide receiver Reggie Wayne. They both ran the 40 yard dash in about 4.45 seconds. Johnson was actually about 10 pounds heavier than Wayne, so his 40 yard dash time is more impressive by comparison. They were both about 6’2” tall. They both caught about one touchdown per game in their final year of college. Wayne caught about 70 receiving yards per game at Miami, while Johnson caught about 80 yards receiving per game at Kentucky. Even if Kentucky is not a football factory, it plays in the Southeastern Conference, which means they play a difficult schedule. In other words, Johnson’s stats were accumulated against tough opponents.
Had scouts considered how similar Johnson was to Wayne, they might have rated him more highly and he might have been better than a 7th round draft pick. Instead Buffalo got a relative bargain and Johnson had a breakout season in 2010, following it up again in 2011 with another solid year.