fletch9.jpg

It’s All Ball Bearings Nowadays, or, It’s All Sample Sizes Nowadays

 

I’m starting to feel about sample sizes the way that Fletch felt about ball bearings.  If you haven’t seen the movie I don’t think I’m spoiling the plot at all by saying that he thought they were pretty important.

The cacophony (yeah, that just happened) of opinions surrounding the NFL’s spring draft is a target rich environment for the application of small sample sizes.  Let’s think about all of the way small samples end up with outsized importance being assigned to them.

The Combine.

The combine is a few days of drills that aren’t even football related except that they also involve speed and strength.  But the draftniks assign all manner of importance to the combine.

This guy needs to have a good combine, they might say.  This guy needs to run at least a X.XX.

But the combine is a small sample.  It’s a player’s performance on a single day.  Yet outsized importance is attached to that day and it can counteract a collegiate record that might be four years long.  Isn’t that screwed up?  Every Saturday in the fall the player goes out after a week of coaching and plays in a football game, and after 4 years you might have 50 or so games as a record, and that record can be affected by a single day of workouts?

Speed is pretty important to football, but the best wide receivers are rarely the fastest, and as far as running backs go, if you show me a running back who averaged over six yards per carry for his college career, I’ll show you a running back who will have a good Speed Score (and I’ll even bet you can pick one up in the fourth round).

Moving on.

Player comparisons.

Player X is like Larry Fitzgerald.  Wait, no he’s not.  Maybe he’s like Anquan Boldin.  Wait, no, he’s really like a bigger Santonio Holmes.

Saying someone could be like Larry Fitzgerald is applying a comparison that is supposed to have meaning, but is essentially just a small sample size.  Larry Fitzgerald is just one guy with a particular skillset who happened to be a good pro.  There’s no guarantee that another guy who came along with a similar skillset would have Larry Fitzgerald’s success.

But the other problem with player comparisons is that when you try to shoehorn them into a type, you ignore the possibility that they might not fit any type, yet might be good in spite of that.  They might be a new kind of good.  To ignore this possibility is to say basically that only a few types of successful player exist and every player coming out of college needs to fit into one of the pre-made molds.  This is preposterous.

Here’s another problem with player comparisons.  Even the EXACT SAME PLAYER can have different results.  Forget about trying to predict the future by coming up with a reasonable approximation (Player A will be successful because he has the same skillset as Larry Fitzgerald).  Players, not just similar players – exact players, have different results. Randy Moss had about as bad of a season as you can have in Oakland in 2006, then he had a record setter the next year.  If the exact same player can have a range of results based on situation, single player comparisons between pros and college players become ridiculous.  The college player is almost assured of walking into a different situation than the one that allowed the pro to acquire whatever reputation we assign to him.  Similarity is not destiny.

(Side note: This might seem like I’m making an argument that would counter my use of similarity scores.  However, my similarity scores compare 20 player seasons typically for the very reason that similarity does not equal destiny and because single player comparisons create a ridiculously small sample.)

If he can do it, he can do it.

This is a phrase I’ve been using more of lately.  It basically means that a player’s accomplishments on the field can generally speak for themselves.  If a guy catches 1500 yards in a season, he figured out a way to get open.  Don’t rob him of his skills because he doesn’t seem good in a way we’ve ever seen before.  He figured out a way to do it.  He went out every Saturday and figured out a way to beat double teams, but we’re going to downgrade him because he didn’t run a sub 4.4 forty and we don’t understand how he did it?

fletch9

Competence in Meetings and an Explanation for the Loud Mouth Ryan Brothers

800px-New_York_Jets_Head_Coach_Rex_Ryan

This is from an article that appeared on the Freakonomics blog yesterday.  It’s a guest post by basketball economist David Berri and he discusses an academic study that basically says that our perception of competence is pretty much related to how much people talk.  From the post:

A couple of years ago Cameron Anderson and Gavin J. Kilduff published a study examining how people in meetings evaluate each other.  Obviously we would like people in meetings to think we are competent.  And one might think, the best way to get people to think you are competent is to just be competent.  But that is not what Anderson and Kilduff found.  In a study of how people in a meeting – a meeting designed to answer math questions — were evaluated by their peers, these authors found (as Time reported) that actual competence wasn’t driving evaluations:

Repeatedly, the ones who emerged as leaders and were rated the highest in competence were not the ones who offered the greatest number of correct answers. Nor were they the ones whose SAT scores suggested they’d even be able to. What they did do was offer the most answers — period. 

“Dominant individuals behaved in ways that made them appear competent,” the researchers write, “above and beyond their actual competence.” Troublingly, group members seemed only too willing to follow these underqualified bosses. An overwhelming 94% of the time, the teams used the first answer anyone shouted out — often giving only perfunctory consideration to others that were offered.

Think about what this study says about meetings. If I want you to think I am competent, I need to talk.  But if all of us have this same incentive… well, maybe we better be standing.  A sit-down meeting can be endless (or at least seem that way).

I think we finally have an explanation for the Ryan Brothers.  They have incentives to be loud mouths!  They’ve probably been rewarded for it their entire careers.

Jeremy Lin and the Limits of Scouting

800px-Stephen_Curry_Jeremy_Lin

The first thing we can probably get out of the way in this post is to say that we still have a relatively small sample size on Jeremy Lin’s pro career.  Maybe his first few games in the league have been indicative of his talent, or maybe when we have more observations, those games will look more like outliers. 

But what we do know is that Jeremy Lin is probably better than you would expect based on the fact that he has been previously waived by not one – but two – NBA teams.  This is something of a black mark on the reputation of scouting.  After a four year college career, time spent on two NBA rosters, and a number of games in the D-League, Lin’s breakout was a shock to all involved. 

Lin played in the Ivy League, which has weaker competition no doubt, but looking at a player, and the way he plays, is supposed to be the domain of scouting.  Scouting is about eyeballs, and training those eyeballs on players to assess their skills.  It’s about measuring players against everything that the scout has seen to date, and then making an assessment.

In football there are a number of examples of similar things happening.  Tom Brady, Arian Foster, Victor Cruz are just a few names that come to mind.  They aren’t just good.  They’re basically at, or close to, the top of their position.  They all went undrafted (or 6th round in Brady’s case which is pretty much the same), which is to say that scouts didn’t consider them to be in the top 250 of available players in the year they came out.  Brady, Foster and Cruz are also black marks on the scouting profession.

It’s something of a cheap shot for me to pick three names that are examples of the failures of scouting, even though my intent here is not to take cheap shots, but rather to illustrate why scouting has limits.

Scouting has limits because it is broken and it can’t be fixed on its own terms.

Scouting reports are a series of anecdotes, strung together.  Player X shows ability to read the defense and move to his second and third receiver.  Player Y shows good burst in the open field.  Player Z has the vision to find cut back lanes.  Player A shows good top end speed.  Player K does not have a plus arm.  Player S finishes runs, while Player B does not show the ability to win collisions.  If Player C can shake off a reputation of being lazy, his upside might be in the range of Justin Tuck.

The problem with a list of anecdotes and comparisons strung together is this: there is no way to measure the effectiveness of the evaluation.  Measuring players through scouting is one thing.  Measuring the measuring is another.  The reason that scouting can’t be improved on its own terms is because it has no way to measure the measuring.

How do you know how good a quarterback who “shows ability to move through his progressions” will be?  How do you know how good a running back who “shows good burst for his size” will be?  But before you get there, how do you even know what any of that means?  How much better is great burst than good burst?

Scouts will often hang their hat on a player that everybody else passed on, but that the scout in question graded accurately.  But that’s easy.  If you grade enough players, and as long as you sometimes disagree with the consensus, you’ll get some players right that others got wrong.  That’s no more of an accomplishment than my cherry picking of three names (Brady, Foster, Cruz) that scouting got wrong.  And not just one scout got those guys wrong.  Every scout got them wrong.  Before you rush to pat the Patriots on the back for actually picking Brady, just keep in mind that they five times passed on probably the greatest quarterback ever, unwilling to use a 1st-5th round pick to insure themselves against missing out on him.

So how can scouts measure whether their measuring is actually working?  That’s where we get into the domain of stats.  The logical progression of any evaluation system ends up in the domain of stats.  Why?  Because statistics offers ways to measure the measuring.  Statistics will tell you whether a player’s burst, or probably more accurately, their measured speed, correlates at all with success in the NFL.  Statistics will tell you if a college quarterback’s completion percentage has any bearing on their pro success.  More importantly, statistics will tell you how much you don’t know.

This might be a good place to relate a failure of mine that still demonstrates why statistics have a natural advantage over scouting methods as we know them today.  Before the fantasy football season started last year, I did a lot of work on college receivers.  One of the models that I came up with while doing this work stressed the importance of wide receivers catching a disproportionate amount of their college team’s touchdowns and yards.  I was basically looking at each receiver’s market share of their college team’s passing game.  The model that I was using graded Leonard Hankerson and AJ Green higher than Julio Jones and I wrote an article saying to stay away from Jones.  It wasn’t just that Jones was relatively light on touchdowns at Alabama, it’s that the Tide threw a lot of touchdowns that Jones wasn’t involved in.  For a supposedly elite talent, that is odd. 

However, I am now expecting to be wrong on this prediction.  I expect that Jones will be a very good pro and saying to stay away from him was premature.  But here’s the thing.  My error was within what you could expect from the model.  The model only explains about 20% of the variance in a receiver’s pro production (if you don’t think that’s a lot, go ahead and test draft position as a variable and see how much of a receiver’s success it explains).  That leaves a lot of production to be explained by other factors like the system that a receiver ends up in, usage rates, health, and randomness.  And here’s another thing.  I can work to improve my model with the ultimate goal that it explains more of the variance in production.  I already realize that my model ignored expected usage which can be inferred from draft position.  I can add that variable to the model and retest it to see if it explains more than my first version.  I can keep doing this until the model can’t be improved any more.

Statistics account for what is unknown, and also contemplate improvement by adjusting the model until it either offers a full explanation, or just gets close, in which case the likely variance can be accounted for.

Scouting methods have no way of accounting for the unknown and also have no room for improvement on their own terms.  They start with an expert opinion based on years of experience, and have no room really to go upward – and any room to improve likely ends up moving more into the domain of statistics by quantifying rather than broadly describing.  In fact that is already happening.  A scouting report might contain a number of descriptions of the player’s play, followed by “I currently have him graded as a 1st rounder”.  Some reports might broadly describe a number of “intangibles” and then give the player a number based rating (presumably also rating the intangibles).  The scouts can keep moving slowly in this direction, but if they actually want to really improve their evaluations they’ll eventually have to engage in testing, or measuring the measuring.

If it sounds like I’m just taking cheap shots, I’ll offer this in defense of what I am saying.  The NFL draft is the culmination of the scouting profession’s year.  Each team presumably employs the best football scouts available to them, along with team management who are also generally ex-scouts.  If what I am saying about the limitations of scouting were bullshit, then the NFL draft would be an efficient market.  The first round picks would outperform the second round picks, who would outperform the third rounds picks, and so on.  If current scouting methods were working, it wouldn’t be possible to come up with a statistical model that could better explain the success of wide receivers.  Yet my model for college wide receivers does explain more of their production than does simply using draft spot.  And it doesn’t mildly improve a model based only on draft spot.  Including the variables that I use like Market Share of Team Yards and Market Share of Team Touchdowns basically doubles the explanatory value of a model that includes only draft spot.  That’s basically saying that while teams could easily review college wide receiver statistics, they opt for the much more difficult and costly task of using scouting methods.

Think maybe what I’m saying only applies to wide receivers?  Well it probably also applies to running backs as well.  A model that tries to explain running back rushing efficiency (yards per carry) using draft position doesn’t explain any of the variance in rushing yards per carry.  We know that the scouts don’t know the limits of what they’re doing because despite the NFL’s move to passing offenses, and despite the fact that a running back’s draft spot doesn’t mean that they’ll be any good, scouts continue to give 1st round grades to running backs each year.

I say that scouting has limits not because it can’t be improved.  It’s just that it can’t be improved on its own terms.  Any improvement is likely to be a move in the direction of quantification, which will require testing of the quantification, and then guess what?  You’re in the domain of the spreadsheet jocks.