mardi 6 mai 2014

The hubris of Big Data (or: Why Google Flu is so very wrong)

flu disease spread outbreakBig Data analytics isn’t as magic as it seems. Blind belief in your data can make you look really foolish. Those lessons Google seems to be learning the hard way, at least in its effort to use search inquiries to map influenza outbreaks.
Mapping the illness in near real-time is a noble goal that the world’s smartest company not only didn’t reach, but a new study says Google’s predictions are not even close.
The moral of this story is, Big Data can create big wins and embarrassing losses, though few as public as this one. Having said that, Google deserves credit for good intentions, though it probably should have admitted to the problem before it ended up in an academic journal.
Here’s the story:
Google Flu Trends, since 2008 the poster child for the supposed wonders of big data, has a teensy problem. According to a paper in Science, Google’s flu tracker is almost always wrong. And not just a little wrong.
According to the paper (abstract here, full text behind paywall), “GFT overestimated the prevalence of flu in the 2012–2013 season and overshot the actual level in 2011–2012 by more than 50%. From 21 August 2011 to 1 September 2013, GFT reported overly high flu prevalence 100 out of 108 weeks.”
“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. — David Lazer

So what happened?

According to Johns Hopkins professor Steven Salzberg, the problem is that the metric Google uses — searches for flu-related information — is simply not accurate. The idea is that people who are suffering the beginning of the flu will search Google to learn more about it.  From that information, Google thinks it can figure out the location and severity of a flu outbreak. (Google explains its methodology here.)
Salzman says that plan runs into one small problem: ”When 80-90 percent of people visiting the doctor for ‘flu’ don’t really have it, you can hardly expect their internet searches to be a reliable source of information.”
Speaking on a Science podcast, the lead author of the paper, David Lazer of Northeastern University in Boston, said Google found historic search patterns that seemed to correlate to flu outbreak but then discovered the same patterns did not predict future outbreaks.
“They overfit the data. They had fifty million search terms, and they found some that happened to fit the frequency of the ‘flu’ over the preceding decade or so, but really they were getting idiosyncratic terms that were peaking in the winter at the time the ‘flu’ peaks … but wasn’t driven by the fact that people were actually sick with the ‘flu’,” Lazer says.
As a result, in 2009, Google missed an off-season flu pandemic. “When there was a flu outbreak that was off-season they missed it entirely,” Lazer adds.

Big Data hubris

In 2011, Google tweaked the Flu Trends algorithm and ever since Google’s numbers have been reliably too high.
Lazer calls this “big data hubris” in which “certain assumptions baked into the analysis doomed it in the long run.” Google Flu Trends assumed a stable relationship between search terms and cases of influenza, which has not proven to be reliable.
Another problem is the feedback loop created when someone searches for “flu” is taken to a particular page and that causes Google Flu Trends to see an outbreak where no outbreak exists, Lazer adds.
If this topic interests you, I strongly recommend listening to the podcast, which goes into greater detail than I am able to include in this post.
If you work with Big Data, this is a story you might want to follow. My colleague Mike Wheatley has also written about this issue and takes a somewhat different angle.
feature image: Philippe Put via photopin cc

About David Coursey

Editor-at-Large David Coursey is a veteran technology journalist with more than 25-years’ experience writing about business and consumer computing. Contact him at

Aucun commentaire:

Enregistrer un commentaire