* Edit: 3/10/2014 - 2:45 PM: Added a sentence to the third paragraph of
the section "In Practice: Political Polling in 2012 and Beyond" and
changed the second section heading under "Some Background" from
"Hand-wringing about surveys" to "Much ado about response rates".
Some Background
What is sampling statistics?
Sampling statistics
concerns the planning, collection, and analysis of survey data. When
most people take a statistics course, they are learning "model-based"
statistics. (Model-based statistics is not the same as statistical
modeling, stick with me here.) Model-based statistics uses a
mathematical function to model the distribution of an infinitely-sized
population to quantify uncertainty. Sampling statistics, however, uses a
priori knowledge of the size of the target population to inform
quantifying uncertainty. The big lesson I learned after taking survey sampling
is that if you assume the correct model, then the two statistical
philosophies agree. But if your assumed model is wrong, the two
approaches give different results. (And one approach has fewer
assumptions, bee tee dubs.)
Sampling statistics also has a big bag of other tricks, too many to do
justice here. But it provides frameworks for handling missing or biased
data, combining data on subpopulations whose sample proportions differ
from their proportions of the population, how to sample when
subpopulations have very different statistical characteristics, etc.
As I write this, it is entirely possible to earn a PhD in statistics and
not take a single course in sampling or survey statistics. Many federal
agencies hire statisticians and then send them immediately back to
school to places like UMD's Joint Program in Survey Methodology. (The federal government conducts a LOT of surveys.)
I can't claim to be certain, but I think that sampling statistics became
esoteric for two reasons. First, surveys (and data collection in
general) have traditionally been expensive. Until recently, there
weren't many organizations except for the government that had the budget
to conduct surveys properly and regularly. (Obviously, there are exceptions.)
Second, model-based statistics tend to work well and have broad
applicability. You can do a lot with a laptop, a .csv file, and the
right education. My guess is that these two factors have meant that the
vast majority of statisticians and statistician-like researchers have
become consumers of data sets, rather than producers. In an age of "big
data" this seems to be changing, however.
Much ado about response rates
Response rates for surveys have been dropping for years,
causing frustration among statisticians and skepticism from the public.
Having a lower response rate doesn't just mean your confidence
intervals get wider. Given the nature of many surveys, it's possible (if
not likely) that the probability a person responds to the survey may be
related to one or a combination of relevant variables. If unaddressed,
such non-response can damage an analysis. Addressing the problem drives
up the cost of a survey, however.
Consider measuring unemployment. A person is considered unemployed if they don't have a job and they
are looking for one. Somebody who loses their job may be less likely to
respond to the unemployment survey for a variety of reasons. They may
be embarrassed, they may move back home, they may have lost their house!
But if the government sends a survey or interviewer and doesn't hear
back, how will it know if the respondent is employed, unemployed (and
looking), or off the job market completely? So, they have to find out.
Time spent tracking a respondent down is expensive!
So, if you are collecting data that requires a response, you must
consider who isn't responding and why. Many people anecdotally chalk
this effect up to survey fatigue. Aren't we all tired of being bombarded
by websites and emails asking us for "just a couple minutes" of our
time? (Businesses that send a satisfaction survey every time a customer contacts customer service take note; you may be your own worst data-collection enemy.)
In Practice: Political Polling in 2012 and Beyond
In context of the above, Aaron Strauss's February 25th talk at DSDC was enlightening. Aaron's presentation
was billed as covering "two things that people in [Washington D.C.]
absolutely love. One of those things is political campaigns. The other
thing is using data to estimate causal effects in subgroups of
controlled experiments!" Woooooo! Controlled experiments! Causal
effects! Subgroup analysis! Be still, my beating heart.
Aaron earned a PhD in political science from Princeton and has been
involved in three of the last four presidential campaigns designing
surveys, analyzing collected data, and providing actionable insights for
the Democratic party. His blog is here.
(For the record, I am strictly non-partisan and do not endorse anyone's
politics though I will get in knife fights over statistical practices.)
In an hour-long presentation, Aaron laid a foundation for sampling and
polling in the 21st century, revealing how political campaigns and
businesses track our data, analyze it, and what the future of surveying
may be. The most profound insight I got was to see how the traditional
practices of sampling statistics were being blended with 21st century
data collection methods, through apps and social media. Whether these
changes will address the decline is response rates or only temporarily
offset them remains to be seen.
Some highlights:
Some highlights:
- The number of households that have only wireless telephone service is reaching parity with the number having land line phone service. When considering only households with children (excluding older people with grown children and young adults without children) the number sits at 45 percent.
- Offering small savings on wireless bills may incentivize the taking of flash polls through smart phones.
- Political campaigns have been using social media to gather information and contact non respondents, supplementing or replacing traditional voter records for that purpose.
- By contacting 10 times as many people every day the Obama 2012 campaign schooled Gallup.
- Reducing the marginal cost of surveys allows political pollsters to design randomized controlled trials, to evaluate the efficacy of different campaign messages on voting outcomes. (As with all things statistics, there are tradeoffs and confounding variables with such approaches.)
- Pollsters would love to get access to all of your Facebook data.
Sampling Statistics and "Big Data"
Today, businesses and other organizations are tracking people at
unprecedented levels. One reason rationale for big data being a
"revolution" is that for the first time organizations have access to the
full population of interest. For example, Amazon can track the
purchasing history of 100% of its customers.
I would challenge the above argument, but won't outright disagree with it. Your current customer base may or may not be your full population of interest. You may, for example, be interested in people who don't purchase your product. You may wish to analyze a sample of your market, to figure out how who isn't purchasing from you and why. You may have access to some data on the whole population, but you may not have all the variables you want.
More importantly, sampling statistics has tools that may allow organizations to design tracking schemes to gather the most relevant data to their questions of interest. To quote R.A. Fisher "To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of." The world (especially the social-science world) is not static; priorities and people's behavior are sure to change.
Data fusion, the process of pulling together data from heterogeneous sources into one analysis, is not a survey. But these sources may represent observations and variables in proportions or frequencies differing from the target population. Combining data from these sources with a simple merge may result in biased analyses. Sampling statistics has methods of using sample weights to combine strata of a stratified sample where some strata may be over or under sampled (and there are reasons to do this intentionally).
I am not proposing that sampling statistics will become the new hottest thing. But I would not be surprised if sampling courses move from the esoteric fringes, to being a core course in many or most statistics graduate programs in the coming decades. (And we know it may take over a hundred years for something to become the new hotness anyway.)
I would challenge the above argument, but won't outright disagree with it. Your current customer base may or may not be your full population of interest. You may, for example, be interested in people who don't purchase your product. You may wish to analyze a sample of your market, to figure out how who isn't purchasing from you and why. You may have access to some data on the whole population, but you may not have all the variables you want.
More importantly, sampling statistics has tools that may allow organizations to design tracking schemes to gather the most relevant data to their questions of interest. To quote R.A. Fisher "To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of." The world (especially the social-science world) is not static; priorities and people's behavior are sure to change.
Data fusion, the process of pulling together data from heterogeneous sources into one analysis, is not a survey. But these sources may represent observations and variables in proportions or frequencies differing from the target population. Combining data from these sources with a simple merge may result in biased analyses. Sampling statistics has methods of using sample weights to combine strata of a stratified sample where some strata may be over or under sampled (and there are reasons to do this intentionally).
I am not proposing that sampling statistics will become the new hottest thing. But I would not be surprised if sampling courses move from the esoteric fringes, to being a core course in many or most statistics graduate programs in the coming decades. (And we know it may take over a hundred years for something to become the new hotness anyway.)
Links for further reading
A statistician's role in big data (my source for the R.A. Fisher quote, above)
Aucun commentaire:
Enregistrer un commentaire
Remarque : Seul un membre de ce blog est autorisé à enregistrer un commentaire.