## Inside Angle

#### From 3M Health Information Systems

# Populations, samples and statistical plane crashes

*Ryan Butterfield’s co-author on this blog is Melissa Gottschalk, a Research Analyst 2 on a team within the Clinical and Economic Research group for 3M Health Information Systems.*

Carl Friedrich Gauss (1777-1855) is perhaps the greatest of those polymaths who arise at various intervals throughout the history of science and mathematics. He is often quoted as saying *mathematics is the queen of the sciences… *If so, then we would posit that statistics is the royal lawyer. For it is upon statistical process that science argues and struggles its way through the ever-driving quest for nature’s truths. This evolving process of repeatability, reliability and validity is that what statistics measures. It is on these fundamental layers of the statistical argument that we will touch on here today. Now what does this have to do with health care and business you ask? Possibly nothing, but probably everything, especially if this business relates to information gathering and analysis as it does at 3M HIS.

We start our discussion with a seemingly simple question: *What is* *the difference between a sample and a population, and does it matter?*

First, let’s begin with some definitions. A *population* is a collection of things (people, animals, gummy bears, etc.), and a *sample* is a subset of that population (Dorfman and Valliant, 2005). A sample is usually used because the population is too large for data to be accurately captured and/or analyzed efficiently, and the sample observations can be studied to make inferences on the population. Inferences are mathematically based statements about statistical processes. Think: Why is a mean value the most important one to report from a normal or Gaussian distribution? The statistical theory behind that idea is an *inference* and is based on using calculus to define that the maximum point of a normal distribution contains sufficient information to summarize the whole distribution. Similarly, the dispersion of the distribution can be found through calculus based methods. This maximum point and dispersion value is the mean and variance of the distribution, respectively, which is why we often report distributions by saying they are a normal distribution with a mean of *##* and a variance of *##*.

Now, you may be asking yourself, why did I just read all of that? It is so you know that in the “black box” that relates a sample to a population, there is mathematical theory upon which these processes of statistical inference are based. These processes are not random or made-up, but are theoretically derived rules that govern what robust methods are and are not and when they are most appropriate to use.

*How do we make a sample represent a population and why is that important?*

It is important to note that a random sample should be used as often as possible. While many popularly think, albeit inaccurately, that using random samples is an “effort to reduce bias” or that “a random sample will ensure representation of the population,” the true requirement is that the mathematical theorems, which are the fundamental argument used by most frequentist statistical processes for testing a sample to a population (i.e. hypothesis testing, confidence intervals, etc.), only apply when there is a random sample. Subjects should be selected independently, so that selecting one does not influence the selection of another, and this selection process is random (Swinscow, 1997). There are other types of non-random sampling, such as when subjects volunteer or groups are easily available like in convenient sampling, but it is not always appropriate to make conclusions from statistical inferences based on non-random samples (Banerjee and Chaudhury, 2010).

When one begins a healthcare research project, it is important to understand the data being used for the project. One important early step is understanding whether the data is from a random or non-random sample, or a population. This early step will allow the researcher to not only understand their data better, but will also guide them in choosing the appropriate statistical technique.

*Analysis of Samples compared to Populations*

There are different ways in which one would analyze a population versus a sample. A researcher would use the population’s data when they want to describe exactly what occurred, such as in a particular year there were 15 deaths due to plane crashes. For a *population* analysis, the statistical techniques are quite easy; one simply describes the population with means, rates, and standard deviations. On the other hand, one analyzes a *sample* with the specific purpose of drawing conclusions and making inferences about the population or to make inferences about future events (Deming and Stephan, 1941).

Remember, the idea is that I have a sample and it informs me of what the unknown population may look like. Sampling introduces variability and error into the analysis, so more complex statistical techniques and hypothesis testing can be used instead of simply describing the data.*What are the consequences of not fully understanding my sample? *

*Misassigned Generalizability*

Earlier, we mentioned how it is often thought that using a population’s sample allows for bias reduction or complete population representation. One must understand their sample and should be careful about generalizing to their population especially when events are rare. If you haven’t read *Freakonomics* by Steven Levitt and Stephen Dubner, it’s worth the read. In *Freakonomics*, the authors discuss the risk of flying versus driving and how we misalign our fear and our risk. In 2009 (when the book was written), approximately 40,000 people died in the United States in car crashes, whereas less than 1,000 people died in plane crashes. Obviously, more people died by car than plane. However, if one were to consider that more time is spent driving than flying, “the per-*hour* death rate…is about equal”. This is a perfect example of how misassigning generalizability can be dangerous to statistics. It is easy to be misled if one doesn’t consider what the sample represents. Another problem with probability predicting happens when we have “fat tails.” Fat tails occur when there are “low-probability, high-consequence events” (Nordhaus, 2011). These events deviate from what’s expected and away from the mean. In the paper, *How to Predict an Election*, the author discusses how presidential election forecasting groups violate the fat tails principles (Taleb, 2017). This would explain why many predicted a Clinton victory, only to be surprised by a Trump win. Not all events and samples fall into a perfect bell-curve.

*Do we ever really have the population?*

Some argue that one never has the entire population and we are always analyzing a sample. In an effort to make all of our processes as statistically and, consequently, as scientifically robust as possible, this is an ongoing point of research at 3M HIS. When researching this topic, we found that there is much debate on whether one can ever really study the entire population. A report by the Healthcare Cost and Utilization Project provides an interesting discussion on this topic. They argue that while they have an entire state database and their population of interest, “the state database is unquestionably a sample of the population when inferences go beyond the database” (Houchens, 2010). Perhaps all we ever have is a sample of what the healthcare population truly is, but by studying this as thoroughly as possible, we can make accurate and sustainable inferences which will lead to improved outcomes and healthier lives.

**Melissa Gottschalk** is a Research Analyst 2 on a team within the Clinical and Economic Research group for 3M Health Information Systems.

**Ryan Butterfield, **DrPH, MBA, senior researcher and statistician at 3M Health Information Systems.

References

Banerjee, A. and Chaudhury, S. (2010). Statistics without Tears: Populations and Samples. *Industrial Psychiatry Journal*; 19(1): 60-65.

Deming, W. and Stephan, F. (1941). On the Interpretation of Censuses as Samples. *Journal of the American Statistical Association*; 36(213): 45-49.

Dorfman, A. and Valliant, R. (2005). Superpopulation Models in Survey Sampling. *Encyclopedia of Biostatistics*.

Houchens, R. (2010). Inferences with HCUP State Databases Final Report. HCUP Methods Series Report # 2010-05. U.S. Agency for Healthcare Research and Quality. Retrieved from URL: http://www.hcup-us.ahrq.gov/reports/methods.jsp.

Levitt, S. and Dubner, S. (2009). *Freakonomics: A Rogue Economist Explores the Hidden Side of Everything*. New York: William Morrow.

Nordhaus, W. (2011). Elementary Statistics of Tails. Excerpt from article in *Review of Environmental and Economic Policy*.

Swinscow, T. (1997). Statistics as Square One. *British Medical Journal*, 9^{th} Edition. Retrieved from URL: http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/3-populations-and-samples

Taleb, N. (2017). How to Predict an Election. Tandon School of Engineering, New York University.