Statistics

Statistics is a set of methods that are used to collect and analyze data. Statistical methods help people identify, study, and solve many problems. They enable people to make informed decisions about uncertain situations.

Statistical methods play an important role in a wide variety of occupations. Doctors and public health officials rely on statistics when determining whether certain drugs and therapies help in the treatment of medical problems. Engineers use statistics to set standards for product safety and quality. Weather forecasters and climate researchers work with statistical models to understand and predict weather patterns and climate trends. Scientists consider statistical ideas when designing and evaluating experiments. Psychologists study statistics to learn about human behavior. Statistical techniques enable economists to discover the impact of government policies and to predict future economic conditions.

People often use the word statistics as a plural noun to mean numerical data or numerical summaries of data. Used as a singular noun, statistics means the set of methods used to collect and analyze data and report conclusions. This article discusses statistical methods.

Using statistics to study problems

People called statisticians specialize in using statistical methods to study problems. They typically study a problem in at least four basic steps: (1) defining the problem, (2) collecting the data, (3) analyzing the data, and (4) reporting the results.

Defining the problem.

To obtain accurate data, a statistician must establish an exact definition of the problem. For example, suppose a statistician were asked to count the inhabitants of Lincoln, Nebraska, on a specific date. The statistician would have to define inhabitant clearly to know who should be included in the count. The statistician would have to decide whether to include newborn babies in the hospital, students temporarily away from Lincoln at college or attending college in Lincoln, and people visiting Lincoln from other places. If the statistician did not clearly define inhabitant, gathering useful data would prove quite difficult.

Collecting the data.

Different problems require different kinds of information. The careful study of a single case, such as an airplane crash, can often be useful. But collections of cases, such as the rates of crashes involving various types of airplanes, usually provide more reliable information for reaching general conclusions.

Designing ways to collect data ranks as one of the statistician’s most important tasks. Statisticians can collect data from a population or from a sample. A population is the entire group of objects or people being studied, whereas a sample is only a portion of that group. Statisticians refer to a study in which data are collected on every member of a population as a census.

Statisticians often compare populations that differ in some important way, such as where they live or how healthy they are. To make accurate comparisons, statisticians must control the effect of other differences among the members of their samples. For example, imagine a food company asked a statistician to determine if people in two regions react differently to sweetness in food. To ensure that differences in age did not influence the results, the statistician might compare children in one region with children from the other, and adults from one region with adults from the other.

Statisticians gather data using observational studies and controlled experiments. Observational studies involve collecting data on people or objects in their natural surroundings. A simple type of observational study is the sample survey, in which statisticians ask a sample of people about their opinions or situations.

In controlled experiments, statisticians create special conditions and observe how they affect people or objects. The randomized controlled experiment ranks as the most precise and informative method of collecting data for comparisons. In this method, statisticians divide the units to be studied into groups at random to help control the effects of unmeasured differences.

Researchers conducted one of the most famous randomized controlled experiments in the 1950’s. They tested a newly developed polio vaccine on 400,000 children. Half the children received the vaccine. The other half received a harmless solution, called a placebo, that was known to have no effect on polio. The researchers selected the two groups at random. Because polio affected only a small percentage of children, the groups had to be quite large to reliably reveal whether the vaccine worked. The results showed that the rate of paralysis due to polio was almost three times as great among the children who got the placebo as among those who got the vaccine. Thus, the researchers concluded that the vaccine was effective in helping to prevent polio.

Analyzing the data.

Methods for analyzing statistical data fall into two categories: (1) exploratory methods and (2) confirmatory methods. Statisticians employ exploratory methods to figure out what the collected data reveal about a problem. These methods often involve computing averages or percentages, displaying data on a graph, and estimating levels of association between variables (varying quantities) that were measured. Statisticians often use exploratory methods to compare measurements of two or more samples.

Statisticians use confirmatory methods to distinguish between important differences in data and meaningless random variations. Confirmatory methods typically involve using ideas from a branch of mathematics called probability theory. In the polio vaccine experiment, confirmatory methods enabled researchers to determine that the difference in polio rates between the two groups was much higher than would be expected due to chance variation.

Statistical analysis often requires extensive calculations. Statisticians rely on computer programs to carry out much of this work. A number of statistical methods use computers to simulate random events for comparison to observed data. Many business people and researchers who do not consider themselves statisticians use statistical computer programs in their work.

Reporting the results.

Statisticians analyze data to make inferences (logical conclusions) about the populations being studied. They may report their findings in the form of a table, a graph, or a set of percentages.

If a statistician has examined only samples, the reported results must reflect the uncertainty involved in making inferences about the larger population. Statisticians express uncertainty by making statements about the probability of their conclusions and giving ranges of possible values.

Probability

To make accurate inferences, a statistician must understand probability theory. Probability theory enables people to calculate the chance that different possible outcomes of random events will happen.

Suppose you were to toss a penny five times. Each toss would result in one of two possible outcomes: heads or tails. The two outcomes can be thought of as equally likely—that is, the probability that the coin will turn up heads equals 1/2, as does the probability it will turn up tails. For the collection of five tosses, there are 32 possible sequences of heads and tails, such as heads, heads, heads, tails, tails; or tails, tails, heads, heads, tails. In one possible sequence, the penny never comes up heads, and in another, it comes up heads every time. Each of the 32 sequences is equally likely, so if you counted up how many sequences correspond to zero heads, one head, and so on, you would see that the probabilities for the number of heads are:

0 heads = 1/32 probability

1 head = 5/32 probability

2 heads = 10/32 probability

3 heads = 10/32 probability

4 heads = 5/32 probability

5 heads = 1/32 probability

Statisticians call such lists of possible outcomes and their probabilities probability distributions. They often display the same information in the form of a graph known as a probability histogram. As you toss the penny more times, the probability distribution of the number of heads draws closer and closer to a bell-shaped curve called the normal distribution.

The result of the repeated tosses demonstrates an important mathematical principle called the central limit theorem. The theorem holds that for the sum of a large number of independent repeated events, such as coin tosses, the probability distribution will approximate the normal distribution. The theorem enables statisticians to use the normal distribution in making inferences from a sample of observations to the entire population.

All probability distributions have certain properties. For example, each distribution has a mean or average value. Statisticians calculate the mean of a distribution by multiplying each value by its probability and then adding these products. For five coin tosses, the mean number of heads is found by the following calculation:

mean = (0 X 1/32) + (1 X 5/32) + (2 X 10/32) + (3 X 10/32) + (4 X 5/32) + (5 X 1/32) = 21/2

If you were to repeatedly toss, n times, a coin for which the probability of heads is p, the mean number of heads is n X p. In the example above, n = 5 and p = 1/2. The mean is therefore 5 X 1/2, or 21/2. For the normal distribution, the mean occurs at the value directly under the peak of the curve.

Other important properties of probability distributions include the variance and the standard deviation. Both terms measure how values vary around the mean. The following equation gives the variance:

variance = sum of [(value – mean)2 X (probability of value)]

The standard deviation is the square root of the variance. For the normal distribution, the probability that a value lies within one standard deviation of the mean is about 2/3, and within two standard deviations of the mean is approximately 95/100.

Sampling

Statisticians must plan carefully to choose samples that will give them useful information. For example, suppose a statistician must estimate the level of unemployment nationwide. The statistician would have to determine how to obtain a sample that would represent the whole nation as accurately as possible. Should many households in each of a few cities be sampled, or should fewer households in each of many cities? How should households in selected cities be chosen?

Statisticians try to avoid choosing samples that do not represent the entire population. Imagine a statistician had to conduct a sample survey to measure the opinions of all the people in a city. The statistician could stand in a public place and survey people walking by, but the location chosen might influence the results. For example, a survey taken in a wealthy neighborhood might strongly reflect the opinions of wealthy people and exclude the opinions of the city’s poor. Statisticians use the term selection bias to refer to the way that poorly selected samples can lead to inaccurate conclusions.

To reduce selection bias, statisticians often choose the units that make up a sample randomly. A simple random sample is selected in such a way that all possible samples of the same size have an equal probability of being selected. The larger a random sample is, the more reliably a statistician can infer such quantities as means or proportions for the population. Larger samples also usually justify drawing more precise conclusions. Statisticians can measure the reliability of a sample using the standard deviation of the sample average. The standard deviation decreases in proportion to the square root of the sample size. Thus to double the reliability, the statistician must take a sample four times as large.

Sample sizes vary greatly depending on the purposes of the statistical study. Most well-known public opinion polls survey samples consisting of from 500 to 2,000 people. The sample survey used to measure the official national unemployment rate in the United States involves interviews with over 50,000 individuals. Such a survey produces averages and proportions over five times as reliable as a survey of 1,500 people. Although statisticians use fairly complex methods to choose the samples in these surveys, they still rely in part on the idea of simple random samples.

History

People gathered numerical data as far back as ancient times, and the Bible describes the details of several censuses. Political and religious leaders collected information about people and property throughout the Middle Ages and the Renaissance. In the 1700’s, the word statistik was commonly used in German universities to describe a systematic comparison of data about nations.

In the late 1800’s, British scientists and mathematicians, including Francis Ysidro Edgeworth, Francis Galton, Karl Pearson, and George Udny Yulesuch, developed many of the statistical ideas and methods of analysis used today. However, many of these ideas remained unrefined until the 1920’s. At that time, statistics emerged as a branch of science through the work of a small group of statisticians, also working in England. Statistical inference grew out of the work of Ronald A. Fisher, Jerzy Neyman, and Egon Pearson. Fisher also developed a theory of experimental design based on random assignment of treatments. Neyman proposed a theory of sample surveys with ideas similar to those in the theory of experimental design.

During World War II (1939-1945), statisticians developed many ideas and methods as part of the war effort in the United Kingdom and the United States. After the war, the field of statistics grew, and statistical ideas came into use in a wide variety of areas.

Careers in statistics

Statisticians find career opportunities in a wide variety of fields, including actuarial science (the estimation of risk), agriculture, biology, business, education, engineering, environmental science, health and medicine, quality control, and the social sciences. In all of these fields—and in many other fields—statisticians work closely with other scientists and researchers to develop new statistical techniques, adapt existing methods to new problems, design experiments, and direct the analysis of surveys and observational studies. As the ability to collect and process data improves, statisticians often take up key roles in new and expanding fields. In the 2000’s, areas of rapid development included astronomy and astrophysics, environmental and climate studies, statistical genetics, and data mining (extracting useful information from large stores of data) in business.

Many national governments employ professional statisticians at various levels of responsibility and policymaking. Statistical experts at local levels help solve problems concerning the environment, the economy, transportation, public health, and other matters of public concern. Lawyers and judges have increasingly turned to statisticians to help weigh evidence and determine reasonable doubt. Universities employ statisticians for teaching and research. Many statisticians engage in private consulting practice.