Teach yourself statistics

Teach yourself statistics

Sampling Distribution: Difference Between Proportions

Statistics problems often involve comparisons between sample proportions from two independent populations. This lesson describes the sampling distribution for the difference between sample proportions.

In this lesson, you'll learn how to find the mean of the sampling distribution, how to compute the standard deviation of the sampling distribution, how to compute the standard error of the sampling distribution, and how to find the cumulative probability that the difference between sample proportions will be less than or equal to some critical value, which we call d.

Sampling Distribution: Difference Between Proportions

Suppose we have two populations with proportions equal to P₁ and P₂. Suppose further that we take all possible simple random samples of size n₁ and n₂. And finally, suppose that the following assumptions are valid.

For each sample, population size is at least 20 times sample size. (Some sources are ok if population size is only 10 times sample size.)
The samples from each population are big enough to justify using a normal distribution to model differences between proportions. The sample sizes will be big enough when the following conditions are met:
- n₁P₁ > 10
- n₁(1 -P₁) > 10
- n₂P₂ > 10
- n₂(1 - P₂) > 10
When P₁ and P₂ each equal 0.5, this criterion requires that at least 20 observations be sampled from each population. When P₁ or P₂ is more extreme than 0.5, even more observations are required.
The samples are independent; that is, observations selected from population 1 are not affected by observations selected from population 2, and vice versa.

Given these assumptions, we know the following about the sampling distribution for the difference between sample proportions.

The sampling distribution for the difference between independent sample proportions will be approximately normally distributed.
The expected value of the difference between all possible sample proportions is equal to the difference between population proportions. Thus, the mean of the sampling distribution for the difference between sample proportions is:
μ_d = E(p₁ - p₂) = P₁ - P₂
where μ_d is the mean of the sampling distribution, p₁ and p₂ are sample proportions, and P₁ and P₂ are population proportions.

Standard Deviation of Sampling Distribution

When population sizes are large relative to sample sizes, the standard deviation of the difference between sample proportions (SD) is approximately equal to:

SD = sqrt{ [P₁(1 - P₁) / n₁] + [P₂(1 - P₂) / n₂] }

It is straightforward to derive this equation, based on material covered in previous lessons. The derivation starts with a recognition that the variance of the difference between independent random variables is equal to the sum of the individual variances. Thus,

σ²_d = σ²_P₁ _- _P₂ = σ²₁ + σ²₂

If the populations N₁ and N₂ are both large relative to n₁ and n₂, respectively, then

σ²₁ = P₁(1 - P₁) / n₁ And σ²₂ = P₂(1 - P₂) / n₂

In this context, a population is considered to be "large" relative to a sample if it is at least 20 times bigger than the sample.

Therefore,

σ²_d = [ P₁(1 - P₁) / n₁ ] + [ P₂(1 - P₂) / n₂ ]
And
SD = sqrt{ [ P₁(1 - P₁) / n₁ ] + [ P₂(1 - P₂) / n₂ ] }

Bottom line: We can use the formula above to compute the standard deviation of a the sampling distribution for the difference between population proportions if:

N₁ is large relative to n₁, and N₂ is large relative to n₂.
We know each sample size (n₁ and n₂).
We know the population proportions (P₁ and P₁).

Standard Error of Sampling Distribution

Typically, we don't know the values for population parameters P₁ or P₂. And, if we don't know P₁ and P₂, we cannot compute the standard deviation of the difference between sample proportions (SD).

However, we can compute sample estimates of P₁ and P₂ from sample data. Substituting those estimates into the equation for SD, we get:

SE = sqrt{ [ p₁(1 - p₁) / n₁ ] + [ p₂(1 - p₂) / n₂ ] }

In this equation, p₁ is the sample estimate of P₁, p₂ is the sample estimate of P₂, and SE is a sample estimate of SD, the standard deviation of the difference between sample proportions. SE is the standard error of the difference between sample proportions.

Reminder: This formula for standard error assumes that N₁ is large relative to n₁, and N₂ is large relative to n₂.

In future lessons, you will see that being able to compute the standard error from sample data is essential for inferential statistics. It will allow us to compute confidence intervals for the difference between proportions and to test hypotheses about the difference between proportions.

How to Find Probability

When the sampling distribution for the difference between sample proportions is approximately normal in shape, you can use the normal distribution to find a cumulative probability for any difference in independent sample proportions. Specifically, you can find:

P(p₁ - p₂ ≤ C)

where p₁ is a sample proportion from population 1, p₂ is a sample proportion from population 2,and C is a constant. Finding the probability that the difference between sample proportions will be no greater than the constant C is a four-step process:

Step 1: Find Mean of Sampling Distribution

When the sampling distribution is approximately normal in shape, the sampling distribution will symmetric and centered over the difference between population proportions. Therefore, the mean of the sampling distribution of a difference between two independent sample proportions will equal:

μ_d = P₁ - P₂

where μ_d is the mean of the sampling distribution, P₁ is population proportion for population 1, and P₂ is population proportion for population 2.

Step 2: Find Standard Deviation

Earlier in this lesson (see above), we explained how to compute standard deviation of the sampling distribution when you know the population proportion. And we showed how to estimate the standard deviation with the standard error when you don't know the population proportion. When population size is big relative to sample size, you can use these formulas to find standard deviation and standard error:

SD = sqrt{ [ P₁(1 - P₁) / n₁ ] + [ P₂(1 - P₂) / n₂ ] }

SE = sqrt{ [ p₁(1 - p₁) / n₁ ] + [ p₂(1 - p₂) / n₂ ] }

where SD is the standard deviation of the sampling distribution, SE is the standard error, P₁ and P₂ are independent population proportions, p₁ and p₂ are sample estimates of the population proportions, n₁ is sample size from population 1, and n₂ is sample size from population 2.

Step 3: Transform C Into z-Score

If you know the standard deviation of the sampling distribution, compute a z-score using this formula:

z = (C – μ_d / SD

If you know the standard error, use this formula:

z = (C – μ_d) / SE

where C is the value of a constant for which we want to find a probability, μ_d is the mean of the sampling distribution, SD is the standard deviation of the sampling distribution, and SE is the standard error of the sampling distribution.

Step 4: Find Probability

Find the probability for the z-score you calculated in Step 3; and you have found the probability that a difference between two indpendent sample proporitions will be no greater than the constant C.

You can find the probability for the z-score from a handheld graphing calculator, from a written probability table commonly found in the appendix of introductory statistics texts, or from an online probability calculator, like Stat Trek's normal distribution calculator.

Test Your Understanding

In this section, we work through a sample problem to show how to apply the guidelines presented above. For this problem, we will use Stat Trek's Normal Distribution Calculator to compute probability.

Normal Distribution Calculator

The normal calculator solves common statistical problems, based on the normal distribution. The calculator computes cumulative probabilities, based on three simple inputs. Simple instructions guide you to an accurate solution, quickly and easily. If anything is unclear, frequently-asked questions and sample problems provide straightforward explanations. The calculator is free. It can found in the Stat Trek main menu under the Stat Tools tab. Or you can tap the button below.

Normal Distribution Calculator

Test Your Understanding

Problem 1

In one state, 52% of the voters are Republicans, and 48% are Democrats. In a second state, 47% of the voters are Republicans, and 53% are Democrats. Suppose 100 voters are surveyed from each state. Assume the survey uses simple random sampling.

What is the probability that the survey will show a greater percentage of Republican voters in the second state than in the first state?

Solution:

This problem satisfies the conditions that allow us to assume the sampling distribution is approximately normal.

For each sample, population size is at least 20 times sample size.
The sampling method is simple random sampling.
For each sample, it will be true that n * p ≥ 10, where p is the sample proportion.
For each sample, it will be true that n * (1 - p) ≥ 10.

Therefore, we can use the four-step solution to find probability.

Step 1. Find the mean of the sampling distribution. In the first state, 52% of voters are Republican; and in the second state, 47% of voters are Republican. Therefore, the mean of the sampling distribution (μ_d) is:
μ_d = P₁ - P₂

μ_d = 0.52 - 0.47 = 0.05
Step 2. Find the standard deviation of the sampling distribution. Since we know population proportions, we can compute the standard deviation, rather than estimate it with standard error. The standard deviation of the sampling distribution is:
SD = sqrt{ [ P₁(1 - P₁) / n₁ ] + [ P₂(1 - P₂) / n₂ ] }

SD = sqrt{ [ (0.52)(1 - 0.52) / 100 ] + [ (0.47)(1 - 0.47) / 100 ] }

SD = sqrt{ [ 0.002496 ] + [ 0.002491] } = 0.0706
Step 3. Transform C into a z-score. This problem requires us to find the probability that p₁ is less than p₂. This is equivalent to finding the probability that p₁ - p₂ is less than zero. Therefore, for this problem, the constant C for which we want to find a cumulative probability is zero; and the z-score formula is:
z = (C - μ_d)/SD = (0 - 0.05)/0.0706 = -0.7082
Step 4. Find the probability. To find this probability, we use Stat Trek's Normal Distribution Calculator. Specifically, we enter the following inputs: -0.7082, for the z-score; 0, for the mean; and 1, for the standard deviation. (It is not necessary to compute the mean or standard deviation of the z-score, because every z-score has a mean of 0 and a standard deviation of 1.)

Normal Distribution Calculator

The calculator tells us that the probability of finding a z-score less than -0.7082 is 0.23941. Therefore, the probability that the survey will show a greater percentage of Republican voters in the second state than in the first state is about 0.24.

Last lesson Next lesson