Sampling Distribution: Difference Between Proportions

Statistics problems often involve comparisons between sample proportions from two independent populations. This lesson describes the sampling distribution for the difference between sample proportions and explains how to compute the standard error for the difference between sample proportions.

Sampling Distribution: Difference Between Proportions

Suppose we have two populations with proportions equal to P₁ and P₂. Suppose further that we take all possible samples of size n₁ and n₂. And finally, suppose that the following assumptions are valid.

The samples from each population are big enough to justify using a normal distribution to model differences between proportions. The sample sizes will be big enough when the following conditions are met: n₁P₁ > 10, n₁(1 -P₁) > 10, n₂P₂ > 10, and n₂(1 - P₂) > 10. (This criterion requires that at least 20 observations be sampled from each population. When P₁ or P₂ is more extreme than 0.5, even more observations are required.)
The samples are independent; that is, observations selected from population 1 are not affected by observations selected from population 2, and vice versa.

Given these assumptions, we know the following about the sampling distribution for the difference between sample proportions.

The sampling distribution for the difference between independent sample proportions will be approximately normally distributed. We know this from the central limit theorem.
The expected value of the difference between all possible sample proportions is equal to the difference between population proportions. Thus, E(p₁ - p₂) = P₁ - P₂.

Standard Deviation of Sampling Distribution

When population sizes are large relative to sample sizes, the standard deviation of the difference between sample proportions (σ_d) is approximately equal to:

σ_d = sqrt{ [P₁(1 - P₁) / n₁] + [P₂(1 - P₂) / n₂] }

It is straightforward to derive this equation, based on material covered in previous lessons. The derivation starts with a recognition that the variance of the difference between independent random variables is equal to the sum of the individual variances. Thus,

σ²_d = σ²_P₁ _- _P₂ = σ²₁ + σ²₂

If the populations N₁ and N₂ are both large relative to n₁ and n₂, respectively, then

σ²₁ = P₁(1 - P₁) / n₁ And σ²₂ = P₂(1 - P₂) / n₂

In this context, a population is considered to be "large" relative to a sample if it is at least 20 times bigger than the sample.

Therefore,

σ²_d = [ P₁(1 - P₁) / n₁ ] + [ P₂(1 - P₂) / n₂ ]
And
σ_d = sqrt{ [ P₁(1 - P₁) / n₁ ] + [ P₂(1 - P₂) / n₂ ] }

Bottom line: We can use the formula above to compute the standard deviation of a the sampling distribution for the difference between population proportions if:

N₁ is large relative to n₁, and N₂ is large relative to n₂.
We know each sample size (n₁ and n₂).
We know the population proportions (P₁ and P₁).

Standard Error of Sampling Distribution

Typically, we don't know the values for population parameters P₁ or P₂. And, if we don't know P₁ and P₂, we cannot compute the standard deviation of the difference between sample proportions (σ_d).

However, we can compute sample estimates of P₁ and P₂ from sample data. Substituting those estimates into the equation for σ_d, we get:

SE_d = sqrt{ [ p₁(1 - p₁) / n₁ ] + [ p₂(1 - p₂) / n₂ ] }

In this equation, p₁ is the sample estimate of P₁, p₂ is the sample estimate of P₂, and SE_d is a sample estimate of σ_d, the standard deviation of the difference between sample proportions. SE_d is the standard error of the difference between sample proportions.

Reminder: This formula for standard error assumes that N₁ is large relative to n₁, and N₂ is large relative to n₂.

In future lessons, you will see that being able to compute the standard error from sample data is essential for inferential statistics. It will allow us to compute confidence intervals for the difference between proportions and to test hypotheses about the difference between proportions.

Summary of Key Points

The key takeaways from this lesson are summarized below.

The sampling distribution for the difference between sample proportions will be normally distributed when:
- The samples from each population are large (at least 20 when population proportions are close to 0.5; and larger when proportions are more extreme).
- The samples are independent.
If each population is large relative to its sample, the standard error of the sampling distribution can be computed from the following formula:
SE_d = sqrt{ [ p₁(1 - p₁) / n₁ ] + [ p₂(1 - p₂) / n₂ ] }
A population is considered "large" if it is at least 20 times bigger than its sample.

Difference Between Proportions: Sample Problem

In this section, we work through a sample problem to show how to apply the theory presented above. In this example, we will use Stat Trek's Normal Distribution Calculator to compute probabilities.

Normal Distribution Calculator

The normal calculator solves common statistical problems, based on the normal distribution. The calculator computes cumulative probabilities, based on three simple inputs. Simple instructions guide you to an accurate solution, quickly and easily. If anything is unclear, frequently-asked questions and sample problems provide straightforward explanations. The calculator is free. It can found in the Stat Trek main menu under the Stat Tools tab. Or you can tap the button below.

Normal Distribution Calculator

Sample Problem

In one state, 52% of the voters are Republicans, and 48% are Democrats. In a second state, 47% of the voters are Republicans, and 53% are Democrats. Suppose 100 voters are surveyed from each state. Assume the survey uses simple random sampling.

What is the probability that the survey will show a greater percentage of Republican voters in the second state than in the first state?

(A) 0.04
(B) 0.05
(C) 0.24
(D) 0.71
(E) 0.76

Solution

The correct answer is C. For this analysis, let P₁ = the proportion of Republican voters in the first state, P₂ = the proportion of Republican voters in the second state, p₁ = the proportion of Republican voters in the sample from the first state, and p₂ = the proportion of Republican voters in the sample from the second state. The number of voters sampled from the first state (n₁) = 100, and the number of voters sampled from the second state (n₂) = 100.

The solution involves four steps.

Make sure the samples from each population are big enough to model differences with a normal distribution. Because n₁P₁ = 100 * 0.52 = 52, n₁(1 - P₁) = 100 * 0.48 = 48, n₂P₂ = 100 * 0.47 = 47, and n₂(1 - P₂) = 100 * 0.53 = 53 are each greater than 10, the sample size is large enough.
Find the mean of the difference in sample proportions: E(p₁ - p₂) = P₁ - P₂ = 0.52 - 0.47 = 0.05.
Find the standard deviation of the difference.
σ_d = sqrt{ [ P₁(1 - P₁) / n₁ ] + [ P₂(1 - P₂) / n₂ ] }

σ_d = sqrt{[(0.52)(0.48) / 100] + [(0.47)(0.53) / 100]}

σ_d = sqrt (0.002496 + 0.002491)

σ_d = sqrt(0.004987) = 0.0706
Find the probability. This problem requires us to find the probability that p₁ is less than p₂. This is equivalent to finding the probability that p₁ - p₂ is less than zero. To find this probability, we need to transform the random variable (p₁ - p₂) into a z-score. That transformation appears below.
z_{p₁ _- p₂} = (x - μ_{p₁ _- p₂}) / σ_d
z_{p₁ _- p₂} = (0 - 0.05)/0.0706 = -0.7082

We can use Stat Trek's Normal Distribution Calculator to find the probability of a z-score being -0.7082 or less. We know that the z-score is -0.7082, the mean of a z-score is 0, and the standard deviation of a z-score is 1. We plug those numbers into the calculator, as shown below.

The calculator tells us that the probability of finding a z-score less than -0.7082 is 0.23941. Therefore, the probability that the survey will show a greater percentage of Republican voters in the second state than in the first state is about 0.24.

Last lesson Next lesson