Teach yourself statistics

Teach yourself statistics

How to Compare Datasets

Common graphical displays (e.g., dotplots, boxplots, stemplots, bar charts, histograms) can be effective tools for comparing data from two or more datasets.

Four Ways to Describe Data Sets

When you compare two or more datasets, focus on four features:

Center. Graphically, the center of a distribution is the point where about half of the observations are on either side.
Spread. The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are clustered around a single value, the spread is smaller.
Shape. The shape of a distribution is described by symmetry, skewness, number of peaks, etc.
Unusual features. Unusual features refer to gaps (areas of the distribution where there are no observations) and outliers.

The remainder of this lesson shows how to use various graphs to compare datasets in terms of center, spread, shape, and unusual features. (This is a skill that students are expected to master for the Advanced Placement Statistics Exam.)

Dotplots

When dotplots are used to compare datasets, they are positioned one above the other, using the same scale of measurement, as shown below.

Boxplot

The dotplots show pet ownership in homes on two city blocks. Here's how to interpret the dotplots. Each dot represents a household. As shown in the plots, Block A and Block B both have 15 dots. That means each block has 15 households. The numbers along the axis represent the number of pets owned by a household.

Pet ownership is a little lower in Block A. In Block A, most households have zero or one pet; in Block B, most households have two or more pets. In Block A, pet ownership is skewed right; in Block B, it is roughly bell-shaped. In Block B, pet ownership ranges from 0 to 6 pets per household versus 0 to 4 pets in Block A; so there is more variability in the Block B distribution. There are no outliers or gaps in either dataset.

Note: You can count the number of pets in each block. Block A has 5 households with 0 pets, 4 households with 1 pet, 3 households with 2 pets, 2 households with 3 pets, and 1 household with 4 pets - 20 pets in all. Block B has 2 households with 0 pets, 3 households with 1 pet, 4 households with 2 pets, 3 households with 3 pets, 1 household with 4 pets, 1 household with 5 pets, and 1 household with 6 pets - 35 pets in all. So, Block B has more pets than Block A - even though both blocks have the same number of households.

Back-to-Back Stemplots

The back-to-back stemplots are another graphic option for comparing data from two groups. The center of a back-to-back stemplot consists of a column of stems, with a vertical line on each side. Leaves representing one dataset extend from the right, and leaves representing the other dataset extend from the left.

Boys

Girls

7
1
1 4 6
4 5 8
1 2 2 2 8 9
3 4 7 9
2 5 8
1 3

0
1
2
3
4
5
6
7

1
2 6 8
3 4 4 6 6 8 9
4 3 6
4

The back-to-back stemplot above shows the amount of cash (in dollars) carried by a random sample of teenage boys and girls. The boys carried more cash than the girls - a median of $42 for the boys versus $36 for the girls. Both distributions were roughly bell-shaped, although there was more variation among the boys. And finally, there were neither gaps nor outliers in either group.

Parallel Boxplots

With parallel boxplots (aka, side-by-side boxplots), data from two groups are displayed on the same chart, using the same measurement scale.

Control group

Treatment group


2	4	6	8	10	12	14	16

The boxplot above summarizes results from a medical study. The treatment group received an experimental drug to relieve cold symptoms, and the control group received a placebo. The boxplot shows the number of days each group continued to report symptoms.

Neither boxplot reveals unusual features, such as gaps or outliers. Both plots are skewed to the right, although the skew is more prominent in the treatment group. The range of patient response was about the same in both groups. In the treatment group, cold symptoms lasted 1 to 15 days (range = 14) versus 3 to 17 days (range = 14) for the control group. The median recovery time is more telling - about 5 days for the treatment group versus about 9 days for the control group. It appears that the drug may have had a positive effect on patient recovery.

Double Bar Charts

A double bar chart is similar to a regular bar chart, except that it provides two pieces of information for each category rather than just one. Often, the charts are color-coded with a different colored bar representing each piece of information.

The double bar chart above shows customer satisfaction ratings for different cars, broken out by gender. The blue bars represent males; the red bars, females.

Both groups prefer the Japanese cars to the American cars, with Honda receiving the highest ratings and Ford receiving the lowest ratings. Moreover, both genders agree on the rank order in which the cars are rated. As a group, the men seem to be tougher raters; they gave lower ratings to each car than the women gave.

Back-to-Back Histograms

A back-to-back histogram (also known as is a side-by-side histogram or mirrored histogram) is a special type of histogram used to compare the distribution of two related data sets. Here are the key features of a back-to-back histogram.

The histogram consists of two mirrored bars extending left and right from a central axis (usually zero).
One group's data is plotted to the left (negative side), and the other group's data is plotted to the right (positive side).
The x-axis shows the frequency or relative frequency (i.e., percentage or proportion), while the y-axis shows the variable's bins.

The back-to-back histogram shown below compares weights for college and professional football players:

Back-to-back histogram compares weight of college and professional football players

Like any back-to-back histogram, this chart easily highlights similarities and differences in distributions. For example, this chart tells us the following:

Each distribution is reasonably symmetric around its midpoint (although the distribution for college players is slightly skewed).
The midpoint for professionals is between 240 to 259 pounds; the midpoint for college players is a little lighter - between 220 and 239 pounds.
The range is the same for both distributions.
There are no gaps or outliers in either distribution.

Test Your Understanding

Problem

The back-to-back stemplot below shows the number of books read in a year by a random sample of college and high school students.

College

High school

7
3 6 6
1 2 3 4
6 8 8 9
2 8

3

0
1
2
3
4
5
6
7

0 0 3 5
1 2 4 4 6
1 8 9
0 1

Which of the following statements are true?

I. Seven college students did not read any books.
II. The college median is equal to the high school median.
III. The mean is greater than the median in both groups.

(A) I only
(B) II only
(C) III only
(D) I and II
(E) II and III

Solution

The correct answer is (E). None of the college students failed to read a book during the year; the fewest read was seven. In both groups, the median is equal to 24. And the mean number of books read per year is 25.3 for high school students versus 30.4 for college students; so the mean is greater than the median in both groups.

Last lesson Next lesson