Comparing Groups

Introduction

Goals

This tutorial will explore:

A Review of visualizing proportions using a 100% stacked barplot
Comparing proportions using a z-test for proportions
OPTIONAL: Comparing means using a t-test for means
OPTIONAL: Comparing 3 or more means using ANOVA.
OPTIONAL: Paired mean comparisons

One-sample Inference

In this tutorial, we will proceed straight to multi-group comparison inference (comparing two means, comparing two proportions, and comparing 3+ means). These methods are more widely used in application.

For information on completing a one-sample test or interval in R, feel free to check the resources linked below, but you won’t need them for this class!

Visualizing Proportions

Pima Health Data

The Pima.tr dataset contains data on 200 women of Pima Indian heritage living near Phoenix, Arizona. As part of a Health study on this population, medical researchers gathered data for several health-related variables.

The Data

This data is one of many inside the MASS package. Take a look at the data here!

library(MASS)
Pima.tr

Posing a question

One of the variables collected was whether the participant was considered “diabetic” according to World Health Organization criteria (represented as the variable type). Researchers are curious if the proportion of PIMA women who are diabetic is different depending on whether that woman had been pregnant in the past before

Guiding Question: Are women with a past pregnancy more likely to be diabetic than women who have not been pregnant?

Creating our predictor variable

The pregnancy variable indicates women who have been pregnant at least once (“Yes”) and women who have not been pregnant before (“No”).

Let’s first look at an unstacked barplot to compare these groups in terms of diabetic status.

ggplot(data = Pima.tr, aes(x = pregnancy, fill = type)) +
  geom_bar(color = "black", width = 0.5) +
  labs(x = "At least 1 pregnancy", fill = "Diabetic")

Visualizing Proportionally

Let’s now make a visual that makes the proportional comparison easier to see. Create the barplot again, but this time, as a stacked barplot with position = "fill"

ggplot(data = Pima.tr, aes(x = pregnancy, fill = type)) +
  geom_bar(position = "fill", color = "black", width = 0.5) +
  labs(x = "At least 1 pregnancy", y = "Proportion", fill = "Diabetic")

Theme Note

We will talk about this in a coming tutorial, but you can add general plot themes in ggplot to change the background paneling and layout.

I mention this here because I thinktheme_hc() looks especially nice with stacked barplots. Just note that you may need to install and library the ggthemes package to use this particular theme option.

ggplot(data = Pima.tr, aes(x = pregnancy, fill = type)) +
  geom_bar(position = "fill", color = "black", width = 0.4) +
  labs(x = "At least 1 pregnancy", y = "Proportion", fill = "Diabetic") +
  theme_hc()

Testing Proportions

How to test for a difference

To see if the difference can be explained by random chance or not, we’ll need a two-sample test for proportions. We can use the prop.test function for that.

Using prop.test

The prop.test function in R is a straightforward choice for this comparison. It takes two arguments:

x: the number of cases with the criteria of interest
n the total number of cases in each group.

Comparing Professor Arden and and Professor Bernard

Before exploring the Pima data question, let’s start with a simpler example.

Let’s say that I want to know if the proportion of “A’s” between Professor Arden’s class and Professor Bernard’s class might be different or not.

100 people took the course with professor Arden this semester, and 50 of them got an “A.”
90 people took the course with professor Bernard the same semester, and 35 of them got an “A.”

Testing that Difference

READ THIS

To apply this information to our prop.test function, we first identify the number of A’s in each group we are comparing–those go in the x vector first.

Then we identify the size of each group–those go in the n vector and should go in the same order as you did the first one.

prop.test(x = c(50, 35),
          n = c(100,90))

Making sense of that code

Our first vector is telling us that Professor Arden had 50 A’s, while Professor Bernard had 35 A’s.

The second vector is telling us that there are a total of 100 students in Professor Arden’s class, and 90 students in Professor Bernard’s class.

Notice that order matters there. If Professor Arden has 50 A’s out of a group size of 100, make sure 50 and 100 both come first or both come second in their respective vectors!

Frequency Table Needed

Notice that in the Pima.tr dataset, each row is either a Yes or No, but I don’t have the frequency totals provided for me to plug into prop.test. I need to create a frequency table first.

We can do that with the table function!

Frequency Table Example

First, we write table, and inside the parentehses, we identify the two vectors that we want to cross-tabulate. Since each variable has two categories, we’ll end up with a 2x2 contingency table.

Note that this is a base R function, so we do need to specify the dataframe that each variable comes from here as well!

table(Pima.tr$pregnancy, Pima.tr$type)

Optional: Adding totals

It may be helpful to see row and column totals. If wanting to do that, just add the function, addmargins around your table function. Or if you assign your table a name like in that last example, just run addmargins(table_name)

addmargins(table(Pima.tr$pregnancy, Pima.tr$type))

Testing for a Difference in Proportions - Table

Now, how do we take this information to plug into prop.test?

First, remember what our response variable is. In this situation, it’s our type (is patient diabetic or not) and the column labeled yes represents those who are diabetic.
Second, recognize that the two values in the type “Yes” column represent the two values we’re placing in our x vector–number of cases where criteria of interest (diabetic status) is true.
Third, figure out the group totals for each of the pregnant categories. What total number of cases are “Yes” for pregnant, and what total number of cases are “No” for pregnant?

Plugging them in

Please be careful to read this and think about the process of extracting numbers from the frequency table to here. Many students make a mistake here.

There were 12 diabetics in the no pregnancy group (12+16 = 28 total in that no pregnancy group)
There were 56 diabetics in the past pregnancy group (56 + 116 = 172 total in that past pregnancy group)

prop.test(x = c(12, 56), 
          n = c(28, 172))

Interpreting Results

Notice at the bottom, it provides the sample proportions for each group.

The proportion of never pregnant women with diabetes is 0.429
The proportion of previously pregnant women with diabetes was 0.326.

At the top, we find a p-value.

Based on a p-value of 0.3943, we have little to no evidence of a difference in diabetic rates between these two groups. This difference could be explained as random chance here.

We also get a 95% confidence interval for the true difference in proportions. That may or may not be interesting depending on the context. In some cases, we might instead wish to calculate the relative risk (or odds ratio) value and confidence interval.

Directional Alternative?

If this happened to be a situation where we only found a result in one direction to be noteworthy, we could add a directional indication on the test.

If we wished to indicate that, we would need to add a 3rd argument:

alternative: can be set to “greater” “less” or “equal” where equal would be the default choice when we don’t define it.

Professor Example

Let’s consider this alternative with the Professor Arden vs. Bernard question

Professor Arden and Bernard are using the same Final Exam, but Professor Arden had a tutoring center option for their students, while Professor Bernard did not.
As a result, we would only find it noteworthy if Professor Arden’s students were more likely to get A’s.

Adding it in the code

To decide whether to write “less” or “greater”, we need to think about which group is listed first. Since we have professor Arden’s numbers listed first, then our alternative will be that Professor Arden’s proportion is “greater” than Professor Berard’s proportion.

Professor Arden prop > Professor Bernard prop

prop.test(x = c(50, 35), 
          n = c(100, 90), 
          alternative = "greater")

Interpreting

We see a p-value of 0.08, suggesting mild evidence (but not particularly strong evidence) that Professor Arden’s class results in a higher proportion of A’s compared to Professor Bernard.

NOTE: When choosing a directional option, ignore the confidence interval output–it represents something different!

Visualizing/Testing Means (Optional)

Proportions vs. Means

Remember that means are measures we might take from numeric variables. Proportions are measures we use with categorical variables. Let’s next consider testing involving a comparison of two means!

Should we do a t-test?

Remember that when our sample size is not particularly large, we need a “t-correction” to account for our estimate of the standard deviation using only our sample data. When our sample size is fairly large though, the difference between a t-test and z-test is negligible.

When using software, it’s fine to just always do a t-test for that reason, since the t-test results will correctly converge toward the results from a z-test!

Two-sample t-test

Consider the birthwt dataset from the MASS package.

birthwt

One question we might investigate is whether mothers who reported smoking during the pregnancy (smoke_answer) were more likely to give birth to babies with a lower birth weight (bwt) on average.

ggplot(data = birthwt, aes(x = smoke_answer, y = bwt, fill = smoke_answer)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  labs(y = "Birth weight (g)", x = "Smoking Status")

Is a Pooled Method Appropriate?

Visually, it appears that the variance of each group is similar. But when using software, an unpooled method is the “safer” choice. We will always choose an unpooled method when using software in our class!

t.test

We will use the t.test function for the comparison. This test takes the following arguments

data: the data frame where our varaibles are located.
Simple entry of your response variable ~ predictor variable, with the ~ connecting the two!
var.equal: whether to assume equal variances. We will typically choose FALSE as the safer testing choice when there is no reason to assume variances of each population is equal.
alternative: again, an option to identify if we are doing a directional test.

Trying with the births example

For now, let’s start with a non-directional test.

The data frame we are working from is birthwt
The response variable we are comparing means from is bwt, and we are using smoke_answer as the grouping variable to compare means from.

Note: Putting the response variable first and grouping variable second is important!

t.test(data = birthwt, 
       bwt ~ smoke_answer, 
       var.equal = FALSE)

A non-directional output

Notice our sample means

Non-smoking mothers give birth to babies with an average weight of 3,055 grams
Smoking mothers give birth to babies with an average weight of 2,771 grams

Notice our p-value

Our p-value is 0.007, suggesting that we have very strong evidence for a difference in mean birth weight between these two populations. It’s unlikely we would see this much difference in our sample means by random chance.

Also of interest, we get a confidence interval for the true mean difference as well!

We are 95% confident that the true difference in mean birth weight between non-smoking birth mothers and smoking mothers to be between 78 grams and 488 grams.

Doing Directional Tests

Directional tests may be completed if we would only find a difference in one direction possible or noteworthy.

Perhaps we had a theory that smoking affected the pregnancy and caused decreased birth weight. If that’s the case, then alphabetically, our groups are No and Yes in that order (check the data frame again if you’d like to verify the category names here!).

Alternative: No (smoking) ___ Yes (smoking)

Putting in into code

If our theory says that smoking should increase the likelihood of premature/low weight birth, then our alternative is that weight for non-smoking > weight for smoking. So we would plug in greater!

t.test(data = birthwt, 
       bwt ~ smoke_answer, 
       var.equal = FALSE,
       alternative = "greater")

Making sense of a Directional result

Notice that the p-value is now half the size it was before. That reflects that if left up to random chance, the probability of finding a difference this high or higher in only one direction is half as likely as finding that much difference in either direction.

ANOVA (Optional)

What is ANOVA?

ANOVA stands for Analysis of Variance and is a statistical technique for modeling how one or more categorical predictors explain variability in a numeric response. In this tutorial, we will use its most simple application: comparing differences in means between 3 or more populations. This particular use is called “One-Way ANOVA.”

ANOVA Comparison

Let’s examine the penguins data from palmerpenguins. There are 3 species in this data.

Now, let’s see if the three species might plausibly have the same average bill length or not.

ggplot(data = penguins, aes(x = species, 
                            y = bill_length_mm, 
                            fill = species, 
                            color = species)) +
  geom_jitter(width = 0.05)

Running an ANOVA

We can do that by using the aov function, saving this result as a model, and then taking a summary of that model.

Notice again–the numeric variable comes first, followed by the grouping variable.

penguin_model = aov(data = penguins, 
                    flipper_length_mm ~ species)
summary(penguin_model)

Interpreting the results

We find a p-value of 2x10^-16, which is extremely low. This is actually the lowest value R will output unless we manually request for more precision.

Basically, we have VERY strong evidence that the difference in means we observed in our sample results would not show up by random chance if these three species indeed had equals means.

Tukey Test for pairwise comparisons

If we want to check further for pairwise comparisons, we can run a tukey test using the TukeyHSD function. The only input for this function is the already created ANOVA model.

penguin_model = aov(data = penguins, 
                    flipper_length_mm ~ species)

TukeyHSD(penguin_model)

From this results we see strong evidences that all species have different means from one another. The p-values are so low that they round to 0 in the output (granted they aren’t actually 0, but are all very small numbers).

Paired Comparisons (Optional)

This section is OPTIONAL

I used to teach paired t-tests explicitly, but it is no longer in my notes to make space for other things. But if end up doing a paired comparison in the future (perhaps a simple pre-post study), you might find the visualization (and testing component) helpful!

When would your data be paired?

You may encounter paired data when trying to compare pre and post measurements of the same individual/unit, or if it is a matched pairs design of some sort.

As long as the data is organized in such a way that the pairs are adjacent (or are in the same respective ordering), then the test adaption is easy by just adding the paired = TRUE argument.

Paired Plot

The gehan dataset in R includes remission times of leukaemia patients, where half were treated with the drug 6-mercaptopurine, and the rest were controls. The trial was designed with matched pairs, so we can actually do a direct comparison across pairs, rather than treat these as independent samples!

gehan

To plot this data, we can do a point plot with pairing lines to indicate which values are paired together. To do this, we need to add two things.

In your aes function, add this argument:

pair: should be matched with your grouping ID variable that identifies which rows are matched together.

You need two geometries:

geom_point to plot your points
geom_line will connect your paired observations with lines.

I also suggest making the point sizes bigger in geom_point and you may also tinker with the geom_line line as well as desired.

ggplot(data = gehan, aes(x = treat, y = time, group = pair, color = treat)) +
  geom_point(size = 3) +
  geom_line(size = 0.5, color = "black")

Running a paired test

We run the t-test the same way, just with the paired = TRUE argument added.

t.test(data = gehan, 
       time ~ treat,
       paired = TRUE,
       alternative = "two.sided")

Return Home

This tutorial was created by Kelly Findley, with assistance from Brandon Pazmino (UIUC ’21). We hope this experience was helpful for you!

If you’d like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/