This tutorial will explore:
In this tutorial, we will proceed straight to multi-group comparison inference (comparing two means, comparing two proportions, and comparing 3+ means). These methods are more widely used in application.
For information on completing a one-sample test or interval in R, feel free to check the resources linked below, but you won't need them for this class!
The Pima.tr
dataset contains data on 200 women of Pima Indian heritage living near Phoenix, Arizona. As part of a Health study on this population, medical researchers gathered data for several health-related variables.
This data is one of many inside the MASS
package. Take a look at the data here!
library(MASS)
Pima.tr
One of the variables collected was whether the participant was considered "diabetic" according to World Health Organization criteria (represented as the variable type
). Researchers are curious if the proportion of PIMA women who are diabetic is different depending on whether that woman had been pregnant in the past before
Guiding Question: Are women with a past pregnancy more likely to be diabetic than women who have not been pregnant?
The pregnancy
variable indicates women who have been pregnant at least once ("Yes") and women who have not been pregnant before ("No").
Let's first look at an unstacked barplot to compare these groups in terms of diabetic status.
ggplot(data = Pima.tr, aes(x = pregnancy, fill = type)) +
geom_bar(color = "black", width = 0.5) +
labs(x = "At least 1 pregnancy", fill = "Diabetic")
Let's now make a visual that makes the proportional comparison easier to see. Create the barplot again, but this time, as a stacked barplot with position = "fill"
ggplot(data = Pima.tr, aes(x = pregnancy, fill = type)) +
geom_bar(position = "fill", color = "black", width = 0.5) +
labs(x = "At least 1 pregnancy", y = "Proportion", fill = "Diabetic")
We will talk about this in a coming tutorial, but you can add general plot themes in ggplot
to change the background paneling and layout.
I mention this here because I thinktheme_hc()
looks especially nice with stacked barplots.
ggplot(data = Pima.tr, aes(x = pregnancy, fill = type)) +
geom_bar(position = "fill", color = "black", width = 0.4) +
labs(x = "At least 1 pregnancy", y = "Proportion", fill = "Diabetic") +
theme_hc()
To see if the difference can be explained by random chance or not, we'll need a two-sample test for proportions. We can use the prop.test
function for that.
The prop.test
function in R is a straightforward choice for this comparison. It takes two arguments:
x
: the number of cases with the criteria of interestn
the total number of cases in each group.Before exploring the Pima
data question, let's start with a simpler example.
Let's say that I want to know if the proportion of "A's" between Professor X's class and Professor Y's class might be different or not.
To test whether the difference in sample proportions could be explained as random chance sampling variability, we can use prop.test
prop.test(x = c(50, 35),
n = c(100,90))
The first vector represents the number of cases meeting our criteria of interest ("getting an A") for each group. It's inputted as a vector of two values.
The second vector represents the total number of cases in each group. Notice that order matters there. If Professor X has 50 out of 100 A's, make sure 50 and 100 both come first or both come second in their respective vectors!
Notice that this code results in the same p-value.
prop.test(x = c(35, 50),
n = c(90, 100))
Notice that in the Pima.tr
dataset, each row is either a Yes or No, but I don't have the frequency totals provided for me to plug into prop.test
. I need to create a frequency table first.
We can do that with a pipe by creating a summary table of counts.
The only summary function we need is n
, which will simply count how many observations there are in each group.
Use group_by
for our two categorical variables of interest to get frequencies of each of these four possible subgroups. The order for these two variables is arbitrary--feel free to switch them around!
Pima.tr |>
group_by(pregnancy, type) |>
summarise(count = n())
Now, how do we take this information to plug into prop.test
?
type
(is patient diabetic or not).type
is "Yes" represent the two values we're placing in our x
vector--number of cases where criteria of interest (diabetic status) is true.pregnant
categories. What total number of cases are "Yes" for pregnant, and what total number of cases are "No" for pregnant?prop.test(x = c(12, 56),
n = c(28, 172))
Notice at the bottom, it provides the sample proportions for each group.
At the top, we find a p-value.
We also get a 95% confidence interval for the true difference in proportions. That may or may not be interesting depending on the context. In some cases, we might instead wish to calculate the relative risk (or odds ratio) value and confidence interval.
If this happened to be a situation where we only found a result in one direction to be noteworthy, we could add a directional indication on the test.
If we wished to indicate that, we would need to add a 3rd argument:
alternative
: can be set to "greater" "less" or "equal" where equal would be the default choice when we don't define it.Let's consider this alternative with the professor X vs. Y question
To decide whether to write "less" or "greater", we need to think about which group is listed first. Since we have professor X's numbers listed first, then our alternative will be that Professor X's proportion is "greater" than Professor Y's proportion.
Professor X prop > Professor Y prop
prop.test(x = c(50, 35),
n = c(100, 90),
alternative = "greater")
We see a p-value of 0.08, suggesting moderate evidence (but not strong evidence) that Professor X's class results in a higher proportion of A's compared to Professor Y.
NOTE: When choosing a directional option, ignore the confidence interval output--it represents something different!
Remember that means are measures we might take from numeric variables. Proportions are measures we use with categorical variables. Let's next consider testing involving a comparison of two means!
Remember that when our sample size is not particularly large, we need a "t-correction" to account for our estimate of the standard deviation using only our sample data. When our sample size is fairly large though, the difference between a t-test and z-test is negligible.
When using software, it's fine to just always do a t-test for that reason, since the t-test results will correctly converge toward the results from a z-test!
Consider the birthwt
dataset from the MASS
package.
birthwt
One question we might investigate is whether mothers who reported smoking during the pregnancy (smoke_answer
) were more likely to give birth to babies with a lower birth weight (bwt
) on average.
ggplot(data = birthwt, aes(x = smoke_answer, y = bwt, fill = smoke_answer)) +
geom_boxplot() +
stat_boxplot(geom = "errorbar") +
labs(y = "Birth weight (g)", x = "Smoking Status")
Visually, it appears that the variance of each group is similar. But when using software, an unpooled method is the "safer" choice. We will always choose an unpooled method when using software in our class!
We will use the t.test
function for the comparison. This test takes the following arguments
data
: the data frame where our varaibles are located.~
connecting the two!var.equal
: whether to assume equal variances. We will typically choose FALSE
as the safer testing choice when there is no reason to assume variances of each population is equal.alternative
: again, an option to identify if we are doing a directional test.For now, let's start with a non-directional test.
birthwt
bwt
, and we are using smoke_answer
as the grouping variable to compare means from.Note: Putting the response variable first and grouping variable second is important!
t.test(data = birthwt,
bwt ~ smoke_answer,
var.equal = FALSE)
Notice our sample means
Notice our p-value
Also of interest, we get a confidence interval for the true mean difference as well!
Directional tests may be completed if we would only find a difference in one direction possible or noteworthy.
Perhaps we had a theory that smoking affected the pregnancy and caused decreased birth weight. If that's the case, then alphabetically, our groups are No and Yes in that order (check the data frame again if you'd like to verify the category names here!).
Alternative: No (smoking) ___ Yes (smoking)
If our theory says that smoking should increase the likelihood of premature/low weight birth, then our alternative is that weight for non-smoking > weight for smoking. So we would plug in greater!
t.test(data = birthwt,
bwt ~ smoke_answer,
var.equal = FALSE,
alternative = "greater")
Notice that the p-value is now half the size it was before. That reflects that if left up to random chance, the probability of finding a difference this high or higher in only one direction is half as likely as finding that much difference in either direction.
ANOVA stands for Analysis of Variance and is a statistical technique for modeling how one or more categorical predictors explain variability in a numeric response. In this tutorial, we will use its most simple application: comparing differences in means between 3 or more populations. This particular use is called "One-Way ANOVA."
Let's examine the penguins
data from palmerpenguins
. There are 3 species in this data.
Now, let's see if the three species might plausibly have the same average bill length or not.
ggplot(data = penguins, aes(x = species,
y = bill_length_mm,
fill = species,
color = species)) +
geom_jitter(width = 0.05)
We can do that by using the aov
function, saving this result as a model, and then taking a summary of that model.
Notice again--the numeric variable comes first, followed by the grouping variable.
penguin_model = aov(data = penguins,
flipper_length_mm ~ species)
summary(penguin_model)
We find a p-value of 2x10^-16, which is extremely low. This is actually the lowest value R will output unless we manually request for more precision.
Basically, we have VERY strong evidence that the difference in means we observed in our sample results would not show up by random chance if these three species indeed had equals means.
If we want to check further for pairwise comparisons, we can run a tukey test using the TukeyHSD
function. The only input for this function is the already created ANOVA model.
penguin_model = aov(data = penguins,
flipper_length_mm ~ species)
TukeyHSD(penguin_model)
From this results we see strong evidences that all species have different means from one another. The p-values are so low that they round to 0 in the output (granted they aren't actually 0, but are all very small numbers).
I used to teach paired t-tests explicitly, but it is no longer in my notes to make space for other things. But if end up doing a paired comparison in the future (perhaps a simple pre-post study), you might find the visualization (and testing component) helpful!
You may encounter paired data when trying to compare pre and post measurements of the same individual/unit, or if it is a matched pairs design of some sort.
As long as the data is organized in such a way that the pairs are adjacent (or are in the same respective ordering), then the test adaption is easy by just adding the paired = TRUE
argument.
The gehan
dataset in R includes remission times of leukaemia patients, where half were treated with the drug 6-mercaptopurine, and the rest were controls. The trial was designed with matched pairs, so we can actually do a direct comparison across pairs, rather than treat these as independent samples!
gehan
To plot this data, we can do a point plot with pairing lines to indicate which values are paired together. To do this, we need to add two things.
In your aes
function, add this argument:
pair
: should be matched with your grouping ID variable that identifies which rows are matched together.You need two geometries:
geom_point
to plot your pointsgeom_line
will connect your paired observations with lines.I also suggest making the point sizes bigger in geom_point
and you may also tinker with the geom_line
line as well as desired.
ggplot(data = gehan, aes(x = treat, y = time, group = pair, color = treat)) +
geom_point(size = 3) +
geom_line(size = 0.5, color = "black")
We run the t-test the same way, just with the paired = TRUE
argument added.
t.test(data = gehan,
time ~ treat,
paired = TRUE,
alternative = "two.sided")
This tutorial was created by Kelly Findley, with assistance from Brandon Pazmino (UIUC '21). We hope this experience was helpful for you!
If you'd like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/