In this tutorial, we will now begin looking at how to construct confidence intervals and complete basic hypothesis tests in
R. Using the base functions in
R, we can save time from making a lot of tedious calculations and go straight to our results.
One sample inference would commmonly be done using the base functions
qnorm for a z-method, or
qt for a t method. In this tutorial, we will skip this scenario and move straight ahead to comparing groups.
Consider an experiment to determine if people who are assigned a zinc-fortified breakfast shake are less likely to get a cold in the Fall and Winter as compared to people taking the same breakfast shake without zinc. We have 300 people assigned to each group, and we find that 81 in the zinc group and 113 in the non-zinc group get sick. Would this be enough evidence to determine a difference, or could this difference be explained as random chance?
First, let's make a graph to visualize this difference. Note we need some extra setup to create this information to plot.
data = data.frame( Group = c("Zinc", "No Zinc"), Proportion = c(81/300, 113/300) ) library(ggplot2) library(ggthemes) ggplot(data = data, aes(x = Group, y = Proportion, fill = Group)) + geom_col(position = "stack", color = "black", width = 0.3) + theme_hc() + scale_y_continuous(breaks = seq(0,1,0.05), limits = c(0,0.5)) + labs(title = "Proportion who got Cold")
To see if the difference can be explained by random chance or not, we'll need a two-sample test for proportions.
prop.test function, followed by two necessary arguments: 1) A vector of the counts for our two groups who had the condition of interest, and 2) a vector of the total sample size for each group.
We may optionally add a direction if we wish to do a directional test, but we hold that off for now.
prop.test(x = c(81, 113), n = c(300, 300))
Notice that we get a test statistic (in this case, it is the Chi-square statistic, but don't worry about that), followed by a p-value. We also get a 95% confidence interval for the true difference in proportions.
From this result, we have pretty strong evidence that the difference in sample proportions is unlikely to be explained by random chance.
If we entered this test knowing that zinc is unlikely to increase the risk for a cold, then we could make a reasonable argument for a one-sided test. If we wished to get a p-value from this format, we could just cut the previous p-value in half, or actually set up the test in this format.
Since it is the zinc group listed first in the argument, we'll say that zinc < no zinc in proportion of sickness.
prop.test(x = c(81, 113), n = c(300, 300), alternative = "less")
A z-test is only appropriate if the true variance of each group is known (or can be estimated well with our sample). Typically, if each sample is > 120 in size, a z-test should be very reasonable, as our sample variance is a very good estimate.
If using a z-test, one can use
pnorm once the z-score and standard error are calculated. For t-tests, there is a
t.test function we can use directly.
If we return to the
penguins data from the
palmerpenguins package. As noted before, the sex of each penguin was recorded (for most). Let's see if there are systematic differences between male and female penguins in terms of bill length.
library(palmerpenguins) ggplot(data = penguins, aes(x = sex, y = bill_length_mm, fill = sex)) + geom_violin(alpha = 0.2) + geom_boxplot(alpha = 0.6, width = 0.1) + theme_bw()
A few things for this data--we have quite a few unlabeled penguins in terms of sex, so we won't be able to use those for this analysis. Secondly, the distribution is at least bimodal already (likely due to the different species represented). It will make most sense if we pick just one species to compare. Let's go with the Adelie species.
library(palmerpenguins) library(dplyr) penguins %>% filter((sex == "female" | sex == "male") & species == "Adelie") %>% ggplot(aes(x = sex, y = bill_length_mm, fill = sex, color = sex)) + geom_violin(alpha = 0.2) + geom_boxplot(alpha = 1, width = 0.1, fill = "white") + geom_sina(size = 0.5, alpha = 0.5) + theme_bw()
Visually, it appears that the variance of each group is quite similar. We could also calcualte the variance of each for numeric comparison (as done below). The values are far in relatively close in magnitude, so pooled is probably ok.
penguins %>% filter((sex == "female" | sex == "male") & species == "Adelie") %>% group_by(sex) %>% summarise(mean = mean(bill_length_mm), var = var(bill_length_mm))
We will use the
t.test function for the comparison. A few things:
~, and then the grouping variable
var.equalshould be FALSE. Otherwise list it as TRUE or just don't bother writing it in!
t.test can't be run within a pipe, so you'll need to subset separately, save to a name, and then call in as the data argument for
binary_data = penguins %>% filter((sex == "female" | sex == "male") & species == "Adelie") t.test(data = binary_data, bill_length_mm ~ sex, alternative = "two.sided", var.equal = TRUE)
Our results show a very small p-value, meaning that it is extremely unlikely that the difference in bill length we found in our sample would just be random chance differences. There is likely a biological difference here that explains this difference in mean bill length.
Also of interest, we get a confidence interval for the true mean difference as well! This will output for a non-directional test. It will NOT output as we would expect if you input a directional test though, so keep that in mind!
Directional tests may be completed if there is reason to believe that the effect wouldn't make sense to go in the other direction. Let's pretend that we wanted to do this with this example.
Perhaps we knew there were only two possible results: Females and Males have equal bill length, or males have longer bill length. To decide if our alternative should be
greater, think of filling in this blank:
Alternative: female ___ male
Since female comes alphabetically before male, this would be the ordering. My alternative is that female is less than male, so the alternative I need here is "less"!
binary_data = penguins %>% filter((sex == "female" | sex == "male") & species == "Adelie") t.test(data = binary_data, bill_length_mm ~ sex, alternative = "less", var.equal = TRUE)
We get a result! It may not be obvious, but the p-value is half the size...this is what happens in a one-sided test when you look the direction of the "effect." Rather than ask...
"How often would we get a difference this large or larger if there truly was no difference?"
we're not asking,
"How often would we get the male group this much higher than the female group if there truly was no difference?"
Also take note that the confidence interval for the mean difference includes infinity as a lower bound. While this has an odd meaning, it is not particularly meaningful to us given what we have learned. So just ignore this part of the output!
Now go ahead and look at the
mtcars data. We're interested to know if there might be a difference in the average
mpg of cars with manual drive or automatic drive (housed in the
am variable is technically numeric, I'll need to convert it to be a factor variable in order for ggplot to accept it as "categorical."
ggplot(data = mtcars, aes(x = as.factor(am), y = mpg, color = as.factor(am))) + geom_jitter(width = 0.1) + geom_boxplot(position = position_nudge(x = 0.2), width = 0.1) + labs(x = "0 = Manual. 1 = Automatic", color = "Auto. or Man.") + theme_bw() + scale_color_manual(values = c("orchid3", "goldenrod1"))
Note, however, that the category names are not in words, but simply 1 and 0. 0 denotes the car as being manual, and 1 denotes that it is automatic.
As you might guess, 0 comes alphabetically before 1. And since this is how it is coded in
R, this is all that matters for determining alphabetical ordering!
Conduct a t-test to determine if automatic vehicles have a higher
mpg on average than manual cars.
Think carefully about the 0-1 ordering and how you phrase the alternative!
t.test(data = ______, ______ ~ ______, alternative = _______, var.equal = __________)
t.test(data = mtcars, mpg ~ am, alternative = "less", var.equal = FALSE)
When working with paired data, we can add a
paired = TRUE argument to our
You may likely encounter paired data when trying to compare pre and post measurements of the same individual/unit, or if it is a matched pairs design of some sort.
As long as the data is organized in such a way that the pairs are adjacent (or are in the same respective ordering), then the test adaption is easy by just adding the
paired = TRUE argument.
I'm not aware of any paired data in base R data or R packages, so I'll make a small one up!
Let's plot this data in a way that maintains the pairing. To do this, be sure you have an additional variable that identifies which observations should be paired. This will go in the
Create a strip chart (no jittering).
Second, add a
geom_line geometry to connect the data. It will connect the dots that share the same value from the pair variable. Adjust sizes and coloring as needed!
#Make some fake data paired_data = data.frame( Cholesterol = c(212,205,243,234,206,209,198,196,256,219,243,212,245,260,234,225), Pre_Post = c("Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post"), Pair = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8) ) #Make Pre/Post list in chronological order, not alphabetical paired_data$Pre_Post = factor(paired_data$Pre_Post, levels = c("Pre", "Post")) #Paired Strip Chart ggplot(data = paired_data, aes(x = Pre_Post, y = Cholesterol, color = Pre_Post, group = Pair)) + geom_point(size = 3) + geom_line(size = 0.5, color = "black") + theme_bw()
We run the t-test the same way, just with the
paired = TRUE argument added. The setup is provided again to allow the code to work, but only the
t.test part at the end is new here!
#Make some fake data paired_data = data.frame( Cholesterol = c(212,205,243,234,206,209,198,196,256,219,243,212,245,260,234,225), Pre_Post = c("Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post", "Pre", "Post"), Pair = c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8) ) #Make Pre/Post list in chronological order, not alphabetical paired_data$Pre_Post = factor(paired_data$Pre_Post, levels = c("Pre", "Post")) t.test(data = paired_data, Cholesterol ~ Pre_Post, paired = TRUE, alternative = "two.sided")
ANOVA stands for Analysis of Variance. In our class, we will use its most simple situation: comparing difference in means between 3 or more groups. This particular use is called "One-Way ANOVA."
Let's return to the penguins data again. There are 3 species in this data. In a previous question, we looked at whether male and female penguins have the same average bill length or not. Now, let's see if the three species have the same average bill length or not.
ggplot(data = penguins, aes(x = species, y = bill_length_mm, fill = species, color = species)) + geom_violin(alpha = 0.2) + geom_boxplot(alpha = 0.8, width = 0.1, fill = "white") + geom_sina(size = 0.5, alpha = 0.5) + theme_bw()
We can do that by using the
aov function, saving this result as a model, and then taking a summary of that model.
Notice again--the numeric variable comes first, followed by the grouping variable.
penguin_model = aov(data = penguins, flipper_length_mm ~ species) summary(penguin_model)
We find that the variance between species with regard to flipper length is extremely high. The p-value is actually at the lower table reporting limit for
2.2x10^-16--these sample mean differences are highly unlikely to be explained by random chance!
If we want to check further for pairwise comparisons, we can run a tukey test using the
TukeyHSD function. The only input for this function is the already created ANOVA model.
penguin_model = aov(data = penguins, flipper_length_mm ~ species) TukeyHSD(penguin_model)
From this results we see strong evidences that all species have different means from one another. The p-values are so low that they round to 0 in the output (granted they aren't actually 0, but are all very small numbers).
Compare the body mass index of penguins across the three different years of data collection. Your solution code should produce an anova model named
year_model and run a summary of the model.
year_model = ___(_____________) summary(_________)
year_model = aov(data = ______, ______ ~ ______) summary(year_model)
year_model = aov(data = penguins, body_mass_g ~ year) summary(year_model)
Feel free to visualize the results below as well!
ggplot(data = penguins, aes(x = as.factor(year), y = body_mass_g, fill = as.factor(year))) + _______________
This tutorial was created by Kelly Findley, with assistance from Brandon Pazmino (UIUC '21). We hope this experience was helpful for you!