This tutorial will pick up from the previous one. We will talk about:
So far, we've only considered how we can predict the value of a response variable using one numeric predictor variable.
ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +
geom_point() +
labs(y = "Bill Length (mm)", x = "Bill Depth (g)") +
geom_smooth(method = "lm", se = FALSE)
But we might gain more predictive power by including a second predictor! In the example above, what if we used both the bill depth of a penguin and the penguin's species to predict it's bill length?
ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
geom_point() +
labs(y = "Bill Length (mm)", x = "Bill Depth (g)") +
geom_smooth(method = "lm", se = FALSE)
That categorical predictor completely changes the story. It looked at first like higher bill depth suggested lower bill length on average. But that's just because species is a confounder here.
We actually see a positive correlation if we stratify the data by species.
Let's say we have one numeric predictor (e.g., bill depth) and one categorical predictor (e.g., species). Here our the modeling options!
A Simple Model: We choose only one of these predictors to include in the model.
An Additive Model: We plot a simple relationship with the numeric predictor, but give each group in our categorical variable a unique intercept.
An Interaction Model: We allow each group of our categorical variable to have a unique intercept and slope! The relationship (slope) between the numeric predictor and our response depends on which group we are looking at.
Let's consider a (made-up) experiment: Participants were asked to take a test on a topic that was unfamiliar to them. The response variable is their score on that exam. However, we have two variables we're going to use to predict their test score
study_time
they spent (in minutes)materials
they were given (Clear or Unclear)Let's say the "Clear" study materials were carefully structured to benefit students more, whereas the "Unclear" instructions were full of jargon and not very accessible for learning.
Let's take a look at this data, with study time on the x-axis, and coloring by which materials they were using.
#Create the data
Task_Data
#Plot the Data
ggplot(data = Task_Data, aes(x = study_time, y = score, color = materials)) +
geom_point() +
labs(x = "Study Time (min)", y = "Exam Score")
Let's explore some modeling options for these groups!
We can start by using study time as the sole predictor of exam score.
Notice: If I still want to retain materials
in color, I will need to select a specific color in geom_smooth
so that it knows I want only one line.
ggplot(data = Task_Data, aes(x = study_time, y = score, color = materials)) +
geom_point() +
labs(x = "Study Time (min)", y = "Exam Score") +
geom_smooth(method = "lm", se = FALSE, color = "black")
As we learned in a previous tutorial, we can make a simple linear model to find a best fitting line and judge how much predictive power we find with this predictor.
mod_simple_ST = lm(data = Task_Data, score ~ study_time)
summary(mod_simple_ST)
We find:
study_time
)What about the simple relationship between the materials a person used and their exam score? In other words, we are now going to ignore study time and just look at average exam score differences between each group.
ggplot(data = Task_Data, aes(x = materials, y = score, color = materials)) +
geom_jitter(width = 0.05) +
labs(x = "Clear vs. Unclear Instructions", y = "Exam Score")
It may be weird to think of fitting a "linear model" across two columns. All that means here is that we're connecting the mean of one column with the mean of the other with a line.
To run this as a linear model, R
will treat the category options as 0's and 1's. By default, it will do it alphabetically:
Just for sake of this tutorial... this is how R is thinking about this "linear" relationship.
Task_Data$binary = ifelse(Task_Data$materials == "Clear", 0, 1)
ggplot(data = Task_Data, aes(x = binary, y = score)) +
geom_point() +
labs(x = "Clear vs. Unclear Instructions", y = "Exam Score") +
geom_smooth(method = "lm", se = FALSE) +
scale_x_continuous(limits = c(0,1), breaks = seq(0,1,1)) +
scale_y_continuous(breaks = seq(10,100,10))
mod_simple_group = lm(data = Task_Data, score ~ materials)
summary(mod_simple_group)
From the output, we find:
This means that the mean score for those with Clear instructions was 60.827, while the mean score for those with Unclear instrucions was 60.827 - 28.593 = 32.234.
The \(r^2\) and adj. \(r^2\) also show that this one grouping difference alone explains almost half of the variability in exam score!
An additive model will:
FYI plotting an additive model is a little complicated. You do not need to know how to write this code on your own to visualize. We're just visualizing for sake of learning it.
add_mod = lm(data = Task_Data, score ~ study_time + materials)
Task_Data$prediction = predict(add_mod)
ggplot(data = Task_Data, aes(x = study_time, y = score, color = materials)) +
geom_point() +
geom_line(aes(y = prediction), size = 1)
To make an additive model, we just add both predictors on the right side of our ~
symbol, combined with a +
sign.
add_mod = lm(data = Task_Data, score ~ study_time + materials)
summary(add_mod)
What do we find from the Estimate column?
You can think of this model as giving us two lines of best fit.
study_time
)study_time
)Notice that adj. \(r^2\) for this model is 0.6563. That's improvement over either simple model option!
We can also look at the p-value for our two predictors to see that each are uniquely contributing.
study_time
is testing whether we have evidence that study time has some linear relationship after already accounting for group.group
is testing whether we have evidence that the intercept adjustment is different from 0 after already accounting for study time.When we talk about two predictors "interacting," we mean that the relationship that one has with the response depends on the value of the other!
What does that mean in our context?
An interaction means that the relationship (slope) between study time and exam score depends on which set of instructions they had.
It makes sense that the more time people studied, the better (on average) they scored. It also makes sense that those with clearer instructions probably saw more benefit from study time (steeper relationship) than those with unclear instructions (flatter relationship).
This is pretty straightforward to visualize! Just color your grouping variable and add geom_smooth
. If you leave the color option out in geom_smooth
, it will create a separate line of best fit for each group.
ggplot(data = Task_Data, aes(x = study_time, y = score, color = materials)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
To make an interaction term, use the *
sign to connect the predictor variables which you want to allow interaction.
int_mod = lm(data = Task_Data, score ~ study_time * materials)
summary(int_mod)
What do we find from the Estimate column?
We again have two lines of best fit, but now with unique slopes for each too
study_time
)study_time
)To determine if we have evidence for the interaction model in comparison to the additive model, we just need to look at the p-value for the interaction term!
In this case, it is 0.00119, suggesting we have very strong evidence that there is at least some interaction with these predictors. People with the Clear instructions saw more benefit from study time in improving their score (steeper slope) than those with unclear.
In case you're wondering about that high p-value for the "additive term" (materials
by itself):
Finally, we can see that adj \(r^2\) is best with the interaction model (0.7105). That's about a 5.5% increase (going from about 34.5% unexplained to about 29% unexplained). Just another indicator that the interaction model fits the best!
Let's take a look at the esdcomp
data, which records number of complaints that various ER doctors get, alongside other factors about that doctor.
esdcomp
Let's focus on these variables:
complaints
: How many complaints that doctor has receivedvisits
the number of patient visits that doctor has loggedgender
the gender of the doctorLet's first create a scatterplot that predicts complaints
based on visits
and gender
. Let's plot an interaction model. In other words, we're going to see if the relationship between visits and complaints depends on the gender of the doctor.
ggplot(data = esdcomp, aes(x = visits, y = complaints, color = gender)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Go ahead and build an interaction model first to see if we have evidence that there might be an interaction between these two predictors
int_mod = _________________
_______________
int_mod = lm(data = esdcomp,________ ~ ______________)
summary(________)
int_mod = lm(data = esdcomp, complaints ~ visits * gender)
summary(int_mod)
If done correctly, we get a p-value of 0.324 from the interaction model. This suggests we don't have much evidence here that the relationship between visits and complaints depends on gender.
It's still possible there is an additive relationship--that one gender receives more complaints on average, even after controlling for number of visits!
Try an additive model
int_mod = _________________
_______________
add_mod = lm(data = esdcomp,________ ~ ______________)
summary(________)
add_mod = lm(data = esdcomp, complaints ~ visits + gender)
summary(add_mod)
If done correctly, the p-value for the additive term (with no interaction term) is rather high. This means, among doctors with the same number of visits, we don't see an additive difference in complaints between male and female doctors from this sample of data.
That doesn't necessarily mean no difference based on gender here--we do have a relatively small sample size, so there might be a small effect here that we can't detect with this sample size.
For sake of completion, let's also look at the simple plots
mod_visits = lm(data = esdcomp, complaints ~ visits)
summary(mod_visits)
ggplot(data = esdcomp, aes(x = visits, y = complaints)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
mod_gender = lm(data = esdcomp, complaints ~ gender)
summary(mod_gender)
ggplot(data = esdcomp, aes(x = gender, y = complaints, color = gender)) +
geom_jitter(width = 0.05) +
geom_smooth(method = "lm", se = FALSE)
We do see evidence that more visits seems to correlate with more complaints. That should make sense!
We don't see any simple relationship between gender and complaints.
With just these two predictors, it seems like a simple relationship between visits and complaints is the most sensible! It would be interesting though to recheck these with:
That's all folks!
If you'd like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/