Modeling Binary Predictors

Tutorial Introduction

Goals

This tutorial will pick up from the previous one. We will talk about:

Visualizing one numeric predictor (on the x axis) with one categorical predictor (in color)
Reviewing the application of a “simple” model (using just one predictor)
Creating and interpreting an “additive” model
Creating and an interpreting an “interaction” model

Visualizing Two Predictors

A simple model

So far, we’ve only considered how we can predict the value of a response variable using one numeric predictor variable.

ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +
  geom_point() +
  labs(y = "Bill Length (mm)", x = "Bill Depth (g)") +
  geom_smooth(method = "lm", se = FALSE)

Add a categorical variable

But we might gain more predictive power by including a second predictor! In the example above, what if we used both the bill depth of a penguin and the penguin’s species to predict it’s bill length?

ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  geom_point() +
  labs(y = "Bill Length (mm)", x = "Bill Depth (g)") +
  geom_smooth(method = "lm", se = FALSE)

What did we find?

That categorical predictor completely changes the story. It looked at first like higher bill depth suggested lower bill length on average. But that’s just because species is a confounder here.

We actually see a positive correlation if we stratify the data by species.

Modeling Options

Let’s say we have one numeric predictor (e.g., bill depth) and one categorical predictor (e.g., species). Here our the modeling options!

A Simple Model: We choose only one of these predictors to include in the model.
An Additive Model: We plot a simple relationship with the numeric predictor, but give each group in our categorical variable a unique intercept.
An Interaction Model: We allow each group of our categorical variable to have a unique intercept and slope! The relationship (slope) between the numeric predictor and our response depends on which group we are looking at.

A Guiding Example

Prompt

Let’s consider a (made-up) experiment: Participants were asked to take a test on a topic that was unfamiliar to them. The response variable is their score on that exam. However, we have two variables we’re going to use to predict their test score

How much study_time they spent (in minutes)
Which study materials they were given (Clear or Unclear)

Let’s say the “Clear” study materials were carefully structured to benefit students more, whereas the “Unclear” instructions were full of jargon and not very accessible for learning.

Visualize the data

Let’s take a look at this data, with study time on the x-axis, and coloring by which materials they were using.

#Create the data
Task_Data

#Plot the Data
ggplot(data = Task_Data, aes(x = study_time, y = score, color = materials)) +
  geom_point() +
  labs(x = "Study Time (min)", y = "Exam Score")

What’s next

Let’s explore some modeling options for these groups!

First we will look at the simple modeling options (each predictor in isolation)
Next, we will look at an additive model
Finally, we will check an interaction model

Example of Simple Models

Study Time in Isolation

We can start by using study time as the sole predictor of exam score.

Notice: If I still want to retain materials in color, I will need to select a specific color in geom_smooth so that it knows I want only one line.

ggplot(data = Task_Data, aes(x = study_time, y = score, color = materials)) +
  geom_point() +
  labs(x = "Study Time (min)", y = "Exam Score") +
  geom_smooth(method = "lm", se = FALSE, color = "black")

Linear Model with Study Time only

As we learned in a previous tutorial, we can make a simple linear model to find a best fitting line and judge how much predictive power we find with this predictor.

mod_simple_ST = lm(data = Task_Data, score ~ study_time)
summary(mod_simple_ST)

We find:

From our Estimate column, the best fitting line is \(\hat{y} = 14.6439 + 0.6093\)(study_time)
We have an \(r^2\) value of 0.2594, which after adjusting for correlation likely due to random chance, comes to 0.2466
From the p-value 0.0000324, we have very strong evidence that there is at least some linear relationship between study time and exam score

Instruction type only

What about the simple relationship between the materials a person used and their exam score? In other words, we are now going to ignore study time and just look at average exam score differences between each group.

ggplot(data = Task_Data, aes(x = materials, y = score, color = materials)) +
  geom_jitter(width = 0.05) +
  labs(x = "Clear vs. Unclear Instructions", y = "Exam Score")

“Linear Model” with Group only

It may be weird to think of fitting a “linear model” across two columns. All that means here is that we’re connecting the mean of one column with the mean of the other with a line.

To run this as a linear model, R will treat the category options as 0’s and 1’s. By default, it will do it alphabetically:

The alphabetically first group name (Clear) will now behave like 0’s in the column
The other (Unclear) will be 1’s in the column.

Just for sake of this tutorial… this is how R is thinking about this “linear” relationship.

Task_Data$binary = ifelse(Task_Data$materials == "Clear", 0, 1)

ggplot(data = Task_Data, aes(x = binary, y = score)) +
  geom_point() +
  labs(x = "Clear vs. Unclear Instructions", y = "Exam Score") +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(limits = c(0,1), breaks = seq(0,1,1)) +
  scale_y_continuous(breaks = seq(10,100,10))

Linear Model Output with Group

mod_simple_group = lm(data = Task_Data, score ~ materials)
summary(mod_simple_group)

From the output, we find:

The Intercept is 60.827. This is the mean score of the base group (Clear).
The slope is -28.593. This is how much different the mean score is for the other group (Unclear) with respect to the baseline group

This means that the mean score for those with Clear instructions was 60.827, while the mean score for those with Unclear instrucions was 60.827 - 28.593 = 32.234.

The \(r^2\) and adj. \(r^2\) also show that this one grouping difference alone explains almost half of the variability in exam score!

Example of Additive Model

Visualizing an Additive Model

An additive model will:

Have only one slope–the relationship between study time and score
Create a unique intercept for each materials group–materials now contribute a constant additive difference.

FYI plotting an additive model is a little complicated. You do not need to know how to write this code on your own to visualize. We’re just visualizing for sake of learning it.

add_mod = lm(data = Task_Data, score ~ study_time + materials)
Task_Data$prediction = predict(add_mod)

ggplot(data = Task_Data, aes(x = study_time, y = score, color = materials)) +
  geom_point() +
  geom_line(aes(y = prediction), size = 1)

Modeling an Additive Model

To make an additive model, we just add both predictors on the right side of our ~ symbol, combined with a + sign.

add_mod = lm(data = Task_Data, score ~ study_time + materials)
summary(add_mod)

What do we find from the Estimate column?

The intercept term represents the intercept for our baseline category (Clear)
We have a common slope for both groups. That is, for every one unit increase in study time, we expect a 0.52 point increase in exam score on average.
The “intercept adjustment” for the Unclear group is -26.5322. That means we take our prediction for the clear group, and just adjust it down 26.5 units.

You can think of this model as giving us two lines of best fit.

For the Clear group: \(\hat{y} = 32.4556 + 0.52092\)(study_time)
For the Unclear group: \(\hat{y} = 5.92338 + 0.52092\)(study_time)

Interpreting the Strength of this Model

Notice that adj. \(r^2\) for this model is 0.6563. That’s improvement over either simple model option!

We can also look at the p-value for our two predictors to see that each are uniquely contributing.

The p-value for study_time is testing whether we have evidence that study time has some linear relationship after already accounting for group.
The low p-value (0.0000005) provides very strong evidence it is!
The p-value for group is testing whether we have evidence that the intercept adjustment is different from 0 after already accounting for study time.
The low p-value (0.0000000000165) provides very strong evidence it is!

Example of Interaction Model

When we talk about two predictors “interacting,” we mean that the relationship that one has with the response depends on the value of the other!

What does that mean in our context?

An interaction means that the relationship (slope) between study time and exam score depends on which set of instructions they had.

It makes sense that the more time people studied, the better (on average) they scored. It also makes sense that those with clearer instructions probably saw more benefit from study time (steeper relationship) than those with unclear instructions (flatter relationship).

Visualizing an Interaction Model

This is pretty straightforward to visualize! Just color your grouping variable and add geom_smooth. If you leave the color option out in geom_smooth, it will create a separate line of best fit for each group.

ggplot(data = Task_Data, aes(x = study_time, y = score, color = materials)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Modeling an Interaction Model

To make an interaction term, use the * sign to connect the predictor variables which you want to allow interaction.

int_mod = lm(data = Task_Data, score ~ study_time * materials)
summary(int_mod)

Findings

What do we find from the Estimate column?

The intercept term (17.4060) represents the intercept for our baseline category (Clear)
The slope on study time (0.8026) now represents the slope for our baseline category (Clear) only.
The “intercept adjustment” for the Unclear group is 3.1062. That means we estimate the intercept term for the Unclear group is 3.1062 units higher than the intercept for the Clear group
That last term (-0.5766) represents the interaction term. It is a slope adjustment for our second group (Unclear), suggesting we adjust the slope down 0.5766 units from the slope for the baseline group. (0.8026 - 0.5766 = 0.2660)

We again have two lines of best fit, but now with unique slopes for each too

For the Clear group: \(\hat{y} = 17.4060 + 0.8026\)(study_time)
For the Unclear group: \(\hat{y} = 20.5122 + 0.226\)(study_time)

Interpreting the Strength of this Model

To determine if we have evidence for the interaction model in comparison to the additive model, we just need to look at the p-value for the interaction term!

In this case, it is 0.00119, suggesting we have very strong evidence that there is at least some interaction with these predictors. People with the Clear instructions saw more benefit from study time in improving their score (steeper slope) than those with unclear.

What about the other lines?

In case you’re wondering about that high p-value for the “additive term” (materials by itself):

that just means that after allowing slopes to be different, we actually don’t have much evidence that we need different intercept values.
That does not mean the interaction model is a bad fit! It just means that we could reasonably let these groups have equal intercepts, but we should keep separate slopes (keep the interaction).

Interpeting r squared

Finally, we can see that adj \(r^2\) is best with the interaction model (0.7105). That’s about a 5.5% increase (going from about 34.5% unexplained to about 29% unexplained). Just another indicator that the interaction model fits the best!

New Example: Doctor Complaints

Visualize the data

Let’s take a look at the esdcomp data, which records number of complaints that various ER doctors get, alongside other factors about that doctor.

esdcomp

Let’s focus on these variables:

complaints: How many complaints that doctor has received
visits the number of patient visits that doctor has logged
gender the gender of the doctor

Visualize

Let’s first create a scatterplot that predicts complaints based on visits and gender. Let’s plot an interaction model. In other words, we’re going to see if the relationship between visits and complaints depends on the gender of the doctor.

ggplot(data = esdcomp, aes(x = visits, y = complaints, color = gender)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Modeling Practice

Go ahead and build an interaction model first to see if we have evidence that there might be an interaction between these two predictors

int_mod = _________________
_______________

int_mod = lm(data = esdcomp,________ ~ ______________)
summary(________)

int_mod = lm(data = esdcomp, complaints ~ visits * gender)
summary(int_mod)

Alternative Model Practice

If done correctly, we get a p-value of 0.324 from the interaction model. This suggests we don’t have much evidence here that the relationship between visits and complaints depends on gender.

It’s still possible there is an additive relationship–that one gender receives more complaints on average, even after controlling for number of visits!

Try an additive model

int_mod = _________________
_______________

add_mod = lm(data = esdcomp,________ ~ ______________)
summary(________)

add_mod = lm(data = esdcomp, complaints ~ visits + gender)
summary(add_mod)

Additive Model Interpretations

If done correctly, the p-value for the additive term (with no interaction term) is rather high. This means, among doctors with the same number of visits, we don’t see an additive difference in complaints between male and female doctors from this sample of data.

That doesn’t necessarily mean no difference based on gender here–we do have a relatively small sample size, so there might be a small effect here that we can’t detect with this sample size.

Simple Model Plots

For sake of completion, let’s also look at the simple plots

mod_visits = lm(data = esdcomp, complaints ~ visits)
summary(mod_visits)

ggplot(data = esdcomp, aes(x = visits, y = complaints)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

mod_gender = lm(data = esdcomp, complaints ~ gender)
summary(mod_gender)

ggplot(data = esdcomp, aes(x = gender, y = complaints, color = gender)) +
  geom_jitter(width = 0.05) +
  geom_smooth(method = "lm", se = FALSE)

Simple model interpretations

We do see evidence that more visits seems to correlate with more complaints. That should make sense!

We don’t see any simple relationship between gender and complaints.

With just these two predictors, it seems like a simple relationship between visits and complaints is the most sensible! It would be interesting though to recheck these with:

A larger sample of doctors. Especially if we had more than 14 female doctors
Other potential predictors.

Return Home

That’s all folks!

If you’d like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/