Modeling Binary Predictors

Visualizing Categorical Predictors

Representing Variables with Color

We may commonly wish to represent other variables in a scatterplot. When modeling with 2 or more variables, we would use "Multiple Regression."

Visualizing relationships with 2 or more predictors can get difficult, as it requires more dimensions. However, if our second variable is categorical, we can easily use color to represent it in a 2D scatterplot.

Penguin Species

We can do that in our example by mapping species to color. As you might have noticed from the bill length example, the relationship appeared to decrease as bill depth increased, but this is strange--there's some confounding to uncover here!

Tip: Make sure to remove any generic coloring option in the line below (otherwise, it will just overwrite the first option).

Tip: Remember that when assigning a variable into your aesthetic line, don't put quotation marks around it. That's because a variable is a "dynamic" entry, whereas a specific color like "purple" is a static entry.

ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  geom_point() +
  labs(title = "Bill Length by Bill Depth", y = "Bill Length (mm)", x = "Bill Depth (g)") +

Modeling Separate best fit lines

Notice that if we ad the geom_smooth function back into our plot, we now have the option to fit a separate best fit line for each of these species. If we leave the color option out of geom_smooth, then it will default to adding separately colored lines for each group.

ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  geom_point() +
  labs(title = "Bill Length by Bill Depth", y = "Bill Length (mm)", x = "Bill Depth (g)") +
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE)

The default here is to plot an interaction model--in other words, we allow each group to have a unique slope in addition to a unique intercept.

We could plot an additive model (and you will see an example later), but you don't need to know how to setup code for that purpose.

Coloring by Category, but a single best fit line

If you want to add a single best fit line, rather than one for each category, you can do that by defining color in geom_smooth. It is set up exactly the same, except for an additional color argument in the last line.

ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  geom_point() +
  labs(title = "Bill Length by Bill Depth", y = "Bill Length (mm)", x = "Bill Depth (g)") +
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, color = "black")

Try on your own!

Let's take a look at the iris data. Create a scatterplot that predicts Petal.Length from Sepal.Length, and colors by Species. Add best fit lines, such that each Species gets its own best fit line. Assume interaction terms here--not an additive model.

Feel free to add whichever customization options you wish!

ggplot(data = _______, aes(x = _____, y = _____, color = ______)) +
  geom_point() +
  geom_smooth(__________, se = ______)
ggplot(data = iris, aes(x = Sepal.Length, y = _____, color = ______)) +
  geom_point() +
  geom_smooth(method = "lm", se = ______)
ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Considering an Additive Model

Our goal

Modeling with categorical predictors is also possible. For now, we work with the very simple case of having one numeric predictor and one binary predictor.

We could build a model with categorical predictor encompassing 3 or more categories, but it's a little trickier to interpret and goes beyond the scope of this class.

An example

Let's first visualize an example that we will model. Predicting bill length based on bill depth and sex

ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = sex)) +
  geom_point() +

Species Confounding and NA on Sex

As we already noticed, species is a big confounder in this relationship, and it might make it difficult to represent and model the realtionship between these variables with species unaccounted.

We also notice that that there are a few NA (sex unrecorded) penguins here.

For that reason, we are going to proceed with just one species: "Adelie", and let's also eliminate penguins with no identified sex. We can create a subset to use for the remaining exercises that condenses down to these specifics!

Adelie = subset(penguins, species == "Adelie" & sex != "NA")

Visualizing with Adelie

Let's now visualize with just the Adelie penguins, but again using bill depth as a numeric predictor and coloring by sex.

We will first build a simple model and ignore sex in building this model.

ggplot(data = Adelie, aes(x = bill_depth_mm, y = bill_length_mm, color = sex)) +
  geom_point() +
  theme_classic() +
  geom_smooth(method = "lm", color = "black", se = FALSE)

So now we ask:

  • Is the intercept different if we allowed both groups to have different best fit lines? (Additive model)
  • Would their slopes be different if we looked at the correlations between bill depth and bill length for each sex separately? (Interaction model)

Visualizing an Additive Model

As discussed in the notes, we can try an additive model, where we tell our model to assume each group (in this case, male and female penguins) have the same slope between bill depth and bill length. What we would allow is for each group to have a different intercept.

Let's visualize what this would look like if we fit an additive model. The code is a little complicated to set up, but don't worry about that part. Just focus on the plot itself and how it's different when we allow for differences in intercept!

fitlm = lm(data = Adelie, bill_length_mm ~ bill_depth_mm + sex)
Adelie$predlm = predict(fitlm)

ggplot(data = Adelie, aes(x = bill_depth_mm, y = bill_length_mm, color = sex)) +
  geom_point() +
  theme_classic() +
  geom_line(aes(y = predlm), size = 1)

When accounting for sex, the relationship looks much weaker!

Building an Additive Model

Let's now look at what happens when we create an additive model and look at the summary. This will look very much the same as with simple linear regression--we just add a + followed by our binary predictor.

add = lm(data = Adelie, bill_length_mm ~ bill_depth_mm + sex)

Interpreting the coefficient

Note that the coefficient for our binary variable represents an intercept adjustment to be applied if the category listed is true. So, if the penguin is male, we should increase the intercept by that value (2.9). If the penguin is not that category (in this case, female), then that value should be dropped. The best fit line is essentially:

bill length estimate = 35.2589 + 0.1134*(bill depth) + 2.9683(sex) where sex is 1 if male and 0 if female.

Interpreting the t-test

The t-test for bill depth is still testing whether the slope could be different from 0, but after accounting for sex. In other words, given a penguin is male, is there a correlation between bill depth and bill length? Same for female.

The p-value is not very low anymore (0.537), meaning that it's very plausible that there is no linear relationship here when looking at each sex specifically. It's plausible that the true slope parameter could be 0.

The t-test for sex is determining whether there truly is a difference in average bill length after accounting for bill depth. In other words, if two penguins had the same bill depth, is there a difference in bill length on average?

The p-value is quite low (5.55 x 10^-10, or 0.000000000555), so we would reject the null hypothesis and conclude there is a difference on average between male and female penguins' bill length, even after controlling for bill depth.

Building an Interaction Model

Visualize it first!

Should we try an interaction model? In other words, should we allow each sex to have a unique slope?

Let's look at what that would look like visually. This one is easy to plot, as I just make my plot and allow each group to have it's own color.

ggplot(data = Adelie, aes(x = bill_depth_mm, y = bill_length_mm, color = sex)) +
  geom_point() +
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE)

We definitely see some differences in slope when we allow it, but not particularly big differences. Let's see what our model says about whether this interaction term is likely making a difference.

Model it!

To model an interaction, we can make one small tweak. Just change the + to a * in the model argument.

int = lm(data = Adelie, bill_length_mm ~ bill_depth_mm * sex)

The model adds one term now that acts as a slope adjustment if penguins are male.

bill length estimate = 31.1671 + 0.3456*(bill depth) + 10.8536(sex) - 0.4311(bill depth * sex) where sex is 1 if male and 0 if female.

Interpret it!

We now have three terms to make sense of. Note that male is the level assigned to 1 (as it is the category showing up in the output, so female must be the base level).

The intercept and bill depth slope would represent the line for female penguins.

The sexmale line represents an intercept adjustment for male penguins. Do they indeed have a difference on average?

The bill_depth_mm:sexmale line represents a slope adjustment for male penguins. Do they indeed have different slopes?

The slope for females has a p-value of 0.202. Not very convincing evidence that bill depth is linearly correlated with bill length when looking at female penguins--maybe it is, but the amount of correlation we see here with this sample size could be random chance correlation.

The t-test for an intercept adjustment for males has a p-value of 0.109. More borderline, and visually/contextually, it makes sense that male penguins might have a different bill length on average than females, even when looking at penguins with the same bill depth. Not strong, but perhaps mild to moderate evidence.

The t-test for the slope adjustment produces a p-value of 0.243. Note that our t-test for a slope in general after controlling for sex is not convincing to begin with. With that in mind, we wouldn't expect a sex-based difference here if there is no evidence for a slope in general.

What did we learn?

The interaction model was probably unnecessary. There is no slope adjustment needed for each sex, and it's not even clear there is a slope here at all after controlling for sex!

There is definitely some evidence for a sex difference on average with our response variable, bill length.

The additive model is ok, but we probably don't even need the numeric predictor. After controlling for sex, bill depth doesn't seem to correlate with bill length. In other words, when looking at the Adelie species, sex is helpful in predicting bill length, but bill depth isn't.

ggplot(data = Adelie, aes(x = sex, y = bill_length_mm, color = sex, fill = sex)) +
  geom_jitter(width = 0.1) +
  geom_boxplot(width = 0.15, 
               color = "black",
               size = 1,
               position = position_nudge(x = 0.25)) +
  theme_classic() +
  theme(legend.position = "none")

Correlation vs. Causation!

When we didn't take sex into account, we saw an association between bill depth and bill length--but where did it go?

The likely explanation for this association is because female penguins tend to have lower bill length and bill depth, and male penguins tend to have higher values for each.

But once accounting for sex, we see that higher bill depth doesn't necessarily indicate higher bill length. The general association was really just picking up on differences in sex.

This is a simple example of doing causal inference with only observational study data. By controlling for strategic predictors, we can determine if a variable could reasonably be causing a difference, or whether it's some other factor that explains differences.


This tutorial was created by Kelly Findley from the UIUC Statistics Department. I hope this experience was helpful for you!