Multiple Linear Regression

Introduction

In the previous tutorial, we focused on modeling just one predictor, but we can add more predictors to our model as well--it just gets harder to visualize!

To create a numeric model with multiple predictors, simply use the "+" symbol to list multiple predictors.

Let's again use the penguins data.

library(palmerpenguins)
penguins

Building a Multiple Regression Model

Let's focus on flipper length as our response variable. We want to buidl a model that can provide some predictive power to estimate a penguin's flipper length.

Let's start by using two predictors: body mass and bill length.

Notice the format here--response variable before the ~ and the predictors after, connected with a +. The data argument can be made anywhere inside (either before or after the variable entries is fine).

Mod_2 = lm(flipper_length_mm ~ bill_length_mm + body_mass_g, data = penguins)
summary(Mod_2)

Interpreting Output

As we learn in the notes, we now have to interpret these coefficients as representing relationship after controlling for other predictors.

For every one mm increase in bill length, we expect flipper length to be 0.5492 mm higher on average, after controlling for body mass.
In other words, if looking at two penguins of the same body mass, this is the relationship we observe.
For every one gram increase in body mass, we expect flipper length to be 0.01305 mm higher on average, after controlling for bill length.

Interpreting t-tests

The t-test for bill length (in this model) is testing the null hypothesis that the slope for bill length is 0 after controlling for body mass.

Since the p-value is 3x10^-11, then we have **very* strong evidence that bill length still has some linear relationship with flipper length, even after controlling for body mass.
The t-test for body mass likewise has a very low p-value, so we have evidence that body mass is also explaining variability (through a linear relationship) in flipper length, even after controlling for bill length.

Predictions in Multiple Regression

We can again use the predict function to predict our response, using the predictor variables fitted.

First: Make sure our model is defined and saved to a model name. The name can be whatever, but I will name it Mod_2 since this model has 2 predictors.

Second: Use the predict function to identify the model I want to use for predictions, and then enter the inputs. Those inputs will be entered as a data frame and inside the newdata argument.

#Create the model and name it
Mod_2 = lm(flipper_length_mm ~ bill_length_mm + body_mass_g, data = penguins)

#Use predict and enter the inputs
predict(object = Mod_2, newdata = data.frame(bill_length_mm = 50, body_mass_g = 4000))

Build a model!

Now you can try building a model! Let's try with the diamonds dataset.

diamonds

Build a model with carat and depth as predictors of price

Then use predict to predict the price of a diamond with 1 carat and a depth of 60.5

Model = ...

predict(...)

Model = lm(data = ________, price ~ ____________)

predict(object = _________, newdata = ______________)

Model = lm(data = diamonds, price ~ carat + _______)

predict(object = Model, newdata = data.frame(______________))

Model = lm(data = diamonds, price ~ carat + depth)

predict(object = Model, newdata = data.frame(carat = 1, depth = 60.5))

Building Complexity

Modeling in categorical predictors

As we've seen, we can also model in categorical predictors. If we wish to use categorical predictors as additive terms, it's quite straightforward!

Let's look at a case where we compare species. Since species actually has 3 categories, let's first subset it down to 2 to make it easier for us to work with. We will use this subsetted version for many of the remaining examples.

#Create subsetted data first for this specific example
penguins_2 = subset(penguins, species == "Adelie" | species == "Gentoo")

#Model
Mod_3 = lm(flipper_length_mm ~ bill_length_mm + body_mass_g + species, data = penguins_2)
summary(Mod_3)

Interpreting coefficients

predicted flipper length = 148.6 + 0.4861(bill length) + 0.00608(body mass) + 14.63(species)

where species = 1 if Gentoo and 0 if Adelie

We now have a term representing the average difference in flipper length between Adelie and Gentoo penguins (after controlling for bill length and body mass).

Apparently, the Gentoo penguins tend to have a longer flipper length on average compared to Adelie penguins, assuming this comparison is for Adelie and Gentoo penguins of the same body mass and bill length.

Note: We can add categorical variables with 3+ categories in as well! We just don't cover that situation in our class, as it is more complex to interpret.

Additive or Interaction?

In some cases, our categorical variable might interact with another predictor in the model. In other words, the predictor might have a different effect on the response variable depending on what's going on with another predictor.

How do we know whether we should include a categorical predictor in interaction with another numeric predictor?

Comparing Choices

Here is the model we just ran. It includes species as an additive predictor only

Mod_3 = lm(flipper_length_mm ~ bill_length_mm + body_mass_g + species, data = penguins_2)
summary(Mod_3)

This model allows the variable species to interact with bill_length_mm

Mod_3a = lm(flipper_length_mm ~ bill_length_mm*species + body_mass_g, data = penguins_2)
summary(Mod_3a)

This model allows the variable species to interact with body_mass_g

Mod_3b = lm(flipper_length_mm ~ bill_length_mm + body_mass_g*species, data = penguins_2)
summary(Mod_3b)

This model allows species to interact with both

Mod_3c = lm(flipper_length_mm ~ bill_length_mm *species + body_mass_g*species, data = penguins_2)
summary(Mod_3c)

What to make of these results?

There's a lot of output here. How do we make sense of it?

An easy thing to pick up on is the adjusted R squared. If adjusted R squared went down when we added complexity (or maybe stayed the same, or up by a very tiny bit), then the complexity is probably not worth it. We should stick with the simpler model.

Seems like allowing species to interact with bill length might give us some additional predictive power (from about 85% with additive to 88% with that interaction included). But the interaction with body mass is not gaining us anything.

We could also note the p-value for the interaction term with body mass is not low either (0.143 in the partial interaction model, and 0.926 in the full interaction model). Again, signaling we have only weak evidence of an interaction there.

Predictions still possible!

Making model predictions would follow a similar procedure. Just make sure to include your categorical variable in the entries!

Let's make a model prediction for flipper length when looking a penguin with a bill length of 50mm, body mass of 4000g, and is of the species Adelie.

Note that it takes only a single = sign in this argument. In contrast to subsetting, where you need the == to select a category.

#Create the model and name it
Mod_3a = lm(flipper_length_mm ~ bill_length_mm*species + body_mass_g, data = penguins_2)

#Use predict and enter the inputs
predict(object = Mod_3a, newdata = data.frame(bill_length_mm = 50, body_mass_g = 4000, species = "Adelie"))

Assumption Violations in Regression

Checking assumptions in MLR

When doing multiple regression, it gets harder to notice assumption violations since we can't visualize the full model. Let's review what we want to check, and when we should be concerned about model reliability.

New Data Context:

Let's take a look at the BostonHousing data in the mlbench package. Each row represents one housing community in Boston in the 1970s. Among some notable variables are:

medv median home value (in thousands of $)
crim per capita crime rate
rm average number of rooms per dwelling
indus proportion of non-retail business acres per town

data(BostonHousing)
BostonHousing

Linearity

If we fit a variable using a linear term, it's still worth checking that variable as a solo predictor to see if a linear term makes sense.

Let's explore each of these variables as pedictors of medv to see what this individual relationship looks like.

ggplot(data = BostonHousing, aes(x = crim, y = medv)) +
  geom_point()

ggplot(data = BostonHousing, aes(x = rm, y = medv)) +
  geom_point()

ggplot(data = BostonHousing, aes(x = indus, y = medv)) +
  geom_point()

crim is a fairly skewed variable (most are around 0, with some skewing positively). That said, there isn't some obvious non-linear relationship here. Though a predictor variable transformation could improve this fit.
rm is very well modeled with a linear term.
The indus variable might be better fit with a polynomial term rather than a linear term (this goes beyond our course, but we can still make the observation!)

As an example...

This is beyond the scope of this course, but notice that the natural log of crime rate fits a linear relationship more clearly

ggplot(data = BostonHousing, aes(x = log(crim), y = medv)) +
  geom_point()

Making a Residual Plot

To check for equal variance and normality of residuals, we need to make a residual plot. Rather than using ggplot, we will make a simple base R plot. This isn't for presentation--it's just for us to check model diagnostics!

First, we need to create our model and save it to a name.

Then, use the plot function, where the first entry will be our x variable, and the second will be our y variable.

The inputs we will use will be the fitted values of our model, and the residuals of our model.
These can be inputted by simply using the functions fitted.values(mod) and resid(mod) where we input the model name we created.

Optionally, you can add a horizontal line at residuals = 0 to represent where our model is predicted these values to be.

We use the functionabline() with the argument h = 0

Mod_3a = lm(medv ~ crim + rm + indus, data = BostonHousing)

plot(fitted.values(Mod_3a), resid(Mod_3a))
abline(h = 0)

Why these diagnostics matter

One reason we might check a residual plot is to see whether variance in our residuals is constant across the range of fitted values, or to see if the residuals are normally distributed about the best fit line.

It's helpful when these things are true because:

If we have violations of one or both, that means our predictions could have much more error in certain ranges, and we might want to know that!
It also means our slope coefficients might have more error than our standard error value suggests.

Constant Variance (Homoscedasticity) in residuals

Looking at the previous model, the variance is much higher on the lower end of the fitted values range. It's not wildly different, but enough that it might matter some.

Normality of residuals

This is the larger violation in this particular model. Most of the residuals are clustered near the line, but there is some positive skewness here.

The line is also not centered across the range--instead, there is clearly a curved relationship in the residuals. That is likely because indus was fit with a linear term when that particular relationship is probably quadratic (curved).

Handling Assumption violations

As a reminder, assumption violations don't mean the model is worthless. It simply means that:

Our coefficient estimates may have some bias, and
The SE calculations may underestimate the true error in our estimates.

If having unbiased predictions and accurate t-test results are important, we could consider variable transformations

Example with Log transformation

We could run this same model again, but take a log transformation of the response variable (and also of crim)

Mod_3a = lm(log(medv) ~ log(crim) + rm + indus, data = BostonHousing)

plot(fitted.values(Mod_3a), resid(Mod_3a))
abline(h = 0)

Log transformations can often fix non-normality in the residuals. Notice how much more centered the line of best fit is in this model!

They may also help with non-constant variance in the residuals, though in this case, that seems to still be present here. So the error in our predictions will definitely depend on what range of values we are making predictions for.

Model fitting can be quite complex! In the end, assumption violations are not critical problems. But it helps to have an awareness of how they affect the reliability of your coefficients and predictor t-tests.

If you take a statistical modeling course, you can learn more sophisticated methods to fit models and resolve these types of violations.

One more practice

Create a model with the iris data.

iris

Build a model to predict Petal.Length using Sepal.Length, Petal.Width and Species as predictors. Leave Species only as an additive predictor--no interaction term.

Call this model Model, and then create a residual plot of your model.

Model = ...

plot(...)
abline

Model = lm(data = ______, Petal.Length ~ ____________)

plot(_______, _______)
aline(_______)

Model = lm(data = iris, Petal.Length ~ Sepal.Length + _______ + ______)

plot(fitted.values(Model), _______)
abline(_______)

Model = lm(data = iris, Petal.Length ~ Sepal.Length + Petal.Width + Species)

plot(fitted.values(Model), resid(Model))
abline(h = 0)

This model has very good diagnostics. Note that we have a gap in fited values where we have no plants modeled in that range, but that is no violation! Where we do have data, we have constant variance and normally distributed residuals.

Acknowledgment

This tutorial was created by Dr. Kelly Findley of the UIUC Statistics department. I hope this experience was helpful for you!