In the previous tutorial, we focused on modeling just one predictor, but we can add more predictors to our model as well--it just gets harder to visualize!
To create a numeric model with multiple predictors, simply use the "+" symbol to list multiple predictors.
Let's again use the penguins
data.
library(palmerpenguins)
penguins
Let's focus on flipper length as our response variable. We want to buidl a model that can provide some predictive power to estimate a penguin's flipper length.
Let's start by using two predictors: body mass and bill length.
Notice the format here--response variable before the ~
and the predictors after, connected with a +
. The data argument can be made anywhere inside (either before or after the variable entries is fine).
Mod_2 = lm(flipper_length_mm ~ bill_length_mm + body_mass_g, data = penguins)
summary(Mod_2)
As we learn in the notes, we now have to interpret these coefficients as representing relationship after controlling for other predictors.
For every one mm increase in bill length, we expect flipper length to be 0.5492 mm higher on average, after controlling for body mass.
In other words, if looking at two penguins of the same body mass, this is the relationship we observe.
For every one gram increase in body mass, we expect flipper length to be 0.01305 mm higher on average, after controlling for bill length.
The t-test for bill length (in this model) is testing the null hypothesis that the slope for bill length is 0 after controlling for body mass.
We can again use the predict
function to predict our response, using the predictor variables fitted.
First: Make sure our model is defined and saved to a model name. The name can be whatever, but I will name it Mod_2
since this model has 2 predictors.
Second: Use the predict
function to identify the model I want to use for predictions, and then enter the inputs. Those inputs will be entered as a data frame and inside the newdata
argument.
#Create the model and name it
Mod_2 = lm(flipper_length_mm ~ bill_length_mm + body_mass_g, data = penguins)
#Use predict and enter the inputs
predict(object = Mod_2, newdata = data.frame(bill_length_mm = 50, body_mass_g = 4000))
Now you can try building a model! Let's try with the diamonds
dataset.
diamonds
Build a model with carat
and depth
as predictors of price
Then use predict to predict the price of a diamond with 1 carat and a depth of 60.5
Model = ...
predict(...)
Model = lm(data = ________, price ~ ____________)
predict(object = _________, newdata = ______________)
Model = lm(data = diamonds, price ~ carat + _______)
predict(object = Model, newdata = data.frame(______________))
Model = lm(data = diamonds, price ~ carat + depth)
predict(object = Model, newdata = data.frame(carat = 1, depth = 60.5))
As we've seen, we can also model in categorical predictors. If we wish to use categorical predictors as additive terms, it's quite straightforward!
Let's look at a case where we compare species
. Since species actually has 3 categories, let's first subset it down to 2 to make it easier for us to work with. We will use this subsetted version for many of the remaining examples.
#Create subsetted data first for this specific example
penguins_2 = subset(penguins, species == "Adelie" | species == "Gentoo")
#Model
Mod_3 = lm(flipper_length_mm ~ bill_length_mm + body_mass_g + species, data = penguins_2)
summary(Mod_3)
predicted flipper length = 148.6 + 0.4861(bill length) + 0.00608(body mass) + 14.63(species)
where species = 1 if Gentoo and 0 if Adelie
We now have a term representing the average difference in flipper length between Adelie and Gentoo penguins (after controlling for bill length and body mass).
Apparently, the Gentoo penguins tend to have a longer flipper length on average compared to Adelie penguins, assuming this comparison is for Adelie and Gentoo penguins of the same body mass and bill length.
Note: We can add categorical variables with 3+ categories in as well! We just don't cover that situation in our class, as it is more complex to interpret.
In some cases, our categorical variable might interact with another predictor in the model. In other words, the predictor might have a different effect on the response variable depending on what's going on with another predictor.
How do we know whether we should include a categorical predictor in interaction with another numeric predictor?
Here is the model we just ran. It includes species
as an additive predictor only
Mod_3 = lm(flipper_length_mm ~ bill_length_mm + body_mass_g + species, data = penguins_2)
summary(Mod_3)
This model allows the variable species
to interact with bill_length_mm
Mod_3a = lm(flipper_length_mm ~ bill_length_mm*species + body_mass_g, data = penguins_2)
summary(Mod_3a)
This model allows the variable species
to interact with body_mass_g
Mod_3b = lm(flipper_length_mm ~ bill_length_mm + body_mass_g*species, data = penguins_2)
summary(Mod_3b)
This model allows species
to interact with both
Mod_3c = lm(flipper_length_mm ~ bill_length_mm *species + body_mass_g*species, data = penguins_2)
summary(Mod_3c)
There's a lot of output here. How do we make sense of it?
An easy thing to pick up on is the adjusted R squared. If adjusted R squared went down when we added complexity (or maybe stayed the same, or up by a very tiny bit), then the complexity is probably not worth it. We should stick with the simpler model.
Seems like allowing species to interact with bill length might give us some additional predictive power (from about 85% with additive to 88% with that interaction included). But the interaction with body mass is not gaining us anything.
We could also note the p-value for the interaction term with body mass is not low either (0.143 in the partial interaction model, and 0.926 in the full interaction model). Again, signaling we have only weak evidence of an interaction there.
Making model predictions would follow a similar procedure. Just make sure to include your categorical variable in the entries!
Let's make a model prediction for flipper length when looking a penguin with a bill length of 50mm, body mass of 4000g, and is of the species Adelie.
Note that it takes only a single =
sign in this argument. In contrast to subsetting, where you need the ==
to select a category.
#Create the model and name it
Mod_3a = lm(flipper_length_mm ~ bill_length_mm*species + body_mass_g, data = penguins_2)
#Use predict and enter the inputs
predict(object = Mod_3a, newdata = data.frame(bill_length_mm = 50, body_mass_g = 4000, species = "Adelie"))
When doing multiple regression, it gets harder to notice assumption violations since we can't visualize the full model. Let's review what we want to check, and when we should be concerned about model reliability.
Let's take a look at the BostonHousing
data in the mlbench
package. Each row represents one housing community in Boston in the 1970s. Among some notable variables are:
medv
median home value (in thousands of $)crim
per capita crime raterm
average number of rooms per dwellingindus
proportion of non-retail business acres per towndata(BostonHousing)
BostonHousing
If we fit a variable using a linear term, it's still worth checking that variable as a solo predictor to see if a linear term makes sense.
Let's explore each of these variables as pedictors of medv
to see what this individual relationship looks like.
ggplot(data = BostonHousing, aes(x = crim, y = medv)) +
geom_point()
ggplot(data = BostonHousing, aes(x = rm, y = medv)) +
geom_point()
ggplot(data = BostonHousing, aes(x = indus, y = medv)) +
geom_point()
crim
is a fairly skewed variable (most are around 0, with some skewing positively). That said, there isn't some obvious non-linear relationship here. Though a predictor variable transformation could improve this fit.rm
is very well modeled with a linear term.indus
variable might be better fit with a polynomial term rather than a linear term (this goes beyond our course, but we can still make the observation!)This is beyond the scope of this course, but notice that the natural log of crime rate fits a linear relationship more clearly
ggplot(data = BostonHousing, aes(x = log(crim), y = medv)) +
geom_point()
To check for equal variance and normality of residuals, we need to make a residual plot. Rather than using ggplot, we will make a simple base R plot. This isn't for presentation--it's just for us to check model diagnostics!
First, we need to create our model and save it to a name.
Then, use the plot
function, where the first entry will be our x variable, and the second will be our y variable.
fitted.values(mod)
and resid(mod)
where we input the model name we created.Optionally, you can add a horizontal line at residuals = 0 to represent where our model is predicted these values to be.
abline()
with the argument h = 0
Mod_3a = lm(medv ~ crim + rm + indus, data = BostonHousing)
plot(fitted.values(Mod_3a), resid(Mod_3a))
abline(h = 0)
One reason we might check a residual plot is to see whether variance in our residuals is constant across the range of fitted values, or to see if the residuals are normally distributed about the best fit line.
It's helpful when these things are true because:
Looking at the previous model, the variance is much higher on the lower end of the fitted values range. It's not wildly different, but enough that it might matter some.
This is the larger violation in this particular model. Most of the residuals are clustered near the line, but there is some positive skewness here.
The line is also not centered across the range--instead, there is clearly a curved relationship in the residuals. That is likely because indus
was fit with a linear term when that particular relationship is probably quadratic (curved).
As a reminder, assumption violations don't mean the model is worthless. It simply means that:
If having unbiased predictions and accurate t-test results are important, we could consider variable transformations
We could run this same model again, but take a log transformation of the response variable (and also of crim
)
Mod_3a = lm(log(medv) ~ log(crim) + rm + indus, data = BostonHousing)
plot(fitted.values(Mod_3a), resid(Mod_3a))
abline(h = 0)
Log transformations can often fix non-normality in the residuals. Notice how much more centered the line of best fit is in this model!
They may also help with non-constant variance in the residuals, though in this case, that seems to still be present here. So the error in our predictions will definitely depend on what range of values we are making predictions for.
Model fitting can be quite complex! In the end, assumption violations are not critical problems. But it helps to have an awareness of how they affect the reliability of your coefficients and predictor t-tests.
If you take a statistical modeling course, you can learn more sophisticated methods to fit models and resolve these types of violations.
Create a model with the iris
data.
iris
Build a model to predict Petal.Length
using Sepal.Length
, Petal.Width
and Species
as predictors. Leave Species
only as an additive predictor--no interaction term.
Call this model Model
, and then create a residual plot of your model.
Model = ...
plot(...)
abline
Model = lm(data = ______, Petal.Length ~ ____________)
plot(_______, _______)
aline(_______)
Model = lm(data = iris, Petal.Length ~ Sepal.Length + _______ + ______)
plot(fitted.values(Model), _______)
abline(_______)
Model = lm(data = iris, Petal.Length ~ Sepal.Length + Petal.Width + Species)
plot(fitted.values(Model), resid(Model))
abline(h = 0)
This model has very good diagnostics. Note that we have a gap in fited values where we have no plants modeled in that range, but that is no violation! Where we do have data, we have constant variance and normally distributed residuals.