In this tutorial, we will:
geom_point
lm
function.Linear Regression (or you may also call it "Linear Modeling") is the statistical process of fitting and predicting responses to a numeric variable using one or more predictor variables in the form of a linear equation.
If we are only using one predictor variable, then we can call it a "Simple" model.
If the response variable is numeric, and we are using the equation of a line to model its relationship with one predictor, then it is called a "Simple Linear" Model.
The palmerpenguins
package contains a built in dataset named penguins
that has information on 344 penguins body measurements.
library(palmerpenguins)
penguins
One question we might have is: How we might model (predict) a penguin's flipper length by other penguin features?"
Let's focus on using body mass as a feature that may help us predict flipper length.
ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point()
As a reminder, we can add some formatting to make this look cleaner, or just make some stylistic choices!
alpha
if our plot has a lot of overlapping dots, and we want to see the density of points more clearly.ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(color = "purple", alpha = 0.3) +
labs(title = "Flipper Length by Body Mass", y = "Flipper Length (mm)", x = "Body Mass (g)") +
theme_bw()
The transparency isn't really needed for this one--it's just here to demonstrate it. :)
The geom_smooth
function is one way to add a best fit line to our plot. The default line option it produces is the result of using the least-squares criterion to fit the line. We just need two arguments:
method = "lm"
(lm just stands for linear model, as opposed to some other type of model).formula = y~x
(just means we aren't doing any predictor or response transformation.)ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(color = "seagreen") +
labs(title = "Flipper Length by Body Mass", y = "Flipper Length (mm)", x = "Body Mass (g)") +
theme_bw() +
geom_smooth(method = "lm", formula = y~x)
I typically leave out the formula = y~x
argument since that is the default relationship with geom_smooth
. So feel free to just leave that out!
It gives you a message letting you know it's defaulting to a simple linear fit.
ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(color = "seagreen") +
labs(title = "Flipper Length by Body Mass", y = "Flipper Length (mm)", x = "Body Mass (g)") +
theme_bw() +
geom_smooth(method = "lm")
Here are two other things you can customize:
se = FALSE
(usually what I do).color = "black"
or whatever color you want.ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
geom_point(color = "seagreen") +
labs(title = "Flipper Length by Body Mass", y = "Flipper Length (mm)", x = "Body Mass (g)") +
theme_bw() +
geom_smooth(method = "lm", se = FALSE, color = "gold3")
Create a scatterplot using the diabetes
dataset.
weight
and your y axis as ratio
.Hint: Click "Hints" for a scaffold, and click through till you get to the full solution as needed!
ggplot(data = _______, aes(_______________)) +
geom_point(_______) +
_____________
ggplot(data = diabetes, aes(x = weight, y = ratio)) +
geom_point(_______) +
_____________
ggplot(data = diabetes, aes(x = weight, y = ratio)) +
geom_point(alpha = 0.3) +
geom_smooth()
ggplot(data = diabetes, aes(x = weight, y = ratio)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE)
To get the coefficients of a best fit line, and a few other things, we can try creating a linear model with the lm
function. This function takes:
data
: the name of the data frame containing variables you are modeling.formula
, which will be in the form response ~ predictor
Let's create a simple linear model that predicts flipper length based on a penguin's body mass.
Running the code by itself will just offer the intercept and slope coefficient.
lm(data = penguins, formula = flipper_length_mm ~ body_mass_g)
It means our best fit line predicting flipper length from body mass would be:
\(\hat{y} = 136.73 + 0.01528\)*(body_mass
)
Typically though, we'd like much more information. Go ahead and save your model to a descriptive name, and then run a summary
of that model.
Notice that I'm also going to drop the formula
argument name from here on, as R will recognize your input for that.
flip_model = lm(data = penguins, flipper_length_mm ~ body_mass_g)
summary(flip_model)
Here are three lines to focus on. Notice that the intercept and slope coefficients are in the "Estimate" column.
We won't be using anything else in this output for now.
The t-test is testing against the null that the slope is 0. The p-value at the end of that slope line is extremely low. It's the lowest p-value that R will show in summary output (2e-16). This means we're VERY confident that there is at least some linear relationship here.
The adjusted r squared is 75.83%--knowing body mass helps us explains 75.83% of the variance in flipper length, after adjusting for correlation likely explained by random chance. That's quite a bit!
The F-test on the bottom line is testing whether the model as a whole is performing better than the "null model" (just using the mean of the response variable as our best prediction).
For our class, you can ignore this F-test result, as we'll focus on individual predictors, or just R squared to measure the full model's accuracy.
How might we set up a linear model to predict the HDL/LDl cholesterol ratio
based on weight
in the diabetes
dataset?
mod_ratio = __________________
______________
mod_ratio = lm(data = __________, _______ ~ _______)
summary(_________)
mod_ratio = lm(data = diabetes, ratio ~ _______)
summary(_________)
mod_ratio = lm(data = diabetes, ratio ~ weight)
summary(mod_ratio)
We can also use our model to fm make a model-based prediction for the response based on a plugged in predictor value.
The predict
function will take an inputted predictor value and then find the point on the best fit line as a best estimate for the response variable. It takes two arguments
object
: The model we created that we wish to use as the basis of our predictionnewdata
: a data frame of the predictor(s) we are plugging in.For a simple model, newdata will be a data frame of only one column--our one predictor!
Let's do that with our flip_model
, where we will predict the flipper length of a penguin with a body mass of 4000g
flip_model = lm(data = penguins, flipper_length_mm ~ body_mass_g)
predict(object = flip_model,
newdata = data.frame(body_mass_g = 4000))
The newdata
argument is also flexible enough to plug in multiple predictor values, allowing you to make multiple predictions at once.
flip_model = lm(data = penguins, flipper_length_mm ~ body_mass_g)
predict(object = flip_model,
newdata = data.frame(body_mass_g = c(4000, 4100, 4300)))
This tutorial was created by Kelly Findley from the UIUC Statistics Department. I hope this experience was helpful for you!
If you'd like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/