Introduction to Linear Modeling

A Review of Scaterplots

Introduce the Goals

Linear Regression (or you may also hear it called "Linear Modeling") is the statistical process of fitting and predicting responses to one variable using one (or more) predictor variables in the form of a linear equation.

We focus first on "Simple" Linear regression, where we only have one predictor variable to model and predict one response variable. This case is also easy to visualize!

Note that if we are using one variable to model and predict the other, we typically put the "response" variable on the y axis and the "predictor" variable on the x axis.

Introduce some Data

Let's take a look at two numeric variables and use one to understand and predict the other. The palmerpenguins package contains a built in dataset named penguins that has information on 344 penguins body measurements.


One question we might have is how we might model (predict) a penguin's flipper length by other penguin features. The code below creates a scatterplot that does that, using body_mass as a predictor.

ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +

Formatting Reminders

As a reminder, we can format scatterplots in lots of ways. We can add titles and labels, color, and even plot themes.

You can also add some transparency to the data if some of your points are overlapping.

ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(color = "purple", alpha = 0.3) +
  labs(title = "Flipper Length by Body Mass", y = "Flipper Length (mm)", x = "Body Mass (g)") +

Time to Practice!

Fill in the blanks so you successfuly plot a scatterplot with the diabetes dataset, your x axis as weight and y axis as ratio. Add some transparency with alpha. Add the color blue to your plot as well, and title the plot "Weight and Ratio"

Hint: Click "Hints" for a scaffold, and click through till you get to the full solution as needed!

Also Remember to include quotations around the color as it needs to typed out with quotes otherwise the code will not run properly!

ggplot(data = _______, aes(x = _____, y = _____)) +
  geom_point(_______ = 0.5, color = _____) +
  labs(title = _________)
ggplot(data = diabetes, aes(x = , y = )) +
  geom_point(alpha = 0.5, color = "____") +
    labs(title = _________)
ggplot(data = diabetes, aes(x = weight, y = )) +
  geom_point(alpha = 0.5, color = "____") +
    labs(title = "_________")
ggplot(data = diabetes, aes(x = weight, y = ratio)) +
  geom_point(alpha = 0.5, color = "blue") +
    labs(title = "Weight and Ratio")

On your own!

the prostate data contains various health data for 97 men.

Create a scatterplot that has age as its x variable and lcavol as its y variable. Have the color of the plot be purple. Lastly label your axis with the title: " Prostate age and Cancer Volume:

ggplot(______________, aes(_________, __________)) +
  geom_point(_________) +
ggplot(data = prostate, aes(x = ____, y = _____)) +
  geom_point(color = _____) +
  labs(title = ________)
ggplot(data = prostate, aes(x = age, y = lcavol)) +
  geom_point(color = "purple") +
  labs(title = "Prostate age and Cancer Volume")

Creating a Simple Model

Creating a Numeric Model

Let's now create a linear model with the two variables we plotted from the penguins data. To do that, we will use the lm function in R. Typically, we start with the formula argument, followed by the data argument to specify if the variables we are calling on are embedded within a data frame.

The formula argument lists the response variable ~ predictor variable (note that the order is important there!).

If we run this, we'll get the model coefficients.

lm(formula = flipper_length_mm ~ body_mass_g, data = penguins)

Typically though, we'd like much more information. Go ahead and save your model to a descriptive name, and then run a summary of that model. We will get not only the coefficients, but also the t-tests for each (null hypothesis that the true parameter for intercept and slope are 0), r squared, and a few other things.

Notice that you can also drop the argument names as long as you write them in this order.

flip_mod = lm(flipper_length_mm ~ body_mass_g, penguins)

We can see from this output that we are highly confident body mass has a non-zero slope when used as a predictor for flipper length. The p-value for that test is at the minimum reported level in R (2x10^-16, or 0.0000000000000002). We can also see the R squared is around 75.9% (or 75.83% if choosing to report after adjusting to correlation likely due to random chance).

Adding a line of best fit

The linear model we created in the previous section reported the intercept and slope of the best fit line (using the least squares method) for fitting body mass as a predictor of flipper length. We can see what this line looks like on our scatterplot by using the geom_smooth function in ggplot2 and choosing the lm method for fitting this line.

ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(color = "seagreen") +
  labs(title = "Flipper Length by Body Mass", y = "Flipper Length (mm)", x = "Body Mass (g)") +
  theme_bw() +
  geom_smooth(method = "lm")

You can ignore the message here--with geom_smooth, we could optionally fit any type of relationship (linear, or non-linear). It is just letting us know that it defaults to non-linear if we don't fill in the formula argument.

More geom_smooth options

Note that the default is to include a standard error shading around the best fit line. Yes--even our slope is a sample statistic estimating the true relationship (if we had all penguins in this population), and we can calculate the standard error of our slope value and visualize it in the representation!

You can toggle the se argument to FALSE if you'd like to remove the standard error shading for our best fit line.

You can also change the color of the line if you wish. The default is navy blue.

ggplot(data = penguins, aes(x = body_mass_g, y = flipper_length_mm)) +
  geom_point(color = "seagreen") +
  labs(title = "Flipper Length by Body Mass", y = "Flipper Length (mm)", x = "Body Mass (g)") +
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE, color = "gold3")

Predicting values

We can also use our numeric model (the one we saved under flip_mod) to find predictions for flipper length given a body mass. The predict function will take an inputted predictor value and then find the point on the best fit line as a best estimate for the response variable.

The predict function takes two arguments--the name of the model, and newdata, in which we input what predictor value (or values) we wish to make a response variable estimate from.

Let's do that with our model, where we will predict the flipper length of a penguin with a body mass of 4000g

flip_mod = lm(flipper_length_mm ~ body_mass_g, penguins)

predict(flip_mod, newdata = data.frame(body_mass_g = 4000))

Try a numeric model

Now without the the code for reference, create a numeric linear model to predict the bill length of a penguin from their bill depth. The data is provided here for reference to get the appropriate variable names!

Try to create the model here, name it bill_mod, and run a summary of that model

bill_mod = lm(_________)
bill_mod = lm(bill_length_mm ~ _________, data = ________)
bill_mod = lm(bill_length_mm ~ bill_depth_mm, data = penguins)

Try Visualizing the model

Now, try creating a scatterplot that includes the best fit line that models the same relationship we explored before. Set the se shading to be TRUE for this one, and color the line orange.

ggplot(data = penguins, aes(x = _________, y = __________)) +
  geom_point() +
ggplot(data = penguins, aes(x = _________, y = __________)) +
  geom_point() +
ggplot(data = penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "orange")


This tutorial was created by Kelly Findley from the UIUC Statistics Department. I hope this experience was helpful for you!