In this tutorial, we will cover the following:
ggplot2
package as a great resource for visualization functions in R
ggplot2
package!Visualization is important both in the exploration and analysis phase (so we can better see what might be going on in data) and in the presentation phase (to share important insights with our audience).
Visualization will reveal patterns in the data that might not be as evident when only looking at numeric summaries!
R
as a language has many built-in functions for graphing (e.g., hist
, plot
). These are known as "base R" functions.
hist(prostate$age)
Base R plots are great for quick and simple visualization, but they do have limitations.
As visualization tools have evolved tremendously in recent decades, many users find base R functions clunky and difficult to use for more complex visualizations. They instead turn to packages that are regularly updated for new features!
We will primarily use visualization functions from a package named ggplot2
. The "gg" of `ggplot2 stands for "Grammar of Graphics" because of how this package outlines a framework for concisely describing and coding components of a graphics.
While you can install ggplot2
as an individual package, I would actually recommend installing tidyverse
, which is a collection of packages bundled together. If you completed Lab 1, you have already done this! But if you didn't do Lab 1, make sure you have installed tidyverse
, as shown in this video.
On this first panel, we'll start with creating histograms--a great plot option for a single numeric variable. Let's construct our plot piece by piece.
The first line will use the ggplot function. This function requires a data
argument, followed by a mapping
argument.
data
argument specifies which data structure we are calling onmapping
argument specifies which variables we are mapping to the plot and how that variable will be represented in the plot.ggplot(data = ..., mapping = ...)
Note that for mapping
, we identify the variables we are using inside an embedded function called aes
(which stands for "aesthetic").
ggplot(data = ..., mapping = aes(...))
Next, we can name a geometry (geom
) to identify what form our plot will take. Since we are making a histogram, we will choose geom_histogram
.
ggplot(data = ..., mapping = aes(...)) +
geom_histogram()
Let's use the prostate
data that we have already seen, take the age variable plotted on the x axis, and choose histogram as our geometry. Run the following code to see what it looks like!
ggplot(data = prostate, mapping = aes(x = age)) +
geom_histogram()
We linked the ggplot line with the geometry line using a + sign. In ggplot commands, note that we will often link multiple commands together with +. Think of it as an "and" statement.
We aren't restricted to this dark, uninspiring color palette. We can change both the border color (activated by the color
argument) and the fill color (activated by the fill
argument).
Run the following ggplot to change the border color and fill color. Feel free to adjust the color options to other generic colors! (and yes, there is a very extensive list of color options we'll see in a later tutorial!)
ggplot(data = prostate, mapping = aes(x = age)) +
geom_histogram(color = "black", fill = "green")
If naming specific colors, be sure to write them in quotation marks. Quotation marks are often for identifying static entries.
Variable names should not be listed in quotation marks when mapped to a representation because they are dynamic entries!
While histograms in R
will default to 30 bins if no selection is made, you might try playing around with this number until you are happy with the appearance.
Change the number of bins and notice what happens!
ggplot(data = prostate, mapping = aes(x = age)) +
geom_histogram(color = "black", fill = "green", bins = __)
The labs
function allows us to add a title, axes titles, and other labels as we wish. You don't need to use all of these arguments though! Feel free to delete one or more of those arguments.
I also changed the colors just for fun. :)
ggplot(data = prostate, mapping = aes(x = age)) +
geom_histogram(color = "white", fill = "red") +
labs(title = "Age of Respondents", x = "Age", y = "Count totals")
Again, notice the use of quotation marks for Labels. These are static applications to your plot, as opposed to the dynamic application that comes from a variable mapping.
Fill in the blanks below to create a histogram! Use the diabetes
dataset and use chol
(short for "cholesterol") as the variable you will be observing within your dataset.
If you are struggling, click the "Hints" button to get some help. It will eventually take you to the solution if you need it, but try to figure it out on your own first!
ggplot(data = _________, mapping = aes(___________)) +
geom_histogram(_________) +
labs(____________)
ggplot(data = _________, mapping = aes(x = _________)) +
geom_histogram(fill = "pink", _________________) +
labs(title = "________")
ggplot(data = diabetes, mapping = aes(x = _________)) +
geom_histogram(fill = "pink", color = "_______", bins = ___) +
labs(title = "________")
ggplot(data = diabetes, mapping = aes(x = chol)) +
geom_histogram(fill = "pink", color = "black", bins = 20) +
labs(title = "Cholesterol Levels")
Density curves with sample data will be like a smoothed version of a histogram.
If we don't have much data, density curves might be misleading, as they smooth out the data to suggest a distribution shape. However, if we have a lot of data, density curves can be more helpful than histograms in revealing the general trend of that variable.
The anaesthetic
dataset compares 4 different anaesthetics with 20 patients each and measures time in minutes until the patient can begin breathing unassisted after use.
library(faraway)
anaesthetic
The following density curve has its numeric x equal to time to breathe (the variable breath
). We can again use fill to define a fill color and color to define a border color.
ggplot(data = anaesthetic, mapping = aes(x= breath)) +
geom_density(fill = "purple", color = "black")
We talked about default order of arguments in the past, so let's put it to use!
R
will assume that your second argument with aes
is the mapping =
argument. So typically when writing ggplot
code, you can just leave the mapping =
part out!
ggplot(data = anaesthetic, aes(x= breath)) +
geom_density(fill = "purple", color = "black")
You can also drop the data =
as well (like below), but personally, I tend to type it out for clarity!
ggplot(anaesthetic, aes(x= breath)) +
geom_density(fill = "purple", color = "black")
While not especially important on a univariate plot, it's sometimes helpful to add transparency to your graph.
Alpha spans from 0 (fully transparent) to 1 (fully opaque).
Feel free to adjust alpha to different values to see what happens!
ggplot(data = anaesthetic, aes(x= breath)) +
geom_density(fill = "purple", color = "black", alpha = 0.5)
The point of the data is to compare four different anaesthetics, but we just looked at the distribution of all four combined. We'll return to this data at the end of this tutorial to make that comparison!
Barplots can help us visualize the distribution of categorical or discrete variables much the same way histograms do.
We will start with barplots for just one variable.
We will be using the diamond dataset (it's inside the ggplot2
package). Notice that each row is a unique diamond, and each column represents a variable we've observed or measured about these diamonds.
library(ggplot2)
diamonds
One variable of interest is the cut
of the diamond (it's a measure of quality). A barplot can help us quickly see how many diamonds of each cut we have in this dataset.
ggplot(data = diamonds, aes(x = cut)) +
geom_bar(fill = "dodgerblue")
While we can color our barplot using a single block color, we can also color (or in this case, fill) by a variable. Notice in this next plot that we are also assigning the diamond's cut
to also be represented as a fill color!
ggplot(data = diamonds, aes(x = cut, fill = cut)) +
geom_bar()
Several reminds and observation here
geom_bar
line, as that would overwrite the previous fill assignment. Feel free to try in the example above to see what happens!In the previous example, each row represented one object (a diamond). But what if our unit of observation is already a summary?
We can see this with the mtcars
dataset. This data (collected from the 1970s) has each vehicle model represented as one row.
mtcars
Let's say I wanted to make a barplot that compared each model's mpg. In that case, I'm not counting how many vehicles there are in a category. Rather, each row will now have its own bar extending to the value shown in the mpg
column.
Let's assign the car model to the x-axis and the mpg
variable to the y axis. For this situation where aren't counting rows in each category but instead assigning a variable to y, then we will use the geom_col
geometry.
FYI: mtcars
is an unusual case data frame where the vehicle names are stored as rownames, rather than as their own variable. So in the plot below, I inputted rownames(mtcars)
so that R
can find the vehicle names and treat them like a variable. You do not need to know this extra bit!
ggplot(data = mtcars, aes(x = rownames(mtcars), y = mpg)) +
geom_col(fill = "goldenrod")
All of the vehicle names are overlapping! An easy fix is to change the orientation of the barplot itself.
Let's switch which variable appears on the x axis and which is on the y axis.
ggplot(data = mtcars, aes(y = rownames(mtcars), x = mpg)) +
geom_col(fill = "goldenrod")
For this practice, we will use the penguins
dataset from the palmerpenguins
package. This dataset records information from 3 different species of penguins.
Create a barplot to count up and compare how many penguins we have from each species.
Make a choice! Do you want to have your bars extend vertically or horizontally? Which axis should you assign your categorical variable too in each case?
Color each bar differently by also assigning species
as a fill color.
In addition to the barplot itself, create a title called "Penguins by Island" using the labs function at the end.
This practice has no checker, but you can use the hints to see a sample solution.
ggplot(data = ____________, aes(_______________)) +
geom_bar() +
labs(___________)
ggplot(data = penguins, aes(y = ___________, fill = _________)) +
geom_bar() +
labs(title = "_________________")
ggplot(data = penguins, aes(y = species, fill = species)) +
geom_bar() +
labs(title = "Penguins by Species")
Boxplots are a very commonly used to summarize a distribution. One drawback of boxplots is that they can't show you...
An advantage of boxplots is that they are often cleaner than many other choices and very helpful for comparing groups!
Let's first create a boxplot of a numeric variable all by itself using the anaesthetic
data.
ggplot(data = anaesthetic, aes(y = breath)) +
geom_boxplot(fill = "red")
R
defaults to a certain graph size which makes solo boxplots look very unappealing (this can be fixed by defining a width
argument, like below).
geom_boxplot(fill = "red", width = 0.5)
But in practice, boxplots are almost exclusively used to compare groups!
Let's do this again, but let's now compare the different anaesthetics with regard to how many breaths it takes someone to revive on their own completely. Take a look again at the data:
anaesthetic
Let's keep breath
on the y axis, but now split the x axis up by the anaesthetic used (that variable is coded as tgrp
). This will now create a different vertical boxplot for each anaesthetic used.
ggplot(data = anaesthetic, aes(x = tgrp, y = breath)) +
geom_boxplot(fill = "red")
If preferred, we can change the orientation like we did for barplots!
We can also change each box to a different color if we wish by adding a fill
argument.
ggplot(data = anaesthetic, aes(y = tgrp, x = breath, fill = tgrp)) +
geom_boxplot()
We can optionally add errorbars to the whiskers by adding another layer: stat_boxplot()
and setting geom = "errorbar"
ggplot(data = anaesthetic, aes(y = tgrp, x = breath, fill = tgrp)) +
geom_boxplot() +
stat_boxplot(geom = "errorbar")
So, is there a difference between the anaesthetics?
Later in the course, we'll learn how to determine probabilistically if the anaesthetics could be the same, or if this distributions are different enough to suggest underlying differences!
Let's return to the penguins
dataset once again.
library(palmerpenguins)
penguins
body_mass_g
of each species
ggplot(data = _____, aes(___________)) +
geom_boxplot()
stat_boxplot(______________) +
labs(______________)
ggplot(data = penguins, aes(x = _____ , y = _____, fill = ____)) +
geom_boxplot()
stat_boxplot(geom = "__________") +
labs(_______________)
ggplot(data = penguins, aes(x = species , y = body_mass_g, fill = species)) +
geom_boxplot() +
stat_boxplot(geom = "errorbar") +
labs(title = "Body Mass (g) by Species")
Before venturing into your next coding assignment, watch this video! I cover a few common coding mistakes when building a ggplot visualization and also bring back a few timely reminders from the "Navigating RStudio" video.
We'll learn many more features and geometries in this course, but you might also find the R gallery interesting for some previews of what's possible: https://www.r-graph-gallery.com/
This tutorial was created by Kelly Findley and Brandon Pazmino (UIUC '21). We hope this experience was helpful for you!
If you'd like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/