Introduction to ggplot2

Why Visualize?

The Need for Visualization

In this course, we'll be doing a lot of data visualization. Visualization is important both in the exploration and analysis phase (so we can better see what might be going on in data) and in the presentation phase (to share important insights with our audience).

Now we will once again take a look at the prostate dataset. Let's first quickly view the prostate data to get a glimpse of what variables are included and what raw data we see.

library(faraway)
prostate

Summarizing Variables

Let's take the two following variables, age and lweight, and run a summary on both variables:

summary(prostate$lweight)
summary(prostate$age)

But raw data and summary statistics can only go so far in telling us what's going on in our data. Visualizing the variable can communicate much more information at a glance.

Introducing ggplot

While R as a language has many base plot functions for graphing, we will primarily use graphing functions from a package named ggplot2. The "gg" of `ggplot2 stands for "Grammar of Grahics" and is considered by most as the best platform for visualization in R...and arguably the best of any programming language!

Before ever being able to use the ggplot function, you'll need to install the ggplot2 package. Packages are like apps that other people create that you can then install into your version of R on your computer.

While you can install ggplot2 as an individual package, I would actually recommend installing tidyverse, which is a collection of packages bundled together that all work well together! Whenever you have your RStudio open, go ahead and run the following function. Note, it may take 1-2 minutes to fully unpack and install.

install.packages("tidyverse")

Also, for every new session of R, you will need to library each package before you use any functions from it...this step is kind of like opening an app on your phone to interact with it. So if you want to create a graph using the ggplot function, open the ggplot2 package.

library(ggplot2)

You do not have to keep running library(ggplot2) before every plot code though. Just once per session is enough! I typically put all packages I library at the top of my script as a reminder to run them each time I open that script.

Histograms

Our first plot!

On this first panel, we'll start with creating histograms. Histograms are a very simple representation for visualizing a numeric variable.

Specifying the data and variables

The first line will use the ggplot function. This function requires a data argument, followed by a mapping an argument. The data argument specifies which data structure we are calling on, and the mapping argument specifies which variables we are mapping to the plot in which representation.

ggplot(data = ..., mapping = ...)

Note that for mapping, we identify the variables we are using inside an embedded function called aes (which stands for "aesthetic"). This is always necessary if mapping representations to dynamic entries, like a variable.

ggplot(data = ..., mapping = aes(...))

Adding a geometry

Next, we can name a geometry (geom) to identify what form our plot will take. This is where we can choose a histogram by specifying geom_histogram, though there are many other geometries we'll use as well!

We will call on the prostate data, take the age variable plotted on the x axis, and choose histogram as our geometry. See what it looks like!

ggplot(data = prostate, mapping = aes(x = age)) +
  geom_histogram()

Note: We linked the ggplot line with the geom_point line using a + sign. In ggplot commands, note that we will often link multiple commands together with +. Think of it as an "and" statement to connect multiple components that add to the same plot.

Add some color

We aren't restricted to this dark, uninspiring color palette. We can change both the border color (activated by the color argument) and the fill color (activated by the fill argument).

Run the following ggplot to change the border color and fill color. Feel free to adjust the color options to other generic colors! (and yes, there is a very extensive list of color options we'll see in a later tutorial!)

ggplot(data = prostate, mapping = aes(x = age)) +
  geom_histogram(color = "black", fill = "green")

If naming specific colors, be sure to write them in quotation marks. However, variable names should not be listed in quotation marks when mapped to a representation.

Setting Number of Bins

The number of bins within your histogram refers to the number of bars within your histogram and it will largely depend on the number of data points that your dataset contains.

While histograms in R will default to 30 bins if no selection is made, it is good practice to set this in your graphs and to play around with this number until you are happy with the appearance. The less data you have, the fewer bins you probably will want.

Feel free to change number of bins and notice what happens!

ggplot(data = prostate, mapping = aes(x = age)) + 
  geom_histogram(color = "black", fill = "green", bins = 25)

Plotting Weight

Before trying one for yourself, we are going to create one more plot. But this time, with the lweight variable.

Notice that for this plot, we also have some labels. The labs function allows us to add a title, axes titles, and other labels as we wish. You don't need to use all of these arguments though! Feel free to delete one or more options and note that the plot will still run.

ggplot(data = prostate, mapping = aes(x = lweight)) +
  geom_histogram(color = "black", fill = "green") +
  labs(title = "Weight of Respondents", x = "Weight", y = "Count totals")

Time to Practice!

Fill in the blanks below to create a histogram! Use the diabetes dataset and use chol (short for "cholesterol") as the variable you will be observing within your dataset.

  • Set bin number to 20.
  • Create a black borderline and a pink fill color
  • Title your plot "Cholesterol Levels"

If you are struggling, click the "Hints" button to get some help. It will eventually take you to the solution if you need it, but try to figure it out on your own first!

ggplot(data = _________, mapping = aes(___________)) + 
  geom_histogram(_________) +
  labs(____________)
ggplot(data = _________, aes(x = _________)) + 
  geom_histogram(fill = "pink", _________________) +
  labs(title = "________")
ggplot(data = diabetes, mapping = aes(x = _________)) + 
  geom_histogram(fill = "pink", color = "_______", bins = ___) +
  labs(title = "________")
ggplot(data = diabetes, mapping = aes(x = chol)) + 
  geom_histogram(fill = "pink", color = "black", bins = 20) +
  labs(title = "Cholesterol Levels")

Try on your own

Now without the the code for reference, plot a histogram with the prostate dataset, plot the variable lcavol, fill being red, a bordercolor of white, and the bin number of your choice.

ggplot(data = _________, mapping = aes(x = lweight)) + 
  geom_histogram(_____________________)
ggplot(data = prostate, mapping = aes(x = lweight)) + 
  geom_histogram(color = "_______", fill = "______", bins = ___)
ggplot(data = prostate, mapping = aes(x = lweight)) + 
  geom_histogram(color = "white", fill = "red", bins = 20)

Density Curves

Introducing Density Curves

Density curves represent another helpful way to explore the distribution of a numeric variable. A simple way of describing a density curve is as a smoothed version of a histogram.

If we don't have much data, density curves might be misleading, as they smooth out the data to suggest a distribution shape. However, if we have a lot of data, density curves can be more helpful than histograms in revealing the general trend of that variable.

Anaesthetic Data

Now let us first take a look at basic density curve. The data we will be using is the anaesthetic dataset in the faraway package.

There is also a command to open the documentation for this data with ?anaesthetic. This will likely open up in a browswer for you.

We can see from the documentation provided that this data represents how much time it took for participants to begin breathing again unassisted after being taken off one of four different anaeshetics.

library(faraway)
anaesthetic
?anaesthetic

Making a Density curve

The following density curve has its numerical x equal to time to breathe (the variable "breath"). We can again use fill to define a fill color and color to define a border color.

Feel free to change the fill and color options if you wish!

ggplot(data = anaesthetic, mapping = aes(x= breath)) + 
  geom_density(fill = "purple", color = "black")

We can see the overall amount of breathing recorded in our dataset is most common between 0-5 minutes as that is where the highest peak of our density curve lies.

A Shorter Type

Now that we've done this a few times, I'm going to drop the mapping = component in the ggplot line. For this function, R will assume that aes() is your mapping without explicitly writing the argument out. So let's save some typing!

ggplot(data = anaesthetic, aes(x= breath)) + 
  geom_density(fill = "purple", color = "black")

technically, you can also drop the data = as well (like below), but personally, I tend to type it out for clarity!

ggplot(anaesthetic, aes(x= breath)) + 
  geom_density(fill = "purple", color = "black")

Alpha (transparency)

While not especially important on a univariate plot, it's sometimes helpful to add transparency to your graph.

Alpha spans from 0 (fully transparent) to 1 (fully opaque).

Feel free to adjust alpha to different values to see what happens!

ggplot(data = anaesthetic, aes(x= breath)) + 
  geom_density(fill = "purple", color = "black", alpha = 0.5)

Note: you can do the same for histograms, and most any plot we use in ggplot!

But don't we want to compare the anaesthetics?

The point of the data is to compare four different anaesthetics, but we just looked at the distribution of all four combined! We'll return to this data at the end of this tutorial to make that comparison.

Barplots

Barplots using geom_bar

Barplots can help us visualize the distribution of categorical or discrete variables much the same way histograms do.

Typically the caragorical variable is on the x axis of the plot as each bar represents the specifc variable being examined with the y axis reprsenting a count. Though we can also switch the axes if we wish to have the bars extend horizontally rather than vertically.

We will start with barplots that count how many rows of data meet each category of interest.

Exploring the Barplot Code

We will be using the diamond dataset (it's inside the ggplot2 package).

library(ggplot2)
diamonds

Notice that each row is a unique diamond, and each diamond can be classified by various cuts (categorized by quality). A barplot can help us quickly see how many of each cut we have in this dataset.

ggplot(data = diamonds, aes(x = cut)) + 
  geom_bar(fill = "dodgerblue")

Coloring by a Variable

While we can continue to color our barplot using a block color, like we did above, we can also color (or in this case, fill) by a variable.

For example, we can represent "cut" both as a variable on the x axis and as the fill color. This will then color each bar a different color. It doesn't necessarily improve our ability to explore the data, but does make the graph look nice!

ggplot(data = diamonds, aes(x = cut, fill = cut)) + 
  geom_bar()

Notice a few things:

  • Variable names are not put in quotation marks, since they represent dynamic entries rather than static entries (a generic color, like "green" is a static entry.)
  • We need to represent this in the aesthetic (aes) function since it is a variable representation.
  • We do not also add a block color name in the geom_bar line, as that would overwrite our color entry from before. You can experiment though and also put fill = "blue" or some other block color entry, and you'll notice that this will overwrite fill = cut above.

Barplots using geom_col

Data can be represented in different ways. Rather than each row being a unique observation, we might also think of each row as summarizing information about one category.

We can see this with the mtcars dataset stored in base R. This data is from the 1970s and has some interesting comparisons between car models.

mtcars

Mapping to both x and y

Each row in this dataset is a summary of one car model's specifications. So if I wanted to make a barplot to compare the specifications of each car model, I couldn't make a barplot in exactly the same construction as before--I can't count rows!

What I can do though is assign a variable to the y axis that represents the characteristic I want to compare. If needing to assign a variable to both axes, we can use geom_col

Note that mtcars is a special case where the vehicle names are stored as rownames, rather than as their own variable. So in the plot below, I inputted rownames(mtcars) as a way to call on the vector of vehicle names. For purposes of plotting, this acts as a variable now!

ggplot(data = mtcars, aes(x = rownames(mtcars), y = mpg)) +
  geom_col(fill = "goldenrod")

Change the orientation!

So...that's not very helpful. All of the vehicle names are overlapping! An easy fix is to change the orientation of the barplot itself.

Let's switch which variable appears on the x axis and which is on the y axis.

ggplot(data = mtcars, aes(y = rownames(mtcars), x = mpg)) +
  geom_col(fill = "goldenrod")

Color might help

With so many bars, it might be difficult to tell which is which. Assigning each vehicle a unique fill color can again help!

...and since the ggplot line is getting long, I put each argument on a new line to make this code more readable.

ggplot(data = mtcars, aes(y = rownames(mtcars), 
                          x = mpg, 
                          fill = rownames(mtcars))) +
  geom_col()

Try your own Bar Plot

For this practice, we will use the penguins dataset from the palmerpenguins package. This dataset records information from 3 different species of penguins

One thing we want to compare is how many penguins we have from each island.

Make a choice! Do you want to have your bars extend vertically or horizontally? Which axis should you assign your categorical variable too in each case?

Color each bar differently by also assigning island as a fill color.

In addition to the barplot itself, create a title called "Penguins by Island" using the labs function at the end.

ggplot(data = ____________, aes(_______________)) +
  geom_bar() +
  labs(___________)
ggplot(data = penguins, aes(y = ___________, fill = _________)) +
  geom_bar() +
  labs(title = "_________________")
ggplot(data = penguins, aes(y = island, fill = island)) +
  geom_bar() +
  labs(title = "Penguins by Species")

Boxplots

Introducing Boxplots

Boxplots are a very commonly used to summarize a distribution. One drawback of boxplots is that they can't show you 1) how many data points there are, or 2) the shape of the distribution outside the quartiles.

An advantage of boxplots is that they are often cleaner. They are also very helpful for comparing several groups!

Solo Boxplots

Let's first create a boxplot of a numeric variable all by itself using the anaesthetic data.

ggplot(data = anaesthetic, aes(y = breath)) + 
  geom_boxplot(fill = "red")

Is this even helpful?

Note that R defaults to a certain graph size which makes solo boxplots look very unappealing. If making this in your RStudio, you can always resize the window before saving!

But to be honest, nobody makes a solo boxplot. Boxplots are instead much better for comparing multiple groups!

Side by side Boxplots

Let's do this again, but let's now compare the different anaesthetics with regard to how many breaths it takes someone to revive on their own completely. Take a look again at the data:

anaesthetic

Let's keep breath on the y axis, but now split the x axis up by the anaesthetic used tgrp. This will now create a different vertical boxplot for each anaesthetic used.

ggplot(data = anaesthetic, aes(x = tgrp, y = breath)) + 
  geom_boxplot(fill = "red")

Orientation and Color options

If preferred, we can change the orientation like we did for barplots! We can also change each box to a different color if we wish.

Below, we switch the axes, and then also change the fill color to be different for each anaesthetic (tgrp)

ggplot(data = anaesthetic, aes(y = tgrp, x = breath, fill = tgrp)) +
  geom_boxplot()

Whiskers

We can optionally add errorbars to the whiskers by adding another line: stat_boxplot() and setting geom = "errorbar"

Technically, this is a second geom, even if it doesn't seem like a new visual. But it's easy to add!

ggplot(data = anaesthetic, aes(y = tgrp, x = breath, fill = tgrp)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar")

Observations

So, is there a difference between the anaesthetics? If there is, it's not by much. Later in the course, we'll learn how to determing probabilistically if the anaesthetics could be the same, or if this distributions are different enough to suggest underlying differences!

Try your own!

Let's return to the diabetes dataset once again.

library(faraway)
diabetes

Use the folllowing reference code to create a boxplot with the diabetes dataset, your categorical variable as gender, and your numeric variable as chol.

Make each of your bars different colors (as we did above) by setting your categorical variable as the fill option.

Also add errorbars to the whiskers.

Give it the following title: "Cholesterol Levels by Gender"

Note that you will ge a warning message that cholesterol has to be converted to a continuous variable...that is ok!

ggplot(data = _____, aes(___________)) + 
  geom_boxplot()
  stat_boxplot(______________) +
    labs(______________)
ggplot(data = diabetes, aes(x = _____ , y = _____, fill = ____)) + 
  geom_boxplot()
  stat_boxplot(geom = "__________") +
  labs(_______________)
ggplot(data = diabetes, aes(x = gender , y = chol, fill = gender)) + 
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  labs(title = "Cholesterol Levels by Gender")

What else can we do with ggplot2?

Customize Anything!

By this point, you may be wondering how to customize other things. The graphs we have made so far are "ok," but could definitely be improved. The ggplot2 structure allows you to customize almost anything you can think of! As a preview of what will come, ggplot2 can let you...

  • Customize the colors
  • Change the background and gridline paneling
  • Customize (or remove) the legend
  • Change fonts, sizes, and styles of titles and labels
  • Change the axis scaling to be at a custom frequency or range
  • Try a wide variety of geometries (not just histograms and density curves)
  • Plot multiple geometries on the same plot (this is where ggplot gets really fun!)
  • Create interactive/dynamic plots

Something Fun

Below is one of my favorite new ggplots. It's called a "Raincloud" plot, and combines density curves (or in this case, density ridges from the ggridges package) with jittered point plots below. I also added a tiny boxplot to each.

In addition to some careful arguments for those geoms, there are also customized colors, background themes, and scale commands.

The plot below also compares breaths till unassisted for each anaesthetic group. But now we can see much more at once, rather than having to choose just one representation.

library(ggridges)
ggplot(data = anaesthetic, aes(x = breath, 
                            y = tgrp, 
                            fill = tgrp)) +
  geom_density_ridges(position = position_nudge(y = .15, x = 0),
                   alpha = 0.4,
                   scale = 0.7,
                   bandwidth = 0.6) +
  geom_boxplot(position = position_nudge(y = .15, x = 0),
               width = .1, 
               outlier.shape = NA,
               alpha = 0) +
  geom_jitter(height = 0.05,
             size = 0.8) +
  scale_fill_manual(values = c("purple", "seagreen","darkorange1","cyan3")) +
  theme_bw() +
  theme(legend.position = "none") +
  scale_x_continuous(breaks = seq(0,32,2),
                     limits = c(0, 24)) +
  scale_y_discrete(expand = expansion(add = c(0.2, 0.7))) +
  labs(x = "Breaths before Unassisted", y = "Anaesthetic")

WHOA--Do I need to know how to make that??

Nope! It's just a preview of what you can do in ggplot! Just focus on the basic plots we saw earlier.

But by the end of the course, you will have learned many of these features and may even be able to make a plot like this if you are motivated to try!

Acknowledgment

This tutorial was initially created by Brandon Pazmino (UIUC '21) with editing and upkeep by Kelly Findley. We hope this experience was helpful for you!