dyplr package has a number of advanced data wrangling capabilities made "easy" in comparison to using base
R code. We will be taking a look at some basic features of `dplyr in this tutorial!
ggplot2, is one of many packages included in
tidyverse. If you have already installed
tidyverse at this point (which I'm guessing most everyone has), then
dplyr functions should be available to you! If you did not install
tidyverse and instead only installed
ggplot2 by itself, you may want to also install
dplyr as well.
Just as with making ggplots, be sure you have libraried either
tidyverse, or specifically
dplyr before proceeding to use
dplyr functions in
dplyr package uses "pipes," which acts as a "then" statement for each operation you complete. The idea is that it saves you from needing to nest your functions within each other.
While nesting some functions within others is normal and not too confusing, it can quickly become very confusing if we have lots of nested functions! That is why
dplyr is structured with pipes to allow for each nested function to be its own line (as we will see in this tutorial!)
Let's take a look at the
starwars data (an available dataset in
For documentation on this data, you can search online, or simply type
?starwars in a code chunk.
Also, notice that the last three columns won't show up in the preview, as they are each lists. But you can take a look at them directly by calling on the variable:
We've already learned to subset our data using base
R functions (e.g.,
subset), but in
dplyr, the function name for subsetting is
filter. For example, the following code filters our data to only include Star Wars characters with a mass greater than 70 and an eye color of yellow.
Notice, we start with the name of the data, a pipe symbol, and then any
dplyr function names we wish to use.
starwars %>% filter(mass > 70 & eye_color == "yellow")
Notice in the previous example, we had multiple filtering requests. You might often just have one, but in the case of multiple, be thoughtful of whether you want both to be true, or at least one.
I used the
& symbol in the previous example. Alternatively, a comma could be used here, which is always assumed to be
| symbol could be used if we just wanted one of those things to be true.
Run this function again, but try changing between
, to see how these entries changes the selection.
starwars %>% filter(mass > 70 & eye_color == "yellow")
Let's say that I want to allow for more than just yellow eye color, but a couple specific options. I want to briefly introduce you to %in%.
With %in%, I can now provide a concatenated vector of categories and just ask that we filter to options that are contained in the vector. For example:
starwars %>% filter(mass > 70 & eye_color %in% c("yellow", "green", "blue"))
This option comes in handy when selecting multiple categories!
We can also save our selections by setting this equal to some other name (preferably a new variable name). This allows us to call on this data later, perhaps within a graph or linear model or some other case where we want this selection of rows (or as we'll see later on, newly created calculated variables).
Tall_yellow = starwars %>% filter(mass > 70 & eye_color == "yellow") summary(Tall_yellow$height)
Use the starwars data to filter down to Star Wars characters who are Droids.
starwars %>% ...(species...)
starwars %>% filter(species == "Droid")
Perhaps instead of filtering to just certain rows, I'd actually like to filter to only include certain columns (variables). I can use
select to choose which columns to keep and output.
For example, let's say I only want to include the
gender variables. I can do that as follows:
starwars %>% select(name, species, gender)
Supoose I wanted to create a new variable based on information in my data already. For example, a height to mass ratio variable (basically, a bmi type measure, but in different units since we're working with height in cm and mass in kg!)
I can use
mutate to create this new variable as follows. I'm also going to select this column and name only so that we don't have to go scrolling!
starwars %>% mutate(bmi = mass/height) %>% select(name, bmi)
Notice we have two pipes in here. I recommend reading them as "then." - Load
starwars, then... - Mutate a new column called
height, then... - Select only the columns
We can continue adding on this if we like. For example, let's say that I want to filter to only bmi values above 30. Then we can do that as follows:
starwars %>% mutate(bmi = mass/height) %>% select(name, bmi) %>% filter(bmi > 0.5)
Now lets do a problem with the mutate function! Create a calculated variable called
height_in that converts the current height values (in cm) to inches. We can do that by dividing height by 2.54.
starwars %>% ...(height_in = ...) %>% select(...)
starwars %>% ...(height_in = height/...) %>% select(name, ...)
starwars %>% mutate(height_in = height/2.54) %>% select(name, height_in)
Last, we're going to discuss how to use
dplyr to create summary statistic tables for particular variables.
As a first step, we want to pick one (or more) variables to group by. These are typically categorical variables whose categories we want to compare separately across rows. For example:
starwars %>% group_by(sex)
However...this by itself doesn't actually do anything. We need to combine it with another request, like
Now let's try creating a summary measure for a variable of interest, where each measure we request will appear as a column, and each category of the variable we grouped by will be a row.
Notice that in this case, we need to add
na.rm = TRUE to remove the
NA values in the height column for this dataset. However, when there are no
NA values, this added argument is not needed.
starwars %>% group_by(sex) %>% summarise(avg_height = mean(height, na.rm = TRUE))
Notice that the variable we group by (sex) is broken up across rows, and the summary statistic we calculated will show up and automatically calculated for each category.
Doing this with base
R would otherwise be quite tedious. We'd have to make 3 separate subsets, and then report the means of each.
We can create as many summary measures as we want!
starwars %>% group_by(sex) %>% summarise( avg_height = mean(height, na.rm = TRUE), avg_mass = mean(mass, na.rm = TRUE), sd_mass = sd(mass, na.rm = TRUE))
Since some of these categories only have 1 entry, that means that we'll get an
NA result when calcuating standard deviation.
We could also do this for multiple groups if we wish. Let's say we want to look at even more specific subsets, like eye color and gender.
starwars %>% group_by(eye_color, gender) %>% summarise( avg_height = mean(height, na.rm = TRUE), avg_mass = mean(mass, na.rm = TRUE))
In this case, that might be a lot of combinations, so let's add a filter in front to choose some specific category options. I'll again use %in% to choose multiple categories:
starwars %>% filter(eye_color %in% c("brown", "blue", "green", "yellow"), gender %in% c("feminine", "masculine")) %>% group_by(eye_color, gender) %>% summarise( avg_height = mean(height, na.rm = TRUE), avg_mass = mean(mass, na.rm = TRUE))
Note, if working in an RMarkdown file, this will print as a nice table directly in your knitted report!
Let's practice with summarise. But let's work with diamonds again for this practice.
Create a summary table that calculates the average carat value for diamonds of the following qualities: "Fair", "Good" and "Premium". Name your summary variable "Average".
Your table will have 2 columns and 3 rows if done correctly:
diamonds %>% filter(...) %>% group_by(...) %>% ...
diamonds %>% filter(cut %in% ...) %>% group_by(cut) %>% summarise(Average = ...)
diamonds %>% filter(cut %in% c("Fair", "Good", "Premium")) %>% group_by(cut) %>% summarise(Average = mean(carat))
This tutorial was built by Brandon Pazmino (UIUC '21) and Kelly Findley. We hope this experience was helpful for you!If you'd like to learn more about
dplyr, you may find the RStudio cheatsheet helpful (try a web search, or use the help tab from the menu to access cheatsheets). Here is another very accessible discussion of the features we learned, plus a few more: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html