dplyr: An Introduction

Using filter

Introduction

The dyplr package has a number of advanced data wrangling capabilities made "easy" in comparison to using base R code. We will be taking a look at some basic features of `dplyr in this tutorial!

dplyr, like ggplot2, is one of many packages included in tidyverse. If you have already installed tidyverse at this point (which I'm guessing most everyone has), then dplyr functions should be available to you! If you did not install tidyverse and instead only installed ggplot2 by itself, you may want to also install dplyr as well.

Just as with making ggplots, be sure you have libraried either tidyverse, or specifically dplyr before proceeding to use dplyr functions in R.

What is a pipe?

The dplyr package uses "pipes," which acts as a "then" statement for each operation you complete. The idea is that it saves you from needing to nest your functions within each other.

While nesting some functions within others is normal and not too confusing, it can quickly become very confusing if we have lots of nested functions! That is why dplyr is structured with pipes to allow for each nested function to be its own line (as we will see in this tutorial!)

Examining the starwars data

Let's take a look at the starwars data (an available dataset in R).

starwars

For documentation on this data, you can search online, or simply type ?starwars in a code chunk.

Also, notice that the last three columns won't show up in the preview, as they are each lists. But you can take a look at them directly by calling on the variable:

starwars$films

Filter

We've already learned to subset our data using base R functions (e.g., subset), but in dplyr, the function name for subsetting is filter. For example, the following code filters our data to only include Star Wars characters with a mass greater than 70 and an eye color of yellow.

Notice, we start with the name of the data, a pipe symbol, and then any dplyr function names we wish to use.

starwars %>% 
  filter(mass > 70 & eye_color == "yellow")

And vs. Or with Filter

Notice in the previous example, we had multiple filtering requests. You might often just have one, but in the case of multiple, be thoughtful of whether you want both to be true, or at least one.

I used the & symbol in the previous example. Alternatively, a comma could be used here, which is always assumed to be &.

The "or" | symbol could be used if we just wanted one of those things to be true.

Run this function again, but try changing between &, | and , to see how these entries changes the selection.

starwars %>% 
  filter(mass > 70 & eye_color == "yellow")

A Quick word about %in%

Let's say that I want to allow for more than just yellow eye color, but a couple specific options. I want to briefly introduce you to %in%.

With %in%, I can now provide a concatenated vector of categories and just ask that we filter to options that are contained in the vector. For example:

starwars %>% 
  filter(mass > 70 & eye_color %in% c("yellow", "green", "blue"))

This option comes in handy when selecting multiple categories!

Saving your selection

We can also save our selections by setting this equal to some other name (preferably a new variable name). This allows us to call on this data later, perhaps within a graph or linear model or some other case where we want this selection of rows (or as we'll see later on, newly created calculated variables).

Tall_yellow = starwars %>% 
  filter(mass > 70 & eye_color == "yellow")

summary(Tall_yellow$height)

Practice

Use the starwars data to filter down to Star Wars characters who are Droids.

starwars %>%
  ...(species...)
starwars %>%
  filter(species == "Droid")

Using select and mutate

Selecting columns

Perhaps instead of filtering to just certain rows, I'd actually like to filter to only include certain columns (variables). I can use select to choose which columns to keep and output.

For example, let's say I only want to include the name, species, and gender variables. I can do that as follows:

starwars %>%
  select(name, species, gender)

Creating new calcuated variables

Supoose I wanted to create a new variable based on information in my data already. For example, a height to mass ratio variable (basically, a bmi type measure, but in different units since we're working with height in cm and mass in kg!)

I can use mutate to create this new variable as follows. I'm also going to select this column and name only so that we don't have to go scrolling!

starwars %>%
  mutate(bmi = mass/height) %>%
  select(name, bmi)

Notice we have two pipes in here. I recommend reading them as "then." - Load starwars, then... - Mutate a new column called bmi as mass over height, then... - Select only the columns name and bmi

Continue adding on!

We can continue adding on this if we like. For example, let's say that I want to filter to only bmi values above 30. Then we can do that as follows:

starwars %>%
  mutate(bmi = mass/height) %>%
  select(name, bmi) %>%
  filter(bmi > 0.5)

Practice with mutate

Now lets do a problem with the mutate function! Create a calculated variable called height_in that converts the current height values (in cm) to inches. We can do that by dividing height by 2.54.

starwars %>%
  ...(height_in = ...) %>%
  select(...)
starwars %>%
  ...(height_in = height/...) %>%
  select(name, ...)
starwars %>%
  mutate(height_in = height/2.54) %>%
  select(name, height_in)

Using group_by and summarise

Group by a variable

Last, we're going to discuss how to use dplyr to create summary statistic tables for particular variables.

As a first step, we want to pick one (or more) variables to group by. These are typically categorical variables whose categories we want to compare separately across rows. For example:

starwars %>%
  group_by(sex)

However...this by itself doesn't actually do anything. We need to combine it with another request, like summarise!

Summarise

Now let's try creating a summary measure for a variable of interest, where each measure we request will appear as a column, and each category of the variable we grouped by will be a row.

Notice that in this case, we need to add na.rm = TRUE to remove the NA values in the height column for this dataset. However, when there are no NA values, this added argument is not needed.

starwars %>%
  group_by(sex) %>%
  summarise(avg_height = mean(height, na.rm = TRUE))

Notice that the variable we group by (sex) is broken up across rows, and the summary statistic we calculated will show up and automatically calculated for each category.

Doing this with base R would otherwise be quite tedious. We'd have to make 3 separate subsets, and then report the means of each.

Summarise multiple measures

We can create as many summary measures as we want!

starwars %>%
  group_by(sex) %>%
  summarise(
    avg_height = mean(height, na.rm = TRUE),
    avg_mass = mean(mass, na.rm = TRUE),
    sd_mass = sd(mass, na.rm = TRUE))

Since some of these categories only have 1 entry, that means that we'll get an NA result when calcuating standard deviation.

Multiple group subsets

We could also do this for multiple groups if we wish. Let's say we want to look at even more specific subsets, like eye color and gender.

starwars %>%
  group_by(eye_color, gender) %>%
  summarise(
    avg_height = mean(height, na.rm = TRUE),
    avg_mass = mean(mass, na.rm = TRUE))

But let's be more choosy

In this case, that might be a lot of combinations, so let's add a filter in front to choose some specific category options. I'll again use %in% to choose multiple categories:

starwars %>%
  filter(eye_color %in% c("brown", "blue", "green", "yellow"),
         gender %in% c("feminine", "masculine")) %>%
  group_by(eye_color, gender) %>%
  summarise(
    avg_height = mean(height, na.rm = TRUE),
    avg_mass = mean(mass, na.rm = TRUE))

Note, if working in an RMarkdown file, this will print as a nice table directly in your knitted report!

Practice!

Let's practice with summarise. But let's work with diamonds again for this practice.

diamonds

Create a summary table that calculates the average carat value for diamonds of the following qualities: "Fair", "Good" and "Premium". Name your summary variable "Average".

Your table will have 2 columns and 3 rows if done correctly:

diamonds %>%
  filter(...) %>%
  group_by(...) %>%
  ...
diamonds %>%
  filter(cut %in% ...) %>%
  group_by(cut) %>%
  summarise(Average = ...)
diamonds %>%
  filter(cut %in% c("Fair", "Good", "Premium")) %>%
  group_by(cut) %>%
  summarise(Average = mean(carat))

Acknowledgment

This tutorial was built by Brandon Pazmino (UIUC '21) and Kelly Findley. We hope this experience was helpful for you!

If you'd like to learn more about dplyr, you may find the RStudio cheatsheet helpful (try a web search, or use the help tab from the menu to access cheatsheets). Here is another very accessible discussion of the features we learned, plus a few more: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html