dplyr: An Introduction

## Using filter

### Introduction

The dyplr package has a number of advanced data wrangling capabilities made "easy" in comparison to using base R code. We will be taking a look at some basic features of dplyr in this tutorial!

dplyr, like ggplot2, is one of many packages included in tidyverse. If you have already installed tidyverse at this point (which I'm guessing most everyone has), then dplyr functions should be available to you! If you did not install tidyverse and instead only installed ggplot2 by itself, you may want to also install dplyr as well.

Just as with making ggplots, be sure you have libraried either tidyverse, or specifically dplyr before proceeding to use dplyr functions in R.

### What is a pipe?

The dplyr package uses "pipes," which acts as a "then" statement for each operation you complete. The idea is that it saves you from needing to nest your functions within each other.

While nesting some functions within others is normal and not too confusing, it can quickly become very confusing if we have lots of nested functions! That is why dplyr is structured with pipes to allow for each nested function to be its own line (as we will see in this tutorial!)

### Examining the starwars data

Let's take a look at the starwars data (an available dataset in R).

starwars

For documentation on this data, you can search online, or simply type ?starwars in a code chunk.

Also, notice that the last three columns won't show up in the preview, as they are each lists. But you can take a look at them directly by calling on the variable:

### Practice

Use the starwars data to filter down to Star Wars characters who are Droids.

starwars %>%
...(species...)
starwars %>%
filter(species == "Droid")

## Using select and mutate

### Selecting columns

Perhaps instead of filtering to just certain rows, I'd actually like to filter to only include certain columns (variables). I can use select to choose which columns to keep and output.

For example, let's say I only want to include the name, species, and gender variables. I can do that as follows:

starwars %>%
select(name, species, gender)

### Creating new calcuated variables

Supoose I wanted to create a new variable based on information in my data already. For example, a height to mass ratio variable (basically, a bmi type measure, but in different units since we're working with height in cm and mass in kg!)

I can use mutate to create this new variable as follows. I'm also going to select this column and name only so that we don't have to go scrolling!

starwars %>%
mutate(bmi = mass/height) %>%
select(name, bmi)

Notice we have two pipes in here. I recommend reading them as "then." - Load starwars, then... - Mutate a new column called bmi as mass over height, then... - Select only the columns name and bmi

We can continue adding on this if we like. For example, let's say that I want to filter to only bmi values above 30. Then we can do that as follows:

starwars %>%
mutate(bmi = mass/height) %>%
select(name, bmi) %>%
filter(bmi > 0.5)

### Practice with mutate

Now lets do a problem with the mutate function! Create a calculated variable called height_in that converts the current height values (in cm) to inches. We can do that by dividing height by 2.54.

starwars %>%
...(height_in = ...) %>%
select(...)
starwars %>%
...(height_in = height/...) %>%
select(name, ...)
starwars %>%
mutate(height_in = height/2.54) %>%
select(name, height_in)

## Using group_by and summarise

### Group by a variable

Last, we're going to discuss how to use dplyr to create summary statistic tables for particular variables.

As a first step, we want to pick one (or more) variables to group by. These are typically categorical variables whose categories we want to compare separately across rows. For example:

starwars %>%
group_by(sex)

However...this by itself doesn't actually do anything. We need to combine it with another request, like summarise!

### Summarise

Now let's try creating a summary measure for a variable of interest, where each measure we request will appear as a column, and each category of the variable we grouped by will be a row.

Notice that in this case, we need to add na.rm = TRUE to remove the NA values in the height column for this dataset. However, when there are no NA values, this added argument is not needed.

starwars %>%
group_by(sex) %>%
summarise(avg_height = mean(height, na.rm = TRUE))

Notice that the variable we group by (sex) is broken up across rows, and the summary statistic we calculated will show up and automatically calculated for each category.

Doing this with base R would otherwise be quite tedious. We'd have to make 3 separate subsets, and then report the means of each.

### Summarise multiple measures

We can create as many summary measures as we want!

starwars %>%
group_by(sex) %>%
summarise(
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE),
sd_mass = sd(mass, na.rm = TRUE))

Since some of these categories only have 1 entry, that means that we'll get an NA result when calcuating standard deviation.

### Multiple group subsets

We could also do this for multiple groups if we wish. Let's say we want to look at even more specific subsets, like eye color and gender.

starwars %>%
group_by(eye_color, gender) %>%
summarise(
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE))

### But let's be more choosy

In this case, that might be a lot of combinations, so let's add a filter in front to choose some specific category options. I'll again use %in% to choose multiple categories:

starwars %>%
filter(eye_color %in% c("brown", "blue", "green", "yellow"),
gender %in% c("feminine", "masculine")) %>%
group_by(eye_color, gender) %>%
summarise(
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE))

Note, if working in an RMarkdown file, this will print as a nice table directly in your knitted report!

### Practice!

Let's practice with summarise. But let's work with diamonds again for this practice.

diamonds

Create a summary table that calculates the average carat value for diamonds of the following qualities: "Fair", "Good" and "Premium". Name your summary variable "Average".

Your table will have 2 columns and 3 rows if done correctly:

diamonds %>%
filter(...) %>%
group_by(...) %>%
...
diamonds %>%
filter(cut %in% ...) %>%
group_by(cut) %>%
summarise(Average = ...)
diamonds %>%
filter(cut %in% c("Fair", "Good", "Premium")) %>%
group_by(cut) %>%
summarise(Average = mean(carat))

### Acknowledgment

This tutorial was built by Brandon Pazmino (UIUC '21) and Kelly Findley. We hope this experience was helpful for you!

If you'd like to learn more about dplyr`, you may find the RStudio cheatsheet helpful (try a web search, or use the help tab from the menu to access cheatsheets). Here is another very accessible discussion of the features we learned, plus a few more: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html