Data Wrangling with Pipes

Pipes and Data Wrangling

Tutorial Goals

In this tutorial, we will:

Introduce the idea of a “pipe” for doing data wrangling using the dplyr package
Use a filter argument as a way to include or exclude certain rows in our data
Create plots inside our pipe based on the filtering we have done
Make a summary table using the summarise function and group_by in a pipe

Another tidyverse package

ggplot2 is one of many packages in the tidyverse, but let’s talk about another one…

dplyr (usually pronounced as “d plyer”) offers some additional functions that help with “data wrangling.” The name of the package comes from the idea of “plying” data to extract information!

Installed through tidyverse

If you have already installed tidyverse on your computer (or in your RStudio Cloud Project), then you have dplyr installed as well! Just be sure to run library(tidyverse) before using dplyr functions to activate the package.

What is Data Wrangling?

Data wrangling involves basic manipulation with data to prepare for analysis. Some examples include:

Cleaning data
Filtering select variables to mask outliers or focus on specific categories.
Converting a variable type to another data type (e.g., from numeric to binary)

We’ll largely focus on the second of these things in this tutorial!

Filtering Numerically

Mini Class Data

Let’s take a look at this small data frame, representing 10 students in a class

Class

Introducing Pipes

A pipe is the name of pairing two symbols: |>. In many ways, it’s like using + to connect lines of your ggplot. Think of it as an and then statement, where you are proceeding through commands in logical order.

The vertical line | can typically be found over the enter key on your keyboard and may also require you to hold “shift” while pressing it.

Our goal in words

Perhaps we would like to see students who have at least 1 sibling. We’d like to tell R…

call on the Class dataframe (and then)
filter siblings to be greater than 0

Our goal in code

Class |>
  filter(siblings > 0)

less/greater than or equal

We can also make a less than or equal to (<=) or greater than or equal to (>=) argument.

Class |>
  filter(siblings >= 1)

Try it!

Use a filter argument in a pipe to output only students who are less than 67 inches tall.

Class |>
  ______________

Class |>
  filter(height _____)

Filtering Categorically

Filter by academic level

The above examples involve filtering numeric variables, but we can also filter to include/exclude various levels of a categorical variable.

For example, what if we only wanted to select students who were Freshman from the class?

Class |>
  filter(acad_level == "Freshman")

How did we set that up?

In R, think of the double equals (==) as like saying “matches.”

In the previous example, we might read that as acad_level matches Freshman.

Also notice we need quotation marks around Freshman. To select specific names, use quotation marks. To filter down to value ranges, no quotation marks needed.

Does NOT match this category

We might also choose to simply exclude one category. We can do that with != which you should read as “Does not match.”

How would we select all students that are not a Sophomore?

Class |>
  filter(acad_level __ ____________)

Class |>
  filter(acad_level != ____________)

Class |>
  filter(acad_level != "Sophomore")

Watch for cAsE and exact letter matches

I hope you were successful to notice we needed to select Sophomore rather than Sophomores (plural) or sophomore (lowercase).

R is case sensitive, as well as sensitive to any small change to the exact way it is recorded in the data!

Multiple Filter Commands

“And” statements

With filter, I can also implement multiple commands at once using the & symbol. Perhaps we want to include students that are Juniors AND that have at least 1 sibling.

Class |>
  filter(acad_level == "Junior" & siblings > 0)

“Or” statements

Likewise, we can also use the | symbol to make an or statement. It is both one component of our pipe symbol and a stand-alone logic symbol!

How about students that are Juniors OR are at least 68 inches tall.

Class |>
  filter(acad_level == "Junior" | height >= 68)

“In” statements

What if I want Freshmen and Sophomores? We can’t select just one, or all but one. Instead, we can use %in% and then list our categories of interest in a vector.

Class |> 
  filter(acad_level %in% c("Freshman", "Sophomore"))

Practice

I would like a list of all students who are more than 67 inches tall, but not Dan. How might I do that?

Class |>
  filter(____________________)

Class |>
  filter(height > 67 & ___________)

Class |>
  filter(height > 67 & name != "Dan")

Plotting within a pipe

New example: Midwest data

Let’s transition to a larger dataframe now!

In the midwest dataframe, each row represents a county in one of five states: Illinois, Indiana, Michigan, Ohio, or Wisconsin.

midwest

Looking at poptotals

Perhaps we want to compare county geographic areas for Illinois and Wisconsin. But if we try to plot first, we’re going to get all 5 states included

ggplot(data = midwest, aes(x = state, y = area, color = state)) +
  geom_jitter(width = 0.05)

Plotting inside a pipe

What we want to do first is filter to only include IL and WI counties.

In the following code, we do the following - Call on midwest (and then) - Filter to only include counties in IL or WI (and then) - plot the data

And notice that we exclude the data argument in the plot since it is already listed to start the pipe.

midwest |>
  filter(state %in% c("IL", "WI")) |>
  ggplot(aes(x = state, y = area, color = state)) +
  geom_jitter(width = 0.05)

Wait, why is there no second data argument?

Because the data argument is already out front with a pipe. If you make the data argument again, R may ignore the filtering done first and assume you are starting over. Or in some cases, R will just output in error.

Masking outliers

Another useful application of plotting in a pipe is masking outliers. For example, maybe we want to compare county population sizes after excluding the highly urban counties like Cook County (Chicago area) to better see how most counties compare.

midwest |>
  filter(state %in% c("IL", "WI")) |>
  ggplot(aes(x = state, y = poptotal, color = state)) +
  geom_jitter(width = 0.05)

A second filter argument

Let’s complete another pipe that keeps our two state comparison, but also filters to only include county populations below 500,000. Be careful to report the value in the filter without commas!

midwest |>
  filter(state %in% c("IL", "WI") & poptotal < 500000) |>
  ggplot(aes(x = state, y = poptotal, color = state)) +
  geom_jitter(width = 0.05)

Summary Tables

Using summarise

Let’s change gears–perhaps we’d like to compare county populations across states. One measure of interest might be comparing mean county populations.

Before comparing states, notice how we might calculate a mean all by itself inside a pipe.

midwest |>
  summarise(mean(poptotal))

More summarise arguments

We could similarly calculate other summary statistics. We just need to separate them by commas.

It’s also more readable to put each on a different line, so we will do that from here on!

A few examples:

mean finds the mean
median finds the median
sd finds the standard deviation
max finds the maximum
min finds the minimum
length finds the number of observations

midwest |>
  summarise(mean(poptotal),
            median(poptotal),
            sd(poptotal),
            max(poptotal),
            min(poptotal),
            length(poptotal))

Naming the columns

Those are probably not very attractive column headers. Let’s give them some nicer names. No spaces though, as these are like dataframe column headers!

midwest |>
  summarise(Mean = mean(poptotal),
            Median = median(poptotal),
            St_Dev = sd(poptotal),
            Max = max(poptotal),
            Min = min(poptotal),
            Count = length(poptotal))

Creating a summary table

But the real power of piping comes in creating and comparing summary measures across groups. We can add a group_by statement first to communicate that we wish to generate and compare these measures across states.

midwest |>
  group_by(state) |>
  summarise(Mean = mean(poptotal),
            Median = median(poptotal),
            St_Dev = sd(poptotal),
            Max = max(poptotal),
            Min = min(poptotal),
            Count = length(poptotal))

Interpret the results!

Some observations we might make from the previous table include:

Illinois has the highest standard deviation in county populations–likely because it has by far the highest single county population of 5,105,067.
Illinois also has the highest number of counties at 102, and Wisconsin the lowest at 72.
Ohio has the highest mean county population and the highest median county population.
Michigan has the smallest county by population–there’s a Michigan county with a population of only 1,701.

Rounding your statistics

If you wanted to adjust how much those values round to, you can use a round function around your statistics.

Note that round takes two arguments

The value (or in our case, a summarizing function),
The digits argument.

Use digits to communicate how much to round

digits = 0 rounds to a whole number
digits = 2 would round to 2 decimal places
digits = -2 would round to the 100s place

a = 175.428

round(a, digits = 1)

round(a, digits = 0)

round(a, digits = -1)

Try it with Midwest

Round the values in our summary table to the 100s place.

midwest |>
  group_by(state) |>
  summarise(Average = round(mean(poptotal), digits = __),
            Median = round(median(poptotal), digits = __),
            St_Dev = _____(sd(poptotal), digits = __))

midwest |>
  group_by(state) |>
  summarise(Average = round(mean(poptotal), digits = -2),
            Median = round(median(poptotal), digits = -2),
            St_Dev = round(sd(poptotal), digits = -2))

Removing NAs

Comparing penguin species

Let’s say we wanted to compare penguin species by flipper length. But notice what happens in this scenario:

penguins |>
  group_by(species) |>
  summarise(Mean = mean(flipper_length_mm),
            SD = sd(flipper_length_mm))

What happened?

The penguins data had a few cells with missing data. As a result, if we tried calculating summary measures from variables with empty cells, we might see some “NA” results indicating that.

penguins

na.rm = TRUE

To get around this, we can add an argument inside our statistical measures. na.rm = TRUE simply indicates to R to remove NAs when making these calculations. Thus, we’ll find the summary measures of the data that is reported.

penguins |>
  group_by(species) |>
  summarise(Mean = mean(flipper_length_mm, na.rm = TRUE),
            SD = sd(flipper_length_mm, na.rm = TRUE))

Want to learn more?

We only scratched the surface of data wrangling with dplyr. If you’re interested in learning more, check out the RStudio cheatsheet (you can find all tidyverse cheatsheets here: https://www.rstudio.com/resources/cheatsheets/)

Here is another very accessible discussion of the features we learned, plus a few more: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

Return Home

This tutorial was built by Kelly Findley. I hope this experience was helpful for you!

If you’d like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/