In this tutorial, we will:
dplyr
packagefilter
argument as a way to include or exclude certain rows in our datasummarise
function and group_by
in a pipeggplot2
is one of many packages in the tidyverse
, but let's talk about another one...
dplyr
(usually pronounced as "d plyer") offers some additional functions that help with "data wrangling." The name of the package comes from the idea of "plying" data to extract information!
If you have already installed tidyverse
on your computer (or in your RStudio Cloud Project), then you have dplyr
installed as well! Just be sure to run library(tidyverse)
before using dplyr
functions to activate the package.
Data wrangling involves basic manipulation with data to prepare for analysis. Some examples include:
We'll largely focus on the second of these things in this tutorial!
Let's take a look at this small data frame, representing 10 students in a class
Class
A pipe is the name of pairing two symbols: |>
. In many ways, it's like using +
to connect lines of your ggplot. Think of it as an and then statement, where you are proceeding through commands in logical order.
The vertical line |
can typically be found over the enter key on your keyboard and may also require you to hold "shift" while pressing it.
Perhaps we would like to see students who have at least 1 sibling. We'd like to tell R
...
Class
dataframe (and then)siblings
to be greater than 0Class |>
filter(siblings > 0)
We can also make a less than or equal to (<=
) or greater than or equal to (>=
) argument.
Class |>
filter(siblings >= 1)
Use a filter argument in a pipe to output only students who are less than 67 inches tall.
Class |>
______________
Class |>
filter(height _____)
The above examples involve filtering numeric variables, but we can also filter to include/exclude various levels of a categorical variable.
For example, what if we only wanted to select students who were Freshman from the class?
Class |>
filter(acad_level == "Freshman")
In R
, think of the double equals (==
) as like saying "matches."
In the previous example, we might read that as acad_level
matches Freshman
.
Also notice we need quotation marks around Freshman
. To select specific names, use quotation marks. To filter down to value ranges, no quotation marks needed.
We might also choose to simply exclude one category. We can do that with !=
which you should read as "Does not match."
How would we select all students that are not a Sophomore?
Class |>
filter(acad_level __ ____________)
Class |>
filter(acad_level != ____________)
Class |>
filter(acad_level != "Sophomore")
I hope you were successful to notice we needed to select Sophomore
rather than Sophomores
(plural) or sophomore
(lowercase).
R
is case sensitive, as well as sensitive to any small change to the exact way it is recorded in the data!
With filter
, I can also implement multiple commands at once using the &
symbol. Perhaps we want to include students that are Juniors AND that have at least 1 sibling.
Class |>
filter(acad_level == "Junior" & siblings > 0)
Likewise, we can also use the |
symbol to make an or statement. It is both one component of our pipe symbol and a stand-alone logic symbol!
How about students that are Juniors OR are at least 68 inches tall.
Class |>
filter(acad_level == "Junior" | height >= 68)
What if I want Freshmen and Sophomores? We can't select just one, or all but one. Instead, we can use %in%
and then list our categories of interest in a vector.
Class |>
filter(acad_level %in% c("Freshman", "Sophomore"))
I would like a list of all students who are more than 67 inches tall, but not Dan. How might I do that?
Class |>
filter(____________________)
Class |>
filter(height > 67 & ___________)
Class |>
filter(height > 67 & name != "Dan")
Let's transition to a larger dataframe now!
In the midwest
dataframe, each row represents a county in one of five states: Illinois, Indiana, Michigan, Ohio, or Wisconsin.
midwest
Perhaps we want to compare county geographic areas for Illinois and Wisconsin. But if we try to plot first, we're going to get all 5 states included
ggplot(data = midwest, aes(x = state, y = area, color = state)) +
geom_jitter(width = 0.05)
What we want to do first is filter to only include IL
and WI
counties.
In the following code, we do the following - Call on midwest
(and then) - Filter to only include counties in IL
or WI
(and then) - plot the data
And notice that we exclude the data
argument in the plot since it is already listed to start the pipe.
midwest |>
filter(state %in% c("IL", "WI")) |>
ggplot(aes(x = state, y = area, color = state)) +
geom_jitter(width = 0.05)
Because the data argument is already out front with a pipe. If you make the data argument again, R
may ignore the filtering done first and assume you are starting over. Or in some cases, R
will just output in error.
Another useful application of plotting in a pipe is masking outliers. For example, maybe we want to compare county population sizes after excluding the highly urban counties like Cook County (Chicago area) to better see how most counties compare.
midwest |>
filter(state %in% c("IL", "WI")) |>
ggplot(aes(x = state, y = poptotal, color = state)) +
geom_jitter(width = 0.05)
Let's complete another pipe that keeps our two state comparison, but also filters to only include county populations below 500,000. Be careful to report the value in the filter without commas!
midwest |>
filter(state %in% c("IL", "WI") & poptotal < 500000) |>
ggplot(aes(x = state, y = poptotal, color = state)) +
geom_jitter(width = 0.05)
Let's change gears--perhaps we'd like to compare county populations across states. One measure of interest might be comparing mean county populations.
Before comparing states, notice how we might calculate a mean all by itself inside a pipe.
midwest |>
summarise(mean(poptotal))
We could similarly calculate other summary statistics. We just need to separate them by commas.
It's also more readable to put each on a different line, so we will do that from here on!
A few examples:
mean
finds the meanmedian
finds the mediansd
finds the standard deviationmax
finds the maximummin
finds the minimumlength
finds the number of observationsmidwest |>
summarise(mean(poptotal),
median(poptotal),
sd(poptotal),
max(poptotal),
min(poptotal),
length(poptotal))
Those are probably not very attractive column headers. Let's give them some nicer names. No spaces though, as these are like dataframe column headers!
midwest |>
summarise(Mean = mean(poptotal),
Median = median(poptotal),
St_Dev = sd(poptotal),
Max = max(poptotal),
Min = min(poptotal),
Count = length(poptotal))
But the real power of piping comes in creating and comparing summary measures across groups. We can add a group_by
statement first to communicate that we wish to generate and compare these measures across states.
midwest |>
group_by(state) |>
summarise(Mean = mean(poptotal),
Median = median(poptotal),
St_Dev = sd(poptotal),
Max = max(poptotal),
Min = min(poptotal),
Count = length(poptotal))
Some observations we might make from the previous table include:
If you wanted to adjust how much those values round to, you can use a round
function around your statistics.
Note that round
takes two arguments
Use digits
to communicate how much to round
a = 175.428
round(a, digits = 1)
round(a, digits = 0)
round(a, digits = -1)
Round the values in our summary table to the 100s place.
midwest |>
group_by(state) |>
summarise(Average = round(mean(poptotal), digits = __),
Median = round(median(poptotal), digits = __),
St_Dev = _____(sd(poptotal), digits = __))
midwest |>
group_by(state) |>
summarise(Average = round(mean(poptotal), digits = -2),
Median = round(median(poptotal), digits = -2),
St_Dev = round(sd(poptotal), digits = -2))
Let's say we wanted to compare penguin species by flipper length. But notice what happens in this scenario:
penguins |>
group_by(species) |>
summarise(Mean = mean(flipper_length_mm),
SD = sd(flipper_length_mm))
The penguins
data had a few cells with missing data. As a result, if we tried calculating summary measures from variables with empty cells, we might see some "NA" results indicating that.
penguins
To get around this, we can add an argument inside our statistical measures. na.rm = TRUE
simply indicates to R
to remove NAs when making these calculations. Thus, we'll find the summary measures of the data that is reported.
penguins |>
group_by(species) |>
summarise(Mean = mean(flipper_length_mm, na.rm = TRUE),
SD = sd(flipper_length_mm, na.rm = TRUE))
We only scratched the surface of data wrangling with dplyr
. If you're interested in learning more, check out the RStudio cheatsheet (you can find all tidyverse cheatsheets here: https://www.rstudio.com/resources/cheatsheets/)
Here is another very accessible discussion of the features we learned, plus a few more: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
This tutorial was built by Kelly Findley. I hope this experience was helpful for you!
If you'd like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/