Introduction
Tutorial Goals
In this tutorial, we will use the subset
and ifelse
functions to complete some basic “data wrangling”
What is Data Wrangling?
Data wrangling involves basic manipulation with data to prepare for analysis. Some examples include:
- Cleaning data
- Identifying and removing/masking extreme outliers
- Converting a variable type to another data type (e.g., from numeric to binary)
We’ve already done some basic data cleaning within a spreadsheet, and in this tutorial, we’ll focus on the second and third items here.
Subsetting Numerically
Mini Class Data
Let’s take a look at this small data frame, representing 10 students in a class
Class
Filtering Numerically
Let’s say that we’d like to only look at students who are below 67 inches tall. We can do that with the following use of subset
subset(x = Class, subset = height < 67)
How does that work?
The subset
function has two important arguments to fill in:
x
: The object (typically a data frame) that we want to subsetsubset
: the criteria we are using to select rows
TIP: Whenever learning about a new function in R
, you can always use the ?
tool to search for documentation. For example, run ?subset
in your R
console to read about different arguments in subset
and how they work!
Still returns a data frame!
Read the subset we just ran as follows:
Take a subset of…
- The dataframe
Class
And only displays rows such that…
height
is less than 67
Notice that we still get all of the original variables (height
, acad_level
, and siblings
) in the output. subset
just tells R
a decision criteria for which rows to output.
We can drop the argument names
From now on, I will drop the argument names and just write the entries–R
will know what they mean!
subset(Class, height < 67)
Notice: No data = … argument
If you do include an argument, be careful you are typing x
rather than data
here. Note that data = ...
will not work with subset
!
subset(data = Class, height < 67)
will not work!
subset(x = Class, height < 67)
or subset(Class, height < 67)
both work though.
less/greater than or equal
What if we wanted to include 67 in our subset? Heights of 67 inches or lower? Just add an =
sign after the <
subset(Class, height <= 67)
Practice!
Go ahead and use the diabetes dataset from faraway to try subsetting!
library(faraway)
diabetes
Subset this data frame of 403 rows to only include individuals with a cholesterol level above 200. If done correctly, you’ll have 217 rows left!
subset(diabetes, ____________)
subset(diabetes, chol > 200)
Filtering Categories
Filtering to specific categories
We can also use subset to choose a particular category as well. For example, what if we only wanted to select students who were Freshman from the class?
subset(Class, acad_level == "Freshman")
How did we set that up?
In R
, think of the double equals (==
) as like saying “matches.”
In the previous example, we might read that as acad_level
matches Freshman
.
Also notice we need quotation marks around Freshman
. To select specific names, use quotation marks. To filter down to value ranges, no quotation marks needed.
Not this category
We might also choose to simply exclude one category. We can do that with !=
which you should read as “Does not match.”
How would we select all students that are not Sophomores?
subset(Class, acad_level __ ___________)
subset(Class, acad_level != "___________")
subset(Class, acad_level != "Sophomore")
A note from the last one
Hope you were successful to notice we needed to select Sophomore
rather than Sophomores
(plural) or sophomore
(lowercase). Remember that R is case sensitive, as well as sensitive to any small change to the exact way it is recorded in the data!
Practice
Return to the diabetes data
diabetes
Filter your data to only include individual for which gender was recorded as female.
subset(_______, gender __________)
subset(diabetes, gender == "female")
Multiple Commands
Using &
With subset
, I can also implement multiple commands at once using the &
symbol. Perhaps we want to include students that are not sophomores AND that have at least 2 siblings.
subset(Class, acad_level != "Sophomore" & siblings > 1)
Using | the “Or” symbol
Likewise, we can also use the |
symbol to make an or statement.
- Note, the
|
symbol is typically above the enter key and activated by holding shift while you press it.
How about students that are Juniors OR are at least 68 inches tall.
subset(Class, acad_level == "Junior" | height >= 68)
Selecting multiple categories
What if I want Freshmen and Sophomores? We can’t select just one, or all but one. Instead, we can use %in%
and then list our categories of interest in a vector!
subset(Class, acad_level %in% c("Freshman", "Sophomore"))
Practice
Using the diabetes
dataframe, create a subset that includes individuals that are female with a cholesterol level above 180.
subset(_______, gender == "________" _ _________)
subset(diabetes, gender == "female" & __________)
subset(diabetes, gender == "female" & chol > 180)
Saving your Subset
Assigning to a name
Often when subsetting, the goal isn’t just to see the results at a singular moment, but to save the results and use inside another function (like a plot!)
We can assign a subset to a name as follows:
upper_class = subset(Class, acad_level %in% c("Junior", "Senior"))
Nothing will output–it’s just saving the object in the global environment!
Quick View
You can view it by then running the name by itself.
upper_class
Using a Subset inside a plot
Now, if I wanted to make a plot with just the upper classmen, I would input that as my data argument.
ggplot(data = upper_class, aes(x = acad_level)) +
geom_bar()
Your turn!
Go ahead and subset the Class
data to exclude the student with 6 siblings.
- Call this subset
Class_sibs
- You could do this one of several ways!
Then create a histogram of the height variable from Class_sibs
- Change to 5 bins since this is a very small dataset
- Add black border color and a fill color of your choice!
This exercise won’t check for correctness, but you can check for a sample solution!
Class_sibs = subset(__________, ________________)
ggplot(___________) +
geom_histogram()
Class_sibs = subset(Class, siblings _____________)
ggplot(data = _________, aes(x = _____________)) +
geom_histogram(bins = _, color = "black", fill = "__________")
Class_sibs = subset(Class, siblings < 6)
ggplot(data = Class_sibs, aes(x = height)) +
geom_histogram(bins = 5, color = "black", fill = "pink")
ifelse Statement
How it helps
In some cases, we might wish to change the form of a variable in our dataset to something simpler.
Instead of reporting height as a numeric variable, we could dichotomize that to a yes
or no
representing whether the student is above a certain height value or not.
Or, we could simplify acad level to just designate if the student is an upper classman or under classman
How it works
The ifelse
statment has three inputs
test
: a logical criteria that a row will either meet or not meetyes
: the value to output if the criteria is metno
: the value to output if the criteria is not met
Dichotomize Height
Recall our data frame is Class
, and height
is a variable in this data frame (one of several features we have collected from each student in the class).
Let’s run an ifelse
where our logical criteria is whether a student is at least 69 inches tall.
ifelse(Class$height >= 69, "yes", "no")
How did it work?
It took the vector Class$height
and went through its values in order. It then outputted yes
when that criteria was met and no
when it wasn’t.
Look at the Class
data for reference…It appears to be met for the 1st, 5th, 8th, and 9th student!
Class
Don’t forget $
In contrast to subset
where our object was the whole data frame, our logical statement in this case is only involving one particular variable embedded in a data frame. Be sure you call on it correctly!
ifelse(Class$height >= 69, ...)
. Correct!- not
ifelse(Class >= 69, ...)
.R
won’t know which specific variable inClass
we’re asking about. - not
ifelse(height >= 69, ...)
.R
won’t know where to findheight
without identifying the data frame it exists within.
Adding onto our data
Let’s save our output as a new variable named ht_binary
For this situation, let’s also add this variable into our data frame. We can do that by by linking ht_binary
with a $ to our data frame Class
Class$ht_binary = ifelse(Class$height >= 69, "yes", "no")
Class
Look carefully!
Notice that this new variable ht_binary
now appears on the far right end. We could now use that variable within a plot where we call on Class
as our source data.
A challenge!
Now, create a new variable called acad_binary
that will output upper
if the student is a Junior or Senior and lower
if the student is not.
Click through to see one sample solution.
Check Class
after you’re done to see that the column was created as you expected!
Class$__________ = ifelse(_________________)
Class$acad_binary = ifelse(Class$acad_level %in% c(___________), ________, _________)
Class$acad_binary = ifelse(Class$acad_level %in% c("Junior", "Senior"), "upper", "lower")
Return Home
This tutorial was created by Kelly Findley and Brandom Pazmino (UIUC `13). We hope this experience was helpful for you!
If you’d like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/