Skip to Tutorial Content

Introduction

Tutorial Goals

In this tutorial, we will use the subset and ifelse functions to complete some basic “data wrangling”

What is Data Wrangling?

Data wrangling involves basic manipulation with data to prepare for analysis. Some examples include:

  • Cleaning data
  • Identifying and removing/masking extreme outliers
  • Converting a variable type to another data type (e.g., from numeric to binary)

We’ve already done some basic data cleaning within a spreadsheet, and in this tutorial, we’ll focus on the second and third items here.

Subsetting Numerically

Mini Class Data

Let’s take a look at this small data frame, representing 10 students in a class

Class

Filtering Numerically

Let’s say that we’d like to only look at students who are below 67 inches tall. We can do that with the following use of subset

subset(x = Class, subset = height < 67)

How does that work?

The subset function has two important arguments to fill in:

  • x: The object (typically a data frame) that we want to subset
  • subset: the criteria we are using to select rows

TIP: Whenever learning about a new function in R, you can always use the ? tool to search for documentation. For example, run ?subset in your R console to read about different arguments in subset and how they work!

Still returns a data frame!

Read the subset we just ran as follows:

Take a subset of…

  • The dataframe Class

And only displays rows such that…

  • height is less than 67

Notice that we still get all of the original variables (height, acad_level, and siblings) in the output. subset just tells R a decision criteria for which rows to output.

We can drop the argument names

From now on, I will drop the argument names and just write the entries–R will know what they mean!

subset(Class, height < 67)

Notice: No data = … argument

If you do include an argument, be careful you are typing x rather than data here. Note that data = ... will not work with subset!

subset(data = Class, height < 67) will not work!

subset(x = Class, height < 67) or subset(Class, height < 67) both work though.

less/greater than or equal

What if we wanted to include 67 in our subset? Heights of 67 inches or lower? Just add an = sign after the <

subset(Class, height <= 67)

Practice!

Go ahead and use the diabetes dataset from faraway to try subsetting!

library(faraway)
diabetes

Subset this data frame of 403 rows to only include individuals with a cholesterol level above 200. If done correctly, you’ll have 217 rows left!

subset(diabetes, ____________)
subset(diabetes, chol > 200)

Filtering Categories

Filtering to specific categories

We can also use subset to choose a particular category as well. For example, what if we only wanted to select students who were Freshman from the class?

subset(Class, acad_level == "Freshman")

How did we set that up?

In R, think of the double equals (==) as like saying “matches.”

In the previous example, we might read that as acad_level matches Freshman.

Also notice we need quotation marks around Freshman. To select specific names, use quotation marks. To filter down to value ranges, no quotation marks needed.

Not this category

We might also choose to simply exclude one category. We can do that with != which you should read as “Does not match.”

How would we select all students that are not Sophomores?

subset(Class, acad_level __ ___________)
subset(Class, acad_level != "___________")
subset(Class, acad_level != "Sophomore")

A note from the last one

Hope you were successful to notice we needed to select Sophomore rather than Sophomores (plural) or sophomore (lowercase). Remember that R is case sensitive, as well as sensitive to any small change to the exact way it is recorded in the data!

Practice

Return to the diabetes data

diabetes

Filter your data to only include individual for which gender was recorded as female.

subset(_______, gender __________)
subset(diabetes, gender == "female")

Multiple Commands

Using &

With subset, I can also implement multiple commands at once using the & symbol. Perhaps we want to include students that are not sophomores AND that have at least 2 siblings.

subset(Class, acad_level != "Sophomore" & siblings > 1)

Using | the “Or” symbol

Likewise, we can also use the | symbol to make an or statement.

  • Note, the | symbol is typically above the enter key and activated by holding shift while you press it.

How about students that are Juniors OR are at least 68 inches tall.

subset(Class, acad_level == "Junior" | height >= 68)

Selecting multiple categories

What if I want Freshmen and Sophomores? We can’t select just one, or all but one. Instead, we can use %in% and then list our categories of interest in a vector!

subset(Class, acad_level %in% c("Freshman", "Sophomore"))

Practice

Using the diabetes dataframe, create a subset that includes individuals that are female with a cholesterol level above 180.

subset(_______, gender == "________" _ _________)
subset(diabetes, gender == "female" & __________)
subset(diabetes, gender == "female" & chol > 180)

Saving your Subset

Assigning to a name

Often when subsetting, the goal isn’t just to see the results at a singular moment, but to save the results and use inside another function (like a plot!)

We can assign a subset to a name as follows:

upper_class = subset(Class, acad_level %in% c("Junior", "Senior"))

Nothing will output–it’s just saving the object in the global environment!

Quick View

You can view it by then running the name by itself.

upper_class

Using a Subset inside a plot

Now, if I wanted to make a plot with just the upper classmen, I would input that as my data argument.

ggplot(data = upper_class, aes(x = acad_level)) +
  geom_bar()

Your turn!

Go ahead and subset the Class data to exclude the student with 6 siblings.

  • Call this subset Class_sibs
  • You could do this one of several ways!

Then create a histogram of the height variable from Class_sibs

  • Change to 5 bins since this is a very small dataset
  • Add black border color and a fill color of your choice!

This exercise won’t check for correctness, but you can check for a sample solution!

Class_sibs = subset(__________, ________________)

ggplot(___________) +
  geom_histogram()
Class_sibs = subset(Class, siblings _____________)

ggplot(data = _________, aes(x = _____________)) +
  geom_histogram(bins = _, color = "black", fill = "__________")
Class_sibs = subset(Class, siblings < 6)

ggplot(data = Class_sibs, aes(x = height)) +
  geom_histogram(bins = 5, color = "black", fill = "pink")

ifelse Statement

How it helps

In some cases, we might wish to change the form of a variable in our dataset to something simpler.

Instead of reporting height as a numeric variable, we could dichotomize that to a yes or no representing whether the student is above a certain height value or not.

Or, we could simplify acad level to just designate if the student is an upper classman or under classman

How it works

The ifelse statment has three inputs

  • test: a logical criteria that a row will either meet or not meet
  • yes: the value to output if the criteria is met
  • no: the value to output if the criteria is not met

Dichotomize Height

Recall our data frame is Class, and height is a variable in this data frame (one of several features we have collected from each student in the class).

Let’s run an ifelse where our logical criteria is whether a student is at least 69 inches tall.

ifelse(Class$height >= 69, "yes", "no")

How did it work?

It took the vector Class$height and went through its values in order. It then outputted yes when that criteria was met and no when it wasn’t.

Look at the Class data for reference…It appears to be met for the 1st, 5th, 8th, and 9th student!

Class

Don’t forget $

In contrast to subset where our object was the whole data frame, our logical statement in this case is only involving one particular variable embedded in a data frame. Be sure you call on it correctly!

  • ifelse(Class$height >= 69, ...). Correct!
  • not ifelse(Class >= 69, ...). R won’t know which specific variable in Class we’re asking about.
  • not ifelse(height >= 69, ...). R won’t know where to find height without identifying the data frame it exists within.

Adding onto our data

Let’s save our output as a new variable named ht_binary

For this situation, let’s also add this variable into our data frame. We can do that by by linking ht_binary with a $ to our data frame Class

Class$ht_binary = ifelse(Class$height >= 69, "yes", "no")

Class

Look carefully!

Notice that this new variable ht_binary now appears on the far right end. We could now use that variable within a plot where we call on Class as our source data.

A challenge!

Now, create a new variable called acad_binary that will output upper if the student is a Junior or Senior and lower if the student is not.

Click through to see one sample solution.

Check Class after you’re done to see that the column was created as you expected!

Class$__________ = ifelse(_________________)
Class$acad_binary = ifelse(Class$acad_level %in% c(___________), ________, _________)
Class$acad_binary = ifelse(Class$acad_level %in% c("Junior", "Senior"), "upper", "lower")

Return Home

This tutorial was created by Kelly Findley and Brandom Pazmino (UIUC `13). We hope this experience was helpful for you!

If you’d like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/

Creating Subsets