Using the subset command

The Subset Command

Introduction to Data Wrangling

Data wrangling involves basic manipulation with data to prepare for analysis or serve as part of exploratory data analysis. This may involve subsetting, removing, or even adding/creating new variables based on some function of the variables we have.

In this tutorial, we will use the subset function in R for now. It's relatively easy to use for simpler tasks.

Exploring Diamonds

We wil be looking at the built in data set diamonds in R

diamonds

Subseting by Numeric Range

Now lets take a look at the subset command in R

Let's say that we only wanted to look at diamonds with a depth greather than 60. Then we can use subset, name the data frame we are working with, and then make a selection with the variable of interest as follows.

subset(diamonds, depth > 60)

You'll notice that the output is not just that variable, but literally the whole data frame in which depth meets that criteria.

That means that the subset function is simply filtering out rows where the criteria we defined is not met.

Notice: No data = ... argument

One small (but important) thing to notice: We don't say data = diamonds, but instead just say diamonds in the first argument. This is different than most any other function we use in R so remember that small difference!

less/greater than or equal

Also keep in mind that if we want to include the particular number, we can just add an = sign after the < or > sign.

Feel free to play around with different options if you'd like!

subset(diamonds, depth <= 60)

Practice!

Go ahead and use the diabetes dataset from faraway to try subsetting!

library(faraway)
diabetes

Subset your data to only include individuals with a cholesterol level above 200.

subset(diabetes, ____________)
subset(diabetes, chol > 200)

Filtering Categories

Filtering to specific categories

We can also use subset to choose a particular category as well. For example, what if we only wanted to see diamonds that were of "Fair" cut?

subset(diamonds, cut == "Fair")

Notice that the category name should go in quotes. Also notice the double equals sign here. A single equals sign won't work!

Not this category

We might also choose to simply exclude one category. We can do that with != "not equal to" in the statement

subset(diamonds, cut != "Fair")

Practice

Watch that you upper and lower case matches the variable and category names appropriately!

diabetes
subset(_______, gender __________)
subset(diabetes, gender == "female")

Multiple Commands

Using &

With the Subset command, I can also implement multiple commands at once with R using the & symbol. For example. I can request diamonds that have a depth greater than 61 and diamonds with a "Fair" cut.

subset(diamonds, depth > 61 & cut == "Fair")

Using | (Or)

Likewise, we can also use the | symbol (typically the key above your enter/return key and often needing "shift" to be held with it) to make an or statement.

For example, I might want diamonds that are either of "Premium" cut or of "Very Good" cut.

subset( diamonds, cut == "Premium" | cut == "Very Good")

Multiple Inputs with Parentheses

These commands can also get more complicated if needed by using parentheses. Run the following code and note what it seems to be doing!

subset(diamonds, depth > 61 & (cut == "Premium" | cut == "Very Good"))

Practice

Can you run a subset that includes individuals that are male and cholesterol over 200 OR female with cholesterol over 180?

subset(_______, (_________) | (_________))
subset(_______, (gender == "male" & chol > 220) | (_________))
subset(diabetes, (gender == "male" & chol > 220) | (gender == "female" & chol > 200))

Assigning to a name

Often when subsetting, the goal isn't just to see the results at a singular moment, but to save the results and use inside another function, like a plot, test, or regression model.

We can assign a subset to a name as follows:

diamonds_deep = subset(diamonds, depth > 61 & cut == "Fair")

Quick View

You can view it by then running the name by itself.

diamonds_deep = subset(diamonds, depth > 61 & cut == "Fair")
diamonds_deep

Using a Subset inside a plot

As an example, let's make a plot of diamonds_deep. Notice that to use this in a plot, we need to actually type this into the data argument (instead of just diamonds).

Feel free to also change the data argument back to just diamonds for comparison of plots!

ggplot(data = diamonds_deep, aes(x = depth, y = price)) +
geom_point(color = "blue", alpha = 0.2)

Go ahead and subset the diabetes data to only include women with a cholesterol above 200

Save that subset under the name highfem

Then produce a scatterplot that includes only this data. The x axis should be the cholesterol levels, and the y axis should be age.

This exercise won't check, so all plot customization is optional!

highfem = subset(_______, __________ & ________)

ggplot(___________) +
geom_point()
highfem = subset(_______, gender == "female" & ________)

ggplot(data = _______, aes(x = _____, y = _____)) +
geom_point()
highfem = subset(diabetes, gender == "female" & chol > 200)

ggplot(data = highfem, aes(x = chol, y = age)) +
geom_point()

Acknowledgment

This tutorial was created by Brandom Pazmino (UIUC `13) and Kelly Findley. We hope this experience was helpful for you!