Data wrangling involves basic manipulation with data to prepare for analysis or serve as part of exploratory data analysis. This may involve subsetting, removing, or even adding/creating new variables based on some function of the variables we have.
In this tutorial, we will use the
subset function in
R for now. It's relatively easy to use for simpler tasks.
We wil be looking at the built in data set diamonds in R
Now lets take a look at the subset command in R
Let's say that we only wanted to look at diamonds with a depth greather than 60. Then we can use subset, name the data frame we are working with, and then make a selection with the variable of interest as follows.
subset(diamonds, depth > 60)
You'll notice that the output is not just that variable, but literally the whole data frame in which depth meets that criteria.
That means that the subset function is simply filtering out rows where the criteria we defined is not met.
One small (but important) thing to notice: We don't say
data = diamonds, but instead just say
diamonds in the first argument. This is different than most any other function we use in
R so remember that small difference!
Also keep in mind that if we want to include the particular number, we can just add an
= sign after the
Feel free to play around with different options if you'd like!
subset(diamonds, depth <= 60)
Go ahead and use the diabetes dataset from faraway to try subsetting!
Subset your data to only include individuals with a cholesterol level above 200.
subset(diabetes, chol > 200)
We can also use subset to choose a particular category as well. For example, what if we only wanted to see diamonds that were of "Fair" cut?
subset(diamonds, cut == "Fair")
Notice that the category name should go in quotes. Also notice the double equals sign here. A single equals sign won't work!
We might also choose to simply exclude one category. We can do that with != "not equal to" in the statement
subset(diamonds, cut != "Fair")
Return to the diabetes data...filter your data to only include female individuals.
Watch that you upper and lower case matches the variable and category names appropriately!
subset(_______, gender __________)
subset(diabetes, gender == "female")
With the Subset command, I can also implement multiple commands at once with R using the & symbol. For example. I can request diamonds that have a depth greater than 61 and diamonds with a "Fair" cut.
subset(diamonds, depth > 61 & cut == "Fair")
Likewise, we can also use the | symbol (typically the key above your enter/return key and often needing "shift" to be held with it) to make an or statement.
For example, I might want diamonds that are either of "Premium" cut or of "Very Good" cut.
subset( diamonds, cut == "Premium" | cut == "Very Good")
These commands can also get more complicated if needed by using parentheses. Run the following code and note what it seems to be doing!
subset(diamonds, depth > 61 & (cut == "Premium" | cut == "Very Good"))
Can you run a subset that includes individuals that are male and cholesterol over 200 OR female with cholesterol over 180?
subset(_______, (_________) | (_________))
subset(_______, (gender == "male" & chol > 220) | (_________))
subset(diabetes, (gender == "male" & chol > 220) | (gender == "female" & chol > 200))
Often when subsetting, the goal isn't just to see the results at a singular moment, but to save the results and use inside another function, like a plot, test, or regression model.
We can assign a subset to a name as follows:
diamonds_deep = subset(diamonds, depth > 61 & cut == "Fair")
You can view it by then running the name by itself.
diamonds_deep = subset(diamonds, depth > 61 & cut == "Fair") diamonds_deep
As an example, let's make a plot of
diamonds_deep. Notice that to use this in a plot, we need to actually type this into the data argument (instead of just
Feel free to also change the data argument back to just
diamonds for comparison of plots!
ggplot(data = diamonds_deep, aes(x = depth, y = price)) + geom_point(color = "blue", alpha = 0.2)
Go ahead and subset the diabetes data to only include women with a cholesterol above 200
Save that subset under the name
Then produce a scatterplot that includes only this data. The x axis should be the cholesterol levels, and the y axis should be age.
This exercise won't check, so all plot customization is optional!
highfem = subset(_______, __________ & ________) ggplot(___________) + geom_point()
highfem = subset(_______, gender == "female" & ________) ggplot(data = _______, aes(x = _____, y = _____)) + geom_point()
highfem = subset(diabetes, gender == "female" & chol > 200) ggplot(data = highfem, aes(x = chol, y = age)) + geom_point()
This tutorial was created by Brandom Pazmino (UIUC `13) and Kelly Findley. We hope this experience was helpful for you!