In this tutorial, we'll focus more on representing multiple variables together in one plot. In particular, we'll talk about:
We're going to investigate the penguins
data stored in the palmerpenguins
package for several of these plots.
library(palmerpenguins)
penguins
This data represents penguins from the Palmer Archipelago. Penguins were all identified from one of three species (and they are all adorable):
One variable we can compare these penguins against is their flipper length. Let's do that with side by side boxplots.
ggplot(data = penguins, aes(x = species, y = flipper_length_mm, fill = species)) +
geom_boxplot() +
stat_boxplot(geom = "errorbar") +
labs(x = "Species", y = "Flipper Length (mm)", title = "Flipper Length of Different Species")
As a reminder, boxplots are good for quick comparisons of groups using summary values (5-number summary). But there are other options if we wish to see more of the distribution.
Overlapping density curves are a fun alternative to represent distributions all in the same plane.
You might remember the geom_density
option. We will again assign a numeric variable to the x axis (like histogram, the y axis is used to compare how many units are relatively in each x axis zone).
ggplot(data = penguins, aes(x = flipper_length_mm)) +
geom_density(fill = "orchid")
But now, let's take advantage of the fill
argument as a representation for another variable. Let's assign the species
as a fill color so we can compare the flipper length distributions of each species
ggplot(data =penguins, aes(x = flipper_length_mm, fill = species)) +
geom_density()
You'll notice that when we add overlap, it's difficult to see the whole story. We should add some transparency to this graph using alpha. Remember that alpha set to 0 is fully transparent, and alpha set to 1 is fully opaque. I plugged in 0.4, but experiment with different values!
ggplot(data =penguins, aes(x = flipper_length_mm, fill = species)) +
geom_density(alpha = 0.4)
The fllpper length distributions are fairly symmetric, but density curves might reveal interesting distributional patterns that are difficult to see with boxplots. For example, consider how the time until a patient breathes unassisted might vary based on which anaesthetic (A, B, C, or D) they were given.
Here, the skewness and nuances of these distributions might not be as clear when using a boxplot.
ggplot(data = anaesthetic, aes(x = breath, fill = tgrp)) +
geom_density(alpha = 0.3)
Jitter plots are also appropriate for comparing a grouping variable with a numeric variable, but they will actually plot the individual data, rather than just summary values or distributional shape.
Jitter plots are often a good choice when plotting all of the data is reasonable. Boxplots and overlapping densities are often better representations for larger datasets.
There's an optional tutorial at the bottom of the R Tutorials Page that covers how to add multiple geometries in the same plot.
Let's visualize the same example, but now we will use the geom_point()
representation. But you'll notice this plot is not as visually clear as it could be.
ggplot(data = penguins, aes(x = species, y = flipper_length_mm)) +
geom_point() +
labs(x = "Species", y = "Flipper Length (mm)", title = "Flipper Length of Different Species")
You might have noticed the warning message that two rows were removed for missing values. That just means we were missing data for two penguins for at least one of the variables we asked for.
That's ok and expected with many datasets we use that have some missing entries!
Keep in mind that Warning messages are just informational. Error messages are what signal that something is broken and likely needs to be fixed.
The problem is that many points are overlapping, making it harder to see density.
Let's try using geom_jitter
in place of geom_point
ggplot(data = penguins, aes(x = species, y = flipper_length_mm)) +
geom_jitter() +
labs(x = "Species", y = "Flipper Length (mm)", title = "Flipper Length of Different Species")
This is probably too much jittering! Let's try limiting how wide the values jitter by adding a width
argument. I would choose something around 0.05, but you can adjust to whatever looks good!
ggplot(data = penguins, aes(x = species, y = flipper_length_mm)) +
geom_jitter(width = 0.05) +
labs(x = "Species", y = "Flipper Length (mm)", title = "Flipper Length of Different Species")
Just with the other representations, we could differentiate by color as well! But unlike boxplots and violin plots, the argument won't be fill
, but instead it will be color
!
ggplot(data = penguins, aes(x = species, y = flipper_length_mm, color = species)) +
geom_jitter(width = 0.05) +
labs(x = "Species", y = "Flipper Length (mm)", title = "Flipper Length of Different Species")
The stat_summary
function can help you plot a summary measure, like a mean bar, for more information!
This setup is a little complicated, and you won't need it in your lab, but I wanted to show you in case you'd like to try it on your own.
Add stat_summary
line to your plot, and then identify mean as your function. The color, width, and size can all easily be adjusted, but the first few arguments would need to stay as is.
ggplot(data = penguins, aes(x = species, y = flipper_length_mm, color = species)) +
geom_jitter(width = 0.05) +
stat_summary(fun = mean, fun.min = mean, fun.max = mean, geom = "errorbar", color = "black", width = 0.2, linewidth = 1.5) +
labs(x = "Species", y = "Flipper Length (mm)", title = "Flipper Length of Different Species")
A barplot for just one variable can count up how many observations we have for each group in a categorical variable.
How many penguins do we have of each species?
ggplot(data = penguins, aes(x = species)) +
geom_bar() +
labs(x = "Species", title = "Counts for each Species")
And if we want to add some fill color to each bar, we can also fill by speces
too! And then I'll also add a black border color to make it look cleaner.
ggplot(data = penguins, aes(x = species, fill = species)) +
geom_bar(color = "black") +
labs(x = "Species", title = "Counts for each Species")
Another variable collected was the island in which the penguin was observed on. Were the penguins of each species evenly distributed across islands, or did each species show up more often on a certain island?
Let's try creating a different bar for each island
and then seeing that island's composition by species. We can do that by assigning species
as a fill color.
ggplot(data = penguins, aes(x = island, fill = species)) +
geom_bar(color = "black") +
labs(x = "Island", fill = "Species", title = "Species Representation by Island")
Let's look at another example: is there a connection between children who have x-rays and the development of acute myeloid leukemia?
amlxray
records medical information for 238 childrendisease
column reports presence of leukemiaCray
column reports whether this child had previously had an x-rayamlxray
Let's first make a simple plot just seeing how many children in this sample have had an x-ray
ggplot(data = ____________, aes(_______________)) +
geom_bar(___________) +
labs(x = "Had X-ray", title = "How many children had X-rays")
ggplot(data = amlxray, aes(x = ___________, fill = _________)) +
geom_bar(color = ___________) +
labs(x = "Had X-ray", title = "How many children had X-rays")
ggplot(data = amlxray, aes(x = Cray, fill = Cray)) +
geom_bar(color = "black") +
labs(x = "Had X-ray", title = "How many children had X-rays")
Let's try creating a stacked bar plot to see if the proportion of children with leukemia in each group might be different. Let's keep Cray
on the x axis, but now try adding disease
as a fill color.
Since the labs function is quite long, I have created new lines for each argument to make it easier to read.
ggplot(data = amlxray, aes(_______________)) +
geom_bar(color = "black") +
labs(x = "Had X-ray",
fill = "Leukemia Diagnosis",
title = "Relationship between X-rays and Leukemia Diagnosis")
ggplot(data = amlxray, aes(x = ___________, fill = _________)) +
geom_bar(color = "black") +
labs(x = "Had X-ray",
fill = "Leukemia Diagnosis",
title = "Relationship between X-rays and Leukemia Diagnosis")
ggplot(data = amlxray, aes(x = Cray, fill = disease)) +
geom_bar(color = "black") +
labs(x = "Had X-ray",
fill = "Leukemia Diagnosis",
title = "Relationship between X-rays and Leukemia Diagnosis")
It would be much easier to compare these proportions if we instead scaled each bar to be the same height, rather than preserving the counts.
Add position = "fill"
into the geom line to make this 100% stacked!
Also note that since the y-axis is no longer a count, we might choose to label this now as a proportion.
ggplot(data = amlxray, aes(x = Cray, fill = disease)) +
geom_bar(color = "black", position = "fill") +
labs(x = "Had X-ray",
y = "Proportion with Leukemia Diagnosis",
fill = "Leukemia Diagnosis",
title = "Relationship between X-rays and Leukemia Diagnosis")
Sometimes with only 2 bars, it looks nicer to narrow the width. We can do that with a width
argument inside the geom_bar
line.
You can change the width of boxplots using the same argument by the way!
ggplot(data = amlxray, aes(x = Cray, fill = disease)) +
geom_bar(color = "black", position = "fill", width = 0.5) +
labs(x = "Had X-ray",
y = "Proportion with Leukemia Diagnosis",
fill = "Leukemia Diagnosis",
title = "Relationship between X-rays and Leukemia Diagnosis")
It does seem like the children with x-rays have a higher incidence of leukemia! Whether the x-ray is causing the leukemia is a deeper question we can't answer with this alone! We might want to stratify this relationship by some other possible confounders.
If we want to more directly compare within a bar, we can break up the bars into a cluster. Let's try that by now doing position = "dodge"
rather than position = "fill"
ggplot(data = amlxray, aes(x = Cray, fill = disease)) +
geom_bar(position = "dodge", color = "black") +
labs(x = "Had X-ray",
fill = "Leukemia Diagnosis",
title = "Relationship between X-rays and Leukemia Diagnosis")
This provides another way to see the same data! Comparing the height of the bars in each cluster relative to one another shows how proportions vary a bit across each cut quality.
Scatterplots are helpful for looking at the potential relationship between two numeric variables.
Let's take a look at the prostate
data from the package faraway
. This data records information about 97 men who have been diagnosed with prostate cancer.
library(faraway)
prostate
Is there a relationship between the prostate weight (lweight
) and the volume of cancer (lcavol
) in this group?
We'll use the geom_point
option again...but this time with two numeric variables, we'll get points scattered across the plane instead of column groupings!
ggplot(data = prostate, aes(x = lweight, y = lcavol)) +
geom_point() +
labs(x = "Prostate Weight",
y = "Cancer Volume")
The plot makes sense. Men with heavier prostates tend to have more cancer volume on average, but the relationship is not very strong (I can't predict lcavol
with much accuracy knowing only lweight
).
If we don't want to go with the generic black, we can also add a singular color option in geom_point()
. Notice again that with points, it needs to be color =
rather than fill =
. It's easy to mix that up, so try to remember that with points!
ggplot(data = prostate, aes(x = lweight, y = lcavol)) +
geom_point(color = "dodgerblue") +
labs(x = "Prostate Weight",
y = "Cancer Volume")
You can also change the level of transparency within your scatterplot which is denoted by alpha
ggplot(data = prostate, aes(x = lweight, y = lcavol)) +
geom_point(alpha = 0.5, color = "dodgerblue") +
labs(x = "Prostate Weight",
y = "Cancer Volume")
As with other plots, we can also use color as a way to represent another variable.
gleason
represents the "grade" of cancer, where the higher the gleason score, the more serious the cancer. Typically 8-10 is considered serious.
ggplot(data = prostate, aes(x = lweight, y = lcavol, color = gleason)) +
geom_point() +
labs(x = "Prostate Weight",
y = "Cancer Volume")
Since gleason
is coded as a numeric variable, it colors in a continuous scale, and this is not the easiest to compare. The following code will take these values and now view them as ordered categories.
Note that you won't see any output from this--it's simply completing some data structuring behind the scenes!
prostate$gleason = as.factor(prostate$gleason)
Now, let's try that plot again, but with gleason
working as a categorical variable!
prostate$gleason = as.factor(prostate$gleason)
ggplot(data = prostate, aes(x = lweight, y = lcavol, color = gleason)) +
geom_point() +
labs(x = "Prostate Weight",
y = "Cancer Volume")
While the prostate weight does not seem very clearly associated with gleason score, it does seem as if higher cancer volume might be more linked to higher gleason scores.
We can make this relationship more obvious by just focusing on gleason
and lcavol
prostate$gleason = as.factor(prostate$gleason)
ggplot(data = prostate, aes(x = gleason, y = lcavol, color = gleason)) +
geom_jitter(width = 0.05) +
labs(x = "Gleason",
y = "Cancer Volume")
By default, R
will order non-numeric variables in alphabetical order.
For example, the species
variable in the penguins
dataframe from earlier lists the penguin species in alphabetical order
ggplot(data = penguins, aes(x = island, fill = species)) +
geom_bar(color = "black") +
labs(x = "Island", fill = "Species", title = "Species Representation by Island")
But perhaps we want to have these islands list in a different order--perhaps by geography or perhaps by counts from fewest to most penguins.
We can use the factor
function to redefine its structure. This function takes two arguments
levels
, which will be your custom ordering of valuesfactor(penguins$island, levels = c("Torgersen", "Dream", "Biscoe"))
To apply it, we just need to assign this factor ordering back to the original variable in the data frame. Notice below how this factor structuring is set equal to the variable island
through the data frame penguins
.
Notice that this code doesn't output anything. It is an internal restructuring.
penguins$island = factor(penguins$island, levels = c("Torgersen", "Dream", "Biscoe"))
If we now try the plot, you'll see how the levels have been restructured!
penguins$island = factor(penguins$island, levels = c("Torgersen", "Dream", "Biscoe"))
ggplot(data = penguins, aes(x = island, fill = species)) +
geom_bar(color = "black") +
labs(x = "Island", fill = "Species", title = "Species Representation by Island")
Consider this datam frame with 28 high school students
Class
I want to make a plot to compare them based on their class level.
ggplot(data = Class, aes(x = acad_level, y = height, color = acad_level)) +
geom_jitter(width = 0.08) +
labs(title = "Class Heights by Year",
x = "Academic Level",
y = "Height (in)",
color = "Academic Level")
But notice that alphabetically, my order of academic level is Freshman, Junior, Senior, Sophomore.
I'd like to get the order to be chronological: Freshman, Sophomore, Junior, Senior!
Think of my coding template as follows...what would I fill in at each blank?
_______$_______ = factor(______$______, levels = c(_______, _______, _______, _______))
Class$acad_level = factor(Class$acad_level, levels = c("Freshman", "Sophomore", "Junior", "Senior"))
ggplot(data = Class, aes(x = acad_level, y = height, color = acad_level)) +
geom_jitter(width = 0.08) +
labs(title = "Class Heights by Year",
x = "Academic Level",
y = "Height (in)",
color = "Academic Level")
Be super careful with these codes, as you'll need every category name to be exactly as it appears in the data sheet, including CaSe SenSiTiVe. And make sure your data frame name matches what you have in your global environment!
This tutorial was created by Kelly Findley, with assistance by Brandon Pazmino (UIUC '21). We hope this experience was helpful for you!
If you'd like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/