Sampling in R
Goals
In this tutorial, we will talk about
- Sampling (with and without replacement) from a vector or data frame.
- Creating a “for” loop to simulate a process many times
- Iteratively saving results that we generated inside a “for” loop
- Summarizing results that we iteratively saved from a “for” loop
The sample function
The sample
function lets us take a random sample of observations from a vector. There are 3 arguments you should know about when using it:
x
: the vector that you wish to sample fromsize
: how many observations you wish to sample fromx
replace
: whether you wish to sample with or without replacement
An example
Consider the following vector of length 13, representing people’s weights in lbs
weights = c(96, 102, 116, 119, 131, 135, 142, 157, 165, 180, 185, 187, 225)
Sample weights
The sample
function below is set to:
- Sample from the vector
weights
- Take a sample of size 5
- Sample without replacement (no repeated values)
Try running it several times and observe the results!
sample(x = weights, size = 5, replace = FALSE)
Changing it up
Now take a minute and change some of the constraints. A few things to try
- How large a sample size can you take when sampling without replacement?
- What changes when you set
replace = TRUE
instead?
sample(x = weights, size = 5, replace = FALSE)
Sampling from a Data frame
If you wish to sample observations from one particular variable embedded inside a data frame, you’ll need to use the dollar sign ($
) notation to call the variable through the data frame.
prostate$age
An example
Here, we have used the sample
function again, but to tell R
where to find the variable age
, we need to call it through the data frame its embedded within.
library(faraway)
sample(x = prostate$age, size = 10, replace = FALSE)
How to save a new object
Just a reminder: If you wanted to save a particular sample as an object to your global environment, you can assign it to a name! If we wanted to save it under the object name sample_x
, we could do that with the =
sign (or if you want to code like an R guru, you can use the assignment operator <-
which would do the exact same thing in this case).
sample_x = sample(x = prostate$age, size = 10, replace = FALSE)
…and call it back up
And notice that if we now call on sample_x
after defining it, R
will remember it.
sample_x
Taking Multiple Samples
Sampling Variability
In class, we’ve talked about how every time we take a sample, we should expect sampling variability
sample(x = prostate$age, size = 5)
sample(x = prostate$age, size = 5)
sample(x = prostate$age, size = 5)
Variability in Sample Means
For the same reason, every time we take a statistic from our sample data, we should recognize that a statistic will vary from sample to sample too!
Run this code a few times and observe how your sample means continue to vary!
sample_1 = sample(x = prostate$age, size = 5)
sample_2 = sample(x = prostate$age, size = 5)
sample_3 = sample(x = prostate$age, size = 5)
mean(sample_1)
mean(sample_2)
mean(sample_3)
We can try a Loop!
What we’d like to do is create a distribution to see how much our sample means (from samples of size 5) might reasonably vary.
We want R
to run the same code many times and to store our results into one vector. So let’s talk about setting up a loop!
FYI: A Loop is not the only way to accomplish this task in R
, but it’s an intuitive way that will help you see the process.
For Loop: Returning a value
How a for loop works
A for loop will repeat a sequence of code for as many times as you ask it to. To make one, we need to do 3 things:
- Define an object that we will use to store results in our loop
- Write the
for
line, where we define how many times this loop will repeat and assign a variable to be the iteration counter - Write a sequence of code inside curly brackets
{}
to repeat until the number of loops has been completed.
Example
- We defined an object called
x
. To start, let’s just definex = 0
. - This loop will have 5 iterations, and we’re using the variable
i
to be the iteration counter. - Our code inside is telling
R
to add 1 to the value ofx
for each loop.
x = 0
for (i in 1:5) {
x = x + 1
}
x
Predict the loop
I made a few tweaks. Predict what you think the result of this loop will be, then run it and see if you’re right!
x = 1
for (i in 1:3) {
x = x*2
}
x
What happened?
1*2 = 2
2*2 = 4
4*2 = 8
Check your understanding
growth = 4
for (z in 1:12) {
growth = growth + 2
}
Saving results iteratively
The examples above had our loops return one singular numeric answer.
Now, let’s create a loop that returns a vector. We will fill each entry of the vector iteratively as the loops repeats.
For loops: Returning a vector
Bracket notation
Before doing that, let’s talk about some notation we’ll need: brackets!
Consider my vector of weights with 13 elements. I can call up any element of that vector using brackets: []
.
weights[1]
would be the first element: 96
What elements would the other two codes below output?
weights = c(96, 102, 116, 119, 131, 135, 142, 157, 165, 180, 185, 187, 225)
weights[1]
weights[3]
weights[13]
An example
Let’s start by creating a vector called dice
.
If I sample from this vector, I have an equal chance of getting any of the numbers from 1 to 6. I’m simulating the process of rolling a fair, 6-sided die!
dice = c(1,2,3,4,5,6)
sample(dice, size = 1)
Dice loop
Now let’s create results
as an object to fill iteratively with dice roll results. I wrote results = NULL
to simply create the object so that R
recognizes that as an object that can be filled.
Lastly, let’s set up our loop to fill results
by sampling from dice
10 times and storing the results. Notice that since i
is the iteration counter, then it will fill the ith spot of results
with each loop iteartion!
Since we’re sampling dice rolls, we should sample WITH replacement from this vector since we should be able to get the same value on separate rolls!
dice = c(1,2,3,4,5,6)
results = NULL
for (i in 1:10) {
results[i] = sample(x = dice, size = 1, replace = TRUE)
}
results
Storing Sample Means
That previous loop might seem unnecessary–because it is! We could accomplish the same thing with sample(dice, size = 10, replace = TRUE)
BUT, let’s go one step further and save the means of each sample of 5 dice rolls.
dice = c(1,2,3,4,5,6)
means = NULL
for (i in 1:10) {
means[i] = mean(sample(x = dice, size = 5, replace = TRUE))
}
means
Calculating Means
Now let’s scale it up! Instead of only repeating our process 10 times and storing the sample means, let’s repeat it 500 times and store the sample means. Then we’ll plot the results on a histogram.
dice = c(1,2,3,4,5,6)
means = NULL
for (i in 1:500) {
means[i] = mean(sample(x = dice, size = 5, replace = TRUE))
}
hist(means, breaks = 20)
Check for Understanding
Sampling Age
Now let’s see if you can apply what you learned! Create a for loop that does the following:
- Samples from the
diabetes
dataset, and specifically from theage
variable - Samples 25 ages and calculates the mean of that sample of 25
- Repeats this process 200 times and saves the means in a vector called
means_25
- Plots the sample means on a histogram (with breaks set to 20)
Hint: Note that we’re sampling from an existing data frame, so we don’t need to pre-define it first like we did with dice
. But we do need to reference age
through diabetes
using $
notation!
Click continue to see a suggested answer!
_____ = NULL
for (i in ______) {
_______________________________________
}
____(________, breaks = 20)
Sample Solution
Hopefully you found some success on that one! If not though, the solution is below.
If you struggled with this task, consider watching this video to see Kelly walk through the code.
means_25 = NULL
for (i in 1:200) {
means_25[i] = mean(sample(x = diabetes$age, size = 25))
}
hist(means_25, breaks = 20)
Return Home
Good work! If you’re in STAT 212, be sure to watch the videos embedded in the tutorials (also linked on the Lab 1 assignment page) before completing Lab 1. But if you already watched them, then you’re ready to go!
If you’d like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/