In this tutorial, we will talk about
The sample
function lets us take a random sample of observations from a vector. There are 3 arguments you should know about when using it:
x
: the vector that you wish to sample fromsize
: how many observations you wish to sample from x
replace
: whether you wish to sample with or without replacementConsider the following vector of length 13, representing people's weights in lbs
weights = c(96, 102, 116, 119, 131, 135, 142, 157, 165, 180, 185, 187, 225)
The sample
function below is set to:
weights
Try running it several times and observe the results!
sample(x = weights, size = 5, replace = FALSE)
Now take a minute and change some of the constraints. A few things to try
replace = TRUE
instead?sample(x = weights, size = 5, replace = FALSE)
If you wish to sample observations from one particular variable embedded inside a data frame, you'll need to use the dollar sign ($
) notation to call the variable through the data frame.
prostate$age
Here, we have used the sample
function again, but to tell R
where to find the variable age
, we need to call it through the data frame its embedded within.
library(faraway)
sample(x = prostate$age, size = 10, replace = FALSE)
Just a reminder: If you wanted to save a particular sample as an object to your global environment, you can assign it to a name! If we wanted to save it under the object name sample_x
, we could do that with the =
sign (or if you want to code like an R guru, you can use the assignment operator <-
which would do the exact same thing in this case).
sample_x = sample(x = prostate$age, size = 10, replace = FALSE)
And notice that if we now call on sample_x
after defining it, R
will remember it.
sample_x
In class, we've talked about how every time we take a sample, we should expect sampling variability
sample(x = prostate$age, size = 5)
sample(x = prostate$age, size = 5)
sample(x = prostate$age, size = 5)
For the same reason, every time we take a statistic from our sample data, we should recognize that a statistic will vary from sample to sample too!
Run this code a few times and observe how your sample means continue to vary!
sample_1 = sample(x = prostate$age, size = 5)
sample_2 = sample(x = prostate$age, size = 5)
sample_3 = sample(x = prostate$age, size = 5)
mean(sample_1)
mean(sample_2)
mean(sample_3)
What we'd like to do is create a distribution to see how much our sample means (from samples of size 5) might reasonably vary.
We want R
to run the same code many times and to store our results into one vector. So let's talk about setting up a loop!
FYI: A Loop is not the only way to accomplish this task in R
, but it's an intuitive way that will help you see the process.
A for loop will repeat a sequence of code for as many times as you ask it to. To make one, we need to do 3 things:
for
line, where we define how many times this loop will repeat and assign a variable to be the iteration counter{}
to repeat until the number of loops has been completed.x
. To start, let's just define x = 0
.i
to be the iteration counter.R
to add 1 to the value of x
for each loop.x = 0
for (i in 1:5) {
x = x + 1
}
x
I made a few tweaks. Predict what you think the result of this loop will be, then run it and see if you're right!
x = 1
for (i in 1:3) {
x = x*2
}
x
1*2 = 2
2*2 = 4
4*2 = 8
growth = 4
for (z in 1:12) {
growth = growth + 2
}
The examples above had our loops return one singular numeric answer.
Now, let's create a loop that returns a vector. We will fill each entry of the vector iteratively as the loops repeats.
Before doing that, let's talk about some notation we'll need: brackets!
Consider my vector of weights with 13 elements. I can call up any element of that vector using brackets: []
.
weights[1]
would be the first element: 96
What elements would the other two codes below output?
weights = c(96, 102, 116, 119, 131, 135, 142, 157, 165, 180, 185, 187, 225)
weights[1]
weights[3]
weights[13]
Let's start by creating a vector called dice
.
If I sample from this vector, I have an equal chance of getting any of the numbers from 1 to 6. I'm simulating the process of rolling a fair, 6-sided die!
dice = c(1,2,3,4,5,6)
sample(dice, size = 1)
Now let's create results
as an object to fill iteratively with dice roll results. I wrote results = NULL
to simply create the object so that R
recognizes that as an object that can be filled.
Lastly, let's set up our loop to fill results
by sampling from dice
10 times and storing the results. Notice that since i
is the iteration counter, then it will fill the ith spot of results
with each loop iteartion!
Since we're sampling dice rolls, we should sample WITH replacement from this vector since we should be able to get the same value on separate rolls!
dice = c(1,2,3,4,5,6)
results = NULL
for (i in 1:10) {
results[i] = sample(x = dice, size = 1, replace = TRUE)
}
results
That previous loop might seem unnecessary--because it is! We could accomplish the same thing with sample(dice, size = 10, replace = TRUE)
BUT, let's go one step further and save the means of each sample of 5 dice rolls.
dice = c(1,2,3,4,5,6)
means = NULL
for (i in 1:10) {
means[i] = mean(sample(x = dice, size = 5, replace = TRUE))
}
means
Now let's scale it up! Instead of taking only 10 samples and storing the sample means, let's take 500 samples and store the sample means. Then we'll plot the results on a histogram.
dice = c(1,2,3,4,5,6)
means = NULL
for (i in 1:500) {
means[i] = mean(sample(x = dice, size = 5, replace = TRUE))
}
hist(means, breaks = 20)
Now let's see if you can apply what you learned! Create a for loop that does the following:
diabetes
dataset, and specifically from the age
variablemeans_25
Hint: Note that we're sampling from an existing data frame, so we don't need to pre-define it first like we did with dice
. But we do need to reference age
through diabetes
using $
notation!
Click continue to see a suggested answer!
_____ = NULL
for (i in ______) {
_______________________________________
}
____(________, breaks = 20)
Hopefully you found some success on that one! If not though, the solution is below.
If you struggled with this task, consider watching this video to see Kelly walk through the code.
means_25 = NULL
for (i in 1:200) {
means_25[i] = mean(sample(x = diabetes$age, size = 25))
}
hist(means_25, breaks = 20)
Good work! If you're in STAT 212, be sure to watch the videos embedded in the tutorials (also linked on the Lab 1 assignment page) before completing Lab 1. But if you already watched them, then you're ready to go!
If you'd like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/