Sampling and Simulation

Sampling in R

Goals

In this tutorial, we will talk about

  • Sampling (with and without replacement) from a vector or data frame.
  • Creating a "for" loop to simulate a process many times
  • Iteratively saving results that we generated inside a "for" loop
  • Summarizing results that we iteratively saved from a "for" loop

The sample function

The sample function lets us take a random sample of observations from a vector. There are 3 arguments you should know about when using it:

  • x: the vector that you wish to sample from
  • size: how many observations you wish to sample from x
  • replace: whether you wish to sample with or without replacement

An example

Consider the following vector of length 13, representing people's weights in lbs

weights = c(96, 102, 116, 119, 131, 135, 142, 157, 165, 180, 185, 187, 225)

Sample weights

The sample function below is set to:

  • Sample from the vector weights
  • Take a sample of size 5
  • Sample without replacement (no repeated values)

Try running it several times and observe the results!

sample(x = weights, size = 5, replace = FALSE)

Changing it up

Now take a minute and change some of the constraints. A few things to try

  • How large a sample size can you take when sampling without replacement?
  • What changes when you set replace = TRUE instead?
sample(x = weights, size = 5, replace = FALSE)

Sampling from a Data frame

If you wish to sample observations from one particular variable embedded inside a data frame, you'll need to use the dollar sign ($) notation to call the variable through the data frame.

prostate$age

An example

Here, we have used the sample function again, but to tell R where to find the variable age, we need to call it through the data frame its embedded within.

library(faraway)
sample(x = prostate$age, size = 10, replace = FALSE)

How to save a new object

Just a reminder: If you wanted to save a particular sample as an object to your global environment, you can assign it to a name! If we wanted to save it under the object name sample_x, we could do that with the = sign (or if you want to code like an R guru, you can use the assignment operator <- which would do the exact same thing in this case).

sample_x = sample(x = prostate$age, size = 10, replace = FALSE)

...and call it back up

And notice that if we now call on sample_x after defining it, R will remember it.

sample_x

Taking Multiple Samples

Sampling Variability

In class, we've talked about how every time we take a sample, we should expect sampling variability

sample(x = prostate$age, size = 5)
sample(x = prostate$age, size = 5)
sample(x = prostate$age, size = 5)

Variability in Sample Means

For the same reason, every time we take a statistic from our sample data, we should recognize that a statistic will vary from sample to sample too!

Run this code a few times and observe how your sample means continue to vary!

sample_1 = sample(x = prostate$age, size = 5)
sample_2 = sample(x = prostate$age, size = 5)
sample_3 = sample(x = prostate$age, size = 5)

mean(sample_1)
mean(sample_2)
mean(sample_3)

We can try a Loop!

What we'd like to do is create a distribution to see how much our sample means (from samples of size 5) might reasonably vary.

We want R to run the same code many times and to store our results into one vector. So let's talk about setting up a loop!

FYI: A Loop is not the only way to accomplish this task in R, but it's an intuitive way that will help you see the process.

For Loop: Returning a value

How a for loop works

A for loop will repeat a sequence of code for as many times as you ask it to. To make one, we need to do 3 things:

  • Define an object that we will use to store results in our loop
  • Write the for line, where we define how many times this loop will repeat and assign a variable to be the iteration counter
  • Write a sequence of code inside curly brackets {} to repeat until the number of loops has been completed.

Example

  • We defined an object called x. To start, let's just define x = 0.
  • This loop will have 5 iterations, and we're using the variable i to be the iteration counter.
  • Our code inside is telling R to add 1 to the value of x for each loop.
x = 0

for (i in 1:5) {
  x = x + 1
}

x

Predict the loop

I made a few tweaks. Predict what you think the result of this loop will be, then run it and see if you're right!

x = 1

for (i in 1:3) {
  x = x*2
}

x

What happened?

1*2 = 2 2*2 = 4 4*2 = 8

Check your understanding

growth = 4

for (z in 1:12) {
  growth = growth + 2
}

Quiz

Saving results iteratively

The examples above had our loops return one singular numeric answer.

Now, let's create a loop that returns a vector. We will fill each entry of the vector iteratively as the loops repeats.

For loops: Returning a vector

Bracket notation

Before doing that, let's talk about some notation we'll need: brackets!

Consider my vector of weights with 13 elements. I can call up any element of that vector using brackets: [].

weights[1] would be the first element: 96

What elements would the other two codes below output?

weights = c(96, 102, 116, 119, 131, 135, 142, 157, 165, 180, 185, 187, 225)

weights[1]

weights[3]

weights[13]

An example

Let's start by creating a vector called dice.

If I sample from this vector, I have an equal chance of getting any of the numbers from 1 to 6. I'm simulating the process of rolling a fair, 6-sided die!

dice = c(1,2,3,4,5,6)

sample(dice, size = 1)

Dice loop

Now let's create results as an object to fill iteratively with dice roll results. I wrote results = NULL to simply create the object so that R recognizes that as an object that can be filled.

Lastly, let's set up our loop to fill results by sampling from dice 10 times and storing the results. Notice that since i is the iteration counter, then it will fill the ith spot of results with each loop iteartion!

Since we're sampling dice rolls, we should sample WITH replacement from this vector since we should be able to get the same value on separate rolls!

dice = c(1,2,3,4,5,6)

results = NULL

for (i in 1:10) {
  results[i] = sample(x = dice, size = 1, replace = TRUE)
}

results

Storing Sample Means

That previous loop might seem unnecessary--because it is! We could accomplish the same thing with sample(dice, size = 10, replace = TRUE)

BUT, let's go one step further and save the means of each sample of 5 dice rolls.

dice = c(1,2,3,4,5,6)

means = NULL

for (i in 1:10) {
  means[i] = mean(sample(x = dice, size = 5, replace = TRUE))
}

means

Calculating Means

Now let's scale it up! Instead of taking only 10 samples and storing the sample means, let's take 500 samples and store the sample means. Then we'll plot the results on a histogram.

dice = c(1,2,3,4,5,6)

means = NULL

for (i in 1:500) {
  means[i] = mean(sample(x = dice, size = 5, replace = TRUE))
}

hist(means, breaks = 20)

Make an adjustment!

Quiz

Sampling Age

Now let's see if you can apply what you learned! Create a for loop that does the following:

  • Samples from the diabetes dataset, and specifically from the age variable
  • Samples 25 ages and calculates the mean of that sample of 25
  • Repeats this process 200 times and saves the means in a vector called means_25
  • Plots the sample means on a histogram (with breaks set to 20)

Hint: Note that we're sampling from an existing data frame, so we don't need to pre-define it first like we did with dice. But we do need to reference age through diabetes using $ notation!

Click continue to see a suggested answer!

_____ = NULL

for (i in ______) {
  _______________________________________
}

____(________, breaks = 20)

Sample Solution

Hopefully you found some success on that one! If not though, the solution is below.

If you struggled with this task, consider watching this video to see Kelly walk through the code.

means_25 = NULL

for (i in 1:200) {
  means_25[i] = mean(sample(x = diabetes$age, size = 25))
}

hist(means_25, breaks = 20)

Return Home

Good work! If you're in STAT 212, be sure to watch the videos embedded in the tutorials (also linked on the Lab 1 assignment page) before completing Lab 1. But if you already watched them, then you're ready to go!

If you'd like to go back to the tutorial home page, click here: https://stat212-learnr.stat.illinois.edu/