Introduction to R

Basic Computation

Welcome!

Welcome to your first tutorial for coding in R! In this tutorial set, we’ll discuss how to set up calculations, create and use basic data structures, and run several basic descriptor commands.

Keep in mind…

Throughout this tutorial, you will see code chunks like this:

2+2

Often, these code chunks will be completed and ready to go for demonstration. You should click “run code” on the top right to see what happens.

You should also feel free to play around with them too and run them with other entries! Don’t worry, you won’t break the tutorial by changing the contents. :)

Some code chunks will be challenges for you to fill. In these, use the provided hints…the last hint will be the suggested solution. You can also use submit to check that your output is correct!

Arithmetic

First lets practice basic computations with R. Addition, subtraction, multiplication, division, and exponents use symbols that are likely already familiar to you.

look at the examples provided for simple computations and then produce some of your own:

56+1

66-60

45*2

81/9

sqrt(144)

5^2

Adding Parentheses

We can also use parentheses to complete multiple calculations at once.

When implmenting computations into R, keep in mind order of operations (PEMDAS), thus adding () into a certain portion of your math problem in R is essential if calculating multiple operations at once. Notice the difference between these two entries

(25-5)/4

25-5/4

The second is defaulting to PEMDAS, while the first is putting the first two numbers together as the numerator.

Your Turn!

Output the code 8 plus 6, all divided by 2. The solution is available for reference:

(8 + 6)/2

#Did you use parentheses around 8+6?
(8+6)

Vectors in R

Introducing Vectors

A vector is a collection of items (for example, a list of numbers) that are tied together into one structure. To create a vector, we will use our first R function, c (which is short for “combine”)

functions in R are usually a letter or name, followed by parentheses that include inputs for that function.

The following vector could represent the heights (in inches) of 13 adults. The entries are placed inside the function like an input, and then when I run this function, it outputs the same list of numbers, but tied together as a vector.

c(65,71,63,68,67,72,64,61,67,71,72,68,64)

Characters

An example of a character vector might be storing responses to a question that produces categorical responses. Notices that character entries should be in quotation marks (whereas numbers should typically be listed without quotation marks).

c("yes","yes","no","yes","no","yes","yes","yes","no","yes")

Sequences

In some cases (like plots), we might wish to create a sequence of equally placed numbers. There is a special function named “seq” that allows us to make a sequence from a starting value to a final value, by intervals of our choice.

Notice that this function now has multiple arguments to fill. We will define the three listed here.

seq(from = 2, to = 20, by = 2)

Leaving out the argument names

Keep in mind that in R, we don’t have to fill in all of the argument names. If we list our inputs in this order, R will assume the order is…from, to, by…in that order.

seq(2, 20, 2)

Default entries

Something else to keep in mind–we don’t have to fill in every possible argument to a function. Only the necessary ones. For example, if we leave the “by” argument empty, R will assume a default value of 1. Try running this to see!

seq(2, 20)

In case you’re curious, you can always check out the documentation for a function by running ? in front of the name. This will give you info about what argument options are available, and what default entries are used if left undefined. It’s a bit technical and confusing at first, but as you become more coding experienced, it can be very helpful to reference for new (to you) functions.

?seq

Assignment

I have Variables

Let’s say that I want to calculate something, but that calculation needs to vary based on certain info (perhaps depending on the data I have). So for that code, maybe I want to write it in terms of variables, and then define those variables depending on my data!

Notice this code will not produce anything now because these variables are undefined.

x_bar - 1.96*(s/n)

Creating Variables

What I will do now is define each of these three variables, and then the code below will run!

x_bar = 23.4
s = 5
n = 100

x_bar - 1.96*(s/n)

Your Global Environment

You won’t see it in this tutorial, but when coding in RStudio, there will be a window on the top right called “Global Environment”. That will be helpful for checking that variables you define are, in fact, defined!

It’s also helpful as a reminder that those variable names are still active in anything else you use them for. It’s important to occasionally click the broom icon above that window and clear your environment when moving on to new tasks or assignments. :)

Assigning Vectors to a Variable Name

We can also save vectors to a variable name–this is helpful when we might want to summarize or use this vector in a later command.

heights = c(65,71,63,68,67,72,64,61,67,71,72,68,64)

heights

breaks = seq(0,100,5)

breaks

Equals (=) vs. Assignment (<-)

Just a heads up…you can also use something called the assignment operator instead of an equals sign when defining variables. The assignment operator is <- and is meant to resemble an arrow pointing at the variable name.

heights <- c(65,71,63,68,67,72,64,61,67,71,72,68,64)

heights

s <- 25

s

This is the traditional way to assign variables, as early computers had this key on the keyboard. Today, that notation is more cumbersome than a simple =, but you will commonly see <- in help sheets and traditional coding manuals.

There are some rare situations where they will behave differently, but you won’t see any situations like that in this class. I just recommend you be consistent with your notation!

Note, within function arguments, always use =. <- is only for variable assignment.

Operations on a variable

We can complete arithmetic operations on vectors, as well as calculate various summary statistics if working with data.

Take a look at the following example, where we take our height vector and multiply it by 2.54 to convert these values from inches to centimeters.

height = c(65,71,63,68,67,72,64,61,67,71,72,68,64)

height_cm = height*2.54

height_cm

Hey, did you see how cool that was?

If you have used other programming languages before, I just want to highlight how useful that previous operation is when using R. You don’t have to create a loop to multiply each value to 2.54. You can literally just perform an operation on a vector!

R allows for vectorized operations. You will find that incredibly helpful if you continue to use R in the future.

Practice!

Give it a try! Create a sequence from 3 to 24 by 3’s. Name this as Vector, and then divide Vector by 3. It should produce a vector from 1 to 8 by 1’s after this division.

______ = ___(from = __, to = __, by = __)

Vector/__

Vector = seq(from = 3, to = __, by = __)

Vector/__

Vector = seq(from = 3, to = 24, by = 3)

Vector/3

More Practice!

Now, try creating a vector with the following data representing inches of precipitation for 12 months in Champaign.

Save this data as a vector named Temp_2019

3.85, 1.90, 5.09, 4.89, 6.08, 2.82, 3.38, 2.19, 3.36, 5.00, 1.91, 1.82

FYI: Weather data for the Champaign_Urbana area can be found here: https://stateclimatologist.web.illinois.edu/data/champaign-urbana/

3.85, 1.90, 5.09, 4.89, 6.08, 2.82, 3.38, 2.19, 3.36, 5.00, 1.91, 1.82

Temp_2019 = c(...)

Temp_2019 = c(3.85, 1.90, 5.09, 4.89, 6.08, 2.82, 3.38, 2.19, 3.36, 5.00, 1.91, 1.82)
Temp_2019

Matrix operations

Vector vs. Matrix

A vector is a one-dimensional R structure where all entries are of the same type (e.g., all numbers, all characters).

A matrix is a two-dimensional R structure, meaning that every entry of a matrix belongs to both a row and a column. Matrices also require all entries to be of the same type.

Building a Matrix

We can use the matrix function much like we use the c function. The main difference is that we have to tell R how many rows to create.

Use the nrow argument to specify number of rows.

firstmatrix = matrix(c(3,1,3,5,5,2,1,3,5,3,4,2), nrow = 3)
firstmatrix

secondmatrix = matrix(c(3,1,3,5,5,2,1,3,5,3,4,2), nrow = 2)
secondmatrix

Notice that in sequentially assigns entries down the first column, then the second column, etc. To switch to assign down rows first, you can add the byrow = FALSE argument at the end!

Identifying entries

With both vectors and matrices, you can identify specific entries using bracket notation.

Let’s start with a vector. Since these are one-dimensional, we need only to enter the sequential position.

Let’s reference the 7 in 4th position.

v = c(1,3,5,7,9,11,13,15)
v[4]

Now let’s do the same with a vector. But now we need two values to identify position: the row number and column number.

Let’s identify the 3 in the second row, third column

firstmatrix = matrix(c(3,1,3,5,5,2,1,3,5,3,4,2), nrow = 3)
firstmatrix

firstmatrix[2,3]

Identifying rows or columns

Leave one entry blank!

firstmatrix = matrix(c(3,1,3,5,5,2,1,3,5,3,4,2), nrow = 3)
firstmatrix

#2nd row
firstmatrix[2,]

#3rd column
firstmatrix[,3]

Try your own

Create a matrix with 2 rows with the following values, filling by column first (the default ordering). Name this matrix: Coef.

Note: This one won’t evaluate you as correct or incorrect. You can check the hints to get to the solution if you wish!

2.2, 4.7, 5.5, 6.2, 5.8, 4.7, 7.2, 3.1

Coef = matrix(c(2.2, 4.7, 5.5, 6.2, 5.8, 4.7, 7.2, 3.1), ____)

Coef = matrix(c(2.2, 4.7, 5.5, 6.2, 5.8, 4.7, 7.2, 3.1), nrow = 2)

Find the entry

Take the following matrix and use bracket notation to output the value 4 from this matrix

m = matrix(c(5,6,4,2,3,7,8,1,0), nrow = 3)

m = matrix(c(5,6,4,2,3,7,8,1,0), nrow = 3)
m[_,_]

m = matrix(c(5,6,4,2,3,7,8,1,0), nrow = 3)
m[1,3]

Data Frames (and Tibbles) in R

Introducing Data Frames

A data frame in R is a collection of vectors, where each vector represents one variable of data. Typically, each column of a data frame is a variable, and each row represents one observation (set of measurements from one individual at one point in time).

In more realistic data analysis situations, you would oftenmightw, we’ll focus on data we create directly in R, or some named datasets that exist online in the R universe for learning purposes.

Upload the Prostate Data frame with Package

In the following code, we will upload a data frame named “prostate.” This data is saved in a package named “faraway.” Packages are ways that R users can create code structures or data frames and share them with others! We’ll use packages many times throughout the course.

Note that if using a package on your personal computer, you’ll need to install it before librarying it. So if you want to replicate this next bit on your own computer, be sure to run the following: install.packages("faraway")

Once installed, you can activate any package for use in your current session of R by running library(package_name). In this case, the package name is faraway, so we will run that here!

library(faraway)
prostate

Note that library(faraway) calls on the location of this data, and then prostate is one (of many!) data frames in this package that we can access. By running just the name, we get a snapshot of this data frame in our output.

A Little Exploration

We can use different functions on a data frame to learn more about it. Here are a couple basic ones.

"Number of rows (observations)"
nrow(prostate)

"Number of coloumns (variables)"
ncol(prostate)

Create a Data Frame Manually

We can also create a data frame manually by entering named vectors that we want to tie together. We will use the command “data.frame”, which concatenates vectors that we list separated by commas.

Class = data.frame(
  heights = c(65,71,63,68,67,72,64,61,67,71,72,68,64),
  responses = c("yes","yes","no","yes","no","yes","yes","yes","no","yes","no","no","yes")
)

Class

New Lines to Improve Readability

Notice in the code chunk above, we hit “enter” after each comma to list each variable in a new line. With most functions in R, you can insert line breaks to improve readability without changing the operation! We could list all of that in one long line, and it would run exactly the same, but it is now very difficult to read!

As you are learning to code, please please please make line breaks where appropriate! It will make it much easier for you and for those of us who might be helping you. :)

Data Frames with Multiple Variables

Now, can you try creating a data frame with two variables? Let’s report the test scores of 5 fictional students, as well as their Names.

Scores: 90, 81, 87, 98, 78

Names: “Jose”, “Maddie”, “Peter”, “Amy”, and “Kara”

Let’s call this data frame “Results.”

Then be sure to call up this data frame at the end.

Don’t forget to put a comma at the end of the Scores line!

______ = data.frame(
  Scores = ...
  Names = ...
)

Results

Results = data.frame(
  Scores = c(90, 81, ...),
  Names = c("Jose", ...)
)

Results

Results = data.frame(
  Scores = c(90, 81, 87, 98, 78),
  Names = c("Jose", "Maddie", "Peter", "Amy", "Kara")
)

Results

And Tibbles Too

You should also be aware that “tibbles” are another data structure that you may encounter. Tibbles behave exactly like data frames in basically every way–the only real difference is how they display data when called on.

In this R tutorial, you won’t see a difference. In fact, this tutorial purposely displays data frames like a tibble! But if using R on your personal computer, you’ll notice that data frames display clunkier. They might display as many as 1,000 rows of data, while tibbles display a truncated version, plus some additional variable info. Tibbles just give you an efficient run down!

The more data you work with in R, the more you’ll notice the difference, and probably realize why tibbles are easier to work with than data frames.

We can actually take the same data from earlier and save it as a tibble.

Class = tibble(
  heights = c(65,71,63,68,67,72,64,61,67,71,72,68,64),
  responses = c("yes","yes","no","yes","no","yes","yes","yes","no","yes","no","no","yes")
)

Class

Summarizing Data

When analyzing data, we are often interested in summarizing certain variables in our data.

The summary command is a quick way to produce several helpful summary statistics for all of our variables at once. Summary produces the 5-number summary and the mean for all variables.

library(faraway)

summary(prostate)

We can also produce specific summaries for specific variables using commands like mean, sd, and median. Just make sure you call on specific variables by using the $ operator. This allows you to access a specific element of the data frame.

sd(prostate$lweight)
mean(prostate$lweight)
median(prostate$age)

Exploring the diabetes data frame

Now Lets take a look at a new dataset

library the faraway package again, and then call up the data frame named diabetes to display.

library(_______)
________

library(faraway)
diabetes

Calculate the numbers of observations from the dataset:

nrow(___)

nrow(diabetes)

Summary

Now, run a summary of the diabetes data frame.

summary(____)

summary(diabetes)

More Statistics

And lastly, calculate the standard deviation of the age variable (within diabetes).

sd(diabetes$____)

sd(diabetes$age)

That’s it!

That wraps this up! If you found a glitch anywhere, let me know by dropping a message in Campuswire.

If you’d like to see any of this content (and a little extra) in text format, I recommend my colleague David Dalpiaz’ book, specifically chapters 2-6.

https://daviddalpiaz.github.io/appliedstats/