Date

R Basics

Vectors

Commonly used functions

Data in R is stored in vectors. These are essentially an ordered container for objects of the same type (i.e. numeric, character, logical). We can create a vector with "c(...)". Here are some examples:

numeric.vector <- c(1, 2, 3, 4, 5)
character.vector <- c("a", "b", "cd", "e f", "123", "TRUE")
logical.vector <- c(T, T, F, T, T)

There are a few basic functions that can give us important information about the vectors you're working with.

The "class" function tells us what type of data is stored in the vector

class(numeric.vector)
## [1] "numeric"

The "length" function tells us how many observations are in the vector

length(numeric.vector)
## [1] 5

The "str" function gives us information about both the type of object stored by the vector as well as the number of observations.

str(numeric.vector)
##  num [1:5] 1 2 3 4 5

When performing exploratory data analysis, it is often sufficient to use "str" in place of the other functions. However, a function like "length" can be useful if we need to set a variable that refers to the number of observations in our dataset.

Vector operations

Since vectors are the basic units in R, performing operation with them works quite smoothly. However, we need to be careful about how R tries to interpret certain cases. Performing any of the basic operations with two numeric vectors of the same length is done element wise. Here are some examples:

numeric.vector.b <- c(5, 6, 7, 8, 9)
numeric.vector + numeric.vector.b
## [1]  6  8 10 12 14
numeric.vector - numeric.vector.b
## [1] -4 -4 -4 -4 -4
numeric.vector * numeric.vector.b
## [1]  5 12 21 32 45
numeric.vector/numeric.vector.b
## [1] 0.2000 0.3333 0.4286 0.5000 0.5556

We can also multiply by a scalar or add a constant by performing the "*" or "+" operations using a vector of any length along with a vector of length 1.

2 * numeric.vector
## [1]  2  4  6  8 10
5 + numeric.vector
## [1]  6  7  8  9 10

Beyond these operations, things start to get sticky. When we try to perform any of the operations using vectors of different lengths, R will do so without throwing a warning provided the larger vector is a multiple of the smaller. This should be avoided since it is confusing for other programers and possibly even future you.

vector.a <- c(1, 2)
vector.b <- c(1, 2, 3, 4, 5, 6)
vector.c <- c(1, 2, 3, 4, 5)

# R will happily perform this computations, but is should be avoided
vector.b - vector.a
## [1] 0 0 2 2 4 4
# R will happily perform this computations, but is should be avoided
vector.b * vector.a
## [1]  1  4  3  8  5 12
# R will warn you, but still perform the calculation
vector.c + vector.a
## Warning: longer object length is not a multiple of shorter object length
## [1] 2 4 4 6 6

FInally, we note that many of the built in functions for R are performed element wise on vectors. For example, calling the function "sin" with the argument numeric.vector will return a vector giving sin(x) for each entry in numeric.vector.

sin(numeric.vector)
## [1]  0.8415  0.9093  0.1411 -0.7568 -0.9589

Subsetting

We can access different elements of vectors using bracket notation ("vector[...]"). R indexes vectors starting with 1, i.e. vector[1] returns the first element of vector. For example, the fourth element of numeric.vector and the second element of logical.vector are given by:

numeric.vector[4]
## [1] 4
logical.vector[2]
## [1] TRUE

We can access multiple elements of a vector by using a numeric vector or logical vector (indices that match the TRUE values get returned) inside the brackets. For example, the second, fourth, and fifth elements of character.vector:

character.vector[c(2, 4, 5)]
## [1] "b"   "e f" "123"
character.vector[c(F, T, F, T, T)]
## [1] "b"   "e f" "123"

We need to be careful when subsetting using a logical vector. If the vector you being used to subset the data is shorter than the data vector, R will try to be smart and multiply the logical vector until it is the same length as the data vector. This probably makes more sense through an example.

subset.idcs <- c(T, F)
numeric.vector[subset.idcs]
## [1] 1 3 5

Instead of throwing an error about the fact that subset.idcs and numeric.idcs are different lengths, R duplicates the vector subset.idcs to be (T,F,T,F,T). This is often confusing for other programers as well as yourself and should be avoided. Instead, you should use a logical vector that is equal in length to the vector you are trying to subset. This practice is very handy for many data analysis tasks. For instance, if we wanted all the elements from numeric.vector that are less than 3, we write

numeric.vector[numeric.vector < 3]
## [1] 1 2

Data Frames

A powerful object for analyzing data in R is the data frame. Data frames are lists of vectors of equal length. While each vector contains elements of the same type, the vectors that make up a data frame may be different types. We'll be working with the "iris" dataset, which is built into R and stored as a data frame.

Commonly used functions

Data frames have names for each of the vectors they contain (although the default names may be generic eg. X1, X2, ...). We can look at the names using the "names" function.

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

The "head" and "tail" functions display the first and last observations of the data frame. We can also specify how many observations to display (the default is 6 observations).

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
head(iris, n = 10)  #displays 10 observations
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

We can use the "summary" function to get a better understanding of the data. This function returns several summary statistics for each column of the data frame.

summary(iris)
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Finally, we might be interested in the number of observations or variables in the data frame. Typically, observations are stored as rows of a data frame while variables correspond to columns. the "nrow" and "ncol" functions give the number of rows and columns respectively.

nrow(iris)
## [1] 150
ncol(iris)
## [1] 5

Subsetting

Sometimes, we want to analyze a particular variable of our dataset. To do this, we need to access a specific vector from our data frame. This is done with data.frame$variable.name. For instance, the following is a vector of sepal widths for all of the observations in the iris data frame. As long as variable.name matches with one of the variables in names(data.frame), R will give us back the appropriate vector.

sepal_widths <- iris$Sepal.Width
head(sepal_widths)
## [1] 3.5 3.0 3.2 3.1 3.6 3.9

Just as with vectors, we can access specific elements of sepal_widths using "[...]". For instance, the following examples return the sepal width of the 47th observation and the sepal widths for irises of species setosa.

sepal_widths[47]
## [1] 3.8
sepal_widths[iris$Species == "setosa"]
##  [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
## [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
## [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3

If we want to subset by observations and variables, we use data.frame[..., ...] where the values inseide the brackets are either single values or a vector of values. The first entry corresponds to the rows of data.frame and the second to the columns. Leaving either of these values blank will return all rows (or columns). This is best seen through examples.

# The first row and column of iris

iris[1, 1]
## [1] 5.1
# The first row and all columns of iris

iris[1, ]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
# Rows 2, 7, and 47; columns 1 and 4

iris[c(2, 7, 47), c(1, 4)]
##    Sepal.Length Petal.Width
## 2           4.9         0.2
## 7           4.6         0.3
## 47          5.1         0.2
# All iris observations of species setosa

iris_setosa <- iris[iris$Species == "setosa", ]
head(iris_setosa)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# The first four columns (everything except Species) of iris for
# observations of species setosa

setosa_variables <- iris[iris$Species == "setosa", c(1, 2, 3, 4)]
head(setosa_variables)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
setosa_variables <- iris[iris$Species == "setosa", 1:4]
head(setosa_variables)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4
# All data for flowers with Sepal.Length less than 5
iris[iris$Sepal.Length < 5, ]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 58           4.9         2.4          3.3         1.0 versicolor
## 107          4.9         2.5          4.5         1.7  virginica
# The Species of flowers with Sepal.Length less than 5
iris[iris$Sepal.Length < 5, 5]
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     setosa     setosa     setosa    
## [19] setosa     setosa     versicolor virginica 
## Levels: setosa versicolor virginica
iris[iris$Sepal.Length < 5, ]$Species
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] setosa     setosa     setosa     setosa     setosa     setosa    
## [19] setosa     setosa     versicolor virginica 
## Levels: setosa versicolor virginica