# R Basics

## Vectors

### Commonly used functions

Data in R is stored in vectors. These are essentially an ordered container for
objects of the **same type** (i.e. numeric, character, logical). We can create a
vector with "c(...)". Here are some examples:

```
numeric.vector <- c(1, 2, 3, 4, 5)
character.vector <- c("a", "b", "cd", "e f", "123", "TRUE")
logical.vector <- c(T, T, F, T, T)
```

There are a few basic functions that can give us important information about the vectors you're working with.

The "class" function tells us what type of data is stored in the vector

```
class(numeric.vector)
```

```
## [1] "numeric"
```

The "length" function tells us how many observations are in the vector

```
length(numeric.vector)
```

```
## [1] 5
```

The "str" function gives us information about both the type of object stored by the vector as well as the number of observations.

```
str(numeric.vector)
```

```
## num [1:5] 1 2 3 4 5
```

When performing exploratory data analysis, it is often sufficient to use "str" in place of the other functions. However, a function like "length" can be useful if we need to set a variable that refers to the number of observations in our dataset.

### Vector operations

Since vectors are the basic units in R, performing operation with them works quite smoothly. However, we need to be careful about how R tries to interpret certain cases. Performing any of the basic operations with two numeric vectors of the same length is done element wise. Here are some examples:

```
numeric.vector.b <- c(5, 6, 7, 8, 9)
numeric.vector + numeric.vector.b
```

```
## [1] 6 8 10 12 14
```

```
numeric.vector - numeric.vector.b
```

```
## [1] -4 -4 -4 -4 -4
```

```
numeric.vector * numeric.vector.b
```

```
## [1] 5 12 21 32 45
```

```
numeric.vector/numeric.vector.b
```

```
## [1] 0.2000 0.3333 0.4286 0.5000 0.5556
```

We can also multiply by a scalar or add a constant by performing the "*" or "+" operations using a vector of any length along with a vector of length 1.

```
2 * numeric.vector
```

```
## [1] 2 4 6 8 10
```

```
5 + numeric.vector
```

```
## [1] 6 7 8 9 10
```

Beyond these operations, things start to get sticky. When we try to perform any
of the operations using vectors of different lengths, R will do so without
throwing a warning provided the larger vector is a multiple of the smaller. This
**should be avoided** since it is confusing for other programers and possibly
even future you.

```
vector.a <- c(1, 2)
vector.b <- c(1, 2, 3, 4, 5, 6)
vector.c <- c(1, 2, 3, 4, 5)
# R will happily perform this computations, but is should be avoided
vector.b - vector.a
```

```
## [1] 0 0 2 2 4 4
```

```
# R will happily perform this computations, but is should be avoided
vector.b * vector.a
```

```
## [1] 1 4 3 8 5 12
```

```
# R will warn you, but still perform the calculation
vector.c + vector.a
```

```
## Warning: longer object length is not a multiple of shorter object length
```

```
## [1] 2 4 4 6 6
```

FInally, we note that many of the built in functions for R are performed element wise on vectors. For example, calling the function "sin" with the argument numeric.vector will return a vector giving sin(x) for each entry in numeric.vector.

```
sin(numeric.vector)
```

```
## [1] 0.8415 0.9093 0.1411 -0.7568 -0.9589
```

### Subsetting

We can access different elements of vectors using bracket notation
("**vector**[...]"). R indexes vectors starting with 1, i.e. **vector**[1]
returns the first element of **vector**. For example, the fourth element of
numeric.vector and the second element of logical.vector are given by:

```
numeric.vector[4]
```

```
## [1] 4
```

```
logical.vector[2]
```

```
## [1] TRUE
```

We can access multiple elements of a vector by using a numeric vector or logical vector (indices that match the TRUE values get returned) inside the brackets. For example, the second, fourth, and fifth elements of character.vector:

```
character.vector[c(2, 4, 5)]
```

```
## [1] "b" "e f" "123"
```

```
character.vector[c(F, T, F, T, T)]
```

```
## [1] "b" "e f" "123"
```

We need to be careful when subsetting using a logical vector. If the vector you being used to subset the data is shorter than the data vector, R will try to be smart and multiply the logical vector until it is the same length as the data vector. This probably makes more sense through an example.

```
subset.idcs <- c(T, F)
numeric.vector[subset.idcs]
```

```
## [1] 1 3 5
```

Instead of throwing an error about the fact that subset.idcs and numeric.idcs
are different lengths, R duplicates the vector subset.idcs to be (T,F,T,F,T).
This is often confusing for other programers as well as yourself and **should be
avoided**. Instead, you should use a logical vector that is equal in length to
the vector you are trying to subset. This practice is very handy for many data
analysis tasks. For instance, if we wanted all the elements from numeric.vector
that are less than 3, we write

```
numeric.vector[numeric.vector < 3]
```

```
## [1] 1 2
```

## Data Frames

A powerful object for analyzing data in R is the data frame. Data frames are
lists of vectors of equal length. While each vector contains elements of the
**same type**, the vectors that make up a data frame may be **different types**.
We'll be working with the "iris" dataset, which is built into R and stored as a
data frame.

### Commonly used functions

Data frames have names for each of the vectors they contain (although the default names may be generic eg. X1, X2, ...). We can look at the names using the "names" function.

```
names(iris)
```

```
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
```

The "head" and "tail" functions display the first and last observations of the data frame. We can also specify how many observations to display (the default is 6 observations).

```
head(iris)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

```
tail(iris)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
```

```
head(iris, n = 10) #displays 10 observations
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
```

We can use the "summary" function to get a better understanding of the data. This function returns several summary statistics for each column of the data frame.

```
summary(iris)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1
## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3
## Median :5.80 Median :3.00 Median :4.35 Median :1.3
## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2
## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8
## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
```

Finally, we might be interested in the number of observations or variables in the data frame. Typically, observations are stored as rows of a data frame while variables correspond to columns. the "nrow" and "ncol" functions give the number of rows and columns respectively.

```
nrow(iris)
```

```
## [1] 150
```

```
ncol(iris)
```

```
## [1] 5
```

### Subsetting

Sometimes, we want to analyze a particular variable of our dataset. To do this,
we need to access a specific vector from our data frame. This is done with
**data.frame**$**variable.name**. For instance, the following is a vector of
sepal widths for all of the observations in the iris data frame. As long as
**variable.name** matches with one of the variables in names(**data.frame**), R
will give us back the appropriate vector.

```
sepal_widths <- iris$Sepal.Width
head(sepal_widths)
```

```
## [1] 3.5 3.0 3.2 3.1 3.6 3.9
```

Just as with vectors, we can access specific elements of sepal_widths using "[...]". For instance, the following examples return the sepal width of the 47th observation and the sepal widths for irises of species setosa.

```
sepal_widths[47]
```

```
## [1] 3.8
```

```
sepal_widths[iris$Species == "setosa"]
```

```
## [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9
## [18] 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2
## [35] 3.1 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3
```

If we want to subset by observations and variables, we use **data.frame**[...,
...] where the values inseide the brackets are either single values or a vector
of values. The first entry corresponds to the rows of **data.frame** and the
second to the columns. Leaving either of these values blank will return all rows
(or columns). This is best seen through examples.

```
# The first row and column of iris
iris[1, 1]
```

```
## [1] 5.1
```

```
# The first row and all columns of iris
iris[1, ]
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
```

```
# Rows 2, 7, and 47; columns 1 and 4
iris[c(2, 7, 47), c(1, 4)]
```

```
## Sepal.Length Petal.Width
## 2 4.9 0.2
## 7 4.6 0.3
## 47 5.1 0.2
```

```
# All iris observations of species setosa
iris_setosa <- iris[iris$Species == "setosa", ]
head(iris_setosa)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
```

```
# The first four columns (everything except Species) of iris for
# observations of species setosa
setosa_variables <- iris[iris$Species == "setosa", c(1, 2, 3, 4)]
head(setosa_variables)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
```

```
setosa_variables <- iris[iris$Species == "setosa", 1:4]
head(setosa_variables)
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
```

```
# All data for flowers with Sepal.Length less than 5
iris[iris$Sepal.Length < 5, ]
```

```
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 58 4.9 2.4 3.3 1.0 versicolor
## 107 4.9 2.5 4.5 1.7 virginica
```

```
# The Species of flowers with Sepal.Length less than 5
iris[iris$Sepal.Length < 5, 5]
```

```
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa versicolor virginica
## Levels: setosa versicolor virginica
```

```
iris[iris$Sepal.Length < 5, ]$Species
```

```
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa versicolor virginica
## Levels: setosa versicolor virginica
```