Date

Apply functions in R

R's apply functions are the efficient way to perform repeated operations on vectors of your data. There are many variants of the "apply" statement as well as wrappers for special cases of these functions easier. These notes will cover the apply statements and wrappers that are most commonly used in data analysis. Much of what we discuss here carries over to other apply statements as well.

apply

The first function we'll look at is apply. This function applies a given function to each element of a data structure with defined dimension (i.e. matrices, arrays, data frames). The elements that apply operates on are the rows/columns of matrices and data frames (or the sub-matrices of arrays). In other words, if you want to apply some function to the rows or columns of your data, you should use the apply function. When you call apply, you must supply the function with the object you want to operate on (X), the dimension(s) you would like to apply the function over (MARGIN), and the function you would like to perform (FUN). There are a few things to note about these arguments before proceeding to our examples.

  • X must have positive dimension. You can easily check this by running dim(X). Note that vectors and lists do not satisfy this requirement.

  • For the MARGIN argument, 1 indicates rows and 2 indicates columns. If you are using higher dimensional arrays, integers that correspond to any of the other dimensions are accepatable as well.

  • you may specify additional arguments to your function following the FUN argument.

apply will return the vector, matrix or list that results from applying FUN to each row or column as defined by MARGIN. Whenever possible, apply will try to simplify the results of its return into a matrix or vector. If FUN returns objects with different lengths when applied to the various rows/columns, the returned object will be a list. Lets look at a few examples.

sample.matrix <- matrix(sample(1:15, 15), ncol = 5)
sample.matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   15    8   12   10    2
## [2,]    7   11    5    6   14
## [3,]    3    4   13    1    9
# basic apply statements that calculate the mean of rows and columns
row.means <- apply(sample.matrix, 1, mean)
col.means <- apply(sample.matrix, 2, mean)
row.means
## [1] 9.4 8.6 6.0
col.means
## [1]  8.333  7.667 10.000  5.667  8.333
# faster way to accomplish the same thing
row.means <- rowMeans(sample.matrix)
col.means <- colMeans(sample.matrix)
row.means
## [1] 9.4 8.6 6.0
col.means
## [1]  8.333  7.667 10.000  5.667  8.333

A few of the functions that are frequently used by statisticians (sum and mean) have slightly faster implementations of an apply statment used to do the same thing. However, using an apply statement allows us to pass in additional arguments.

# you can pass in additional arguments after defining the function. Any
# arguments not passed in will use the function's default value

row.means.trimmed <- apply(sample.matrix, 1, mean, trim = 0.35)
col.means.trimmed <- apply(sample.matrix, 2, mean, trim = 0.35)
row.means.trimmed
## [1] 10.000  8.000  5.333
col.means.trimmed
## [1]  7  8 12  6  9

You can also define your own functions for the FUN argument. User defined functions will take the row or column as their argument depending on the value you specified for MARGIN.

overall.mean <- mean(sample.matrix)

row.value <- apply(sample.matrix, 1, function(row) {
    value <- (mean(row) - overall.mean)/sd(row)
    return(value)
})

col.value <- apply(sample.matrix, 2, function(col) {
    value <- (mean(col) - overall.mean)/sd(col)
    return(value)
})

row.value
## [1]  0.2870  0.1587 -0.4082
col.value
## [1]  0.05455 -0.09492  0.45883 -0.51745  0.05530

As we mentioned, when our functions return objects of varying lengths apply will return a list.

param.matrix <- matrix(c(5, 10, 0, 5), ncol = 2)
param.matrix
##      [,1] [,2]
## [1,]    5    0
## [2,]   10    5
# generate a list of length with 5 samples of mean 0 normal variables and 10
# samples of mean 5 variables
apply(param.matrix, 1, function(row) rnorm(row[1], row[2]))
## [[1]]
## [1] -0.2734  0.2565  1.0476  2.1080  0.1782
## 
## [[2]]
##  [1] 5.059 3.139 6.402 6.015 4.699 5.704 4.677 5.341 5.737 4.789

sapply (lapply)

The next functions that we will look at are similar to apply in that they are used to loop over a data structure and apply some function to the elements of that data structure. Unlike apply, they operate on data structures with undefined dimension (i.e. lists and vectors). These functions are sapply and lapply. Actually, sapply is just a wrapper for lapply that tries to simplify the returned value into a vector or matrix where possible. We use sapply in our examples with the knowledge that if we wanted to return a list we could use lapply instead. When you call sapply (or lapply) you need to supply it with the data structure that will get looped over (X) and a function to apply to each element of that data structure (FUN). There are a few things to note

  • Running sapply on a data frame will perform FUN on each of the vectors in that comprise it (remember data frames are just lists of vectors of equal length).
  • Running sapply on a matrix or array will coerce the object into a list and then apply FUN to the individual elements.
sample.list <- list(samp1 = rnorm(10, 0), samp2 = rnorm(10, 5), samp3 = rnorm(10, 
    10))

# The objects we call sapply on don't have defined dimension so we don't
# need a MARGIN argument. Other than that the function calls are similar

sapply(sample.list, mean)
##   samp1   samp2   samp3 
## -0.4709  4.4661 10.5416
sapply(sample.list, mean, trim = 0.35)
##   samp1   samp2   samp3 
## -0.3927  4.4029 10.6850
sapply(sample.list, function(list.element) {
    value <- mean(list.element)/sd(list.element)
})
##   samp1   samp2   samp3 
## -0.3283  9.3778  9.5501

When we call sapply (or lapply) on a data frame it acts on the vectors that make up the data frame.

# built in R dataset on regional levels of different socioeconomic
# indicators in Switzerland
head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
swiss.subset <- swiss[, c("Fertility", "Agriculture", "Education", "Infant.Mortality")]

# lets look the mean and standard deviation of each of the variables
sapply(swiss.subset, mean)
##        Fertility      Agriculture        Education Infant.Mortality 
##            70.14            50.66            10.98            19.94
sapply(swiss.subset, sd)
##        Fertility      Agriculture        Education Infant.Mortality 
##           12.492           22.711            9.615            2.913
# we can even plot histograms of the data by variable
par(mfrow = c(2, 2))  # this is a nice function to put multiple plots on one page
sapply(swiss.subset, hist)

plot of chunk unnamed-chunk-6

Unfortunately, the names for the histograms aren't particularly helpful. You could try to adjust this by supplying additional arguments to the hist function

var.names <- names(swiss.subset)
var.names
## [1] "Fertility"        "Agriculture"      "Education"       
## [4] "Infant.Mortality"
par(mfrow = c(2, 2))
sapply(swiss.subset, hist, xlab = "observed value", main = var.names)

plot of chunk unnamed-chunk-7

Instead of labeling each plotwith the correct variable name, R prints every string in var.names for each plot. Not exactly what we were looking for. This is because the functions we apply to the elements of our data structure are constant. In other words, the only inputs to FUN that change are the elements of our data structure. Thankfully, R provides the function mapply to help us get around this.

mapply

The mapply function works much like sapply or lapply. With mapply though, our functions are not limited to a single varying argument. The call for mapply is a little different from the other apply functions. We first need to specify our function then we provide the data structure that the function operates on. Our histogram example would look something like this:

var.names <- names(swiss.subset)
par(mfrow = c(2, 2))
mapply(function(variable, var.name) hist(variable, main = var.name, xlab = "observed value"), 
    swiss.subset, var.names)

plot of chunk unnamed-chunk-8

Notice that the order we provide the different variables at the end of our mapply statement needs to be the same as the order they are called in the function. Further, mapply is not restricted to only two variables. As long as we make sure the orders correspond we may use as many variables as we like.

Apply wrappers: by and replicate

Some apply function tasks are so common amongst R users that the language includes wrappers for them. Chief among these are by and replicate. by statements are useful when trying to run functions on various subsets of your data. Consider the "mtcars" dataset that is built into R and suppose we wanted to calculate the average mpg of a car by number of cylinders in the engine.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# the arguments for a by statment are data, subset factor, function in that
# order
by(mtcars$mpg, mtcars$cyl, mean)
## mtcars$cyl: 4
## [1] 26.66
## -------------------------------------------------------- 
## mtcars$cyl: 6
## [1] 19.74
## -------------------------------------------------------- 
## mtcars$cyl: 8
## [1] 15.1
# we can use sapply statement to look at the average of several variables by
# number of cylinders
sapply(mtcars[, c("mpg", "hp", "wt")], function(var) by(var, mtcars$cyl, mean))
##     mpg     hp    wt
## 4 26.66  82.64 2.286
## 6 19.74 122.29 3.117
## 8 15.10 209.21 3.999

replicate statements are useful for simulations and fairly simple to implement. These statements produce a specified number of replicates of a given statement. For instance, if we wanted to produce 10 simulations of sampling 25 random uniform variables

replicate(10, runif(25))
##            [,1]    [,2]    [,3]    [,4]    [,5]   [,6]    [,7]    [,8]
##  [1,] 0.5748392 0.28749 0.00349 0.44652 0.13699 0.3699 0.51512 0.69841
##  [2,] 0.0010590 0.93278 0.47812 0.87892 0.35814 0.3688 0.77206 0.40563
##  [3,] 0.0001907 0.89475 0.87006 0.78308 0.70703 0.3261 0.42233 0.14024
##  [4,] 0.3188565 0.47157 0.86951 0.13302 0.91671 0.2523 0.86594 0.86434
##  [5,] 0.9853210 0.81585 0.45488 0.40766 0.90607 0.6450 0.47121 0.12542
##  [6,] 0.7024356 0.90947 0.46666 0.53455 0.69911 0.2381 0.62024 0.94814
##  [7,] 0.2249691 0.66061 0.70574 0.47489 0.73907 0.2238 0.55454 0.61841
##  [8,] 0.1353921 0.20309 0.27294 0.05341 0.42293 0.9872 0.47972 0.49680
##  [9,] 0.4470449 0.44555 0.24052 0.05491 0.08967 0.6046 0.17333 0.27853
## [10,] 0.9385163 0.67798 0.95618 0.99885 0.45088 0.1313 0.97006 0.07001
## [11,] 0.0982774 0.50816 0.56374 0.94642 0.49484 0.7884 0.77314 0.63851
## [12,] 0.2711856 0.86542 0.54552 0.20941 0.12294 0.7534 0.63229 0.27112
## [13,] 0.4084631 0.37047 0.48257 0.78127 0.80352 0.3549 0.08938 0.35646
## [14,] 0.3299767 0.99052 0.89554 0.68312 0.74825 0.3454 0.71481 0.37773
## [15,] 0.9147518 0.37893 0.17761 0.18310 0.35805 0.6820 0.26136 0.86034
## [16,] 0.4389611 0.11486 0.20609 0.37435 0.74195 0.0718 0.76355 0.30815
## [17,] 0.5641979 0.12247 0.86493 0.22845 0.53164 0.7486 0.23534 0.70235
## [18,] 0.9340997 0.11293 0.68794 0.76306 0.68938 0.4741 0.77426 0.29306
## [19,] 0.7925815 0.65003 0.09922 0.08144 0.46391 0.3626 0.91598 0.03533
## [20,] 0.5656450 0.46962 0.49972 0.04007 0.04784 0.8274 0.50144 0.48898
## [21,] 0.1160394 0.30752 0.92019 0.57434 0.68835 0.1342 0.18251 0.36334
## [22,] 0.0960271 0.66276 0.14329 0.08054 0.98979 0.4877 0.93420 0.96727
## [23,] 0.0618901 0.21379 0.27903 0.32398 0.71879 0.8374 0.35423 0.78674
## [24,] 0.3905563 0.66687 0.95751 0.20393 0.94135 0.7348 0.14536 0.76835
## [25,] 0.8456175 0.08976 0.15553 0.13075 0.74546 0.6280 0.23254 0.62025
##          [,9]   [,10]
##  [1,] 0.30124 0.68865
##  [2,] 0.67112 0.72139
##  [3,] 0.56038 0.99757
##  [4,] 0.23748 0.59152
##  [5,] 0.07987 0.50159
##  [6,] 0.63697 0.91020
##  [7,] 0.71425 0.58123
##  [8,] 0.07934 0.18153
##  [9,] 0.91799 0.85539
## [10,] 0.77144 0.37545
## [11,] 0.43682 0.64109
## [12,] 0.55022 0.43939
## [13,] 0.74331 0.58395
## [14,] 0.01913 0.72981
## [15,] 0.20714 0.95380
## [16,] 0.24238 0.55779
## [17,] 0.16606 0.59220
## [18,] 0.13465 0.03029
## [19,] 0.74797 0.40250
## [20,] 0.11388 0.11504
## [21,] 0.04791 0.58220
## [22,] 0.69066 0.83256
## [23,] 0.00699 0.73928
## [24,] 0.49927 0.18991
## [25,] 0.91362 0.01659