Apply functions in R
R
's apply
functions are the efficient way to perform repeated
operations on vectors of your data. There are many variants of the "apply"
statement as well as wrappers for special cases of these
functions easier. These notes will cover the apply
statements and
wrappers that are most commonly used in data analysis. Much of what we discuss
here carries over to other apply
statements as well.
apply
The first function we'll look at is apply
. This function applies a given
function to each element of a data structure with defined dimension
(i.e. matrices, arrays, data frames). The elements that apply
operates on are
the rows/columns of matrices and data frames (or the sub-matrices of arrays). In
other words, if you want to apply some function to the rows or columns of your
data, you should use the apply
function. When you call apply
, you must
supply the function with the object you want to operate on (X
), the dimension(s)
you would like to apply the function over (MARGIN
), and the function you would
like to perform (FUN
). There are a few things to note about these arguments
before proceeding to our examples.
-
X
must have positive dimension. You can easily check this by running dim(X
). Note that vectors and lists do not satisfy this requirement. -
For the
MARGIN
argument, 1 indicates rows and 2 indicates columns. If you are using higher dimensional arrays, integers that correspond to any of the other dimensions are accepatable as well. -
you may specify additional arguments to your function following the
FUN
argument.
apply
will return the vector, matrix or list that results from applying FUN
to each row or column as defined by MARGIN
. Whenever possible, apply
will
try to simplify the results of its return into a matrix or vector. If FUN
returns objects with different lengths when applied to the various rows/columns,
the returned object will be a list. Lets look at a few examples.
sample.matrix <- matrix(sample(1:15, 15), ncol = 5)
sample.matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] 15 8 12 10 2
## [2,] 7 11 5 6 14
## [3,] 3 4 13 1 9
# basic apply statements that calculate the mean of rows and columns
row.means <- apply(sample.matrix, 1, mean)
col.means <- apply(sample.matrix, 2, mean)
row.means
## [1] 9.4 8.6 6.0
col.means
## [1] 8.333 7.667 10.000 5.667 8.333
# faster way to accomplish the same thing
row.means <- rowMeans(sample.matrix)
col.means <- colMeans(sample.matrix)
row.means
## [1] 9.4 8.6 6.0
col.means
## [1] 8.333 7.667 10.000 5.667 8.333
A few of the functions that are frequently used by statisticians (sum
and mean
) have slightly faster implementations of an apply
statment used to
do the same thing. However, using an apply
statement allows us to pass in
additional arguments.
# you can pass in additional arguments after defining the function. Any
# arguments not passed in will use the function's default value
row.means.trimmed <- apply(sample.matrix, 1, mean, trim = 0.35)
col.means.trimmed <- apply(sample.matrix, 2, mean, trim = 0.35)
row.means.trimmed
## [1] 10.000 8.000 5.333
col.means.trimmed
## [1] 7 8 12 6 9
You can also define your own functions for the FUN
argument. User defined
functions will take the row or column as their argument depending on the value
you specified for MARGIN
.
overall.mean <- mean(sample.matrix)
row.value <- apply(sample.matrix, 1, function(row) {
value <- (mean(row) - overall.mean)/sd(row)
return(value)
})
col.value <- apply(sample.matrix, 2, function(col) {
value <- (mean(col) - overall.mean)/sd(col)
return(value)
})
row.value
## [1] 0.2870 0.1587 -0.4082
col.value
## [1] 0.05455 -0.09492 0.45883 -0.51745 0.05530
As we mentioned, when our functions return objects of varying lengths apply
will return a list.
param.matrix <- matrix(c(5, 10, 0, 5), ncol = 2)
param.matrix
## [,1] [,2]
## [1,] 5 0
## [2,] 10 5
# generate a list of length with 5 samples of mean 0 normal variables and 10
# samples of mean 5 variables
apply(param.matrix, 1, function(row) rnorm(row[1], row[2]))
## [[1]]
## [1] -0.2734 0.2565 1.0476 2.1080 0.1782
##
## [[2]]
## [1] 5.059 3.139 6.402 6.015 4.699 5.704 4.677 5.341 5.737 4.789
sapply
(lapply
)
The next functions that we will look at are similar to apply
in that they
are used to loop over a data structure and apply some function to the elements
of that data structure. Unlike apply
, they operate on data structures with
undefined dimension (i.e. lists and vectors). These functions are sapply
and
lapply
. Actually, sapply
is just a wrapper for lapply
that tries to
simplify the returned value into a vector or matrix where possible. We use
sapply
in our examples with the knowledge that if we wanted to return a list
we could use lapply
instead. When you call sapply
(or lapply
) you need to
supply it with the data structure that will get looped over (X
) and a function
to apply to each element of that data structure (FUN
). There are a few things
to note
- Running
sapply
on a data frame will performFUN
on each of the vectors in that comprise it (remember data frames are just lists of vectors of equal length). - Running
sapply
on a matrix or array will coerce the object into a list and then applyFUN
to the individual elements.
sample.list <- list(samp1 = rnorm(10, 0), samp2 = rnorm(10, 5), samp3 = rnorm(10,
10))
# The objects we call sapply on don't have defined dimension so we don't
# need a MARGIN argument. Other than that the function calls are similar
sapply(sample.list, mean)
## samp1 samp2 samp3
## -0.4709 4.4661 10.5416
sapply(sample.list, mean, trim = 0.35)
## samp1 samp2 samp3
## -0.3927 4.4029 10.6850
sapply(sample.list, function(list.element) {
value <- mean(list.element)/sd(list.element)
})
## samp1 samp2 samp3
## -0.3283 9.3778 9.5501
When we call sapply
(or lapply
) on a data frame it acts on the vectors that
make up the data frame.
# built in R dataset on regional levels of different socioeconomic
# indicators in Switzerland
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
swiss.subset <- swiss[, c("Fertility", "Agriculture", "Education", "Infant.Mortality")]
# lets look the mean and standard deviation of each of the variables
sapply(swiss.subset, mean)
## Fertility Agriculture Education Infant.Mortality
## 70.14 50.66 10.98 19.94
sapply(swiss.subset, sd)
## Fertility Agriculture Education Infant.Mortality
## 12.492 22.711 9.615 2.913
# we can even plot histograms of the data by variable
par(mfrow = c(2, 2)) # this is a nice function to put multiple plots on one page
sapply(swiss.subset, hist)
Unfortunately, the names for the histograms aren't particularly helpful. You could try to adjust this by supplying additional arguments to the hist function
var.names <- names(swiss.subset)
var.names
## [1] "Fertility" "Agriculture" "Education"
## [4] "Infant.Mortality"
par(mfrow = c(2, 2))
sapply(swiss.subset, hist, xlab = "observed value", main = var.names)
Instead of labeling each plotwith the correct variable name, R prints every
string in var.names
for each plot. Not exactly what we were looking for. This
is because the functions we apply to the elements of our data structure are
constant. In other words, the only inputs to FUN
that change are the elements
of our data structure. Thankfully, R provides the function mapply
to help us
get around this.
mapply
The mapply
function works much like sapply
or lapply
. With mapply
though, our functions are not limited to a single varying argument. The call for
mapply
is a little different from the other apply functions. We first need to
specify our function then we provide the data structure that the function
operates on. Our histogram example would look something like this:
var.names <- names(swiss.subset)
par(mfrow = c(2, 2))
mapply(function(variable, var.name) hist(variable, main = var.name, xlab = "observed value"),
swiss.subset, var.names)
Notice that the order we provide the different variables at the end of our
mapply
statement needs to be the same as the order they are called in the
function. Further, mapply
is not restricted to only two variables. As long as
we make sure the orders correspond we may use as many variables as we like.
Apply wrappers: by
and replicate
Some apply function tasks are so common amongst R users that the language
includes wrappers for them. Chief among these are by
and replicate
. by
statements are useful when trying to run functions on various subsets of your
data. Consider the "mtcars" dataset that is built into R and suppose we wanted
to calculate the average mpg of a car by number of cylinders in the engine.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# the arguments for a by statment are data, subset factor, function in that
# order
by(mtcars$mpg, mtcars$cyl, mean)
## mtcars$cyl: 4
## [1] 26.66
## --------------------------------------------------------
## mtcars$cyl: 6
## [1] 19.74
## --------------------------------------------------------
## mtcars$cyl: 8
## [1] 15.1
# we can use sapply statement to look at the average of several variables by
# number of cylinders
sapply(mtcars[, c("mpg", "hp", "wt")], function(var) by(var, mtcars$cyl, mean))
## mpg hp wt
## 4 26.66 82.64 2.286
## 6 19.74 122.29 3.117
## 8 15.10 209.21 3.999
replicate
statements are useful for simulations and fairly simple to
implement. These statements produce a specified number of replicates of a given
statement. For instance, if we wanted to produce 10 simulations of sampling 25
random uniform variables
replicate(10, runif(25))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.5748392 0.28749 0.00349 0.44652 0.13699 0.3699 0.51512 0.69841
## [2,] 0.0010590 0.93278 0.47812 0.87892 0.35814 0.3688 0.77206 0.40563
## [3,] 0.0001907 0.89475 0.87006 0.78308 0.70703 0.3261 0.42233 0.14024
## [4,] 0.3188565 0.47157 0.86951 0.13302 0.91671 0.2523 0.86594 0.86434
## [5,] 0.9853210 0.81585 0.45488 0.40766 0.90607 0.6450 0.47121 0.12542
## [6,] 0.7024356 0.90947 0.46666 0.53455 0.69911 0.2381 0.62024 0.94814
## [7,] 0.2249691 0.66061 0.70574 0.47489 0.73907 0.2238 0.55454 0.61841
## [8,] 0.1353921 0.20309 0.27294 0.05341 0.42293 0.9872 0.47972 0.49680
## [9,] 0.4470449 0.44555 0.24052 0.05491 0.08967 0.6046 0.17333 0.27853
## [10,] 0.9385163 0.67798 0.95618 0.99885 0.45088 0.1313 0.97006 0.07001
## [11,] 0.0982774 0.50816 0.56374 0.94642 0.49484 0.7884 0.77314 0.63851
## [12,] 0.2711856 0.86542 0.54552 0.20941 0.12294 0.7534 0.63229 0.27112
## [13,] 0.4084631 0.37047 0.48257 0.78127 0.80352 0.3549 0.08938 0.35646
## [14,] 0.3299767 0.99052 0.89554 0.68312 0.74825 0.3454 0.71481 0.37773
## [15,] 0.9147518 0.37893 0.17761 0.18310 0.35805 0.6820 0.26136 0.86034
## [16,] 0.4389611 0.11486 0.20609 0.37435 0.74195 0.0718 0.76355 0.30815
## [17,] 0.5641979 0.12247 0.86493 0.22845 0.53164 0.7486 0.23534 0.70235
## [18,] 0.9340997 0.11293 0.68794 0.76306 0.68938 0.4741 0.77426 0.29306
## [19,] 0.7925815 0.65003 0.09922 0.08144 0.46391 0.3626 0.91598 0.03533
## [20,] 0.5656450 0.46962 0.49972 0.04007 0.04784 0.8274 0.50144 0.48898
## [21,] 0.1160394 0.30752 0.92019 0.57434 0.68835 0.1342 0.18251 0.36334
## [22,] 0.0960271 0.66276 0.14329 0.08054 0.98979 0.4877 0.93420 0.96727
## [23,] 0.0618901 0.21379 0.27903 0.32398 0.71879 0.8374 0.35423 0.78674
## [24,] 0.3905563 0.66687 0.95751 0.20393 0.94135 0.7348 0.14536 0.76835
## [25,] 0.8456175 0.08976 0.15553 0.13075 0.74546 0.6280 0.23254 0.62025
## [,9] [,10]
## [1,] 0.30124 0.68865
## [2,] 0.67112 0.72139
## [3,] 0.56038 0.99757
## [4,] 0.23748 0.59152
## [5,] 0.07987 0.50159
## [6,] 0.63697 0.91020
## [7,] 0.71425 0.58123
## [8,] 0.07934 0.18153
## [9,] 0.91799 0.85539
## [10,] 0.77144 0.37545
## [11,] 0.43682 0.64109
## [12,] 0.55022 0.43939
## [13,] 0.74331 0.58395
## [14,] 0.01913 0.72981
## [15,] 0.20714 0.95380
## [16,] 0.24238 0.55779
## [17,] 0.16606 0.59220
## [18,] 0.13465 0.03029
## [19,] 0.74797 0.40250
## [20,] 0.11388 0.11504
## [21,] 0.04791 0.58220
## [22,] 0.69066 0.83256
## [23,] 0.00699 0.73928
## [24,] 0.49927 0.18991
## [25,] 0.91362 0.01659