Expected time (1 - 1.5 hours)

Topics Expected Time
1. Lesson Overview and Helpful hint of the week (10min)
2. Using Loops and Unlisting (35min)
3. The Apply Family of Functions (25min)
4. Purrr and Map (10min)

Overview

Concepts covered this week

This lesson will focus on Loops, the Apply family of functions and Map (the Tidyverse version of apply) which are a very useful foundational techniques in data manipulation.

All three of these methods are often interchangeable but each one can be more appropriate for a specific problem than the others so it worthwhile understanding all of them.

Loops

Loops are the simplest method of iterating a pre-defined set of values through a variable. Some common applications you might encounter are:

  • inputting a vector into a function

  • iterating the same function over a list,or a series of columns or rows from a dataframe

  • simulating the outcomes of a process using different random inputs

Apply family of functions

Apply type functions are a series of base R functions that allow you to “apply” a function over the rows or columns of a dataframe or matrix (apply), the elements of a list (lapply/sapply) or a element-wise over multiple lists (mapply/Map). Their are also other less commonly used members of this family of functions but we won’t go through these explicitly (tapply/vapply etc…).

map and purrr

map functions (always lowercase map not to be confused with Map mentioned above) are from the Tidyverse package purrr. These are variations of the apply family of functions that have been made to conform to the data philosophy of the tidyverse. Many of them perform identical tasks to apply type functions but in a more consistent manner (map = lapply, map_chr/map_dbl = sapply, map2/pmap = mapply etc…).

Helpful Hint of The Week: Vignettes! Long form documentation to help understand tricky packages

Vignettes! The long form documentation that provides examples and step by step guides to packages and their functions!

Often we look at a help function using “help()” or “?” but the documentation provided is insufficient or explained in a manner thats too complex or requires some prior knowledge that we just don’t have. Vignettes are the solution to this problem!

To access vignettes for a specific package, type in browseVignettes(“package name”).

Lets look at three vignettes relevant to last week lesson: First we will look at the Tidyverse manifesto (point of the Tidyverse), followed by the two Tidyverse embedded packages that we went through briefly last week: dplyr and tidyr and finally we will leave the package name blank and see all available vignettes.

browseVignettes("tidyverse")
browseVignettes("dplyr")
browseVignettes("tidyr")
browseVignettes()

In most vignettes, you can select HTML to view a document with embedded code or the R code that accompanies that documentation.


=============== Loops ===============


Loops can substitute a variable with a series of values. For instance we could specify the following function \(Y = 3 + 10\beta^2\) and then print out the values of Y when \(\beta\) varies between 1 and 5.

To do this we would need to set what values \(\beta\) should take:

for(B in 1:5){
  Y = 3 + 10*(B^2)
  print(Y)
}
[1] 13
[1] 43
[1] 93
[1] 163
[1] 253

All that is happening in the code above is that we are replacing the variable “B” with 1, then 2, then 3, then 4 and then 5. We need to “print(Y)” inside the loop because each time we repeat the task of evaluating Y with a new value of B we are overwriting the previous value of Y. So Y ends up only storing the last value.

Y
[1] 253

Often you would want to save all of the values in a vector and then view them all afterwards. To do this we would first create our vector Y, then specify exactly where in that vector we want to store each value. We will simply store the first value in the first element and the second value in the second element etc… But we may find it useful to find more elaborate ways of indexing the vector (store each value in every third element etc…)

Y = vector(mode = "numeric", length = 5)
for(B in 1:5){
  Y[B] = 3 + 10*(B^2)
}

Y
[1]  13  43  93 163 253

Below is an 8min video explaining unlisting followed by an 11min video that goes through some examples of using loops. Here is the data that you can copy into R Studio to follow the steps I go through in the video:

student_1 <- c(1, 6, 11, 16, 25, 38, 42, 68, 67, 80, 92, 98)
student_2 <- c(4, 8, 13, 18, 24, 29, 36, 41, 57, 74, 85, 92)
student_3 <- c(9, 17, 22, 35, 42, 56, 59, 62, 73, 83, 88, 90)
student_4 <- c(6, 12, 25, 32, 38, 45, 58, 67, 72, 81, 87, 95)
student_5 <- c(8, 18, 34, 45, 52, 58, 71, 76, 83, 89, 97, 100)

cohort_1 <- data.frame(student_1, student_2, student_3, student_4, student_5)



Solution without using a loop:

average_skill <- vector(mode = "numeric", length = 12)

average_skill[1] <- mean(unlist(cohort_1[1, ]))
average_skill[2] <- mean(unlist(cohort_1[2, ]))
average_skill[3] <- mean(unlist(cohort_1[3, ]))
average_skill[4] <- mean(unlist(cohort_1[4, ]))
average_skill[5] <- mean(unlist(cohort_1[5, ]))
average_skill[6] <- mean(unlist(cohort_1[6, ]))
average_skill[7] <- mean(unlist(cohort_1[7, ]))
average_skill[8] <- mean(unlist(cohort_1[8, ]))
average_skill[9] <- mean(unlist(cohort_1[9, ]))
average_skill[10] <- mean(unlist(cohort_1[10, ]))
average_skill[11] <- mean(unlist(cohort_1[11, ]))
average_skill[12] <- mean(unlist(cohort_1[12, ]))

average_skill
 [1]  5.6 12.2 21.0 29.2 36.2 45.2 53.2 62.8 70.4 81.4 89.8 95.0



Solution using a loop:

loop_solution <- vector(mode = "double", length = 12)

for(x in 1:12){
  loop_solution[x] <- mean(unlist(cohort_1[x, ]))
}

loop_solution
 [1]  5.6 12.2 21.0 29.2 36.2 45.2 53.2 62.8 70.4 81.4 89.8 95.0


========== Apply Family of Functions ==========


The apply functions are base R functions that allow you to iterate a single function over an index of an object. So if you have a list you can apply a function to each list element (lapply), if you have a dataframe or matrix you can apply a function over each row or column (apply).

Additionally, you can specify what type of object you would like your output to be in. lapply iterates over a list and produces an output that is also a list but sapply does the same thing but returns a vector (if possible) instead of a list.

To continue with the same xample we used for loops, let’s try out the apply function.



Instead of asking what element of the object we want to standardise with and “x” or “y”, we can simply ask three easy questions

  1. Which object are we interested in?
  2. Which dimension are we working across (rows or columns? *remember rows are the 1st dimension and columns are the second)
  3. What operation would we like to apply to each row or column?

The answer to these questions in our example are:

  1. We are interested in the dataframe called cohort_1
  2. We are working on each row (dimension 1)
  3. We would like to apply the function “mean”
apply_solution <- apply(cohort_1, 1, mean)

apply_solution
 [1]  5.6 12.2 21.0 29.2 36.2 45.2 53.2 62.8 70.4 81.4 89.8 95.0

If we wanted to solve the column averages instead, we could have simply specified the 2nd dimension instead of the first.

apply(cohort_1, 2, mean)
student_1 student_2 student_3 student_4 student_5 
 45.33333  40.08333  53.00000  51.50000  60.91667 

It may make it easier to follow if you read the apply function backwards (clearly this isn’t TidyVerse :)): It would read as: calculate the “mean” of each “column” in “cohort_1”.

Below is a summary of the main apply type functions with a brief description of their purpose:

apply
- perform an operation over the rows or columns of a matrix/df.
- returns a vector/matrix

lapply
- perform an operation over each element in a list.
- returns a list

sapply
- perform an operation over each element in a list.
- returns a vector (if simplification - coercion is possible)

mapply
- when you have multiple data structures and you want to perform an operation on the first element of each of the objects followed by the second element of each etc…
- returns a vector/matrix by simplification/coercion

Map
- same as mapply
- returns a list

========== purrr and the map functions ==========

purrr is a package of functional programming tools from the Tidyverse. The primary tool in the purrr toolkit is map which is the equivalent of lapply.

The major difference between the base R apply functions and map (besides performance) is consistency. map functions are clear and transparent in what they do.

If you are using map and you want a specific output, you simply add a suffix to map that specifies that:

if you want your output to be a dataframe: map_df

if you want your output to be of type character or double or logical: map_chr/map_dbl/map_lgl

if you want to map over two list objects element-wise (like mapply): map2

Or more than two objects: pmap

To replicate our example of estimating the average of each column above using purrr we would simply:

library(tidyverse)

map_dbl(cohort_1, mean)
student_1 student_2 student_3 student_4 student_5 
 45.33333  40.08333  53.00000  51.50000  60.91667 

Unfortunately purrr wants to always work with lists (remember dataframes are lists of vectors) so it does not easily calculate the row by row averages. This is part of trying to get users to conform to the Tidy philosophy of data.

For the interested reader, this data is not tidy because the “Lesson” variable is contained in the rownames and the “Student” variable is contained in the column names. Lastly, the “Score” variable is not in any single column but rather positioned as coordinates between the Lesson rowname and the student column name.

To fix this we would (1) add the “Lesson” variable using mutate and position the variable as the first column. Then (2) we would make the data long by gathering all of the “Student” variable contained in the column names (except for column one which is our new “Lesson” variable) and assign the corresponding scores to a new variable called “Score”. Finally (3), we would group the data using “group_by” by the “Lesson” variable and then calculate the average of each group using “summarise”.

cohort_1 %>% 
  mutate(Lesson = as.numeric(rownames(.)), .before = student_1) %>% 
  gather(key = "Student", value = "Score", -1) %>% 
  group_by(Lesson) %>% 
  summarise(Average = mean(Score))
# A tibble: 12 x 2
   Lesson Average
    <dbl>   <dbl>
 1      1     5.6
 2      2    12.2
 3      3    21  
 4      4    29.2
 5      5    36.2
 6      6    45.2
 7      7    53.2
 8      8    62.8
 9      9    70.4
10     10    81.4
11     11    89.8
12     12    95