6. From loops to functions

Multiple ways to iterate in R

Learning objectives


Many paths to the same summit

One of the wonderful things about R is that there are often many ways to achieve the same result. This makes R expressive, and as you practice and expand your R vocabulary, you’ll inevitably find multiple ways to obtain the same results. Some ways may be long and meandering, and others may be more direct and easy to remember. In this module we will practice two approaches towards iteration—the art of not repeating yourself—specifically, for loops1 and functions (functional programming).

For most parallel problems (i.e., the current operation does not depend on the one that came before), functional programming offers significant advantages to for loops. However, both loops and functions are powerful ways to automate processes and workflows. We will review how and when to use each, and their pros and cons.

The core objective of this lesson is to practice harnessing the power of iteration in your workflows, and to demonstrate ways to iterate with functional programming. If you can define a function to work with one object in R, functional programming allows you to scale that function to any number of objects (assuming you have enough computing resources). Together, functional programming and iteration allow automation at scale, and put R a cut above GUI-based data workflows that require “clicking though” an analysis.

In this module, let’s imagine you’re a data manager, and need to provide data to a group of users. We’ll practice iteration on reading and writing data to illustrate a transition from for loops to functional programming.


for loops

“Don’t repeat yourself. It’s not only repetitive, it’s redundant, and people have heard it before.” -Lemony Snicket

The for loop is ubiquitous across programming languages and a fundamental concept that allows us to obey a core concept in programming: don’t repeat yourself. The for loop is a good place to begin a discussion of iteration because it makes iteration very explicit—you can see exactly what is taking place in each iteration, or loop.

A common problem we might face is reading multiple data frames into R. In the /data/gwl folder, we have station data for El Dorado, Placer, and Sacramento counties. We can read these in one by one by copy and pasting code, but we’re repeating ourselves. This may not matter for only 3 counties, but if we were to use all 58 counties with groundwater level data, this would be a non-scaleable approach prone to human error.

library(tidyverse)
library(here)

eldorado <- read_csv(here("data", "gwl", "county", "El Dorado.csv"))
placer   <- read_csv(here("data", "gwl", "county", "Placer.csv"))
sac      <- read_csv(here("data", "gwl", "county", "Sacramento.csv"))

We can replace this code with a for loop and read all of these items into a list2.

# list all files we want to read in
files_in <- fs::dir_ls(here("data", "gwl", "county"))

# initialize a list of defined length
l <- vector("list", length = length(files_in))

# loop over all files and read them into each element of the list
for(i in seq_along(l)){
  l[[i]] <- read_csv(files_in[i])
}

What happened above is that we evaluated the loop first starting with 1 in the place of i, and went to length(l), which is 3, each time placing the integer wherever there is an i in the above loop. Although our index is i, we can use any other unquoted character string, like j, k, or even index — its only purpose is to hold the index that we iterate through.

After the loop evaluates, we can access each element of the list with double bracket [[ notation, subsetting either by an integer index, or a name if one exists. This list doesn’t have names, but we could set them by assigning a vector of names to names(l).

# access first list element - El Dorado county dataframe
l[[1]] 

for loops happen sequentially, and require us to think of code in terms of objects that we iterate through, index by index. This can result in both slower and duplicated (verbose) code. In addition to writing efficient code, a core rule of good programming is to not repeat oneself.

Let’s imagine now that you needed to write each of these data into a separate file, separated by the unique ID of each station (e.g., the SITE_CODE). Your entire loop would look like this:

# initialize a list of defined length
l <- vector("list", length = length(files_in))

# loop over all files and read them into each element of the list
for(i in seq_along(l)){
  l[[i]] <- read_csv(files_in[i])
}

# combine all list elements into a single dataframe
# then split into another list by SITE_CODE.
ldf <- bind_rows(l)
ldf <- split(ldf, ldf$SITE_CODE)

# loop over each list element and write a csv file
fs::dir_create(here("data", "gwl", "site_code"))
# here we make a list of files names ( names(ldf) is a vector )
files_out <- glue::glue("{here('data', 'gwl', 'site_code', names(ldf))}.csv")

for(i in seq_along(ldf)){
  write_csv(ldf[[i]], files_out[i])
}

This is a lot of code to accomplish a relatively standard iterative workflow of reading in a directory of files, combining them, and writing them out. Functional programming can simplify loop-based workflows, run faster, and encourage you to think conceptually about the transformations at play, without worrying about tracking a changing index. Moreover, loops in R are usually unnecessary unless the result of the ith index depends on the i-1 (previous) index.

lapply()

Base R gives us a toolkit for functional programming via the apply family of functions, specifically lapply() and mapply(). The “l” in lapply() stands for “list”, and can be read as “list apply”. The apply functions are designed to iterate over lists, matrices, rows, or columns, and are very flexible. For example, we can simplify the for loops above as:

# read
l <- lapply(files_in, read_csv)

# bind and split by SITE_CODE
ldf <- bind_rows(l)
ldf <- split(ldf, ldf$SITE_CODE)

# write out: requires function (write_csv) 
# with 2 args (x = ldf & y = files_out)
mapply(function(x, y) write_csv(x, y), ldf, files_out) 

We can make this clearer by extracting the anonymous function3 above, assigning it to an identifier, and calling that in mapply():

my_function <- function(x, y){ 
  write_csv(x, y)
}

mapply(my_function, ldf, files_out) 

Notice that we don’t need to initialize a list to store output, or keep track of indices. The emphasis is on the transformation taking place, not index bookkeeping, and setting up looping patterns. The result is the same, and we have a much clearer way to approach the problem.


map()

The map family of functions in the {purrr} package improves on base R’s apply functions with simplified syntax, type-specific output that makes it harder to accidentally create errors, and convenience functions for common operations. When we combine map() with pipes (%>%) we can greatly simplify our code.

First, to mimic what we did with lapply() above:

# read
l <- map(files_in, ~read_csv(.x))

# bind and split by SITE_CODE
ldf <- bind_rows(l)
ldf <- group_split(ldf, SITE_CODE)

# write
walk2(ldf, files_out, ~write_csv(.x, .y)) 

The .x signifies each of the individual elements of passed into the function, and can be thought of as a placeholder for the input. We begin read_csv() with a ~ to indicate the beginning of a function.

We can further simplify our code with map_df() to automatically row bind the list elements into one dataframe. Moreover, we can pipe these statements together to avoid creating the intermediate ldf object.

# read and bind, split, and write
map_df(files_in, ~read_csv(.x)) %>% 
  group_split(SITE_CODE) %>% 
  walk2(files_out, ~write_csv(.x, .y))

Let’s break down what we did above.

  1. We mapped the function read_csv() over the vector of file paths, files_in. Although this function is simple, in practice we can map a large and complex function in the same way. The _df in map_df() means we pass the list otherwise returned by map() into bind_rows(), and thus return one combined dataframe of all the csv files we read instead of a list of dataframes.
  2. We used group_split() to split the combined dataframe by SITE_CODE which returns a list of dataframes ordered by the unique values of the grouping variable. This is identical to base R’s split() except it doesn’t return a named list.
  3. We walk2()ed over the list of dataframes (one for each SITE_CODE), and the files_out vector from above, and wrote a csv for each pair of objects (.x = dataframe and .y = output file path).

Extra information

What would happen if we used map2() instead of walk2() in the code above?

Click for the Answer!

walk2() is a special case of walk() which takes 2 vector inputs instead of 1. The first and second vector inputs are called in the function with .x and .y like so: walk2(input_1, input_2, ~function(.x, .y)). We use walk() instead of map() whenever we want the side effect of the function, like writing a file. We could also use map() here, but it would unnecessarily print each of the 723 dataframes.

You may now we wondering if there is a map3(), map4() and so on. To map over more than 2 inputs at once, check out the “parallel” map purrr::pmap() function, which is similar to base R’s mapply().


You may notice that we used an intermediate object (files_out) from above. We could re-write our chain without this object by creating it within the function call:

# read, bind, and write
map_df(files_in, ~read_csv(.x)) %>% 
  group_split(SITE_CODE) %>% 
  walk(
    ~write_csv(
      .x, 
      paste0(here("data","gwl","site_code"), "/", .x$SITE_CODE[1], ".csv")
    )
  )

The ~, .x, and .y syntax may seem confusing at first, but with some practice it will become easier, and it provides a consistent syntax to express complex ideas about your code. More more importantly, the emphasis is on keeping track of functions rather than creating and managing the scaffolding of for loops.

Taking things one step further, imagine you were:

We could capture these transform steps in a function, store it in our /functions folder as described in the project management module, and in this way, clean up our workspace so we can keep track of the functions applied on our data and keep our scripts short and readable.

Store this in a file /scripts/functions/f_import_clean.R:

# import dataframe, filter to observation wells, convert well 
# depth feet to meters, project to epgs 3310, & export the data
f_import_clean_export <- function(file_in, file_out){
  read_csv(file_in) %>% 
    filter(WELL_USE == "Observation") %>% 
    mutate(well_depth_m = WELL_DEPTH * 0.3048) %>% 
    sf::st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4269) %>% 
    sf::st_transform(3310) %>% 
    sf::st_write(file_out, delete_layer = TRUE)
}

Now we can walk() inputs over this function with ease in our main script without keeping track of loops and indices, or even the function internals, which are neatly stored in their own file.

source(here("scripts/functions/f_import_clean_export.R"))

# create a directory to store results
fs::dir_create(here("results"))

# vectors with function args: input (.x) & output (.y) files
files_in  <- fs::dir_ls(here("data", "gwl", "county"))
files_out <- here("results", str_replace_all(basename(files_in), ".csv", ".shp"))

walk2(files_in, files_out, ~f_import_clean_export(.x, .y))

Now we check the /results directory to verify it is populated with the shapefiles we just wrote.


Additional Resources

Iteration is a core skill that will allow you to scale your workflows from small to large while maintaining reproducibility. Combined with a proficiency in functional programming, you will more easily develop and store functions, declutter your workspace to focus on important transformations, streamline tracking down and fixing bugs, and keep track of data pipelines. Your ability to perform and re-perform arbitrarily complex workflows will exponentially increase.

We recommend the following places to start to learn more about functional programming in R:


Lesson adapted from R for Data Science.


Previous module:
5. Simple Shiny
Next module:
7. Paramaterized Reports


  1. For loops are an example of imperative programming, which differs from functional programming in that it emphasizes the steps to take to change the state of a computer rather than composing and applying functions.↩︎

  2. For a review of R’s list data structure, see the data structures module and Hadley Wickham’s excellent “R for Data Science” chapter on vectors.↩︎

  3. An anonymous function is a function that is not assigned to an identifier. In other words, it is created and used but doesn’t exist in the Gobal Environment as a function you can call by name. The benefit of using anonymous functions is that they allow you to quickly write single-use functions without storing and managing them. For example, in the expression lapply(1:5, function(x) print(x)), the anonymous function is function(x) print(x). To make it a regular, named function, we assign it to a value, like print_value <- function(x) print(x). Then it can be called by name like so: lapply(1:5, print_value).↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".