An introduction to working with lists using purrr

By Matt Brousil

This week I'm going to walk through some examples of applying functions to every item in one (or more) lists. Basically this gives an overview answering the question: "If I have a list of data, how do I loop through that list manipulating and running analyses on its data without using for() loops?" The goal will be to apply our manipulations and analyses the same way to every item in the list.

We can use the purrr package to replace loops. In my experience there is sometimes a speed tradeoff, but we can gain some very important confidence in the reproducibility and accuracy of our code with purrr. It can help us write more concise code that is structured in very consistent ways. This should be familiar to some degree if you saw Vera Pfeiffer's talk on functions and using lapply().

Today we'll analyze a dataset using several common tools from purrr, starting from reading in data using purrr. We'll do the following:

  1. Read in many CSVs as a list of data frames
  2. Apply a model to the data for each separate data frame
  3. Cover how to handle errors while iterating over lists
  4. Filter the data for each data frame separately, then combine into single data frame
  5. Split a data frame into a new list
  6. Plot and export .png files for many figures all at once

You can access a compressed file here with the data and folder structure for this exercise. Note that the data are taken and modified from here.

1. Reading in data

Load the tidverse, which includes purrr and some other packages:

1library(tidyverse)

Our first step is to read in our data. We have several .csv files of tornado-related data, all structured in a similar way.

Let's look at the names. There's 11 csv files in our data directory.

1dir("data/", pattern = ".csv", full.names = TRUE)
2
3##  [1] "data/2008_torn.csv" "data/2009_torn.csv" "data/2010_torn.csv"
4##  [4] "data/2011_torn.csv" "data/2012_torn.csv" "data/2013_torn.csv"
5##  [7] "data/2014_torn.csv" "data/2015_torn.csv" "data/2016_torn.csv"
6## [10] "data/2017_torn.csv" "data/2018_torn.csv"

Using purrr::map we're able to read them all in at once. Here's how a map statement works:

1map(.x = , # Here we supply the vector or list of values (in this case filenames)
2    .f = ~ ) # Here we supply the function we want to feed those values into

Compare this with lapply(), which is very similar:

1lapply(X = , # Vector or list of values
2       FUN = ) # Function to apply to them

Jenny Bryan has a blog post here that compares purrr functions to other R iteration options if you'd like to read more.

Ok, let's get started mapping:

1# Create vector of all csv filenames
2filenames <- dir("data/", pattern = ".csv", full.names = TRUE)
3
4# For each name in the vector, read it and make a dataframe out of it
5tornado_data_list <- map(.x = filenames,
6                         # .x signifies one item in our list of filenames
7                         .f = ~ read.csv(file = .x,
8                                         stringsAsFactors = FALSE))

Great, now what's the output of our map() statement look like?

 1# str() is messy:
 2str(tornado_data_list)
 3
 4# length() might be helpful. 11 dataframes
 5length(tornado_data_list)
 6
 7## [1] 11
 8
 9# We can look at individual ones
10head(tornado_data_list[[1]])
11
12##   om   yr mo dy       date     time tz st stf stn mag inj fat  loss closs  slat
13## 1  1 2008  1  7 2008-01-07 14:22:00  3 MO  29   0   0   0   0  0.00     0 38.12
14## 2  2 2008  1  7 2008-01-07 14:54:00  3 MO  29   0   0   0   0  0.02     0 38.36
15## 3  3 2008  1  7 2008-01-07 15:30:00  3 IL  17   0   3   5   0  4.00     0 42.39
16## 4  4 2008  1  7 2008-01-07 15:55:00  3 MO  29   0   0   0   0  0.00     0 39.07
17## 5  5 2008  1  7 2008-01-07 16:02:00  3 WI  55   0   3   5   0 13.81     0 42.55
18## 6  6 2008  1  7 2008-01-07 16:39:00  3 WI  55   0   1   0   0  7.93     0 42.62
19##     slon  elat   elon   len wid ns sn sg  f1  f2 f3 f4 fc
20## 1 -93.77 38.12 -93.77  0.07  20  1  1  1 185   0  0  0  0
21## 2 -93.30 38.36 -93.30  0.17  25  1  1  1  15   0  0  0  0
22## 3 -88.83 42.46 -88.60 13.20 100  1  1  1   7 111  0  0  0
23## 4 -91.95 39.07 -91.95  0.30  40  1  1  1   7   0  0  0  0
24## 5 -88.33 42.60 -88.14 10.78 200  1  1  1 127  59  0  0  0
25## 6 -87.87 42.63 -87.82  2.48  75  1  1  1  59   0  0  0  0
26
27# Or all of them
28tornado_data_list

This is great: Instead of writing the same function (read.csv()) 11 times, we just did it once and got the same output essentially, albeit in list form.

2. Iteratively performing analyses

If you had just one of these data frames you might want to run a linear model using its data. We have 11, but we can still run models for each of them using map(). We'll model tornado rating (mag) as a function of estimated property loss (loss):

1map(.x = tornado_data_list,
2    .f = ~  lm(formula = mag ~ loss, data = .x))
3
4## Warning in storage.mode(v) <- "double": NAs introduced by coercion
5
6## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'y'

Whoops! You should have gotten an error. This happens, but when you have large lists and you're iterating over them it can be frustrating to run into an error that shuts down your whole analysis. What if you were using map() to run a function 100 times?

Luckily, purrr has some functions that allow us to continue pushing through past errors and to handle them elegantly. One of these is safely(), which for each list item you iterate over returns both a result and error. If the function runs without an error on a list item then it leaves you a NULL value for error, otherwise the error will be captured there. result will contain the output of the function, or NULL or another value you specify if an error occurs.

Let's use it:

 1# We use the same function syntax as with map()
 2safe_model <- safely(.f = ~ lm(formula = mag ~ loss, data = .x),
 3                     # This is the default, just illustrated here to help
 4                     otherwise = NULL)
 5
 6# Then we feed this newly wrapped function into the map statement we used that
 7# caused our earlier error.
 8safe_model_output <- map(.x = tornado_data_list,
 9                         .f = ~ safe_model(.x))
10
11## Warning in storage.mode(v) <- "double": NAs introduced by coercion

The result is a list of lists...Each list item contains a list of two other items:

  1. result, 2. error
 1safe_model_output[[1]]
 2
 3## $result
 4##
 5## Call:
 6## lm(formula = mag ~ loss, data = .x)
 7##
 8## Coefficients:
 9## (Intercept)         loss
10##     0.55972      0.03637
11##
12##
13## $error
14## NULL

Great...This list is complicated now, though. We can use transpose() to invert it. This function will turn our list of paired result/errors and return to us one list of results and one list of errors. Dig it:

  1transpose(safe_model_output)
  2
  3## $result
  4## $result[[1]]
  5##
  6## Call:
  7## lm(formula = mag ~ loss, data = .x)
  8##
  9## Coefficients:
 10## (Intercept)         loss
 11##     0.55972      0.03637
 12##
 13##
 14## $result[[2]]
 15##
 16## Call:
 17## lm(formula = mag ~ loss, data = .x)
 18##
 19## Coefficients:
 20## (Intercept)         loss
 21##     0.50383      0.03762
 22##
 23##
 24## $result[[3]]
 25##
 26## Call:
 27## lm(formula = mag ~ loss, data = .x)
 28##
 29## Coefficients:
 30## (Intercept)         loss
 31##    0.602615     0.004762
 32##
 33##
 34## $result[[4]]
 35##
 36## Call:
 37## lm(formula = mag ~ loss, data = .x)
 38##
 39## Coefficients:
 40## (Intercept)         loss
 41##    0.786864     0.002123
 42##
 43##
 44## $result[[5]]
 45##
 46## Call:
 47## lm(formula = mag ~ loss, data = .x)
 48##
 49## Coefficients:
 50## (Intercept)         loss
 51##    0.568803     0.007743
 52##
 53##
 54## $result[[6]]
 55##
 56## Call:
 57## lm(formula = mag ~ loss, data = .x)
 58##
 59## Coefficients:
 60## (Intercept)         loss
 61##     0.61657      0.00275
 62##
 63##
 64## $result[[7]]
 65## NULL
 66##
 67## $result[[8]]
 68##
 69## Call:
 70## lm(formula = mag ~ loss, data = .x)
 71##
 72## Coefficients:
 73## (Intercept)         loss
 74##     0.49408      0.09331
 75##
 76##
 77## $result[[9]]
 78##
 79## Call:
 80## lm(formula = mag ~ loss, data = .x)
 81##
 82## Coefficients:
 83## (Intercept)         loss
 84##   2.515e-01    2.554e-07
 85##
 86##
 87## $result[[10]]
 88##
 89## Call:
 90## lm(formula = mag ~ loss, data = .x)
 91##
 92## Coefficients:
 93## (Intercept)         loss
 94##   2.534e-01    1.449e-08
 95##
 96##
 97## $result[[11]]
 98##
 99## Call:
100## lm(formula = mag ~ loss, data = .x)
101##
102## Coefficients:
103## (Intercept)         loss
104##   3.846e-01    2.009e-08
105##
106##
107##
108## $error
109## $error[[1]]
110## NULL
111##
112## $error[[2]]
113## NULL
114##
115## $error[[3]]
116## NULL
117##
118## $error[[4]]
119## NULL
120##
121## $error[[5]]
122## NULL
123##
124## $error[[6]]
125## NULL
126##
127## $error[[7]]
128##
129##
130## $error[[8]]
131## NULL
132##
133## $error[[9]]
134## NULL
135##
136## $error[[10]]
137## NULL
138##
139## $error[[11]]
140## NULL

This is helpful because we can now just pull out the results portion and drop the null values (or, fix the errors and re-run if you choose).

1successful_models %
2  discard(is.null)
3
4length(successful_models)
5
6## [1] 10

Based on error output, it looks like we have an issue with the mag column in data frame 7.

1tornado_data_list[[7]]$mag %>% unique()
2
3## [1] "0"            "2"            "1"            "3"            "4"
4## [6] "cheeseburger"

My bad, I had burgers on the brain.

3. Summarizing and condensing datasets

Maybe we're sick of having 11 data frames now. Let's summarize the data a bit, and then have purrr spit the output back to us as a single, condensed data frame. The _df tacked onto map signifies that it's returning a single data frame as its output instead of a list. There are other output options as well.

 1tornado_summary %
 2  mutate(mag = as.numeric(mag)) %>%
 3  filter(
 4    # Remove unknown values
 5    mag != -9,
 6    # Tornadoes not crossing state lines
 7    ns == 1,
 8    # Entire track in current state
 9    sn == 1,
10    # Entire track, not a portion of the track
11    sg == 1) %>%
12  select(om, date, yr, st, mag, loss, closs))
13
14## Warning: NAs introduced by coercion
15
16head(tornado_summary)
17
18##   om       date   yr st mag  loss closs
19## 1  1 2008-01-07 2008 MO   0  0.00     0
20## 2  2 2008-01-07 2008 MO   0  0.02     0
21## 3  3 2008-01-07 2008 IL   3  4.00     0
22## 4  4 2008-01-07 2008 MO   0  0.00     0
23## 5  5 2008-01-07 2008 WI   3 13.81     0
24## 6  6 2008-01-07 2008 WI   1  7.93     0
25
26tail(tornado_summary)
27
28##           om       date   yr st mag  loss closs
29## 12966 617019 2018-12-27 2018 LA   1  7000     0
30## 12967 617020 2018-12-27 2018 LA   1  7000     0
31## 12968 617021 2018-12-27 2018 MS   0 15000     0
32## 12969 617022 2018-12-31 2018 KY   1 55000     0
33## 12970 617023 2018-12-31 2018 IN   1 50000     0
34## 12971 617024 2018-12-31 2018 IN   1 20000     0

4. Splitting into lists and extracting list items

Previously the data had been separated into a list by year, but maybe what we really want is to get them separated at the state level. We can do this!

 1tornado_by_state %
 2  # x is our data frame to split
 3  split(x = .,
 4  # f is the factor to use in splitting (state in this case)
 5        f = list(.$st))
 6
 7# How many list items do we have now?
 8length(tornado_by_state)
 9
10## [1] 51

Let's plot yearly data, but we're not sure it will work so we'll use safely() again.

1safe_plot % ggplot() + geom_boxplot(aes(group = yr, y = loss)),
2                                    otherwise = NULL)
3plot_test <- map(.x = tornado_by_state,
4                 .f = ~ safe_plot(.x))

There are a couple of ways to extract list items, but one method we can use is pluck()

1plot_results %
2  transpose() %>%
3  pluck("result")

5. Iterating over multiple vectors or lists

Now we can save each of these plots we've created. We'll want to have names ready for their filenames, though. To combine our state names with our state plots we'll use map2(), which iterates over two vectors or lists (.x & .y) simultaneously. Note that this iteration is in parallel! This means that .x[1] and .y[1] are both used simultaneously, then .x[2] and .y[2], etc. The two vectors are not crossed, but this can be done using the purrr::cross() function if you'd like. Also note that because they are done in parallel, .x and .y have to have the same length!

We'll make a plot for each state, named _tornado_plot.png.

 1# Are the data frame names and the plot list compatible for map2?
 2length(names(tornado_by_state)) == length(plot_results)
 3
 4## [1] TRUE
 5
 6map2(.x = names(tornado_by_state),
 7     .y = plot_results,
 8     .f = ~ ggsave(filename = paste0("figures/", .x, "_tornado_plot.png"),
 9                   plot = .y, device = "png", width = 8, height = 6,
10                   units = "in"))

References: