Tidyverse Data Wrangling

By Ben Leonard

What is Data Wrangling?

Most of the time datasets will not come "out of the box" ready to use for analysis. Summarizing data and presenting audience friendly tables and figures requires quite a bit of data manipulation. A data wrangling task is performed anytime a dataset is cleaned, tweaked, or adjusted to fit a subsequent analytical step.

Why Use R?

Programs that provide an easy to use graphical user interface (GUI) such as Excel can be used to perform data wrangling tasks. Since it can be faster to manipulate data is Excel why should anyone take extra time to write code in R?

  1. Repeatability. A cornerstone of the scientific process is producing a replicable result. If a dataset has to be heavily massaged to produce the desired result these steps must be documented. Otherwise it is easy for errors to creep into the final output. Without transparency it is difficult for reviewers to differentiate between honest mistakes and purposeful data fabrication. It is also difficult to perform any type of quality assurance protocol without proper documentation. By coding all analytical steps in R it is actually easier to leave a trail of evidence showing data wrangling steps than trying to do so in Excel by heavily annotating workbooks.

  2. Data Integrity. Excel often does unexpected things to data. For example, Excel will attempt to auto-recognize data types when entering data. If I try to enter the value "1-15" this automatically becomes the date "January 15, 2020". Excel also does not always respect the preservation and display of significant digits. Some formulas don't follow sounds statistical philosophy. Many advanced data maneuvers are absent from or really difficult to perform in Excel.

  3. Work Flow. If R is required eventually for statistics, modeling, rmarkdown, shiny, etc. then keeping all steps "in house" just makes sense. Since R and RStudio provide most of the tools a data scientist needs why muddy the waters by switching back and forth with Excel. However, it should be noted that there are many packages that make it easy to integrate the two approaches. For example, readxl is a tidyverse package for reading Excel workbooks. It is also possible to code Excel procedures using Visual Basic for Applications (VBA) code and then programmatically execute this code in R. Many choose a hybrid approach in which Excel is used as a data exploration tool that can be used to figure out which steps need to be taken to accomplish a task in R.

Tidyverse Packages

1# Installing tidyverse packages
2# install.packages("tidyverse")
3
4# Loading tidyverse packages
5# library("tidyverse")
6
7# Tidyverse information
8# ?tidyverse

https://www.tidyverse.org/

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Install the complete tidyverse with:

install.packages("tidyverse")

Loading the tidyverse library is a great way to load many useful packages in one simple step. Several packages included in this distribution are great tools for data wrangling. Notably dplyr, readr, and tidyr are critical assets. The general philosophy is that if a number of steps are needed for a data wrangling task these can be modularized into a stream-lined set into a human readable commands.

https://dplyr.tidyverse.org/

https://readr.tidyverse.org/

https://tidyr.tidyverse.org/

Criticism

Some argue that tidyverse syntax is not intuitive and that it should not be taught to students first learning R. Much of the basis for this comes from the avoidance of certain base R staples such as the "$" operator, "[[]]", loops, and the plot command.

https://github.com/matloff/TidyverseSkeptic

Pipe Operator

"%>%"

The pipe operator is the crux of all tidyverse data operations. It strings together multiple functions without the need for subsetting. If a number of operations are needed to be performed on a single dataframe this approach simplifies the process. The following image shows the mathematical principle of how the pipe operator works by bridging functions with explicitly subsetting arguments. In this case, steps A %>% B %>% C would represent the tidyverse "pipe syntax" while f(g(x)) is the base R approach.

The examples below show how to perform a few simple tasks in both base R and tidyverse "pipe syntax". Note that for both of these cases the base R syntax may actually be preferred in terms of brevity, but the tidyverse approaches are slightly more readable.

 1## Task #1: Determine the mean of a vector of three doubles.
 2
 3# Create a vector of values
 4a <- c(12.25, 42.50, 75.00)
 5
 6# Base R
 7round(mean(a), digits = 2)
 8
 9## [1] 43.25
10
11# Tidyverse
12a %>%
13  mean() %>%
14  round(digits = 2)
15
16## [1] 43.25
17
18## Task #2: Extract the unique names of all storms in the "storms" dataset and view the first few results.
19
20# Base R
21head(unique(storms$name))
22
23## [1] "Amy"      "Caroline" "Doris"    "Belle"    "Gloria"   "Anita"
24
25# Tidyverse
26storms %>%
27  pull(name) %>%
28  unique() %>%
29  head()
30
31## [1] "Amy"      "Caroline" "Doris"    "Belle"    "Gloria"   "Anita"

Reading Data

read_csv() and write_csv()

While base R does provide analogous functions (read.csv, and write.csv) the versions included in readr are 5 to 10 times faster. This speed, in addition to a loading bar, makes these functions preferred for large files. The dataframes produced are also always tibbles and don't convert columns to factors or use rownames by default. Additionally, read_csv is able to make better determinations about data types and has more sensible defaults.

The example below shows how to read.csv is unable to identify time series data automatically and assigns a column class of "character". However, read_csv correctly reads in the data without having to specify any column data types.

 1## Task #1: Read in a dataset of time series data and determine the default class assignments for each column data type.
 2
 3# Base R
 4sapply(read.csv("weather.csv"), class)
 5
 6##   timestamp       value
 7## "character"   "numeric"
 8
 9# Tidyverse
10read_csv("weather.csv") %>%
11  sapply(class)
12
13## Parsed with column specification:
14## cols(
15##   timestamp = col_datetime(format = ""),
16##   value = col_double()
17## )
18
19## $timestamp
20## [1] "POSIXct" "POSIXt"
21##
22## $value
23## [1] "numeric"

tribble()

Sometimes it is necessary to define a small dataframe in your code rather than reading data from an external source such as a csv file. The tribble() function makes this very easy by allowing row-wise definition of dataframes in tibble format.

Note. A tibble is a type of dataframe specific to the tidyverse. See the following description:

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don't change variable names or types, and don't do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large datasets containing complex objects.

https://tibble.tidyverse.org/

The example below shows how much more intuitive it is to create data frames using tribble. Row-wise definitions make it simple to enter tabular data and manually update datasets when need be. The use of tribble also guarantees a tibble output which is then going to be fully compatible with all tidyverse functions. A tibble is also achieved without row-wise syntax using tibble() (formerly data_frame()).

 1## Task #2: Manually create a data frame.
 2
 3# Base R
 4class(data.frame("col1" = c(0.251, 2.253),
 5                 "col2" = c(1.231, 5.230)))
 6
 7## [1] "data.frame"
 8
 9# Tidyverse #1
10class(tibble("col1" = c(0.251, 2.253),
11             "col2" = c(1.231, 5.230)))
12
13## [1] "tbl_df"     "tbl"        "data.frame"
14
15# Tidyverse #2
16tribble(
17  ~col1, ~col2,
18  0.251, 1.231,
19  2.253, 5.230,
20  1.295, 3.192, # New Data
21  ) %>%
22  class()
23
24## [1] "tbl_df"     "tbl"        "data.frame"

Manipulating Data

Much of data wrangling comes down to data manipulations performed over and over again. A compact set of functions make this easy in tidyverse without having to nest functions or define a bunch of different variables. While many base R equivalents exist the example below performs a set of commands using only tidyverse syntax.

select() extracts columns and returns a tibble.

arrange() changes the ordering of the rows.

filter() picks cases based on their values.

mutate() adds new variables that are functions of existing variables.

rename() easily changes the name of a column(s)

pull() extracts a single column as a vector.

 1# Task #1: Perform numerous data manipulation operations on a dataset.
 2
 3ToothGrowth %>%
 4  mutate(dose2x = dose*2) %>% # Double the value of dose
 5  select(-dose) %>% # Remove old dose column
 6  arrange(-len) %>% # Sort length from high to low
 7  filter(supp == "OJ") %>% # Filter for only orange juice supplement
 8  rename(length = len) %>% # Rename len to be more explicit
 9  pull(length) %>% # Extract length as a vector
10  head()
11
12## [1] 30.9 29.4 27.3 27.3 26.4 26.4

Summarizing Data

group_by() Most data operations are done on groups defined by variables. group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping.

summarize() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.

summarise() and summarize() are synonyms.

pivot_wider() "widens" data, increasing the number of columns and decreasing the number of rows.

pivot_longer() "lengthens" data, increasing the number of rows and decreasing the number of columns.

Data summarization often requires some difficult maneuvering. Since data summaries may then be used for subsequent analyses and may even be used as a deliverable it is important to have a well documented and clear set of procedures. The following procedures are used to create tabular data summaries much like pivot tables in Excel. However, unlike pivot tables in Excel the steps here are well documented.

 1## Task #1: Summarize the prices of various diamond cuts using a variety of statistical measures and then properly format the results.
 2
 3diamonds %>%
 4  group_by(cut) %>%
 5  summarize(mean = mean(price)) %>%
 6  mutate(mean = dollar(mean))
 7
 8## `summarise()` ungrouping output (override with `.groups` argument)
 9
10## # A tibble: 5 x 2
11##   cut       mean
12##   <ord>     <chr>
13## 1 Fair      $4,358.76
14## 2 Good      $3,928.86
15## 3 Very Good $3,981.76
16## 4 Premium   $4,584.26
17## 5 Ideal     $3,457.54
18
19# Tip: Using _at will apply one or more functions to one or more variables.
20diamonds %>%
21  group_by(cut) %>%
22  summarize_at(vars(price), funs(mean, sd, min, median, max)) %>%
23  mutate_at(vars(-cut), dollar)
24
25## Warning: `funs()` is deprecated as of dplyr 0.8.0.
26## Please use a list of either functions or lambdas:
27##
28##   # Simple named list:
29##   list(mean = mean, median = median)
30##
31##   # Auto named with `tibble::lst()`:
32##   tibble::lst(mean, median)
33##
34##   # Using lambdas
35##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
36## This warning is displayed once every 8 hours.
37## Call `lifecycle::last_warnings()` to see where this warning was generated.
38
39## # A tibble: 5 x 6
40##   cut       mean      sd        min   median    max
41##   <ord>     <chr>     <chr>     <chr> <chr>     <chr>
42## 1 Fair      $4,358.76 $3,560.39 $337  $3,282.00 $18,574
43## 2 Good      $3,928.86 $3,681.59 $327  $3,050.50 $18,788
44## 3 Very Good $3,981.76 $3,935.86 $336  $2,648.00 $18,818
45## 4 Premium   $4,584.26 $4,349.20 $326  $3,185.00 $18,823
46## 5 Ideal     $3,457.54 $3,808.40 $326  $1,810.00 $18,806
47
48## Task #2: a) Create a presence absence table for fish encounters where each station is a different column. b) Then condense the table to get a total fish counts per station.
49
50# a)
51df <- fish_encounters %>%
52  mutate(seen = "X") %>% # Set value to X for seen
53  pivot_wider(names_from = station, values_from = seen) # Move stations to columns
54
55df %>% head()
56
57## # A tibble: 6 x 12
58##   fish  Release I80_1 Lisbon Rstr  Base_TD BCE   BCW   BCE2  BCW2  MAE   MAW
59##   <fct> <chr>   <chr> <chr>  <chr> <chr>   <chr> <chr> <chr> <chr> <chr> <chr>
60## 1 4842  X       X     X      X     X       X     X     X     X     X     X
61## 2 4843  X       X     X      X     X       X     X     X     X     X     X
62## 3 4844  X       X     X      X     X       X     X     X     X     X     X
63## 4 4845  X       X     X      X     X       <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
64## 5 4847  X       X     X      <NA>  <NA>    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
65## 6 4848  X       X     X      X     <NA>    <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
66
67# b)
68df %>%
69  pivot_longer(-fish, names_to = "station") %>% # Move back to long format
70  group_by(station) %>%
71  filter(!is.na(value)) %>% # Remove NAs
72  summarize(count = n()) # Count number of records
73
74## `summarise()` ungrouping output (override with `.groups` argument)
75
76## # A tibble: 11 x 2
77##    station count
78##    <chr>   <int>
79##  1 Base_TD    11
80##  2 BCE         8
81##  3 BCE2        7
82##  4 BCW         8
83##  5 BCW2        7
84##  6 I80_1      19
85##  7 Lisbon     13
86##  8 MAE         5
87##  9 MAW         5
88## 10 Release    19
89## 11 Rstr       12

Complex Example

While this document follows some pretty simple examples using real-life data can get really complicated quickly. In the last example, I use real data from a study looking at tree transpiration. Data received from Campbell data loggers must be heavily manipulated to calculate "v" or the velocity of sap movement for each timestamp. A function is defined which performs all of the data wrangling steps and allows for the user to change the data sources and censoring protocols. While there is no need to fully understand what is going on with this example it is important to note that each step is documented with in line comments. This tidyverse recipe which is saved as a function can then be added to a custom package and loaded as needed.

 1# Define a function that accomplishes numerous data wrangling steps given several input variables.
 2tdp <- function(
 3  path, # Path to data files
 4  meta, # Name of metadata file
 5  min, # Minimum accepted value
 6  max, # Maximum accepted value
 7  mslope # Minimum accepted slope
 8) {
 9
10  list.files(path = path, # Define subfolder
11                  pattern = "*.dat", # Pattern for file suffix
12                  full.names = TRUE, # Use entire file name
13                  recursive = TRUE) %>% # Create list of dat files
14  map_df(~ read_csv(
15    .x,
16    skip = 4, # Skip first four lines
17    col_names = F, # Don't auto assign column names
18    col_types = cols(), # Don't output
19    locale = locale(tz = "Etc/GMT+8") # Set timezone
20  ) %>% # Read in all dat files
21    select(-c(37:ncol(.))) %>% # Remove extra columns
22    mutate(temp = as.integer(str_extract(.x, "(?<=_)\\d{1}(?=_TableTC.dat)")))) %>% # Extract site number from file name
23  `colnames<-`(c("timestamp", "record", "jday", "jhm", 1:32, "plot_num")) %>% # Manually define column names
24  select(plot_num, everything()) %>% # Move plot_num to be the first column
25  pivot_longer(
26    names_to = "channel",
27    values_to = "dTCa",
28    -(1:5),
29    names_transform = list(channel = as.double)) %>% # Flatten data
30  left_join(read_csv(meta,
31                     col_types = cols())) %>% # Merge tree metadata
32  filter(!is.na(tree_num)) %>% # Remove NAs associated with channels lacking a tree number
33  unite(tree_id,
34        c(plot_num, tree_num, tree_type), # Combine three variables
35        sep = "-", # Use - as delimiter
36        remove = FALSE) %>% # Create tree IDs
37  arrange(plot_num, channel, timestamp) %>%
38  mutate(date = as.Date(timestamp, tz = "Etc/GMT+8"), # Extract date
39         hour = hour(timestamp), # Extract hour of day
40         am = if_else(hour < 12, T, F), # True vs False for morning hours
41         slope = dTCa - lag(dTCa), # Calculate dTCa slope for each timestep
42         flag = ifelse(dTCa < min | dTCa > max | slope > mslope, TRUE, FALSE)) %>% # Flag values that are out of bounds
43  filter(!flag) %>% # Filter and remove values that are out of bounds
44  group_by(plot_num, channel) %>% # Group data by day/site/channel
45  filter(date != min(date)) %>% # Remove first day of data collection
46  group_by(plot_num, channel, date, am) %>% # include date and morning in grouping
47  mutate(dTM = max(dTCa)) %>% # Calculate maximum temperature difference per day/site/channel
48  ungroup() %>%
49  mutate(baseline = if_else(dTCa == dTM & am, dTM, NA_real_), # Force dTM to occur in the morning
50         diff = difftime(timestamp, lag(timestamp), units = "mins"), # Calculate time difference
51         id = if_else(diff != 15 | is.na(diff), 1, 0), # Identify non-contiguous periods of time
52         id = cumsum(id)) %>% # Unique IDs will be produced for contiguous periods of data collection
53  group_by(plot_num, channel, id) %>%
54  mutate(baseline = na.approx(baseline, na.rm = F, rule = 2)) %>% # Interpolate between baselines
55  ungroup() %>%
56  mutate(k = (baseline - dTCa) / dTCa, # Calculate the sap flux index
57         v = 0.0119 * (k ^ 1.231) * 60 * 60, # Calculate sap velocity in cm/hr using Granier formula
58         v = replace_na(v, 0)) %>% # Replace NAs with 0s
59  group_by(tree_id, date, probe_depth) %>%
60  summarize(sum_v = sum(v)) # Select important variables
61}
62
63# Perform data wrangling steps
64tdp(path = "data", meta = "channels.csv", min = 2, max = 20, mslope = 1) %>%
65  head()
66
67## Joining, by = c("plot_num", "channel")
68
69## `summarise()` regrouping output by 'tree_id', 'date' (override with `.groups` argument)
70
71## # A tibble: 6 x 4
72## # Groups:   tree_id, date [2]
73##   tree_id date       probe_depth sum_v
74##   <chr>   <date>           <dbl> <dbl>
75## 1 1-1-DF  2020-07-03          15 155.
76## 2 1-1-DF  2020-07-03          25  79.2
77## 3 1-1-DF  2020-07-03          50 185.
78## 4 1-1-DF  2020-07-03          90  29.6
79## 5 1-1-DF  2020-07-04          15 156.
80## 6 1-1-DF  2020-07-04          25  83.4