Formatting data for use with R

Oct 21, 2020 · 8 min read · data formatting janitor ·

By Matt Brousil

This post walks through some best practices for formatting and preparing your data for use with R. Some of the suggestions here are drawn from Data Carpentry, Broman and Woo (2018), and the BES Guide to Reproducible Code. The required dataset is available here. A previous version of this post that deals with cleaning both data and scripts is available here.

1library(tidyverse)
2library(readxl)
3library(janitor)
4library(lubridate)

1. Use .csv instead of .xlsx or other proprietary file formats when possible.

The .csv filetype can be read easily by many software packages, including base R. This isn't necessarily the case with .xlsx. Instead, we use the readxl package here. In my experience, other packages for reading Excel files can have issues installing on many computers due to their installation requirements.

1raw_data <- read_excel(path = "cool project 2020.xlsx")

Additional notes:

If your .xlsx has multiple sheets, then make multiple files. This is a more reliable storage method, and .csv files don't allow for multiple sheets.
Don't use formatting to store data. R can't decipher things like highlighting easily and the ways of documenting formatting meaning often leave a lot to be desired. But if you need to detect Excel formatting from R, check out this blog post.

2. File naming

cool project 2020.xlsx isn't a good name! There are a few points we want to keep in mind with file naming:

Machine readability
- Don't use spaces in filenames
- Don't use characters other than letters, numbers, and _ or -
Human readability
- Make sure the name of the file is descriptive, try to follow a consistent naming convention

Something like brousil_thesis_data_2020.csv is a better name, although you might want to make yours more specific.``

3. Column naming

Let's take a look at our column names:

1names(raw_data)
2
3## [1] "start date"   "end date"
4## [3] "which group?" "value"
5## [5] "\"Notes\""    "lat"
6## [7] "long"

Here's what the dataset itself looks like:

 1head(raw_data)
 2
 3## # A tibble: 6 x 7
 4##   `start date` `end date`
 5##          <dbl> <dttm>
 6## 1        40142 2010-05-05 00:00:00
 7## 2        40143 2010-05-06 00:00:00
 8## 3        40143 2010-05-06 00:00:00
 9## 4           NA 2010-05-08 00:00:00
10## 5           NA 2010-05-09 00:00:00
11## 6        40142 2010-05-05 00:00:00
12## # ... with 5 more variables: `which
13## #   group?` <chr>, value <dbl>,
14## #   `"Notes"` <chr>, lat <chr>, long <chr>

The column naming in this dataset is a mess! Note that the formatting of the column names changes between the vector output of names() and the tibble version from head(). We could have avoided this instability. When naming columns in a spreadsheet for use in R, we should avoid:

Spaces
Special characters
Capitalization

Spaces in particular make it very difficult to refer to column names, because you have to use backticks or quotes when referencing them. Capitalization isn't a huge deal, but it adds extra keystrokes to your names and also makes it easier to misspell them.

1raw_data$`which group?`
2
3# This won't work
4raw_data %>% select(which group?)
5
6# This will work
7raw_data %>% select("which group?")

We can take care of the names quickly using clean_names() from janitor.

1raw_data <- clean_names(raw_data)
2
3names(raw_data)
4
5## [1] "start_date"  "end_date"    "which_group"
6## [4] "value"       "notes"       "lat"
7## [7] "long"

4. Layout of the data

The next step is to consider the layout of the data. For example, is the dataset in tidy format? Let's take a look:

 1raw_data
 2
 3## # A tibble: 10 x 7
 4##    start_date end_date            which_group
 5##         <dbl> <dttm>              <chr>
 6##  1      40142 2010-05-05 00:00:00 A
 7##  2      40143 2010-05-06 00:00:00 A
 8##  3      40143 2010-05-06 00:00:00 B
 9##  4         NA 2010-05-08 00:00:00 B
10##  5         NA 2010-05-09 00:00:00 A
11##  6      40142 2010-05-05 00:00:00 A
12##  7      40148 2010-05-11 00:00:00 A
13##  8      40148 2010-05-11 00:00:00 A
14##  9      40149 2010-05-12 00:00:00 B
15## 10      40142 2010-05-05 00:00:00 A
16## # ... with 4 more variables: value <dbl>,
17## #   notes <chr>, lat <chr>, long <chr>

This dataset is close, but not quite in tidy format.

Good qualities of the layout:

A single table
Variables are in columns
Observations are in rows

Bad qualities of the layout:

Multiple pieces of data in the notes column
Duplicated rows

To start, let's separate the notes column into multiple columns.

 1clean_data <- separate(data = raw_data, col = notes, into = c("type", "location"),
 2                    sep = ", ", remove = FALSE)
 3
 4head(clean_data)
 5
 6## # A tibble: 6 x 9
 7##   start_date end_date            which_group
 8##        <dbl> <dttm>              <chr>
 9## 1      40142 2010-05-05 00:00:00 A
10## 2      40143 2010-05-06 00:00:00 A
11## 3      40143 2010-05-06 00:00:00 B
12## 4         NA 2010-05-08 00:00:00 B
13## 5         NA 2010-05-09 00:00:00 A
14## 6      40142 2010-05-05 00:00:00 A
15## # ... with 6 more variables: value <dbl>,
16## #   notes <chr>, type <chr>, location <chr>,
17## #   lat <chr>, long <chr>

Next, let's investigate the duplicate rows. We'll print any rows that aren't unique using get_dupes() from janitor and then remove the repeated row with help from duplicated().

 1# From janitor
 2get_dupes(clean_data)
 3
 4## No variable names specified - using all columns.
 5
 6## # A tibble: 2 x 10
 7##   start_date end_date            which_group
 8##        <dbl> <dttm>              <chr>
 9## 1      40142 2010-05-05 00:00:00 A
10## 2      40142 2010-05-05 00:00:00 A
11## # ... with 7 more variables: value <dbl>,
12## #   notes <chr>, type <chr>, location <chr>,
13## #   lat <chr>, long <chr>, dupe_count <int>
14
15# From base R. Note that it prints out a logical vector.
16duplicated(clean_data)
17
18##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
19##  [8] FALSE FALSE FALSE

Drop the extra row using duplicated():

1clean_data <- clean_data[!duplicated(clean_data), ]

5. Dates

Dates are tricky in Excel. It has a lot of automatic formatting options that are intended to be helpful. They often lead to problems, though. Moreover, things that aren't supposed to be dates are interpreted by Excel to be dates.

This dataset shows multiple potential pitfalls with dates.

The start_date column contains Excel serial number dates that haven't translated over to R.
The end_date column has imported correctly! But when it was entered into Excel, it was formatted like this: 5/11/10. This isn't trustworthy, and 5, 11, and 10 are all realistic values for the year, month, or day.

Some of the safest ways to enter dates are as follows:

As separate columns in your spreadsheet: year, month, day
As ISO 8601: YYYY-MM-DD (e.g., 2020-10-21)
- ISO = International Organization for Standardization

Luckily we can clean up the Excel serial formatted dates:

1clean_data$start_date <- excel_numeric_to_date(as.numeric(clean_data$start_date))

If we want to save the data as a .csv for later use in Excel, we might want to split the dates up still and drop the original columns.

 1clean_data <- clean_data %>%
 2mutate(start_year = year(start_date),
 3       start_mo = month(start_date),
 4       start_day = day(start_date),
 5       end_year = year(end_date),
 6       end_mo = month(end_date),
 7       end_day = day(end_date)) %>%
 8  select(-start_date, -end_date)
 9
10head(clean_data)
11
12## # A tibble: 6 x 13
13##   which_group value notes type  location lat
14##   <chr>       <dbl> <chr> <chr> <chr>    <chr>
15## 1 A               1 expe~ expe~ greenho~ -27.~
16## 2 A               5 expe~ expe~ greenho~ -27.~
17## 3 B              75 expe~ expe~ lab      -27.~
18## 4 B               4 expe~ expe~ greenho~ -27.~
19## 5 A               2 expe~ expe~ lab      -27.~
20## 6 A               1 expe~ expe~ lab      -27.~
21## # ... with 7 more variables: long <chr>,
22## #   start_year <dbl>, start_mo <dbl>,
23## #   start_day <int>, end_year <dbl>,
24## #   end_mo <dbl>, end_day <int>

6. Missing values

Many datasets you work with will likely contain missing values. It's important to know why data can be missing, and how to treat it differently based on the reason for its absence.

Make sure that you enter 0 when your data is supposed to contain zero! Often people will leave zeroes blank during data entry, and it's not clear to the user whether these are intended to be missing data or actual 0 values.
There's not uniform agreement about how to code in truly missing data. But there are multiple options. Either a blank cell or a manually entered NA value is best for data that truly are gaps in the dataset. My thought is that NA is probably best, i.e. it is clear that the data are missing and not just forgot to enter (Broman & Woo, 2018).
- e.g., An experimental replicate was lost, data logger corrupted, etc.
Always fill in values, even when they are just repeats of dates in the rows above!

7. Miscellaneous

Lastly, you'll want to avoid including special characters or units inside of cells with numeric data (e.g., lat/long). There's a few reasons for this:

These characters or units won't typically add anything to your analytical abilities in R
Special characters may be difficult to wrangle in R
The presence of a special character in a numeric column will likely force that column to be character rather than numeric data

My suggestion is either don't include them at all, or you can reference them in column name, e.g. distance_m.

We can drop them from our dataset using gsub():

1clean_data$lat <- gsub(pattern = "°", replacement = "", x = clean_data$lat)
2clean_data$long <- gsub(pattern = "°", replacement = "", x = clean_data$long)

8. Keep your raw data untouched and separate from cleaned data!

In case a transcription or cleaning error happens, you need to be able to compare your original dataset (untouched) to the cleaned version. Save a new .csv separately from the old data.

1write.csv(x = clean_data,
2          file = "brousil_cleaned_thesis_data.csv",
3          row.names = FALSE)

Sources:

Cooper, N., & Hsing, P. Y. (2017). A guide to reproducible code in ecology and evolution. Technical report, British Ecological Society. Available at https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf.
Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989
Peter R. Hoyt, Christie Bahlai, Tracy K. Teal (Eds.), Erin Alison Becker, Aleksandra Pawlik, Peter Hoyt, Francois Michonneau, Christie Bahlai, Toby Reiter, et al. (2019, July 5). datacarpentry/spreadsheet-ecology-lesson: Data Carpentry: Data Organization in Spreadsheets for Ecologists, June 2019 (Version v2019.06.2). Zenodo. http://doi.org/10.5281/zenodo.3269869