Formatting data for use with R
By Matt Brousil
This post walks through some best practices for formatting and preparing your data for use with R. Some of the suggestions here are drawn from Data Carpentry, Broman and Woo (2018), and the BES Guide to Reproducible Code. The required dataset is available here. A previous version of this post that deals with cleaning both data and scripts is available here.
1library(tidyverse)
2library(readxl)
3library(janitor)
4library(lubridate)
1. Use .csv instead of .xlsx or other proprietary file formats when possible.
The .csv
filetype can be read easily by many software packages, including base R. This isn't necessarily the case with .xlsx
. Instead, we use the readxl
package here. In my experience, other packages for reading Excel files can have issues installing on many computers due to their installation requirements.
1raw_data <- read_excel(path = "cool project 2020.xlsx")
Additional notes:
- If your
.xlsx
has multiple sheets, then make multiple files. This is a more reliable storage method, and.csv
files don't allow for multiple sheets. - Don't use formatting to store data. R can't decipher things like highlighting easily and the ways of documenting formatting meaning often leave a lot to be desired. But if you need to detect Excel formatting from R, check out this blog post.
2. File naming
cool project 2020.xlsx
isn't a good name! There are a few points we want to keep in mind with file naming:
- Machine readability
- Don't use spaces in filenames
- Don't use characters other than letters, numbers, and _ or -
- Human readability
- Make sure the name of the file is descriptive, try to follow a consistent naming convention
Something like brousil_thesis_data_2020.csv
is a better name, although you might want to make yours more specific.``
3. Column naming
Let's take a look at our column names:
1names(raw_data)
2
3## [1] "start date" "end date"
4## [3] "which group?" "value"
5## [5] "\"Notes\"" "lat"
6## [7] "long"
Here's what the dataset itself looks like:
1head(raw_data)
2
3## # A tibble: 6 x 7
4## `start date` `end date`
5## <dbl> <dttm>
6## 1 40142 2010-05-05 00:00:00
7## 2 40143 2010-05-06 00:00:00
8## 3 40143 2010-05-06 00:00:00
9## 4 NA 2010-05-08 00:00:00
10## 5 NA 2010-05-09 00:00:00
11## 6 40142 2010-05-05 00:00:00
12## # ... with 5 more variables: `which
13## # group?` <chr>, value <dbl>,
14## # `"Notes"` <chr>, lat <chr>, long <chr>
The column naming in this dataset is a mess! Note that the formatting of the column names changes between the vector output of names()
and the tibble version from head()
. We could have avoided this instability. When naming columns in a spreadsheet for use in R,
we should avoid:
- Spaces
- Special characters
- Capitalization
Spaces in particular make it very difficult to refer to column names, because you have to use backticks or quotes when referencing them. Capitalization isn't a huge deal, but it adds extra keystrokes to your names and also makes it easier to misspell them.
1raw_data$`which group?`
2
3# This won't work
4raw_data %>% select(which group?)
5
6# This will work
7raw_data %>% select("which group?")
We can take care of the names quickly using clean_names()
from janitor
.
1raw_data <- clean_names(raw_data)
2
3names(raw_data)
4
5## [1] "start_date" "end_date" "which_group"
6## [4] "value" "notes" "lat"
7## [7] "long"
4. Layout of the data
The next step is to consider the layout of the data. For example, is the dataset in tidy format? Let's take a look:
1raw_data
2
3## # A tibble: 10 x 7
4## start_date end_date which_group
5## <dbl> <dttm> <chr>
6## 1 40142 2010-05-05 00:00:00 A
7## 2 40143 2010-05-06 00:00:00 A
8## 3 40143 2010-05-06 00:00:00 B
9## 4 NA 2010-05-08 00:00:00 B
10## 5 NA 2010-05-09 00:00:00 A
11## 6 40142 2010-05-05 00:00:00 A
12## 7 40148 2010-05-11 00:00:00 A
13## 8 40148 2010-05-11 00:00:00 A
14## 9 40149 2010-05-12 00:00:00 B
15## 10 40142 2010-05-05 00:00:00 A
16## # ... with 4 more variables: value <dbl>,
17## # notes <chr>, lat <chr>, long <chr>
This dataset is close, but not quite in tidy format.
Good qualities of the layout:
- A single table
- Variables are in columns
- Observations are in rows
Bad qualities of the layout:
- Multiple pieces of data in the
notes
column - Duplicated rows
To start, let's separate the notes column into multiple columns.
1clean_data <- separate(data = raw_data, col = notes, into = c("type", "location"),
2 sep = ", ", remove = FALSE)
3
4head(clean_data)
5
6## # A tibble: 6 x 9
7## start_date end_date which_group
8## <dbl> <dttm> <chr>
9## 1 40142 2010-05-05 00:00:00 A
10## 2 40143 2010-05-06 00:00:00 A
11## 3 40143 2010-05-06 00:00:00 B
12## 4 NA 2010-05-08 00:00:00 B
13## 5 NA 2010-05-09 00:00:00 A
14## 6 40142 2010-05-05 00:00:00 A
15## # ... with 6 more variables: value <dbl>,
16## # notes <chr>, type <chr>, location <chr>,
17## # lat <chr>, long <chr>
Next, let's investigate the duplicate rows. We'll print any rows that aren't unique using get_dupes()
from janitor
and then remove the repeated row with help from duplicated()
.
1# From janitor
2get_dupes(clean_data)
3
4## No variable names specified - using all columns.
5
6## # A tibble: 2 x 10
7## start_date end_date which_group
8## <dbl> <dttm> <chr>
9## 1 40142 2010-05-05 00:00:00 A
10## 2 40142 2010-05-05 00:00:00 A
11## # ... with 7 more variables: value <dbl>,
12## # notes <chr>, type <chr>, location <chr>,
13## # lat <chr>, long <chr>, dupe_count <int>
14
15# From base R. Note that it prints out a logical vector.
16duplicated(clean_data)
17
18## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
19## [8] FALSE FALSE FALSE
Drop the extra row using duplicated()
:
1clean_data <- clean_data[!duplicated(clean_data), ]
5. Dates
Dates are tricky in Excel. It has a lot of automatic formatting options that are intended to be helpful. They often lead to problems, though. Moreover, things that aren't supposed to be dates are interpreted by Excel to be dates.
This dataset shows multiple potential pitfalls with dates.
- The
start_date
column contains Excel serial number dates that haven't translated over to R. - The
end_date
column has imported correctly! But when it was entered into Excel, it was formatted like this:5/11/10
. This isn't trustworthy, and 5, 11, and 10 are all realistic values for the year, month, or day.
Some of the safest ways to enter dates are as follows:
- As separate columns in your spreadsheet:
year
,month
,day
- As ISO 8601:
YYYY-MM-DD
(e.g., 2020-10-21)- ISO = International Organization for Standardization
Luckily we can clean up the Excel serial formatted dates:
1clean_data$start_date <- excel_numeric_to_date(as.numeric(clean_data$start_date))
If we want to save the data as a .csv for later use in Excel, we might want to split the dates up still and drop the original columns.
1clean_data <- clean_data %>%
2mutate(start_year = year(start_date),
3 start_mo = month(start_date),
4 start_day = day(start_date),
5 end_year = year(end_date),
6 end_mo = month(end_date),
7 end_day = day(end_date)) %>%
8 select(-start_date, -end_date)
9
10head(clean_data)
11
12## # A tibble: 6 x 13
13## which_group value notes type location lat
14## <chr> <dbl> <chr> <chr> <chr> <chr>
15## 1 A 1 expe~ expe~ greenho~ -27.~
16## 2 A 5 expe~ expe~ greenho~ -27.~
17## 3 B 75 expe~ expe~ lab -27.~
18## 4 B 4 expe~ expe~ greenho~ -27.~
19## 5 A 2 expe~ expe~ lab -27.~
20## 6 A 1 expe~ expe~ lab -27.~
21## # ... with 7 more variables: long <chr>,
22## # start_year <dbl>, start_mo <dbl>,
23## # start_day <int>, end_year <dbl>,
24## # end_mo <dbl>, end_day <int>
6. Missing values
Many datasets you work with will likely contain missing values. It's important to know why data can be missing, and how to treat it differently based on the reason for its absence.
- Make sure that you enter 0 when your data is supposed to contain zero! Often people will leave zeroes blank during data entry, and it's not clear to the user whether these are intended to be missing data or actual 0 values.
- There's not uniform agreement about how to code in truly missing data. But there are multiple options. Either a blank cell or a manually entered NA value is best for data that truly are gaps in the dataset. My thought is that
NA
is probably best, i.e. it is clear that the data are missing and not just forgot to enter (Broman & Woo, 2018).- e.g., An experimental replicate was lost, data logger corrupted, etc.
- Always fill in values, even when they are just repeats of dates in the rows above!
7. Miscellaneous
Lastly, you'll want to avoid including special characters or units inside of cells with numeric data (e.g., lat
/long
). There's a few reasons for this:
- These characters or units won't typically add anything to your analytical abilities in R
- Special characters may be difficult to wrangle in R
- The presence of a special character in a numeric column will likely force that column to be character rather than numeric data
My suggestion is either don't include them at all, or you can reference them in column name, e.g. distance_m
.
We can drop them from our dataset using gsub()
:
1clean_data$lat <- gsub(pattern = "°", replacement = "", x = clean_data$lat)
2clean_data$long <- gsub(pattern = "°", replacement = "", x = clean_data$long)
8. Keep your raw data untouched and separate from cleaned data!
In case a transcription or cleaning error happens, you need to be able to compare your original dataset (untouched) to the cleaned version. Save a new .csv separately from the old data.
1write.csv(x = clean_data,
2 file = "brousil_cleaned_thesis_data.csv",
3 row.names = FALSE)
Sources:
- Cooper, N., & Hsing, P. Y. (2017). A guide to reproducible code in ecology and evolution. Technical report, British Ecological Society. Available at https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf.
- Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets, The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989
- Peter R. Hoyt, Christie Bahlai, Tracy K. Teal (Eds.), Erin Alison Becker, Aleksandra Pawlik, Peter Hoyt, Francois Michonneau, Christie Bahlai, Toby Reiter, et al. (2019, July 5). datacarpentry/spreadsheet-ecology-lesson: Data Carpentry: Data Organization in Spreadsheets for Ecologists, June 2019 (Version v2019.06.2). Zenodo. http://doi.org/10.5281/zenodo.3269869