Sharing and cleaning data
By Abigail Hudak
Overview
The goal of this document is to share some tips and ideas about structuring and cleaning data for sharing and collaborating. None of the concepts are comprehensive, but I hope you find some useful tips.
Cleaning data
Favorites of janitor
The package janitor is awesome for data cleaning. Consider learning this package if using a lot of Excel sheets from other users. Excel sheets may have bad columns names (i.e. with "?" or upper and lowercase letters) or empty data, etc. You want your R objects to be clean 1) for your sanity, 2) for readability of your code, and 3) for ease of coding.
A couple useful links for this package:
Making sure column names are clean
1library(janitor)
2library(tidyverse)
3data("iris")
4
5colnames(iris) #view current column names
6
7## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
8
9colnames(clean_names(iris, case = "screaming_snake")) #change column names to all caps and separted by underscore
10
11## [1] "SEPAL_LENGTH" "SEPAL_WIDTH" "PETAL_LENGTH" "PETAL_WIDTH" "SPECIES"
12
13colnames(clean_names(iris)) #change column names to lower case and separted by underscore
14
15## [1] "sepal_length" "sepal_width" "petal_length" "petal_width" "species"
16
17iris<-clean_names(iris) #keep this one
Evaluate your data: make a frequency table. tabyl() similar to table() in base R, but better (returns a data.frame and has other features).
1tabyl(iris, species)
2
3## species n percent
4## setosa 50 0.3333333
5## versicolor 50 0.3333333
6## virginica 50 0.3333333
7
8iris %>% tabyl(species) #can be piped-in, if you are into that
9
10## species n percent
11## setosa 50 0.3333333
12## versicolor 50 0.3333333
13## virginica 50 0.3333333
14
15iris %>% tabyl(species) %>% adorn_totals("row") #add total count row
16
17## species n percent
18## setosa 50 0.3333333
19## versicolor 50 0.3333333
20## virginica 50 0.3333333
21## Total 150 1.0000000
22
23#adorn_() functions ahve lots of basic reporting features!
Other great functions
1get_dupes() #find duplicate rows
2excel_numeric_to_date() #converts Excel serial number dates (42223) to a class date ("2020-04-1")
3remove_empty() #remove columns and/or rows that are entirely empty
Favorites of Tidyverse
Qucikly manipulate data into new structures, change column names, etc. See "cheat sheat" below for data warngling technqiues using dplyr and tidyr.
https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Reshaping data
1unite() #unites several columns into one
2separate() #separate one column into several
3spread() #spread rows into columns. Long data->wide data
4gather() #gather columns into rows. Wide data->long data
5arrange() #order rows by values ina column (ascending)
6arrange(x, desc(y)) #descending order
7rename() #rename individual columns (not good for changing a lot)
Script writing
General tips
Your code should be readable for your own sanity and for reproducibility.
- Use
::
to show which package does whatdplyr::lef_join(a, b, by = "x1")
- Spacing before and after operator
function = print()
- Indentation doesn't have any meaning in R like it does in other languages, but it is critical in making your code readable.
- Annotate your code!! Good to annotate above or next to a line.
1#gross
2new<-data.frame %>% select(this_column)%>% group_by(this_variable) %>%mutate(new_column = mean(that_column))
3
4#better
5new <- data.frame %>% #select data.frame to manipulate
6 select(this_column, that_column) %>% #select columns to keep
7 group_by(that_column) %>% #group together by a column
8 mutate(new_column = mean(that_column)) #make a new column of the means
9
10#alternative styles
11
12new <- data.frame %>% #select data.frame to manipulate
13 select(this_column, that_column) %>% #select columns to keep
14 group_by(that_column) %>% #group together by a column
15 mutate(new_column = mean(that_column)) #make a new column of the means
16
17 #select data.frame to manipulate
18new <- data.frame %>%
19 #select columns to keep
20 select(this_column, that_column) %>%
21 #group together by a column
22 group_by(that_column) %>%
23 #make a new column of the means
24 mutate(new_column = mean(that_column))
Script writing mediums
R markdown
RMD (what this document is) is really awesome.
Some key features allow for flexibility in visualization of code and data sharing depedning on your audience. For a reader who may not be interested in the code, you can hide it and only show the output. However, if your reader would want to see what packagaes you used, the code, etc. you can have that shown as well.
common chunk options
hide message: message = FALSE
hide warning: warning = FALSE
hide all : include = FALSE
hide results : results = "hide" or results = FALSE
only print output: echo = FALSE
prevent code from running: eval = FALSE
https://cougrstats.wordpress.com/2019/09/12/a-tour-of-r-markdown/
R script
I am personally not a fan because I: 1) don't like autofill and 2) like to View() my data a lot so I don't like having my script in that same spot.
Notepad++
Big fan. Perks: no autofill, automatic indentation, can have a seprate window for code and RStudio (especially nice when you have a monitor)
Notepad
Before you judge me...there are perks! I like to use this as my scratch space to play around with code. What's great about saving code in a text file is that you can search in your file explorer for key terms to help you find a code. I find this extremely helpful so I can go back and find a function/code/etc. I want to use again. Downside: can only Ctrl + Z once (why this is good to serve as scratch space) and no automatic indentation.
Posting for help on Stack Overflow
To get good feedback, you need to post good questions. For others to help you: 1) your data structure should be clearly laid out, 2) your goal should be clearly stated, and 3) your error/problem/question should be clearly addressed. Usually also good to post code that you have already tried and the outcome or error message. If possible, post an example of what you want the outcome to look like.
Usually good to post dummy data that have intuitive variables and easy to use numbers (if possible). Remember if you are posting biologically-related concepts, you are limiting who can help you if you aren't being clear. To post dummy data use rnorm(n, mean = , sd = )
to make a normal distribution of numbers or runif(n, min = , max = )
to generate numbers from uniform distribution.
Publicly sharing code
RPubs easy to use and free.
Github is widely used and allows for version control. http://swcarpentry.github.io/git-novice/