An Introduction to data.table using sythetic lat/lon data
By Nicholas A. Potter
This is a short introduction to using data.table based on my work with climate data.
Initial setup
Let's begin by generating some data to work with. We'll use the following packages:
1# If you don't have the packages below, install with:
2#install.packages(c("data.table", "readr", "dplyr", "microbenchmark"))
3
4library(data.table)
5library(dplyr) # tidyverse data manipulation
6library(readr) # tidyverse reading in data
7library(microbenchmark) # for timing code to see which is faster
We will work with a grid of latitude and longitude points, and we'll generate some fake temperature and precip data:
1# grid resolution
2res <- 0.0416666666666667
3
4# Grid extents
5lats <- seq( 24.358333333333333, 52.909999999999997, by = res)
6lons <- seq(-124.933333300000000, -66.016666633333330, by = res)
7
8# Define a data.table
9dt <- data.table(expand.grid(lat = lats, lon = lons))
10
11# default printing of data.tables is screen friendly:
12dt
13
14## lat lon
15## 1: 24.35833 -124.93333
16## 2: 24.40000 -124.93333
17## 3: 24.44167 -124.93333
18## 4: 24.48333 -124.93333
19## 5: 24.52500 -124.93333
20## ---
21## 970686: 52.73333 -66.01667
22## 970687: 52.77500 -66.01667
23## 970688: 52.81667 -66.01667
24## 970689: 52.85833 -66.01667
25## 970690: 52.90000 -66.01667
26
27# Assume:
28# GDD is a function of latitude (Areas closer to the poles are less warm)
29# Precip is random
30
31# Equivalent to dplyr::mutate(gdd = ..., prec = ...)
32dt[, `:=`(
33 gdd = 3000 - 70*(lat-24) + rnorm(n = nrow(dt), sd = 100),
34 prec = pmax(0,rnorm(n = nrow(dt), mean = 12, sd = 3)))]
35dt
36
37## lat lon gdd
38## 1: 24.35833 -124.93333 3056.2775
39## 2: 24.40000 -124.93333 3234.5665
40## 3: 24.44167 -124.93333 2920.4358
41## 4: 24.48333 -124.93333 2850.3544
42## 5: 24.52500 -124.93333 2980.4107
43## ---
44## 970686: 52.73333 -66.01667 1069.3607
45## 970687: 52.77500 -66.01667 812.7283
46## 970688: 52.81667 -66.01667 920.2619
47## 970689: 52.85833 -66.01667 1137.4059
48## 970690: 52.90000 -66.01667 946.3984
49## prec
50## 1: 10.66249
51## 2: 12.14676
52## 3: 10.33538
53## 4: 10.30929
54## 5: 14.82836
55## ---
56## 970686: 12.03940
57## 970687: 13.76590
58## 970688: 11.02060
59## 970689: 11.92815
60## 970690: 17.25918
61
62# For comparison, a data.frame
63df <- as.data.frame(dt)
A good reference comparing data.table and dplyr:
https://atrebas.github.io/post/2019-03-03-datatable-dplyr/
If you're wedded to dplyr grammar and don't want to learn data.table, try dtplyr
:
Why data.table?
When is data.table perhaps better than the tidyverse? I like and use both, so I don't hold with the idea that there is one that is better. Instead, there are specific reasons to use each. I use data.table when:
1. I need complex transformations on large data sets
As the number of rows increases, data.table becomes increasingly faster that a data.frame or tibble. This can turn a several day process into a day or less, which is huge when you inevitably end up rerunning things. By converting a script to use data.table
, sp
, and refactoring functions, I decreased the processing time for one year of climate data from four days to five hours. I had 90 years of data to process for a minimum of six different models...
For example, let's time a summary of gdd at a more coarse lat/lon grid:
1microbenchmark(
2 dt[, .(gdd = mean(gdd)), by = .(lat = round(lat, 2), lon = round(lon,2))],
3 times = 10, unit = "ms")
4
5## Unit: milliseconds
6## expr
7## dt[, .(gdd = mean(gdd)), by = .(lat = round(lat, 2), lon = round(lon, 2))]
8## min lq mean median
9## 263.4171 265.3146 302.2987 268.0365
10## uq max neval
11## 355.6186 364.7961 10
12
13microbenchmark(
14 df %>%
15 #mutate(lat = round(lat, 2), lon = round(lon, 2)) %>%
16 group_by(lat = round(lat, 2), lon = round(lon, 2)) %>%
17 summarize(gdd = mean(gdd)),
18 times = 10, unit = "ms")
19
20## Unit: milliseconds
21## expr
22## df %>% group_by(lat = round(lat, 2), lon = round(lon, 2)) %>% summarize(gdd = mean(gdd))
23## min lq mean median
24## 2156.705 2544.885 2627.235 2661.827
25## uq max neval
26## 2745.505 2816.072 10
2. tidyverse grammar gets too long / complex
People will also say this is a negative, because other people can't read your code as easily. I'm not sure I agree. In my experience as a researcher, we don't collaboratively write code often. Your most common reader is going to be yourself in 3-6 months when you have to revisit something. So documenting and writing clear code is important, but data.table is clear, it's just a different language than the tidyverse. It is wonderfully succinct at times. For example, in dplyr you might write:
1# dplyr approach
2df %>%
3 mutate(lat = round(lat), lon = round(lon)) %>%
4 group_by(lat, lon) %>%
5 summarize(gdd = mean(gdd))
6
7# data.table
8dt[, .(gdd = mean(gdd)), by = .(lat = round(lat), lon = round(lon))]
That doesn't seem like much for one transformation, but if the number of transformations is high either because you have multiple data files or multiple variables that all need a different transformation, the difference in code length is substantial. It's much easier to look at 20 lines of data.table transformations than it is to look at 50 lines of dplyr transformations to accomplish the same thing.
Using data.table
The grammar of data.table is DT[i,j, other]
, where i
is a logical selector on rows, j
is operations on columns, and other
is additional arguments for grouping, which columns to perform operations on, etc...
I Operations (select rows)
# Select using the "i" argument
dt[lat % dplyr::filter(lat < 25)
## lat lon gdd
## 1: 24.35833 -124.93333 3056.277
## 2: 24.40000 -124.93333 3234.566
## 3: 24.44167 -124.93333 2920.436
## 4: 24.48333 -124.93333 2850.354
## 5: 24.52500 -124.93333 2980.411
## ---
## 22636: 24.81667 -66.01667 2902.917
## 22637: 24.85833 -66.01667 2901.789
## 22638: 24.90000 -66.01667 2822.173
## 22639: 24.94167 -66.01667 2830.030
## 22640: 24.98333 -66.01667 3019.428
## prec
## 1: 10.662486
## 2: 12.146765
## 3: 10.335384
## 4: 10.309294
## 5: 14.828355
## ---
## 22636: 11.866986
## 22637: 9.461943
## 22638: 12.930729
## 22639: 14.225363
## 22640: 12.048109
dt[lat -67]
## lat lon gdd prec
## 1: 24.35833 -66.97500 2847.152 10.236523
## 2: 24.40000 -66.97500 2997.729 10.588268
## 3: 24.44167 -66.97500 2912.901 15.688600
## 4: 24.48333 -66.97500 2782.295 13.240460
## 5: 24.52500 -66.97500 3024.306 13.872522
## ---
## 380: 24.81667 -66.01667 2902.917 11.866986
## 381: 24.85833 -66.01667 2901.789 9.461943
## 382: 24.90000 -66.01667 2822.173 12.930729
## 383: 24.94167 -66.01667 2830.030 14.225363
## 384: 24.98333 -66.01667 3019.428 12.048109
J Operations (operate on columns)
1# Perform an operation on specific columns
2dt[, .(
3 lat = mean(lat),
4 lon = mean(lon),
5 gdd = mean(gdd))]
6
7## lat lon gdd
8## 1: 38.62917 -95.475 1975.92
9
10# Alternatively, for all columns or just specific ones:
11dt[, lapply(.SD, mean, na.rm = TRUE)] # equivalent to dplyr::transmute_all()
12
13## lat lon gdd prec
14## 1: 38.62917 -95.475 1975.92 11.99819
15
16dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("gdd", "prec")]
17
18## gdd prec
19## 1: 1975.92 11.99819
20
21# A more complicated function
22# center GDD by removing the mean
23dt[, .(lat, lon, d_gdd = gdd - mean(gdd))]
24
25## lat lon d_gdd
26## 1: 24.35833 -124.93333 1080.3575
27## 2: 24.40000 -124.93333 1258.6465
28## 3: 24.44167 -124.93333 944.5158
29## 4: 24.48333 -124.93333 874.4344
30## 5: 24.52500 -124.93333 1004.4908
31## ---
32## 970686: 52.73333 -66.01667 -906.5593
33## 970687: 52.77500 -66.01667 -1163.1916
34## 970688: 52.81667 -66.01667 -1055.6581
35## 970689: 52.85833 -66.01667 -838.5141
36## 970690: 52.90000 -66.01667 -1029.5216
37
38# Perform operations on the same data set
39dt[, gd_gdd := gdd - mean(gdd)] # equivalent to dplyr::mutate()
40
41# For multiple variables at once:
42dt[, `:=`(gd_gdd = gdd - mean(gdd),
43 gd_prec = prec - mean(prec))]
44dt
45
46## lat lon gdd
47## 1: 24.35833 -124.93333 3056.2775
48## 2: 24.40000 -124.93333 3234.5665
49## 3: 24.44167 -124.93333 2920.4358
50## 4: 24.48333 -124.93333 2850.3544
51## 5: 24.52500 -124.93333 2980.4107
52## ---
53## 970686: 52.73333 -66.01667 1069.3607
54## 970687: 52.77500 -66.01667 812.7283
55## 970688: 52.81667 -66.01667 920.2619
56## 970689: 52.85833 -66.01667 1137.4059
57## 970690: 52.90000 -66.01667 946.3984
58## prec gd_gdd gd_prec
59## 1: 10.66249 1080.3575 -1.33570660
60## 2: 12.14676 1258.6465 0.14857212
61## 3: 10.33538 944.5158 -1.66280901
62## 4: 10.30929 874.4344 -1.68889873
63## 5: 14.82836 1004.4908 2.83016281
64## ---
65## 970686: 12.03940 -906.5593 0.04120797
66## 970687: 13.76590 -1163.1916 1.76771001
67## 970688: 11.02060 -1055.6581 -0.97759204
68## 970689: 11.92815 -838.5141 -0.07004544
69## 970690: 17.25918 -1029.5216 5.26098430
70
71# Or equivalently
72dt[, c("gd_gdd", "gd_prec") := .(gdd - mean(gdd), prec - mean(prec))]
73
74# Group transformations
75dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
76dt[, `:=`(gd_gdd = gdd - mean(gdd),
77 gd_prec = prec - mean(prec)),
78 by = .(lat0, lon0)]
79dt
80
81## lat lon gdd
82## 1: 24.35833 -124.93333 3056.2775
83## 2: 24.40000 -124.93333 3234.5665
84## 3: 24.44167 -124.93333 2920.4358
85## 4: 24.48333 -124.93333 2850.3544
86## 5: 24.52500 -124.93333 2980.4107
87## ---
88## 970686: 52.73333 -66.01667 1069.3607
89## 970687: 52.77500 -66.01667 812.7283
90## 970688: 52.81667 -66.01667 920.2619
91## 970689: 52.85833 -66.01667 1137.4059
92## 970690: 52.90000 -66.01667 946.3984
93## prec gd_gdd gd_prec lat0
94## 1: 10.66249 95.14628 -1.2601424 24
95## 2: 12.14676 273.43529 0.2241363 24
96## 3: 10.33538 -40.69542 -1.5872448 24
97## 4: 10.30929 -110.77682 -1.6133345 24
98## 5: 14.82836 45.97717 3.0862385 25
99## ---
100## 970686: 12.03940 62.21880 0.2354413 53
101## 970687: 13.76590 -194.41356 1.9619434 53
102## 970688: 11.02060 -86.87996 -0.7833587 53
103## 970689: 11.92815 130.26402 0.1241879 53
104## 970690: 17.25918 -60.74347 5.4552177 53
105## lon0
106## 1: -125
107## 2: -125
108## 3: -125
109## 4: -125
110## 5: -125
111## ---
112## 970686: -66
113## 970687: -66
114## 970688: -66
115## 970689: -66
116## 970690: -66
117
118# Removing variables
119dt[, `:=`(gd_gdd = NULL, gd_prec = NULL)] # dplyr::select(-gd_gdd, -gd_prec)
Other Operations (keys, indices, merges, and renaming)
1# Create some 2-digit lat/lon groups
2dt[, `:=`(lat2 = round(lat,2), lon2 = round(lon,2))]
3
4# No keys is okay, but operations are slower
5key(dt) # No key set
6
7## NULL
8
9microbenchmark(
10 dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
11 times = 10, unit = "ms")
12
13## Unit: milliseconds
14## expr
15## dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)]
16## min lq mean median
17## 155.8982 289.4522 323.1314 303.3228
18## uq max neval
19## 310.706 530.188 10
20
21# Set keys that you are grouping by is faster
22setkey(dt, lat2, lon2)
23setkeyv(dt, c("lat2", "lon2")) #Equivalent - useful for quoted vars in functions
24key(dt) # Now with lat2, lon2
25
26## [1] "lat2" "lon2"
27
28microbenchmark(
29 dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
30 times = 10, unit = "ms")
31
32## Unit: milliseconds
33## expr
34## dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)]
35## min lq mean median uq
36## 44.2884 46.3645 66.44499 69.0569 79.6183
37## max neval
38## 96.594 10
In Functions (or for loops)
1#' Center variables of a data.table.
2#' @param d a data.table.
3#' @param vars a list of quoted column names to center.
4#' @param byvar a list of quoted column names by which to center.
5#' @param na.rm exclude NA?
6#' @return a data.table with centered variables.
7center_dt <- function(d, vars, byvars, na.rm = TRUE) {
8 setkeyv(d, byvars)
9 d[, lapply(.SD, function(x) { x - mean(x, na.rm = na.rm) }),
10 by = byvars, .SDcols = vars]
11}
12dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
13center_dt(dt, vars = c("gdd", "prec"), byvars = c("lat0", "lon0"))
14
15## lat0 lon0 gdd prec
16## 1: 24 -125 95.146283 -1.2601424
17## 2: 24 -125 -45.366808 0.7401815
18## 3: 24 -125 -59.763709 -1.5278002
19## 4: 24 -125 4.492488 -6.7632807
20## 5: 24 -125 36.774394 1.3920487
21## ---
22## 970686: 53 -66 -72.051125 2.6577200
23## 970687: 53 -66 75.705411 5.4217230
24## 970688: 53 -66 -57.443669 1.3007194
25## 970689: 53 -66 -145.215167 1.6826111
26## 970690: 53 -66 -60.743466 5.4552177
27
28# Alternatively, specify just the transformation as a function
29center <- function(x) { x - mean(x) }
30dt[, lapply(.SD, function(x){ center(x) }),
31 by = c("lat0", "lon0"), .SDcols = c("gdd", "prec")]
32
33## lat0 lon0 gdd prec
34## 1: 24 -125 95.146283 -1.2601424
35## 2: 24 -125 -45.366808 0.7401815
36## 3: 24 -125 -59.763709 -1.5278002
37## 4: 24 -125 4.492488 -6.7632807
38## 5: 24 -125 36.774394 1.3920487
39## ---
40## 970686: 53 -66 -72.051125 2.6577200
41## 970687: 53 -66 75.705411 5.4217230
42## 970688: 53 -66 -57.443669 1.3007194
43## 970689: 53 -66 -145.215167 1.6826111
44## 970690: 53 -66 -60.743466 5.4552177
Reading and writing data
data.table
provides fread
and fwrite
to quickly read and write data.
fread()
also takes a URL if you want to directly read data from http.
1microbenchmark(fwrite(dt, file = "dt.csv"), times = 1, unit = "ms")
2
3## Unit: milliseconds
4## expr min
5## fwrite(dt, file = "dt.csv") 682.1413
6## lq mean median uq
7## 682.1413 682.1413 682.1413 682.1413
8## max neval
9## 682.1413 1
10
11microbenchmark(write_csv(dt, path = "dt.csv"), times = 1, unit = "ms")
12
13## Unit: milliseconds
14## expr min
15## write_csv(dt, path = "dt.csv") 3856.494
16## lq mean median uq
17## 3856.494 3856.494 3856.494 3856.494
18## max neval
19## 3856.494 1
20
21microbenchmark(saveRDS(dt, file = "dt.rds"), times = 1, unit = "ms")
22
23## Unit: milliseconds
24## expr min
25## saveRDS(dt, file = "dt.rds") 4002.413
26## lq mean median uq
27## 4002.413 4002.413 4002.413 4002.413
28## max neval
29## 4002.413 1
30
31microbenchmark(fread("dt.csv"), times = 1, unit = "ms")
32
33## Unit: milliseconds
34## expr min lq mean
35## fread("dt.csv") 298.4178 298.4178 298.4178
36## median uq max neval
37## 298.4178 298.4178 298.4178 1
38
39microbenchmark(read_csv("dt.csv"), times = 1, unit = "ms")
40
41## Unit: milliseconds
42## expr min lq
43## read_csv("dt.csv") 2899.474 2899.474
44## mean median uq max neval
45## 2899.474 2899.474 2899.474 2899.474 1
46
47microbenchmark(readRDS("dt.rds"), times = 1, unit = "ms")
48
49## Unit: milliseconds
50## expr min lq
51## readRDS("dt.rds") 529.9389 529.9389
52## mean median uq max neval
53## 529.9389 529.9389 529.9389 529.9389 1