Research Profile: Downloading, Rolling Joins, and MRT

1library(tidyverse)
2library(dplyr)
3library(data.table)
4library(mvpart)
5library(vegan)
6
7#setwd("C:/Users/Stefanie/Documents/WSU/rGroup") #set working directory

Author: Stefanie Watson

Getting Data

You can download the species subset for this research profile here. Other data is downloaded from Canada using the downloader package.

The environmental data that we are using is all freely available through the national data buoy center. First, you have to download the zipped file to the correct location (use mode 'wb' to download correctly). The URL should be one that you can copy/paste into your address bar and will immediately start a download. In this case, this was not the same as the URL displayed in the address bar. I had to hover my mouse over the download to get the correct URl for Windows 10. You will know you have the correct URL when you put it into the address bar and it immediately starts a download instead of taking you to a webpage. Then, you have to unzip the file into a folder and read the csv into R. I have used locations on my computer, but you can create a temp file if you are unsure where to save.

I have also included a 50 observation subset of species data to work with later.

 1#download the buoy data directly from Canada using downloader package
 2download("http://www.meds-sdmm.dfo-mpo.gc.ca/alphapro/wave/waveshare/csvData/C46132_CSV.ZIP", dest="dataset.zip", mode="wb")
 3
 4#unzip the file to be able to read in the csv
 5unzip("dataset.zip")
 6
 7buoy <- read_csv("C46132.csv")
 8
 9## Parsed with column specification:
10## cols(
11##   .default = col_double(),
12##   STN_ID = col_character(),
13##   DATE = col_character(),
14##   Q_FLAG = col_integer(),
15##   X24 = col_character()
16## )
17
18## See spec(...) for full column specifications.
19
20head(buoy)
21
22## # A tibble: 6 x 24
23##   STN_ID DATE     Q_FLAG LATITUDE LONGITUDE DEPTH  VCAR  VTPK `VWH$`  VCMX
24##   <chr>  <chr>     <int>    <dbl>     <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
25## 1 C46132 05/05/1~      7     49.7      128.  2040  0    12.2     1.5   2.4
26## 2 C46132 05/06/1~      7     49.7      128.  2040  0    28.4     1.8   3.6
27## 3 C46132 05/07/1~      7     49.7      128.  2040  0.12 11.1     2.6   0
28## 4 C46132 05/07/1~      7     49.7      128.  2040  0    12.2     2.6   4.7
29## 5 C46132 05/08/1~      7     49.7      128.  2040  0    12.2     1.7   3
30## 6 C46132 06/01/1~      7     49.7      128.  2040  0     4.27    3.8   5.9
31## # ... with 14 more variables: `VTP$` <dbl>, WDIR <dbl>, WSPD <dbl>,
32## #   `WSS$` <dbl>, GSPD <dbl>, WDIR_1 <dbl>, WSPD_1 <dbl>, `WSS$_1` <dbl>,
33## #   GSPD_1 <dbl>, ATMS <dbl>, ATMS_1 <dbl>, DRYT <dbl>, SSTP <dbl>,
34## #   X24 <chr>
35
36species <- read_csv("speciesSub.csv")
37
38## Parsed with column specification:
39## cols(
40##   Time = col_datetime(format = ""),
41##   species1 = col_integer(),
42##   species2 = col_integer(),
43##   species3 = col_integer(),
44##   species4 = col_integer(),
45##   species5 = col_integer()
46## )
47
48head(species)
49
50## # A tibble: 6 x 6
51##   Time                species1 species2 species3 species4 species5
52##   <dttm>                 <int>    <int>    <int>    <int>    <int>
53## 1 2014-07-04 06:30:00        1        0        2       11        0
54## 2 2015-02-23 04:30:00        0        1        1        0        2
55## 3 2014-06-13 00:30:00        0        0        2        8        1
56## 4 2015-02-15 08:30:00        1        0        0        1        0
57## 5 2014-08-04 07:30:00        1        0        0       24        0
58## 6 2014-07-16 09:30:00        0        0        1       24        0

Date/Times and Rolling Joins

The environmental data is a mess right now. We only need the dates that span our study and only want to use a few of the variables. We need to clean up the date/time formatting and then we can focus on subsetting this data to our needs.

Note: You may need to change the species timestamp to POSIXct if you opened it in excel.

 1#convert to standard date/time format: this is important for if they are in the wrong format
 2buoy$DATE <- strptime(buoy$DATE, '%m/%d/%Y %H:%M')
 3head(buoy$DATE)
 4
 5## [1] "1994-05-05 19:41:00 PDT" "1994-05-06 10:41:00 PDT"
 6## [3] "1994-05-07 09:41:00 PDT" "1994-05-07 16:41:00 PDT"
 7## [5] "1994-05-08 19:41:00 PDT" "1994-06-01 08:42:00 PDT"
 8
 9#convert to Posixct
10buoy$DATE <- as.POSIXct(buoy$DATE)
11
12#subset the buoy data
13#Below is my attempt to fix a variable name ending in $
14#It didnt work and messed up the dataset more than it helped, any ideas are welcome
15#buoy$VWHB <- buoy[,9]
16
17buoy <- buoy %>%
18  filter(DATE >= as.POSIXct('2014-06-04 00:00:00 PDT') &
19                    DATE <= as.POSIXct('2015-04-08 00:00:00 PDT')) %>%
20  select(DATE, WSPD, GSPD, ATMS, SSTP)

We can now use rolling joins to match the species information with the correct times in the bouy data. You will want to use a join when matching data. The functions rbind and cbind require that your data is already ordered and will append regardless of whether the observations match. Usually this is fine, but for disjoint datasets you should use joins. A join will match two datasets by a common key. You have to specify the key as a unique identifier for each observation (typically an ID, but here I use timestamps). You will probably mostly use left/right joins to fill a dataset. However, in these joins the keys must match exactly. Here, I have timestamps taken at different minutes that I want to match. I use a rolling join which matches to the nearest time. I have also specified that it has to match to past measurements and that the measurement must be within an hour of the image.

1#We need each dataset to have a key
2buoy <- data.table(buoy, key = "DATE")
3species <- data.table(species, key = "Time")
4
5#match to nearest
6#create a condition to 1 hour
7hour <- 60*60*1 # 1 hour = 60sec * 60min * 1 hour
8mergeData <-buoy[species, roll = -hour]
9mergeData <- na.omit(mergeData)

Multivariate Regression Trees

This is a summary of how to use the 'mvpart' package in R as first described by De/'Ath (2002). This is only available directly from GitHub so make sure that you have it downloaded correctly. There are tutorials online on how to do so.

The first thing that I need to do is make sure that I have my data in order from above. My two data sets need to be bound and any transformations need to be done beforehand.

 1#transform species data: sqrt for poisson and wisconsin for natural species variation
 2mergeData[,6:10] <- mergeData[,6:10] %>%
 3  sqrt() %>%
 4  wisconsin()
 5
 6head(mergeData)
 7
 8##                   DATE WSPD GSPD   ATMS SSTP species1  species2  species3
 9## 1: 2014-06-05 04:30:00 10.6 12.6 1023.9 10.7        0 0.0000000 0.3865505
10## 2: 2014-06-05 08:30:00 11.5 13.4 1023.2 10.4        0 0.0000000 0.6475296
11## 3: 2014-06-07 05:30:00  5.4  6.6 1022.7 11.9        0 0.3095295 0.4377408
12## 4: 2014-06-08 06:30:00  3.4  3.9 1024.0 12.3        0 0.4046282 0.4046282
13## 5: 2014-06-10 13:30:00  6.9  8.3 1020.6 12.5        0 0.0000000 0.6216661
14## 6: 2014-06-13 00:30:00  8.9 10.8 1010.5 13.4        0 0.0000000 0.3610369
15##     species4  species5
16## 1: 0.6134495 0.0000000
17## 2: 0.3524704 0.0000000
18## 3: 0.2527298 0.0000000
19## 4: 0.1907436 0.0000000
20## 5: 0.3783339 0.0000000
21## 6: 0.2779263 0.3610369

Now we can make our first MRT. The first one that we are making is based on Euclidean distance and is prodiced by R. By default, R chooses the smallest tree within one standard error of the tree with the minimum cross-validated relative error. Using the summary function gives us more information about each of the nodes and leaves.

1#tree pruned by R
2comp <- mvpart(data.matrix(mergeData[,6:10])~WSPD+GSPD+ATMS+SSTP, mergeData)

 1summary(comp)
 2
 3## Call:
 4## mvpart(form = data.matrix(mergeData[, 6:10]) ~ WSPD + GSPD +
 5##     ATMS + SSTP, data = mergeData)
 6##   n= 41
 7##
 8##          CP nsplit rel error   xerror      xstd
 9## 1 0.1308952      0 1.0000000 1.040219 0.1040792
10## 2 0.1036941      1 0.8691048 1.141133 0.1384589
11##
12## Node number 1: 41 observations,    complexity param=0.1308952
13##   Means=0.2009,0.1068,0.2581,0.3345,0.02654, Summed MSE=0.1969636
14##   left son=2 (11 obs) right son=3 (30 obs)
15##   Primary splits:
16##       SSTP < 10.95   to the left,  improve=0.13089520, (0 missing)
17##       WSPD < 5.45    to the right, improve=0.07554495, (0 missing)
18##       GSPD < 6.25    to the left,  improve=0.06469340, (0 missing)
19##       ATMS < 1002.65 to the right, improve=0.06151302, (0 missing)
20##
21## Node number 2: 11 observations
22##   Means=0.1091,0.09661,0.4648,0.1973,0.04119, Summed MSE=0.2353352
23##
24## Node number 3: 30 observations
25##   Means=0.2346,0.1106,0.1822,0.3848,0.02117, Summed MSE=0.1476592

This MRT has 2 leaves and 3 nodes. Node 1 is a split and nodes 2-3 are leaves. The bar charts by each leaf show the average species count during that condition and gives the number of observations in that condition (n=##). The relative error is reported at the bottom. The equation to find the amount of variance the tree can explain is 1 - relative error. In this case it is 1 - 0.869 = 13.1% of the variance. The complexity param of the summary gives the amount of variance explained by each split in the tree. In this case there is only one split, so all of the variance is explained by that split.

If we want to then we can prune the tree ourselves. R will give us an interactive relative error chart from which to select a tree size. The chart is static in this RMarkdown, but you can select a blue node to view the different tree sizes when the plot is live. RMarkDown has seen fit to select all of the trees for us and plot them overlapping below. This will not happen live.

1mvpart(data.matrix(mergeData[,6:10])~WSPD+GSPD+ATMS+SSTP, mergeData, xv="p")

Finally, we can produce a summary of each of the leaves and conditions within our tree.

 1#To obtain the path to the leaf nodes
 2
 3#searches frame for info on leaves and identifies leaves by bool
 4leafnodeRows <- grepl("leaf",comp$frame$var)
 5
 6#determine the leaf position
 7nodevals <- as.numeric(rownames(comp$frame)[leafnodeRows])
 8
 9#creates a list of rules taken from each node identified as a leaf
10rules <- path.rpart(comp,nodevals)
11
12##
13##  node number: 2
14##    root
15##    SSTP< 10.95
16##
17##  node number: 3
18##    root
19##    SSTP>=10.95
20
21#creates a nice dataframe of the rules by formatting and casting
22rulesdf <- do.call("rbind",lapply(rules,function(x)paste(x,collapse = " -AND- ")))
23rulesdf <- data.frame(nodeNumber=rownames(rulesdf),rule=rulesdf[,1],stringsAsFactors=FALSE)
24rulesdf
25
26##   nodeNumber                   rule
27## 2          2 root -AND- SSTP< 10.95
28## 3          3 root -AND- SSTP>=10.95