9. Advanced spatial R and mapmaking: II

From 1,000 point-clicks to 1 script…

Learning objectives


We can use some of the same tools and data we used previously, but we can now also add a few options to download and actively use USGS/NHD data. These data include river stage and flow, water quality, station locations, watershed and streamline attributes, etc. Particularly for water scientists, it can be immensely useful to tie in additional data, or use standard datasets that are continuously updated (like the USGS gage network).

The Packages

First we’ll need a few packages we’ve not used yet. Please install these with install.packages() if you don’t have them.

library(tidyverse) # data wrangling & viz
library(purrr) # iteration
library(janitor) # name cleaning
library(glue) # pasting stuff together

library(sf) # analysis tools
library(mapview)  # interactive maps!
mapviewOptions(fgb = FALSE) # to save interactive maps

library(tigris) # county & state boundaries
library(units) # for convert spatial and other units
library(dataRetrieval) # download USGS river data
library(tmap) # mapping
library(tmaptools) # mapping tools

River Data: Find Nearest USGS Station

We’ve demonstrated how to join data and crop data in R, but let’s use some alternate options to download new data. We’ll focus on surface water here, and look at how we can download and map flowlines in R. Much of the USGS data network can be queried and downloaded in R. This may include data on water quality, river discharge, water temperature, spatial basins, and NHD flowlines. The {dataRetrieval} package is an excellent option for these operations.

Let’s find a few groundwater stations to use. Here we’ll grab one close to the American River and one close to the Cosumnes just for demonstration purposes, but this could be any X/Y point, or set of points you are interested in.

Groundwater Data

Iteration…remember {purrr}? Let’s use it here!

# get counties for CA as {sf}
ca_co <- counties("CA", cb = TRUE, progress_bar = FALSE)

# read the stations
gw_files <- list.files(path = "data/gwl/county",
                       full.names = TRUE, pattern = "*.csv")

# read all files into dataframes and combine with purrr
gw_df <- map_df(gw_files, ~read.csv(.x))

# the readr package will also do this same thing by default
# when passed a list of files with the same data types
gw_df <- read_csv(gw_files)

# now make "spatial" as sf objects
gw_df <- st_as_sf(gw_df, coords=c("LONGITUDE","LATITUDE"), 
                  crs=4326, remove=FALSE) %>% 
  # and transform!
  st_transform(., 4269)
Get the Sacramento County shapefile.
# get just sacramento: here we read in a shapefile:
sac_co <- st_read("data/shp/sac/sac_county.shp")
Reading layer `sac_county' from data source 
  using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 9 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -13565710 ymin: 4582007 xmax: -13472670 ymax: 4683976
Projected CRS: WGS 84 / Pseudo-Mercator
# check CRS
[1] 3857
# match with other data
sac_co <- st_transform(sac_co, st_crs(ca_co))
st_crs(sac_co) == st_crs(ca_co) # should be TRUE!
[1] TRUE
# make a box around sacramento county
# (a grid with an n=1) for inset
sac_box <- st_make_grid(sac_co, n = 1)

Now we can filter to our location of interest.

# filter to only Sacramento
gw_sac <- gw_df %>% 
  filter(COUNTY_NAME == "Sacramento")

# use this layer to filter to only locations (stations of interest)
sac_loi <- gw_sac %>% filter(STN_ID %in% c("52418", "5605"))

mapview(sac_co, alpha.regions=0, 
          color="black", lwd=3, legend=FALSE) +
  mapview(sac_loi, col.regions="purple",layer.name="Locations of Interest") 

Use findNLDI

The findNLDI function allows us to pass a single spatial point as well as a few different parameters like search upstream or downstream, and what we want to find, and then return a list of items (see the help page for using the function here), leveraging the hydro-network linked data index (NLDI)1. Note, we’ll need internet connectivity here for these functions to run successfully.

Let’s look only at mainstem flowlines from our locations of interest, and return the nearest NWIS sites as well as the NHD flowlines (streamlines). We’ll use the map() function to pass a list of stations along (here only 2, but this is flexible, and in practice we can map over a much larger number of stations).

First, we want to look up the COMID or location identifier for the centroids.


# Need to convert our locations of interest to WGS84
sac_loi <- sac_loi %>% 

# now we can go get flowline data!
us_nwis <- map(sac_loi$geometry,
                ~findNLDI(location = .x,
                          nav  = c("UM"), 
                          find = c("nwis", "flowlines"),
                          distance_km = 120)

Great, now we have a list of three or more things for each sf object we passed to findNLDI. In this case, we should have (for each location of interest we used):

Let’s play with these data and make some maps.

Extract the NLDI Info

There are a few options, and it depends on what your goal is. Here we show a few simple ways to pull these data out or collapse them. Remember, we can access things in lists with our [[]] too!

# we can split these data into separate data frame
# and add them as objects to the .Global environment.
# first add names based on our station IDs:
us_nwis <- set_names(us_nwis, nm=glue("id_{sac_loi$STN_ID}"))

# then split into separate dataframes
# us_nwis %>% list2env(.GlobalEnv)

# Or we can combine with map_df
us_flowlines <- map_df(us_nwis, ~rbind(.x$UM_flowlines))
us_nwissite <- map_df(us_nwis, ~rbind(.x$UM_nwissite))

mapview(sac_loi, col.region="purple", legend = TRUE, 
        cex=3, layer.name="GW LOI") +
  mapview(us_nwissite, col.regions="orange", 
          legend = TRUE, layer.name="UM NWIS Sites") +
  mapview(us_flowlines, color="steelblue", lwd=2, 
          layer.name="UM Flowlines", legend=FALSE)

Next, let’s filter to NWIS USGS stations that have flow data (generally these have 8-digit identifiers instead of a longer code which can be more water quality parameters), and pull streamflow data for the nearest station.

# use the stringr package, part of tidyverse to trim characters
usgs_stations <- us_nwissite %>% 
  filter(stringr::str_count(identifier) < 14)

# double check?
mapview(sac_loi, col.region="purple", legend = TRUE, 
        cex=3, layer.name="GW LOI") +
  mapview(us_nwissite, col.regions="orange", cex=2, 
          legend = TRUE, layer.name="UM NWIS Sites") +
  mapview(usgs_stations, col.regions="cyan4", 
          legend = TRUE, layer.name="USGS Gages") +
  mapview(us_flowlines, color="steelblue", lwd=2, 
          layer.name="UM Flowlines", legend=FALSE)

Snap to the Nearest Point

The final filter involves snapping our LOI points (n = 2) to the nearest USGS stations from the stations we filtered to above. We can then use these data to generate some analysis and exploratory plots.

Snapping spatial data can be tricky, mainly because decimal precision can cause problems. One solution is to add a slight buffer around points or lines to improve successful pairing.

For this example, we’ll use st_nearest_feature(), which gives us an index of the nearest feature (row) between two sets of spatial data. In this case, we have two sets of points.

# get row index of nearest feature between points:
usgs_nearest_index <- st_nearest_feature(sac_loi, usgs_stations)

# now filter using the row index
usgs_stations_final <- usgs_stations[usgs_nearest_index, ]

# get vector of distances from each ISD station to nearest USGS station
dist_to_loi <- st_distance(sac_loi, 
                           by_element = TRUE)

# use units package to convert units to miles or km
(dist_to_loi_mi <- units::set_units(dist_to_loi, miles))
Units: [miles]
[1] 13.696934  1.202355
(dist_to_loi_km <- units::set_units(dist_to_loi, km))
Units: [km]
[1] 22.043079  1.935003
# bind back to final dataset:
usgs_stations_final <- usgs_stations_final %>% 
  cbind(dist_to_loi_mi, dist_to_loi_km)

# now plot!
mapview(usgs_stations, cex = 2.75, col.regions = "gray",
        layer.name = "USGS Stations") +
  mapview(us_flowlines, legend = FALSE, color = "steelblue") + 
  mapview(usgs_stations_final, col.regions = "yellow",
          layer.name = "Nearest USGS Station to LOI") +
  mapview(sac_loi, col.regions="purple",
          layer.name = "GW LOI")

Notice anything? How could we approach this differently so we pulled at least one gage per river instead of two in one river and none in the other?

Select Nearest by Distance

If we want to select more than a single point based on a threshold distance we can use a non-overlapping join and specify a distance. For many spatial operations, using a projected CRS is important because it generally provides a more accurate calculation since it is based on a “flat” surface and uses a linear grid. It has the additional advantage that we tend to process and understand information that is grid based more easily than curvilinear (degree-based), so a distance of 100 yards or 100 meters makes sense when compared with 0.001 degrees.

Therefore, first we transform our data into a projected CRS, then we do our join and distance calculations, then we can transform back to our lat/lon CRS.

usgs_stations <- st_transform(usgs_stations, 3310)
sac_loi <- st_transform(sac_loi, 3310)

# use a search within 15km to select stations
usgs_stations_15km <- st_join(sac_loi,
                              dist = 15000) %>% 
  st_drop_geometry() %>%
  filter(!is.na(X)) %>% # can't have NA's 
  st_as_sf(coords = c("X","Y"), crs = 4326)

mapview(usgs_stations_15km,  col.regions = "yellow") +
  mapview(sac_loi, col.regions = "purple") +
  mapview(us_flowlines, legend = FALSE, color = "steelblue")

Download USGS Data with NLDI

Now we have our stations of interest, and our climate data, let’s download river flow and water temperature data with the {dataRetrieval} package.

# strip out the "USGS" from our identifier with "separate"
usgs_stations_15km <- usgs_stations_15km %>% 
  tidyr::separate(col = identifier, # the column we want to separate
                  into = c("usgs", "usgs_id"), # the 2 cols to create
                  remove = FALSE) %>% # keep the original column
  select(-usgs) # drop this column

# see if there's daily discharge/wtemperature data available ("dv"):
dataRetrieval::whatNWISdata(siteNumber = usgs_stations_15km$usgs_id, 
                            service = "dv", 
                            parameterCd = c("00060", "00010"),
                            statCd = "00003")
    agency_cd  site_no
163      USGS 11446500
166      USGS 11446500
243      USGS 11446700
260      USGS 11446980
                                          station_nm site_tp_cd
163                        AMERICAN R A FAIR OAKS CA         ST
166                        AMERICAN R A FAIR OAKS CA         ST
    dec_lat_va dec_long_va coord_acy_cd dec_coord_datum_cd alt_va
163   38.63546   -121.2277            F              NAD83  71.53
166   38.63546   -121.2277            F              NAD83  71.53
243   38.59129   -121.3327            S              NAD83  45.00
260   38.56713   -121.3883            S              NAD83  25.00
    alt_acy_va alt_datum_cd   huc_cd data_type_cd parm_cd stat_cd
163       0.01       NGVD29 18020111           dv   00010   00003
166       0.01       NGVD29 18020111           dv   00060   00003
243       2.00       NGVD29 18020111           dv   00010   00003
260       2.50       NGVD29 18020111           dv   00010   00003
     ts_id loc_web_ds medium_grp_cd parm_grp_cd  srs_id access_cd
163 234322         NA           wat        <NA> 1645597         0
166  10977         NA           wat        <NA> 1645423         0
243 234323         NA           wat        <NA> 1645597         0
260 234324         NA           wat        <NA> 1645597         0
    begin_date   end_date count_nu
163 1971-07-20 2023-05-18     3146
166 1904-10-01 2023-05-17    43328
243 2016-10-01 2023-05-17     2418
260 2016-10-01 2023-05-18     2393
# Extract Streamflow for identified sites
usgs_Q <- readNWISdv(usgs_stations_15km$usgs_id, 
                parameterCd = "00060", 
                startDate = "2016-10-01") %>% 

# extract water temp
usgs_wTemp <- readNWISdv(usgs_stations_15km$usgs_id, 
                parameterCd = "00010", 
                startDate = "2016-10-01") %>% 

Plot our USGS Data

Now we have the data, let’s plot!

# Plot! 
(hydro <- ggplot() + 
   geom_line(data = usgs_Q, aes(x = Date, y = Flow, col = site_no),
             size = .5) + 
   scale_color_brewer(palette = "Set1") +
   facet_wrap(~site_no, scales = "free_x") + 
   theme_classic() + 
   labs(title="USGS Discharge Data",
        x="", y="Discharge (cfs)") +
   theme(legend.position = "none"))
# Plot temp
(gg_temp <- ggplot() + 
    geom_path(data = usgs_wTemp, 
              aes(x = Date, y = Wtemp, col = site_no),
              size = .5) + 
    facet_wrap(~site_no) + 
    theme_bw() + 
    labs(title="USGS Water Temperature Data",
         x="", y="Water Temperature (C)") +
    scale_color_viridis_d() +
    theme(legend.position = "none"))


In the plots above, we see the gaps in data are connected when using a line plot. Ideally, we would prefer to visualize these data with gaps (no line) where there is no data. To do this, we can leverage handy functions from the {tidyr} package: complete() and fill().

Click for Answers!
# load the package

# fill all unique combinations of Date in our data
usgs_wTemp2 <- usgs_wTemp %>%
  group_by(site_no) %>% # group by our gages first
  complete(Date = seq.Date(min(Date), max(Date), by="day")) %>% 
  # then list the cols we want to fill same value through whole dataset
  fill(site_no, agency_cd)

# now regenerate plot!
# Plot temp
(gg_temp2 <- ggplot() + 
    geom_path(data = usgs_wTemp2, 
              aes(x = Date, y = Wtemp, col = site_no),
              size = .5) + 
    facet_wrap(~site_no) + 
    theme_bw() + 
    labs(title="USGS Water Temperature Data",
         x="", y="Water Temperature (C)") +
    scale_color_viridis_d() +
    theme(legend.position = "none"))

Make a Map with {tmap}

One final component that we haven’t covered much is how to create a publication ready-map. We can do this using the {ggplot2} package in conjunction with geom_sf(), or we can use some alternate packages which are built specifically to work with spatial data and use a similar code structure to {ggplot2}.

The {tmap} and {tmaptools} are excellent options to create a nice map that can be used in any report or publication. First, let’s load the packages we’ll use.

Now we build our layers using a similar structure as {ggplot2}.

final_tmap <-

  # counties
  tm_shape(sac_co) +
  tm_polygons(border.col = "gray50", col = "gray50", 
              alpha = 0.1, border.alpha = 0.9, lwd = 0.5, lty = 1) +

  # rivers
  tm_shape(us_flowlines) +
  tm_lines(col = "steelblue", lwd = 2) +
  # points: LOI stations
  tm_shape(sac_loi) +
    tm_symbols(col = "orange3", border.col = "gray20", 
               shape = 21, size = 1.5, alpha = 1) +
    tm_add_legend('symbol',shape = 21, col='orange3', border.col='black', size=1,
                 labels=c(' LOI')) +
  # points usgs
  tm_shape(usgs_stations_15km) +
    tm_symbols(col = "cyan4", border.col = "gray20", 
               shape = 23, size = 0.5) +
  tm_add_legend('symbol',shape = 23, col='cyan4', border.col='black', size=1,
                 labels=c(' USGS Stations')) +
  # layout
              frame = FALSE,
              legend.outside = FALSE, attr.outside = FALSE,
              inner.margins = 0.01, outer.margins = (0.01),
              legend.position = c(0.2,0.8)) +
    tm_compass(type = "4star", position = c("right","bottom")) +
    tm_scale_bar(position = c("right","bottom"))


To save this map, we use a similar function as the ggsave(), but in this case, it’s tmap::tmap_save().

                filename = "images/map_of_sites.jpg",
                height = 11, width = 8, units = "in", dpi = 300)

Previous module:
7. Paramaterized reports

  1. For more info on the NLDI: https://labs.waterdata.usgs.gov/about-nldi/index.html↩︎


If you see mistakes or want to suggest changes, please create an issue on the source repository.


Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".