7. Parameterized reports

How to automate routine reporting

Learning objectives


“Your success in life will be determined largely by your ability to speak, your ability to write, and the quality of your ideas, in that order.” — Patrick Winston (1943-2019)

Data science is the art of combining domain knowledge, statistics, math, programming, and visualization to find order and meaning in disorganized information. Communicating the results of your analyses, or ability to make data “speak”, is of utmost importance. The modern day open source package ecosystem is full of powerful ways to give voice to our analyses.

Analyses typically lead to figures, tables, and interpretation of these information. The {rmarkdown} package provides R users with a standardized approach for turning R analyses into reports, documents, presentations, dashboards, and websites. In this module, we assume familiarity with {rmarkdown}, and extend the previous modules on iteration, functional programming, and reproducible workflows to demonstrate how to iterate over reports.

According to R Markdown: The Definitive Guide, some example use cases for creating a parameterized report include:

In this module, we will focus on the first case, and build parameterized reports for a set of geographic locations. Throughout this course, we’ve been working with groundwater elevation data across California counties. Let’s imagine that we want to generate a report on groundwater level trends for a set of counties.

Although the RStudio Interactive Development Environment (IDE) encourages knitting RMarkdown documents by clicking a button, we can also knit documents via: rmarkdown::render(). Iterating over render() is the key to scaling parameterized reports. To iterate over R Markdown reports, we must first understand how to use params.


params

A parameterized .Rmd file takes a set of params (short for “parameters”) in the YAML header, which are bound into a named list called params and accessed with code from within the .Rmd file with params$<paramater-name>. For example, consider the example YAML:

title: "My awesome paramaterized report"
output: html_document
params:
  start_date: 2021-01-01
  watershed: Yuba
  data: gwl_yuba.csv

In the code, we could then access the value "2021-01-01" with params$start_date. Similarly, params$watershed will equal "Yuba" and params$data will equal "gwl_yuba.csv".


Set up a report with params

Let’s apply params to our task and generate an html_document for a set of counties. To illustrate, we will use a simplified, pre-processed dataset of 3 counties (Sacramento, Yolo, and San Joaquin counties). If you’re motivated to do so, you can use the entire groundwater level dataset of > 2 million records to scale the process to all counties. Read in the data and take a look at the fields.

library(tidyverse)
library(sf)

# groundwater level data for Sacramento, Yolo, San Joaquin counties
gwl <- read_rds("data/sac_yolo_sj.rds")

gwl %>% 
  group_by(SITE_CODE) %>% 
  slice(1) %>% 
  plot()

We will iterate over the COUNTY_NAME to create three reports, one for each county. Copy and paste the following code into a new file reports/gwl_report.Rmd

---
title: "`r paste(params$county, 'Groundwater Levels')`"
output: html_document
params:
  county: "placeholder"
---

<br>

```{r, echo = FALSE, message = FALSE, error = FALSE, warning = FALSE}
library(tidyverse)
library(here)
library(sf)
library(mapview)
library(maps)
library(DT)
mapviewOptions(fgb = FALSE)

knitr::opts_chunk$set(warning = FALSE, message = FALSE, out.width = "100%")

# filter all groundwater level data (already loaded into memory) by 
# the supplied county
d <- filter(gwl, COUNTY_NAME == params$county)

# extract the county spatial file
county_sf <- st_as_sf(map("county", plot = FALSE, fill = TRUE)) %>% 
  filter(ID == paste0("california,", tolower(params$county)))

```


This report shows groundwater levels in `r params$county` county.

Dates range from `r min(d$MSMT_DATE, na.rm = TRUE)` to `r max(d$MSMT_DATE, na.rm = TRUE)`.

Data source: [DWR Periodic Groundwater Level Database](https://data.cnra.ca.gov/dataset/periodic-groundwater-level-measurements).

<br>

## Distribution of measurements over time

50% of measured values occur on or after `r median(d$MSMT_DATE, na.rm = TRUE)`.

```{r hist, echo = FALSE}
d %>% 
  ggplot() +
  geom_histogram(aes(MSMT_DATE)) +
  theme_minimal() +
  labs(title = "", x = "", y = "Count")
```


<br>

## Monitoring sites

```{r map, echo = FALSE}
# mapview of county outline
county_mv <- mapview(
  county_sf, layer.name = paste(params$county, "county"), 
  lwd = 2, color = "red", alpha.regions = 0
)

# mapview of monitoring points
points_mv <- d %>% 
  group_by(SITE_CODE) %>% 
  slice(1) %>% 
  select(-MSMT_DATE) %>% # remove msmt date b/c its irrelevant
  mapview(layer.name = "Monitoring stations")

county_mv + points_mv
```

<br>

## All groundwater levels

```{r plot, echo = FALSE}
# interactive hydrograph
p <- ggplot(d, aes(MSMT_DATE, WSE, color = SITE_CODE)) + 
  geom_line(alpha = 0.5) +
  guides(color = FALSE)
plotly::ggplotly(p)
```

<br> 

```{r dt, echo = FALSE}
# data table of median groundwater level per site, per year
d %>% 
  select(-c("COUNTY_NAME", "WELL_DEPTH")) %>%
  st_drop_geometry() %>% 
  mutate(YEAR = lubridate::year(MSMT_DATE)) %>% 
  group_by(SITE_CODE, YEAR) %>% 
  summarise(WSE_MEDIAN = median(WSE, na.rm = TRUE)) %>%
  ungroup() %>% 
  DT::datatable(
    extensions = 'Buttons', options = list(
      dom = 'Bfrtip',
      buttons = 
        list('copy', 'print', list(
          extend  = 'collection',
          buttons = c('csv', 'excel', 'pdf'),
          text    = 'Download'
        ))
    )
  )
```


***

Report generated on `r Sys.Date()`.

Pause and think

Take a moment to read the .Rmd file above and see what it does. Notice where params$county is located in the document. Particularly, in the first code chunk it’s used to filter the groundwater level data (assumed to be in memory so we only load it once rather than every time we run this script) down to the county parameter.

d <- filter(gwl, COUNTY_NAME == params$county)

Next, how might you write an .Rmd file like the one above and test that everything looks the way you want it to before calling it done? In other words, would you start by writing params$county in all places it needs to be or start with one county, make sure everything works, and then substitute in params$county?


Iterate over a report

Finally, we create a vector of counties we want to write reports for and iterate over them. We also need to specify the output location of each file. Since we are writing html_documents, the file extension is .html. Using walk2() from our functional programming toolkit, we can pass in the counties vector and the output file paths into rmarkdown::render() and silently write the files.

# unique counties to write reports for
counties  <- unique(gwl$COUNTY_NAME)

# output file names
files_out <- tolower(counties) %>% 
  str_replace_all(" ", "_") %>% 
  paste0(., ".html")

# silently (walk) over the county names and file names, 
# creating a report for each combination
walk2(
  counties, 
  files_out,
  ~rmarkdown::render(
    input       = "reports/gwl_report.Rmd", 
    output_file = here("reports", .y),
    params      = list(county = .x)
  )
)

Open and explore each of the files that were written.


Pause and think

If we wanted to automate reports like this and have them published online or emailed to our team every morning at 7AM, what tools would we need?

Hint: see the automation module section on task schedulers.


In-line R

Within an .Rmd we can insert R code in-line using the following syntax:

`r <function>`

So for instance we can write a string like:

The mean of 1 and 3 is `r mean(c(1,3))`.

And when the document knits, we get: The mean of 1 and 3 is 2.

Consider this as an approach to add specific output about each site in the text narrative.


Additional Resources

Although we only demonstrated one type of output report in this module, the html_document, there are many other output formats that you can parameterize and iterate over, including Word documents, PDFs, flexdashboards, and presentations.

To dig deeper, see the official RMarkdown guide for paramaterized reports.


Previous module:
6. Iteration
Next module:
8. Advanced spatial

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".