3. Project Management & Data Organization

Approaches towards organization and efficiency.

Learning objectives


First principles

What to avoid building (Source: https://xkcd.com/2054/).

Figure 1: What to avoid building (Source: https://xkcd.com/2054/).

The first step in any data science project is to set up and maintain a clean, predictable development environment. As you accumulate raw data, write code, and generate results, things can get messy if you don’t stick to good programming naming and organization habits. In this module we’ll cover how to keep your projects organized and consistent, which will make your projects more reproducible, keep your workflows efficient, firm up your code to stand the test of time, and give your code structure so it’s easy to maintain when things break or you need to revisit it down the line.

REVIEW

Although this is an intermediate level course, we will revisit introductory material on “Project Management” because no matter your skill level in R, strategic project management remains fundamental. Subsequent modules in this course assume familiarity with {here}, .Rprojects, naming conventions, and general best practices.


After reviewing core introductory topics, we will discuss .RProfile and .Renviron files and when to use them. To review core best practices we recommend particular familiarity with the following concepts:


.RProfile

If you have user- or project-level code that needs to be run every time you start up R, customizing your .RProfile can streamline this operation.

The .RProfile file is an actual (hidden) file that is automatically sourced (run) as R code when you open an R session. A .RProfile file can live in the project root directory, or the user’s home directory, although only one .RProfile can be loaded per R session or RProject. If a project-level .Rprofile exists, it supersedes the user-level .Rprofile.

Global (user-level) .Rprofile

The easiest and most consistent way to edit your .RProfile file across operating systems is with the {usethis} package. Run usethis::edit_r_profile() to open your user-level (or global) .RProfile. This is the default .Rprofile that will be used for any and all projects unless you have a local (project-level) .Rprofile.

usethis::edit_r_profile()

Local (project-level) .Rprofile

We can create or edit a local or project-level .RProfile with the scope argument inside the edit_r_profile function. Remember, a project-level .RProfile will supersede the global .Rprofile.

usethis::edit_r_profile(scope = "project")


Using .Rprofile

To illustrate how .RProfile works, let’s do something cool and useless. We’ll write a short program that greets us with a random inspirational quote, and then we’ll put in .RProfile so it runs whenever we start up R.

The {cowsay} package is a fun way to print text animal art.

cowsay::say(what = "hello world!", by = "cow")

 ----- 
hello world! 
 ------ 
    \   ^__^ 
     \  (oo)\ ________ 
        (__)\         )\ /\ 
             ||------w|
             ||      ||

Let’s randomize the animal displayed and make the message it says one of the motivational quotes found at this Github repo, copy and paste the code into our .RProfile, and restart R.

library(cowsay) # animals!
library(glue)   # pasting things together

# get vector of all animals
animals <- names(cowsay::animals)

# get pieces to make link
repo <- "JakubPetriska/060958fd744ca34f099e947cd080b540"
csv <- "raw/963b5a9355f04741239407320ac973a6096cd7b6/quotes.csv"

# get dataframe of inspirational quotes
quotes  <- readr::read_csv(glue("https://gist.githubusercontent.com/{repo}/{csv}"))  

# make full quote
quotes$full_quote  <- glue("{quotes$Quote} - {quotes$Author}")

# now use it!
cowsay::say(sample(quotes$full_quote, 1), by = sample(animals, 1))

 -------------- 
All achievements, all earned riches, have their beginning in an idea. - Napoleon Hill 
 --------------
    \
      \
        \
                   _.-````'-,_
         _,.,_ ,-'`           `'-.,_
       /)     (                   '``-.
      ((      ) )                      `\
        \)    (_/                        )\
        |       /)           '    ,'    / \
        `\    ^'            '     (    /  ))
          |      _/\ ,     /    ,,`\   (  "`
          \Y,   |   \  \  | ````| / \_ \
            `)_/      \  \  )    ( >  ( >
                       \( \(     |/   |/
          mic & dwb  /_(/_(    /_(  /_(
    
rm(animals, quotes) # remove the objects we just created


.Renviron

Sometimes you need to store sensitive information, like API Keys, Database passwords, data storage paths, or general variables used across all scripts. We don’t want to accidentally share these information, accidentally push them to Github, or copy and paste them over and over again from script to script. We also might want to build a codebase that relies on a few variables that another user can set in their own system in a way that works for them. Environmental variables are the way to address all of these concerns.

Environmental variables are objects that store character strings. They are accessible from within R upon startup. To view all environmental variables, use Sys.getenv(). You can also pull out one environmental variable at a time by passing in its name, for instance:

Sys.getenv("USER")
[1] "richpauloo"

You can set your own environmental variables which are stored in another hidden file called .Renviron (this is the Python analog of .env). Keep in mind, .Renviron files typically contain lists of environmental variables that look similar to R code but it is actually not running R code…so don’t put R code in your .Renviron file! If we need to run R code when starting up R, we use .RProfile.

To illustrate the use of .Renviron, we run usethis::edit_r_environ(), add the environmental variable ANIMAL = "cat", save, and restart R.

usethis::edit_r_environ()

We can access our environmental variable as follows (remember you need to restart R for changes to take effect, try Session > Restart R):

Sys.getenv("ANIMAL")

We can use our environmental variable, for instance, in a function.

inspire_me <- function(animal){

  # get pieces to make link
  repo <- "JakubPetriska/060958fd744ca34f099e947cd080b540"
  csv  <- "raw/963b5a9355f04741239407320ac973a6096cd7b6/quotes.csv"
  
  # silently read dataframe
  suppressMessages(
    quotes  <- readr::read_csv(
      glue::glue("https://gist.githubusercontent.com/{repo}/{csv}")
    )  
  )
  
  # paste together the full quote
  quotes$full_quote  <- paste0(quotes$Quote, " -", quotes$Author)
  
  # make a user-specified animal say the quote
  cowsay::say(sample(quotes$full_quote, 1), by = animal)

}

# have the environmental variable say a quote
inspire_me(Sys.getenv("ANIMAL"))

Although it may not appear powerful in this trivial example, when a project grows substantially large and complex, or when managing multiple sensitive passwords and access tokens, environmental variables are a standard approach that are widely used.

Pause and think

In the example function above, we might notice that reading in a url from a csv every time we run inspire_me() is a lot of unnecessary overhead. Where else might we be able to read that csv in automatically when R starts up, so that it’s available for our inspire_me() function, and that we only need to read it once?


Click for Answers!

We can move read step of the csv into a project-level RProfile, so it’s available to the project where we need this csv, but not to any general R session we may open outside of the project.

.RProfile

# get pieces to make link
repo <- "JakubPetriska/060958fd744ca34f099e947cd080b540"
csv  <- "raw/963b5a9355f04741239407320ac973a6096cd7b6/quotes.csv"

# silently read dataframe
suppressMessages(
  quotes  <- readr::read_csv(
    glue::glue("https://gist.githubusercontent.com/{repo}/{csv}")
  )  
)

# paste together the full quote
quotes$full_quote  <- paste0(quotes$Quote, " -", quotes$Author)

Modified function

inspire_me <- function(animal){

  # make a user-specified animal say the quote
  cowsay::say(sample(quotes$full_quote, 1), by = animal)

}


Strategies to organize projects/code

Best practices for writing code across languages typically recommend package imports and function definitions at the top of a script, followed by code. For example, a script may look like this:

# import packages
library(package_1)
library(package_2)

# define functions
my_first_function <- function(){
  print("hello")
}

my_second_function <- function(){
  print("world")
}

# run scripts/functions
my_first_function()
my_second_function()

These approaches work well when scripts are relatively simple, but as a project grows large and complex, it’s best practice to move functions into another script or set of scripts, and break up your workflow into discrete steps.

For instance, although the inspire_me() function above is relatively simple, we can pretend that the read, transform, and print steps carried out in the function were themselves long functions in part of a much more complex, real-world workflow. Imagine we created a script called functions.R that contained the following code. Don’t worry if you haven’t seen purrr::walk() before. We’ll cover this in a later module on iteration, and all you need to know about it now is that it “walks” over each input and applies a function. In this case, we apply the require() function to a vector of package names to load them.

# list packages in a vector and load them all
pkgs <- c("readr", "cowsay")
purrr::walk(pkgs, require, character.only = TRUE)

# read quotes from a url
f_read_data <- function(url){
  suppressMessages(
    quotes  <- read_csv(url)  
  )
  return(quotes)
}

# paste the quote to the author
f_preprocess_data <- function(d){
  d$full_quote  <- paste0(d$Quote, " -", d$Author)
  return(d)
}

# print a random animal and a random quote
f_inspire_me <- function(d){
  animals <- names(animals)
  say(sample(d$full_quote, 1), by = sample(animals, 1))
}

We can call this script using source() to load or import these functions into our environment where they are available for use, just as we load a library.

source(here("scripts/functions.R"))

Abstracting Functions from Code

However, this is hardly a satisfying solution because in a real project, our pretend functions above may grow quite large, and we will likely add more and more functions. Eventually, a single script may hold them all, and something like functions.R may become many hundreds of lines long, making it difficult to sift through, debug, or add new lines of code. A better organizational approach which makes things easier to maintain over time is to move all our functions to a directory /functions, and store them all as separate files named after their function name:

Save as /scripts/functions/f_read_data.R

# read quotes from a url
f_read_data <- function(url){
  suppressMessages(
    quotes  <- read_csv(url)
  )
  return(quotes)
}

Save as /scripts/functions/f_preprocess_data.R

# paste the quote to the author
f_preprocess_data <- function(d){
  d$full_quote  <- paste0(d$Quote, " -", d$Author)
  return(d)
}

Save as /scripts/functions/f_inspire_me.R

# print a random animal and a random quote
f_inspire_me <- function(d){
  animals <- names(animals)
  say(sample(d$full_quote, 1), by = sample(animals, 1))
}

The functions folder in the root project directory should now look like this:

Now in our /scripts directory, we create a script, 01_control.R to source our functions and use them. Be sure to restart R to clear your environment before sourcing this control script so we know we are working from a clean slate.

Save as /scripts/01_control.R and run.

# packages needed for this script
pkgs <- c("readr", "cowsay", "here", "tidyverse", "glue")
walk(pkgs, require, character.only = TRUE)

# silently source all functions using the purrr::walk function
fns <- fs::dir_ls(here("scripts/functions"))
walk(fns, ~source(.x))

# define the url where quotes are located
# get pieces to make link
repo <- "JakubPetriska/060958fd744ca34f099e947cd080b540"
csv <- "raw/963b5a9355f04741239407320ac973a6096cd7b6/quotes.csv"
url <- glue("https://gist.githubusercontent.com/{repo}/{csv}")  

# use all of our functions
f_read_data(url) %>% 
  f_preprocess_data() %>% 
  f_inspire_me()

 ----- 
You can't stop the waves, but you can 
learn to surf. -Jon Kabat-Zinn 
 ------ 
    \   
     \
                   ____
                _.' :  `._
            .-.'`.  ;   .'`.-.
   __      / : ___\ ;  /___ ; \      __
  ,'_ ""--.:__;".-.";: :".-.":__;.--"" _`,
  :' `.t""--.. '<@.`;_  ',@>` ..--""j.' `;
       `:-.._J '-.-'L__ `-- ' L_..-;'
          "-.__ ;  .-"  "-.  : __.-"
             L ' /.------.\ ' J
             "-.   "--"   .-"
             __.l"-:_JL_;-";.__
         .-j/'.;  ;""""  / .'\"-.
         .' /:`. "-.:     .-" .';  `.
      .-"  / ;  "-. "-..-" .-"  :    "-.
  .+"-.  : :      "-.__.-"      ;-._   \
  ; \  `.; ;                    : : "+. ;
  :  ;   ; ;                    : ;  : \:
  ;  :   ; :                    ;:   ;  :
  : \  ;  :  ;                  : ;  /  ::
  ;  ; :   ; :                  ;   :   ;:
  :  :  ;  :  ;                : :  ;  : ;
  ;\    :   ; :                ; ;     ; ;
  : `."-;   :  ;              :  ;    /  ;
 ;    -:   ; :              ;  : .-"   :
  :\     \  :  ;            : \.-"      :
  ;`.    \  ; :            ;.'_..--  / ;
  :  "-.  "-:  ;          :/."      .'  :
   \         \ :          ;/  __        :
    \       .-`.\        /t-""  ":-+.   :
     `.  .-"    `l    __/ /`. :  ; ; \  ;
       \   .-" .-"-.-"  .' .'j \  /   ;/
        \ / .-"   /.     .'.' ;_:'    ;
  :-""-.`./-.'     /    `.___.'
               \ `t  ._  /  bug
                "-.t-._:'
  

source() is the key to chaining together many scripts. In the example above, we were able to abstract functions into a separate folder which makes keeping track of them much easier than if they cluttered our control script.

Learn more

Separating all functions into standalone scripts is not a revolutionary idea – in fact, this is precisely how R packages are written! For example, see the {dplyr} github repo’s /R folder which contains all dplyr functions in one directory. When you call library(dplyr) you’re essentially sourcing all of these functions into your environment.

If project management and reproducible data pipelines are interesting to you, check out the {targets} R package. A similar framework for Shiny Apps exists called {golem}, which also includes {usethis}-like commands that streamline common chores in Shiny App development.


{renv}

We use {here} because we expect that whoever else opens your code on their machine is likely to have a different project root path, and {here} ensures your code is portable between different computers with different root project paths (e.g., ~/Documents/Github/myproject versus C:\Users\louis\Documents\myproject).

Development environments are similar. When we work in R – or any programming language for that matter – we use a snapshot of package versions based on when we downloaded and installed them [e.g. with install.packages()]. You can check the version of the installed packages loaded into your current environment with sessionInfo().

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] glue_1.4.2   cowsay_0.8.0 purrr_0.3.4  knitr_1.33  

loaded via a namespace (and not attached):
 [1] bslib_0.2.5.1     compiler_4.1.0    pillar_1.6.2     
 [4] jquerylib_0.1.4   highr_0.9         tools_4.1.0      
 [7] bit_4.0.4         rmsfact_0.0.3     digest_0.6.27    
[10] downlit_0.2.1     jsonlite_1.7.2    evaluate_0.14    
[13] lifecycle_1.0.0   tibble_3.1.3      pkgconfig_2.0.3  
[16] png_0.1-7         rlang_0.4.11      rstudioapi_0.13  
[19] cli_3.0.1         curl_4.3.2        parallel_4.1.0   
[22] distill_1.2       yaml_2.2.1        xfun_0.25        
[25] stringr_1.4.0     fs_1.5.0          vctrs_0.3.8      
[28] sass_0.4.0        hms_1.1.0         tidyselect_1.1.1 
[31] bit64_4.0.5       rprojroot_2.0.2   here_1.0.1       
[34] R6_2.5.0          fansi_0.5.0       vroom_1.5.4      
[37] rmarkdown_2.10    tzdb_0.1.2        readr_2.0.1      
[40] magrittr_2.0.1    usethis_2.0.1     fortunes_1.5-4   
[43] htmltools_0.5.1.1 ellipsis_0.3.2    utf8_1.2.2       
[46] stringi_1.7.3     crayon_1.4.1     

The version number is the string of numbers listed after a package name and underscore.

Similarly, you can use installed.packages() to view information on all of your installed packages.

When packages change between versions, changes are typically designed to fix bugs or improve performance, but sometimes, they can break code. Thus, collaborative work on a project may be challenged by people working on the same code but with different versions of packages.

The solution to this problem is for everyone to use the same versions of packages (and R), which is to say that collaborators should use the same development environment. This is a common concept across programming languages.

{renv} manages your package environment and makes it easy to share it with others by creating and curating a “lock” file (renv.lock) in the root project directory. When starting a project, create the file with renv::init(), install packages as you go along, and update the lockfile with renv::snapshot(). When a collaborator opens your project (for example, after cloning it from Github), all they need to do is open the .RProj file and {renv} will automatically set up the development environment captured in the lock file.

If you find yourself needing to share important analyses, perhaps that run on a production server, you should look into {renv}. For most day-to-day data science that you don’t plan on sharing or working collaboratively on, it may be unnecessary.


Previous module:
2. Git
Next module:
4. Interactive Visualization

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".