Approaches towards organization and efficiency.
Learning objectives
.RProfile
and
.Renviron
files{renv}
Figure 1: What to avoid building (Source: https://xkcd.com/2054/).
The first step in any data science project is to set up and maintain a clean, predictable development environment. As you accumulate raw data, write code, and generate results, things can get messy if you don’t stick to good programming naming and organization habits. In this module we’ll cover how to keep your projects organized and consistent, which will make your projects more reproducible, keep your workflows efficient, firm up your code to stand the test of time, and give your code structure so it’s easy to maintain when things break or you need to revisit it down the line.
REVIEW
Although this is an intermediate level course, we will
revisit introductory material on “Project
Management” because no matter your skill level in R,
strategic project management remains fundamental. Subsequent
modules in this course assume familiarity .Rproject
s,
naming conventions, and general best practices.
After reviewing core introductory topics, we will discuss
.RProfile
and .Renviron
files and when to use
them. To review core best practices we recommend particular familiarity
with the following concepts:
If you have user- or project-level code that needs
to be run every time you start up R, customizing your
.RProfile
can streamline this operation.
The .RProfile
file is an actual (hidden) file that is
automatically sourced (run) as R
code when you open an
R
session. A .RProfile
file can live in the
project root directory, or the user’s home directory, although only one
.RProfile
can be loaded per R session or RProject. If a
project-level .Rprofile
exists, it supersedes the
user-level .Rprofile
.
.Rprofile
The easiest and most consistent way to edit your
.RProfile
file across operating systems is with the
{usethis}
package. Run
usethis::edit_r_profile()
to open your user-level (or
global) .RProfile
. This is the default
.Rprofile
that will be used for any and all projects
unless you have a local (project-level)
.Rprofile
.
usethis::edit_r_profile()
.Rprofile
We can create or edit a local or project-level
.RProfile
with the scope
argument inside the
edit_r_profile
function. Remember, a project-level
.RProfile
will supersede the global
.Rprofile
.
usethis::edit_r_profile(scope = "project")
.Rprofile
To illustrate how .RProfile
works, let’s do something
cool and useless. We’ll write a short program that greets us with a
random inspirational quote, and then we’ll put in .RProfile
so it runs whenever we start up R.
The {cowsay}
package is a fun way to print text animal art.
cowsay::say(what = "hello world!", by = "cow")
-----
hello world!
------
\ ^__^
\ (oo)\ ________
(__)\ )\ /\
||------w|
|| ||
Let’s randomize the animal displayed and make the message it says one
of the motivational quotes found at this Github
repo, copy and paste the code into our .RProfile
, and
restart R.
library(cowsay) # animals!
library(glue) # pasting things together
# get vector of all animals
animals <- names(cowsay::animals)
# get pieces to make link
repo <- "JakubPetriska/060958fd744ca34f099e947cd080b540"
csv <- "raw/963b5a9355f04741239407320ac973a6096cd7b6/quotes.csv"
# get dataframe of inspirational quotes
quotes <- readr::read_csv(glue("https://gist.githubusercontent.com/{repo}/{csv}"))
# make full quote
quotes$full_quote <- glue("{quotes$Quote} - {quotes$Author}")
# now use it!
cowsay::say(sample(quotes$full_quote, 1), by = sample(animals, 1))
--------------
The possibilities are numerous once we decide to act and not react. - George Bernard Shaw
--------------
\
\
\
/""-._
. '-,
: '',
; * '.
' * () '.
\ \
\ _.---.._ '.
: .' _.--''-'' \ ,'
.._ '/.' . ;
; `-. , \'
; `, ; ._\
; \ _,-' ''--._
: \_,-' '-._
\ ,-' . '-._
.' __.-''; \...,__ '.
.' _,-' \ \ ''--.,__ '\
/ _,--' ; \ ; \^.}
;_,-' ) \ )\ ) ;
/ \/ \_.,-' ;
/ ;
,-' _,-'''-. ,-., ; PFA
,-' _.-' \ / |/'-._...--'
:--`` )/
'
rm(animals, quotes) # remove the objects we just created
Sometimes you need to store sensitive information, like API Keys, Database passwords, data storage paths, or general variables used across all scripts. We don’t want to accidentally share these information, accidentally push them to Github, or copy and paste them over and over again from script to script. We also might want to build a codebase that relies on a few variables that another user can set in their own system in a way that works for them. Environmental variables are the way to address all of these concerns.
Environmental variables are objects that store
character strings. They are accessible from within R upon startup. To
view all environmental variables, use Sys.getenv()
. You can
also pull out one environmental variable at a time by passing in its
name, for instance:
Sys.getenv("USER")
[1] "richpauloo"
You can set your own environmental variables which are stored in
another hidden file called .Renviron
(this is the Python
analog of .env
). Keep in mind, .Renviron
files
typically contain lists of environmental variables that look similar to
R code but it is actually not running R code…so don’t put R code in your
.Renviron
file! If we need to run R code when starting up
R, we use .RProfile
.
To illustrate the use of .Renviron
, we run
usethis::edit_r_environ()
, add the environmental variable
ANIMAL = "cat"
, save, and restart R
.
usethis::edit_r_environ()
We can access our environmental variable as follows (remember you
need to restart R for changes to take effect, try
Session > Restart R
):
Sys.getenv("ANIMAL")
We can use our environmental variable, for instance, in a function.
inspire_me <- function(animal){
# get pieces to make link
repo <- "JakubPetriska/060958fd744ca34f099e947cd080b540"
csv <- "raw/963b5a9355f04741239407320ac973a6096cd7b6/quotes.csv"
# silently read dataframe
suppressMessages(
quotes <- readr::read_csv(
glue::glue("https://gist.githubusercontent.com/{repo}/{csv}")
)
)
# paste together the full quote
quotes$full_quote <- paste0(quotes$Quote, " -", quotes$Author)
# make a user-specified animal say the quote
cowsay::say(sample(quotes$full_quote, 1), by = animal)
}
# have the environmental variable say a quote
inspire_me(Sys.getenv("ANIMAL"))
Although it may not appear powerful in this trivial example, when a project grows substantially large and complex, or when managing multiple sensitive passwords and access tokens, environmental variables are a standard approach that are widely used.
Pause and think
In the example function above, we might notice that reading in a url
from a csv every time we run inspire_me()
is a lot of
unnecessary overhead. Where else might we be able to read that csv in
automatically when R starts up, so that it’s available for our
inspire_me()
function, and that we only need to read it
once?
We can move read step of the csv into a project-level RProfile, so it’s available to the project where we need this csv, but not to any general R session we may open outside of the project.
.RProfile
# get pieces to make link
repo <- "JakubPetriska/060958fd744ca34f099e947cd080b540"
csv <- "raw/963b5a9355f04741239407320ac973a6096cd7b6/quotes.csv"
# silently read dataframe
suppressMessages(
quotes <- readr::read_csv(
glue::glue("https://gist.githubusercontent.com/{repo}/{csv}")
)
)
# paste together the full quote
quotes$full_quote <- paste0(quotes$Quote, " -", quotes$Author)
Modified function
Best practices for writing code across languages typically recommend package imports and function definitions at the top of a script, followed by code. For example, a script may look like this:
These approaches work well when scripts are relatively simple, but as a project grows large and complex, it’s best practice to move functions into another script or set of scripts, and break up your workflow into discrete steps.
For instance, although the inspire_me()
function above
is relatively simple, we can pretend that the read
,
transform
, and print
steps carried out in the
function were themselves long functions in part of a much more complex,
real-world workflow. Imagine we created a script called
functions.R
that contained the following code. Don’t worry
if you haven’t seen purrr::walk()
before. We’ll cover this
in a later module on iteration, and all
you need to know about it now is that it “walks” over each input and
applies a function. In this case, we apply the require()
function to a vector of package names to load them.
# list packages in a vector and load them all
pkgs <- c("readr", "cowsay")
purrr::walk(pkgs, require, character.only = TRUE)
# read quotes from a url
f_read_data <- function(url){
suppressMessages(
quotes <- read_csv(url)
)
return(quotes)
}
# paste the quote to the author
f_preprocess_data <- function(d){
d$full_quote <- paste0(d$Quote, " -", d$Author)
return(d)
}
# print a random animal and a random quote
f_inspire_me <- function(d){
animals <- names(animals)
say(sample(d$full_quote, 1), by = sample(animals, 1))
}
We can call this script using source()
to load or import
these functions into our environment where they are available for use,
just as we load a library.
source("scripts/functions.R")
However, this is hardly a satisfying solution because in a real
project, our pretend functions above may grow quite large, and we will
likely add more and more functions. Eventually, a single script may hold
them all, and something like functions.R
may become many
hundreds of lines long, making it difficult to sift through, debug, or
add new lines of code. A better organizational approach which makes
things easier to maintain over time is to move all our functions to a
directory /functions
, and store them all as separate files
named after their function name:
Save as
/scripts/functions/f_read_data.R
# read quotes from a url
f_read_data <- function(url){
suppressMessages(
quotes <- read_csv(url)
)
return(quotes)
}
Save as
/scripts/functions/f_preprocess_data.R
Save as
/scripts/functions/f_inspire_me.R
The functions
folder in the root project directory
should now look like this:
Now in our /scripts
directory, we create a script,
01_control.R
to source our functions and use them. Be sure
to restart R
to clear your environment before sourcing this
control script so we know we are working from a clean slate.
Save as /scripts/01_control.R
and
run.
# packages needed for this script
pkgs <- c("readr", "cowsay", "tidyverse", "glue")
walk(pkgs, require, character.only = TRUE)
# silently source all functions using the purrr::walk function
fns <- fs::dir_ls("scripts/functions")
walk(fns, ~source(.x))
# define the url where quotes are located
# get pieces to make link
repo <- "JakubPetriska/060958fd744ca34f099e947cd080b540"
csv <- "raw/963b5a9355f04741239407320ac973a6096cd7b6/quotes.csv"
url <- glue("https://gist.githubusercontent.com/{repo}/{csv}")
# use all of our functions
f_read_data(url) %>%
f_preprocess_data() %>%
f_inspire_me()
-----
You can't stop the waves, but you can
learn to surf. -Jon Kabat-Zinn
------
\
\
____
_.' : `._
.-.'`. ; .'`.-.
__ / : ___\ ; /___ ; \ __
,'_ ""--.:__;".-.";: :".-.":__;.--"" _`,
:' `.t""--.. '<@.`;_ ',@>` ..--""j.' `;
`:-.._J '-.-'L__ `-- ' L_..-;'
"-.__ ; .-" "-. : __.-"
L ' /.------.\ ' J
"-. "--" .-"
__.l"-:_JL_;-";.__
.-j/'.; ;"""" / .'\"-.
.' /:`. "-.: .-" .'; `.
.-" / ; "-. "-..-" .-" : "-.
.+"-. : : "-.__.-" ;-._ \
; \ `.; ; : : "+. ;
: ; ; ; : ; : \:
; : ; : ;: ; :
: \ ; : ; : ; / ::
; ; : ; : ; : ;:
: : ; : ; : : ; : ;
;\ : ; : ; ; ; ;
: `."-; : ; : ; / ;
; -: ; : ; : .-" :
:\ \ : ; : \.-" :
;`. \ ; : ;.'_..-- / ;
: "-. "-: ; :/." .' :
\ \ : ;/ __ :
\ .-`.\ /t-"" ":-+. :
`. .-" `l __/ /`. : ; ; \ ;
\ .-" .-"-.-" .' .'j \ / ;/
\ / .-" /. .'.' ;_:' ;
:-""-.`./-.' / `.___.'
\ `t ._ / bug
"-.t-._:'
source()
is the key to chaining together many scripts.
In the example above, we were able to abstract functions into a separate
folder which makes keeping track of them much easier than if they
cluttered our control script.
Learn more
Separating all functions into standalone scripts is not a
revolutionary idea – in fact, this is precisely how R packages are
written! For example, see the {dplyr}
github repo’s /R folder which contains all dplyr functions in one
directory. When you call library(dplyr)
you’re essentially
sourcing all of these functions into your environment.
If project management and reproducible data pipelines are interesting
to you, check out the {targets}
R package. A similar framework for Shiny Apps exists called {golem}
,
which also includes {usethis}
-like commands that streamline
common chores in Shiny App development.
{renv}
We use RProjects because we expect that whoever else opens your code
on their machine is likely to have a different project root path, and
using an RProject ensures your code is portable between different
computers with different root project paths (e.g.,
~/Documents/Github/myproject
versus
C:\Users\louis\Documents\myproject
).
Development environments are similar. When we work in R
– or any programming language for that matter – we use a snapshot of
package versions based on when we downloaded and installed them
[e.g. with install.packages()
]. You can check the version
of the installed packages loaded into your current environment with
sessionInfo()
.
R version 4.1.2 (2021-11-01)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 13.3
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] glue_1.6.2 cowsay_0.8.2 purrr_0.3.4 knitr_1.37
loaded via a namespace (and not attached):
[1] pillar_1.7.0 bslib_0.3.1 compiler_4.1.2
[4] jquerylib_0.1.4 highr_0.9 tools_4.1.2
[7] bit_4.0.4 rmsfact_0.0.3 digest_0.6.29
[10] downlit_0.4.0 jsonlite_1.8.0 evaluate_0.14
[13] memoise_2.0.1 lifecycle_1.0.1 tibble_3.1.7
[16] pkgconfig_2.0.3 png_0.1-7 rlang_1.0.6
[19] cli_3.6.0 rstudioapi_0.13 curl_4.3.2
[22] parallel_4.1.2 distill_1.3 yaml_2.2.2
[25] xfun_0.37 fastmap_1.1.0 withr_2.5.0
[28] stringr_1.4.0 fs_1.5.2 vctrs_0.4.1
[31] sass_0.4.5 hms_1.1.1 tidyselect_1.1.1
[34] bit64_4.0.5 R6_2.5.1 fansi_1.0.3
[37] vroom_1.5.7 rmarkdown_2.13 tzdb_0.2.0
[40] readr_2.1.2 magrittr_2.0.3 usethis_2.1.5
[43] fortunes_1.5-4 htmltools_0.5.4 ellipsis_0.3.2
[46] utf8_1.2.2 stringi_1.7.6 cachem_1.0.6
[49] crayon_1.5.1
The version number is the string of numbers listed after a package name and underscore.
Similarly, you can use installed.packages()
to view
information on all of your installed packages.
When packages change between versions, changes are typically designed to fix bugs or improve performance, but sometimes, they can break code. Thus, collaborative work on a project may be challenged by people working on the same code but with different versions of packages.
The solution to this problem is for everyone to use the same versions
of packages (and R
), which is to say that collaborators
should use the same development environment.
This is a common concept across programming languages.
{renv}
manages your package environment and makes it easy to share it with
others by creating and curating a “lock” file (renv.lock
)
in the root project directory. When starting a project, create the file
with renv::init()
, install packages as you go along, and
update the lockfile with renv::snapshot()
. When a
collaborator opens your project (for example, after cloning it from
Github), all they need to do is open the .RProj
file and
{renv}
will automatically set up the development
environment captured in the lock file.
If you find yourself needing to share important analyses, perhaps
that run on a production server, you should look into
{renv}
. For most day-to-day data science that you don’t
plan on sharing or working collaboratively on, it may be
unnecessary.
Previous
module:
2. Git
Next
module:
4. Interactive Visualization
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".