2. Getting Started with R for Water Resources Data Science

Unleash your potential to work with any kind of data…

Welcome!

The goals of this training are to expose you to fundamentals and to develop an appreciation of what is possible with R. Importantly, we hope to instill the idea that data science should be transferable and reproducible, no matter the specific tools (i.e., R) or the dataset used. Reproducibility is a framework and mindset. An introductory course will not make you an expert, but hopefully it will draw you in, build excitement about using reproducible approaches to your own work, and show you how to be more efficient and less stressed when faced with messy real-world data!

We hope you will find this workshop both fun and helpful, and we appreciate your patience as we continue to develop this course, both virtually and elsewhere! We would also like to thank Allison Horst for allowing us to use her incredible monsteRs and R illustrations that you will see included throughout this course.

Why R? Why Data Science?

Computer literacy is essential in all aspects of the working world. The ability to reproduce a set of common or repetitive tasks efficiently and reliably is the root goal of reproducibility. Whether in science, management, or other fields, the skills required to increase productivity, maintain transparency, and feel competent and able to tackle multiple tasks all boil down to data science and project management tools.

Data management skills are needed for entering data without errors, storing it in a usable way, and extracting key aspects of the data for analysis. Basic programming is required for everything from accessing and managing data, data visualization, to statistical analysis and modeling. Data science combines the skills and tools that allow for acquisition, processing, analyzing, communicating, and maintenance data of any type.

Course Objectives

This course’s main objective is to provide students with a introduction to the power and possibility that an open source (free) and reproducible programming language such as R provides by demonstrating how to import, explore, visualize, and communicate different types of data. Using water resources based examples, this course will guide students through basic data science skills and strategies for continued learning and use of R. R is a language for statistical computing and a general purpose programming language. It is one of the primary languages used in data science and for data analysis across many of the natural sciences. This course will provide lessons in:

Why should I invest time in learning R?

There are many programming languages available and each has specific benefits. R was originally created as a statistical programming language but now is largely viewed as a ‘data science’ language. R is also an open-source programming language - not only is it free, but this means anybody can contribute to its development. Furthermore, R has powerful and nearly limitless plotting/visualization functionality, and you can adjust nearly any aspect of a graph to communicate your data effectively.

As Hadley Wickham, a prominent R developer, states:

R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers… it helps you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer. (R4DS)

R/RStudio Fundamentals

In the old days, the only way to use R was directly from the Console - this is a bare bones way of running R only with direct input of commands. Now, RStudio is the go-to Interactive Development Environment (IDE) for R. Think of it like a car that is built around an engine. RStudio is built around the R Console (engine) and includes many other features to improve the user’s experience.

Let’s get familiar with RStudio. If you haven’t done so already, download and install RStudio from the link above (version 1.4.1103). After it’s installed, find the RStudio shortcut and fire it up. You should see something like this:

*The RStudio window.*

Figure 2: The RStudio window.

There are four panes in RStudio (starting from the top right and moving clockwise):

Executing code in RStudio (Console)

The first part of RStudio that we will work in is called the Console, which tells you what code R is running. We can use many of the same commands found in a calculator or Microsoft Excel.

Type the following text into the console at the line that ends with >, press Enter, and you should see the following results:

# enter in console (after the ">" mark)
8 / 4
[1] 2
# enter in console (after the ">" mark)
4 + 8
[1] 12

Storing Variables in the Environment

You can also create variables with custom values (we’ll talk much more about this later). But here are the basics, the first part of the code is a name of your choosing. Meaningful variable names are better, but the only rules are: 1) that it can’t start with a number and 2) it must not have any spaces. The second bit, <-, is the assignment operator. This tells R to take the result of whatever is on the right side of the <- and save it as a new object in your R Environment.

Type the following into your Console and press enter after each one to see their output:

# assign the value 4 to the variable name "stream"
stream <- 4
# assign the value 8 to the variable name "pebble"
pebble <- 8

You might notice is no output in the Console for the lines of code you have just run. Instead, they have been stored in your R Environment in the top right pane of your RStudio window. Click on that tab and take a look! Because they have been stored, you can print these variables by typing their name in the Console and pressing Enter.

Generally there are two possible outcomes when you run code. First, the code will simply print output directly in the console, as it did with the calculations you entered above. Second, there is no output because you have stored it as a variable in the Environment. The Environment is the collection of named objects that are stored in memory for your current R session. Anything stored in memory will be accessible by the variable name without running the original script that was used to create it.

Add the variables you just created together, and examine the output:

stream + pebble
[1] 12

You can also create new variables using existing variables like so:

habitat <- stream + pebble

Keep in mind R is case-sensitive! Details like spacing and spelling matter in coding (the computer will do only exactly what you say, nothing more, nothing less), so if we use Habitat or habitatt or ha_bitat hoping to get the value of habitat, we’re out of luck.

Please note: Clicking on the broom button in the Environment will permanently clear out your existing variables. Only do this if you are certain you want to remove/reset all saved variables and datasets.

In this same pane in the RStudio window is the History tab, which will record all the code you’ve run, and the Connections tab will show connections to other databases, etc.

Installing Packages (Files etc tab)

Immediately below the Environment is the third section of RStudio, where all of your Packages are stored. The base or core installation of R is quite powerful, but because R is open-source, there are thousands of packages (pieces of code we can download and use) that dramatically extend the capability of R, from statistical analysis, to modeling, to geospatial mapping, to website and document creation. Packages are the collection of code or functions that are a standardized way of extending R with new methods, techniques, and programming functionality.

CRAN

One of the reasons for R’s popularity is CRAN, The Comprehensive R Archive Network. CRAN is where you download R and also where you can gain access to additional packages. All of the packages we will use during this tutorial will be downloaded from CRAN. As of 2021-07-02, there are 17,807 packages on CRAN!

Installing {tidyverse}

When a package is installed, that means the source code is downloaded and put into your library. Let’s give it a shot using the {tidyverse}, a set of packages assembled for data tidying and visualization purposes.

We’re going to use our very first function: install.packages() to install this package. Type the following into your Console and press enter:

install.packages("tidyverse")

You should see it appear in the Packages tab. To find it, you can either scroll through the list, or type the package name into the search bar at the top of the pane.

In order to use a package, you must load it into your current workspace. This is sometimes called attaching a package, but we prefer to avoid clicking whenever possible, so we’ll load this package using the library() function. Type the following into your Console and press Enter:

library(tidyverse)

Now your package is loaded, and ready to use. You can be certain your package is attached if there is a check mark next to the package name in the Packages tab.

Remaining Tabs in the Files etc Pane

The remaining tabs in this pane allow you to see:

Great job! You’ve opened up RStudio, learned some of the basic functionality, and now we’re ready to get going on your first R project!


Getting Help

Being able to find help and interpret that help is probably one of the most important skills for learning a new language. We have a whole lesson devoted to help and troubleshooting, but here’s a quick overview for “local” help that is already built into R and RStudio. Help on functions and packages can be accessed directly from R.

Help from the Console

Getting help from the Console is straightforward and can be done numerous ways. If you know the name of the function or package we can type that as follows:

# Using the help command/shortcut
# When you know the name of a function
help("print") # Help on the print command
?print # Help on the print command using the `?` shortcut

# When you know the name of the package
help(package = "dplyr") # Help on the package `dplyr`

If we don’t know the name, or maybe just want to look for a part of something, we can use the following in the Console:

# Don't know the exact name or just part of it?

apropos("print") # Returns all available functions with "print" in the name

??print # shortcut, but also searches demos and vignettes in a formatted page


Previous module:
1. Install R/RStudio
Next module:
3. Project Management

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".