If you find any typos, errors, or places where the text could be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.
- Open an issue or submit a merge request on GitLab with the feedback or suggestions.
Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.
B Resources
B.1 For this course
- For a quick reference to the functions and tools taught in this course, check out our cheatsheet!
B.1.1 Setting up RStudio Cloud
Are technical issues with your computer preventing you from completing the pre-course tasks or doing the course sessions?
A last-ditch solution may be to use the RStudio Cloud platform to do the course. What is it?
RStudio Cloud is a cloud computing service hosted by RStudio. It runs RStudio in your web browser with full functionality, including built-in integration of R Projects and Git. It looks slightly different from a local RStudio session, and it treats R Projects and Git a little different, but the overall R user experience is virtually identical. For what we need in this course, we are fine with their free service, which only requires a little work to register and set up.
Setting up RStudio Cloud
- Register a free account: Go to the RStudio Cloud website and complete the sign-up process for a free account.
- Log in to the account. After logging in, we start on the “Personal Workspace” page. Here we can create new R Projects (which we’ll do next), set up additional workspaces (which we don’t need for this course) or read the R guides and cheatsheets available in the “Learn” section of the menu panel (we can explore those after the course, there’s a lot of good content):
Figure B.1: This is RStudio Cloud.
- Start a new R Project: Click the “New Project” button. To practice using it, click the right half of the button and you get the option to create a project by cloning a GitHub repository (more on that later):
Figure B.2: Create a new RStudio Cloud Project.
Figure B.3: Our first RStudio Cloud project. The interface is identical to a local instance of RStudio.
- Throttle CPU and memory usage down: For this course, we don’t need a lot of computer power, so we can get by with just the minimum memory and CPU settings. With these settings, we get 30 hours of run-time per month, which should be plenty for this course. To do so, click the gear-icon in the top right of the screen, then the “Resources” tab, toggle both sliders to the far left, and click “Apply Changes”:
Figure B.4: Tuning cloud resources. Slow and steady wins the race!
- Set up access to GitHub: Finally, we need to allow access to private GitHub repositories, which we’ll be using in this course. Click on your user name in the top-right corner, and then the “Authentication” tab. This will take us to the authentication options, where we tick both boxes at the bottom for enabling GitHub, including the one labeled “Private repo access also enabled”:
Figure B.5: Enable access to private repositories on GitHub.
- Setting up Git on RStudio Cloud projects: Git cannot be set up inside the project using packages, so prodigenr and
usethis::use_git()
will not work. However, it can easily be done through the RStudio menu interface:Tools -> Version Control -> Project Setup -> Version Control System: Git
- Now you’re good to go! You should have no issues completing the pre-course tasks and have a course experience similar to those using a local installation.
- There are just a few more differences between local and RStudio Cloud sessions to be aware of:
- Uploading files to RStudio Cloud projects: To make files available to RStudio Cloud projects, we have to upload the files to it. Thankfully, this is very easy! In RStudio Cloud, the “Files” tab of Panel D (Figure 3.1) contains a new icon “Upload File,” which allows us to easily upload a file from our local computer (or several files within a zip file) to the RStudio Cloud folder we’re currently in:
Figure B.6: Uploading files, e.g. datasets to the cloud is easy.
- R projects and folder structure in RStudio Cloud: Since we start every R session inside of a project on RStudio Cloud, RStudio in the cloud session doesn’t let us set up projects from inside of the R session. The prodigenr package can still be used through the console to set up the basic folder structure of our projects. However, the root folder of any project is fixed to
/cloud/project/
, so when setting up our folder structure with prodigenr, we should have this in mind (e.g.prodigenr::setup_project("LearningR")
creates the prodigenr folders inside/cloud/project/LearningR
and avoids unnecessarily long file paths). - Cloning a GitHub remote repository in RStudio Cloud: Can be done when setting up a new project in the RStudio Cloud workspace (“New Project from Git Repo”), or from the terminal, but it cannot be done through the RStudio main menu interface.
- Uploading files to RStudio Cloud projects: To make files available to RStudio Cloud projects, we have to upload the files to it. Thankfully, this is very easy! In RStudio Cloud, the “Files” tab of Panel D (Figure 3.1) contains a new icon “Upload File,” which allows us to easily upload a file from our local computer (or several files within a zip file) to the RStudio Cloud folder we’re currently in:
B.1.2 Potential exercise solutions
These are potential exercise solutions of R code only. This are mostly intended as a resource for after the class, not during it.
Exercise: Reading the READMEs
# This exercise has no solution.
Exercise: Better file naming
# Bad: Has a space.
fit models.R# Good: Descriptive with no space.
-models.R
fit# Bad: Not descriptive.
foo.r
stuff.r# Good: Descriptive with no space.
get_data.R# Bad: Has space
10.docx
Manuscript version # Good: Descriptive.
manuscript.docx# Bad: Not descriptive and has spaces.
new version of analysis.R# Bad: Not descriptive and has dots.
trying.something.here.R# Good: Descriptive with - or _
-regression.R
plotting
utility_functions.R# Bad: Not descriptive.
code.R
Exercise: Make code more readable
# Object names
# Should be snake case (looks like `snake_case`)
# DayOne
day_one
# Should not overwrite existing function names
# T = TRUE, so don't name anything T
# T <- FALSE
<- FALSE
false
# c is a function name already. Plus c is not descriptive
# c <- 9
<- 9
number_value
# Spacing
# Commas should be in the correct place
# x[,1]
# x[ ,1]
1]
x[,
# Spaces should be in the correct place
# mean (x, na.rm = TRUE)
# mean( x, na.rm = TRUE )
mean(x, na.rm = TRUE)
# Add spaces between separate words and symbols
# height<-feet*12+inches
<- feet * 12 + inches
height
# But don't add spaces interrupting strings of symbols or code
# df $ z
$z
df# x <- 1 : 10
<- 1:10
x
# Indenting and brackets
# Indenting should be done after if, for, else functions
# if (y < 0 && debug) {
# message("Y is negative")}
if (y < 0 && debug) {
message("Y is negative")
}
Exercise: Committing to history
# There is not R code solution for this section.
Exercise: Clone GitHub repository from RStudio
# There is not R code solution for this section.
Exercise: Push and pull
# There is not R code solution for this section.
Exercise: Dealing with merge conflicts
# There is not R code solution for this section.
Exercise: Getting familiar with the dataset
# Load the packages
source(here::here("R/package-loading.R"))
# Check column names
colnames(NHANES)
# Look at contents
str(NHANES)
glimpse(NHANES)
# See summary
summary(NHANES)
# Look over the dataset documentation
?NHANES
Exercise: Practice what we’ve learned
# 1. Select specific columns
%>%
nhanes_small select(tot_chol, bp_sys_ave, poverty)
# 2. Rename columns
%>%
nhanes_small rename(diabetes_diagnosis_age = diabetes_age)
# 3. Re-write with pipe
%>%
nhanes_small select(bmi, contains("age"))
# 4. Re-write with pipe
%>%
nhanes_small select(phys_active_days, phys_active) %>%
rename(days_phys_active = phys_active_days)
Exercise: Piping, filtering, and mutating
# 1. BMI between 20 and 40, with diabetes
%>%
nhanes_small filter(bmi >= 20 & bmi <= 40 & diabetes == "Yes")
# Pipe the data into mutate function and:
<- nhanes_small %>% # dataset
nhanes_modified mutate(
mean_arterial_pressure = ((2 * bp_dia_ave) + bp_sys_ave) / 3,
young_child = if_else(age < 6, "Yes", "No")
)
nhanes_modified
Exercise: Calculate some basic statistics
# 1.
%>%
nhanes_small summarise(mean_weight = mean(weight, na.rm = TRUE),
mean_age = mean(age, na.rm = TRUE))
# 2.
%>%
nhanes_small summarise(max_height = max(height, na.rm = TRUE),
min_height = min(height, na.rm = TRUE))
# 3.
%>%
nhanes_small summarise(median_age = median(height, na.rm = TRUE),
median_phys_active_days = median(phys_active_days, na.rm = TRUE))
Exercise: Answer some statistical questions with group by and summarise
# 1.
%>%
nhanes_small filter(!is.na(diabetes)) %>%
group_by(diabetes, sex) %>%
summarise(
mean_age = mean(age, na.rm = TRUE),
max_age = max(age, na.rm = TRUE),
min_age = min(age, na.rm = TRUE)
)
# 2.
%>%
nhanes_small filter(!is.na(diabetes)) %>%
group_by(diabetes, sex) %>%
summarise(
mean_height = mean(height, na.rm = TRUE),
max_height = max(height, na.rm = TRUE),
min_height = min(height, na.rm = TRUE),
mean_weight = mean(weight, na.rm = TRUE),
max_weight = max(weight, na.rm = TRUE),
min_weight = min(weight, na.rm = TRUE)
)
Exercise: Practicing the dplyr functions
%>%
NHANES rename_with(snakecase::to_snake_case) %>%
select(gender, age, bmi) %>%
filter(!is.na(gender) & !is.na(age) & !is.na(bmi)) %>%
rename(sex = gender) %>%
mutate(age_class = if_else(age < 50, "under 50", "over 50")) %>%
group_by(age_class, sex) %>%
summarize(bmi_mean = mean(bmi, na.rm = TRUE),
bmi_median = median(bmi, na.rm = TRUE))
Exercise: Create another R Markdown document.
# This exercise has no R code solution.
Exercise: Creating a table using R code
# 1. Loading libraries
source(here::here("R/package-loading.R"))
load(here::here("data/nhanes_small.rda"))
# 2. Calculating mean BMI and Age
%>%
nhanes_small filter(!is.na(diabetes)) %>%
group_by(diabetes, sex) %>%
summarise(mean_age = mean(age, na.rm = TRUE),
mean_bmi = mean(bmi, na.rm = TRUE)) %>%
ungroup() %>%
# 3. Round the means to 1 digit and
# modify the `sex` column so that male and female get capitalized.
mutate(mean_age = round(mean_age, 1),
mean_bmi = round(mean_bmi, 1),
sex = str_to_sentence(sex)) %>%
# 4. Rename `diabetes` to `"Diabetes Status"` and `sex` to `Sex`
rename("Diabetes Status" = diabetes, Sex = sex,
"Mean Age" = mean_age, "Mean BMI" = mean_bmi) %>%
# 5. Include the `knitr::kable()` function at the end of the pipe.
::kable(caption = "A prettier Table. Mean values of Age and BMI for each sex and diabetes status.") knitr
Exercise: Practice using Markdown for writing text
# This exercise has no R code solution.
Exercise: Adding figures and changing the theme
# This exercise has no R code solution.
Exercise: Creating plots with one or two variables
# 1a. Distribution of age
ggplot(nhanes_tidied, aes(x = age)) +
geom_histogram()
# 1b. Distribution of age at diabetes diagnosis
ggplot(nhanes_tidied, aes(x = diabetes_age)) +
geom_histogram()
# 2a. Number of people who smoke now
ggplot(nhanes_tidied, aes(x = smoke_now)) +
geom_bar()
# 2b. Number of people who are physically active
ggplot(nhanes_tidied, aes(x = phys_active)) +
geom_bar()
# 3a. BMI in relation to systolic blood pressure
ggplot(nhanes_tidied, aes(x = bmi, y = bp_sys_ave)) +
geom_hex()
# 3b. BMI relation to diastolic blood pressure
ggplot(nhanes_tidied, aes(x = bmi, y = bp_dia_ave)) +
geom_hex()
# 4. Physically active people with or without diabetes
ggplot(nhanes_tidied, aes(x = diabetes, fill = phys_active)) +
geom_bar(position = position_dodge())
# 5. Poverty levels between those with or without diabetes
ggplot(nhanes_tidied, aes(x = diabetes, y = poverty)) +
geom_violin()
Exercise: Changing the colour schemes
# 1. change colors to a viridis color scheme
+
base_barplot scale_fill_viridis_d()
# 2. change colors to another viridis color scheme
+
base_barplot scale_fill_viridis_d(option = "A")
+
base_barplot scale_fill_viridis_d(option = "E")
# 3. change colours to a viridis color scheme
+
base_scatterplot scale_color_viridis_c()
# 4. change colors to another viridis color scheme
+
base_scatterplot scale_color_viridis_c(option = "B")
B.2 Resources for general use:
B.3 More formatting syntax in Markdown
B.3.1 Block quotes
Block quotes are used when you want to emphasize a block of text,
usually for quoting someone.
You create a block quote by putting a >
at the beginning of the line,
and as with the lists and headers,
it needs empty lines before and after the text.
So it looks like this:
> Block quote
which gives…
Block quote
B.3.2 Adding footnotes
Footnotes are added by enclosing a number or word in square brackets ([]
)
and beginning with an uptick (^
). It looks like this:
[^1] or this[^note].
Footnote
[^1]: Footnote content
[^note]: Another footnote
which gives…
Now, if you scroll down to the bottom of the page, you will see these footnotes.
B.3.3 Adding links to websites
Including a link to a website in your document is done by surrounding the link text
with square brackets ([]
) followed by the link URL in brackets (()
).
There must not be any space between the square brackets
and the regular brackets (it should look like []()
).
[Link](https://google.com)
which gives…
B.3.4 Inserting (simple) tables
While you can insert tables using Markdown too,
it isn’t recommended to do that for complicated or large tables.
Tables are created by separating columns with |
,
with the table header being separated by a line that looks like |:--|
.
A simple example is:
| | Fun | Serious |
|:--|----:|--------:|
| **Happy** | 1234 | 5678 |
| **Sad** | 123 | 456 |
which gives…
Fun | Serious | |
---|---|---|
Happy | 1234 | 5678 |
Sad | 123 | 456 |
The |---:|
or |:---|
tell the table to left-align or right-align the values
in the column. Center-align is |:----:|
.
So you can probably imagine,
doing this for larger or even slightly more complicated tables is not practical.
A good alternative approach is to create the table in a spreadsheet,
importing that table into R within a code chunk,
and using knitr::kable()
to create the table after that.
B.4 Other colour schemes in ggplot2
You can modify ggplot2 colour schemes using many other pre-defined palettes by installing new R packages, including scientific journal colour palettes (ggsci) and even a Wes Anderson (wesanderson) or a Stubio Ghibli (ghibli) colour palette! Also check out the Data Visualization book in the Resources
chapter for more information and learning on visualizing data.
B.5 For learning
Free online books:
- R for Data Science: Excellent open and online resource for using R for data analysis and data science.
- Fundamentals of Data Visualization: Excellent online resource for using ggplot2 and R graphics. The book mostly focuses on concepts and theory of how to visualize, rather than the practicalities (i.e. no coding involved).
- ModernDive: Statistical Inference via Data Science: Great book on using statistics and data science methods in R.
- Happy Git and GitHub for the useR (highly recommended): Specifically useful is the chapter on Daily Workflows using Git.
- Data Visualization: A practical introduction: A book that goes into practical as well as conceptual detail on how and why to make certain graphs, given your data.
- Online book for R Markdown: The go-to reference for learning and using R Markdown.
- Course material for a statistics class: Excellent course material for teaching statistics and R.
Quick references:
- RStudio cheatsheets: Multiple, high-quality cheatsheets you can print off to use as a handy reference.
- Tidyverse style guide: To learn about how to write well-styled code in R.
Articles:
- Good enough practicies in scientific computing: An article listing and describing some practices to use when writing code.
- Best practices in scientific computing.
- Case study of reproducible methodds in Bioinformatics: (Kim, Poline, and Dumas 2018).
General sites
- Organizing R Source Code.
- Hands-on tutorial for learning Git, in a web-based terminal.
- Simpler, first-steps guide to using Git.
- RStudio tutorial on using R Markdown.
- Markdown syntax guide.
- Pandoc Markdown Manual (R Markdown uses pandoc).
- Adding citations in R Markdown.
Interactive sites or resources for hands-on learning:
Videos:
- Video on using Git in RStudio.
Getting help:
- StackOverflow for tidyr.
- StackOverflow for dplyr.
- StackOverflow for ggplot2.
- Tip: Combine auto-completion with
::
to find new functions and documentation on the functions (e.g. try typingbase::
and then hitting Tab to show a list of all functions found in base R).
Package | Title | Description |
---|---|---|
bookdown | Authoring Books and Technical Documents with R Markdown | Output formats and utilities for authoring books and technical documents with R Markdown. |
broom | Convert Statistical Objects into Tidy Tibbles | Summarizes key information about statistical objects in tidy tibbles. This makes it easy to report results, create plots and consistently work with large numbers of models at once. Broom provides three verbs that each provide different types of information about a model. tidy() summarizes information about model components such as coefficients of a regression. glance() reports information about an entire model, such as goodness of fit measures like AIC and BIC. augment() adds information about individual observations to a dataset, such as fitted values or influence measures. |
data.table |
Extension of data.frame
|
Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development. |
datapasta | R Tools for Data Copy-Pasta | RStudio addins and R functions that make copy-pasting vectors and tables to text painless. |
dplyr | A Grammar of Data Manipulation | A fast, consistent tool for working with data frame like objects, both in memory and out of memory. |
forcats | Tools for Working with Categorical Variables (Factors) | Helpers for reordering factor levels (including moving specified levels to front, ordering by first appearance, reversing, and randomly shuffling), and tools for modifying factor levels (including collapsing rare levels into other, ‘anonymising,’ and manually ‘recoding’). |
fs | Cross-Platform File System Operations Based on ‘libuv’ | A cross-platform interface to file system operations, built on top of the ‘libuv’ C library. |
ggplot2 | Create Elegant Data Visualisations Using the Grammar of Graphics | A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics.” You provide the data, tell ‘ggplot2’ how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. |
glue | Interpreted String Literals | An implementation of interpreted string literals, inspired by Python’s Literal String Interpolation <https://www.python.org/dev/peps/pep-0498/>; and Docstrings <https://www.python.org/dev/peps/pep-0257/>; and Julia’s Triple-Quoted String Literals <https://docs.julialang.org/en/v1.3/manual/strings/#Triple-Quoted-String-Literals-1>;. |
googledrive | An Interface to Google Drive | Manage Google Drive files from R. |
haven | Import and Export ‘SPSS,’ ‘Stata’ and ‘SAS’ Files | Import foreign statistical formats into R via the embedded ‘ReadStat’ C library, <https://github.com/WizardMac/ReadStat>;. |
here | A Simpler Way to Find Your Files | Constructs paths to your project’s files. The ‘here()’ function uses a reasonable heuristics to find your project’s files, based on the current working directory at the time when the package is loaded. Use it as a drop-in replacement for ‘file.path(),’ it will always locate the files relative to your project root. |
janitor | Simple Tools for Examining and Cleaning Dirty Data | The main janitor functions can: perfectly format data.frame column names; provide quick counts of variable combinations (i.e., frequency tables and crosstabs); and isolate duplicate records. Other janitor functions nicely format the tabulation results. These tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel. This package follows the principles of the “tidyverse” and works well with the pipe function %>%. janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff. |
knitr | A General-Purpose Package for Dynamic Report Generation in R | Provides a general-purpose tool for dynamic report generation in R using Literate Programming techniques. |
lubridate | Make Dealing with Dates a Little Easier | Functions to work with date-times and time-spans: fast and user friendly parsing of date-time data, extraction and updating of components of a date-time (years, months, days, hours, minutes, and seconds), algebraic manipulation on date-time and time-span objects. The ‘lubridate’ package has a consistent and memorable syntax that makes working with dates easy and fun. Parts of the ‘CCTZ’ source code, released under the Apache 2.0 License, are included in this package. See <https://github.com/google/cctz>; for more details. |
patchwork | The Composer of Plots | The ‘ggplot2’ package provides a strong API for sequentially building up a plot, but does not concern itself with composition of multiple plots. ‘patchwork’ is a package that expands the API to allow for arbitrarily complex composition of plots by, among others, providing mathematical operators for combining multiple plots. Other packages that try to address this need (but with a different approach) are ‘gridExtra’ and ‘cowplot.’ |
purrr | Functional Programming Tools | A complete and consistent functional programming toolkit for R. |
readr | Read Rectangular Text Data | The goal of ‘readr’ is to provide a fast and friendly way to read rectangular data (like ‘csv,’ ‘tsv,’ and ‘fwf’). It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes. |
readxl | Read Excel Files | Import excel files into R. Supports ‘.xls’ via the embedded ‘libxls’ C library <https://github.com/libxls/libxls>; and ‘.xlsx’ via the embedded ‘RapidXML’ C++ library <http://rapidxml.sourceforge.net>;. Works on Windows, Mac and Linux without external dependencies. |
rio | A Swiss-Army Knife for Data I/O | Streamlined data import and export by making assumptions that the user is probably willing to make: ‘import()’ and ‘export()’ determine the data structure from the file extension, reasonable defaults are used for data import and export (e.g., ‘stringsAsFactors=FALSE’), web-based import is natively supported (including from SSL/HTTPS), compressed files can be read directly without explicit decompression, and fast import packages are used where appropriate. An additional convenience function, ‘convert(),’ provides a simple method for converting between file types. |
rmarkdown | Dynamic Documents for R | Convert R Markdown documents into a variety of formats. |
stringr | Simple, Consistent Wrappers for Common String Operations | A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package. All function and argument names (and positions) are consistent, all functions deal with “NA”’s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another. |
tibble | Simple Data Frames | Provides a ‘tbl_df’ class (the ‘tibble’) that provides stricter checking and better formatting than the traditional data frame. |
tidyr | Tidy Messy Data | Tools to help to create tidy data, where each column is a variable, each row is an observation, and each cell contains a single value. ‘tidyr’ contains tools for changing the shape (pivoting) and hierarchy (nesting and ‘unnesting’) of a dataset, turning deeply nested lists into rectangular data frames (‘rectangling’), and extracting values out of string columns. It also includes tools for working with missing values (both implicit and explicit). |
tidyverse | Easily Install and Load the ‘Tidyverse’ | The ‘tidyverse’ is a set of packages that work in harmony because they share common data representations and ‘API’ design. This package is designed to make it easy to install and load multiple ‘tidyverse’ packages in a single step. Learn more about the ‘tidyverse’ at <https://tidyverse.org>;. |
usethis | Automate Package and Project Setup | Automate package and project setup tasks that are otherwise performed manually. This includes setting up unit testing, test coverage, continuous integration, Git, ‘GitHub,’ licenses, ‘Rcpp,’ ‘RStudio’ projects, and more. |
vroom | Read and Write Rectangular Text Data Quickly | The goal of ‘vroom’ is to read and write data (like ‘csv,’ ‘tsv’ and ‘fwf’) quickly. When reading it uses a quick initial indexing step, then reads the values lazily , so only the data you actually use needs to be read. The writer formats the data in parallel and writes to disk asynchronously from formatting. |
B.6 Frequently Asked Questions (FAQ)
B.6.1 Installation Problems
Q: What do I do if I don’t have admin rights and can’t install the newest versions of programs?
- A: Depending on your institutional policies and infrastructure, you may not be able to download newer versions of R and RStudio. Firstly, you should contact your IT support person to see whether you are able to access an updated version. While you could attempt to use the older version of these programs available to you, you may notice that some things either look quite different or not work at all. One solution is to use the RStudio Cloud platform, as per the instructions in the section Setting up RStudio Cloud.
Q: Why did my output have so much red text when I tried to install packages?
- A: Rest assured that this is simply the way that R informs you that packages are being installed, even though it may look scary. If there really was an error, the output would likely begin with a message saying “Error: …” or “Warning: …,” to differentiate between informational compared to error messages.
Q: Why did I receive a bunch of error messages when I tried to install packages?
- A: For some Windows users, you may need to install Rtools.
Q: Help! My code isn’t working, even though I’ve followed all the instructions.
- A: This is the most common problem you will face, so we’ve dedicated Section 7.13 to troubleshooting solutions in the Management of R Projects section.