Want to help out or contribute?

If you find any typos, errors, or places where the text could be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.

• Open an issue or submit a merge request on GitLab with the feedback or suggestions.
• Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

# 11 Data visualization

When in RStudio, quickly jump to this page by using r3::open_data_visualization().

Session objectives:

1. Learn and apply the basics of creating publication-quality graphs.
2. Learn about the importance of considering the colours you use in your graphs and apply tools that are colour-blind friendly.
3. Learn about and avoid using certain commonly used, but inappropriate graphs for presenting results.
4. Create useful graphs such as boxplots, scatterplots, line graphs, jitter plots, and (appropriate) barplots.

## 11.1 Basic principles for creating graphs

Please take ~10 min to read through this section, as well as the next one.

Making graphs in R is surprisingly easy and can be done with very little code. Because of the ease with which you can make them, it gives you some time to: reason about why you are making them; if the graph is the most appropriate for the data or results; and, how you can design your graphs to be as accessible and understandable as possible.

To start, here are some tips for making a graph:

• Whenever possible or reasonable, show raw data values rather than summaries (e.g. means).
• Though commonly used in scientific papers, avoid barplots with means and error bars as they greatly misrepresent the data (we’ll cover why later).
• Use colour to 1) highlight and enhance your message and 2) to make the plot visually appealing.
• Use a colour-blind friendly palette so the plot is more accessible to others (more on this later).

There are also excellent online books on this that are included in the resources chapter.

## 11.2 Basic structure of using ggplot2

ggplot2 is an implementation of the “Grammar of Graphics” (gg). This is a powerful approach to creating plots because it provides a set of structured rules (a “grammar”) that allow you to expressively describe components (or “layers”) of a graph. Because you are able to describe the components, it makes it easier to then implement those “descriptions” into creating a graph. There are at least four aspects to using ggplot2 that relate to its “grammar”:

• Aesthetics, aes(): How data are mapped to the plot, for instance, what data to put on the x axis, on the y axis, and/or whether to use a colour for a variable.
• Geometries, geom_ functions: The visual representation of the data, as a layer. This tells ggplot2 how the aesthetics should be visualized. For instance, should they be shown as points, lines, boxes, bins, or bars?
• Scales, scale_ functions: These control the visual properties of the geom_ layers. For instance, to modify the appearance of the axes, to change the colour of dots from red to blue, or to use a different colour palette entirely.
• Themes, theme_ functions or theme(): Directly controls all other aspects of the plot. For instance, control the size, font, and angle of axis text. Or maybe change the thickness or colour of the axis lines.

There is a massive amount of features in ggplot2. Thankfully, ggplot2 was specifically designed to make it easy to find and use its functions and settings using tab auto-completion. As an example, if you type out geom_ and then hit tab, you will get a list of all the geoms available. Likewise with scale_ or with all the options inside theme() (e.g. type out theme(axis. and then hit tab, and a list of theme settings related to the axis will pop up). ggplot2 also works best with tidy data.

So, why do we teach ggplot2 and not base R plotting? Base R plotting functionality is quite good and you can make really nice, publication-quality graphs. However, there are several major limitations to base R plots from a beginner and a user-interface perspective:

• Function and argument naming and documentation are inconsistent and opaque (e.g. the cex argument, in certain but not all functions, magnifies text and symbols, but you can’t tell from the name that it does that).
• User-friendly documentation that is accessible to a broad range of people is not much of a priority, so often the help documentation isn’t written with beginners in mind.
• Graphs are built similar to painting on a canvas: Make a mistake and you need to start all over (e.g. restart R).

These limitations are due to the fact that base R plotting was developed:

• By different people over different periods of time.
• By people who were/are mostly in statistics and maths.
• By people who (generally) don’t have training in principles of software user-design, user-interface, or engineering.
• Without a strong “design philosophy” to guide development.
• During a time when auto-completion didn’t really exist or was sub-optimal, so short function and object names were more important than they are today.

On the other hand, ggplot2:

• Has excellent documentation for help and learning.
• Has a strong design philosophy that makes it easier to use.
• Works in “layers”, so you don’t have to start over if you make a mistake.
• Works very well with auto-completion.
• Function and argument naming is consistent and descriptive (in plain English).

These are the reasons we teach and use ggplot2. Let’s make our first graphs together now.

## 11.3 Graph individual variables

Very often you want to get a sense of your data, one variable (i.e. one column in a data frame) at a time. You create these plots to see the distribution of a variable and visually inspect the data for any problems. There are several ways of plotting continuous variables (e.g. weight, height) in ggplot2. For discrete variables (e.g. “male” and “female”), there is really only one way.

You may notice that since the data wrangling chapter, we used the word “column” to describe the columns in the data frame, but now we’re saying “variable”. There’s a reason for this: ggplot2 really only works with tidy data. And if we recall from the definition of tidy data, it is made up of “variables” (columns) and “observations” (rows) of a data frame. To us, a “variable” is something that we are interested in analyzing or visualizing, and that contains only values relevant to that measurement (e.g. a Weight variable must only contain values for weight). The NHANES dataset is already pretty tidy: Rows are participants at the survey year, columns are the variables that were measured. So, from now on, we call them “variables”.

Ok, let’s visually explore our data. Open the LearningR R Project if it isn’t open already, and create a new R Markdown file called visualization-session.Rmd. To do this, go to File -> New File -> R Markdown, and a dialog box will then pop up. Type in “Data visualization” in the title section and your name in the author section. Choose HTML as the output format. When the file is created, delete all the text and code chunks, keeping the YAML header, and then save this file as visualization-session.Rmd in the doc/ folder. We will use this file for the code-along and exercises in this session.

First, we want to load the packages and dataset, so write at the top of the file, right under the YAML header:

# Load packages and dataset

Then add a new code chunk with the shortcut Ctrl-Alt-I or by using the menu item Code -> Insert Chunk. Name the new chunk label as setup and then add this to the first code chunk:

# Load packages

# Load the small, tidied dataset from the wrangling session
load(here::here("data/nhanes_small.rda"))

Now we are ready to start creating the first plot! Since BMI is a strong risk factor for diabetes, let’s check out its distribution. To show distributions, there are two good geoms: geom_density() and geom_histogram().

Write out a new header called # One variable plots and then add a new code chunk below it. Next let’s create a density distribution plot:

# Create density plot for BMI
ggplot(nhanes_small, aes(x = bmi)) +
geom_density()

In this session, we’ll create a new code chunk for each plot we make to maintain a nice, readable code, and to practice writing headers and inserting code chunks. So, create a new code chunk and type this code:

# Create histogram for BMI
ggplot(nhanes_small, aes(x = bmi)) +
geom_histogram()

It’s good practice to always create a new line after the +. We can see that for the most part there is a good distribution with BMI, though there are several values that are quite large… some at 80 BMI units!

The plots above are for continuous variables, but what about for discrete? Well, sadly, there’s really only one: geom_bar(). This isn’t a geom for a barplot though! This shows the counts of a discrete variable. There are many discrete variables in NHANES, including sex and diabetes, so let’s visualize those. Again, create a new code chunk, then type:

# create count barplot for sex
ggplot(nhanes_small, aes(x = sex)) +
geom_bar()

We can see that there are almost equal numbers of females and males. Now we’ll do the same for diabetes status, so in a new code chunk type:

# create count barplot for diabetes status
ggplot(nhanes_small, aes(x = diabetes)) +
geom_bar()

For diabetes, it seems there is some missingness in the dataset. Since diabetes status is an important variable for us, let’s remove all missing values right now, save the tidied dataset in the data/ folder, and plot it again (in a new code chunk).

# Remove individuals with missing diabetes status
nhanes_tidied <- nhanes_small %>%
filter(!is.na(diabetes))

# Save the tidied dataset as an rda file in the data folder
usethis::use_data(nhanes_tidied, overwrite = TRUE)

# Create a new count barplot for diabetes status
ggplot(nhanes_tidied, aes(x = diabetes)) +
geom_bar()

Let’s take a minute to talk about the commonly used barplots with mean and error bars. In all cases, bar plots should only be used for discrete (categorical) data where you want to show counts or proportions. They should as a general rule not be used for continuous data. This is because the commonly used “bar plot of means with error bars” actually hides the underlying distribution of the data. To have a better explanation of this, you can read the article on why to avoid barplots after the course. The image below, taken from that paper, shows briefly why this plot type is not useful.

If you do want to create a barplot, you’ll quickly find out that it actually is quite hard to do in ggplot2. The reason it is difficult to create in ggplot2 is by design: it’s a bad plot to use, so use something else.

Before we move on, let’s add and commit the new files we created into the Git history and push up to

## 11.4 Plotting two variables

There are many more types of “geoms” to use when plotting two variables. Which one to choose depends on what you are trying to show or to communicate, and what the data are. Usually the variable that you “control or influence” (the independent variable) in an experimental setting goes on the x-axis, and the variable that “responds” (the dependent variable) goes on the y-axis.

### 11.4.1 Two continuous variables

When you have two continuous variables, some geoms are:

• geom_point(), which is used to create a standard scatterplot.
• geom_hex(), which is used to replace geom_point() when your data is massive, since creating points for each value in a large dataset can take a long time to plot.
• geom_smooth(), which applies a “regression-type” line to the data (default uses LOESS regression).

Let’s check out how BMI may influence cholesterol using a basic scatterplot, hex plot, and a smoothing line plot in a new code chunk. But first, create a new Markdown header called # Plotting two variables and create the code chunk below that.

# Using 2 continuous variables
bmi_chol <- ggplot(nhanes_tidied, aes(x = bmi, y = tot_chol))
# Standard scatter plot
bmi_chol +
geom_point()

With 10,000 data points, the scatter plot is a little crowded.

# Standard scatter plot, but with hexagons
bmi_chol +
geom_hex()

Notice how the hex plot changes the colour of the data based on how many values are in the area of the plot.

# Runs a smoothing line with confidence interval
bmi_chol +
geom_smooth()

This makes a nice smoothing line through the data and gives us an idea of general trends or relationships between the two variables. You can also combine geoms by adding another one with a +.

# Or combine two geoms, hex plot with smoothing line
bmi_chol +
geom_hex() +
geom_smooth()

### 11.4.2 Two discrete variables

Sadly, for two discrete variables, there are not many options available without major data wrangling. The most useful geom for this is geom_bar() like before, but with an added variable. Because geom_bar() has a “fill” (coloured inside), we can change that fill based on a variable. So let’s see what the difference in diabetes status is between sexes.

# 2 categorical/discrete
# (We can pipe data into ggplot)
two_discrete <- nhanes_tidied %>%
ggplot(aes(x = diabetes, fill = sex))

# Stacked
two_discrete +
geom_bar()

By default, geom_bar() will make fill groups stacked on top of each other. For this case, it isn’t really that useful. So let’s instead have them side by side. For that, we need to use the position argument with a function called position_dodge(). This new function takes the fill grouping variable and “dodges” them (moves them) to be side by side.

# "dodged" (side-by-side) bar plot
two_discrete +
geom_bar(position = position_dodge())

Now you can see that there are slightly more men that have diabetes.

### 11.4.3 Discrete and continuous variables

When the variable types are mixed (continuous and discrete), there are many more geoms available to use. A couple of good ones are:

• geom_boxplot(), which makes boxplots that show the median and a measure of range in the data. Boxplots are generally pretty good at showing the spread of data.
• geom_jitter(), which makes a type of “scatter” plot, but for discrete and continuous variables. A useful argument to geom_jitter() is called width, which controls how wide the jittered points go out from the center line. This plot is much better than the boxplot since it shows the actual data, and not summaries like a boxplot does. When you have lots of data points however, it isn’t very good.
• geom_violin(), which shows a density distribution like geom_density(). This geom is great when there is a lot of data and geom_jitter() is just a mass of dots.

Let’s see how BMI differs between those with or without diabetes.

# Using mixed data
two_mixed <- nhanes_tidied %>%
ggplot(aes(x = diabetes, y = bmi))

# Standard boxplot with outliers
two_mixed +
geom_boxplot()

However, the box plot is still hiding your actual data points. These can be shown with a jitter plot:

# Show the actual data using a jitter plot
two_mixed +
geom_jitter()

Or a violin plot:

# Show the distribution with a voilin plot
two_mixed +
geom_violin()

The violin plot kind of looks like two stingrays, eh? Before proceeding with the exercise, take a moment to save your changes, and add and commit them to the Git history, and then push to Github.

## 11.5 Exercise: Create plots with one or two variables

Time: 15 min

Create a new header in the R Markdown file called # Exercise to make plots with one or two variables. Then create a code chunk below that. Copy and paste the below code into that code chunk. Complete as many tasks as you can below.

1. Using geom_histogram(), find out what the distribution is for the two variables below.
1. age (participant’s age at collection).
2. diabetes_age (age of diabetes diagnosis).
2. Using geom_bar(), find out how many people have data recorded for each of these discrete variables. What can you say about most people for these variables?
1. smoke_now (current smoking status).
2. phys_active (does moderate to vigorous physical activity).
3. Using geom_hex(), find out how BMI relates to the two blood pressure variables. Do you notice anything about the data from the plots?
1. bp_sys_ave (average systolic blood pressure).
2. bp_dia_ave (average diastolic blood pressure).
4. Using geom_bar(), find out how phys_active those with or without diabetes are. Put diabetes on the x-axis. What can you say based on the data? Note the differences in missingness between groups. Don’t forget to use position_dodge() in the position argument, in order to arrange the bars side by side.
5. Using geom_violin(), find how poverty levels are different for those with or without diabetes. Put diabetes on the x-axis. Looking at the distributions, what can you conclude about how poverty may be associated with diabetes status?
• The poverty variable is calculated as a ratio between income and a poverty threshold. Smaller numbers mean higher poverty.
6. Once you are done, save, add, and commit the changes to the files into the Git history.
# 1a. Distribution of age
ggplot(___, aes(x = ___)) +
___()

# 1b. Distribution of age of diabetes diagnosis
ggplot(___, aes(x = ___)) +
___()

# 2a. Number of people who smoke now
ggplot(___, aes(x = ___)) +
___()

# 2b. Number of people who are physically active
ggplot(___, aes(x = ___)) +
___()

# 3a. BMI in relation to systolic blood pressure
ggplot(___, aes(x = ___, y = ___)) +
___()

# 3b. BMI relation to diastolic blood pressure
ggplot(___, aes(x = ___, y = ___)) +
___()

# 4. Physically active people with or without diabetes
ggplot(___, aes(x = ___, fill = ___)) +
___(___ = ___())

# 5. Poverty levels between those with or without diabetes
ggplot(___, aes(x = ___, y = ___)) +
___()
Click for the (possible) solution.

# 1a. Distribution of age
ggplot(nhanes_tidied, aes(x = age)) +
geom_histogram()

# 1b. Distribution of age at diabetes diagnosis
ggplot(nhanes_tidied, aes(x = diabetes_age)) +
geom_histogram()

# 2a. Number of people who smoke now
ggplot(nhanes_tidied, aes(x = smoke_now)) +
geom_bar()

# 2b. Number of people who are physically active
ggplot(nhanes_tidied, aes(x = phys_active)) +
geom_bar()

# 3a. BMI in relation to systolic blood pressure
ggplot(nhanes_tidied, aes(x = bmi, y = bp_sys_ave)) +
geom_hex()

# 3b. BMI relation to diastolic blood pressure
ggplot(nhanes_tidied, aes(x = bmi, y = bp_dia_ave)) +
geom_hex()

# 4. Physically active people with or without diabetes
ggplot(nhanes_tidied, aes(x = diabetes, fill = phys_active)) +
geom_bar(position = position_dodge())

# 5. Poverty levels between those with or without diabetes
ggplot(nhanes_tidied, aes(x = diabetes, y = poverty)) +
geom_violin()

## 11.6 Visualizing three or more variables

There are many many ways to visualize additional variables in a plot and further explore your data. We can use ggplot’s colour, shape, size, transparency (“alpha”), and fill aesthetics, as well as “facets”. Faceting in ggplot2 is a way of splitting the plot up into multiple plots when the underlying aesthetics are the same or similar. In this section, we’ll be covering many of these capabilities in ggplot2.

The most common and “prettiest” way of adding a third variable is by using colour. Let’s try to answer a few questions, to visualize some examples. First, create a new header called # Plotting three or more variables and create a new code chunk below it.

Question: Is systolic blood pressure different in those with or without diabetes in females and males? In this case, we have one continuous variable (bp_sys_ave) and two discrete (sex and diabetes). For this plot, we could use geom_boxplot().

# Plot systolic blood pressure in relation to sex and diabetes status
nhanes_tidied %>%
ggplot(aes(x = sex, y = bp_sys_ave, colour = diabetes)) +
geom_boxplot()

Do you see differences in systolic blood pressure between the sexes? Between diabetics and non-diabetics?

Question: How does BMI relate to systolic blood pressure and age? Here we have three continuous variables (bmi, bp_sys_ave, and age), so we could use geom_point().

# Plot BMI in relation to systolic blood pressure and age
nhanes_tidied %>%
ggplot(aes(x = bmi, y = bp_sys_ave, colour = age)) +
geom_point()

Can you see any associations between systolic blood pressure and BMI or age?

Question: How does BMI relate to systolic blood pressure and what is different between those with and without diabetes? In this case, we have two continuous (bmi and bp_sys_ave) and one discrete variable (diabetes). We could use geom_point():

# Plot BMI in relation to systolic blood pressure and diabetes status
nhanes_tidied %>%
ggplot(aes(x = bmi, y = bp_sys_ave, colour = diabetes)) +
geom_point()

For this plot it’s really hard to see what’s different. But there is another way of visualizing a third (or fourth, and fifth!) variable: with “faceting”. Faceting splits the plot up into multiple subplots using the function facet_grid(). To work, at least one of the first two arguments to facet_grid() are needed. The first two are:

• cols: The discrete variable to use to facet the plot column-wise (i.e. side-by-side)
• rows: The discrete variable to use to facet the plot row-wise (i.e. stacked on top of each other)

For both cols and rows, the variable given must be wrapped by vars() (e.g. vars(diabetes)). Let’s try it with the previous example (instead of using colour).

# Plot BMI in relation to systolic blood pressure and diabetes status using
# faceting by column
nhanes_tidied %>%
ggplot(aes(x = bmi, y = bp_sys_ave)) +
geom_point() +
facet_grid(cols = vars(diabetes))

Try faceting with plots stacked by diabetes status, using the argument rows = vars(diabetes) instead. Which do you find more informative?

# Plot BMI in relation to systolic blood pressure and diabetes status using
# faceting by row
nhanes_tidied %>%
ggplot(aes(x = bmi, y = bp_sys_ave)) +
geom_point() +
facet_grid(rows = vars(diabetes))

We can also facet by sex and use age as a colour:

# Plot BMI in relation to systolic blood pressure, age, sex and diabetes status
# using faceting
nhanes_tidied %>%
ggplot(aes(x = bmi, y = bp_sys_ave, colour = age)) +
geom_point() +
facet_grid(rows = vars(diabetes),
cols = vars(sex))

Before moving on, let’s save the file, add and commit the new changes to the Git history, and push to GitHub.

## 11.7 Colours: Make your graphs more accessible

Please take ~5 min to read through this section and then do the exercise.

Colour blindness is common in the population, with red-green colour blindness in particular affecting about 8% of men and 0.5% of women. So to make your graph more accessible to people with colour blindness, you need to consider the colours you use. For more detail on how colours look to those with colour-blindness, check out this documentation from the viridis package. The viridis colour scheme (also developed as an R package) was specifically designed to represent data to all colour visions (including as a grayscale, e.g. from black to white). There is a really good, informative talk on YouTube on this topic.

When using colours, think about what you are trying to convey in your figure and how your choice of colours will be interpreted. You can use built-in colour schemes or create your own. For now, let’s stick to using built-in ones. There are two: the viridis and the ColorBrewer colour schemes. Both are well designed and are colour-blind friendly. For this course, we will only cover the viridis package.

## 11.8 Exercise: Changing the colour schemes

Time: 10 min

Practice changing colour schemes on a bar plot. Start with a base plot object to work from that has two discrete variables. Create a new Markdown header called # Exercise for changing colours and create a new code chunk below it. Copy and paste the code below into the new code chunk.

# Barplot to work from, with two discrete variables
base_barplot <- nhanes_tidied %>%
ggplot(aes(x = diabetes, fill = sex)) +
geom_bar(position = position_dodge())

Use the scale_fill_ set of functions to add the colour scheme. If you need help, use the help() or ? functions in RStudio to look over the documentation for more information or to see the other scale_ functions. Use tab auto-completion to find the correct function.

1. Change the colour to the viridis scheme with the scale_fill_viridis_d() function and use it on the base_barplot graph so that the plot is colour-blind friendly. Because the variables are discrete, you need to add _d to the end of the viridis scheme function.

2. Viridis has several palettes. Add the argument options = "A" to the scale_fill_viridis_d() function. Run the function again and see how the colour changes. Next change "A" to "E".

3. Now, let’s practice using the colour schemes on a plot with continuous variables. Copy and paste the code below into a new code chunk. Since we are using colour instead of fill, the scale_ will be scale_colour_viridis_c(). The _c at the end indicates the variable will be continuous.

# scatterplot to work from, with three continuous variables
base_scatterplot <- nhanes_tidied %>%
ggplot(aes(x = bmi, y = bp_sys_ave, colour = age)) +
geom_point()
4. Like in point 2 above, use the options argument and set it to "B" to see how the colour changes.

5. Lastly, add and commit the changes to the R Markdown file into the Git history.

Click for the (possible) solution.

# 1. change colors to a viridis color scheme
base_barplot +
scale_fill_viridis_d()

# 2. change colors to another viridis color scheme
base_barplot +
scale_fill_viridis_d(option = "A")

base_barplot +
scale_fill_viridis_d(option = "E")

# 3. change colours to a viridis color scheme
base_scatterplot +
scale_color_viridis_c()

# 4. change colors to another viridis color scheme
base_scatterplot +
scale_color_viridis_c(option = "B")

## 11.9 Titles, axis labels, and themes

There are so so so many options to modify a ggplot2 figure. Almost all of them are found in the theme() function. We won’t cover individual theme items, since the help with ?theme and the ggplot2 theme webpage already document theme() really well. So we’ll instead cover a few of the built-in themes, as well as setting the axes labels and plot title. We’ll build off of the previously created base_scatterplot. All built-in themes start with theme_.

# create scatterplot to play with themes
base_scatterplot2 <- base_scatterplot +
facet_grid(rows = vars(diabetes),
cols = vars(sex)) +
scale_color_viridis_c()

# View the plot with the default theme
base_scatterplot2

# Some pre-defined themes
base_scatterplot2 + theme_bw()

base_scatterplot2 + theme_minimal()

base_scatterplot2 + theme_classic()

You can also set the theme for all subsequent plots, by using the theme_set() function, specifying the theme you want in the parenthesis.

# set the theme for all subsequent plots
theme_set(theme_bw())

For adding labels to things, such as axis titles, the function is labs(). To change the y-axis title, use the y argument in labs(). For the x-axis, it is x. For the whole plot, it is title:

# add plot title, and change x and y axis titles
base_scatterplot2 +
labs(title = "BMI, systolic blood pressure, and age by diabetes and sex.",
y = "Systolic Blood Pressure (mmHg)",
x = "BMI (kg/m2)")

## 11.10 Saving the plot

Finally, to save the plot you created, use the ggsave() function. The first argument says where to save the graph: Give the name of the newly created file, as well as the folder location. The next argument is the plot you want to save. To set the dimensions of the figure, use width and height arguments.

# save the plot
ggsave(here::here("doc/images/scatterplot.pdf"),
base_scatterplot2, width = 7, height = 5)

Lastly, let’s save the R Markdown file, add and commit the changes, and push to GitHub.

## 11.11 Summary of session

• Use the “Grammar of Graphics” with the ggplot2 package within the tidyverse to plot your data.
• Prioritize plotting raw data instead of summaries whenever possible or where appropriate.
• ggplot2 has 4 levels of grammar: aes() (which data to plot), geom_ (what kind of plot), scale_ (making the plot pretty), and theme() (controls specifics of the plot).
• Only use bar plots for discrete values. If applying them on continuous variables, it hides the distribution of the data.
• Use both colour, X, and Y axis to plot three dimensions or use facet_grid() to plot more dimensions.
• Use colourblind-friendly palettes, e.g. by using the colour palettes viridis or ColorBrewer.
• Save plots by ggsave().