If you find any typos, errors, or places where the text could be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.
11 Data visualization
When in RStudio, quickly jump to this page by using
- Learn and apply the basics of creating publication-quality graphs.
- Learn the importance of considering the colours you use in your graphs and apply tools that are colour-blind friendly.
- Learn why to avoid using commonly used but inappropriate graphs for presenting results.
- Create useful graphs such as boxplots, scatterplots, line graphs, jitter plots, and (appropriate) barplots.
11.1 Basic principles for creating graphs
Please take ~10 min to read through this section, as well as the next one.
Making graphs in R is surprisingly easy and can be done with very little code. Because of the ease with which you can make them, it gives you some time to consider: why you are making them; whether the graph you’ve selected is the most appropriate for your data or results; and how you can design your graphs to be as accessible and understandable as possible.
To start, here are some tips for making a graph:
- Whenever possible or reasonable, show raw data values rather than summaries (e.g. means).
- Though commonly used in scientific papers, avoid barplots with means and error bars as they greatly misrepresent the data (we’ll cover why later).
- Use colour to highlight and enhance your message, and make the plot visually appealing.
- Use a colour-blind friendly palette to make the plot more accessible to others (more on this later too).
There are also excellent online books on this that are included in the resources chapter.
11.2 Basic structure of using ggplot2
ggplot2 is an implementation of the “Grammar of Graphics” (gg). This is a powerful approach to creating plots because it provides a set of structured rules (a “grammar”) that allow you to expressively describe components (or “layers”) of a graph. Since you are able to describe the components, it is easier to then implement those “descriptions” in creating a graph. There are at least four aspects to using ggplot2 that relate to its “grammar”:
aes(): How data are mapped to the plot, including what data is put on the x and y axes, and/or whether to use a colour for a variable.
geom_functions: Visual representation of the data, as a layer. This tells ggplot2 how the aesthetics should be visualized, including whether they should be shown as points, lines, boxes, bins, or bars.
scale_functions: Controls the visual properties of the
geom_layers. Can be used to modify the appearance of the axes, to change the colour of dots from, e.g., red to blue, or to use a different colour palette entirely.
theme(): Directly controls all other aspects of the plot, such as the size, font, and angle of axis text, and the thickness or colour of the axis lines.
There is a massive amount of features in ggplot2.
Thankfully, ggplot2 was specifically designed to make it easy to find
and use its functions and settings using tab auto-completion.
To demonstrate this feature, try typing out
geom_ and then hitting tab. You will get a list of all the geoms available.
You can use this with
scale_ or the options inside
theme(). Try typing out
theme(axis. and then hiting tab,
and a list of theme settings related to the axis will pop up.
ggplot2 also works best with tidy data.
So, why do we teach ggplot2 and not base R plotting? Base R plotting functionality is quite good and you can make really nice publication-quality graphs. However, there are several major limitations to base R plots from a beginner and a user-interface perspective:
- Function and argument names are inconsistent
and opaque (e.g. the
cexargument can be used to magnify text and symbols, but you can’t immediately tell from the name that it does that).
- User-friendly documentation that is accessible to a broad range of people is not much of a priority, so often the help documentation isn’t written with beginners in mind.
- Graphs are built similar to painting on a canvas; make a mistake and you need to start all over (e.g. restart R).
These limitations are due to the fact that base R plotting was developed:
- By different people over different periods of time.
- By people who were/are mostly from statistics and maths backgrounds.
- By people who (generally) don’t have training in principles of software user-design, user-interface, or engineering.
- Without a strong “design philosophy” to guide development.
- During a time when auto-completion didn’t really exist or was sub-optimal, so short function and object names were more important than they are today.
On the other hand, ggplot2:
- Has excellent documentation for help and learning.
- Has a strong design philosophy that makes it easier to use.
- Works in “layers,” so you don’t have to start over if you make a mistake.
- Works very well with auto-completion.
- Uses function and argument naming that is consistent and descriptive (in plain English).
These are the reasons we teach and use ggplot2.
11.3 Graph individual variables
Let’s make our first graphs together now.
Very often you want to get a sense of your data, one variable (i.e. column in a data frame) at a time. You create plots to see the distribution of a variable and visually inspect the data for any problems. There are several ways of plotting continuous variables (e.g. weight or height) in ggplot2. For discrete variables (e.g. “male” and “female”), there is really only one way.
You may notice that, since the data wrangling chapter,
we have been using the term “column” to describe the columns in the data frame,
but from this point forward, we will instead refer to “variable.”
There’s a reason for this: ggplot2 really only works with tidy data.
If we recall the definition of tidy data,
it consists of “variables” (columns) and “observations” (rows) of a data frame.
To us, a “variable” is something that we are interested in analyzing or visualizing,
and which only contains values relevant to that measurement
Weight variable must only contain values for weight).
NHANES dataset is already pretty tidy:
rows are participants at the survey year, and
columns are the variables that were measured.
Let’s visually explore our data.
LearningR project, create a new R
Markdown file called
visualization-session.Rmd. To do this, go to
File -> New File -> R Markdown. Enter “Data visualization” as
the title and your name as the author. Choose HTML as the output
format. When the file is created, keep
the YAML header but delete all the text and code chunks. Save the file as
visualization-session.Rmd in the
doc/ folder. We will use this file for
the code-along exercises in this session.
We need to load the packages and dataset, so we will add a new code chunk by either using the shortcut
Ctrl-Alt-I or the menu
Code -> Insert Chunk. Name the new chunk label as
setup and then add
this to the first code chunk:
# Load packages source(here::here("R/package-loading.R")) # Load the small, tidied dataset from the wrangling session load(here::here("data/nhanes_small.rda"))
Now, we are ready to start creating the first plot!
Since BMI is a strong risk factor for diabetes, let’s check out the distribution of BMI among the participants.
There are two good geoms for examining distributions:
In this session, we are going to create a new code chunk for each plot we make to maintain
a nice readable code, and to practice writing headers and inserting code chunks. Write out a new header called
# One variable plots in the free text area after the setup code chunk. Then add a new code
chunk below it. Let’s take a look at a density distribution plot:
# Create density plot for BMI ggplot(nhanes_small, aes(x = bmi)) + geom_density()
Now, let’s take a look at a histogram, by creating a new code chunk again:
# Create histogram for BMI ggplot(nhanes_small, aes(x = bmi)) + geom_histogram()
Note that it is good practice to always create a new line after the
Our plot shows that, for the most part, there is a good distribution with BMI,
although there are several values that are quite large, including some at 80 BMI units!
The geoms above are appropriate for plotting continuous variables,
but what about plotting discrete variables?
Well, sadly, there’s really only one:
This isn’t a geom for a barplot though; instead, it shows the counts of a discrete variable.
There are many discrete variables in NHANES, including sex and diabetes,
so let’s use this geom to visualize those.
Again, create a new code chunk, then type:
# Create count barplot for sex ggplot(nhanes_small, aes(x = sex)) + geom_bar()
We can see that there are almost equal numbers of females and males. Now, we’ll do the same for the diabetes status variable. In a new code chunk, type:
# Create count barplot for diabetes status ggplot(nhanes_small, aes(x = diabetes)) + geom_bar()
For diabetes, it seems that there is some missingness in the data.
Since diabetes status is an important variable for us, let’s remove all missing
values, save the tidied dataset in the
data/ folder, and plot it again
in a new code chunk.
# Remove individuals with missing diabetes status <- nhanes_small %>% nhanes_tidied filter(!is.na(diabetes)) # Save the tidied dataset as an rda file in the data folder ::use_data(nhanes_tidied, overwrite = TRUE) usethis # Create a new count barplot for diabetes status ggplot(nhanes_tidied, aes(x = diabetes)) + geom_bar()
Let’s take a minute to talk about commonly used barplots with mean and error bars. In all cases, barplots should only be used for discrete (categorical) data where you want to show counts or proportions. As a general rule, they should not be used for continuous data. This is because the commonly used “bar plot of means with error bars” actually hides the underlying distribution of the data. To have a better explanation of this, you can read the article on why to avoid barplots after the course. The image below was taken from that paper, and briefly demonstrates why this plot type is not useful.
If you do want to create a barplot, you’ll quickly find out that it is actually quite hard to do in ggplot2. The reason it is difficult to create in ggplot2 is by design: it’s a bad plot to use, so use something else.
Before we move on, let’s add and commit the new files we created into the Git history and push up to your GitHub repository.
11.4 Plotting two variables
There are many more types of “geoms” to use when plotting two variables. Your choice of which one to use depends on what you are trying to show or communicate, and the nature of the data. Usually, the variable that you “control or influence” (the independent variable) in an experimental setting goes on the x-axis, and the variable that “responds” (the dependent variable) goes on the y-axis.
When you have two continuous variables, some geoms to use are:
geom_point(), which is used to create a standard scatterplot.
geom_hex(), which is used to replace
geom_point()when your data is massive and creating points for each value takes too long to plot.
geom_smooth(), which applies a “regression-type” line to the data (default uses LOESS regression).
Let’s check out how BMI may influence cholesterol using a basic scatterplot, hex plot, and a smoothing line plot in a new code chunk.
First, enter a new Markdown header called
# Plotting two variables and create the
code chunk below that.
# Using 2 continuous variables <- ggplot(nhanes_tidied, aes(x = bmi, y = tot_chol)) bmi_chol # Standard scatter plot + bmi_chol geom_point()
You can see that, with 10,000 data points, the scatterplot is a little crowded.
# Standard scatter plot, but with hexagons + bmi_chol geom_hex()
Notice how the hex plot changes the colour of the data based on how many values are in the area of the plot.
# Runs a smoothing line with confidence interval + bmi_chol geom_smooth()
This makes a nice smoothing line through the data and gives us an idea of
general trends or relationships between the two variables. You can also combine
geoms by adding another one with a
# Or combine two geoms, hex plot with smoothing line + bmi_chol geom_hex() + geom_smooth()
11.4.1 Two discrete variables
Sadly, there are not many options available for plotting two discrete variables, without major data wrangling.
The most useful geom for this type of plot is
but with an added variable.
We can use the
geom_bar() “fill” option to have a certain colour for different levels of a variable.
Let’s use this to see difference in diabetes status between the sexes.
# Two categorical/discrete variables # Note that we can pipe data into ggplot <- nhanes_tidied %>% two_discrete ggplot(aes(x = diabetes, fill = sex)) # Stacked + two_discrete geom_bar()
geom_bar() will make “fill” groups stacked on top of each other.
In this case, it isn’t really that useful, so let’s change them to be sitting side by side.
For that, we need to use the
position argument with a function called
This new function takes the “fill” grouping variable and “dodges” them
(moves them) to be side by side.
# "dodged" (side-by-side) bar plot + two_discrete geom_bar(position = position_dodge())
Now you can see that there are slightly more men that have diabetes than women.
11.4.2 Discrete and continuous variables
When the variable types are mixed (continuous and discrete), there are many more geoms available to use. A couple of good ones are:
geom_boxplot(), which makes boxplots that show the median and a measure of range in the data. Boxplots are generally pretty good at showing the spread of data.
geom_jitter(), which makes a type of scatterplot, but for discrete and continuous variables. A useful argument to
width, which controls how wide the jittered points span from the center line. This plot is much better than the boxplot since it shows the actual data, and not summaries like a boxplot does. However, it is not very good when you have lots of data points.
geom_violin(), which shows a density distribution like
geom_density(). This geom is great when there is a lot of data and
geom_jitter()may otherwise appear as a mass of dots.
Let’s take a look at these geoms, by plotting how BMI differs between those with or without diabetes.
# Using mixed data <- nhanes_tidied %>% two_mixed ggplot(aes(x = diabetes, y = bmi)) # Standard boxplot with outliers + two_mixed geom_boxplot()
However, the boxplot is still hiding your actual data points. Instead, data points can be shown with a jitter plot:
# Show the actual data using a jitter plot + two_mixed geom_jitter()
Or a violin plot:
# Show the distribution with a voilin plot + two_mixed geom_violin()
The violin plot kind of looks like two stingrays, eh? Before proceeding with the following exercise, take a moment to add and commit changes to the Git history, and then push to Github.
11.5 Exercise: Creating plots with one or two variables
Time: 15 min
Create a new header in the R Markdown file called
# Exercise to make plots with one or two variables, followed by a new code chunk.
Copy and paste the below code into that code chunk.
# 1a. Distribution of age ggplot(___, aes(x = ___)) + ___() # 1b. Distribution of age of diabetes diagnosis ggplot(___, aes(x = ___)) + ___() # 2a. Number of people who smoke now ggplot(___, aes(x = ___)) + ___() # 2b. Number of people who are physically active ggplot(___, aes(x = ___)) + ___() # 3a. BMI in relation to systolic blood pressure ggplot(___, aes(x = ___, y = ___)) + ___() # 3b. BMI relation to diastolic blood pressure ggplot(___, aes(x = ___, y = ___)) + ___() # 4. Physically active people with or without diabetes ggplot(___, aes(x = ___, fill = ___)) + ___(___ = ___()) # 5. Poverty levels between those with or without diabetes ggplot(___, aes(x = ___, y = ___)) + ___()
Complete as many tasks as you can below.
geom_histogram(), find out what the distribution is for the two variables below.
age(participant’s age at collection).
diabetes_age(age of diabetes diagnosis).
geom_bar(), find out how many people have data recorded for each of the two discrete variables below. What can you say about most people for these variables?
smoke_now(current smoking status).
phys_active(does moderate to vigorous physical activity).
geom_hex(), find out how BMI relates to the two blood pressure variables below. Do you notice anything about the data from the plots?
bp_sys_ave(average systolic blood pressure).
bp_dia_ave(average diastolic blood pressure).
geom_bar(), find out how
phys_activethose with or without
diabeteson the x-axis. What can you say based on the data? Note the differences in missingness between groups. Don’t forget to use
positionargument, in order to arrange the bars side by side.
geom_violin(), find how
povertylevels are different for those with or without
diabeteson the x-axis. Looking at the distributions, what can you conclude about how poverty may be associated with diabetes status?
povertyvariable is calculated as a ratio between income and a poverty threshold. Smaller numbers mean higher poverty.
- Save, add, and commit the changes to the Git history.
Click for the (possible) solution.
# 1a. Distribution of age ggplot(nhanes_tidied, aes(x = age)) + geom_histogram() # 1b. Distribution of age at diabetes diagnosis ggplot(nhanes_tidied, aes(x = diabetes_age)) + geom_histogram() # 2a. Number of people who smoke now ggplot(nhanes_tidied, aes(x = smoke_now)) + geom_bar() # 2b. Number of people who are physically active ggplot(nhanes_tidied, aes(x = phys_active)) + geom_bar() # 3a. BMI in relation to systolic blood pressure ggplot(nhanes_tidied, aes(x = bmi, y = bp_sys_ave)) + geom_hex() # 3b. BMI relation to diastolic blood pressure ggplot(nhanes_tidied, aes(x = bmi, y = bp_dia_ave)) + geom_hex() # 4. Physically active people with or without diabetes ggplot(nhanes_tidied, aes(x = diabetes, fill = phys_active)) + geom_bar(position = position_dodge()) # 5. Poverty levels between those with or without diabetes ggplot(nhanes_tidied, aes(x = diabetes, y = poverty)) + geom_violin()
11.6 Visualizing three or more variables
There are many ways to visualize additional variables in a plot and further explore your data. For that, we can use ggplot2’s colour, shape, size, transparency (“alpha”), and fill aesthetics, as well as “facets.” Faceting in ggplot2 is a way of splitting the plot up into multiple plots when the underlying aesthetics are the same or similar. In this section, we’ll be covering many of these capabilities in ggplot2.
The most common and “prettiest” way of adding a third variable is by using colour.
Let’s try to answer a few of the questions below, to visualize some of these features.
First, create a new header called
# Plotting three or more variables
and a code chunk below it.
Question: Is systolic blood pressure different in those with
or without diabetes in females and males?
In this case, we have one continuous variable (
and two discrete variables (
To plot this, we could use
# Plot systolic blood pressure in relation to sex and diabetes status %>% nhanes_tidied ggplot(aes(x = sex, y = bp_sys_ave, colour = diabetes)) + geom_boxplot()
Do you see differences in systolic blood pressure between the sexes? Between diabetics and non-diabetics?
Question: How does BMI relate to systolic blood pressure
Here, we have three continuous variables (
so we could use
# Plot BMI in relation to systolic blood pressure and age %>% nhanes_tidied ggplot(aes(x = bmi, y = bp_sys_ave, colour = age)) + geom_point()
Can you see any associations between systolic blood pressure and BMI or age?
Question: How does BMI relate to systolic blood pressure,
and what is different between those with and without diabetes?
In this case, we have two continuous variables (
bp_sys_ave) and one discrete variable (
We could use
# Plot BMI in relation to systolic blood pressure and diabetes status %>% nhanes_tidied ggplot(aes(x = bmi, y = bp_sys_ave, colour = diabetes)) + geom_point()
For this latter plot, it’s really hard to see what’s different.
But there is another way of visualizing a third (or fourth, and fifth) variable:
Faceting splits the plot up into multiple subplots
using the function
For faceting to work, at least one of the first two arguments to
facet_grid() is needed.
The first two arguments are:
cols: The discrete variable to use to facet the plot column-wise (i.e. side-by-side).
rows: The discrete variable to use to facet the plot row-wise (i.e. stacked on top of each other).
rows, the nominated variable must be wrapped by
Let’s try it using an example from the previous answer (instead of using
# Plot BMI in relation to systolic blood pressure and diabetes status using # faceting by column %>% nhanes_tidied ggplot(aes(x = bmi, y = bp_sys_ave)) + geom_point() + facet_grid(cols = vars(diabetes))
Try faceting with plots stacked by diabetes status, using the argument
rows = vars(diabetes) instead. Which do you find more informative?
# Plot BMI in relation to systolic blood pressure and diabetes status using # faceting by row %>% nhanes_tidied ggplot(aes(x = bmi, y = bp_sys_ave)) + geom_point() + facet_grid(rows = vars(diabetes))
We can also facet by
sex and use
age as a colour:
# Plot BMI in relation to systolic blood pressure, age, sex and diabetes status # using faceting %>% nhanes_tidied ggplot(aes(x = bmi, y = bp_sys_ave, colour = age)) + geom_point() + facet_grid(rows = vars(diabetes), cols = vars(sex))
Before moving on, let’s save the file, add and commit the new changes to the Git history, and push to GitHub.
11.7 Colours: Make your graphs more accessible
Please take ~5 min to read through this section and then complete the exercise that follows.
Colour blindness is common in the general population, with red-green colour blindness affecting about 8% of men and 0.5% of women. To make your graph more accessible to people with colour blindness, you need to consider the colours you use. For more detail on how colours look to those with colour blindness, check out this documentation from the viridis package. The viridis colour scheme (also developed as an R package) was specifically designed to represent data to all colour visions (including as a grayscale, e.g. from black to white). There is a really informative talk on YouTube on this topic.
When using colours, think about what you are trying to convey in your figure and how your choice of colours will be interpreted. You can use built-in colour schemes or create your own. For now, let’s stick to using built-in ones. There are two we can start with: the viridis and the ColorBrewer colour schemes. Both are well designed and are colour blind friendly. For this course, we will only cover the viridis package.
11.8 Exercise: Changing the colour schemes
Time: 10 min
Practice changing colour schemes on a bar plot. Start with a base plot object to work from that has two discrete variables.
Create a new Markdown header called
# Exercise for changing colours and create
a new code chunk below it. Copy and paste the code below into the new code chunk.
# Barplot to work from, with two discrete variables <- nhanes_tidied %>% base_barplot ggplot(aes(x = diabetes, fill = sex)) + geom_bar(position = position_dodge())
scale_fill_ function set to add the colour scheme. If you need help,
? functions in RStudio to look over the documentation
for more information or to see the other
scale_ functions. Use tab auto-completion
to find the correct function.
Change the colour to the viridis scheme with the
scale_fill_viridis_d()function and use it on the
base_barplotgraph so that the plot is colour blind friendly. Because the variables are discrete, you will need to add
_dto the end of the viridis scheme function.
Viridis has several palettes. Add the argument
options = "A"to the
scale_fill_viridis_d()function. Run the function again and see how the colour changes. Next, change
Now, let’s practice using the colour schemes on a plot with continuous variables. Copy and paste the code below into a new code chunk. Since we are using
_cat the end indicates the variable will be continuous.
# Scatterplot to work from, with three continuous variables <- nhanes_tidied %>% base_scatterplot ggplot(aes(x = bmi, y = bp_sys_ave, colour = age)) + geom_point()
Similar to #2 above, use the
optionsargument to set the palette to
"B"and see how the colour changes.
Lastly, add and commit the changes to the R Markdown file into the Git history.
Click for the (possible) solution.
# 1. change colors to a viridis color scheme + base_barplot scale_fill_viridis_d() # 2. change colors to another viridis color scheme + base_barplot scale_fill_viridis_d(option = "A") + base_barplot scale_fill_viridis_d(option = "E") # 3. change colours to a viridis color scheme + base_scatterplot scale_color_viridis_c() # 4. change colors to another viridis color scheme + base_scatterplot scale_color_viridis_c(option = "B")
11.9 Titles, axis labels, and themes
There are so many options in RStudio to modify a ggplot2 figure.
Almost all of them are found in the
We won’t cover individual theme items,
?theme help page
and ggplot2 theme webpage already document
theme() really well.
Instead, we’ll cover a few of the built-in themes,
as well as setting the axes labels and plot title.
We’ll build off of the previously created
All built-in themes start with
# Create scatterplot to play with themes <- base_scatterplot + base_scatterplot2 facet_grid(rows = vars(diabetes), cols = vars(sex)) + scale_color_viridis_c() # View the plot with the default theme base_scatterplot2
# Some pre-defined themes + theme_bw()base_scatterplot2
You can also set the theme for all subsequent plots by using the
function, and specifying the theme you want in the parenthesis.
# Set the theme for all subsequent plots theme_set(theme_bw())
To add labels such as axis titles to your plot, you can use the function
To change the y-axis title, use the
y argument in
For the x-axis, it is
For the whole plot, it is
# Add plot title and change axis titles + base_scatterplot2 labs(title = "BMI, systolic blood pressure, and age by diabetes and sex", y = "Systolic Blood Pressure (mmHg)", x = "BMI (kg/m2)")
11.10 Saving the plot
To save the plot you created, use the
The first argument says where to save the graph. Give the name of the newly created file,
as well as the folder location.
The next argument says which plot to save.
At this point, you can set the dimensions of the figure
# Save the plot ggsave(here::here("doc/images/scatterplot.pdf"), width = 7, height = 5) base_scatterplot2,
Lastly, let’s save the R Markdown file, add and commit the changes, and push to GitHub.
11.11 Summary of session
- Use the “Grammar of Graphics” approach in conjunction with the ggplot2 package within the tidyverse to plot your data.
- Prioritize plotting raw data instead of summaries whenever possible.
- ggplot2 has 4 levels of grammar:
aes()(which data to plot),
geom_(what kind of plot),
scale_(to make the plot pretty), and
theme()(to control the specifics of the plot).
- Only use barplots for discrete values. If applying them on continuous variables, it hides the distribution of the data.
- To plot more dimensions, use colour, the X axis, the Y axis, or
- Use colour blind-friendly palettes, such as
- Save plots using