Want to help out or contribute?

If you find any typos, errors, or places where the text could be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.

  • Open an issue or submit a merge request on GitLab with the feedback or suggestions.
  • Hypothesis Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

9 Data management and wrangling

When in RStudio, quickly jump to this page using r3::open_data_wrangling().

Session objectives:

  1. Learn the difference between “messy” and “tidy” data, including how to create tidy data to simplify your analysis.
  2. Perform simple transformations and subsetting of datasets, such as:
    • Subset specific columns and rows of a dataset, with filter()and select().
    • Sort rows of a dataset by a specific column, with arrange().
    • Create new or transform existing columns in a dataset, with mutate().
    • Calculate simple data summaries, with summarise().
  3. Learn about and apply the “split-apply-combine” method of analyses, with group_by() and summarise().
  4. Write “tidier” and more readable code by using the pipe (%>%) operator.

9.1 “Messy” vs. “tidy” data

Take 10 min to read through this “Messy” vs “tidy” data section and the Managing and working with data in R section.

The concept of “tidy” data was popularized in an article by Hadley Wickham and described in more detail in the Tidy Data chapter of the R for Data Science online book. Before we continue with tidy data, we need to cover something that is related to the concept of “tidy” and that has already come up in this course: the tidyverse. The tidyverse is an ecosystem of R packages that are designed to work well together, that all follow a strong “design philosophy” and common style guide. This makes combining these packages in the tidyverse much easier. We teach the tidyverse because of these abovementioned reasons.

Ok, back to “tidy data”. A tidy dataset is when:

  • Each variable has its own column (e.g. “Body Weight”).
  • Each observation has its own row (e.g. “Person”).
  • Each value has its own cell (e.g. “Body weight for a person at a specific date”).

Take a look at the example “tidy” and “messy” data frames (also called “tibbles” in the tidyverse) below. Think about why each may be considered “tidy” or “messy”. What do you notice between the tidy versions and the messier versions?

# Datasets come from tidyr
# Tidy:
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# Partly tidy:
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

# Messier:
table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

# Messy:
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

# Messy:
table4b
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

The “most” tidy version is table1 as each column describes their values (e.g. population is population size), each row is unique (e.g. first row is for values from Afghanistan from 1999), and each cell is an explicit value representative of its column and row.

table2 is a “long” version of table1 so it is partly “tidy”, but it doesn’t satisfy the rule that each variable has a column, since count represents both cases and population size.

On the other hand, table3 is messy because the rate column values are a composite of two other column values (cases and population), when it should be a single number (a percent). Both table4a and table4b have columns with ambiguous values inside. For example, you can’t tell from the data what the values in the 1999 column contain.

Tidy data has a few notable benefits:

  1. Time spent preparing your data to be tidy from the beginning can save days of added work and frustration in the long run.
  2. “Tidy data” is a conceptual framework that allows you to easily build off and wrangle (i.e. “manipulate”, “clean up”, “manage”) data in simpler and easy-to-interpret ways, especially when used within the framework of the tidyverse.

The concept of tidy data also gives rise to “tidy code” for wrangling. By using “verbs” (R functions) and chaining them together in “sentences” (in a sequential pipeline), you can construct meaningful and readable code that describes in plainer English what you are doing to the data. This is one simple way that you can enhance the reproducibility of your code.

9.2 Managing and working with data in R

Take 5 min and read through this section.

When working with data, there are a few principles to follow:

  • You should always save your original raw dataset in the data-raw/ folder.
    • Note: Whether or not you save data to data-raw/ depends on how you collected the data and how many collaborators are on your team. You may end up storing and processing the data in another folder as a project of its own.
  • Never edit your raw data directly and save it in a separate location.
  • Only work with your raw data using R code. Don’t manually edit it. Manual editing doesn’t leave a history of what you’ve done to it, so you can’t go back and see what you’ve done. Always keep a history of any changes you’ve made to the data, preferably by using R code.
  • Save the edited data as another dataset and store it in the data/ folder.

When wrangling your data with R code, make sure to:

  • Document what you did to your data and why you did it, to help you remember later on (by using hashes to indicate comments and non-code text).
  • Write the code in a way that is as descriptive as possible and is readable enough to understand what is being done to the data.
  • Keep the code simple: Don’t be clever, be clear. Clear code is easier to understand than clever and sometimes overly-complex code.

In data wrangling, most tasks can be expressed by a few simple “verbs” (actions). Wrangling here refers to maneuvering, managing, controlling, and turning your data around to clean it up, better understand it, and prepare it for later analyses. The table below lists some common “verbs” from the dplyr package that come from the tidyverse:

Table 9.1: List of common data wrangling tasks, along with an example and the function used for the wrangling.
Task Example Function
Select columns Remove data entry columns such as the person’s name who entered the data. select()
Rename columns Changing a column name from ‘Q1’ to ‘participant_name’. rename()
Transform or modify columns Multiplying or taking the log of a column’s values. mutate()
Subset/filter rows Keeping rows with glucose values above 4. filter()
Sort rows Show rows with the smallest value at the top. arrange()
Calculate summaries of the data Calculating the maximum, median, and minimum age. summarise()
Run an analysis by a group Calculate means of age by males and females. group_by() with summarise()

Tip: Sometimes you need to do some complicated wrangling to get your data “in shape” for later analyses. To save some time, you could save the wrangled data as an “output” dataset in the data/ folder. That way, you can easily use it again later rather than having to run the wrangling code every time you want to work with the data.

9.3 Load the packages and dataset

We’re going to use the US NHANES dataset to demonstrate some wrangling functions. There is an NHANES package that contains a teaching version of the original dataset, so we’ll use that for this lesson.

First, make sure that you have the LearningR R Project open. Second, open the R/package-loading.R script and load the NHANES package, so it looks like:

library(tidyverse)
library(NHANES)

Third, open the R/wrangling-session.R script to begin entering code. We’ll use this script file to write the code for this session (but not for the exercises).

## # Load up the packages
## source(here::here("R/package-loading.R"))

# Briefly glimpse contents of dataset
glimpse(NHANES)
#> Rows: 10,000
#> Columns: 76
#> $ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 516…
#> $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10…
#> $ Gender           <fct> male, male, male, male, female, male, male, female, …
#> $ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10,…
#> $ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  0-9,  4…
#> $ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795…
#> $ Race1            <fct> White, White, White, Other, White, White, White, Whi…
#> $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Education        <fct> High School, High School, High School, NA, Some Coll…
#> $ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, NA, NA, …
#> $ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, …
#> $ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 875…
#> $ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00…
#> $ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3…
#> $ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, …
#> $ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, …
#> $ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7…
#> $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Height           <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166…
#> $ BMI              <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.…
#> $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plu…
#> $ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, …
#> $ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111,…
#> $ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, …
#> $ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124,…
#> $ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, …
#> $ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108,…
#> $ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, …
#> $ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114,…
#> $ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, …
#> $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, …
#> $ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, …
#> $ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113,…
#> $ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1…
#> $ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
#> $ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vg…
#> $ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0…
#> $ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA…
#> $ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, None, N…
#> $ Depressed        <fct> Several, Several, Several, NA, Several, NA, NA, None…
#> $ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, …
#> $ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, …
#> $ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, …
#> $ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes,…
#> $ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2,…
#> $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, …
#> $ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, …
#> $ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, …
#> $ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA,…
#> $ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, …
#> $ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA,…
#> $ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No,…
#> $ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smok…
#> $ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, …
#> $ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Y…
#> $ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, …
#> $ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes…
#> $ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, …
#> $ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes,…
#> $ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, …
#> $ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, …
#> $ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA,…
#> $ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1…
#> $ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, …
#> $ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA, Hetero…
#> $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Finally, add and commit the changes to the Git history with the RStudio Git interface.

9.4 Exercise: Getting familiar with the dataset

Time: 10 min

Take some time to get to know the NHANES dataset.

  1. Create a new R script by typing in the RStudio Console usethis::use_r("exercises-wrangling").
  2. Copy the code below and paste it into the new script file.
  3. Replace the ___ with NHANES.
  4. Run each line of code by using the Ctrl-Enter shortcut.
  5. Take a look at the output for each command, to see what sorts of data you are able to retrieve.
  6. Once you have completed each, add and commit the changes to the Git history with the RStudio Git interface.
# Load the packages
source(here::here("R/package-loading.R"))

# Check column names
colnames(___)

# Look at contents
str(___)
glimpse(___)

# See summary
summary(___)

# Look over the dataset documentation
?___
Click for the (possible) solution.

# Load the packages
source(here::here("R/package-loading.R"))

# Check column names
colnames(NHANES)

# Look at contents
str(NHANES)
glimpse(NHANES)

# See summary
summary(NHANES)

# Look over the dataset documentation
?NHANES

9.5 Selecting specific columns in a dataset

Selecting columns of a dataset is a very common data wrangling task. The function for this task in RStudio is appropriately called select(). You would use select() to extract one or more variables in a dataset that you want to have a closer look at or to save as a new data frame to work with. It may be that you wish to explore the clinical characteristics of your study sample, so you may select some basic demographic variables (e.g., the Age column) and clinical variables (e.g., Weight and Height columns) to perform these analyses.

For the input arguments, select() takes the dataset as the first argument, which is the first input position right after the opening bracket (, and then takes the names of the columns you want to select. The argument after the data argument is ..., which indicates that you can add as many columns as you want, separated by a ,.

# Select one column by its name, without quotes
select(NHANES, Age)
#> # A tibble: 10,000 x 1
#>      Age
#>    <int>
#>  1    34
#>  2    34
#>  3    34
#>  4     4
#>  5    49
#>  6     9
#>  7     8
#>  8    45
#>  9    45
#> 10    45
#> # … with 9,990 more rows

# Select two or more columns by name, without quotes
select(NHANES, Age, Weight, BMI)
#> # A tibble: 10,000 x 3
#>      Age Weight   BMI
#>    <int>  <dbl> <dbl>
#>  1    34   87.4  32.2
#>  2    34   87.4  32.2
#>  3    34   87.4  32.2
#>  4     4   17    15.3
#>  5    49   86.7  30.6
#>  6     9   29.8  16.8
#>  7     8   35.2  20.6
#>  8    45   75.7  27.2
#>  9    45   75.7  27.2
#> 10    45   75.7  27.2
#> # … with 9,990 more rows

# To *exclude* a column, use minus (-)
select(NHANES, -HeadCirc)
#> # A tibble: 10,000 x 75
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> # … with 9,990 more rows, and 66 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, Height <dbl>,
#> #   BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
#> #   BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>,
#> #   BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

If some of your columns have similar patterns for naming at the beginning, middle, or end of the name, you can use some helper functions to choose these columns. Use ?select_helpers (choose the “Select helpers” option in the menu that pops up) to read more about these functions and to get help on them. Some commonly used helpers are:

  • starts_with(): Select columns that begin with a pattern.
  • ends_with(): Select columns that end with a pattern.
  • contains(): Select columns that contain a pattern.
# All columns starting with letters "BP" (blood pressure)
select(NHANES, starts_with("BP"))
#> # A tibble: 10,000 x 8
#>    BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3
#>       <int>    <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1      113       85    114     88    114     88    112     82
#>  2      113       85    114     88    114     88    112     82
#>  3      113       85    114     88    114     88    112     82
#>  4       NA       NA     NA     NA     NA     NA     NA     NA
#>  5      112       75    118     82    108     74    116     76
#>  6       86       47     84     50     84     50     88     44
#>  7      107       37    114     46    108     36    106     38
#>  8      118       64    106     62    118     68    118     60
#>  9      118       64    106     62    118     68    118     60
#> 10      118       64    106     62    118     68    118     60
#> # … with 9,990 more rows

# All columns ending in letters "Day"
select(NHANES, ends_with("Day"))
#> # A tibble: 10,000 x 3
#>    TVHrsDay CompHrsDay AlcoholDay
#>    <fct>    <fct>           <int>
#>  1 <NA>     <NA>               NA
#>  2 <NA>     <NA>               NA
#>  3 <NA>     <NA>               NA
#>  4 <NA>     <NA>               NA
#>  5 <NA>     <NA>                2
#>  6 <NA>     <NA>               NA
#>  7 <NA>     <NA>               NA
#>  8 <NA>     <NA>                3
#>  9 <NA>     <NA>                3
#> 10 <NA>     <NA>                3
#> # … with 9,990 more rows

# All columns containing letters "Age"
select(NHANES, contains("Age"))
#> # A tibble: 10,000 x 9
#>      Age AgeDecade AgeMonths DiabetesAge Age1stBaby SmokeAge AgeFirstMarij
#>    <int> <fct>         <int>       <int>      <int>    <int>         <int>
#>  1    34 " 30-39"        409          NA         NA       18            17
#>  2    34 " 30-39"        409          NA         NA       18            17
#>  3    34 " 30-39"        409          NA         NA       18            17
#>  4     4 " 0-9"           49          NA         NA       NA            NA
#>  5    49 " 40-49"        596          NA         27       38            18
#>  6     9 " 0-9"          115          NA         NA       NA            NA
#>  7     8 " 0-9"          101          NA         NA       NA            NA
#>  8    45 " 40-49"        541          NA         NA       NA            13
#>  9    45 " 40-49"        541          NA         NA       NA            13
#> 10    45 " 40-49"        541          NA         NA       NA            13
#> # … with 9,990 more rows, and 2 more variables: AgeRegMarij <int>, SexAge <int>

For more information on using the pattern functions such as starts_with(), check out ?select_helpers.

You’ll notice that running these functions doesn’t actually change the data itself. When you run a function without assigning it using <-, the only action the function does is to send the output to your screen, and you won’t have saved that data anywhere for later use. But if you want to create a new dataset with only the columns you selected, you’ll need to assign the selected dataset to a new object.

The full NHANES dataset is 10,000 individuals (rows) with 76 parameters (columns). Since we are only interested in some of these parameters, we will subset the large dataset and save it for later use as a new dataset.

# Save the selected columns as a new data frame
# Recall the style guide for naming objects
nhanes_small <- select(NHANES, Age, Gender, Height, 
                       Weight, BMI, Diabetes, DiabetesAge, 
                       PhysActiveDays, PhysActive, TotChol, 
                       BPSysAve, BPDiaAve, SmokeNow, Poverty)

# View the new data frame
nhanes_small
#> # A tibble: 10,000 x 14
#>      Age Gender Height Weight   BMI Diabetes DiabetesAge PhysActiveDays
#>    <int> <fct>   <dbl>  <dbl> <dbl> <fct>          <int>          <int>
#>  1    34 male     165.   87.4  32.2 No                NA             NA
#>  2    34 male     165.   87.4  32.2 No                NA             NA
#>  3    34 male     165.   87.4  32.2 No                NA             NA
#>  4     4 male     105.   17    15.3 No                NA             NA
#>  5    49 female   168.   86.7  30.6 No                NA             NA
#>  6     9 male     133.   29.8  16.8 No                NA             NA
#>  7     8 male     131.   35.2  20.6 No                NA             NA
#>  8    45 female   167.   75.7  27.2 No                NA              5
#>  9    45 female   167.   75.7  27.2 No                NA              5
#> 10    45 female   167.   75.7  27.2 No                NA              5
#> # … with 9,990 more rows, and 6 more variables: PhysActive <fct>,
#> #   TotChol <dbl>, BPSysAve <int>, BPDiaAve <int>, SmokeNow <fct>,
#> #   Poverty <dbl>

9.6 Renaming all column names based on the style guide

In the interests of keeping data tidy and matching the style guide, we should change the column names to be all lower case with _ for spaces between words. There’s a package that can do that for us called snakecase.

In the Console, type out snakecase:: and hit Tab. You’ll see a list of possible functions to use. We want to use the snakecase function, so scroll down and find the to_snake_case().

To change all the column names to lower case, we’ll use the function rename_with(). This function takes the data as the first argument but the second argument needs to be a function, which in our case is to_snake_case() without using the () at the end. This function will rename all columns.

# Rename all columns to snake case
nhanes_small <- rename_with(nhanes_small, snakecase::to_snake_case)

# Have a look at the data frame
nhanes_small
#> # A tibble: 10,000 x 14
#>      age gender height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>   <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male     165.   87.4  32.2 No                 NA               NA
#>  2    34 male     165.   87.4  32.2 No                 NA               NA
#>  3    34 male     165.   87.4  32.2 No                 NA               NA
#>  4     4 male     105.   17    15.3 No                 NA               NA
#>  5    49 female   168.   86.7  30.6 No                 NA               NA
#>  6     9 male     133.   29.8  16.8 No                 NA               NA
#>  7     8 male     131.   35.2  20.6 No                 NA               NA
#>  8    45 female   167.   75.7  27.2 No                 NA                5
#>  9    45 female   167.   75.7  27.2 No                 NA                5
#> 10    45 female   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

9.7 Renaming specific columns

Depending on how your data was collected, it may have column names that aren’t very descriptive. So you’ll probably want to rename them to something more explanatory, which is particularly important if you’re sharing your work with others or in an environment where multiple people are working on the same data. We will use the function called rename(). Like select(), rename() takes the dataset as the first argument and then takes as many renaming arguments as you want (because the second argument position is ...). When renaming, it takes the form of newname = oldname.

Let’s rename gender to be sex, since gender is a social construct, while sex is biological. This is consistent with what the sex variable in the dataset actually describes too.

rename(nhanes_small, sex = gender)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  6     9 male    133.   29.8  16.8 No                 NA               NA
#>  7     8 male    131.   35.2  20.6 No                 NA               NA
#>  8    45 fema…   167.   75.7  27.2 No                 NA                5
#>  9    45 fema…   167.   75.7  27.2 No                 NA                5
#> 10    45 fema…   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

# View data frame
nhanes_small
#> # A tibble: 10,000 x 14
#>      age gender height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>   <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male     165.   87.4  32.2 No                 NA               NA
#>  2    34 male     165.   87.4  32.2 No                 NA               NA
#>  3    34 male     165.   87.4  32.2 No                 NA               NA
#>  4     4 male     105.   17    15.3 No                 NA               NA
#>  5    49 female   168.   86.7  30.6 No                 NA               NA
#>  6     9 male     133.   29.8  16.8 No                 NA               NA
#>  7     8 male     131.   35.2  20.6 No                 NA               NA
#>  8    45 female   167.   75.7  27.2 No                 NA                5
#>  9    45 female   167.   75.7  27.2 No                 NA                5
#> 10    45 female   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

Notice that the rename() function has not altered the nhanes_small data frame. To do this, we need to assign the change to nhanes_small in a way that saves it:

nhanes_small <- rename(nhanes_small, sex = gender)
# View data frame
nhanes_small
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  6     9 male    133.   29.8  16.8 No                 NA               NA
#>  7     8 male    131.   35.2  20.6 No                 NA               NA
#>  8    45 fema…   167.   75.7  27.2 No                 NA                5
#>  9    45 fema…   167.   75.7  27.2 No                 NA                5
#> 10    45 fema…   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

Now, you can see that the column has been renamed from gender to sex. What if you wanted to select some columns and then rename some of them? Would you have to create a new data object every time? No! We can make use of a very powerful tool called piping with the %>% function.

9.8 Chaining functions with the pipe

Take 5 minutes and read this section before we continue.

A key component of tidy data and tidy code is making use of the “pipe” operator, %>%. You would use the “pipe” operator when you are writing a piece of code with multiple operations or intermediate steps that require you to save and overwrite each step as an object (see below). One advantage of the “pipe” operator is that it will help to ensure that your code is less cluttered with redundant and temporary object names.

This operator allows you to “pipe” the output from one function to the input of another function, just like a plumbing pipe would do for water. This allows you to easily chain functions together into “sentences”. Let’s use an example based on English words for some action. This is the English sentence:

We need some eggs. Drive to the grocery store and buy some eggs before arriving home from work.

There are basically two actions here (“drive” and “buy”) with four inputs (“work”, “grocery store”, “eggs”, “home”), that are all based on the previous action. Since an action in R is a function, the functions would be drive() and buy(). In regular R (without the “pipe” operator), we would have to nest functions (reading from the inside to the outside) together to chain them:

drive(buy(drive(at_work, "grocery store"), "eggs"), "home")

This is difficult to read. Another way to chain functions would be to create temporary objects for each step:

at_grocery_store <- drive(at_work, "grocery store")
got_eggs <- buy(at_grocery_store, "eggs")
at_home <- drive(got_eggs, "home")

This still isn’t very “readable”, as we are having to re-name each intermediate object with reference to the object before it. The pipe %>% operator can really simplify this:

at_work %>% 
    drive("grocery store") %>% 
    buy("eggs") %>% 
    drive("home")

Do you find this more readable and understandable? It reads how it would be done, in order of the steps taken.

The idea of piping is to read the functions from left to right. This can help clarify and break down complex data processing workflows, and is the basis for the tidyverse and many other packages.

The operator %>% takes the output from the object or function from the left of the operator and puts it into the function on the right of the operator. All input goes into the first position argument of the function. Within the tidyverse packages, all functions take a data frame (or vector) as the first argument in order to work with the pipe.

Let’s try this out with an example from NHANES. The keyboard shortcut for the pipe is Ctrl-Shift-M (i.e., M for the magrittr package that created the pipe).

# These two ways are the same
colnames(nhanes_small)
#>  [1] "age"              "sex"              "height"           "weight"          
#>  [5] "bmi"              "diabetes"         "diabetes_age"     "phys_active_days"
#>  [9] "phys_active"      "tot_chol"         "bp_sys_ave"       "bp_dia_ave"      
#> [13] "smoke_now"        "poverty"

nhanes_small %>% 
    colnames()
#>  [1] "age"              "sex"              "height"           "weight"          
#>  [5] "bmi"              "diabetes"         "diabetes_age"     "phys_active_days"
#>  [9] "phys_active"      "tot_chol"         "bp_sys_ave"       "bp_dia_ave"      
#> [13] "smoke_now"        "poverty"

The pipe automatically takes nhanes_small and puts it into the first position, so we don’t need to specify nhanes_small inside colnames() when piping.

Let’s try using the pipe with the select() and rename() functions from the previous section. Remember, both select() and rename() take a dataset as the first argument, which makes them pipe-able.

nhanes_small %>% 
    select(phys_active) %>% 
    rename(physically_active = phys_active)
#> # A tibble: 10,000 x 1
#>    physically_active
#>    <fct>            
#>  1 No               
#>  2 No               
#>  3 No               
#>  4 <NA>             
#>  5 No               
#>  6 <NA>             
#>  7 <NA>             
#>  8 Yes              
#>  9 Yes              
#> 10 Yes              
#> # … with 9,990 more rows

We can now “read” these actions as:

Take the nhanes_small dataset and then select the “phys_active” column and then rename the “phys_active” column to “physically_active”.

Now, let’s add and commit changes to the Git history with the RStudio Git interface before moving on to the exercise.

9.9 Exercise: Practice what we’ve learned

Time: 10 min

In the exercise-wrangling.R script file, complete the following tasks.

Replace the ___ with NHANES

  1. Copy and paste the code below into the script file. Replace the ___ in the select() function, with the columns tot_chol, bp_sys_ave, and poverty.

    nhanes_small %>% 
        select(___)
  2. Copy and paste the code below and fill out the blanks. Rename diabetes_age to be diabetes_diagnosis_age. Tip: Recall that renaming is in the form new = old.

    nhanes_small %>% 
        rename(___ = ___)
  3. Re-write this piece of code using the “pipe” operator:

    select(nhanes_small, bmi, contains("age"))
  4. Read through (in your head) the code below. How intuitive is it to read? Now, re-write this code so that you don’t need to create the temporary physical_activity object by using the pipe, then re-read the revised version. Which do you feel is easier to “read”?

    physical_activity <- select(nhanes_small, phys_active_days, phys_active)
    rename(physical_activity, days_phys_active = phys_active_days)
  5. Lastly, add and commit these changes to the Git history with the RStudio Git interface.

Click for the (possible) solution.

# 1. Select specific columns
nhanes_small %>%
    select(tot_chol, bp_sys_ave, poverty)

# 2. Rename columns
nhanes_small %>%
    rename(diabetes_diagnosis_age = diabetes_age)

# 3. Re-write with pipe
nhanes_small %>% 
    select(bmi, contains("age"))

# 4. Re-write with pipe
nhanes_small %>% 
    select(phys_active_days, phys_active) %>% 
    rename(days_phys_active = phys_active_days)

9.10 Filtering data by row

Filtering data by row is a very common activity in data analysis, especially if you want to get rid of outliers or to subset by a categorical group. The function to subset or filter is called filter(). filter() is distinct from select() in the sense that it operates on rows, whereas select() operates on columns.

The filter() function takes a logic condition (TRUE or FALSE). As with the other functions, the first argument is the dataset and all others are the logical conditions that will apply to the row filtering. When the logical conditions equal TRUE, it means that those rows will be kept and those that are FALSE will be dropped.

Warning: Since filter() uses logical conditions, you need to be really careful when writing the logic. As you probably know, humans are really really bad at logic. If your logical condition starts getting complex, double and triple check that you know for certain that your logic code is doing what you think it is doing. It’s very easy to make mistakes at this stage, even for advanced R users.

The simplest kind of logic condition is to test for “equality”. In R, “equal to” is represented by ==. An example: If we wanted to keep only females in the dataset, we would use the logic condition like this:

nhanes_small %>%
    filter(sex == "female")
#> # A tibble: 5,020 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  2    45 fema…   167.   75.7  27.2 No                 NA                5
#>  3    45 fema…   167.   75.7  27.2 No                 NA                5
#>  4    45 fema…   167.   75.7  27.2 No                 NA                5
#>  5    10 fema…   142.   38.6  19.2 No                 NA               NA
#>  6    58 fema…   148.   57.5  26.2 No                 NA                2
#>  7     9 fema…   139.   53.1  27.4 No                 NA               NA
#>  8    56 fema…   171.   57.5  19.7 No                 NA                7
#>  9    56 fema…   171.   57.5  19.7 No                 NA                7
#> 10    57 fema…   157.   51    20.7 No                 NA                3
#> # … with 5,010 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

We would “read” this code as:

Take the nhanes_small dataset and then filter so that only rows where sex is equal to “female” are kept.

When a row in the sex column has the value "female", that row is kept. Otherwise, it is dropped.

There are other logic comparisons to use. Table 9.2 can be used as a reference for logical conditions in R.

Table 9.2: Logical operators in R.
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== equal to
!= not equal to
!x Not x (if x is true or false)
x | y x OR y
x & y x AND y

Let’s try out a few of these logical conditions with filter().

# Participants who are not female
nhanes_small %>%
    filter(sex != "female")
#> # A tibble: 4,980 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5     9 male    133.   29.8  16.8 No                 NA               NA
#>  6     8 male    131.   35.2  20.6 No                 NA               NA
#>  7    66 male    170.   68    23.7 No                 NA                7
#>  8    58 male    182.   78.4  23.7 No                 NA                5
#>  9    54 male    169.   74.7  26.0 No                 NA                1
#> 10    50 male    178.   84.1  26.6 No                 NA                7
#> # … with 4,970 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

# Participants who have BMI equal to 25
nhanes_small %>%
    filter(bmi == 25)
#> # A tibble: 35 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    24 male    178.   79.3    25 No                 NA                2
#>  2    24 male    178.   79.3    25 No                 NA                2
#>  3    32 male    179    80.1    25 No                 NA               NA
#>  4    32 male    179    80.1    25 No                 NA               NA
#>  5    32 male    179    80.1    25 No                 NA               NA
#>  6    32 male    179    80.1    25 No                 NA                2
#>  7    55 fema…   164.   67.5    25 No                 NA                3
#>  8     6 fema…   117.   34      25 No                 NA                3
#>  9    53 male    162.   65.6    25 No                 NA                2
#> 10    57 fema…   178.   78.8    25 No                 NA                3
#> # … with 25 more rows, and 6 more variables: phys_active <fct>, tot_chol <dbl>,
#> #   bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>, poverty <dbl>

# Participants who have BMI equal to or more than 25
nhanes_small %>%
    filter(bmi >= 25)
#> # A tibble: 5,422 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  5    45 fema…   167.   75.7  27.2 No                 NA                5
#>  6    45 fema…   167.   75.7  27.2 No                 NA                5
#>  7    45 fema…   167.   75.7  27.2 No                 NA                5
#>  8    54 male    169.   74.7  26.0 No                 NA                1
#>  9    58 fema…   148.   57.5  26.2 No                 NA                2
#> 10    50 male    178.   84.1  26.6 No                 NA                7
#> # … with 5,412 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

We use the | (“or”) and & (“and”) when we want to combine conditions across columns. Be careful with these operators and when combining logic conditions, as they can sometimes work differently than our human brains interpret them (speaking from experience). For &, both sides must be TRUE in order for the combination to be TRUE. For |, only one side needs to be TRUE in order for the combination to be TRUE. To see how they work try these:

TRUE & TRUE
#> [1] TRUE
TRUE & FALSE
#> [1] FALSE
FALSE & FALSE
#> [1] FALSE
TRUE | TRUE
#> [1] TRUE
TRUE | FALSE
#> [1] TRUE
FALSE | FALSE
#> [1] FALSE

When used in filter(), combinations of logic conditions may look like this:

# When BMI is 25 AND sex is female
nhanes_small %>%
    filter(bmi == 25 & sex == "female")
#> # A tibble: 21 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    55 fema…   164.   67.5    25 No                 NA                3
#>  2     6 fema…   117.   34      25 No                 NA                3
#>  3    57 fema…   178.   78.8    25 No                 NA                3
#>  4    57 fema…   178.   78.8    25 No                 NA               NA
#>  5    57 fema…   178.   78.8    25 No                 NA               NA
#>  6    79 fema…   157.   61.6    25 No                 NA               NA
#>  7    54 fema…   161.   64.8    25 No                 NA                4
#>  8    54 fema…   161.   64.8    25 No                 NA               NA
#>  9    54 fema…   161.   64.8    25 No                 NA               NA
#> 10    54 fema…   161.   64.8    25 No                 NA               NA
#> # … with 11 more rows, and 6 more variables: phys_active <fct>, tot_chol <dbl>,
#> #   bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>, poverty <dbl>

# When BMI is 25 OR sex is female
nhanes_small %>%
    filter(bmi == 25 | sex == "female")
#> # A tibble: 5,034 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  2    45 fema…   167.   75.7  27.2 No                 NA                5
#>  3    45 fema…   167.   75.7  27.2 No                 NA                5
#>  4    45 fema…   167.   75.7  27.2 No                 NA                5
#>  5    10 fema…   142.   38.6  19.2 No                 NA               NA
#>  6    58 fema…   148.   57.5  26.2 No                 NA                2
#>  7     9 fema…   139.   53.1  27.4 No                 NA               NA
#>  8    56 fema…   171.   57.5  19.7 No                 NA                7
#>  9    56 fema…   171.   57.5  19.7 No                 NA                7
#> 10    57 fema…   157.   51    20.7 No                 NA                3
#> # … with 5,024 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

9.11 Arranging the rows of your data by column

You may want to sort your rows by a specific column so that values are arranged in ascending or descending order. This can be done using the function called arrange(). Again, arrange() takes the dataset as the first argument, followed by the columns that you wish to arrange data by. By default, arrange() orders in ascending order.

# Arranging data by age in ascending order
nhanes_small %>%
    arrange(age)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1     0 fema…     NA    4.5    NA <NA>               NA               NA
#>  2     0 fema…     NA    9.5    NA <NA>               NA               NA
#>  3     0 male      NA    6.9    NA <NA>               NA               NA
#>  4     0 male      NA   10.7    NA <NA>               NA               NA
#>  5     0 fema…     NA    9      NA <NA>               NA               NA
#>  6     0 fema…     NA    9      NA <NA>               NA               NA
#>  7     0 fema…     NA    6      NA <NA>               NA               NA
#>  8     0 fema…     NA    6      NA <NA>               NA               NA
#>  9     0 fema…     NA    8.7    NA <NA>               NA               NA
#> 10     0 fema…     NA    4.7    NA <NA>               NA               NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

arrange() also arranges parameters of type character alphabetically:

nhanes_small %>% 
    arrange(sex)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  2    45 fema…   167.   75.7  27.2 No                 NA                5
#>  3    45 fema…   167.   75.7  27.2 No                 NA                5
#>  4    45 fema…   167.   75.7  27.2 No                 NA                5
#>  5    10 fema…   142.   38.6  19.2 No                 NA               NA
#>  6    58 fema…   148.   57.5  26.2 No                 NA                2
#>  7     9 fema…   139.   53.1  27.4 No                 NA               NA
#>  8    56 fema…   171.   57.5  19.7 No                 NA                7
#>  9    56 fema…   171.   57.5  19.7 No                 NA                7
#> 10    57 fema…   157.   51    20.7 No                 NA                3
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

If we want to order the column based on descending order, this can be done with desc().

# Arranging data by age in descending order
nhanes_small %>%
    arrange(desc(age))
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    80 fema…   157.   66.9  27.2 No                 NA               NA
#>  2    80 fema…   157.   66.9  27.2 No                 NA               NA
#>  3    80 male    170.   51.4  17.9 No                 NA               NA
#>  4    80 male    168.   95.5  34.0 Yes                48               NA
#>  5    80 male    168    82.8  29.3 Yes                80                1
#>  6    80 fema…   153.   77.2  33.1 No                 NA               NA
#>  7    80 fema…   163.   65.6  24.8 No                 NA               NA
#>  8    80 fema…   163.   65.6  24.8 No                 NA               NA
#>  9    80 male    172.   78    26.5 No                 NA               NA
#> 10    80 fema…   167.   71.5  25.8 No                 NA               NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

You can also order your data by multiple columns. For instance, we could arrange first by sex and then by age.

# Arranging data by sex then age in ascending order
nhanes_small %>%
    arrange(sex, age)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1     0 fema…     NA    4.5    NA <NA>               NA               NA
#>  2     0 fema…     NA    9.5    NA <NA>               NA               NA
#>  3     0 fema…     NA    9      NA <NA>               NA               NA
#>  4     0 fema…     NA    9      NA <NA>               NA               NA
#>  5     0 fema…     NA    6      NA <NA>               NA               NA
#>  6     0 fema…     NA    6      NA <NA>               NA               NA
#>  7     0 fema…     NA    8.7    NA <NA>               NA               NA
#>  8     0 fema…     NA    4.7    NA <NA>               NA               NA
#>  9     0 fema…     NA    9.1    NA <NA>               NA               NA
#> 10     0 fema…     NA   10.6    NA <NA>               NA               NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

9.12 Transform or add columns

To modify an existing column or to add a new one, we can use the function called mutate(). You can use mutate() to compute a new variable using existing columns in your dataset, such as calculating BMI using height and weight columns. You can multiply all values in a certain column by 2, or combine columns into a new variable. Like the other functions, the first input is the dataset and the other arguments are columns to add or modify.

Take this exmaple: Height values are in centimeters, but we want them in meters. So, we would use mutate() with the following instruction:

height = height / 100

This form is similar to how math works. The action that happens on the right hand side is put into the variable of the left hand side. When using mutate(), it looks like this:

nhanes_small %>%
    mutate(height = height / 100)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    1.65   87.4  32.2 No                 NA               NA
#>  2    34 male    1.65   87.4  32.2 No                 NA               NA
#>  3    34 male    1.65   87.4  32.2 No                 NA               NA
#>  4     4 male    1.05   17    15.3 No                 NA               NA
#>  5    49 fema…   1.68   86.7  30.6 No                 NA               NA
#>  6     9 male    1.33   29.8  16.8 No                 NA               NA
#>  7     8 male    1.31   35.2  20.6 No                 NA               NA
#>  8    45 fema…   1.67   75.7  27.2 No                 NA                5
#>  9    45 fema…   1.67   75.7  27.2 No                 NA                5
#> 10    45 fema…   1.67   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

We can also create a new column (e.g., log transforming height):

nhanes_small %>% 
    mutate(logged_height = log(height))
#> # A tibble: 10,000 x 15
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  6     9 male    133.   29.8  16.8 No                 NA               NA
#>  7     8 male    131.   35.2  20.6 No                 NA               NA
#>  8    45 fema…   167.   75.7  27.2 No                 NA                5
#>  9    45 fema…   167.   75.7  27.2 No                 NA                5
#> 10    45 fema…   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>, logged_height <dbl>

We can add multiple modifications or additions with mutate() by separating them with ,. So, if we first wanted to have height as meters and then take the natural logarithm, it would be:

nhanes_small %>% 
    mutate(height = height / 100,
           logged_height = log(height))
#> # A tibble: 10,000 x 15
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    1.65   87.4  32.2 No                 NA               NA
#>  2    34 male    1.65   87.4  32.2 No                 NA               NA
#>  3    34 male    1.65   87.4  32.2 No                 NA               NA
#>  4     4 male    1.05   17    15.3 No                 NA               NA
#>  5    49 fema…   1.68   86.7  30.6 No                 NA               NA
#>  6     9 male    1.33   29.8  16.8 No                 NA               NA
#>  7     8 male    1.31   35.2  20.6 No                 NA               NA
#>  8    45 fema…   1.67   75.7  27.2 No                 NA                5
#>  9    45 fema…   1.67   75.7  27.2 No                 NA                5
#> 10    45 fema…   1.67   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>, logged_height <dbl>

We can also have different values based on logic conditions using if_else(). Use Table 9.2 to help with creating the logic condition.

nhanes_small %>%
    mutate(highly_active = if_else(phys_active_days >= 5, "yes", "no"))
#> # A tibble: 10,000 x 15
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  6     9 male    133.   29.8  16.8 No                 NA               NA
#>  7     8 male    131.   35.2  20.6 No                 NA               NA
#>  8    45 fema…   167.   75.7  27.2 No                 NA                5
#>  9    45 fema…   167.   75.7  27.2 No                 NA                5
#> 10    45 fema…   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>, highly_active <chr>

Recall that the original dataset doesn’t change. If we want the added variable to be saved, we must assign it to something with <-. Putting it all together, you would enter something like this:

nhanes_update <- nhanes_small %>%
    mutate(height = height / 100,
           logged_height = log(height),
           highly_active = if_else(phys_active_days >= 5, "Yes", "No"))

Before you start the following exercise, add and commit all the files changes to the Git history with the RStudio Git interface.

9.13 Exercise: Piping, filtering, and mutating

Time: 20 min

Copy and paste the code below into the exercises-wrangling.R script file.

# 1. BMI between 20 and 40 with diabetes
nhanes_small %>%
    # Format should follow: variable >= number or character
    filter(___ >= ___ & ___ <= ___ & ___ == ___)

# Pipe the data into mutate function and:
nhanes_modified <- nhanes_small %>% # Specifying dataset
    mutate(
        # 2. Calculate mean arterial pressure
        ___ = ___,
        # 3. Create young_child variable using a condition
        ___ = if_else(___, "Yes", "No")
    )

nhanes_modified

Then, start replacing the ___ with the appropriate code to complete the tasks below. (Hint: Create a new “Section” in the R script for this exercise by using Ctrl-Shift-R).

  1. Filter nhanes_small so only those participants with a BMI of more than or equal to 20 and less than or equal to 40, and keep those who have diabetes.

  2. Create a new variable called mean_arterial_pressure by applying the formula: \(((2 \times DBP) + SBP) / 3\) (DBP = bp_dia_ave and SBP = bp_sys_ave) to calculate Mean Arterial Pressure. Hint: In R, use + to add, * to multiply, and / to divide.

  3. Create a new variable called young_child for cases where age is less than 6 years.

  4. Finally, add and commit these changes to the Git history with the RStudio Git Interface. Push to GitHub to synchronize with your GitHub repository.

Click for the (possible) solution.

# 1. BMI between 20 and 40, with diabetes
nhanes_small %>%
    filter(bmi >= 20 & bmi <= 40 & diabetes == "Yes")

# Pipe the data into mutate function and:
nhanes_modified <- nhanes_small %>% # dataset
    mutate(
        mean_arterial_pressure = ((2 * bp_dia_ave) + bp_sys_ave) / 3,
        young_child = if_else(age < 6, "Yes", "No")
    )

nhanes_modified

9.14 Split-apply-combine: Summarizing data

Take 5 min to read through this section before we continue.

Summarising or applying simple (or complex) statistics to data is a key component of any analysis. Simple summaries or statistics can be done either on all of the data or on groups of it. There are many data analysis tasks that can be approached using the split-apply-combine method, which involves splitting the data into groups, applying some analysis to each group, and then combining the results together.

In dplyr, you can use the function called summarise() to summarise on all the data. If you want to do a split-apply-combine analysis to, e.g., find the max height of females and males, you would use the functions group_by() and then summarise(). Using group_by() splits the data up and summarise() applies an analysis, then immediately combines it back together.

The first position argument to group_by() is, as usual, the dataset. The next arguments are the columns that contain the values you want to group by. These columns must contain categorical data (e.g. sex). group_by() tells R to compute the next operations on the data within each grouping, rather than on all the data.

As with the other functions, summarize() takes the dataset as the first position argument. The next arguments work similar to the arguments in mutate() with one difference: the output must create a single value (e.g. a mean). Like mutate(), you can add multiple “summaries” by adding new columns separated by ,. You would use summarise() to derive basic descriptive statistics of a certain variable, including min(), max(), mean(), median(), or sd() (standard deviation).

The group_by() function doesn’t do anything by itself so should always be used in combination with a summarise(), mutate(), arrange(), or other function. However, the summarise() function can be used on its own. Let’s take a look.

9.15 Calculating summary statistics

Let’s calculate the maximum value of the BMI variable. See what happens when you enter the following:

nhanes_small %>%
    summarise(max_bmi = max(bmi))
#> # A tibble: 1 x 1
#>   max_bmi
#>     <dbl>
#> 1      NA

We get back a result of NA, which means “missing”. In R, NA values “propagate”, meaning that if there is one value missing, then the max or mean will also be missing. So, we need to tell max() to exclude any NA values from the calculation using the argument na.rm = TRUE.

nhanes_small %>%
    summarise(max_bmi = max(bmi, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   max_bmi
#>     <dbl>
#> 1    81.2

To calculate another summary staistic, you would add another summary column using ,:

nhanes_small %>%
    summarise(max_bmi = max(bmi, na.rm = TRUE),
              min_bmi = min(bmi, na.rm = TRUE))
#> # A tibble: 1 x 2
#>   max_bmi min_bmi
#>     <dbl>   <dbl>
#> 1    81.2    12.9

Before you start the following exercise, add and commit changes to the Git history with the RStudio Git interface.

9.16 Exercise: Calculate some basic statistics

Time: 10 min

Practice using summarise() by calculating various summary statistics. Copy and paste the code below into the exercises-wrangling.R script file.

# 1.
nhanes_small %>%
    summarise(mean_weight = ___,
              mean_age = ___)

# 2.
nhanes_small %>%
    summarise(max_height = ___,
              min_height = ___)

# 3.
nhanes_small %>%
    summarise(___ = ___,
              ___ = ___)

Then, start replacing the ___ with the appropriate code to complete the tasks below. Don’t forget to use na.rm = TRUE in the basic statistic functions.

  1. Calculate the mean of weight and age.
  2. Calculate the max and min of height.
  3. Calculate the median of age and phys_active_days.
  4. Lastly, add and commit any changes made to the Git history with the RStudio Git interface.
Click for the (possible) solution.

# 1.
nhanes_small %>%
    summarise(mean_weight = mean(weight, na.rm = TRUE),
              mean_age = mean(age, na.rm = TRUE))

# 2.
nhanes_small %>%
    summarise(max_height = max(height, na.rm = TRUE),
              min_height = min(height, na.rm = TRUE))

# 3.
nhanes_small %>%
    summarise(median_age = median(height, na.rm = TRUE),
              median_phys_active_days = median(phys_active_days, na.rm = TRUE))             

9.17 Summary statistics by a group

While the summarise() function is useful enough on its own, it really shines when combined with group_by().

Let’s use these functions to find out the mean age and BMI between those with and without diabetes.

nhanes_small %>%
    group_by(diabetes) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 3 x 3
#>   diabetes mean_age mean_bmi
#>   <fct>       <dbl>    <dbl>
#> 1 No          35.4      26.2
#> 2 Yes         59.2      32.6
#> 3 <NA>         1.20     29.6

Quick note: If you are using a version of dplyr >= 1.0.0, you’ll get a message informing you that it is regrouping output. This is a notification and can be ignored. If you don’t want the message displayed, you can add options(dplyr.summarise.inform = FALSE) to the top of your script and run it.

We also get a warning about there being missing values in diabetes, so we need to remove rows that have missing diabetes status values.

nhanes_small %>%
    # Recall ! means "NOT", so !is.na means "is not missing"
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 2 x 3
#>   diabetes mean_age mean_bmi
#>   <fct>       <dbl>    <dbl>
#> 1 No           35.4     26.2
#> 2 Yes          59.2     32.6

Cool! We can add more columns to the grouping, so let’s do that. Let’s compare mean age and BMI by sex and diabetes status.

nhanes_small %>%
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, sex) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 4 x 4
#> # Groups:   diabetes [2]
#>   diabetes sex    mean_age mean_bmi
#>   <fct>    <fct>     <dbl>    <dbl>
#> 1 No       female     36.5     26.2
#> 2 No       male       34.3     26.1
#> 3 Yes      female     59.9     33.7
#> 4 Yes      male       58.6     31.5

Since we don’t need the dataset grouped anymore, it’s good practice to end the grouping with ungroup().

nhanes_small %>%
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, sex) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE)) %>% 
    ungroup()
#> # A tibble: 4 x 4
#>   diabetes sex    mean_age mean_bmi
#>   <fct>    <fct>     <dbl>    <dbl>
#> 1 No       female     36.5     26.2
#> 2 No       male       34.3     26.1
#> 3 Yes      female     59.9     33.7
#> 4 Yes      male       58.6     31.5

Before you start the following exercise, add and commit changes to the Git history.

9.18 Exercise: Answer some statistical questions with group by and summarise

Time: 5-7 min

Copy and paste the code below into the exercises-wrangling.R script file.

# 1. 
nhanes_small %>% 
    filter(!is.na(diabetes)) %>% 
    ___(___, ___) %>% 
    ___(
        ___,
        ___,
        ___
    )

# 2. 
nhanes_small %>% 
    filter(!is.na(diabetes)) %>% 
    ___(___, ___) %>% 
    ___(
        ___,
        ___,
        ___,
        ___,
        ___,
        ___
    )

Then, start replacing the ___ with the appropriate code including group_by() with summarise(), to answer these questions:

  1. What is the mean, max, and min differences in age between females and males with or without diabetes?
  2. What is the mean, max, and min differences in height and weight between females and males with or without diabetes?
  3. Once done, add and commit the changes to the file to the Git history.
Click for the (possible) solution.

# 1. 
nhanes_small %>% 
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, sex) %>% 
    summarise(
        mean_age = mean(age, na.rm = TRUE),
        max_age = max(age, na.rm = TRUE),
        min_age = min(age, na.rm = TRUE)
    )

# 2. 
nhanes_small %>% 
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, sex) %>% 
    summarise(
        mean_height = mean(height, na.rm = TRUE),
        max_height = max(height, na.rm = TRUE),
        min_height = min(height, na.rm = TRUE),
        mean_weight = mean(weight, na.rm = TRUE),
        max_weight = max(weight, na.rm = TRUE),
        min_weight = min(weight, na.rm = TRUE)
    )

9.19 Saving datasets as files

The nhanes_small data frame you created is only available after you’ve created it from NHAHES, but if you want to access it later, you can save it as an .rda file in your data/ folder using the function usethis::use_data(). We would add overwrite = TRUE to this so that we can save the dataset again even if the file already exists.

The usethis::use_data() function outputs some information, the last of which (“Document your data”) we won’t cover in this course. The function takes any number of datasets and saves them individually as a .rda R dataset file in the data/ folder.

# Saving data as an .rda file in the data folder
usethis::use_data(nhanes_small, overwrite = TRUE)

For many projects, it isn’t necessary or advisable to save every single data object you create. It’s better to let the code create the data you’ll use rather than saving each new wrangled dataset you might create. However, sometimes you will want or need to save the dataset you’ve been working on, perhaps because you’ve done a lot of work to prepare it for later analyses, or because you’ve run an analysis and want to save the results. In these cases, you should definitely save the new cleaned dataset.

9.20 Loading in a dataset

Take ~5 min to read through this section.

We’ve been using a teaching dataset that we load from a package, mainly so that we can focus on getting familiar with data wrangling. However, there will come a time when you want to wrangle your own data. There are several ways to load in a dataset, with the most common being:

  1. Using the RStudio menu File -> Import Dataset -> From Text/Excel/SPSS/SAS/Stata (depending on your file type you want to import).

  2. If the file is a .csv file, use readr::read_csv() to import the dataset:

    dataset_name <- readr::read_csv(here::here("data/dataset_name.csv"))
  3. If the dataset is a .rda file, use load():

    load(here::here("data/dataset_name.rda"))    

    This loads the dataset into your R session so that you can use it again. If you want to check that it works, restart the R session with either Ctrl-Shift-F10 or with the menu item Session -> Restart R. Then, type out and run this:

    load(here::here("data/nhanes_small.rda"))

    You should now see this dataset in the Environment tab. This is how you save and load data.

For SAS, SPSS, or Stata files, you can use the package haven to import those types of data files into R.

9.21 Exercise: Practicing the dplyr functions

Time: 30 minutes

Practice using dplyr by using the NHANES dataset and wrangling the data into a summary output. Don’t create any intermediate objects by only using the pipe operator to link each task below with the next one.

  1. Rename all columns to use snakecase.
  2. Select the columns gender, age and BMI.
  3. Exclude "NAs" from all of the selected columns.
  4. Rename gender to sex.
  5. Create a new column called age_class, where anyone under 50 years old is labeled "under 50" and those 50 years and older are labeled "over 50".
  6. Group the data according to sex and age_class.
  7. Calculate the mean and median BMI according to the grouping to determine the difference in BMI between age classes and sex.
  8. Add and commit changes to the Git history with the RStudio Git interface.
Click for the (possible) solution.

NHANES %>% 
    rename_with(snakecase::to_snake_case) %>% 
    select(gender, age,  bmi) %>% 
    filter(!is.na(gender) & !is.na(age) & !is.na(bmi)) %>% 
    rename(sex = gender) %>% 
    mutate(age_class = if_else(age < 50, "under 50", "over 50")) %>%
    group_by(age_class, sex) %>% 
    summarize(bmi_mean = mean(bmi, na.rm = TRUE), 
              bmi_median = median(bmi, na.rm = TRUE))

9.22 Summary of session

  • With tidy data, each variable has its own column, each observation has its own row, and each value has its own cell.
  • Use the tidyverse to load in multiple packages to tidy up data.
  • Never edit raw data. Instead, use R code to make changes and clean up the raw data, rather than manually editing the dataset.
  • Use the functions select(), rename(), filter(), mutate() (“change or modify”), arrange() and summarise() from the dplyr package to wrangle your data.
  • Use the pipe (%>%) to write easy-to-read code, similar to reading a text consisting of multiple sentences.