Want to help out or contribute?

If you find any typos, errors, or places where the text could be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.

  • Open an issue or submit a merge request on GitLab with the feedback or suggestions.
  • Hypothesis Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

9 Data management and wrangling

When in RStudio, quickly jump to this page using r3::open_data_wrangling().

Session objectives:

  1. Learn the difference between “messy” and “tidy” data and how to create tidy data to simplify your analysis.
  2. Perform simple transformations and subsetting of datasets, such as:
    • Subset specific columns and rows of a dataset, with filter()and select().
    • Sort rows of a dataset by a specific column, with arrange().
    • Create new or transform existing columns in a dataset, with mutate().
    • Calculate simple data summaries, with summarize().
  3. Learn about and apply the “split-apply-combine” method for doing analyses, with group_by() and summarize().
  4. Write “tidier” and more readable code by using the pipe (%>%) operator.

9.1 “Messy” vs. “tidy” data

Take 10 min to read through this “Messy” vs “tidy” data section and the Managing and working with data in R section.

The concept of “tidy” data was popularized in an article by Hadley Wickham and described in more detail in the Tidy Data chapter of the R for Data Science online book.

But before we continue with tidy data, we need to cover something that is related to the concept of “tidy” and that will come up often in this course: the tidyverse. The tidyverse is an ecosystem of R packages that are designed to work well together, that all follow a strong “design philosophy” and common style guide. This makes combining these packages in the tidyverse much easier. These packages also tend to have excellent, beginner-friendly documentation and tutorials on learning and using the packages. We teach the tidyverse because of these reasons.

Ok, back to “tidy data”. A tidy dataset is when:

  • Each variable has its own column (e.g. “Body Weight”).
  • Each observation has its own row (e.g. “Person”).
  • Each value has its own cell (e.g. “Body weight for a person at a specific date”).

Take a look at the example “tidy” and “messy” data frames (also called “tibbles” in the tidyverse) below. Think about why each is “tidy” or “messy”. What do you notice between the tidy versions and the messier versions?

# Datasets come from tidyr
# Tidy:
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

# Partly tidy:
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

# Messier:
table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

# Messy:
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

# Messy:
table4b
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

The “most” tidy version is table1 which has columns that describe their values (e.g. population is population size), each row is unique (e.g. first row is for values from Afghanistan from 1999), and each cell is an explicit value representative of its column and row. table2 is a “long” version of table1 so it is partly “tidy”, but it doesn’t satisfy the rule that each variable has a column, since count represents both cases and population size. On the other hand, table3 is messy because the rate column values are a composite of two other column values (cases and population), when it should be a single number (a percent). Both table4a and table4b have columns with ambiguous values inside; what does values in the 1999 column contain? You can’t tell from the data.

Tidy data has a few notable benefits:

  1. Time spent preparing your data to be tidy from the beginning can save days of frustration in the long run.
  2. “Tidy data” is a conceptual framework that allows you to easily build off and wrangle (i.e. “manipulate”, “clean up”, “manage”) data in simpler and easy to interpret ways, especially when used within the framework of the tidyverse.

The concept of tidy data also gives rise to “tidy code” for wrangling. By using “verbs” (R functions) and chaining them together in “sentences” (in a sequential pipeline), you can construct meaningful and readable code that describes in plainer English what you are doing to the data

9.2 Managing and working with data in R

Take 5 min and read through this section. When working with data, there are a few principles to follow:

  • Never edit raw data and save it in a separate location (could put in the data-raw/ folder).
    • Note: Saving to data-raw/ depends on how you collected the data and how many collaborators are on your team. You may end up storing and processing the data in another folder as a project of its own.
  • Only work with your raw data using R code, don’t manually edit it. Manual editing doesn’t leave a history of what you’ve done to it, so you can’t go back and see what you’ve done. Always keep a history of any changes you’ve made to the data, preferably by using R code.
  • Save the edited data as another dataset and store it in the data/ folder.

When wrangling your data with R code make sure to:

  • Document and comment as best you can what you did to your data and why you did it to help you remember.
  • Write the code itself to be as descriptive as you can and to be readable enough to understand what is being done to the data. Keep the code simple: Don’t be clever, be clear. Clear code is easier to understand than some clever code.

In data wrangling, most tasks can be expressed by a few simple “verbs” (actions). Wrangling here is used in the sense of maneuvering, managing, controlling, and turning your data around to clean it up, to better understand it, and to prepare it for later analyses. The table below lists some common “verbs” from the dplyr package that come from the tidyverse:

Table 9.1: List of common data wrangling tasks, along with an example and the function used for the wrangling.
Task Example Function
Select columns Remove data entry columns such as person’s name who entered the data. select()
Rename columns Changing a column name from ‘Q1’ to ‘ParticipantName’. rename()
Transform or modify columns Multiplying a column’s values or taking the log. mutate()
Subset/filter rows Keeping rows with glucose values above 4. filter()
Sort rows Show rows with the smallest value at the top. arrange()
Calculate summaries of the data Calculating the maximum, median, and minimum age. summarise()
Run an analysis by a group Calculate means of age by males and females. group_by() with summarise()

Tip: Sometimes you need to do some complicated wrangling to get your data in appropriate “shape” to use for later analyses. To help save some time, you could save the wrangled data as an “output” dataset in the data/ folder. That way, you can easily use it again later rather than having to run the wrangling code every time you want to work with the data.

9.3 Load the packages and dataset

We’re going to use the US NHANES dataset to demonstrate the wrangling functions. There is an NHANES package that contains a teaching version of the original dataset, so we’ll use that for this lesson. First, make sure the LearningR R Project you created previously is open. Then open the R/package-loading.R script and add the dataset package to the file, so it looks like:

library(tidyverse)
library(NHANES)

Then open the R/wrangling-session.R script to start typing out the next code. We’ll use this file to write the code for this session (but not for the exercises).

## # Load up the packages
## source(here::here("R/package-loading.R"))

# Briefly glimpse contents of dataset
glimpse(NHANES)
#> Rows: 10,000
#> Columns: 76
#> $ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 516…
#> $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10…
#> $ Gender           <fct> male, male, male, male, female, male, male, female, …
#> $ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10,…
#> $ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  0-9,  4…
#> $ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795…
#> $ Race1            <fct> White, White, White, Other, White, White, White, Whi…
#> $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Education        <fct> High School, High School, High School, NA, Some Coll…
#> $ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, NA, NA, …
#> $ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, …
#> $ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 875…
#> $ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00…
#> $ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3…
#> $ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, …
#> $ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, …
#> $ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7…
#> $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Height           <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166…
#> $ BMI              <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.…
#> $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plu…
#> $ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, …
#> $ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111,…
#> $ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, …
#> $ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124,…
#> $ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, …
#> $ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108,…
#> $ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, …
#> $ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114,…
#> $ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, …
#> $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, …
#> $ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, …
#> $ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113,…
#> $ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1…
#> $ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
#> $ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vg…
#> $ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0…
#> $ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA…
#> $ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, None, N…
#> $ Depressed        <fct> Several, Several, Several, NA, Several, NA, NA, None…
#> $ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, …
#> $ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, …
#> $ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, …
#> $ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes,…
#> $ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2,…
#> $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, …
#> $ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, …
#> $ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, …
#> $ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA,…
#> $ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, …
#> $ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA,…
#> $ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No,…
#> $ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smok…
#> $ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, …
#> $ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Y…
#> $ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, …
#> $ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes…
#> $ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, …
#> $ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes,…
#> $ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, …
#> $ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, …
#> $ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA,…
#> $ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1…
#> $ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, …
#> $ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA, Hetero…
#> $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

With the added dataset package and code added to the R/wrangling-session.R, add and commit the changes to the Git history with the RStudio Git interface.

9.4 Exercise: Become familiar with the dataset

Time: 10 min

Take the time to get familiar with the NHANES dataset.

  1. Create a new R script by typing in the RStudio Console usethis::use_r("exercises-wrangling").
  2. Copy the code below and paste it into the new exercise file.
  3. Replace the ___ with the NHANES dataset.
  4. Run each line of code by using Ctrl-Enter.
  5. Once done, add and commit the changes to the Git history with the RStudio Git interface.
# Load the packages
source(here::here("R/package-loading.R"))

# Check column names
colnames(___)

# Look at contents
str(___)
glimpse(___)

# See summary
summary(___)

# Look over the dataset documentation
?___
Click for the (possible) solution.

# Load the packages
source(here::here("R/package-loading.R"))

# Check column names
colnames(NHANES)

# Look at contents
str(NHANES)
glimpse(NHANES)

# See summary
summary(NHANES)

# Look over the dataset documentation
?NHANES

9.5 Select specific columns in a dataset

Selecting columns of a dataset is a very common data wrangling task. The function for this task is appropriately called select(). You would use select() to extract one or more variables in a dataset, to have a closer look at or save as a new data frame to work with. It may be that you wish to explore the clinical characteristics of your study sample, so you may select some basic demographic variables (e.g., Age column) and clinical variables (e.g., Weight and Height columns) to perform these analyses.

For the input arguments, it takes the dataset as the first argument, which is the first input position right after the opening bracket (, and then takes the names of the columns you want to select. The argument after the data argument is ..., which indicates that you can add as many columns as you want, separated by a ,.

# Select one column by its name, without quotes
select(NHANES, Age)
#> # A tibble: 10,000 x 1
#>      Age
#>    <int>
#>  1    34
#>  2    34
#>  3    34
#>  4     4
#>  5    49
#>  6     9
#>  7     8
#>  8    45
#>  9    45
#> 10    45
#> # … with 9,990 more rows

# Select two or more columns by name, without quotes
select(NHANES, Age, Weight, BMI)
#> # A tibble: 10,000 x 3
#>      Age Weight   BMI
#>    <int>  <dbl> <dbl>
#>  1    34   87.4  32.2
#>  2    34   87.4  32.2
#>  3    34   87.4  32.2
#>  4     4   17    15.3
#>  5    49   86.7  30.6
#>  6     9   29.8  16.8
#>  7     8   35.2  20.6
#>  8    45   75.7  27.2
#>  9    45   75.7  27.2
#> 10    45   75.7  27.2
#> # … with 9,990 more rows

# To *exclude* a column, use minus (-)
select(NHANES, -HeadCirc)
#> # A tibble: 10,000 x 75
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> # … with 9,990 more rows, and 66 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, Height <dbl>,
#> #   BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
#> #   BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>,
#> #   BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

If some of your columns have similar patterns for naming at the beginning, middle, or end of the name, you can use some helper functions to choose these columns. Use ?select_helpers (choose the “Select helpers” option in the menu that pops up) to read more about these functions and to get help on them. Some commonly used helpers are:

  • starts_with(): Select columns that begin with a pattern.
  • ends_with(): Select columns that end with a pattern.
  • contains(): Select columns that contain a pattern.
# All columns starting with letters "BP" (blood pressure)
select(NHANES, starts_with("BP"))
#> # A tibble: 10,000 x 8
#>    BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3
#>       <int>    <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1      113       85    114     88    114     88    112     82
#>  2      113       85    114     88    114     88    112     82
#>  3      113       85    114     88    114     88    112     82
#>  4       NA       NA     NA     NA     NA     NA     NA     NA
#>  5      112       75    118     82    108     74    116     76
#>  6       86       47     84     50     84     50     88     44
#>  7      107       37    114     46    108     36    106     38
#>  8      118       64    106     62    118     68    118     60
#>  9      118       64    106     62    118     68    118     60
#> 10      118       64    106     62    118     68    118     60
#> # … with 9,990 more rows

# All columns ending in letters "Day"
select(NHANES, ends_with("Day"))
#> # A tibble: 10,000 x 3
#>    TVHrsDay CompHrsDay AlcoholDay
#>    <fct>    <fct>           <int>
#>  1 <NA>     <NA>               NA
#>  2 <NA>     <NA>               NA
#>  3 <NA>     <NA>               NA
#>  4 <NA>     <NA>               NA
#>  5 <NA>     <NA>                2
#>  6 <NA>     <NA>               NA
#>  7 <NA>     <NA>               NA
#>  8 <NA>     <NA>                3
#>  9 <NA>     <NA>                3
#> 10 <NA>     <NA>                3
#> # … with 9,990 more rows

# All columns containing letters "Age"
select(NHANES, contains("Age"))
#> # A tibble: 10,000 x 9
#>      Age AgeDecade AgeMonths DiabetesAge Age1stBaby SmokeAge AgeFirstMarij
#>    <int> <fct>         <int>       <int>      <int>    <int>         <int>
#>  1    34 " 30-39"        409          NA         NA       18            17
#>  2    34 " 30-39"        409          NA         NA       18            17
#>  3    34 " 30-39"        409          NA         NA       18            17
#>  4     4 " 0-9"           49          NA         NA       NA            NA
#>  5    49 " 40-49"        596          NA         27       38            18
#>  6     9 " 0-9"          115          NA         NA       NA            NA
#>  7     8 " 0-9"          101          NA         NA       NA            NA
#>  8    45 " 40-49"        541          NA         NA       NA            13
#>  9    45 " 40-49"        541          NA         NA       NA            13
#> 10    45 " 40-49"        541          NA         NA       NA            13
#> # … with 9,990 more rows, and 2 more variables: AgeRegMarij <int>, SexAge <int>

For more information on using the pattern functions such as starts_with(), check out ?select_helpers.

You’ll notice that running these functions doesn’t actually change the data itself. When you run a function without assigning it using <-, the only action the function does is to send the output to your screen. But if you want to create a new dataset with only the columns you selected, you’ll need to assign it to a new object.

The full NHANES dataset is 10,000 individuals (rows) with 76 parameters (columns). Since we are only interested in some of these parameters, we will subset the large dataset and save it for later use as a new dataset.

# Save the selected columns as a new data frame
# Recall the style guide for naming objects
nhanes_small <- select(NHANES, Age, Gender, Height, 
                       Weight, BMI, Diabetes, DiabetesAge, 
                       PhysActiveDays, PhysActive, TotChol, 
                       BPSysAve, BPDiaAve, SmokeNow, Poverty)

# View the new data frame
nhanes_small
#> # A tibble: 10,000 x 14
#>      Age Gender Height Weight   BMI Diabetes DiabetesAge PhysActiveDays
#>    <int> <fct>   <dbl>  <dbl> <dbl> <fct>          <int>          <int>
#>  1    34 male     165.   87.4  32.2 No                NA             NA
#>  2    34 male     165.   87.4  32.2 No                NA             NA
#>  3    34 male     165.   87.4  32.2 No                NA             NA
#>  4     4 male     105.   17    15.3 No                NA             NA
#>  5    49 female   168.   86.7  30.6 No                NA             NA
#>  6     9 male     133.   29.8  16.8 No                NA             NA
#>  7     8 male     131.   35.2  20.6 No                NA             NA
#>  8    45 female   167.   75.7  27.2 No                NA              5
#>  9    45 female   167.   75.7  27.2 No                NA              5
#> 10    45 female   167.   75.7  27.2 No                NA              5
#> # … with 9,990 more rows, and 6 more variables: PhysActive <fct>,
#> #   TotChol <dbl>, BPSysAve <int>, BPDiaAve <int>, SmokeNow <fct>,
#> #   Poverty <dbl>

9.6 Renaming all column names based on the style guide

To match the style guide, we should change the column names to be all lower case and with _ for spaces between words. There’s a package that can do that for us called snakecase.

In the Console, type out snakecase:: and hit Tab. You’ll see a list of possible functions to use. We want to use the snake case function, so scroll down and find the to_snake_case(). This is the function we want to use.

To change all the column names to lower case, we’ll use the function rename_with(). This function, like the other dplyr functions, takes the data as the first argument while the second argument needs to be a function, which in our case is to_snake_case() but without using the () at the end. Based on this function, it then renames all columns.

# Rename all columns to snake case
nhanes_small <- rename_with(nhanes_small, snakecase::to_snake_case)

# Have a look at the data frame
nhanes_small
#> # A tibble: 10,000 x 14
#>      age gender height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>   <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male     165.   87.4  32.2 No                 NA               NA
#>  2    34 male     165.   87.4  32.2 No                 NA               NA
#>  3    34 male     165.   87.4  32.2 No                 NA               NA
#>  4     4 male     105.   17    15.3 No                 NA               NA
#>  5    49 female   168.   86.7  30.6 No                 NA               NA
#>  6     9 male     133.   29.8  16.8 No                 NA               NA
#>  7     8 male     131.   35.2  20.6 No                 NA               NA
#>  8    45 female   167.   75.7  27.2 No                 NA                5
#>  9    45 female   167.   75.7  27.2 No                 NA                5
#> 10    45 female   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

9.7 Rename specific columns

Depending on how your data was collected, it may have column names that aren’t very descriptive. So you’ll probably want to rename them to something more explanatory, which is particularly important if you’re sharing your work with others or in an environment where multiple people are working on the same data. As with select(), to rename columns you use the function called rename(). Like select(), rename() takes the dataset as the first argument (first position) and then takes as many renaming arguments as you want (because the second argument position is ...). Renaming takes the form of newname = oldname. Let’s rename gender to be sex, since gender is a social construct, while sex is biological, which is what the data actually is.

rename(nhanes_small, sex = gender)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  6     9 male    133.   29.8  16.8 No                 NA               NA
#>  7     8 male    131.   35.2  20.6 No                 NA               NA
#>  8    45 fema…   167.   75.7  27.2 No                 NA                5
#>  9    45 fema…   167.   75.7  27.2 No                 NA                5
#> 10    45 fema…   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

# View data frame
nhanes_small
#> # A tibble: 10,000 x 14
#>      age gender height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>   <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male     165.   87.4  32.2 No                 NA               NA
#>  2    34 male     165.   87.4  32.2 No                 NA               NA
#>  3    34 male     165.   87.4  32.2 No                 NA               NA
#>  4     4 male     105.   17    15.3 No                 NA               NA
#>  5    49 female   168.   86.7  30.6 No                 NA               NA
#>  6     9 male     133.   29.8  16.8 No                 NA               NA
#>  7     8 male     131.   35.2  20.6 No                 NA               NA
#>  8    45 female   167.   75.7  27.2 No                 NA                5
#>  9    45 female   167.   75.7  27.2 No                 NA                5
#> 10    45 female   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

Notice that the rename() function has not altered the nhanes_small data frame. To do this, we need to assign the change to nhanes_small:

nhanes_small <- rename(nhanes_small, sex = gender)
# View data frame
nhanes_small
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  6     9 male    133.   29.8  16.8 No                 NA               NA
#>  7     8 male    131.   35.2  20.6 No                 NA               NA
#>  8    45 fema…   167.   75.7  27.2 No                 NA                5
#>  9    45 fema…   167.   75.7  27.2 No                 NA                5
#> 10    45 fema…   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

Now the column has been renamed from gender to sex. What if you want to select some columns and then rename some of them, do you have to create a new data object every time? No! We can make use of a very powerful tool called piping with the %>% function.

9.8 Chaining functions with the pipe

Take 5 minutes and read this section before we continue.

A key component of the tidy data and tidy code concept is making use of the %>% operator. You would use the “pipe” operator when you are writing a piece of code with multiple operations/intermediate steps that require you to save/overwrite each step as an object (see below). One advantage of the “pipe” operator is that it will help to ensure that your code is less cluttered with unimportant and temporary object names.

This operator allows you to “pipe” the output from one function to the input of another function, like a plumbing pipe would do for water. This allows you to easily chain functions together into “sentences”. Let’s use an example based on English words for some action. This is the English sentence:

We need some eggs. Drive to the grocery store and buy up some eggs before coming home from work.

There are basically two actions here (“drive” and “buy”) and four “inputs” (“work”, “grocery store”, “eggs”, “home”), that are all based on the previous action. Since an action in R is a function, the functions would be drive() and buy(). In regular R, if we wanted to chain these functions together, we would have to nest them like this:

drive(buy(drive(at_work, "grocery store"), "eggs"), "home")

This is difficult to read. We could also create temporary objects:

at_grocery_store <- drive(at_work, "grocery store")
got_eggs <- buy(at_grocery_store, "eggs")
at_home <- drive(got_eggs, "home")

But this still isn’t too “readable”, and we are having to re-name each intermediate object with reference to the object before it. The pipe %>% operator can really simplify this:

at_work %>% 
    drive("grocery store") %>% 
    buy("eggs") %>% 
    drive("home")

Do you find this more readable and understandable? We read it like how it would actually be done, in order of the steps taken.

Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right. This can help clarify and break down complex data processing workflows, and is the basis for all tidyverse and many other packages. This is a basic design philosophy of interacting with data when using the tidyverse.

The pipe %>% takes the output from the object or function on the left hand side and puts it into the function on the right hand side. All input goes into the first position argument of the function. So within the tidyverse packages, all functions take a data frame (or vector) as the first argument in order to work with the pipe.

Ok, let’s return back together and try this out. The keyboard shortcut for the pipe is Ctrl-Shift-M (M for the magrittr package that created the pipe).

# These two ways are the same
colnames(nhanes_small)
#>  [1] "age"              "sex"              "height"           "weight"          
#>  [5] "bmi"              "diabetes"         "diabetes_age"     "phys_active_days"
#>  [9] "phys_active"      "tot_chol"         "bp_sys_ave"       "bp_dia_ave"      
#> [13] "smoke_now"        "poverty"

nhanes_small %>% 
    colnames()
#>  [1] "age"              "sex"              "height"           "weight"          
#>  [5] "bmi"              "diabetes"         "diabetes_age"     "phys_active_days"
#>  [9] "phys_active"      "tot_chol"         "bp_sys_ave"       "bp_dia_ave"      
#> [13] "smoke_now"        "poverty"

Because the pipe automatically takes nhanes_small and puts it into the first position, we don’t need to type out nhanes_small inside colnames() when piping.

Let’s try the pipe on the select() and rename() function from the previous section. Remember, both select() and rename() take a dataset as the first argument, which makes them pipe-able.

nhanes_small %>% 
    select(phys_active) %>% 
    rename(physically_active = phys_active)
#> # A tibble: 10,000 x 1
#>    physically_active
#>    <fct>            
#>  1 No               
#>  2 No               
#>  3 No               
#>  4 <NA>             
#>  5 No               
#>  6 <NA>             
#>  7 <NA>             
#>  8 Yes              
#>  9 Yes              
#> 10 Yes              
#> # … with 9,990 more rows

We can now “read” these actions as:

Take the nhanes small dataset and then select the phys active column and then rename the phys active column to physically active.

Alright, let’s add and commit to the Git history with the RStudio Git interface before moving on to the exercise.

9.9 Exercise: Practice what we’ve learned

Time: 10 min

In the exercise-wrangling.R file, complete these tasks:

  1. Copy and paste the below code into the exercise file. In the select() function, type in the columns tot_chol, bp_sys_ave, and poverty where the blank space is.

    nhanes_small %>% 
        select(___)
  2. Copy and paste the below code and fill out the blanks. Rename diabetes_age to be diabetes_diagnosis_age. Tip: Recall that renaming is in the form new = old.

    nhanes_small %>% 
        rename(___ = ___, ___ = ____)
  3. Re-write this bit of code to use the pipe:

    select(nhanes_small, bmi, contains("age"))
  4. Read aloud (under your breath or in your head) the below code. How intuitive is it to read? Now re-write this code so you don’t need to create the temporary physical_activity object by using the pipe, then re-read the revised version. Which do you feel is easier to “read”?

    physical_activity <- select(nhanes_small, phys_active_days, phys_active)
    rename(physical_activity, days_phys_active = phys_active_days)
  5. Lastly, add and commit these changes to the Git history with the RStudio Git interface.

Click for the (possible) solution.

# 1. Select specific columns
nhanes_small %>%
    select(tot_chol, bp_sys_ave, poverty)

# 2. Rename columns
nhanes_small %>%
    rename(diabetes_diagnosis_age = diabetes_age)

# 3. Re-write with pipe
nhanes_small %>% 
    select(bmi, contains("age"))

# 4. Re-write with pipe
nhanes_small %>% 
    select(phys_active_days, phys_active) %>% 
    rename(days_phys_active = phys_active_days)

9.10 Filter the data by row

Filtering data by row is a very common activity in data analysis, for example, to get rid of outliers or to subset by a categorical group. As with the previous functions, the function to subset/filter is called filter(). filter() is distinct from select() in the sense that it operates on rows whereas select() operates on columns. The filter() function takes a logic condition (TRUE or FALSE). As with the other functions, the first argument is the dataset and all others are the logical conditions to apply to the row filtering. With filter(), when the logical conditions equal TRUE, it means that those rows will be kept and those that are FALSE will be dropped.

A warning: Since filter() uses logical conditions, you need to be really careful when writing the logic. As you probably know, humans are really really bad at logic. So if your logical condition starts getting even a little complex, double and triple check that you know for certain that your logic code is doing what you think it is doing. It’s very easy to make mistakes at this stage, even for advanced R users.

The simplest kind of logic condition is to test for “equality”. In R, “equal to” is represented by ==. For example, if we want to keep only females in the dataset it would be:

nhanes_small %>%
    filter(sex == "female")
#> # A tibble: 5,020 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  2    45 fema…   167.   75.7  27.2 No                 NA                5
#>  3    45 fema…   167.   75.7  27.2 No                 NA                5
#>  4    45 fema…   167.   75.7  27.2 No                 NA                5
#>  5    10 fema…   142.   38.6  19.2 No                 NA               NA
#>  6    58 fema…   148.   57.5  26.2 No                 NA                2
#>  7     9 fema…   139.   53.1  27.4 No                 NA               NA
#>  8    56 fema…   171.   57.5  19.7 No                 NA                7
#>  9    56 fema…   171.   57.5  19.7 No                 NA                7
#> 10    57 fema…   157.   51    20.7 No                 NA                3
#> # … with 5,010 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

We’d “read” this code as:

Take the nhanes small dataset, and then filter so that only rows where sex is equal to “female” are kept.

So, when a row in the sex column has the value "female", that row is kept. Otherwise, it is dropped.

There are other logic comparisons to use. Table 9.2 can be used as a reference for logical conditions in R.

Table 9.2: Logical operators in R.
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== equal to
!= not equal to
!x Not x (if x is true or false)
x | y x OR y
x & y x AND y

Let’s try out a few of these logical conditions with filter().

# Participants that don't have sex as female
nhanes_small %>%
    filter(sex != "female")
#> # A tibble: 4,980 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5     9 male    133.   29.8  16.8 No                 NA               NA
#>  6     8 male    131.   35.2  20.6 No                 NA               NA
#>  7    66 male    170.   68    23.7 No                 NA                7
#>  8    58 male    182.   78.4  23.7 No                 NA                5
#>  9    54 male    169.   74.7  26.0 No                 NA                1
#> 10    50 male    178.   84.1  26.6 No                 NA                7
#> # … with 4,970 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

# Participants that have BMI equal to 25
nhanes_small %>%
    filter(bmi == 25)
#> # A tibble: 35 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    24 male    178.   79.3    25 No                 NA                2
#>  2    24 male    178.   79.3    25 No                 NA                2
#>  3    32 male    179    80.1    25 No                 NA               NA
#>  4    32 male    179    80.1    25 No                 NA               NA
#>  5    32 male    179    80.1    25 No                 NA               NA
#>  6    32 male    179    80.1    25 No                 NA                2
#>  7    55 fema…   164.   67.5    25 No                 NA                3
#>  8     6 fema…   117.   34      25 No                 NA                3
#>  9    53 male    162.   65.6    25 No                 NA                2
#> 10    57 fema…   178.   78.8    25 No                 NA                3
#> # … with 25 more rows, and 6 more variables: phys_active <fct>, tot_chol <dbl>,
#> #   bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>, poverty <dbl>

# Participants that have BMI equal to or more than 25
nhanes_small %>%
    filter(bmi >= 25)
#> # A tibble: 5,422 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  5    45 fema…   167.   75.7  27.2 No                 NA                5
#>  6    45 fema…   167.   75.7  27.2 No                 NA                5
#>  7    45 fema…   167.   75.7  27.2 No                 NA                5
#>  8    54 male    169.   74.7  26.0 No                 NA                1
#>  9    58 fema…   148.   57.5  26.2 No                 NA                2
#> 10    50 male    178.   84.1  26.6 No                 NA                7
#> # … with 5,412 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

We use the | (“or”) and & (“and”) when we want to combine conditions across columns. Be especially careful with these operators and whenever combining logic conditions, as they can sometimes work differently than our human brains interpret them (speaking from experience). For &, both sides must be TRUE in order for the combination to be TRUE. For |, only one side needs to be TRUE in order for the combination to be TRUE. To see how they work try these:

TRUE & TRUE
#> [1] TRUE
TRUE & FALSE
#> [1] FALSE
FALSE & FALSE
#> [1] FALSE
TRUE | TRUE
#> [1] TRUE
TRUE | FALSE
#> [1] TRUE
FALSE | FALSE
#> [1] FALSE

When used in filter() it would be used like:

# when BMI is 25 AND sex is female
nhanes_small %>%
    filter(bmi == 25 & sex == "female")
#> # A tibble: 21 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    55 fema…   164.   67.5    25 No                 NA                3
#>  2     6 fema…   117.   34      25 No                 NA                3
#>  3    57 fema…   178.   78.8    25 No                 NA                3
#>  4    57 fema…   178.   78.8    25 No                 NA               NA
#>  5    57 fema…   178.   78.8    25 No                 NA               NA
#>  6    79 fema…   157.   61.6    25 No                 NA               NA
#>  7    54 fema…   161.   64.8    25 No                 NA                4
#>  8    54 fema…   161.   64.8    25 No                 NA               NA
#>  9    54 fema…   161.   64.8    25 No                 NA               NA
#> 10    54 fema…   161.   64.8    25 No                 NA               NA
#> # … with 11 more rows, and 6 more variables: phys_active <fct>, tot_chol <dbl>,
#> #   bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>, poverty <dbl>

# when BMI is 25 OR sex is female
nhanes_small %>%
    filter(bmi == 25 | sex == "female")
#> # A tibble: 5,034 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  2    45 fema…   167.   75.7  27.2 No                 NA                5
#>  3    45 fema…   167.   75.7  27.2 No                 NA                5
#>  4    45 fema…   167.   75.7  27.2 No                 NA                5
#>  5    10 fema…   142.   38.6  19.2 No                 NA               NA
#>  6    58 fema…   148.   57.5  26.2 No                 NA                2
#>  7     9 fema…   139.   53.1  27.4 No                 NA               NA
#>  8    56 fema…   171.   57.5  19.7 No                 NA                7
#>  9    56 fema…   171.   57.5  19.7 No                 NA                7
#> 10    57 fema…   157.   51    20.7 No                 NA                3
#> # … with 5,024 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

9.11 (Re)Arranging the rows of your data by column

You may want to sort your rows by a specific column so that rows are arranged with bigger (or smaller) values on top. Arranging is done by using arrange(). Again, arrange() takes the dataset as the first argument and anything else it uses as the columns to order by. By default, arrange orders in ascending order.

# ascending order by age
nhanes_small %>%
    arrange(age)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1     0 fema…     NA    4.5    NA <NA>               NA               NA
#>  2     0 fema…     NA    9.5    NA <NA>               NA               NA
#>  3     0 male      NA    6.9    NA <NA>               NA               NA
#>  4     0 male      NA   10.7    NA <NA>               NA               NA
#>  5     0 fema…     NA    9      NA <NA>               NA               NA
#>  6     0 fema…     NA    9      NA <NA>               NA               NA
#>  7     0 fema…     NA    6      NA <NA>               NA               NA
#>  8     0 fema…     NA    6      NA <NA>               NA               NA
#>  9     0 fema…     NA    8.7    NA <NA>               NA               NA
#> 10     0 fema…     NA    4.7    NA <NA>               NA               NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

It also arranges parameters of type character alphabetically:

nhanes_small %>% 
    arrange(sex)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  2    45 fema…   167.   75.7  27.2 No                 NA                5
#>  3    45 fema…   167.   75.7  27.2 No                 NA                5
#>  4    45 fema…   167.   75.7  27.2 No                 NA                5
#>  5    10 fema…   142.   38.6  19.2 No                 NA               NA
#>  6    58 fema…   148.   57.5  26.2 No                 NA                2
#>  7     9 fema…   139.   53.1  27.4 No                 NA               NA
#>  8    56 fema…   171.   57.5  19.7 No                 NA                7
#>  9    56 fema…   171.   57.5  19.7 No                 NA                7
#> 10    57 fema…   157.   51    20.7 No                 NA                3
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

We can do this also in descending order with desc().

# descending order
nhanes_small %>%
    arrange(desc(age))
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    80 fema…   157.   66.9  27.2 No                 NA               NA
#>  2    80 fema…   157.   66.9  27.2 No                 NA               NA
#>  3    80 male    170.   51.4  17.9 No                 NA               NA
#>  4    80 male    168.   95.5  34.0 Yes                48               NA
#>  5    80 male    168    82.8  29.3 Yes                80                1
#>  6    80 fema…   153.   77.2  33.1 No                 NA               NA
#>  7    80 fema…   163.   65.6  24.8 No                 NA               NA
#>  8    80 fema…   163.   65.6  24.8 No                 NA               NA
#>  9    80 male    172.   78    26.5 No                 NA               NA
#> 10    80 fema…   167.   71.5  25.8 No                 NA               NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

You can also arrange by multiple columns. For instance, first arrange by sex and then by age.

# ascending order by sex and Age
nhanes_small %>%
    arrange(sex, age)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1     0 fema…     NA    4.5    NA <NA>               NA               NA
#>  2     0 fema…     NA    9.5    NA <NA>               NA               NA
#>  3     0 fema…     NA    9      NA <NA>               NA               NA
#>  4     0 fema…     NA    9      NA <NA>               NA               NA
#>  5     0 fema…     NA    6      NA <NA>               NA               NA
#>  6     0 fema…     NA    6      NA <NA>               NA               NA
#>  7     0 fema…     NA    8.7    NA <NA>               NA               NA
#>  8     0 fema…     NA    4.7    NA <NA>               NA               NA
#>  9     0 fema…     NA    9.1    NA <NA>               NA               NA
#> 10     0 fema…     NA   10.6    NA <NA>               NA               NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

9.12 Transform or add columns

To “transform” (modify) an existing column or to add a new one, the function to use is called mutate(). Unfortunately, unlike the other functions, the name is not as obvious about what it does. The meaning of mutate though is to change or modify, so it kind of makes sense. You would use mutate() if you wanted to compute a new variable using existing columns in your dataset, such as calculating BMI using height and weight columns. It may be that you want to multiply all values in a certain column by 2, or combine columns into a new variable. Like the other functions, the first input is the data and the other arguments are columns to add or modify.

The form that mutate() uses is similar to normal R assignment: For instance, since the height’s values are in centimeters, maybe we’d rather want them in meters. So, in mutate() we’d type out:

height = height / 100

This form is similar to how math works. The action that happens on the right hand side is put into the variable of the left hand side. With using mutate() itself:

nhanes_small %>%
    mutate(height = height / 100)
#> # A tibble: 10,000 x 14
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    1.65   87.4  32.2 No                 NA               NA
#>  2    34 male    1.65   87.4  32.2 No                 NA               NA
#>  3    34 male    1.65   87.4  32.2 No                 NA               NA
#>  4     4 male    1.05   17    15.3 No                 NA               NA
#>  5    49 fema…   1.68   86.7  30.6 No                 NA               NA
#>  6     9 male    1.33   29.8  16.8 No                 NA               NA
#>  7     8 male    1.31   35.2  20.6 No                 NA               NA
#>  8    45 fema…   1.67   75.7  27.2 No                 NA                5
#>  9    45 fema…   1.67   75.7  27.2 No                 NA                5
#> 10    45 fema…   1.67   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>

Or we can create a new column (maybe log transforming height):

nhanes_small %>% 
    mutate(logged_height = log(height))
#> # A tibble: 10,000 x 15
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  6     9 male    133.   29.8  16.8 No                 NA               NA
#>  7     8 male    131.   35.2  20.6 No                 NA               NA
#>  8    45 fema…   167.   75.7  27.2 No                 NA                5
#>  9    45 fema…   167.   75.7  27.2 No                 NA                5
#> 10    45 fema…   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>, logged_height <dbl>

We can also add multiple modifications or additions with mutate by separating with ,. So if we first wanted to have height as meters and then take the natural logarithm, it would be:

nhanes_small %>% 
    mutate(height = height / 100,
           logged_height = log(height))
#> # A tibble: 10,000 x 15
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    1.65   87.4  32.2 No                 NA               NA
#>  2    34 male    1.65   87.4  32.2 No                 NA               NA
#>  3    34 male    1.65   87.4  32.2 No                 NA               NA
#>  4     4 male    1.05   17    15.3 No                 NA               NA
#>  5    49 fema…   1.68   86.7  30.6 No                 NA               NA
#>  6     9 male    1.33   29.8  16.8 No                 NA               NA
#>  7     8 male    1.31   35.2  20.6 No                 NA               NA
#>  8    45 fema…   1.67   75.7  27.2 No                 NA                5
#>  9    45 fema…   1.67   75.7  27.2 No                 NA                5
#> 10    45 fema…   1.67   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>, logged_height <dbl>

We can also have different values based on a logic conditions using if_else(). Use Table 9.2 to help with creating the logic condition.

nhanes_small %>%
    mutate(highly_active = if_else(phys_active_days >= 5, "yes", "no"))
#> # A tibble: 10,000 x 15
#>      age sex   height weight   bmi diabetes diabetes_age phys_active_days
#>    <int> <fct>  <dbl>  <dbl> <dbl> <fct>           <int>            <int>
#>  1    34 male    165.   87.4  32.2 No                 NA               NA
#>  2    34 male    165.   87.4  32.2 No                 NA               NA
#>  3    34 male    165.   87.4  32.2 No                 NA               NA
#>  4     4 male    105.   17    15.3 No                 NA               NA
#>  5    49 fema…   168.   86.7  30.6 No                 NA               NA
#>  6     9 male    133.   29.8  16.8 No                 NA               NA
#>  7     8 male    131.   35.2  20.6 No                 NA               NA
#>  8    45 fema…   167.   75.7  27.2 No                 NA                5
#>  9    45 fema…   167.   75.7  27.2 No                 NA                5
#> 10    45 fema…   167.   75.7  27.2 No                 NA                5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> #   tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> #   poverty <dbl>, highly_active <chr>

Recall that the original dataset doesn’t change. If we want the added variable to be included we must assign it to something with <-. So putting it all together:

nhanes_update <- nhanes_small %>%
    mutate(height = height / 100,
           logged_height = log(height),
           highly_active = if_else(phys_active_days >= 5, "Yes", "No"))

Before you start the exercise, add and commit all the files changes to the Git history with the RStudio Git interface.

9.13 Exercise: Piping, filtering, and mutating

Time: 20 min

Copy and paste the code below into the script exercises-wrangling.R. Then start replacing the ___ with the appropriate code to complete the tasks below. (Suggestion: Create a new “Section” in the R script for this exercise by using Ctrl-Shift-R).

  1. Filter nhanes_small so only those with a BMI of more than or equal to 20 and less than or equal to 40 and keep those who have diabetes.
  2. Create a new variable called mean_arterial_pressure by applying the formula \(((2 \times DBP) + SBP) / 3\) (DBP = bp_dia_ave and SBP = bp_sys_ave) to calculate Mean Arterial Pressure. Hint: In R, use + to add, use * to multiply, use / to divide.
  3. Create a new variable called young_child when age is less than 6 years.
  4. Finally, once done, add and commit these changes to the Git history with the RStudio Git Interface. Push to GitHub to synchronize with your GitHub repository.
# 1. BMI between 20 and 40 and who have diabetes
nhanes_small %>%
    # format: variable >= number or character
    filter(___ >= ___ & ___ <= ___ & ___ == ___)

# Pipe the data into mutate function and:
nhanes_modified <- nhanes_small %>% # dataset
    mutate(
        # 2. Calculate mean arterial pressure
        ___ = ___,
        # 3. Create young_child variable using a condition
        ___ = if_else(___, "Yes", "No")
    )

nhanes_modified
Click for the (possible) solution.

# 1. BMI between 20 and 40 and who have diabetes
nhanes_small %>%
    # format: variable >= number
    filter(bmi >= 20 & bmi <= 40 & diabetes == "Yes")

# Pipe the data into mutate function and:
nhanes_modified <- nhanes_small %>% # dataset
    mutate(
        # 2. Calculate mean arterial pressure
        mean_arterial_pressure = ((2 * bp_dia_ave) + bp_sys_ave) / 3,
        # 3. Create Young_child variable using a condition
        young_child = if_else(Age < 6, "Yes", "No")
    )

nhanes_modified

9.14 Split-apply-combine: Summarizing data

Take 5 min to read through this section, before we continue.

Summarizing or applying simple (or complex) statistics to data is (obviously) a key component of any analysis. Simple summaries or statistics can be done either on all the data or on groups of it. There are many data analysis tasks that can be approached using the split-apply-combine method: split the data into groups, apply some analysis to each group, and then combine the results together.

In dplyr, to summarize on all the data, you would use the function summarize(). If you want to do a split-apply-combine (e.g. find the max height of females and males) analysis, you would use the functions group_by() and then summarize(). Using group_by() splits the data up and summarise() then applies an analysis and immediately combines it back together.

The first position argument to group_by() is, as usual, the dataset. The next arguments are the columns that contain the values you want to group by. These columns must contain categorical data (e.g. sex). group_by() tells R to compute the next operations the data within each grouping, rather than on all the data. On its own, group_by() does nothing but instead works with other functions. For example, group_by() works in combination with mutate() too.

As with the other functions, summarise() takes the data as the first position argument. The next arguments work similar to mutate() with one difference: the output must create a single value (e.g. a max or a mean). Like mutate(), you can add multiple “summaries” by adding new columns separated by comma ,. You would use summarize() to derive basic descriptive statistics of a certain variable, including min(), max(), mean(), median(), or sd() (standard deviation).

The group_by() function doesn’t do anything by itself so should always be used in combination with a summarize(), mutate(), arrange(), or other function. However, the summarize() function can be used on its own. Let’s take a look.

9.15 Calculating summary statistics

Let’s calculate the maximum value of BMI.

nhanes_small %>%
    summarize(max_bmi = max(bmi))
#> # A tibble: 1 x 1
#>   max_bmi
#>     <dbl>
#> 1      NA

We get back a result of NA, which means “missing”. In R, NA values “propagate” meaning that if there is one missing, then the max or mean will also be missing. So we need to tell max() to exclude NA from the calculation using na.rm = TRUE.

nhanes_small %>%
    summarize(max_bmi = max(bmi, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   max_bmi
#>     <dbl>
#> 1    81.2

To add another summary column, use ,.

nhanes_small %>%
    summarize(max_bmi = max(bmi, na.rm = TRUE),
              min_bmi = min(bmi, na.rm = TRUE))
#> # A tibble: 1 x 2
#>   max_bmi min_bmi
#>     <dbl>   <dbl>
#> 1    81.2    12.9

Before you start the exercise, add and commit what you’ve changed to the Git history with the RStudio Git interface.

9.16 Exercise: Use summarize() to calculate basic statistics

Time: 10 min

Practice using summarize() by calculating various summary statistics. Complete the following tasks:

  1. Calculate the mean() of weight and age.
  2. Calculate the max() and min() of height.
  3. Calculate the median() of age and phys_active_days.
  4. Lastly, add and commit any changes made to the Git history with the RStudio Git interface.

Don’t forget to use na.rm = TRUE in the basic statistic functions.

# 1.
nhanes_small %>%
    summarize(mean_weight = ___,
              mean_age = ___)

# 2.
nhanes_small %>%
    summarize(max_height = ___,
              min_height = ___)

# 3.
nhanes_small %>%
    summarize(___ = ___,
              ___ = ___)
Click for the (possible) solution.

# 1.
nhanes_small %>%
    summarize(mean_weight = mean(weight, na.rm = TRUE),
              mean_age = mean(age, na.rm = TRUE))

# 2.
nhanes_small %>%
    summarize(max_height = max(height, na.rm = TRUE),
              min_height = min(height, na.rm = TRUE))

# 3.
nhanes_small %>%
    summarize(median_age = median(height, na.rm = TRUE),
              median_phys_active_days = median(phys_active_days, na.rm = TRUE))             

9.17 Summary statistics by a group

While the summarize() function is partly useful, it really shines when combined with group_by(). Let’s find out the mean age and BMI between those with and without Diabetes.

nhanes_small %>%
    group_by(diabetes) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 3 x 3
#>   diabetes mean_age mean_bmi
#>   <fct>       <dbl>    <dbl>
#> 1 No          35.4      26.2
#> 2 Yes         59.2      32.6
#> 3 <NA>         1.20     29.6

Quick note: If you are using dplyr >= 1.0.0, you’ll get a message informing you that it is regrouping output. This is simply a message and can be ignored. If you don’t want the message displayed, write options(dplyr.summarise.inform = FALSE) at the top of your script and run it then.

We get a warning about there being missing values in diabetes, so let’s first remove rows that have missing diabetes status.

nhanes_small %>%
    # Recall ! means "NOT", so !is.na means "is not missing"
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 2 x 3
#>   diabetes mean_age mean_bmi
#>   <fct>       <dbl>    <dbl>
#> 1 No           35.4     26.2
#> 2 Yes          59.2     32.6

Cool! But we can add more columns to the grouping. Let’s compare mean age and BMI by sex and diabetes status.

nhanes_small %>%
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, sex) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 4 x 4
#> # Groups:   diabetes [2]
#>   diabetes sex    mean_age mean_bmi
#>   <fct>    <fct>     <dbl>    <dbl>
#> 1 No       female     36.5     26.2
#> 2 No       male       34.3     26.1
#> 3 Yes      female     59.9     33.7
#> 4 Yes      male       58.6     31.5

Since we don’t need the dataset grouped anymore, it’s good practice to end the grouping with ungroup().

nhanes_small %>%
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, sex) %>% 
    summarise(mean_age = mean(age, na.rm = TRUE),
              mean_bmi = mean(bmi, na.rm = TRUE)) %>% 
    ungroup()
#> # A tibble: 4 x 4
#>   diabetes sex    mean_age mean_bmi
#>   <fct>    <fct>     <dbl>    <dbl>
#> 1 No       female     36.5     26.2
#> 2 No       male       34.3     26.1
#> 3 Yes      female     59.9     33.7
#> 4 Yes      male       58.6     31.5

Before you start the exercise, add and commit the changes to the Git history.

9.18 Exercise: Use group_by() and summarize() to answer some questions

Time: 5-7 min

Using the group_by() with summarize(), answer these questions:

  1. What is the mean, max, and min differences in age between females and males with or without diabetes?
  2. What is the mean, max, and min differences in height and weight between females and males with or without diabetes?
  3. Once done, add and commit the changes to the file to the Git history.
# 1. 
nhanes_small %>% 
    filter(!is.na(diabetes)) %>% 
    ___(___, ___) %>% 
    ___(
        ___,
        ___,
        ___
    )

# 2. 
nhanes_small %>% 
    filter(!is.na(diabetes)) %>% 
    ___(___, ___) %>% 
    ___(
        ___,
        ___,
        ___,
        ___,
        ___,
        ___
    )
Click for the (possible) solution.

# 1. 
nhanes_small %>% 
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, sex) %>% 
    summarize(
        mean_age = mean(age, na.rm = TRUE),
        max_age = max(age, na.rm = TRUE),
        min_age = min(age, na.rm = TRUE),
    )

# 2. 
nhanes_small %>% 
    filter(!is.na(diabetes)) %>% 
    group_by(diabetes, sex) %>% 
    summarize(
        mean_height = mean(height, na.rm = TRUE),
        max_height = max(height, na.rm = TRUE),
        min_height = min(height, na.rm = TRUE),
        mean_weight = mean(weight, na.rm = TRUE),
        max_weight = max(weight, na.rm = TRUE),
        min_weight = min(weight, na.rm = TRUE)
    )

9.19 Saving datasets as files

The nhanes_small data frame you created is only available after you’ve created it from NHAHES, but if you want to access it later, you can save this as an .rda file into your data/ folder using the function usethis::use_data(). We use overwrite = TRUE so that we can save the dataset again even if the file already exists. The usethis::use_data() function outputs some information, the last of which (“Document your data”) we won’t cover in this course. The function takes any number of datasets and saves each individually as a .rda R dataset file in the data/ folder.

# Saving data as an .rda file in the data folder
usethis::use_data(nhanes_small, overwrite = TRUE)

For many projects, it isn’t necessary or advisable to save every single data object you create. It’s better to let the code create the data you’ll use rather than save each new wrangled dataset you might create. But sometimes you’ll need or want to save the dataset you’ve been working on, maybe because you’ve done a lot of cleaning to it, preparing it for later analyses, or because you’ve run an analysis and want to save the results. In that case, you should definitely save the new cleaned dataset.

9.20 Loading in a dataset

Take ~5 min to read through this section.

We’ve so far been using a teaching dataset that we load from a package, mainly so we can focus on getting familiar with data wrangling. However, there will come a time when you want to wrangle your own data. There are several ways to load in a dataset, the most common being:

  1. Using the RStudio menu File -> Import Dataset -> From Text/Excel/SPSS/SAS/Stata (depending on your file type you want to import). Copy and paste the R code given for importing into the R script you are working in.

  2. If the file is a .csv file, use readr::read_csv() to import the dataset:

    dataset_name <- readr::read_csv(here::here("data/dataset_name.csv"))
  3. If the dataset is a .rda file, use load():

    load(here::here("data/dataset_name.rda"))    

    This loads in the dataset into your R session so you can use it again. If you want to see that it works, restart the R session with either Ctrl-Shift-F10 or with the menu item Session -> Restart R. Then, type out and run this:

    load(here::here("data/nhanes_small.rda"))

    You should see this dataset in the Environment tab. This is how you save and load data.

For SAS, SPSS, or Stata files, you can use the package haven to import those types of data files into R.

9.21 Exercise: Practicing the dplyr functions

Time: 30 minutes

Practice using dplyr by starting from the original NHANES dataset and wrangling the data into a summary output. Don’t create any intermediate objects by only using the pipe to link each task with the next one.

  1. Rename all the columns so they are snake case.
  2. Select columns gender, age, and BMI.
  3. Exclude NA values from all the selected columns.
  4. Rename gender to sex.
  5. Create a new column named age_class, where anyone under 50 years old is labeled "under 50" and those 50 years old and older are labeled "over 50".
  6. Group the data according to sex and age_class.
  7. Calculate the mean and median BMI according to the grouping to determine the difference in BMI between age classes and sex.
  8. Once done, add and commit this new code to the Git history with the RStudio Git interface.
Click for the (possible) solution.

NHANES %>% 
    rename_with(snakecase::to_snake_case) %>% 
    select(gender, age,  bmi) %>% 
    filter(!is.na(gender) & !is.na(age) & !is.na(bmi)) %>% 
    rename(sex = gender) %>% 
    mutate(age_class = if_else(age < 50, "under 50", "over 50")) %>%
    group_by(age_class, sex) %>% 
    summarise(bmi_mean = mean(bmi, na.rm = TRUE), 
              bmi_median = median(bmi, na.rm = TRUE))

9.22 Summary of session

  • In tidy data, each variable has its now column, each observation has its own row, and each value has its own cell.
  • Use the R package tidyverse to load in multiple packages to tidy up data.
  • Never edit raw data - instead, use R code to make changes and clean up the raw data rather than manually editing the dataset.
  • Use the functions select(), rename(), filter(), mutate() (“change or modify”), arrange() and summarise() from the dplyr package to work with and wrangle your data.
  • Use the pipe (%>%) to write easy-to-read code, similar to reading a text consisting of multiple sentences.