If you find any typos, errors, or places where the text could be improved, please let us know by providing feedback either in the feedback survey (given during class), by using GitLab, or directly in this document with hypothes.is annotations.
- Open an issue or submit a merge request on GitLab with the feedback or suggestions.
Add an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.
9 Data management and wrangling
When in RStudio, quickly jump to this page using
r3::open_data_wrangling()
.
Session objectives:
- Learn the difference between “messy” and “tidy” data, including how to create tidy data to simplify your analysis.
- Perform simple transformations and subsetting of datasets, such as:
- Subset specific columns and rows of a dataset, with
filter()
andselect()
. - Sort rows of a dataset by a specific column, with
arrange()
. - Create new or transform existing columns in a dataset, with
mutate()
. - Calculate simple data summaries, with
summarise()
.
- Subset specific columns and rows of a dataset, with
- Learn about and apply the “split-apply-combine” method of analyses,
with
group_by()
andsummarise()
. - Write “tidier” and more readable code by using the pipe (
%>%
) operator.
9.1 “Messy” vs. “tidy” data
Take 10 min to read through this “Messy” vs “tidy” data section and the Managing and working with data in R section.
The concept of “tidy” data was popularized in an article by Hadley Wickham and described in more detail in the Tidy Data chapter of the R for Data Science online book. Before we continue with tidy data, we need to cover something that is related to the concept of “tidy” and that has already come up in this course: the tidyverse. The tidyverse is an ecosystem of R packages that are designed to work well together, that all follow a strong “design philosophy” and common style guide. This makes combining these packages in the tidyverse much easier. We teach the tidyverse because of these abovementioned reasons.
Ok, back to “tidy data”. A tidy dataset is when:
- Each variable has its own column (e.g. “Body Weight”).
- Each observation has its own row (e.g. “Person”).
- Each value has its own cell (e.g. “Body weight for a person at a specific date”).
Take a look at the example “tidy” and “messy” data frames (also called “tibbles” in the tidyverse) below. Think about why each may be considered “tidy” or “messy”. What do you notice between the tidy versions and the messier versions?
# Datasets come from tidyr
# Tidy:
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
# Partly tidy:
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
# Messier:
table3
#> # A tibble: 6 x 3
#> country year rate
#> * <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
# Messy:
table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
# Messy:
table4b
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> * <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
The “most” tidy version is table1
as each column describes their values
(e.g. population is population size), each row is unique (e.g. first row is for
values from Afghanistan from 1999), and each cell is an explicit value
representative of its column and row.
table2
is a “long” version of table1
so it is partly “tidy”,
but it doesn’t satisfy the rule that each variable has a column,
since count
represents both cases and population size.
On the other hand, table3
is messy
because the rate
column values are a composite of two other column values
(cases and population), when it should be a single number (a percent). Both
table4a
and table4b
have columns with ambiguous values inside. For example, you can’t tell from the data what the
values in the 1999
column contain.
Tidy data has a few notable benefits:
- Time spent preparing your data to be tidy from the beginning can save days of added work and frustration in the long run.
- “Tidy data” is a conceptual framework that allows you to easily build off and wrangle (i.e. “manipulate”, “clean up”, “manage”) data in simpler and easy-to-interpret ways, especially when used within the framework of the tidyverse.
The concept of tidy data also gives rise to “tidy code” for wrangling. By using “verbs” (R functions) and chaining them together in “sentences” (in a sequential pipeline), you can construct meaningful and readable code that describes in plainer English what you are doing to the data. This is one simple way that you can enhance the reproducibility of your code.
9.2 Managing and working with data in R
Take 5 min and read through this section.
When working with data, there are a few principles to follow:
- You should always save your original raw dataset in the
data-raw/
folder.- Note: Whether or not you save data to
data-raw/
depends on how you collected the data and how many collaborators are on your team. You may end up storing and processing the data in another folder as a project of its own.
- Note: Whether or not you save data to
- Never edit your raw data directly and save it in a separate location.
- Only work with your raw data using R code. Don’t manually edit it. Manual editing doesn’t leave a history of what you’ve done to it, so you can’t go back and see what you’ve done. Always keep a history of any changes you’ve made to the data, preferably by using R code.
- Save the edited data as another dataset and store it in the
data/
folder.
When wrangling your data with R code, make sure to:
- Document what you did to your data and why you did it, to help you remember later on (by using hashes to indicate comments and non-code text).
- Write the code in a way that is as descriptive as possible and is readable enough to understand what is being done to the data.
- Keep the code simple: Don’t be clever, be clear. Clear code is easier to understand than clever and sometimes overly-complex code.
In data wrangling, most tasks can be expressed by a few simple “verbs” (actions). Wrangling here refers to maneuvering, managing, controlling, and turning your data around to clean it up, better understand it, and prepare it for later analyses. The table below lists some common “verbs” from the dplyr package that come from the tidyverse:
Task | Example | Function |
---|---|---|
Select columns | Remove data entry columns such as the person’s name who entered the data. |
select()
|
Rename columns | Changing a column name from ‘Q1’ to ‘participant_name’. |
rename()
|
Transform or modify columns | Multiplying or taking the log of a column’s values. |
mutate()
|
Subset/filter rows | Keeping rows with glucose values above 4. |
filter()
|
Sort rows | Show rows with the smallest value at the top. |
arrange()
|
Calculate summaries of the data | Calculating the maximum, median, and minimum age. |
summarise()
|
Run an analysis by a group | Calculate means of age by males and females. |
group_by() with summarise()
|
Tip: Sometimes you need to do some complicated wrangling to get your data
“in shape” for later analyses.
To save some time,
you could save the wrangled data as an “output” dataset in the data/
folder.
That way, you can easily use it again later rather
than having to run the wrangling code every time you want to work with the data.
9.3 Load the packages and dataset
We’re going to use the US NHANES dataset to demonstrate some wrangling functions. There is an NHANES package that contains a teaching version of the original dataset, so we’ll use that for this lesson.
First, make sure that you have the LearningR
R Project open.
Second, open the R/package-loading.R
script and load the NHANES package,
so it looks like:
Third, open the R/wrangling-session.R
script to begin entering code.
We’ll use this script file to write the code for this session (but not for the exercises).
## # Load up the packages
## source(here::here("R/package-loading.R"))
# Briefly glimpse contents of dataset
glimpse(NHANES)
#> Rows: 10,000
#> Columns: 76
#> $ ID <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 516…
#> $ SurveyYr <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10…
#> $ Gender <fct> male, male, male, male, female, male, male, female, …
#> $ Age <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10,…
#> $ AgeDecade <fct> 30-39, 30-39, 30-39, 0-9, 40-49, 0-9, 0-9, 4…
#> $ AgeMonths <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795…
#> $ Race1 <fct> White, White, White, Other, White, White, White, Whi…
#> $ Race3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Education <fct> High School, High School, High School, NA, Some Coll…
#> $ MaritalStatus <fct> Married, Married, Married, NA, LivePartner, NA, NA, …
#> $ HHIncome <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, …
#> $ HHIncomeMid <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 875…
#> $ Poverty <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00…
#> $ HomeRooms <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3…
#> $ HomeOwn <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, …
#> $ Work <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, …
#> $ Weight <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7…
#> $ Length <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ HeadCirc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Height <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166…
#> $ BMI <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.…
#> $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BMI_WHO <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plu…
#> $ Pulse <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, …
#> $ BPSysAve <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111,…
#> $ BPDiaAve <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, …
#> $ BPSys1 <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124,…
#> $ BPDia1 <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, …
#> $ BPSys2 <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108,…
#> $ BPDia2 <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, …
#> $ BPSys3 <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114,…
#> $ BPDia3 <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, …
#> $ Testosterone <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DirectChol <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, …
#> $ TotChol <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, …
#> $ UrineVol1 <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113,…
#> $ UrineFlow1 <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1…
#> $ UrineVol2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ UrineFlow2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Diabetes <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
#> $ DiabetesAge <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ HealthGen <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vg…
#> $ DaysPhysHlthBad <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0…
#> $ DaysMentHlthBad <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA…
#> $ LittleInterest <fct> Most, Most, Most, NA, Several, NA, NA, None, None, N…
#> $ Depressed <fct> Several, Several, Several, NA, Several, NA, NA, None…
#> $ nPregnancies <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, …
#> $ nBabies <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ Age1stBaby <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SleepHrsNight <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, …
#> $ SleepTrouble <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, …
#> $ PhysActive <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes,…
#> $ PhysActiveDays <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2,…
#> $ TVHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ CompHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TVHrsDayChild <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, …
#> $ CompHrsDayChild <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, …
#> $ Alcohol12PlusYr <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, …
#> $ AlcoholDay <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA,…
#> $ AlcoholYear <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, …
#> $ SmokeNow <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA,…
#> $ Smoke100 <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No,…
#> $ Smoke100n <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smok…
#> $ SmokeAge <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, …
#> $ Marijuana <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Y…
#> $ AgeFirstMarij <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, …
#> $ RegularMarij <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes…
#> $ AgeRegMarij <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, …
#> $ HardDrugs <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes,…
#> $ SexEver <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, …
#> $ SexAge <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, …
#> $ SexNumPartnLife <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA,…
#> $ SexNumPartYear <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1…
#> $ SameSex <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, …
#> $ SexOrientation <fct> Heterosexual, Heterosexual, Heterosexual, NA, Hetero…
#> $ PregnantNow <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Finally, add and commit the changes to the Git history with the RStudio Git interface.
9.4 Exercise: Getting familiar with the dataset
Time: 10 min
Take some time to get to know the NHANES dataset.
- Create a new R script by typing in the RStudio Console
usethis::use_r("exercises-wrangling")
. - Copy the code below and paste it into the new script file.
- Replace the
___
withNHANES
. - Run each line of code by using the
Ctrl-Enter
shortcut. - Take a look at the output for each command, to see what sorts of data you are able to retrieve.
- Once you have completed each, add and commit the changes to the Git history with the RStudio Git interface.
9.5 Selecting specific columns in a dataset
Selecting columns of a dataset is a very common data wrangling task. The function for this task in RStudio is appropriately called select()
. You would use select()
to extract one or more variables in a dataset that you want to have a closer look at or to save as a new data frame to work with. It may be that you wish to explore the clinical characteristics of your study sample, so you may select some basic demographic variables (e.g., the Age
column) and clinical variables (e.g., Weight
and Height
columns) to perform these analyses.
For the input arguments, select()
takes the dataset as the first argument,
which is the first input position right after the opening bracket (
,
and then takes the names of the columns you want to select.
The argument after the data argument is ...
,
which indicates that you can add as many columns as you want, separated by a ,
.
# Select one column by its name, without quotes
select(NHANES, Age)
#> # A tibble: 10,000 x 1
#> Age
#> <int>
#> 1 34
#> 2 34
#> 3 34
#> 4 4
#> 5 49
#> 6 9
#> 7 8
#> 8 45
#> 9 45
#> 10 45
#> # … with 9,990 more rows
# Select two or more columns by name, without quotes
select(NHANES, Age, Weight, BMI)
#> # A tibble: 10,000 x 3
#> Age Weight BMI
#> <int> <dbl> <dbl>
#> 1 34 87.4 32.2
#> 2 34 87.4 32.2
#> 3 34 87.4 32.2
#> 4 4 17 15.3
#> 5 49 86.7 30.6
#> 6 9 29.8 16.8
#> 7 8 35.2 20.6
#> 8 45 75.7 27.2
#> 9 45 75.7 27.2
#> 10 45 75.7 27.2
#> # … with 9,990 more rows
# To *exclude* a column, use minus (-)
select(NHANES, -HeadCirc)
#> # A tibble: 10,000 x 75
#> ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education
#> <int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct>
#> 1 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
#> 2 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
#> 3 51624 2009_10 male 34 " 30-39" 409 White <NA> High Sch…
#> 4 51625 2009_10 male 4 " 0-9" 49 Other <NA> <NA>
#> 5 51630 2009_10 female 49 " 40-49" 596 White <NA> Some Col…
#> 6 51638 2009_10 male 9 " 0-9" 115 White <NA> <NA>
#> 7 51646 2009_10 male 8 " 0-9" 101 White <NA> <NA>
#> 8 51647 2009_10 female 45 " 40-49" 541 White <NA> College …
#> 9 51647 2009_10 female 45 " 40-49" 541 White <NA> College …
#> 10 51647 2009_10 female 45 " 40-49" 541 White <NA> College …
#> # … with 9,990 more rows, and 66 more variables: MaritalStatus <fct>,
#> # HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> # HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, Height <dbl>,
#> # BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
#> # BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>,
#> # BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> # DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> # UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> # HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> # LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> # Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> # PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> # TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> # AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> # Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> # RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> # SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> # SexOrientation <fct>, PregnantNow <fct>
If some of your columns have similar patterns for naming at the beginning,
middle, or end of the name, you can use some helper functions to choose these columns.
Use ?select_helpers
(choose the “Select helpers” option in the menu that pops up)
to read more about these functions and to get help on them.
Some commonly used helpers are:
starts_with()
: Select columns that begin with a pattern.ends_with()
: Select columns that end with a pattern.contains()
: Select columns that contain a pattern.
# All columns starting with letters "BP" (blood pressure)
select(NHANES, starts_with("BP"))
#> # A tibble: 10,000 x 8
#> BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3
#> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 113 85 114 88 114 88 112 82
#> 2 113 85 114 88 114 88 112 82
#> 3 113 85 114 88 114 88 112 82
#> 4 NA NA NA NA NA NA NA NA
#> 5 112 75 118 82 108 74 116 76
#> 6 86 47 84 50 84 50 88 44
#> 7 107 37 114 46 108 36 106 38
#> 8 118 64 106 62 118 68 118 60
#> 9 118 64 106 62 118 68 118 60
#> 10 118 64 106 62 118 68 118 60
#> # … with 9,990 more rows
# All columns ending in letters "Day"
select(NHANES, ends_with("Day"))
#> # A tibble: 10,000 x 3
#> TVHrsDay CompHrsDay AlcoholDay
#> <fct> <fct> <int>
#> 1 <NA> <NA> NA
#> 2 <NA> <NA> NA
#> 3 <NA> <NA> NA
#> 4 <NA> <NA> NA
#> 5 <NA> <NA> 2
#> 6 <NA> <NA> NA
#> 7 <NA> <NA> NA
#> 8 <NA> <NA> 3
#> 9 <NA> <NA> 3
#> 10 <NA> <NA> 3
#> # … with 9,990 more rows
# All columns containing letters "Age"
select(NHANES, contains("Age"))
#> # A tibble: 10,000 x 9
#> Age AgeDecade AgeMonths DiabetesAge Age1stBaby SmokeAge AgeFirstMarij
#> <int> <fct> <int> <int> <int> <int> <int>
#> 1 34 " 30-39" 409 NA NA 18 17
#> 2 34 " 30-39" 409 NA NA 18 17
#> 3 34 " 30-39" 409 NA NA 18 17
#> 4 4 " 0-9" 49 NA NA NA NA
#> 5 49 " 40-49" 596 NA 27 38 18
#> 6 9 " 0-9" 115 NA NA NA NA
#> 7 8 " 0-9" 101 NA NA NA NA
#> 8 45 " 40-49" 541 NA NA NA 13
#> 9 45 " 40-49" 541 NA NA NA 13
#> 10 45 " 40-49" 541 NA NA NA 13
#> # … with 9,990 more rows, and 2 more variables: AgeRegMarij <int>, SexAge <int>
For more information on using the pattern functions such as starts_with()
,
check out ?select_helpers
.
You’ll notice that running these functions doesn’t actually change the data itself.
When you run a function without assigning it using <-
,
the only action the function does is to send the output to your screen, and you won’t have saved that data anywhere for later use.
But if you want to create a new dataset with only the columns you selected,
you’ll need to assign the selected dataset to a new object.
The full NHANES dataset is 10,000 individuals (rows) with 76 parameters (columns). Since we are only interested in some of these parameters, we will subset the large dataset and save it for later use as a new dataset.
# Save the selected columns as a new data frame
# Recall the style guide for naming objects
nhanes_small <- select(NHANES, Age, Gender, Height,
Weight, BMI, Diabetes, DiabetesAge,
PhysActiveDays, PhysActive, TotChol,
BPSysAve, BPDiaAve, SmokeNow, Poverty)
# View the new data frame
nhanes_small
#> # A tibble: 10,000 x 14
#> Age Gender Height Weight BMI Diabetes DiabetesAge PhysActiveDays
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 4 male 105. 17 15.3 No NA NA
#> 5 49 female 168. 86.7 30.6 No NA NA
#> 6 9 male 133. 29.8 16.8 No NA NA
#> 7 8 male 131. 35.2 20.6 No NA NA
#> 8 45 female 167. 75.7 27.2 No NA 5
#> 9 45 female 167. 75.7 27.2 No NA 5
#> 10 45 female 167. 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 6 more variables: PhysActive <fct>,
#> # TotChol <dbl>, BPSysAve <int>, BPDiaAve <int>, SmokeNow <fct>,
#> # Poverty <dbl>
9.6 Renaming all column names based on the style guide
In the interests of keeping data tidy and matching the style guide, we should change the column names to be all lower case
with _
for spaces between words. There’s a package that can do that for us
called snakecase.
In the Console, type out snakecase::
and hit Tab. You’ll see a list of
possible functions to use. We want to use the snakecase function, so scroll
down and find the to_snake_case()
.
To change all the column names to lower
case, we’ll use the function rename_with()
. This function takes the data as the first argument but the second argument needs
to be a function, which in our case is to_snake_case()
without using the
()
at the end. This function will rename all columns.
# Rename all columns to snake case
nhanes_small <- rename_with(nhanes_small, snakecase::to_snake_case)
# Have a look at the data frame
nhanes_small
#> # A tibble: 10,000 x 14
#> age gender height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 4 male 105. 17 15.3 No NA NA
#> 5 49 female 168. 86.7 30.6 No NA NA
#> 6 9 male 133. 29.8 16.8 No NA NA
#> 7 8 male 131. 35.2 20.6 No NA NA
#> 8 45 female 167. 75.7 27.2 No NA 5
#> 9 45 female 167. 75.7 27.2 No NA 5
#> 10 45 female 167. 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
9.7 Renaming specific columns
Depending on how your data was collected,
it may have column names that aren’t very descriptive.
So you’ll probably want to rename them to something more explanatory, which is
particularly important if you’re sharing your work with others or in an
environment where multiple people are working on the same data.
We will use the function called rename()
.
Like select()
, rename()
takes the dataset as the first argument and then takes as many renaming arguments as you want
(because the second argument position is ...
).
When renaming, it takes the form of newname = oldname
.
Let’s rename gender to be sex, since gender is a social construct, while sex is biological. This is consistent with what the sex variable in the dataset actually describes too.
rename(nhanes_small, sex = gender)
#> # A tibble: 10,000 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 4 male 105. 17 15.3 No NA NA
#> 5 49 fema… 168. 86.7 30.6 No NA NA
#> 6 9 male 133. 29.8 16.8 No NA NA
#> 7 8 male 131. 35.2 20.6 No NA NA
#> 8 45 fema… 167. 75.7 27.2 No NA 5
#> 9 45 fema… 167. 75.7 27.2 No NA 5
#> 10 45 fema… 167. 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
# View data frame
nhanes_small
#> # A tibble: 10,000 x 14
#> age gender height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 4 male 105. 17 15.3 No NA NA
#> 5 49 female 168. 86.7 30.6 No NA NA
#> 6 9 male 133. 29.8 16.8 No NA NA
#> 7 8 male 131. 35.2 20.6 No NA NA
#> 8 45 female 167. 75.7 27.2 No NA 5
#> 9 45 female 167. 75.7 27.2 No NA 5
#> 10 45 female 167. 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
Notice that the rename()
function has not altered the nhanes_small
data frame. To do this, we need to assign the change to nhanes_small
in a way that saves it:
nhanes_small <- rename(nhanes_small, sex = gender)
# View data frame
nhanes_small
#> # A tibble: 10,000 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 4 male 105. 17 15.3 No NA NA
#> 5 49 fema… 168. 86.7 30.6 No NA NA
#> 6 9 male 133. 29.8 16.8 No NA NA
#> 7 8 male 131. 35.2 20.6 No NA NA
#> 8 45 fema… 167. 75.7 27.2 No NA 5
#> 9 45 fema… 167. 75.7 27.2 No NA 5
#> 10 45 fema… 167. 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
Now, you can see that the column has been renamed from gender to sex.
What if you wanted to select some columns and then rename some of them? Would you have to create a new data object every time? No! We can make use of a very powerful tool called piping with the %>%
function.
9.8 Chaining functions with the pipe
Take 5 minutes and read this section before we continue.
A key component of tidy data and tidy code is making use of the “pipe” operator,
%>%
. You would use the “pipe” operator when you are writing a piece
of code with multiple operations or intermediate steps that require you to
save and overwrite each step as an object (see below). One advantage of the “pipe”
operator is that it will help to ensure that your code is less cluttered with
redundant and temporary object names.
This operator allows you to “pipe” the output from one function to the input of another function, just like a plumbing pipe would do for water. This allows you to easily chain functions together into “sentences”. Let’s use an example based on English words for some action. This is the English sentence:
We need some eggs. Drive to the grocery store and buy some eggs before arriving home from work.
There are basically two actions here (“drive” and “buy”)
with four inputs (“work”, “grocery store”, “eggs”, “home”),
that are all based on the previous action.
Since an action in R is a function,
the functions would be drive()
and buy()
.
In regular R (without the “pipe” operator), we would have to nest functions (reading from the inside to the outside) together to chain them:
This is difficult to read. Another way to chain functions would be to create temporary objects for each step:
at_grocery_store <- drive(at_work, "grocery store")
got_eggs <- buy(at_grocery_store, "eggs")
at_home <- drive(got_eggs, "home")
This still isn’t very “readable”, as we are having to re-name each
intermediate object with reference to the object before it. The pipe %>%
operator can really simplify this:
Do you find this more readable and understandable? It reads how it would be done, in order of the steps taken.
The idea of piping is to read the functions from left to right. This can help clarify and break down complex data processing workflows, and is the basis for the tidyverse and many other packages.
The operator %>%
takes the output from the object or function from the left of the operator
and puts it into the function on the right of the operator.
All input goes into the first position argument of the function.
Within the tidyverse packages, all functions take a data frame (or vector) as the first argument in order
to work with the pipe.
Let’s try this out with an example from NHANES. The
keyboard shortcut for the pipe is Ctrl-Shift-M
(i.e., M for the magrittr package
that created the pipe).
# These two ways are the same
colnames(nhanes_small)
#> [1] "age" "sex" "height" "weight"
#> [5] "bmi" "diabetes" "diabetes_age" "phys_active_days"
#> [9] "phys_active" "tot_chol" "bp_sys_ave" "bp_dia_ave"
#> [13] "smoke_now" "poverty"
nhanes_small %>%
colnames()
#> [1] "age" "sex" "height" "weight"
#> [5] "bmi" "diabetes" "diabetes_age" "phys_active_days"
#> [9] "phys_active" "tot_chol" "bp_sys_ave" "bp_dia_ave"
#> [13] "smoke_now" "poverty"
The pipe automatically takes nhanes_small
and puts it into the first position,
so we don’t need to specify nhanes_small
inside colnames()
when piping.
Let’s try using the pipe with the select()
and rename()
functions from the previous section.
Remember, both select()
and rename()
take a dataset as the first argument,
which makes them pipe-able.
nhanes_small %>%
select(phys_active) %>%
rename(physically_active = phys_active)
#> # A tibble: 10,000 x 1
#> physically_active
#> <fct>
#> 1 No
#> 2 No
#> 3 No
#> 4 <NA>
#> 5 No
#> 6 <NA>
#> 7 <NA>
#> 8 Yes
#> 9 Yes
#> 10 Yes
#> # … with 9,990 more rows
We can now “read” these actions as:
Take the nhanes_small dataset and then select the “phys_active” column and then rename the “phys_active” column to “physically_active”.
Now, let’s add and commit changes to the Git history with the RStudio Git interface before moving on to the exercise.
9.9 Exercise: Practice what we’ve learned
Time: 10 min
In the exercise-wrangling.R
script file, complete the following tasks.
Replace the ___
with NHANES
Copy and paste the code below into the script file. Replace the
___
in theselect()
function, with the columnstot_chol
,bp_sys_ave
, andpoverty
.Copy and paste the code below and fill out the blanks. Rename
diabetes_age
to bediabetes_diagnosis_age
. Tip: Recall that renaming is in the formnew = old
.Re-write this piece of code using the “pipe” operator:
Read through (in your head) the code below. How intuitive is it to read? Now, re-write this code so that you don’t need to create the temporary
physical_activity
object by using the pipe, then re-read the revised version. Which do you feel is easier to “read”?Lastly, add and commit these changes to the Git history with the RStudio Git interface.
Click for the (possible) solution.
# 1. Select specific columns
nhanes_small %>%
select(tot_chol, bp_sys_ave, poverty)
# 2. Rename columns
nhanes_small %>%
rename(diabetes_diagnosis_age = diabetes_age)
# 3. Re-write with pipe
nhanes_small %>%
select(bmi, contains("age"))
# 4. Re-write with pipe
nhanes_small %>%
select(phys_active_days, phys_active) %>%
rename(days_phys_active = phys_active_days)
9.10 Filtering data by row
Filtering data by row is a very common activity in data analysis, especially if you want to get rid of outliers or to subset by a categorical group.
The function to subset or filter is called filter()
. filter()
is distinct from select()
in the sense that it operates on rows, whereas select()
operates on columns.
The filter()
function takes a logic condition (TRUE
or FALSE
).
As with the other functions, the first argument is the dataset
and all others are the logical conditions that will apply to the row filtering.
When the logical conditions equal TRUE
, it means that those rows will be kept and those that are FALSE
will be
dropped.
Warning: Since filter()
uses logical conditions,
you need to be really careful when writing the logic.
As you probably know, humans are really really bad at logic.
If your logical condition starts getting complex,
double and triple check that you know for certain that your logic code is doing
what you think it is doing. It’s very easy to make mistakes at this stage, even
for advanced R users.
The simplest kind of logic condition is to test for “equality”.
In R, “equal to” is represented by ==
.
An example: If we wanted to keep only females in the dataset, we would use the logic condition like this:
nhanes_small %>%
filter(sex == "female")
#> # A tibble: 5,020 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 49 fema… 168. 86.7 30.6 No NA NA
#> 2 45 fema… 167. 75.7 27.2 No NA 5
#> 3 45 fema… 167. 75.7 27.2 No NA 5
#> 4 45 fema… 167. 75.7 27.2 No NA 5
#> 5 10 fema… 142. 38.6 19.2 No NA NA
#> 6 58 fema… 148. 57.5 26.2 No NA 2
#> 7 9 fema… 139. 53.1 27.4 No NA NA
#> 8 56 fema… 171. 57.5 19.7 No NA 7
#> 9 56 fema… 171. 57.5 19.7 No NA 7
#> 10 57 fema… 157. 51 20.7 No NA 3
#> # … with 5,010 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
We would “read” this code as:
Take the nhanes_small dataset and then filter so that only rows where
sex
is equal to “female” are kept.
When a row in the sex
column has the value "female"
, that row is kept.
Otherwise, it is dropped.
There are other logic comparisons to use. Table 9.2 can be used as a reference for logical conditions in R.
Operator | Description |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
== | equal to |
!= | not equal to |
!x | Not x (if x is true or false) |
x | y | x OR y |
x & y | x AND y |
Let’s try out a few of these logical conditions with filter()
.
# Participants who are not female
nhanes_small %>%
filter(sex != "female")
#> # A tibble: 4,980 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 4 male 105. 17 15.3 No NA NA
#> 5 9 male 133. 29.8 16.8 No NA NA
#> 6 8 male 131. 35.2 20.6 No NA NA
#> 7 66 male 170. 68 23.7 No NA 7
#> 8 58 male 182. 78.4 23.7 No NA 5
#> 9 54 male 169. 74.7 26.0 No NA 1
#> 10 50 male 178. 84.1 26.6 No NA 7
#> # … with 4,970 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
# Participants who have BMI equal to 25
nhanes_small %>%
filter(bmi == 25)
#> # A tibble: 35 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 24 male 178. 79.3 25 No NA 2
#> 2 24 male 178. 79.3 25 No NA 2
#> 3 32 male 179 80.1 25 No NA NA
#> 4 32 male 179 80.1 25 No NA NA
#> 5 32 male 179 80.1 25 No NA NA
#> 6 32 male 179 80.1 25 No NA 2
#> 7 55 fema… 164. 67.5 25 No NA 3
#> 8 6 fema… 117. 34 25 No NA 3
#> 9 53 male 162. 65.6 25 No NA 2
#> 10 57 fema… 178. 78.8 25 No NA 3
#> # … with 25 more rows, and 6 more variables: phys_active <fct>, tot_chol <dbl>,
#> # bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>, poverty <dbl>
# Participants who have BMI equal to or more than 25
nhanes_small %>%
filter(bmi >= 25)
#> # A tibble: 5,422 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 49 fema… 168. 86.7 30.6 No NA NA
#> 5 45 fema… 167. 75.7 27.2 No NA 5
#> 6 45 fema… 167. 75.7 27.2 No NA 5
#> 7 45 fema… 167. 75.7 27.2 No NA 5
#> 8 54 male 169. 74.7 26.0 No NA 1
#> 9 58 fema… 148. 57.5 26.2 No NA 2
#> 10 50 male 178. 84.1 26.6 No NA 7
#> # … with 5,412 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
We use the |
(“or”) and &
(“and”) when we want to combine conditions across columns.
Be careful with these operators and when combining logic conditions,
as they can sometimes work differently than our human brains interpret them
(speaking from experience).
For &
, both sides must be TRUE
in order for the combination to be TRUE
.
For |
, only one side needs to be TRUE
in order for the combination to be TRUE
.
To see how they work try these:
TRUE & TRUE
#> [1] TRUE
TRUE & FALSE
#> [1] FALSE
FALSE & FALSE
#> [1] FALSE
TRUE | TRUE
#> [1] TRUE
TRUE | FALSE
#> [1] TRUE
FALSE | FALSE
#> [1] FALSE
When used in filter()
, combinations of logic conditions may look like this:
# When BMI is 25 AND sex is female
nhanes_small %>%
filter(bmi == 25 & sex == "female")
#> # A tibble: 21 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 55 fema… 164. 67.5 25 No NA 3
#> 2 6 fema… 117. 34 25 No NA 3
#> 3 57 fema… 178. 78.8 25 No NA 3
#> 4 57 fema… 178. 78.8 25 No NA NA
#> 5 57 fema… 178. 78.8 25 No NA NA
#> 6 79 fema… 157. 61.6 25 No NA NA
#> 7 54 fema… 161. 64.8 25 No NA 4
#> 8 54 fema… 161. 64.8 25 No NA NA
#> 9 54 fema… 161. 64.8 25 No NA NA
#> 10 54 fema… 161. 64.8 25 No NA NA
#> # … with 11 more rows, and 6 more variables: phys_active <fct>, tot_chol <dbl>,
#> # bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>, poverty <dbl>
# When BMI is 25 OR sex is female
nhanes_small %>%
filter(bmi == 25 | sex == "female")
#> # A tibble: 5,034 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 49 fema… 168. 86.7 30.6 No NA NA
#> 2 45 fema… 167. 75.7 27.2 No NA 5
#> 3 45 fema… 167. 75.7 27.2 No NA 5
#> 4 45 fema… 167. 75.7 27.2 No NA 5
#> 5 10 fema… 142. 38.6 19.2 No NA NA
#> 6 58 fema… 148. 57.5 26.2 No NA 2
#> 7 9 fema… 139. 53.1 27.4 No NA NA
#> 8 56 fema… 171. 57.5 19.7 No NA 7
#> 9 56 fema… 171. 57.5 19.7 No NA 7
#> 10 57 fema… 157. 51 20.7 No NA 3
#> # … with 5,024 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
9.11 Arranging the rows of your data by column
You may want to sort your rows by a specific column so that values are arranged in ascending or descending order. This can be done using the function called arrange()
.
Again, arrange()
takes the dataset as the first argument,
followed by the columns that you wish to arrange data by.
By default, arrange()
orders in ascending order.
# Arranging data by age in ascending order
nhanes_small %>%
arrange(age)
#> # A tibble: 10,000 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 0 fema… NA 4.5 NA <NA> NA NA
#> 2 0 fema… NA 9.5 NA <NA> NA NA
#> 3 0 male NA 6.9 NA <NA> NA NA
#> 4 0 male NA 10.7 NA <NA> NA NA
#> 5 0 fema… NA 9 NA <NA> NA NA
#> 6 0 fema… NA 9 NA <NA> NA NA
#> 7 0 fema… NA 6 NA <NA> NA NA
#> 8 0 fema… NA 6 NA <NA> NA NA
#> 9 0 fema… NA 8.7 NA <NA> NA NA
#> 10 0 fema… NA 4.7 NA <NA> NA NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
arrange()
also arranges parameters of type character
alphabetically:
nhanes_small %>%
arrange(sex)
#> # A tibble: 10,000 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 49 fema… 168. 86.7 30.6 No NA NA
#> 2 45 fema… 167. 75.7 27.2 No NA 5
#> 3 45 fema… 167. 75.7 27.2 No NA 5
#> 4 45 fema… 167. 75.7 27.2 No NA 5
#> 5 10 fema… 142. 38.6 19.2 No NA NA
#> 6 58 fema… 148. 57.5 26.2 No NA 2
#> 7 9 fema… 139. 53.1 27.4 No NA NA
#> 8 56 fema… 171. 57.5 19.7 No NA 7
#> 9 56 fema… 171. 57.5 19.7 No NA 7
#> 10 57 fema… 157. 51 20.7 No NA 3
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
If we want to order the column based on descending order, this can be done with desc()
.
# Arranging data by age in descending order
nhanes_small %>%
arrange(desc(age))
#> # A tibble: 10,000 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 80 fema… 157. 66.9 27.2 No NA NA
#> 2 80 fema… 157. 66.9 27.2 No NA NA
#> 3 80 male 170. 51.4 17.9 No NA NA
#> 4 80 male 168. 95.5 34.0 Yes 48 NA
#> 5 80 male 168 82.8 29.3 Yes 80 1
#> 6 80 fema… 153. 77.2 33.1 No NA NA
#> 7 80 fema… 163. 65.6 24.8 No NA NA
#> 8 80 fema… 163. 65.6 24.8 No NA NA
#> 9 80 male 172. 78 26.5 No NA NA
#> 10 80 fema… 167. 71.5 25.8 No NA NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
You can also order your data by multiple columns. For instance, we could arrange first by sex
and then by age
.
# Arranging data by sex then age in ascending order
nhanes_small %>%
arrange(sex, age)
#> # A tibble: 10,000 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 0 fema… NA 4.5 NA <NA> NA NA
#> 2 0 fema… NA 9.5 NA <NA> NA NA
#> 3 0 fema… NA 9 NA <NA> NA NA
#> 4 0 fema… NA 9 NA <NA> NA NA
#> 5 0 fema… NA 6 NA <NA> NA NA
#> 6 0 fema… NA 6 NA <NA> NA NA
#> 7 0 fema… NA 8.7 NA <NA> NA NA
#> 8 0 fema… NA 4.7 NA <NA> NA NA
#> 9 0 fema… NA 9.1 NA <NA> NA NA
#> 10 0 fema… NA 10.6 NA <NA> NA NA
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
9.12 Transform or add columns
To modify an existing column or to add a new one, we can use the function called mutate()
. You can use mutate()
to compute a new variable using existing columns in your dataset, such as calculating BMI using height
and weight
columns. You can multiply all values in a certain column by 2, or combine columns into a new variable. Like the other functions, the first input is the dataset and the other arguments are columns to add or modify.
Take this exmaple: Height values are in centimeters, but we want them in meters. So, we would use mutate()
with the following instruction:
height = height / 100
This form is similar to how math works.
The action that happens on the right hand side
is put into the variable of the left hand side.
When using mutate()
, it looks like this:
nhanes_small %>%
mutate(height = height / 100)
#> # A tibble: 10,000 x 14
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 1.65 87.4 32.2 No NA NA
#> 2 34 male 1.65 87.4 32.2 No NA NA
#> 3 34 male 1.65 87.4 32.2 No NA NA
#> 4 4 male 1.05 17 15.3 No NA NA
#> 5 49 fema… 1.68 86.7 30.6 No NA NA
#> 6 9 male 1.33 29.8 16.8 No NA NA
#> 7 8 male 1.31 35.2 20.6 No NA NA
#> 8 45 fema… 1.67 75.7 27.2 No NA 5
#> 9 45 fema… 1.67 75.7 27.2 No NA 5
#> 10 45 fema… 1.67 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 6 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>
We can also create a new column (e.g., log transforming height):
nhanes_small %>%
mutate(logged_height = log(height))
#> # A tibble: 10,000 x 15
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 4 male 105. 17 15.3 No NA NA
#> 5 49 fema… 168. 86.7 30.6 No NA NA
#> 6 9 male 133. 29.8 16.8 No NA NA
#> 7 8 male 131. 35.2 20.6 No NA NA
#> 8 45 fema… 167. 75.7 27.2 No NA 5
#> 9 45 fema… 167. 75.7 27.2 No NA 5
#> 10 45 fema… 167. 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>, logged_height <dbl>
We can add multiple modifications
or additions with mutate()
by separating them with ,
.
So, if we first wanted to have height as meters and then take the natural logarithm,
it would be:
nhanes_small %>%
mutate(height = height / 100,
logged_height = log(height))
#> # A tibble: 10,000 x 15
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 1.65 87.4 32.2 No NA NA
#> 2 34 male 1.65 87.4 32.2 No NA NA
#> 3 34 male 1.65 87.4 32.2 No NA NA
#> 4 4 male 1.05 17 15.3 No NA NA
#> 5 49 fema… 1.68 86.7 30.6 No NA NA
#> 6 9 male 1.33 29.8 16.8 No NA NA
#> 7 8 male 1.31 35.2 20.6 No NA NA
#> 8 45 fema… 1.67 75.7 27.2 No NA 5
#> 9 45 fema… 1.67 75.7 27.2 No NA 5
#> 10 45 fema… 1.67 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>, logged_height <dbl>
We can also have different values based on logic conditions using if_else()
.
Use Table 9.2 to help with creating the logic condition.
nhanes_small %>%
mutate(highly_active = if_else(phys_active_days >= 5, "yes", "no"))
#> # A tibble: 10,000 x 15
#> age sex height weight bmi diabetes diabetes_age phys_active_days
#> <int> <fct> <dbl> <dbl> <dbl> <fct> <int> <int>
#> 1 34 male 165. 87.4 32.2 No NA NA
#> 2 34 male 165. 87.4 32.2 No NA NA
#> 3 34 male 165. 87.4 32.2 No NA NA
#> 4 4 male 105. 17 15.3 No NA NA
#> 5 49 fema… 168. 86.7 30.6 No NA NA
#> 6 9 male 133. 29.8 16.8 No NA NA
#> 7 8 male 131. 35.2 20.6 No NA NA
#> 8 45 fema… 167. 75.7 27.2 No NA 5
#> 9 45 fema… 167. 75.7 27.2 No NA 5
#> 10 45 fema… 167. 75.7 27.2 No NA 5
#> # … with 9,990 more rows, and 7 more variables: phys_active <fct>,
#> # tot_chol <dbl>, bp_sys_ave <int>, bp_dia_ave <int>, smoke_now <fct>,
#> # poverty <dbl>, highly_active <chr>
Recall that the original dataset doesn’t change.
If we want the added variable to be saved, we must assign it to something with <-
.
Putting it all together, you would enter something like this:
nhanes_update <- nhanes_small %>%
mutate(height = height / 100,
logged_height = log(height),
highly_active = if_else(phys_active_days >= 5, "Yes", "No"))
Before you start the following exercise, add and commit all the files changes to the Git history with the RStudio Git interface.
9.13 Exercise: Piping, filtering, and mutating
Time: 20 min
Copy and paste the code below into the exercises-wrangling.R
script file.
# 1. BMI between 20 and 40 with diabetes
nhanes_small %>%
# Format should follow: variable >= number or character
filter(___ >= ___ & ___ <= ___ & ___ == ___)
# Pipe the data into mutate function and:
nhanes_modified <- nhanes_small %>% # Specifying dataset
mutate(
# 2. Calculate mean arterial pressure
___ = ___,
# 3. Create young_child variable using a condition
___ = if_else(___, "Yes", "No")
)
nhanes_modified
Then, start replacing the ___
with the appropriate
code to complete the tasks below.
(Hint: Create a new “Section”
in the R script for this exercise by using Ctrl-Shift-R
).
Filter
nhanes_small
so only those participants with a BMI of more than or equal to 20 and less than or equal to 40, and keep those who have diabetes.Create a new variable called
mean_arterial_pressure
by applying the formula: \(((2 \times DBP) + SBP) / 3\) (DBP =bp_dia_ave
and SBP =bp_sys_ave
) to calculate Mean Arterial Pressure. Hint: In R, use+
to add,*
to multiply, and/
to divide.Create a new variable called
young_child
for cases where age is less than 6 years.Finally, add and commit these changes to the Git history with the RStudio Git Interface. Push to GitHub to synchronize with your GitHub repository.
Click for the (possible) solution.
# 1. BMI between 20 and 40, with diabetes
nhanes_small %>%
filter(bmi >= 20 & bmi <= 40 & diabetes == "Yes")
# Pipe the data into mutate function and:
nhanes_modified <- nhanes_small %>% # dataset
mutate(
mean_arterial_pressure = ((2 * bp_dia_ave) + bp_sys_ave) / 3,
young_child = if_else(age < 6, "Yes", "No")
)
nhanes_modified
9.14 Split-apply-combine: Summarizing data
Take 5 min to read through this section before we continue.
Summarising or applying simple (or complex) statistics to data is a key component of any analysis. Simple summaries or statistics can be done either on all of the data or on groups of it. There are many data analysis tasks that can be approached using the split-apply-combine method, which involves splitting the data into groups, applying some analysis to each group, and then combining the results together.
In dplyr, you can use the function called summarise()
to summarise on all the data.
If you want to do a split-apply-combine analysis to, e.g., find the max height of females and males,
you would use the functions group_by()
and then summarise()
.
Using group_by()
splits the data up and summarise()
applies an analysis, then immediately combines it back together.
The first position argument to group_by()
is, as usual, the dataset.
The next arguments are the columns that contain the values you want to group by.
These columns must contain categorical data (e.g. sex). group_by()
tells R
to compute the next operations on the data within each grouping, rather than on all
the data.
As with the other functions,
summarize()
takes the dataset as the first position argument.
The next arguments work similar to the arguments in mutate()
with one difference:
the output must create a single value (e.g. a mean).
Like mutate()
,
you can add multiple “summaries” by adding new columns separated by ,
.
You would use summarise()
to derive basic descriptive statistics of a certain
variable, including min()
, max()
, mean()
, median()
, or sd()
(standard
deviation).
The group_by()
function doesn’t do anything by itself so should always be used
in combination with a summarise()
, mutate()
, arrange()
, or other function.
However, the summarise()
function can be used on its own. Let’s take a look.
9.15 Calculating summary statistics
Let’s calculate the maximum value of the BMI variable. See what happens when you enter the following:
We get back a result of NA
, which means “missing”. In R, NA
values
“propagate”, meaning that if there is one value missing, then the max or mean will also
be missing. So, we need to tell max()
to exclude any NA
values from the calculation
using the argument na.rm = TRUE
.
nhanes_small %>%
summarise(max_bmi = max(bmi, na.rm = TRUE))
#> # A tibble: 1 x 1
#> max_bmi
#> <dbl>
#> 1 81.2
To calculate another summary staistic, you would add another summary column using ,
:
nhanes_small %>%
summarise(max_bmi = max(bmi, na.rm = TRUE),
min_bmi = min(bmi, na.rm = TRUE))
#> # A tibble: 1 x 2
#> max_bmi min_bmi
#> <dbl> <dbl>
#> 1 81.2 12.9
Before you start the following exercise, add and commit changes to the Git history with the RStudio Git interface.
9.16 Exercise: Calculate some basic statistics
Time: 10 min
Practice using summarise()
by calculating various summary statistics.
Copy and paste the code below into the exercises-wrangling.R
script file.
# 1.
nhanes_small %>%
summarise(mean_weight = ___,
mean_age = ___)
# 2.
nhanes_small %>%
summarise(max_height = ___,
min_height = ___)
# 3.
nhanes_small %>%
summarise(___ = ___,
___ = ___)
Then, start replacing the ___
with the appropriate
code to complete the tasks below. Don’t forget to use na.rm = TRUE
in the basic statistic functions.
- Calculate the mean of
weight
andage
. - Calculate the max and min of
height
. - Calculate the median of
age
andphys_active_days
. - Lastly, add and commit any changes made to the Git history with the RStudio Git interface.
Click for the (possible) solution.
# 1.
nhanes_small %>%
summarise(mean_weight = mean(weight, na.rm = TRUE),
mean_age = mean(age, na.rm = TRUE))
# 2.
nhanes_small %>%
summarise(max_height = max(height, na.rm = TRUE),
min_height = min(height, na.rm = TRUE))
# 3.
nhanes_small %>%
summarise(median_age = median(height, na.rm = TRUE),
median_phys_active_days = median(phys_active_days, na.rm = TRUE))
9.17 Summary statistics by a group
While the summarise()
function is useful enough on its own, it really shines when combined with group_by()
.
Let’s use these functions to find out the mean age and BMI between those with and without diabetes.
nhanes_small %>%
group_by(diabetes) %>%
summarise(mean_age = mean(age, na.rm = TRUE),
mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 3 x 3
#> diabetes mean_age mean_bmi
#> <fct> <dbl> <dbl>
#> 1 No 35.4 26.2
#> 2 Yes 59.2 32.6
#> 3 <NA> 1.20 29.6
Quick note: If you are using a version of dplyr >= 1.0.0, you’ll get a message informing you that it is regrouping output
.
This is a notification and can be ignored.
If you don’t want the message displayed,
you can add options(dplyr.summarise.inform = FALSE)
to the top of your script
and run it.
We also get a warning about there being missing values in diabetes, so we need to remove rows that have missing diabetes status values.
nhanes_small %>%
# Recall ! means "NOT", so !is.na means "is not missing"
filter(!is.na(diabetes)) %>%
group_by(diabetes) %>%
summarise(mean_age = mean(age, na.rm = TRUE),
mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 2 x 3
#> diabetes mean_age mean_bmi
#> <fct> <dbl> <dbl>
#> 1 No 35.4 26.2
#> 2 Yes 59.2 32.6
Cool! We can add more columns to the grouping, so let’s do that. Let’s compare mean age and BMI by sex and diabetes status.
nhanes_small %>%
filter(!is.na(diabetes)) %>%
group_by(diabetes, sex) %>%
summarise(mean_age = mean(age, na.rm = TRUE),
mean_bmi = mean(bmi, na.rm = TRUE))
#> # A tibble: 4 x 4
#> # Groups: diabetes [2]
#> diabetes sex mean_age mean_bmi
#> <fct> <fct> <dbl> <dbl>
#> 1 No female 36.5 26.2
#> 2 No male 34.3 26.1
#> 3 Yes female 59.9 33.7
#> 4 Yes male 58.6 31.5
Since we don’t need the dataset grouped anymore, it’s good practice to end the
grouping with ungroup()
.
nhanes_small %>%
filter(!is.na(diabetes)) %>%
group_by(diabetes, sex) %>%
summarise(mean_age = mean(age, na.rm = TRUE),
mean_bmi = mean(bmi, na.rm = TRUE)) %>%
ungroup()
#> # A tibble: 4 x 4
#> diabetes sex mean_age mean_bmi
#> <fct> <fct> <dbl> <dbl>
#> 1 No female 36.5 26.2
#> 2 No male 34.3 26.1
#> 3 Yes female 59.9 33.7
#> 4 Yes male 58.6 31.5
Before you start the following exercise, add and commit changes to the Git history.
9.18 Exercise: Answer some statistical questions with group by and summarise
Time: 5-7 min
Copy and paste the code below into the exercises-wrangling.R
script file.
# 1.
nhanes_small %>%
filter(!is.na(diabetes)) %>%
___(___, ___) %>%
___(
___,
___,
___
)
# 2.
nhanes_small %>%
filter(!is.na(diabetes)) %>%
___(___, ___) %>%
___(
___,
___,
___,
___,
___,
___
)
Then, start replacing the ___
with the appropriate
code including group_by()
with summarise()
, to answer these questions:
- What is the mean, max, and min differences in age between females and males with or without diabetes?
- What is the mean, max, and min differences in height and weight between females and males with or without diabetes?
- Once done, add and commit the changes to the file to the Git history.
Click for the (possible) solution.
# 1.
nhanes_small %>%
filter(!is.na(diabetes)) %>%
group_by(diabetes, sex) %>%
summarise(
mean_age = mean(age, na.rm = TRUE),
max_age = max(age, na.rm = TRUE),
min_age = min(age, na.rm = TRUE)
)
# 2.
nhanes_small %>%
filter(!is.na(diabetes)) %>%
group_by(diabetes, sex) %>%
summarise(
mean_height = mean(height, na.rm = TRUE),
max_height = max(height, na.rm = TRUE),
min_height = min(height, na.rm = TRUE),
mean_weight = mean(weight, na.rm = TRUE),
max_weight = max(weight, na.rm = TRUE),
min_weight = min(weight, na.rm = TRUE)
)
9.19 Saving datasets as files
The nhanes_small
data frame you created is only available after you’ve created
it from NHAHES, but if you want to access it later, you can save it as an
.rda
file in your data/
folder using the function usethis::use_data()
. We would add
overwrite = TRUE
to this so that we can save the dataset again even if the file
already exists.
The usethis::use_data()
function outputs some information,
the last of which (“Document your data”) we won’t cover in this course.
The function takes any number of datasets
and saves them individually as a .rda
R dataset file in the data/
folder.
For many projects, it isn’t necessary or advisable to save every single data object you create. It’s better to let the code create the data you’ll use rather than saving each new wrangled dataset you might create. However, sometimes you will want or need to save the dataset you’ve been working on, perhaps because you’ve done a lot of work to prepare it for later analyses, or because you’ve run an analysis and want to save the results. In these cases, you should definitely save the new cleaned dataset.
9.20 Loading in a dataset
Take ~5 min to read through this section.
We’ve been using a teaching dataset that we load from a package, mainly so that we can focus on getting familiar with data wrangling. However, there will come a time when you want to wrangle your own data. There are several ways to load in a dataset, with the most common being:
Using the RStudio menu
File -> Import Dataset -> From Text/Excel/SPSS/SAS/Stata
(depending on your file type you want to import).If the file is a
.csv
file, usereadr::read_csv()
to import the dataset:If the dataset is a
.rda
file, useload()
:This loads the dataset into your R session so that you can use it again. If you want to check that it works, restart the R session with either
Ctrl-Shift-F10
or with the menu itemSession -> Restart R
. Then, type out and run this:You should now see this dataset in the Environment tab. This is how you save and load data.
For SAS, SPSS, or Stata files, you can use the package haven to import those types of data files into R.
9.21 Exercise: Practicing the dplyr functions
Time: 30 minutes
Practice using dplyr by using the NHANES
dataset and wrangling
the data into a summary output. Don’t create any intermediate objects by only
using the pipe operator to link each task below with the next one.
- Rename all columns to use snakecase.
- Select the columns
gender
,age
andBMI
. - Exclude
"NAs"
from all of the selected columns. - Rename
gender
tosex
. - Create a new column called
age_class
, where anyone under 50 years old is labeled"under 50"
and those 50 years and older are labeled"over 50"
. - Group the data according to
sex
andage_class
. - Calculate the
mean
andmedian
BMI according to the grouping to determine the difference in BMI between age classes and sex. - Add and commit changes to the Git history with the RStudio Git interface.
Click for the (possible) solution.
NHANES %>%
rename_with(snakecase::to_snake_case) %>%
select(gender, age, bmi) %>%
filter(!is.na(gender) & !is.na(age) & !is.na(bmi)) %>%
rename(sex = gender) %>%
mutate(age_class = if_else(age < 50, "under 50", "over 50")) %>%
group_by(age_class, sex) %>%
summarize(bmi_mean = mean(bmi, na.rm = TRUE),
bmi_median = median(bmi, na.rm = TRUE))
9.22 Summary of session
- With tidy data, each variable has its own column, each observation has its own row, and each value has its own cell.
- Use the tidyverse to load in multiple packages to tidy up data.
- Never edit raw data. Instead, use R code to make changes and clean up the raw data, rather than manually editing the dataset.
- Use the functions
select()
,rename()
,filter()
,mutate()
(“change or modify”),arrange()
andsummarise()
from the dplyr package to wrangle your data. - Use the pipe (
%>%
) to write easy-to-read code, similar to reading a text consisting of multiple sentences.