The examples are part of the Fundamentals of R course. For more, see the R for the Rest of Us website.

Load Packages

Let’s load the packages we need. These include tidyverse (especially the dplyr package) and janitor.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.1          ✔ purrr   0.3.2     
## ✔ tibble  2.1.1          ✔ dplyr   0.8.0.1   
## ✔ tidyr   0.8.3.9000     ✔ stringr 1.4.0     
## ✔ readr   1.3.1          ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(janitor)
library(skimr)

## 
## Attaching package: 'skimr'

## The following object is masked from 'package:stats':
## 
##     filter

clean_names

bad_names <- read_csv("data/badnames.csv")

## Parsed with column specification:
## cols(
##   ID = col_double(),
##   `Age Decade` = col_character(),
##   gender = col_character()
## )

bad_names

With the bad_names data frame, we have to use back tick (`) before and after variable names with spaces in them. Also, RStudio doesn’t autocomplete the variable names, which is a pain!

bad_names %>% 
  skim(`Age Decade`)

We can use clean_names as follows:

good_names <- bad_names %>% 
  clean_names()


good_names

Variable names are much easier to type now! And RStudio autocompletes them, which is super handy.

good_names %>% 
  skim(age_decade)

Import NHANES Data

Let’s import our data using read_csv. Note that the NHANES data is in the data directory so we need to include that.

nhanes <- read_csv("data/nhanes.csv") %>%
  clean_names()

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   ID = col_double(),
##   Age = col_double(),
##   Weight = col_double(),
##   Height = col_double(),
##   BMI = col_double(),
##   DaysPhysHlthBad = col_double(),
##   DaysMentHlthBad = col_double(),
##   SleepHrsNight = col_double(),
##   PhysActiveDays = col_double(),
##   TVHrsDay = col_logical()
## )

## See spec(...) for full column specifications.

## Warning: 4859 parsing failures.
##  row      col           expected    actual              file
## 5001 TVHrsDay 1/0/T/F/TRUE/FALSE 2_hr      'data/nhanes.csv'
## 5002 TVHrsDay 1/0/T/F/TRUE/FALSE More_4_hr 'data/nhanes.csv'
## 5003 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr      'data/nhanes.csv'
## 5004 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr      'data/nhanes.csv'
## 5005 TVHrsDay 1/0/T/F/TRUE/FALSE 1_hr      'data/nhanes.csv'
## .... ........ .................. ......... .................
## See problems(...) for more details.

Let’s see what our data looks like.

nhanes

select

With select we can select variables from the larger data frame.

nhanes %>%
  select(age)

We can also use select for multiple variables:

nhanes %>%
  select(height, weight)

Used within select, the contains function selects variable with certain text in the variable name:

nhanes %>%
  select(contains("age"))

nhanes %>%
  select(contains("phys"))

mutate

We use mutate we make new variables or change existing ones.

We can use mutate in three ways:

Create a new variable with a specific value

nhanes %>%
  mutate(country = "United States") %>% 
  select(country)

Create a new variable based on other variables

nhanes %>%
  mutate(height_inches = height / 2.54) %>% 
  select(contains("height"))

Change an existing variable

nhanes %>%
  mutate(bmi = round(bmi, digits = 1)) %>% 
  select(bmi)

A Brief Interlude

Comparisons

Logical operators

With logical operators, we can create complex filters (e.g. keep only those who say their health is “good”, “very good”, or “excellent”).

filter

We use filter to choose a subset of observations.

We use == to select all observations that meet the criteria.

nhanes %>% 
  filter(gender == "female") %>%
  select(gender)

We use != to select all observations that don’t meet the criteria.

nhanes %>% 
  filter(health_gen != "Good") %>%
  select(health_gen)

We can combine comparisons and logical operators.

nhanes %>% 
  filter(health_gen == "Good" | health_gen == "Vgood" | health_gen == "Excellent") %>%
  select(health_gen)

We can use %in% to collapse multiple comparisons into one.

nhanes %>% 
  filter(health_gen %in% c("Good", "Vgood", "Excellent")) %>%
  select(health_gen)

We can chain together multiple filter functions. Doing it this way, we don’t have create complex logic in one line.

nhanes %>% 
  filter(gender == "male" & (health_gen == "Good" | health_gen == "Vgood" | health_gen == "Excellent")) %>%
  select(gender, health_gen)

nhanes %>% 
  filter(gender == "male") %>%
  filter(health_gen %in% c("Good", "Vgood", "Excellent")) %>%
  select(gender, health_gen)

We can use <, >, <=, and => for numeric data.

nhanes %>% 
  filter(age > 50) %>% 
  select(age)

We can drop NAs with !is.na

nhanes %>% 
  filter(age > 50) %>% 
  filter(!is.na(marital_status)) %>%
  select(age, marital_status)

We can also drop NAs with drop_na

nhanes %>% 
  filter(age > 50) %>% 
  drop_na(marital_status) %>%
  select(age, marital_status)

summarize

With summarize, we can go from a complete dataset down to a summary.

We use these functions with summarize.

This doesn’t work! Notice what the result is.

nhanes %>% 
  summarize(mean_active_days = mean(phys_active_days))

We need to add na.rm = TRUE to tell R to drop NA values.

nhanes %>% 
  summarize(mean_active_days = mean(phys_active_days,
                                    na.rm = TRUE))

We can have multiple arguments in each usage of summarize.

nhanes %>% 
  summarize(mean_active_days = mean(phys_active_days, na.rm = TRUE),
            median_active_days = median(phys_active_days, na.rm = TRUE),
            number_of_responses = n())

group_by

summarize becomes truly powerful when paired with group_by, which enables us to perform calculations on each group.

nhanes %>% 
  group_by(age_decade) %>%
  summarize(mean_active_days = mean(phys_active_days,
                                    na.rm = TRUE))

We can use group_by with multiple groups.

nhanes %>% 
  group_by(age_decade, gender) %>%
  summarize(mean_active_days = mean(phys_active_days,
                                    na.rm = TRUE))

count

If we just want to count the number of things per group, we can use count.

nhanes %>% 
  count(age_decade)

We can also count by multiple groups.

nhanes %>% 
  count(age_decade, gender)

arrange

With arrange, we can reorder rows in a data frame based on the values of one or more variables. R arranges in ascending order by default.

nhanes %>% 
  arrange(age)

We can also arrange in descending order using desc().

nhanes %>% 
  arrange(desc(age))

We often use arrange at the end of chains to display things in order.

nhanes %>% 
  group_by(age_decade, gender) %>%
  summarize(mean_active_days = mean(phys_active_days,
                                    na.rm = TRUE)) %>% 
  arrange(mean_active_days)

Create new data frames

Sometimes you want to save the results of your work to a new data frame.

female_height_inches_by_age <- nhanes %>% 
  filter(gender == "female") %>% 
  mutate(height_inches = height / 2.54) %>% 
  group_by(age_decade) %>% 
  summarize(height_inches = mean(height_inches,
                                    na.rm = TRUE)) %>% 
  drop_na()

female_height_inches_by_age

Crosstabs

Sometimes you want your results in a crosstab. We can use the tabyl function in janitor package to make crosstabs automatically.

nhanes %>% 
  tabyl(gender, age_decade)

janitor has a set of functions that all start with adorn_ that add a number of things to our crosstabs. We call them after tabyl. For example, adorn_totals.

nhanes %>% 
  tabyl(gender, age_decade) %>% 
  adorn_totals(c("row", "col"))

We can add adorn_percentages to add percentages.

nhanes %>% 
  tabyl(gender, age_decade) %>% 
  adorn_totals(c("row", "col")) %>% 
  adorn_percentages()

We can then format these percentages using adorn_pct_formatting.

nhanes %>% 
  tabyl(gender, age_decade) %>% 
  adorn_totals(c("row", "col")) %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting()

If we want to include the n alongside percentages, we can use adorn_ns.

nhanes %>% 
  tabyl(gender, age_decade) %>% 
  adorn_totals(c("row", "col")) %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_ns()

We can add titles to our crosstabs using adorn_title.

nhanes %>% 
  tabyl(gender, age_decade) %>% 
  adorn_totals(c("row", "col")) %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_ns() %>% 
  adorn_title()

We can also do three (or more) way crosstabs automatically by adding more variables to the tabyl function.

nhanes %>% 
  tabyl(gender, age_decade, education) %>%
  adorn_totals(c("row", "col")) %>% 
  adorn_percentages() %>% 
  adorn_pct_formatting() %>% 
  adorn_ns() %>% 
  adorn_title(placement = "combined")

## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## $`8th Grade`
##  gender/age_decade      0-9    10-19     20-29      30-39      40-49
##             female 0.0% (0) 0.0% (0) 9.1% (19) 18.7% (39) 16.3% (34)
##               male 0.0% (0) 0.0% (0) 7.4% (18) 14.0% (34) 22.3% (54)
##              Total 0.0% (0) 0.0% (0) 8.2% (37) 16.2% (73) 19.5% (88)
##       50-59      60-69        70+        NA_        Total
##  15.3% (32) 12.4% (26) 17.7% (37) 10.5% (22) 100.0% (209)
##  14.0% (34) 16.9% (41) 10.3% (25) 14.9% (36) 100.0% (242)
##  14.6% (66) 14.9% (67) 13.7% (62) 12.9% (58) 100.0% (451)
## 
## $`9 - 11th Grade`
##  gender/age_decade      0-9    10-19       20-29       30-39       40-49
##             female 0.0% (0) 0.0% (0) 17.9%  (72) 15.2%  (61) 14.9%  (60)
##               male 0.0% (0) 0.0% (0) 20.6% (100) 16.7%  (81) 22.2% (108)
##              Total 0.0% (0) 0.0% (0) 19.4% (172) 16.0% (142) 18.9% (168)
##        50-59      60-69        70+       NA_        Total
##  18.7%  (75) 12.4% (50) 12.9% (52) 8.0% (32) 100.0% (402)
##  18.3%  (89)  9.7% (47)  9.1% (44) 3.5% (17) 100.0% (486)
##  18.5% (164) 10.9% (97) 10.8% (96) 5.5% (49) 100.0% (888)
## 
## $`College Grad`
##  gender/age_decade      0-9    10-19       20-29       30-39       40-49
##             female 0.0% (0) 0.0% (0) 14.8% (163) 21.2% (233) 25.7% (282)
##               male 0.0% (0) 0.0% (0) 13.0% (130) 21.4% (214) 19.9% (199)
##              Total 0.0% (0) 0.0% (0) 14.0% (293) 21.3% (447) 22.9% (481)
##        50-59       60-69        70+       NA_         Total
##  19.7% (217) 10.9% (120) 5.2%  (57) 2.5% (27) 100.0% (1099)
##  20.0% (200) 16.0% (160) 6.4%  (64) 3.2% (32) 100.0%  (999)
##  19.9% (417) 13.3% (280) 5.8% (121) 2.8% (59) 100.0% (2098)
## 
## $`High School`
##  gender/age_decade      0-9    10-19       20-29       30-39       40-49
##             female 0.0% (0) 0.0% (0) 20.3% (156) 13.6% (105) 17.5% (135)
##               male 0.0% (0) 0.0% (0) 21.0% (157) 15.7% (117) 22.5% (168)
##              Total 0.0% (0) 0.0% (0) 20.6% (313) 14.6% (222) 20.0% (303)
##        50-59       60-69         70+       NA_         Total
##  15.1% (116) 13.8% (106) 12.5%  (96) 7.3% (56) 100.0%  (770)
##  20.7% (155) 10.0%  (75)  5.9%  (44) 4.1% (31) 100.0%  (747)
##  17.9% (271) 11.9% (181)  9.2% (140) 5.7% (87) 100.0% (1517)
## 
## $`Some College`
##  gender/age_decade      0-9    10-19       20-29       30-39       40-49
##             female 0.0% (0) 0.0% (0) 22.6% (271) 20.0% (239) 14.0% (167)
##               male 0.0% (0) 0.0% (0) 24.9% (266) 19.9% (213) 17.6% (188)
##              Total 0.0% (0) 0.0% (0) 23.7% (537) 19.9% (452) 15.7% (355)
##        50-59       60-69        70+       NA_         Total
##  15.3% (183) 14.8% (177) 8.7% (104) 4.7% (56) 100.0% (1197)
##  19.0% (203) 10.8% (116) 5.8%  (62) 2.1% (22) 100.0% (1070)
##  17.0% (386) 12.9% (293) 7.3% (166) 3.4% (78) 100.0% (2267)
## 
## $NA_
##  gender/age_decade          0-9        10-19    20-29    30-39    40-49
##             female 48.6%  (653) 50.9%  (684) 0.0% (0) 0.0% (0) 0.2% (3)
##               male 51.4%  (738) 48.1%  (690) 0.3% (4) 0.1% (2) 0.0% (0)
##              Total 50.1% (1391) 49.4% (1374) 0.1% (4) 0.1% (2) 0.1% (3)
##     50-59    60-69      70+      NA_         Total
##  0.0% (0) 0.1% (1) 0.1% (2) 0.0% (0) 100.0% (1343)
##  0.0% (0) 0.0% (0) 0.0% (0) 0.1% (2) 100.0% (1436)
##  0.0% (0) 0.0% (1) 0.1% (2) 0.1% (2) 100.0% (2779)

Fundamentals of Data Wrangling and Analysis Examples

R for the Rest of Us