The examples are part of the Fundamentals of R course. For more, see the R for the Rest of Us website.
Let’s load the packages we need. These include tidyverse
(especially the dplyr
package) and janitor
.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3.9000 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(janitor)
library(skimr)
##
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
##
## filter
bad_names <- read_csv("data/badnames.csv")
## Parsed with column specification:
## cols(
## ID = col_double(),
## `Age Decade` = col_character(),
## gender = col_character()
## )
bad_names
With the bad_names
data frame, we have to use back tick (`) before and after variable names with spaces in them. Also, RStudio doesn’t autocomplete the variable names, which is a pain!
bad_names %>%
skim(`Age Decade`)
We can use clean_names
as follows:
good_names <- bad_names %>%
clean_names()
good_names
Variable names are much easier to type now! And RStudio autocompletes them, which is super handy.
good_names %>%
skim(age_decade)
Let’s import our data using read_csv
. Note that the NHANES data is in the data directory so we need to include that.
nhanes <- read_csv("data/nhanes.csv") %>%
clean_names()
## Parsed with column specification:
## cols(
## .default = col_character(),
## ID = col_double(),
## Age = col_double(),
## Weight = col_double(),
## Height = col_double(),
## BMI = col_double(),
## DaysPhysHlthBad = col_double(),
## DaysMentHlthBad = col_double(),
## SleepHrsNight = col_double(),
## PhysActiveDays = col_double(),
## TVHrsDay = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 4859 parsing failures.
## row col expected actual file
## 5001 TVHrsDay 1/0/T/F/TRUE/FALSE 2_hr 'data/nhanes.csv'
## 5002 TVHrsDay 1/0/T/F/TRUE/FALSE More_4_hr 'data/nhanes.csv'
## 5003 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr 'data/nhanes.csv'
## 5004 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr 'data/nhanes.csv'
## 5005 TVHrsDay 1/0/T/F/TRUE/FALSE 1_hr 'data/nhanes.csv'
## .... ........ .................. ......... .................
## See problems(...) for more details.
Let’s see what our data looks like.
nhanes
With select
we can select variables from the larger data frame.
nhanes %>%
select(age)
We can also use select
for multiple variables:
nhanes %>%
select(height, weight)
Used within select
, the contains
function selects variable with certain text in the variable name:
nhanes %>%
select(contains("age"))
nhanes %>%
select(contains("phys"))
See also starts_with
and ends_with
.
nhanes %>%
select(starts_with("days"))
nhanes %>%
select(ends_with("days"))
We can select
a range of columns using the var1:var2 pattern
nhanes %>%
select(weight:bmi)
We can drop variables using the -var format:
nhanes %>%
select(-id)
We can drop a set of variables using the -(var1:var2) format:
nhanes %>%
select(-(id:education))
We use mutate
we make new variables or change existing ones.
We can use mutate
in three ways:
Create a new variable with a specific value
nhanes %>%
mutate(country = "United States") %>%
select(country)
Create a new variable based on other variables
nhanes %>%
mutate(height_inches = height / 2.54) %>%
select(contains("height"))
Change an existing variable
nhanes %>%
mutate(bmi = round(bmi, digits = 1)) %>%
select(bmi)
With logical operators, we can create complex filters (e.g. keep only those who say their health is “good”, “very good”, or “excellent”).
We use filter
to choose a subset of observations.
We use ==
to select all observations that meet the criteria.
nhanes %>%
filter(gender == "female") %>%
select(gender)
We use !=
to select all observations that don’t meet the criteria.
nhanes %>%
filter(health_gen != "Good") %>%
select(health_gen)
We can combine comparisons and logical operators.
nhanes %>%
filter(health_gen == "Good" | health_gen == "Vgood" | health_gen == "Excellent") %>%
select(health_gen)
We can use %in%
to collapse multiple comparisons into one.
nhanes %>%
filter(health_gen %in% c("Good", "Vgood", "Excellent")) %>%
select(health_gen)
We can chain together multiple filter
functions. Doing it this way, we don’t have create complex logic in one line.
nhanes %>%
filter(gender == "male" & (health_gen == "Good" | health_gen == "Vgood" | health_gen == "Excellent")) %>%
select(gender, health_gen)
nhanes %>%
filter(gender == "male") %>%
filter(health_gen %in% c("Good", "Vgood", "Excellent")) %>%
select(gender, health_gen)
We can use <
, >
, <=
, and =>
for numeric data.
nhanes %>%
filter(age > 50) %>%
select(age)
We can drop NAs
with !is.na
nhanes %>%
filter(age > 50) %>%
filter(!is.na(marital_status)) %>%
select(age, marital_status)
We can also drop NAs
with drop_na
nhanes %>%
filter(age > 50) %>%
drop_na(marital_status) %>%
select(age, marital_status)
With summarize
, we can go from a complete dataset down to a summary.
We use these functions with summarize
.
This doesn’t work! Notice what the result is.
nhanes %>%
summarize(mean_active_days = mean(phys_active_days))
We need to add na.rm = TRUE
to tell R to drop NA
values.
nhanes %>%
summarize(mean_active_days = mean(phys_active_days,
na.rm = TRUE))
We can have multiple arguments in each usage of summarize
.
nhanes %>%
summarize(mean_active_days = mean(phys_active_days, na.rm = TRUE),
median_active_days = median(phys_active_days, na.rm = TRUE),
number_of_responses = n())
summarize
becomes truly powerful when paired with group_by
, which enables us to perform calculations on each group.
nhanes %>%
group_by(age_decade) %>%
summarize(mean_active_days = mean(phys_active_days,
na.rm = TRUE))
We can use group_by
with multiple groups.
nhanes %>%
group_by(age_decade, gender) %>%
summarize(mean_active_days = mean(phys_active_days,
na.rm = TRUE))
If we just want to count the number of things per group, we can use count
.
nhanes %>%
count(age_decade)
We can also count by multiple groups.
nhanes %>%
count(age_decade, gender)
With arrange
, we can reorder rows in a data frame based on the values of one or more variables. R arranges in ascending order by default.
nhanes %>%
arrange(age)
We can also arrange in descending order using desc()
.
nhanes %>%
arrange(desc(age))
We often use arrange
at the end of chains to display things in order.
nhanes %>%
group_by(age_decade, gender) %>%
summarize(mean_active_days = mean(phys_active_days,
na.rm = TRUE)) %>%
arrange(mean_active_days)
Sometimes you want to save the results of your work to a new data frame.
female_height_inches_by_age <- nhanes %>%
filter(gender == "female") %>%
mutate(height_inches = height / 2.54) %>%
group_by(age_decade) %>%
summarize(height_inches = mean(height_inches,
na.rm = TRUE)) %>%
drop_na()
female_height_inches_by_age
Sometimes you want your results in a crosstab. We can use the tabyl
function in janitor
package to make crosstabs automatically.
nhanes %>%
tabyl(gender, age_decade)
janitor
has a set of functions that all start with adorn_
that add a number of things to our crosstabs. We call them after tabyl
. For example, adorn_totals
.
nhanes %>%
tabyl(gender, age_decade) %>%
adorn_totals(c("row", "col"))
We can add adorn_percentages
to add percentages.
nhanes %>%
tabyl(gender, age_decade) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages()
We can then format these percentages using adorn_pct_formatting
.
nhanes %>%
tabyl(gender, age_decade) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages() %>%
adorn_pct_formatting()
If we want to include the n alongside percentages, we can use adorn_ns
.
nhanes %>%
tabyl(gender, age_decade) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages() %>%
adorn_pct_formatting() %>%
adorn_ns()
We can add titles to our crosstabs using adorn_title
.
nhanes %>%
tabyl(gender, age_decade) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages() %>%
adorn_pct_formatting() %>%
adorn_ns() %>%
adorn_title()
We can also do three (or more) way crosstabs automatically by adding more variables to the tabyl
function.
nhanes %>%
tabyl(gender, age_decade, education) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages() %>%
adorn_pct_formatting() %>%
adorn_ns() %>%
adorn_title(placement = "combined")
## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `age_decade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## $`8th Grade`
## gender/age_decade 0-9 10-19 20-29 30-39 40-49
## female 0.0% (0) 0.0% (0) 9.1% (19) 18.7% (39) 16.3% (34)
## male 0.0% (0) 0.0% (0) 7.4% (18) 14.0% (34) 22.3% (54)
## Total 0.0% (0) 0.0% (0) 8.2% (37) 16.2% (73) 19.5% (88)
## 50-59 60-69 70+ NA_ Total
## 15.3% (32) 12.4% (26) 17.7% (37) 10.5% (22) 100.0% (209)
## 14.0% (34) 16.9% (41) 10.3% (25) 14.9% (36) 100.0% (242)
## 14.6% (66) 14.9% (67) 13.7% (62) 12.9% (58) 100.0% (451)
##
## $`9 - 11th Grade`
## gender/age_decade 0-9 10-19 20-29 30-39 40-49
## female 0.0% (0) 0.0% (0) 17.9% (72) 15.2% (61) 14.9% (60)
## male 0.0% (0) 0.0% (0) 20.6% (100) 16.7% (81) 22.2% (108)
## Total 0.0% (0) 0.0% (0) 19.4% (172) 16.0% (142) 18.9% (168)
## 50-59 60-69 70+ NA_ Total
## 18.7% (75) 12.4% (50) 12.9% (52) 8.0% (32) 100.0% (402)
## 18.3% (89) 9.7% (47) 9.1% (44) 3.5% (17) 100.0% (486)
## 18.5% (164) 10.9% (97) 10.8% (96) 5.5% (49) 100.0% (888)
##
## $`College Grad`
## gender/age_decade 0-9 10-19 20-29 30-39 40-49
## female 0.0% (0) 0.0% (0) 14.8% (163) 21.2% (233) 25.7% (282)
## male 0.0% (0) 0.0% (0) 13.0% (130) 21.4% (214) 19.9% (199)
## Total 0.0% (0) 0.0% (0) 14.0% (293) 21.3% (447) 22.9% (481)
## 50-59 60-69 70+ NA_ Total
## 19.7% (217) 10.9% (120) 5.2% (57) 2.5% (27) 100.0% (1099)
## 20.0% (200) 16.0% (160) 6.4% (64) 3.2% (32) 100.0% (999)
## 19.9% (417) 13.3% (280) 5.8% (121) 2.8% (59) 100.0% (2098)
##
## $`High School`
## gender/age_decade 0-9 10-19 20-29 30-39 40-49
## female 0.0% (0) 0.0% (0) 20.3% (156) 13.6% (105) 17.5% (135)
## male 0.0% (0) 0.0% (0) 21.0% (157) 15.7% (117) 22.5% (168)
## Total 0.0% (0) 0.0% (0) 20.6% (313) 14.6% (222) 20.0% (303)
## 50-59 60-69 70+ NA_ Total
## 15.1% (116) 13.8% (106) 12.5% (96) 7.3% (56) 100.0% (770)
## 20.7% (155) 10.0% (75) 5.9% (44) 4.1% (31) 100.0% (747)
## 17.9% (271) 11.9% (181) 9.2% (140) 5.7% (87) 100.0% (1517)
##
## $`Some College`
## gender/age_decade 0-9 10-19 20-29 30-39 40-49
## female 0.0% (0) 0.0% (0) 22.6% (271) 20.0% (239) 14.0% (167)
## male 0.0% (0) 0.0% (0) 24.9% (266) 19.9% (213) 17.6% (188)
## Total 0.0% (0) 0.0% (0) 23.7% (537) 19.9% (452) 15.7% (355)
## 50-59 60-69 70+ NA_ Total
## 15.3% (183) 14.8% (177) 8.7% (104) 4.7% (56) 100.0% (1197)
## 19.0% (203) 10.8% (116) 5.8% (62) 2.1% (22) 100.0% (1070)
## 17.0% (386) 12.9% (293) 7.3% (166) 3.4% (78) 100.0% (2267)
##
## $NA_
## gender/age_decade 0-9 10-19 20-29 30-39 40-49
## female 48.6% (653) 50.9% (684) 0.0% (0) 0.0% (0) 0.2% (3)
## male 51.4% (738) 48.1% (690) 0.3% (4) 0.1% (2) 0.0% (0)
## Total 50.1% (1391) 49.4% (1374) 0.1% (4) 0.1% (2) 0.1% (3)
## 50-59 60-69 70+ NA_ Total
## 0.0% (0) 0.1% (1) 0.1% (2) 0.0% (0) 100.0% (1343)
## 0.0% (0) 0.0% (0) 0.0% (0) 0.1% (2) 100.0% (1436)
## 0.0% (0) 0.0% (1) 0.1% (2) 0.1% (2) 100.0% (2779)