The examples are part of the Fundamentals of R course. For more, see the R for the Rest of Us website.
Let’s load the packages we need. These include tidyverse
(especially the dplyr
package) and janitor
.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1.9000 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3.9000 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
Let’s import our data using read_csv
. Note that the NHANES data is in the data directory so we need to include that.
nhanes <- read_csv("data/nhanes.csv") %>%
clean_names()
## Parsed with column specification:
## cols(
## .default = col_character(),
## ID = col_double(),
## Age = col_double(),
## Weight = col_double(),
## Height = col_double(),
## BMI = col_double(),
## DaysPhysHlthBad = col_double(),
## DaysMentHlthBad = col_double(),
## SleepHrsNight = col_double(),
## PhysActiveDays = col_double(),
## TVHrsDay = col_logical()
## )
## See spec(...) for full column specifications.
## Warning: 4859 parsing failures.
## row col expected actual file
## 5001 TVHrsDay 1/0/T/F/TRUE/FALSE 2_hr 'data/nhanes.csv'
## 5002 TVHrsDay 1/0/T/F/TRUE/FALSE More_4_hr 'data/nhanes.csv'
## 5003 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr 'data/nhanes.csv'
## 5004 TVHrsDay 1/0/T/F/TRUE/FALSE 4_hr 'data/nhanes.csv'
## 5005 TVHrsDay 1/0/T/F/TRUE/FALSE 1_hr 'data/nhanes.csv'
## .... ........ .................. ......... .................
## See problems(...) for more details.
We use geom_point
to make a scatterplot.
ggplot(data = nhanes,
mapping = aes(x = age,
y = height)) +
geom_point()
## Warning: Removed 353 rows containing missing values (geom_point).
Let’s take a look at what’s going on here.
We use geom_histogram
to make a histogram.
ggplot(data = nhanes,
mapping = aes(x = height)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 353 rows containing non-finite values (stat_bin).
How does ggplot know what to plot on the y axis? It’s using the default statistical transformation for geom_histogram
, which is stat = "bin"
.
If we add stat = "bin"
we get the same thing. Each geom has a default stat.
ggplot(data = nhanes,
mapping = aes(x = height)) +
geom_histogram(stat = "bin")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 353 rows containing non-finite values (stat_bin).
We can adjust the number of bins using the bins
argument.
ggplot(data = nhanes,
mapping = aes(x = height)) +
geom_histogram(bins = 100)
## Warning: Removed 353 rows containing non-finite values (stat_bin).
There are two basic approaches to making bar charts, both of which use geom_bar
.
Approach #1
Use your full dataset.
Only assign a variable to the x axis.
Let ggplot use the default stat
transformation (stat = "count"
) to generate counts that it then plots on the y axis.
Approach #2
Wrangle your data frame before plotting, possibly creating a new data frame in the process
Assign variables to the x and y axes
Use stat = "identity"
to tell ggplot to use the data exactly as it is
ggplot(data = nhanes,
mapping = aes(x = height)) +
geom_bar()
## Warning: Removed 353 rows containing non-finite values (stat_count).
The default statistical transformation for geom_bar
is count
. This will give us the same result as our previous plot.
ggplot(data = nhanes,
mapping = aes(x = height)) +
geom_bar(stat = "count")
## Warning: Removed 353 rows containing non-finite values (stat_count).
Here’s what’s going on.
It’s often easier to do our analysis work, save a data frame, and then use this to plot.
Let’s recreate our female_height_inches_by_age
data frame.
female_height_inches_by_age <- nhanes %>%
filter(gender == "female") %>%
mutate(height_inches = height / 2.54) %>%
group_by(age_decade) %>%
summarize(height_inches = mean(height_inches,
na.rm = TRUE)) %>%
drop_na(age_decade)
female_height_inches_by_age
Then let’s use this data frame to make a bar chart. The stat = "identity"
here tells ggplot to use the exact data points without any stat
transformations.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity")
We can also flip the x and y axes using coord_flip
.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity") +
coord_flip()
We can also geom_col
, which uses stat = "identity"
by default.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_col() +
coord_flip()
color
and fill
We add the color argument within the aes
so that the data in that variable is mapped to those aesthetic properties.
ggplot(data = nhanes,
mapping = aes(x = age,
y = height,
color = gender)) +
geom_point()
## Warning: Removed 353 rows containing missing values (geom_point).
Note that each option in the gender variable (male and female) is mapped to a color (male = teal, female = red).
Let’s try the same thing with a bar chart.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
color = age_decade)) +
geom_bar(stat = "identity")
That didn’t work! Let’s try fill
instead.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = age_decade)) +
geom_bar(stat = "identity")
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = age_decade)) +
geom_bar(stat = "identity")
We can change which colors the data is mapped to by using a scale_
function.
ggplot(data = nhanes,
mapping = aes(x = age,
y = height,
color = gender)) +
geom_point() +
scale_color_manual(values = c("purple", "orange"))
## Warning: Removed 353 rows containing missing values (geom_point).
We can also use built-in palettes like scale_color_viridis_d
(the d means it’s for discrete data).
ggplot(data = nhanes,
mapping = aes(x = age,
y = height,
color = gender)) +
geom_point() +
scale_color_viridis_d(option = "plasma")
## Warning: Removed 353 rows containing missing values (geom_point).
Adjusting our x and y axes is similar. Remember that the x and y axes are considered an aesthetic properties in the same way color is.
We adjust our x and y axes using the scale_
set of functions. Which exact function you use depends on your data. For example, you would use scale_y_continuous
if you have continuous data on the y axis.
The limits
argument sets the minimum and maximum values that display.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_col() +
scale_y_continuous(limits = c(0, 75))
The breaks
argument determines which axis labels show up.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_col() +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75))
If we want to change the x axis labels, we’d need to use scale_x_discrete
because that data is categorical. I’m adding a coord_flip
here to make it easier to read.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_col() +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
scale_x_discrete(labels = c("Zero to Nine",
"Ten to Nineteen",
"Twenty to Twenty-Nine",
"Thirty to Thirty-Nine",
"Forty to Forty-Nine",
"Fifty to Fifty-Nine",
"Sixty to Sixty-Nine",
"Seventy to Seventy-Nine",
"Seventy and Above")) +
coord_flip()
Text is just another geom. For example, we use geom_text
to add labels to our figures.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity") +
geom_text(aes(label = height_inches))
Let’s add a new variable called height_inches_one_digit
to use for plotting.
female_height_inches_by_age <- female_height_inches_by_age %>%
mutate(height_inches_one_digit = round(height_inches, 1))
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity") +
geom_text(aes(label = height_inches_one_digit))
We can use the hjust
and vjust
argumments to horizontally and vertically adjust text.
vjust = 0
puts the labels on the outer edge of the bars.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity") +
geom_text(aes(label = height_inches_one_digit),
vjust = 0)
vjust = 1
puts the labels at the inner edge of the bars.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity") +
geom_text(aes(label = height_inches_one_digit),
hjust = 0) +
coord_flip()
I often do something like vjust = 1.5
to give a bit more padding.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity") +
geom_text(aes(label = height_inches_one_digit),
vjust = 1.5)
We can adjust the color of the text using the color
argument. We’re putting it outside of the aes
because we are setting it for the whole layer.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity") +
geom_text(aes(label = height_inches_one_digit),
vjust = 1.5,
color = "white")
geom_label
is nearly identical but it adds a background. With geom_label
the color
argument determines the text color while the fill is the background color.
ggplot(data = female_height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches)) +
geom_bar(stat = "identity") +
geom_label(aes(label = height_inches_one_digit),
vjust = 1.5,
fill = "white",
color = "blue")
Let’s start by making a slightly more complicated bar chart. We’ll start by making a new data frame.
height_inches_by_age <- nhanes %>%
mutate(height_inches = height / 2.54) %>%
group_by(age_decade, gender) %>%
summarize(height_inches = mean(height_inches,
na.rm = TRUE)) %>%
drop_na(age_decade)
Then let’s take a look at our new data frame.
height_inches_by_age
Now let’s plot this data frame using a bar chart.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col()
The bars are stacked by default. To put them side by side, we use the position = "dodge"
argument within the geom_col
.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge")
To add labels to our plot, we use labs
.
We can a title to the plot with the title
argument.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages")
We can add a subtitle
as well.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9")
We can change the x and y axis labels using the x
and y
arguments.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches")
To change the title above the legend, we use the name of the aesthetic that is being shown.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches",
fill = "")
To add a theme to a plot, we use the theme_
set of functions. There are several built-in themes. For instance, theme_minimal
.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches",
fill = "") +
theme_minimal()
There’s also theme_light
.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches",
fill = "") +
theme_light()
There are also packages that give you themes you can apply to your plots.
Let’s load the ggthemes
package (install it if necessary).
# install.packages("ggthemes")
library(ggthemes)
We can then use a theme from this package (theme_economist
) to make our plots look like those in the Economist.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches",
fill = "") +
theme_economist()
Another option is theme_gdocs()
, which makes your plots look like those made in Google Sheets.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches",
fill = "") +
theme_gdocs()
One of the most powerful features of ggplot is facetting. You can make small multiples by adding just a line of code using the facet_wrap
function.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge") +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches",
fill = "") +
theme_economist() +
facet_wrap(~gender)
Let’s drop the legend since it’s redundant at this point. We do this by adding the show.legend = FALSE
argument within geom_col
.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge",
show.legend = FALSE) +
scale_y_continuous(limits = c(0, 75),
breaks = c(0, 25, 50, 75)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches",
fill = "") +
theme_economist() +
facet_wrap(~gender)
We can do this for any type of figure. Recall our scatterplot from before (with a nice theme added).
ggplot(data = nhanes,
mapping = aes(x = age,
y = height,
color = gender)) +
geom_point() +
theme_economist() +
facet_wrap(~gender)
## Warning: Removed 353 rows containing missing values (geom_point).
Or our histogram.
ggplot(data = nhanes,
mapping = aes(x = height,
fill = gender)) +
geom_histogram() +
theme_economist() +
facet_wrap(~gender)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 353 rows containing non-finite values (stat_bin).
You can use facet_wrap
for as many groups as you have in your data.
ggplot(data = nhanes,
mapping = aes(x = age,
y = bmi)) +
geom_point() +
theme_economist() +
facet_wrap(~education)
## Warning: Removed 366 rows containing missing values (geom_point).
There are two ways to think about saving your plots.
If you’re working in RMarkdown, just knit your file and your plots will show up as part of your HTML, Word, or PDF document. Use this option by default!
If you do need to save an individual plot for some other purpose (e.g. putting it in a report not created in RMarkdown), use the ggsave
function. By default, ggsave
will save the last plot you made.
First, we plot.
ggplot(data = height_inches_by_age,
mapping = aes(x = age_decade,
y = height_inches,
fill = gender)) +
geom_col(position = "dodge",
show.legend = FALSE) +
scale_y_continuous(limits = c(0, 70),
breaks = c(0, 10, 20, 30, 40, 50, 60, 70)) +
labs(title = "Males are taller than females at almost all ages",
subtitle = "But not at 0-9",
x = "Age",
y = "Height in Inches",
fill = "") +
theme_economist() +
facet_wrap(~gender)
And then we save this plot.
ggsave(filename = "plots/age-height.png",
height = 8,
width = 11,
units = "in",
dpi = 300)
We can save our plot to other formats as well. PDF is a great option because it produces small file sizes and high-quality plots. You don’t need to list dpi here as PDFs are vector based.
ggsave(filename = "plots/age-height.pdf",
height = 8,
width = 11)