A Very Short Review of ggplot2

Packages Used
Package	Title	Maintainer	Version	URL
car	Companion to Applied Regression	John Fox <jfox@mcmaster.ca>;	3.0-2	https://r-forge.r-project.org/projects/car/,
Hmisc	Harrell Miscellaneous	Frank E Harrell Jr <f.harrell@vanderbilt.edu>;	4.2-0	http://biostat.mc.vanderbilt.edu/Hmisc,
kableExtra	Construct Complex Table with ‘kable’ and Pipe Syntax	Hao Zhu <haozhu233@gmail.com>;	1.0.1	http://haozhu233.github.io/kableExtra/,
tidyverse	Easily Install and Load the ‘Tidyverse’	Hadley Wickham <hadley@rstudio.com>;	1.2.1	http://tidyverse.tidyverse.org,
reshape2	Flexibly Reshape Data: A Reboot of the Reshape Package	Hadley Wickham <h.wickham@gmail.com>;	1.4.3	NA
plotly	Create Interactive Web Graphics via ‘plotly.js’	Carson Sievert <cpsievert1@gmail.com>;	4.8.0	https://plot.ly/r, https://cpsievert.github.io/plotly_book/,
sjPlot	Data Visualization for Statistics in Social Science	Daniel Lüdecke <d.luedecke@uke.de>;	2.6.2	NA
Stat2Data	Datasets for Stat2	Robin Lock <rlock@stlawu.edu>;	2.0.0	NA

Introduction

ggplot2 is a package is “a system for declaratively creating graphics”. ggplot2 is included in the tidyverse package, which also includes dplyr, tidyr, and tibble.

For reference, here is a ggplot2 cheatsheet and a quick reference guide. Also, if you encounter an issue with ggplot2, there is a very likely chance someone encountered the same issue before and posted about it on StackOverflow, so don’t be afraid to just Google your error message.

This guide will use the FirstYearGPA dataset from Stat2Data.

Frequently Asked Questions

1. Why should I use ggplot2?

Honestly? Because it looks prettier. It also offers more options than base R plots.

data(FirstYearGPA)
plot(FirstYearGPA$GPA, FirstYearGPA$HSGPA) # Base R scatterplot

ggplot(data=FirstYearGPA, aes(x=GPA, y=HSGPA)) + geom_point() # ggplot2 scatterplot

2. Why is it called ggplot2? What happened to ggplot1?

There was a package called ggplot, which stopped existing in mid 2000s, although it’s still kind of around as a historical archive. The “2” in ggplot2 reflects that it was significantly updated.

3. Are there alternatives to ggplot2?

Yes, the most obvious being base R plotting. There is also plotly, which has syntax similiar to ggplot2.

plot_ly(data=FirstYearGPA, x = ~GPA, y = ~HSGPA)

There is also sjPlot, which is particularly good at making APA-style regression tables, but most of its plot functions are based on ggplot2.

plot_scatter(data=FirstYearGPA, x=GPA, y=HSGPA)

Syntax

In base R, there are different commands for different visualization types.

plot(FirstYearGPA$GPA, FirstYearGPA$HSGPA) # Scatterplot

counts <- table(FirstYearGPA$FirstGen)
barplot(counts) # Barplot

hist(FirstYearGPA$GPA) # Histogram

In ggplot2, there is one command, ggplot, which can be further specified with additional syntax. Below are breakdowns for different ggplot visualizations. An important part of the syntax is geom, which adds the layer of the actual plot.

geom	Description
geom_point	Scatterplot
geom_bar	Barplot
geom_density	Density Plot
geom_histogram	Histogram
geom_boxplot	Boxplot
geom_dotplot	Dotplot
geom_smooth	Regression Line

# Scatterplot aka geom_point
ggplot( # ggplot command
  data=FirstYearGPA, # Name of dataset 
  aes(x=GPA, y=HSGPA)) + # Your x-axis and y-axis
  geom_point() # Type of plot

# Barplot aka geom_bar
ggplot( # ggplot command
  data=FirstYearGPA, # Name of dataset 
  aes(x=FirstGen)) + # Your x-axis
  geom_bar() # Type of plot

# Density plot aka geom_density
ggplot( # ggplot command
  data=FirstYearGPA, # Name of dataset 
  aes(x=GPA)) + # Your x-axis
  geom_density() # Type of plot

# Histogram aka geom_histogram
ggplot( # ggplot command
  data=FirstYearGPA, # Name of dataset 
  aes(x=GPA)) + # Your x-axis
  geom_histogram(bins=20) # Type of plot

Since the geom part just adds a layer, you can add multiple geoms to one plot.

# Make a simple linear regression model
mod <- lm(GPA ~ HSGPA, data=FirstYearGPA)

# Scatterplot aka geom_point
ggplot( # ggplot command
  data=FirstYearGPA, # Name of dataset 
  aes(x=GPA, y=HSGPA)) + # Your x-axis and y-axis
  geom_point() + # Type of plot 
  geom_smooth(aes(y=predict(mod))) # Adds a regression line

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Examples

Example 1: Barplot

I want to make a barplot that shows GPA divided by gender.

ggplot(data=FirstYearGPA, aes(x=GPA, fill=Male)) + # Data, x-axis, fill aesthetic
  geom_bar() # Barplot

That looks terrible. There’s too many variations in GPA. Let’s round GPA to the nearest half-point to simplify things.

FirstYearGPA$GPA_round <- round(FirstYearGPA$GPA*2)/2 # Round GPA to two decimal points
head(FirstYearGPA)

##    GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
## 1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
## 2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
## 3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
## 4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
## 5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
## 6 2.95  3.25  600  570    0 18.0  3.0        0     1            1
##   GPA_round
## 1       3.0
## 2       4.0
## 3       3.5
## 4       3.0
## 5       3.5
## 6       3.0

ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Male)) + # New x-axis
  geom_bar()

Now we can see that ggplot isn’t filling the bars by gender. Let’s double-check the class of the variable Male.

class(FirstYearGPA$Male)

## [1] "integer"

FirstYearGPA$Male <- as.factor(FirstYearGPA$Male)
class(FirstYearGPA$Male)

## [1] "factor"

ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Male)) + # Now with altered fill asthetic
  geom_bar()

That looks better. But by default, ggplot2 is stacking by gender. I want the barplots to be next to each other and divided by gender.

ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Male)) + 
  geom_bar(position="dodge") # New barplot position

It’s starting to look nice. But there’s a few things we can do to improve it:

Change the legend
Rename the x- and y-axis titles
Add a big title at the top

Be aware that altering legends in ggplot2 can be a little tricky. In some cases, it may be favorable to just alter the data in the first place. In the FirstYearGPA dataset, gender is defined as “Male”: 0 if female, 1 if male. So let’s create a new gender variable.

FirstYearGPA$Gender <- if_else(FirstYearGPA$Male == 0, "Female", "Male")
head(FirstYearGPA)

##    GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
## 1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
## 2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
## 3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
## 4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
## 5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
## 6 2.95  3.25  600  570    0 18.0  3.0        0     1            1
##   GPA_round Gender
## 1       3.0   Male
## 2       4.0 Female
## 3       3.5 Female
## 4       3.0 Female
## 5       3.5 Female
## 6       3.0 Female

ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Gender)) + # New fill aesthetic
  geom_bar(stat="count", position="dodge")

Now let’s add titles.

ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Gender)) + 
  geom_bar(stat="count", position="dodge") +
  xlab("Firstyear GPA") + # x-axis title
  ylab("Number of Students") + # y-axis title
  ggtitle("Firstyear GPA by Gender") # Main title

Be aware that by default, ggplot titles at the top are aligned to the left. This was done so it would be easier to add subtitles to plots. You’ll have to specify if you want it centered.

ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Gender)) + 
  geom_bar(stat="count", position="dodge") +
  xlab("Firstyear GPA") + 
  ylab("Number of Students") + 
  ggtitle("Firstyear GPA by Gender") +
  theme(plot.title = element_text(hjust = 0.5)) # Change theme to center main title

Now it looks nice. So ultimately, we:

Rounded our data so a barplot could better visualize it
Changed our gender variable from an integer to a factor
Specified a dodged barplot (rather than the default stacked)
Created a new gender variable to improve the plot legend
Added titles
Centered the main title

There are still numerous other ways that you could alter the barplot: increase the size of the text, change the color scheme from red/blue to green/purple, change background color, add error bars, etc., although if I demonstrated all these options, the document would never end!

Example 2: What looks best?

Let’s say I wanted to know more about the students’ SAT Math scores. Since SATM is a numeric variable with lots of variability, I should make a new variable that splits SATM into quartiles and place everyone in the the first, second, third, or fourth quartile. This is something that can be done with the Hmisc package.

data(FirstYearGPA)
FirstYearGPA$math_quartile <- with(FirstYearGPA, cut(SATM, 
                                breaks=quantile(SATM, probs=seq(0,1, by=0.25), na.rm=TRUE),
                                labels=c("Q1","Q2","Q3","Q4"),
                                include.lowest=TRUE))
FirstYearGPA$math_quartile <- as.factor(FirstYearGPA$math_quartile)
head(FirstYearGPA)

##    GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
## 1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
## 2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
## 3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
## 4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
## 5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
## 6 2.95  3.25  600  570    0 18.0  3.0        0     1            1
##   math_quartile
## 1            Q4
## 2            Q4
## 3            Q1
## 4            Q4
## 5            Q2
## 6            Q1

For math_quartile, Q4 has the best scores and Q1 has the worst scores.

How would a density plot look?

ggplot( 
  data=FirstYearGPA, 
  aes(x=GPA)) + 
  geom_density(aes(group=math_quartile, fill=math_quartile),alpha=0.2) +
  scale_fill_discrete(name="Math Quartiles") + # Changes legend title
  xlab("Firstyear GPA") + ylab("")

Pretty, but not very effective in conveying information. It overlaps a lot. What about a dotplot?

ggplot( 
  data=FirstYearGPA, 
  aes(x=math_quartile,y=GPA,fill=math_quartile)) + 
  geom_dotplot(binaxis='y',stackdir='center',dotsize=0.8) +
  theme(legend.position = "none") +
  ylab("Firstyear GPA") + xlab("SATM Quartiles")

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

I don’t think that looks very nice either. What about a boxplot?

ggplot(data=FirstYearGPA, 
  aes(x=math_quartile, y=GPA, fill=math_quartile)) + 
  geom_boxplot(outlier.colour="black", outlier.shape=8, outlier.size=4) +
  theme(legend.position = "none") +
  ylab("Firstyear GPA") + xlab("SATM Quartiles")

Now this one is a lot better. We can see that the one student with the highest GPA was in fourth quartile (aka had one of the highest SAT Math scores), but the group with the highest average GPA was actual the third quartile. Additionally, the only GPA outlier (who had a very low firstyear GPA) was in the third quartile too. This boxplot also allows us to see the range of GPAs among the four SATM groups. It’s important to test out different visualizations when trying to convey information. While density plots look pretty, sometimes the simplest solution is the best.