Packages Used
Package Title Maintainer Version URL
car Companion to Applied Regression John Fox <; 3.0-2 https://r-forge.r-project.org/projects/car/,
Hmisc Harrell Miscellaneous Frank E Harrell Jr <; 4.2-0 http://biostat.mc.vanderbilt.edu/Hmisc,
kableExtra Construct Complex Table with ‘kable’ and Pipe Syntax Hao Zhu <; 1.0.1 http://haozhu233.github.io/kableExtra/,
tidyverse Easily Install and Load the ‘Tidyverse’ Hadley Wickham <; 1.2.1 http://tidyverse.tidyverse.org,
reshape2 Flexibly Reshape Data: A Reboot of the Reshape Package Hadley Wickham <; 1.4.3 NA
plotly Create Interactive Web Graphics via ‘plotly.js’ Carson Sievert <; 4.8.0 https://plot.ly/r, https://cpsievert.github.io/plotly_book/,
sjPlot Data Visualization for Statistics in Social Science Daniel Lüdecke <; 2.6.2 NA
Stat2Data Datasets for Stat2 Robin Lock <; 2.0.0 NA

Introduction

ggplot2 is a package is “a system for declaratively creating graphics”. ggplot2 is included in the tidyverse package, which also includes dplyr, tidyr, and tibble.

For reference, here is a ggplot2 cheatsheet and a quick reference guide. Also, if you encounter an issue with ggplot2, there is a very likely chance someone encountered the same issue before and posted about it on StackOverflow, so don’t be afraid to just Google your error message.

This guide will use the FirstYearGPA dataset from Stat2Data.

Frequently Asked Questions

1. Why should I use ggplot2?

Honestly? Because it looks prettier. It also offers more options than base R plots.

2. Why is it called ggplot2? What happened to ggplot1?

There was a package called ggplot, which stopped existing in mid 2000s, although it’s still kind of around as a historical archive. The “2” in ggplot2 reflects that it was significantly updated.

3. Are there alternatives to ggplot2?

Yes, the most obvious being base R plotting. There is also plotly, which has syntax similiar to ggplot2.

There is also sjPlot, which is particularly good at making APA-style regression tables, but most of its plot functions are based on ggplot2.

Syntax

In base R, there are different commands for different visualization types.

In ggplot2, there is one command, ggplot, which can be further specified with additional syntax. Below are breakdowns for different ggplot visualizations. An important part of the syntax is geom, which adds the layer of the actual plot.

geom Description
geom_point Scatterplot
geom_bar Barplot
geom_density Density Plot
geom_histogram Histogram
geom_boxplot Boxplot
geom_dotplot Dotplot
geom_smooth Regression Line

Since the geom part just adds a layer, you can add multiple geoms to one plot.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Examples

Example 1: Barplot

I want to make a barplot that shows GPA divided by gender.

That looks terrible. There’s too many variations in GPA. Let’s round GPA to the nearest half-point to simplify things.

##    GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
## 1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
## 2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
## 3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
## 4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
## 5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
## 6 2.95  3.25  600  570    0 18.0  3.0        0     1            1
##   GPA_round
## 1       3.0
## 2       4.0
## 3       3.5
## 4       3.0
## 5       3.5
## 6       3.0

Now we can see that ggplot isn’t filling the bars by gender. Let’s double-check the class of the variable Male.

## [1] "integer"
## [1] "factor"

That looks better. But by default, ggplot2 is stacking by gender. I want the barplots to be next to each other and divided by gender.

It’s starting to look nice. But there’s a few things we can do to improve it:

  • Change the legend
  • Rename the x- and y-axis titles
  • Add a big title at the top

Be aware that altering legends in ggplot2 can be a little tricky. In some cases, it may be favorable to just alter the data in the first place. In the FirstYearGPA dataset, gender is defined as “Male”: 0 if female, 1 if male. So let’s create a new gender variable.

##    GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
## 1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
## 2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
## 3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
## 4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
## 5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
## 6 2.95  3.25  600  570    0 18.0  3.0        0     1            1
##   GPA_round Gender
## 1       3.0   Male
## 2       4.0 Female
## 3       3.5 Female
## 4       3.0 Female
## 5       3.5 Female
## 6       3.0 Female

Now let’s add titles.

Be aware that by default, ggplot titles at the top are aligned to the left. This was done so it would be easier to add subtitles to plots. You’ll have to specify if you want it centered.

Now it looks nice. So ultimately, we:

  • Rounded our data so a barplot could better visualize it
  • Changed our gender variable from an integer to a factor
  • Specified a dodged barplot (rather than the default stacked)
  • Created a new gender variable to improve the plot legend
  • Added titles
  • Centered the main title

There are still numerous other ways that you could alter the barplot: increase the size of the text, change the color scheme from red/blue to green/purple, change background color, add error bars, etc., although if I demonstrated all these options, the document would never end!

Example 2: What looks best?

Let’s say I wanted to know more about the students’ SAT Math scores. Since SATM is a numeric variable with lots of variability, I should make a new variable that splits SATM into quartiles and place everyone in the the first, second, third, or fourth quartile. This is something that can be done with the Hmisc package.

##    GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
## 1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
## 2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
## 3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
## 4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
## 5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
## 6 2.95  3.25  600  570    0 18.0  3.0        0     1            1
##   math_quartile
## 1            Q4
## 2            Q4
## 3            Q1
## 4            Q4
## 5            Q2
## 6            Q1

For math_quartile, Q4 has the best scores and Q1 has the worst scores.

How would a density plot look?

Pretty, but not very effective in conveying information. It overlaps a lot. What about a dotplot?

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

I don’t think that looks very nice either. What about a boxplot?

Now this one is a lot better. We can see that the one student with the highest GPA was in fourth quartile (aka had one of the highest SAT Math scores), but the group with the highest average GPA was actual the third quartile. Additionally, the only GPA outlier (who had a very low firstyear GPA) was in the third quartile too. This boxplot also allows us to see the range of GPAs among the four SATM groups. It’s important to test out different visualizations when trying to convey information. While density plots look pretty, sometimes the simplest solution is the best.