Package | Title | Maintainer | Version | URL | |
---|---|---|---|---|---|
car | Companion to Applied Regression | John Fox <jfox@mcmaster.ca>; | 3.0-2 | https://r-forge.r-project.org/projects/car/, | |
Hmisc | Harrell Miscellaneous | Frank E Harrell Jr <f.harrell@vanderbilt.edu>; | 4.2-0 | http://biostat.mc.vanderbilt.edu/Hmisc, | |
kableExtra | Construct Complex Table with ‘kable’ and Pipe Syntax | Hao Zhu <haozhu233@gmail.com>; | 1.0.1 | http://haozhu233.github.io/kableExtra/, | |
tidyverse | Easily Install and Load the ‘Tidyverse’ | Hadley Wickham <hadley@rstudio.com>; | 1.2.1 | http://tidyverse.tidyverse.org, | |
reshape2 | Flexibly Reshape Data: A Reboot of the Reshape Package | Hadley Wickham <h.wickham@gmail.com>; | 1.4.3 | NA | |
plotly | Create Interactive Web Graphics via ‘plotly.js’ | Carson Sievert <cpsievert1@gmail.com>; | 4.8.0 | https://plot.ly/r, https://cpsievert.github.io/plotly_book/, | |
sjPlot | Data Visualization for Statistics in Social Science | Daniel Lüdecke <d.luedecke@uke.de>; | 2.6.2 | NA | |
Stat2Data | Datasets for Stat2 | Robin Lock <rlock@stlawu.edu>; | 2.0.0 | NA |
ggplot2 is a package is “a system for declaratively creating graphics”. ggplot2 is included in the tidyverse package, which also includes dplyr, tidyr, and tibble.
For reference, here is a ggplot2 cheatsheet and a quick reference guide. Also, if you encounter an issue with ggplot2, there is a very likely chance someone encountered the same issue before and posted about it on StackOverflow, so don’t be afraid to just Google your error message.
This guide will use the FirstYearGPA
dataset from Stat2Data.
Honestly? Because it looks prettier. It also offers more options than base R plots.
There was a package called ggplot, which stopped existing in mid 2000s, although it’s still kind of around as a historical archive. The “2” in ggplot2 reflects that it was significantly updated.
In base R, there are different commands for different visualization types.
plot(FirstYearGPA$GPA, FirstYearGPA$HSGPA) # Scatterplot
counts <- table(FirstYearGPA$FirstGen)
barplot(counts) # Barplot
hist(FirstYearGPA$GPA) # Histogram
In ggplot2, there is one command, ggplot, which can be further specified with additional syntax. Below are breakdowns for different ggplot visualizations. An important part of the syntax is geom, which adds the layer of the actual plot.
geom | Description |
---|---|
geom_point | Scatterplot |
geom_bar | Barplot |
geom_density | Density Plot |
geom_histogram | Histogram |
geom_boxplot | Boxplot |
geom_dotplot | Dotplot |
geom_smooth | Regression Line |
# Scatterplot aka geom_point
ggplot( # ggplot command
data=FirstYearGPA, # Name of dataset
aes(x=GPA, y=HSGPA)) + # Your x-axis and y-axis
geom_point() # Type of plot
# Barplot aka geom_bar
ggplot( # ggplot command
data=FirstYearGPA, # Name of dataset
aes(x=FirstGen)) + # Your x-axis
geom_bar() # Type of plot
# Density plot aka geom_density
ggplot( # ggplot command
data=FirstYearGPA, # Name of dataset
aes(x=GPA)) + # Your x-axis
geom_density() # Type of plot
# Histogram aka geom_histogram
ggplot( # ggplot command
data=FirstYearGPA, # Name of dataset
aes(x=GPA)) + # Your x-axis
geom_histogram(bins=20) # Type of plot
Since the geom part just adds a layer, you can add multiple geoms to one plot.
# Scatterplot aka geom_point
ggplot( # ggplot command
data=FirstYearGPA, # Name of dataset
aes(x=GPA, y=HSGPA)) + # Your x-axis and y-axis
geom_point() + # Type of plot
geom_smooth(aes(y=predict(mod))) # Adds a regression line
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
I want to make a barplot that shows GPA divided by gender.
ggplot(data=FirstYearGPA, aes(x=GPA, fill=Male)) + # Data, x-axis, fill aesthetic
geom_bar() # Barplot
That looks terrible. There’s too many variations in GPA. Let’s round GPA to the nearest half-point to simplify things.
FirstYearGPA$GPA_round <- round(FirstYearGPA$GPA*2)/2 # Round GPA to two decimal points
head(FirstYearGPA)
## GPA HSGPA SATV SATM Male HU SS FirstGen White CollegeBound
## 1 3.06 3.83 680 770 1 3.0 9.0 1 1 1
## 2 4.15 4.00 740 720 0 9.0 3.0 0 1 1
## 3 3.41 3.70 640 570 0 16.0 13.0 0 0 1
## 4 3.21 3.51 740 700 0 22.0 0.0 0 1 1
## 5 3.48 3.83 610 610 0 30.5 1.5 0 1 1
## 6 2.95 3.25 600 570 0 18.0 3.0 0 1 1
## GPA_round
## 1 3.0
## 2 4.0
## 3 3.5
## 4 3.0
## 5 3.5
## 6 3.0
Now we can see that ggplot isn’t filling the bars by gender. Let’s double-check the class of the variable Male.
## [1] "integer"
## [1] "factor"
ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Male)) + # Now with altered fill asthetic
geom_bar()
That looks better. But by default, ggplot2 is stacking by gender. I want the barplots to be next to each other and divided by gender.
ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Male)) +
geom_bar(position="dodge") # New barplot position
It’s starting to look nice. But there’s a few things we can do to improve it:
Be aware that altering legends in ggplot2 can be a little tricky. In some cases, it may be favorable to just alter the data in the first place. In the FirstYearGPA dataset, gender is defined as “Male”: 0 if female, 1 if male. So let’s create a new gender variable.
## GPA HSGPA SATV SATM Male HU SS FirstGen White CollegeBound
## 1 3.06 3.83 680 770 1 3.0 9.0 1 1 1
## 2 4.15 4.00 740 720 0 9.0 3.0 0 1 1
## 3 3.41 3.70 640 570 0 16.0 13.0 0 0 1
## 4 3.21 3.51 740 700 0 22.0 0.0 0 1 1
## 5 3.48 3.83 610 610 0 30.5 1.5 0 1 1
## 6 2.95 3.25 600 570 0 18.0 3.0 0 1 1
## GPA_round Gender
## 1 3.0 Male
## 2 4.0 Female
## 3 3.5 Female
## 4 3.0 Female
## 5 3.5 Female
## 6 3.0 Female
ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Gender)) + # New fill aesthetic
geom_bar(stat="count", position="dodge")
Now let’s add titles.
ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Gender)) +
geom_bar(stat="count", position="dodge") +
xlab("Firstyear GPA") + # x-axis title
ylab("Number of Students") + # y-axis title
ggtitle("Firstyear GPA by Gender") # Main title
Be aware that by default, ggplot titles at the top are aligned to the left. This was done so it would be easier to add subtitles to plots. You’ll have to specify if you want it centered.
ggplot(data=FirstYearGPA, aes(x=GPA_round, fill=Gender)) +
geom_bar(stat="count", position="dodge") +
xlab("Firstyear GPA") +
ylab("Number of Students") +
ggtitle("Firstyear GPA by Gender") +
theme(plot.title = element_text(hjust = 0.5)) # Change theme to center main title
Now it looks nice. So ultimately, we:
There are still numerous other ways that you could alter the barplot: increase the size of the text, change the color scheme from red/blue to green/purple, change background color, add error bars, etc., although if I demonstrated all these options, the document would never end!
Let’s say I wanted to know more about the students’ SAT Math scores. Since SATM is a numeric variable with lots of variability, I should make a new variable that splits SATM into quartiles and place everyone in the the first, second, third, or fourth quartile. This is something that can be done with the Hmisc package.
data(FirstYearGPA)
FirstYearGPA$math_quartile <- with(FirstYearGPA, cut(SATM,
breaks=quantile(SATM, probs=seq(0,1, by=0.25), na.rm=TRUE),
labels=c("Q1","Q2","Q3","Q4"),
include.lowest=TRUE))
FirstYearGPA$math_quartile <- as.factor(FirstYearGPA$math_quartile)
head(FirstYearGPA)
## GPA HSGPA SATV SATM Male HU SS FirstGen White CollegeBound
## 1 3.06 3.83 680 770 1 3.0 9.0 1 1 1
## 2 4.15 4.00 740 720 0 9.0 3.0 0 1 1
## 3 3.41 3.70 640 570 0 16.0 13.0 0 0 1
## 4 3.21 3.51 740 700 0 22.0 0.0 0 1 1
## 5 3.48 3.83 610 610 0 30.5 1.5 0 1 1
## 6 2.95 3.25 600 570 0 18.0 3.0 0 1 1
## math_quartile
## 1 Q4
## 2 Q4
## 3 Q1
## 4 Q4
## 5 Q2
## 6 Q1
For math_quartile, Q4 has the best scores and Q1 has the worst scores.
How would a density plot look?
ggplot(
data=FirstYearGPA,
aes(x=GPA)) +
geom_density(aes(group=math_quartile, fill=math_quartile),alpha=0.2) +
scale_fill_discrete(name="Math Quartiles") + # Changes legend title
xlab("Firstyear GPA") + ylab("")
Pretty, but not very effective in conveying information. It overlaps a lot. What about a dotplot?
ggplot(
data=FirstYearGPA,
aes(x=math_quartile,y=GPA,fill=math_quartile)) +
geom_dotplot(binaxis='y',stackdir='center',dotsize=0.8) +
theme(legend.position = "none") +
ylab("Firstyear GPA") + xlab("SATM Quartiles")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
I don’t think that looks very nice either. What about a boxplot?
ggplot(data=FirstYearGPA,
aes(x=math_quartile, y=GPA, fill=math_quartile)) +
geom_boxplot(outlier.colour="black", outlier.shape=8, outlier.size=4) +
theme(legend.position = "none") +
ylab("Firstyear GPA") + xlab("SATM Quartiles")
Now this one is a lot better. We can see that the one student with the highest GPA was in fourth quartile (aka had one of the highest SAT Math scores), but the group with the highest average GPA was actual the third quartile. Additionally, the only GPA outlier (who had a very low firstyear GPA) was in the third quartile too. This boxplot also allows us to see the range of GPAs among the four SATM groups. It’s important to test out different visualizations when trying to convey information. While density plots look pretty, sometimes the simplest solution is the best.