Before you start going through this document, you should have first gone through the RMarkdown document called “Learning_R.rmd”, which can be found in the TaDLab’s GoogleDrive and in the twoyear Github repository. This guide is best read when viewing both versions (RMarkdown and HTML) simultaneously, as you can see how the R syntax translates to HTML.
Even though R is a program for statistical analysis, it can be used to create scientific and nice-looking reports. RMarkdown files can be “knitted”, aka converted, into HTML files or PDF files. Knitting won’t alter the .rmd file, but rather make a wholly separate file based on the .rmd file. This guide will focus on knitting HTML with RMarkdown.
For your reference, here is a cheatsheet for RMarkdown for almost everything that occurs outside of chunks.
In RMarkdown, things can either be inside of chunks or outside of chunks. Inside of chunks is where actual statistical analysis occurs, including randomly generating data and plotting variables.
labels <- c("Red", "Blue", "Yellow", "Green", "Pink")
data <- data.frame(replicate(5,sample(0:10,5,rep=TRUE)))
colnames(data) <- labels
head(data)
## Red Blue Yellow Green Pink
## 1 7 5 5 6 8
## 2 3 7 3 4 10
## 3 3 5 5 6 8
## 4 10 1 8 7 5
## 5 9 10 2 9 6
plot(data)
Outside of chunks, you can do many MicrosoftWord-like things, such write plain, italicized, bolded, deleted, and quoted
text. This is a good opportunity to talk about your code or explain the results of your statistical analysis in plain English. Other features allow you to make tables, create numbered lists, insert images or special characters, include mathematical proofs, and embed links.
For your reference, here is a cheatsheet for RMarkdown.
Headers can be used while making RMarkdown documents and can be seen when knitted to HTML. Headers are always blue and are signified with hashtags. You can use between one and six hashtags to designate the “size” of the header with more hastags equalling smaller header sizes. You aren’t required to put a space between the hashtag and your text, although it looks nicer in RMarkdown.
Example 7
In RMarkdown, if you look at the line at the bottom of the Source pane it should say # Example 6
and then a little up arrow and a little down arrow. If you click on that, it’ll show you the outline for this whole RMarkdown document. You can see the header for Examples 1 through 6, but can’t see Example 7 since six hashtags is the limit for header sizes. If you cursor is under the Example 6 header, then the outline will say Example 6
. Using headers can make it easy to organize things, like analyses on two separate datasets.
Additionally, if you look at the outline, it will also say chunk names. Here is an example.
head(data)
## Red Blue Yellow Green Pink
## 1 7 5 5 6 8
## 2 3 7 3 4 10
## 3 3 5 5 6 8
## 4 10 1 8 7 5
## 5 9 10 2 9 6
If you look in the outline in RMarkdown, you can see Example dataset
under the Example 6 header. If you don’t name your chunks, it’ll automatically label them “Chunk 1”, “Chunk 2”, etc. There is no particular reason to name chunks other than keeping track/organizing your document. While headers will appear in knitted HTML documents, chunk names will not.
It is important that you place chunk names in the right place, after the r
but between the brackets. You also can’t have two chunks with the same name or else the file will halt knitting. Chunk names also go before chunk options, which is discussed below.
When writing a RMarkdown document, you may do many “behind the scene” things like recoding or renaming variables. While it is important to the reader to know what kind of data wrangling techniques you utilized, they probably don’t need to see the lines and lines of code where you do this. This is where chunk options come into play. Chunk options are specified after chunk names (if you choose you name your chunks).
include
is an option that says whether this chunk should be included in the HTML document at all. In your HTML document, it’ll seem like you never wrote this code in the first place. Chunks will default to include=TRUE
if nothing contradicting is specified.
# See? It disappeared!
echo
is an option that allows you to hide the code but show the output in a HTML document. Imagine if you wrote a lot of code to create a nice visualization, but the code is so messy it would distract the reader from the visualization. Therefore, you could use echo=FALSE
to hide the code but show the visualization. Readers would not know how you created the visualization. Chunks will default to echo=TRUE
if nothing contradicting it is specified.
Similiarly, imagine you are doing statistical analyses and you end up with some warning messages. Warnings are different from errors: errors prevent the analysis from taking place since there is something flawed with the code, but warnings tell you something probably went wrong and you might want to fix it, but the code will ultimately follow through. Messages are different from warnings as they are usually harmless. Messages often tell you details about your analysis that you might not be aware of or that the command you’re using is outdated. Warnings and messages will default to TRUE
unless otherwise specified.
In the below example, I am using the package sjPlot
to create a correlation matrix based on randomly generated data. The syntax used will create a message and a warning: a message that indicates which correlation method is computed and a warning that redunant text was removed from the generated correlation matrix.
sjp.corr(data, decimals=1)
## Computing correlation using pearson-method with listwise-deletion...
# Warning is supressed, but message isn't
sjp.corr(data, decimals=1)
## Warning: Removed 15 rows containing missing values (geom_text).
# Message is suppressed, but warning isn't
Alternatively, you may want to showcase failed code. In that case, you would designate error=TRUE
. This will allow the HTML to properly knit despite the error. In the chunk below, I’m attempting to find the mean of a variable that doesn’t exist.
mean(data$Orange)
## Warning in mean.default(data$Orange): argument is not numeric or logical:
## returning NA
## [1] NA
If you wanted to hide your results but show the syntax, you would use the results = 'hide'
. However, this phrase only works for things like summary output and tables.
summary(data)
If you wanted to suppress a plot, you would use fig.keep = 'none'
.
plot(data$Red, data$Green)
For certain packages or commands, you are required you to put results = 'asis'
in your chunks. The package summarytools
is a package that can make pretty tables out of descriptive statistics, but its output can look very different depending on your chunk options. The difference between the two chunks below can be seen when knitted to a HTML.
summarytools::descr(data, style = "rmarkdown", na.rm=TRUE)
Blue | Green | Pink | Red | Yellow | |
---|---|---|---|---|---|
Mean | 5.60 | 6.40 | 7.40 | 6.40 | 4.60 |
Std.Dev. | 3.29 | 1.82 | 1.95 | 3.29 | 2.30 |
Min | 1.00 | 4.00 | 5.00 | 3.00 | 2.00 |
Q1 | 5.00 | 6.00 | 6.00 | 3.00 | 3.00 |
Median | 5.00 | 6.00 | 8.00 | 7.00 | 5.00 |
Q3 | 7.00 | 7.00 | 8.00 | 9.00 | 5.00 |
Max | 10.00 | 9.00 | 10.00 | 10.00 | 8.00 |
MAD | 2.97 | 1.48 | 2.97 | 4.45 | 2.97 |
IQR | 2.00 | 1.00 | 2.00 | 6.00 | 2.00 |
CV | 0.59 | 0.28 | 0.26 | 0.51 | 0.50 |
Skewness | -0.06 | 0.13 | 0.04 | -0.08 | 0.29 |
SE.Skewness | 0.91 | 0.91 | 0.91 | 0.91 | 0.91 |
Kurtosis | -1.58 | -1.55 | -1.85 | -2.18 | -1.68 |
N.Valid | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 |
% Valid | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
summarytools::descr(data, style = "rmarkdown", na.rm=TRUE)
## ### Descriptive Statistics
##
## | | Blue | Green | Pink | Red | Yellow |
## |----------------:|-------:|-------:|-------:|-------:|-------:|
## | **Mean** | 5.60 | 6.40 | 7.40 | 6.40 | 4.60 |
## | **Std.Dev.** | 3.29 | 1.82 | 1.95 | 3.29 | 2.30 |
## | **Min** | 1.00 | 4.00 | 5.00 | 3.00 | 2.00 |
## | **Q1** | 5.00 | 6.00 | 6.00 | 3.00 | 3.00 |
## | **Median** | 5.00 | 6.00 | 8.00 | 7.00 | 5.00 |
## | **Q3** | 7.00 | 7.00 | 8.00 | 9.00 | 5.00 |
## | **Max** | 10.00 | 9.00 | 10.00 | 10.00 | 8.00 |
## | **MAD** | 2.97 | 1.48 | 2.97 | 4.45 | 2.97 |
## | **IQR** | 2.00 | 1.00 | 2.00 | 6.00 | 2.00 |
## | **CV** | 0.59 | 0.28 | 0.26 | 0.51 | 0.50 |
## | **Skewness** | -0.06 | 0.13 | 0.04 | -0.08 | 0.29 |
## | **SE.Skewness** | 0.91 | 0.91 | 0.91 | 0.91 | 0.91 |
## | **Kurtosis** | -1.58 | -1.55 | -1.85 | -2.18 | -1.68 |
## | **N.Valid** | 5.00 | 5.00 | 5.00 | 5.00 | 5.00 |
## | **% Valid** | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
Please note that chunk options only influence things when you knit files to HTML. If you just clicked “Run All” for the RMarkdown file, it would run everything and ignore the chunk options, since chunk options are only concerned with knitting. For example, if you included a warning=FALSE
in every chunk, you’ll still get warnings in the R console. Those warnings just won’t appear when the document is knitted into a HTML.
There are other chunk options that haven’t been mentioned, but the above are the ones that you’ll likely use the most. More information can be found here and here.
Your YAML header is the section at the top of the RMarkdown document that says your name, the date you made the document, and output type. The default options are:
Generally, the default output type will be “html_document”.
You can include a table of contents to your document by adding toc: true
under output. Table of contents are made up by your headers (not chunk names). If you put toc_float: true
, your table of contents will float on the left side of the document. toc_float: false
is the default and puts the table of contents at the top of the document. toc_depth
specifies the threshold of whether headers should be included in the table of contents. If you put toc_depth: 2
, then headers with three hashtags beside them will not be included in the table of contents.
If you wanted to add a theme to your HTML document, the YAML header would be the place to do this. This document uses the united
theme with the tango
highlight. Highlights specifies a syntax highlighting style, while themes affect the whole document. Both united
and tango
come installed with R. Additional themes can be downloaded with packages.
Here is a general guide about controlling the appearance of HTML documents, which includes more information about tables of contents.
Here is a gallery that shows the appearances of different themes.
There are a number of R packages that can assist in visualization, but my favorites include ggplot2
, sjPlot
, and kableExtra
.
ggplot2
is a package that everyone loves and hates. You can use it to create beautiful visualizations but it can be difficult to learn. ggplot2
is part of the tidyverse, which includes data wrangling packages like dplyr
and tidyr
.
Here is an introduction to ggplot2
.
# Randomly generating data
Clarity <- replicate(1,sample(0:10,100,rep=TRUE))
Size <- replicate(1,sample(0:100,100,rep=TRUE))
Sharpness <- replicate(1,sample(0:3,100,rep=TRUE))
Color <- c("Red", "Blue", "Yellow", "Green", "Pink")
Color <- sample(Color, 100, replace=TRUE)
Hue <- c("Light", "Light", "Dark", "Light", "Dark")
Hue <- sample(Hue, 100, replace=TRUE)
more_data <- data.frame(Clarity, Size, Sharpness, Color, Hue)
head(more_data)
## Clarity Size Sharpness Color Hue
## 1 1 9 0 Yellow Dark
## 2 0 4 1 Blue Dark
## 3 4 9 0 Blue Dark
## 4 7 93 1 Red Light
## 5 10 16 2 Blue Dark
## 6 6 58 1 Pink Light
# Assigning colors to values
more_data$order <- 0
more_data$order[more_data$Color == "Blue"] <- "#00BFC4"
more_data$order[more_data$Color == "Green"] <- "#00BA38"
more_data$order[more_data$Color == "Pink"] <- "#F564E3"
more_data$order[more_data$Color == "Red"] <- "#F8766D"
more_data$order[more_data$Color == "Yellow"] <- "#ffdd00"
jcolors <- more_data$order
names(jcolors) <- more_data$Color
head(jcolors)
## Yellow Blue Blue Red Blue Pink
## "#ffdd00" "#00BFC4" "#00BFC4" "#F8766D" "#00BFC4" "#F564E3"
theme_set(theme_gray())
ggplot(data=more_data, aes(x=Color, y=Size, fill=Hue)) +
geom_bar(stat = 'identity', position = 'dodge') +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Gemstone Barplot by Color")
theme_set(theme_gray())
ggplot(data=more_data, aes(x=Hue, y=Size, fill=Color)) +
geom_bar(stat = 'identity', position = "dodge") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Gemstone Barplot by Hue") +
scale_fill_manual(values = jcolors)
theme_set(theme_minimal())
ggplot(more_data,aes(x=Clarity,fill=Color)) +
geom_density(alpha=0.2,lwd=0.2) +
ggtitle("Gemstone Density Plot") +
scale_fill_manual(values = jcolors)
theme_set(theme_gray())
ggplot(more_data, aes(x=Sharpness, y=Size)) +
geom_point(aes(color=Color, fill=Color), shape=23, size=4) +
scale_fill_manual(values = jcolors) +
scale_color_manual(values = jcolors)+
ggtitle("Gemstone Dot Plot")
sjPlot
is a package that easily makes HTML tables for regression models with the function tab_model
. If you know any HTML or CSS, you can use that to further tweak table appearances.
You can learn more about tab_model
here.
mod1 <- lm(Size ~ Clarity, data=more_data)
mod2 <- lm(Size ~ Clarity + Sharpness, data=more_data)
tab_model(mod1, mod2)
Size | Size | |||||
---|---|---|---|---|---|---|
Predictors | Estimates | CI | p | Estimates | CI | p |
(Intercept) | 54.62 | 43.87 – 65.38 | <0.001 | 55.05 | 41.94 – 68.16 | <0.001 |
Clarity | -0.40 | -2.14 – 1.35 | 0.657 | -0.39 | -2.15 – 1.36 | 0.661 |
Sharpness | -0.31 | -5.75 – 5.13 | 0.911 | |||
Observations | 100 | 100 | ||||
R2 / adjusted R2 | 0.002 / -0.008 | 0.002 / -0.018 |
predlabels <- c("Intercept", "Clarity", "Sharpness")
dvlabels <- c("Clarity", "Clarity + Sharpness")
colorder <- c("est", "se", "stat", "p")
tab_model(mod1, mod2,
auto.label=FALSE, pred.labels=predlabels, dv.labels=dvlabels,
string.pred=" ", col.order=colorder,
title = "Predicting Size",
show.ci=FALSE, show.df=FALSE, show.obs=FALSE,
show.est=TRUE, show.se=TRUE, show.std=FALSE, show.stat=TRUE,
string.se="Standard Error", string.stat="T",
p.threshold = 0.05, p.style=c("numeric"),
digits = 2, digits.p = 3, emph.p = TRUE, wrap.labels=25)
Clarity | Clarity + Sharpness | |||||||
---|---|---|---|---|---|---|---|---|
Estimates | Standard Error | T | p | Estimates | Standard Error | T | p | |
Intercept | 54.62 | 5.49 | 9.96 | <0.001 | 55.05 | 6.69 | 8.23 | <0.001 |
Clarity | -0.40 | 0.89 | -0.45 | 0.657 | -0.39 | 0.90 | -0.44 | 0.661 |
Sharpness | -0.31 | 2.78 | -0.11 | 0.911 | ||||
R2 / adjusted R2 | 0.002 / -0.008 | 0.002 / -0.018 |
kableExtra
, in conjunction with the knitr
and kable
packages, can create format dataframes into nice-looking tables for HTML and LaTeX output.
You can learn more about kableExtra
here.
rm(list=setdiff(ls(), "more_data"))
more_data$order <- NULL
Clarity <- aggregate(Clarity~Color, data=more_data, mean)
Size <- aggregate(Size~Color, data=more_data, mean)
Sharpness <- aggregate(Sharpness~Color, data=more_data, mean)
even_more_data <- merge(Clarity, Size, by = 'Color')
even_more_data <- merge(even_more_data, Sharpness, by = 'Color')
even_more_data %>%
kable() %>%
kable_styling
Color | Clarity | Size | Sharpness |
---|---|---|---|
Blue | 5.250000 | 46.62500 | 1.458333 |
Green | 5.045454 | 51.63636 | 1.409091 |
Pink | 5.041667 | 48.45833 | 1.541667 |
Red | 4.500000 | 54.50000 | 1.500000 |
Yellow | 5.722222 | 65.88889 | 1.111111 |
even_more_data %>%
kable("html",digits=2, format.args = list(decimal.mark = ".", big.mark = ","),caption = "Gemstones by Colors") %>%
kable_styling(full_width = F, position = "left") %>%
row_spec(1:1, background = "#00BFC4") %>% # Blue
row_spec(2:2, background = "#00BA38") %>% # Green
row_spec(3:3, background = "#F564E3") %>% # Pink
row_spec(4:4, background = "#F8766D") %>% # Red
row_spec(5:5, background = "#ffdd00") # Yellow
Color | Clarity | Size | Sharpness |
---|---|---|---|
Blue | 5.25 | 46.62 | 1.46 |
Green | 5.05 | 51.64 | 1.41 |
Pink | 5.04 | 48.46 | 1.54 |
Red | 4.50 | 54.50 | 1.50 |
Yellow | 5.72 | 65.89 | 1.11 |