Introduction

Before you start going through this document, you should have first gone through the RMarkdown document called “Learning_R.rmd”, which can be found in the TaDLab’s GoogleDrive and in the twoyear Github repository. This guide is best read when viewing both versions (RMarkdown and HTML) simultaneously, as you can see how the R syntax translates to HTML.

Even though R is a program for statistical analysis, it can be used to create scientific and nice-looking reports. RMarkdown files can be “knitted”, aka converted, into HTML files or PDF files. Knitting won’t alter the .rmd file, but rather make a wholly separate file based on the .rmd file. This guide will focus on knitting HTML with RMarkdown.

For your reference, here is a cheatsheet for RMarkdown for almost everything that occurs outside of chunks.

RMarkdown Basics

In RMarkdown, things can either be inside of chunks or outside of chunks. Inside of chunks is where actual statistical analysis occurs, including randomly generating data and plotting variables.

labels <- c("Red", "Blue", "Yellow", "Green", "Pink")
data <- data.frame(replicate(5,sample(0:10,5,rep=TRUE)))
colnames(data) <- labels
head(data)
##   Red Blue Yellow Green Pink
## 1   7    5      5     6    8
## 2   3    7      3     4   10
## 3   3    5      5     6    8
## 4  10    1      8     7    5
## 5   9   10      2     9    6
plot(data)

Outside of chunks, you can do many MicrosoftWord-like things, such write plain, italicized, bolded, deleted, and quoted text. This is a good opportunity to talk about your code or explain the results of your statistical analysis in plain English. Other features allow you to make tables, create numbered lists, insert images or special characters, include mathematical proofs, and embed links.

For your reference, here is a cheatsheet for RMarkdown.

Headers and chunk names

Headers can be used while making RMarkdown documents and can be seen when knitted to HTML. Headers are always blue and are signified with hashtags. You can use between one and six hashtags to designate the “size” of the header with more hastags equalling smaller header sizes. You aren’t required to put a space between the hashtag and your text, although it looks nicer in RMarkdown.

Example 1

Example 2

Example 3

Example 4

Example 5
Example 6

Example 7

In RMarkdown, if you look at the line at the bottom of the Source pane it should say # Example 6 and then a little up arrow and a little down arrow. If you click on that, it’ll show you the outline for this whole RMarkdown document. You can see the header for Examples 1 through 6, but can’t see Example 7 since six hashtags is the limit for header sizes. If you cursor is under the Example 6 header, then the outline will say Example 6. Using headers can make it easy to organize things, like analyses on two separate datasets.

Additionally, if you look at the outline, it will also say chunk names. Here is an example.

head(data)
##   Red Blue Yellow Green Pink
## 1   7    5      5     6    8
## 2   3    7      3     4   10
## 3   3    5      5     6    8
## 4  10    1      8     7    5
## 5   9   10      2     9    6

If you look in the outline in RMarkdown, you can see Example dataset under the Example 6 header. If you don’t name your chunks, it’ll automatically label them “Chunk 1”, “Chunk 2”, etc. There is no particular reason to name chunks other than keeping track/organizing your document. While headers will appear in knitted HTML documents, chunk names will not.

It is important that you place chunk names in the right place, after the r but between the brackets. You also can’t have two chunks with the same name or else the file will halt knitting. Chunk names also go before chunk options, which is discussed below.

Chunk options

When writing a RMarkdown document, you may do many “behind the scene” things like recoding or renaming variables. While it is important to the reader to know what kind of data wrangling techniques you utilized, they probably don’t need to see the lines and lines of code where you do this. This is where chunk options come into play. Chunk options are specified after chunk names (if you choose you name your chunks).

Include

include is an option that says whether this chunk should be included in the HTML document at all. In your HTML document, it’ll seem like you never wrote this code in the first place. Chunks will default to include=TRUE if nothing contradicting is specified.

# See? It disappeared!

Echo

echo is an option that allows you to hide the code but show the output in a HTML document. Imagine if you wrote a lot of code to create a nice visualization, but the code is so messy it would distract the reader from the visualization. Therefore, you could use echo=FALSE to hide the code but show the visualization. Readers would not know how you created the visualization. Chunks will default to echo=TRUE if nothing contradicting it is specified.

Warnings, messages, and errors

Similiarly, imagine you are doing statistical analyses and you end up with some warning messages. Warnings are different from errors: errors prevent the analysis from taking place since there is something flawed with the code, but warnings tell you something probably went wrong and you might want to fix it, but the code will ultimately follow through. Messages are different from warnings as they are usually harmless. Messages often tell you details about your analysis that you might not be aware of or that the command you’re using is outdated. Warnings and messages will default to TRUE unless otherwise specified.

In the below example, I am using the package sjPlot to create a correlation matrix based on randomly generated data. The syntax used will create a message and a warning: a message that indicates which correlation method is computed and a warning that redunant text was removed from the generated correlation matrix.

sjp.corr(data, decimals=1)
## Computing correlation using pearson-method with listwise-deletion...

# Warning is supressed, but message isn't
sjp.corr(data, decimals=1)
## Warning: Removed 15 rows containing missing values (geom_text).

# Message is suppressed, but warning isn't

Alternatively, you may want to showcase failed code. In that case, you would designate error=TRUE. This will allow the HTML to properly knit despite the error. In the chunk below, I’m attempting to find the mean of a variable that doesn’t exist.

mean(data$Orange)
## Warning in mean.default(data$Orange): argument is not numeric or logical:
## returning NA
## [1] NA

Hiding and showing results

If you wanted to hide your results but show the syntax, you would use the results = 'hide'. However, this phrase only works for things like summary output and tables.

summary(data)

If you wanted to suppress a plot, you would use fig.keep = 'none'.

plot(data$Red, data$Green)

For certain packages or commands, you are required you to put results = 'asis' in your chunks. The package summarytools is a package that can make pretty tables out of descriptive statistics, but its output can look very different depending on your chunk options. The difference between the two chunks below can be seen when knitted to a HTML.

summarytools::descr(data, style = "rmarkdown", na.rm=TRUE)

Descriptive Statistics

  Blue Green Pink Red Yellow
Mean 5.60 6.40 7.40 6.40 4.60
Std.Dev. 3.29 1.82 1.95 3.29 2.30
Min 1.00 4.00 5.00 3.00 2.00
Q1 5.00 6.00 6.00 3.00 3.00
Median 5.00 6.00 8.00 7.00 5.00
Q3 7.00 7.00 8.00 9.00 5.00
Max 10.00 9.00 10.00 10.00 8.00
MAD 2.97 1.48 2.97 4.45 2.97
IQR 2.00 1.00 2.00 6.00 2.00
CV 0.59 0.28 0.26 0.51 0.50
Skewness -0.06 0.13 0.04 -0.08 0.29
SE.Skewness 0.91 0.91 0.91 0.91 0.91
Kurtosis -1.58 -1.55 -1.85 -2.18 -1.68
N.Valid 5.00 5.00 5.00 5.00 5.00
% Valid 100.00 100.00 100.00 100.00 100.00
summarytools::descr(data, style = "rmarkdown", na.rm=TRUE)
## ### Descriptive Statistics  
## 
## |          &nbsp; |   Blue |  Green |   Pink |    Red | Yellow |
## |----------------:|-------:|-------:|-------:|-------:|-------:|
## |        **Mean** |   5.60 |   6.40 |   7.40 |   6.40 |   4.60 |
## |    **Std.Dev.** |   3.29 |   1.82 |   1.95 |   3.29 |   2.30 |
## |         **Min** |   1.00 |   4.00 |   5.00 |   3.00 |   2.00 |
## |          **Q1** |   5.00 |   6.00 |   6.00 |   3.00 |   3.00 |
## |      **Median** |   5.00 |   6.00 |   8.00 |   7.00 |   5.00 |
## |          **Q3** |   7.00 |   7.00 |   8.00 |   9.00 |   5.00 |
## |         **Max** |  10.00 |   9.00 |  10.00 |  10.00 |   8.00 |
## |         **MAD** |   2.97 |   1.48 |   2.97 |   4.45 |   2.97 |
## |         **IQR** |   2.00 |   1.00 |   2.00 |   6.00 |   2.00 |
## |          **CV** |   0.59 |   0.28 |   0.26 |   0.51 |   0.50 |
## |    **Skewness** |  -0.06 |   0.13 |   0.04 |  -0.08 |   0.29 |
## | **SE.Skewness** |   0.91 |   0.91 |   0.91 |   0.91 |   0.91 |
## |    **Kurtosis** |  -1.58 |  -1.55 |  -1.85 |  -2.18 |  -1.68 |
## |     **N.Valid** |   5.00 |   5.00 |   5.00 |   5.00 |   5.00 |
## |     **% Valid** | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |

Please note that chunk options only influence things when you knit files to HTML. If you just clicked “Run All” for the RMarkdown file, it would run everything and ignore the chunk options, since chunk options are only concerned with knitting. For example, if you included a warning=FALSE in every chunk, you’ll still get warnings in the R console. Those warnings just won’t appear when the document is knitted into a HTML.

There are other chunk options that haven’t been mentioned, but the above are the ones that you’ll likely use the most. More information can be found here and here.

YAML headers and themes

Your YAML header is the section at the top of the RMarkdown document that says your name, the date you made the document, and output type. The default options are:

  • Title
  • Author
  • Date (aka the day it was created)
  • Output type

Generally, the default output type will be “html_document”.

You can include a table of contents to your document by adding toc: true under output. Table of contents are made up by your headers (not chunk names). If you put toc_float: true, your table of contents will float on the left side of the document. toc_float: false is the default and puts the table of contents at the top of the document. toc_depth specifies the threshold of whether headers should be included in the table of contents. If you put toc_depth: 2, then headers with three hashtags beside them will not be included in the table of contents.

If you wanted to add a theme to your HTML document, the YAML header would be the place to do this. This document uses the united theme with the tango highlight. Highlights specifies a syntax highlighting style, while themes affect the whole document. Both united and tango come installed with R. Additional themes can be downloaded with packages.

Here is a general guide about controlling the appearance of HTML documents, which includes more information about tables of contents.

Here is a gallery that shows the appearances of different themes.

Visualization packages

There are a number of R packages that can assist in visualization, but my favorites include ggplot2, sjPlot, and kableExtra.

ggplot2

ggplot2 is a package that everyone loves and hates. You can use it to create beautiful visualizations but it can be difficult to learn. ggplot2 is part of the tidyverse, which includes data wrangling packages like dplyr and tidyr.

Here is an introduction to ggplot2.

# Randomly generating data 

Clarity <- replicate(1,sample(0:10,100,rep=TRUE))
Size <- replicate(1,sample(0:100,100,rep=TRUE))
Sharpness <- replicate(1,sample(0:3,100,rep=TRUE))
Color <- c("Red", "Blue", "Yellow", "Green", "Pink")
Color <- sample(Color, 100, replace=TRUE)
Hue <- c("Light", "Light", "Dark", "Light", "Dark")
Hue <- sample(Hue, 100, replace=TRUE)
more_data <- data.frame(Clarity, Size, Sharpness, Color, Hue)
head(more_data)
##   Clarity Size Sharpness  Color   Hue
## 1       1    9         0 Yellow  Dark
## 2       0    4         1   Blue  Dark
## 3       4    9         0   Blue  Dark
## 4       7   93         1    Red Light
## 5      10   16         2   Blue  Dark
## 6       6   58         1   Pink Light
# Assigning colors to values

more_data$order <- 0
more_data$order[more_data$Color == "Blue"] <- "#00BFC4"
more_data$order[more_data$Color == "Green"] <- "#00BA38"
more_data$order[more_data$Color == "Pink"] <- "#F564E3"
more_data$order[more_data$Color == "Red"] <- "#F8766D"
more_data$order[more_data$Color == "Yellow"] <- "#ffdd00"

jcolors <- more_data$order
names(jcolors) <- more_data$Color
head(jcolors)
##    Yellow      Blue      Blue       Red      Blue      Pink 
## "#ffdd00" "#00BFC4" "#00BFC4" "#F8766D" "#00BFC4" "#F564E3"
theme_set(theme_gray())
ggplot(data=more_data, aes(x=Color, y=Size, fill=Hue)) + 
  geom_bar(stat = 'identity', position = 'dodge') + 
  theme(legend.position = "none") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Gemstone Barplot by Color")

theme_set(theme_gray())
ggplot(data=more_data, aes(x=Hue, y=Size, fill=Color)) + 
  geom_bar(stat = 'identity', position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  ggtitle("Gemstone Barplot by Hue") +
  scale_fill_manual(values = jcolors)

theme_set(theme_minimal())
ggplot(more_data,aes(x=Clarity,fill=Color)) + 
  geom_density(alpha=0.2,lwd=0.2) +
  ggtitle("Gemstone Density Plot") +
  scale_fill_manual(values = jcolors)

theme_set(theme_gray())
ggplot(more_data, aes(x=Sharpness, y=Size)) + 
  geom_point(aes(color=Color, fill=Color), shape=23, size=4) + 
  scale_fill_manual(values = jcolors) + 
  scale_color_manual(values = jcolors)+
  ggtitle("Gemstone Dot Plot")

sjPlot

sjPlot is a package that easily makes HTML tables for regression models with the function tab_model. If you know any HTML or CSS, you can use that to further tweak table appearances.

You can learn more about tab_model here.

mod1 <- lm(Size ~ Clarity, data=more_data)
mod2 <- lm(Size ~ Clarity + Sharpness, data=more_data)
tab_model(mod1, mod2)
  Size Size
Predictors Estimates CI p Estimates CI p
(Intercept) 54.62 43.87 – 65.38 <0.001 55.05 41.94 – 68.16 <0.001
Clarity -0.40 -2.14 – 1.35 0.657 -0.39 -2.15 – 1.36 0.661
Sharpness -0.31 -5.75 – 5.13 0.911
Observations 100 100
R2 / adjusted R2 0.002 / -0.008 0.002 / -0.018
predlabels <- c("Intercept", "Clarity", "Sharpness")

dvlabels <- c("Clarity", "Clarity + Sharpness")

colorder <- c("est", "se", "stat", "p")

tab_model(mod1, mod2,
          auto.label=FALSE, pred.labels=predlabels, dv.labels=dvlabels, 
          string.pred=" ", col.order=colorder,
          title = "Predicting Size",
          show.ci=FALSE, show.df=FALSE, show.obs=FALSE,
          show.est=TRUE, show.se=TRUE, show.std=FALSE, show.stat=TRUE, 
          string.se="Standard Error", string.stat="T",
          p.threshold = 0.05, p.style=c("numeric"),
          digits = 2, digits.p = 3, emph.p = TRUE, wrap.labels=25)
Predicting Size
  Clarity Clarity + Sharpness
Estimates Standard Error T p Estimates Standard Error T p
Intercept 54.62 5.49 9.96 <0.001 55.05 6.69 8.23 <0.001
Clarity -0.40 0.89 -0.45 0.657 -0.39 0.90 -0.44 0.661
Sharpness -0.31 2.78 -0.11 0.911
R2 / adjusted R2 0.002 / -0.008 0.002 / -0.018

kableExtra

kableExtra, in conjunction with the knitr and kable packages, can create format dataframes into nice-looking tables for HTML and LaTeX output.

You can learn more about kableExtra here.

rm(list=setdiff(ls(), "more_data"))

more_data$order <- NULL

Clarity <- aggregate(Clarity~Color, data=more_data, mean)
Size <- aggregate(Size~Color, data=more_data, mean)
Sharpness <- aggregate(Sharpness~Color, data=more_data, mean)

even_more_data <- merge(Clarity, Size, by = 'Color')
even_more_data <- merge(even_more_data, Sharpness, by = 'Color')
even_more_data %>%
  kable() %>%
  kable_styling
Color Clarity Size Sharpness
Blue 5.250000 46.62500 1.458333
Green 5.045454 51.63636 1.409091
Pink 5.041667 48.45833 1.541667
Red 4.500000 54.50000 1.500000
Yellow 5.722222 65.88889 1.111111
even_more_data %>%
  kable("html",digits=2, format.args = list(decimal.mark = ".", big.mark = ","),caption = "Gemstones by Colors") %>%
  kable_styling(full_width = F, position = "left") %>%
  row_spec(1:1, background = "#00BFC4") %>% # Blue
  row_spec(2:2, background = "#00BA38") %>% # Green
  row_spec(3:3, background = "#F564E3") %>% # Pink
  row_spec(4:4, background = "#F8766D") %>% # Red
  row_spec(5:5, background = "#ffdd00") # Yellow
Gemstones by Colors
Color Clarity Size Sharpness
Blue 5.25 46.62 1.46
Green 5.05 51.64 1.41
Pink 5.04 48.46 1.54
Red 4.50 54.50 1.50
Yellow 5.72 65.89 1.11