Additional Resources for Learning R

A 500+ page textbook.

Resources for R and Stata.

A really really useful website–if you Google an issue in R, this website will have one of the results.

Swirl is a R package that teaches you R in R.

What is R?

R is a programming language specfically designed for statistical analysis. R is open-source, and is developed by a team of statisticians and programmers in both academia and industry. Alternatives to R include SPSS, Stata, SAS, and Python.

RStudio is an Integrated Development Environment (aka a user-friendly interface) for R and requires R installation to function.

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. You can embed an R code chunk like this:


What do all the buttons do?

The panel in the upper right contains your Environment as well as a History of the commands that you’ve previously entered. Your Environment contains any objects or variables you’ve created or datasets that you have read into RStudio.

Any plots that you generate will show up in the panel in the lower right corner under the Plots tab or will appear directly in the RMarkdown document directly below the code.

The panel on below is where the action happens. It’s called the Console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

To get you started, you will enter all commands at the R prompt (i.e. right after > on the console). You can either type them in manually or copy and paste them. To test the below code out, copy and paste it into the Console or place your cursor in the chunk and press Ctrl+Enter.

print("Hello world")
## [1] "Hello world"

R and RStudio tips and tricks

The up arrow on your keyboard will allow you to scroll up through your past commands.

The tab key on your keyboard will help you (particularly in RStudio) by offering ways to finish your code. If you start typing mea and hit tab, it will suggest mean() among other things. If you type mean(~hwy, data=vehicles, and hit tab, it will tell you the other arguments you can use for the mean() function.

When working within a .R or .Rmd file, you can put your cursor on a line and hit Ctrl + Enter to get the code to execute in the Console. (On a Mac, Command + Enter.)

If you get stuck with some syntax (usually, mismatched parentheses or quotes), the R Console will change from the > at the beginning of the line (which means it is waiting for a new command) to the + at the beginning of the line (which means it is waiting for you to finish a command). To get out, hit the Escape key.

What are packages?

Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation, which can found in the Packages tab to the left. Once installed, they have to be loaded into the session every time RStudio is opened.

Some common packages are mosiac, car, dplyr, ggplot2, stats, and psych. mosiac and car are basic packages used to teach math, statistics, computation, and modeling. dplyr is used for “data wrangling”, e.g. turning raw data into less-messy datasets. ggplot2 is used for making pretty graphs and visualizations. psych and stats include several commands related to concepts in introductory statistics, such as descriptive statistics, correlation matrices, and factor analysis. Stat2Data just includes a bunch of datasets.

Run the following chunk by first removing the hashtags. Then place your cursor anywhere in the chunk below and press Crtl+Shift+Enter.

#Putting a hashtag in front of a line turns the line into a comment. It's useful for explaining code.

Now do the same thing for the chunk below.


Actual Statistics

View the following dataset. Placing “?” before a object or function tells you more about it. Pressing Ctrl+Enter will run a single line of code, while pressing Crtl+Shift+Enter will run all the code in the current chunk.

Descriptive Statistics


#Average GPA
## [1] 3.096164
#Average SAT Verbal
## [1] 605.0685
#Average SAT Math
## [1] 634.2922
#How spread out are the values from the mean? 

## [1] 0.4654759
#SD SAT Verbal
## [1] 83.39345
#SD SAT Math
## [1] 75.23557


hist(FirstYearGPA$SATM, freq=FALSE)

If you run the above lines at the same time, it produces a histogram with a density curve. “col” and “lwd” specify the color and thickness of the line. You can make “col” equal blue or hotpink or almost any color.

P.S. Putting “base::” or “mosaic::” in front of the function just specifies that “count()” (or whatever the functionis) is from the mosaic package (because many packages have commands with identical names)


#Gender distribution of the sample
mosaic::count(FirstYearGPA$Male == '0')
## n_TRUE 
##    117
mosaic::count(FirstYearGPA$Male == '1')
## n_TRUE 
##    102
#Number of FirstGen students
mosaic::count(FirstYearGPA$FirstGen == '0')
## n_TRUE 
##    194
mosaic::count(FirstYearGPA$FirstGen == '1')
## n_TRUE 
##     25

Many R commands do basically the same thing…

##  min  Q1 median  Q3 max     mean       sd   n missing
##  430 580    640 690 800 634.2922 75.23557 219       0
##    vars   n   mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 219 634.29 75.24    640   637.8 74.13 430 800   370 -0.4    -0.14
##      se
## X1 5.08
## [1] 430 580 640 690 800
##       GPA            HSGPA            SATV            SATM      
##  Min.   :1.930   Min.   :2.340   Min.   :260.0   Min.   :430.0  
##  1st Qu.:2.745   1st Qu.:3.170   1st Qu.:565.0   1st Qu.:580.0  
##  Median :3.150   Median :3.500   Median :610.0   Median :640.0  
##  Mean   :3.096   Mean   :3.453   Mean   :605.1   Mean   :634.3  
##  3rd Qu.:3.480   3rd Qu.:3.760   3rd Qu.:670.0   3rd Qu.:690.0  
##  Max.   :4.150   Max.   :4.000   Max.   :740.0   Max.   :800.0  
##       Male              HU              SS            FirstGen     
##  Min.   :0.0000   Min.   : 0.00   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 8.00   1st Qu.: 3.000   1st Qu.:0.0000  
##  Median :0.0000   Median :13.00   Median : 6.000   Median :0.0000  
##  Mean   :0.4658   Mean   :13.11   Mean   : 7.249   Mean   :0.1142  
##  3rd Qu.:1.0000   3rd Qu.:17.00   3rd Qu.:11.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :40.00   Max.   :21.000   Max.   :1.0000  
##      White       CollegeBound   
##  Min.   :0.00   Min.   :0.0000  
##  1st Qu.:1.00   1st Qu.:1.0000  
##  Median :1.00   Median :1.0000  
##  Mean   :0.79   Mean   :0.9224  
##  3rd Qu.:1.00   3rd Qu.:1.0000  
##  Max.   :1.00   Max.   :1.0000

To see what the commands do, enter “?favstats” or a question mark followed by the command into the Console.

Basic Statistics


## [1] 0.4468873
## [1] 0.1943439
## [1] 0.3043114
#This makes a dataset that just has female students and another dataset that just has male students

female <- FirstYearGPA[ which(FirstYearGPA$Male=='0'), ]
male <- FirstYearGPA[ which(FirstYearGPA$Male=='1'), ]


#Are female SATM scores significantly different from zero?

##  One Sample t-test
## data:  female$SATM
## t = 89.363, df = 116, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  594.8085 621.7727
## sample estimates:
## mean of x 
##  608.2906
#What about male scores?

##  One Sample t-test
## data:  male$SATM
## t = 102.17, df = 101, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  651.2232 677.0121
## sample estimates:
## mean of x 
##  664.1176
#Are boys' and girls' scores significantly different from each other?

##  Welch Two Sample t-test
## data:  male$SATM and female$SATM
## t = 5.9315, df = 216.88, p-value = 1.174e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  37.27632 74.37778
## sample estimates:
## mean of x mean of y 
##  664.1176  608.2906

Fun Plots

P.S. “Index” just refers to row number in the dataset

#Plots all high school GPAs
plot(FirstYearGPA$HSGPA,main="High School GPA")

#Plots all freshman GPAs
plot(FirstYearGPA$GPA,main="Freshman GPA")

#Plots male and female SAT Math scores
plot(male$SATM,col="blue",main="Male and Female SAT Math Scores")

col <- c("lightblue","pink") #This makes a variable to color the plot.

boxplot(male$SATM,female$SATM,names=c("Male","Female"),main="Gender and SAT Math scores",col=col)

#Try making a boxplot but this time use SAT Verbal scores

Linear models

Does high school GPA predict freshman GPA?

plot(GPA~HSGPA,data=FirstYearGPA,main="High School GPA predicting Freshman GPA",xlab="Freshman GPA",ylab="High School GPA") + abline(lm(GPA~HSGPA,data=FirstYearGPA), col='red')

## integer(0)
#If you took away everything after the plus sign, it would just be a plot without a line

Linear models help us describe the relationship between the outcome variable and one or more predictor variables. Does high school GPA actually have an impact on freshman GPA?

mod1 <- lm(GPA~HSGPA,data=FirstYearGPA)
## Call:
## lm(formula = GPA ~ HSGPA, data = FirstYearGPA)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10565 -0.31329  0.05871  0.29485  0.82291 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.17985    0.26194   4.504 1.09e-05 ***
## HSGPA        0.55501    0.07542   7.359 3.78e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.4174 on 217 degrees of freedom
## Multiple R-squared:  0.1997, Adjusted R-squared:  0.196 
## F-statistic: 54.15 on 1 and 217 DF,  p-value: 3.783e-12

Do SAT Math scores predict freshman GPA?

plot(GPA~SATM,data=FirstYearGPA) + abline(lm(GPA~SATM,data=FirstYearGPA), col='red')

## integer(0)
#Try making it so there's a title and captions along the x and y axes
mod2 <- lm(GPA~SATM,data=FirstYearGPA)
## Call:
## lm(formula = GPA ~ SATM, data = FirstYearGPA)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1850 -0.3080  0.0409  0.3511  0.9752 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.333499   0.263144   8.868 2.78e-16 ***
## SATM        0.001202   0.000412   2.919  0.00389 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.4577 on 217 degrees of freedom
## Multiple R-squared:  0.03777,    Adjusted R-squared:  0.03334 
## F-statistic: 8.518 on 1 and 217 DF,  p-value: 0.003888
#Try making a linear model that predicts GPA using SAT Verbal scores

Does high school GPA predict freshman GPA with gender as a moderator? With race as a moderator?

mod3 <- lm(GPA~HSGPA*Male,data=FirstYearGPA)
## Call:
## lm(formula = GPA ~ HSGPA * Male, data = FirstYearGPA)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12417 -0.31207  0.05607  0.30210  0.87282 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7513     0.3517   2.137   0.0338 *  
## HSGPA         0.6664     0.1003   6.642 2.49e-10 ***
## Male          0.8823     0.5259   1.678   0.0949 .  
## HSGPA:Male   -0.2306     0.1517  -1.520   0.1299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.4148 on 215 degrees of freedom
## Multiple R-squared:  0.2169, Adjusted R-squared:  0.206 
## F-statistic: 19.85 on 3 and 215 DF,  p-value: 2.143e-11
mod4 <- lm(GPA~HSGPA*White,data=FirstYearGPA)
## Call:
## lm(formula = GPA ~ HSGPA * White, data = FirstYearGPA)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07223 -0.27429  0.03405  0.28068  0.73491 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.2459     0.4963   0.496   0.6208    
## HSGPA         0.7593     0.1441   5.268 3.33e-07 ***
## White         1.3010     0.5745   2.264   0.0246 *  
## HSGPA:White  -0.2923     0.1664  -1.757   0.0804 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.3981 on 215 degrees of freedom
## Multiple R-squared:  0.2784, Adjusted R-squared:  0.2684 
## F-statistic: 27.65 on 3 and 215 DF,  p-value: 3.647e-15
white <- FirstYearGPA[ which(FirstYearGPA$White=='1'), ]
nonwhite <- FirstYearGPA[ which(FirstYearGPA$White=='0'), ]


Now try making a linear regression model where FirstGen is the moderator instead of gender or race. Then make a similiar plot for FirstGen students. This will involve making two datasets (first-gen and not first-gen, like the male/female and white/nonwhite datasets) and overlaying their scatterplots.