Swirl is a R package that teaches you R in R.
R is a programming language specfically designed for statistical analysis. R is open-source, and is developed by a team of statisticians and programmers in both academia and industry. Alternatives to R include SPSS, Stata, SAS, and Python.
RStudio is an Integrated Development Environment (aka a user-friendly interface) for R and requires R installation to function.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. You can embed an R code chunk like this:
plot(cars)
The up arrow on your keyboard will allow you to scroll up through your past commands.
The tab key on your keyboard will help you (particularly in RStudio) by offering ways to finish your code. If you start typing mea and hit tab, it will suggest mean() among other things. If you type mean(~hwy, data=vehicles, and hit tab, it will tell you the other arguments you can use for the mean() function.
When working within a .R or .Rmd file, you can put your cursor on a line and hit Ctrl + Enter to get the code to execute in the Console. (On a Mac, Command + Enter.)
If you get stuck with some syntax (usually, mismatched parentheses or quotes), the R Console will change from the > at the beginning of the line (which means it is waiting for a new command) to the + at the beginning of the line (which means it is waiting for you to finish a command). To get out, hit the Escape key.
Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation, which can found in the Packages tab to the left. Once installed, they have to be loaded into the session every time RStudio is opened.
Some common packages are mosiac, car, dplyr, ggplot2, stats, and psych. mosiac and car are basic packages used to teach math, statistics, computation, and modeling. dplyr is used for “data wrangling”, e.g. turning raw data into less-messy datasets. ggplot2 is used for making pretty graphs and visualizations. psych and stats include several commands related to concepts in introductory statistics, such as descriptive statistics, correlation matrices, and factor analysis. Stat2Data just includes a bunch of datasets.
Run the following chunk by first removing the hashtags. Then place your cursor anywhere in the chunk below and press Crtl+Shift+Enter.
#Putting a hashtag in front of a line turns the line into a comment. It's useful for explaining code.
#install.packages("mosaic")
#install.packages("car")
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("stats")
#install.packages("psych")
#install.packages("Stat2Data")
Now do the same thing for the chunk below.
library(mosaic)
library(car)
library(dplyr)
library(ggplot2)
library(stats)
library(psych)
library(Stat2Data)
View the following dataset. Placing “?” before a object or function tells you more about it. Pressing Ctrl+Enter will run a single line of code, while pressing Crtl+Shift+Enter will run all the code in the current chunk.
library(Stat2Data)
data(FirstYearGPA)
?FirstYearGPA
View(FirstYearGPA)
## Warning: running command ''/usr/bin/otool' -L '/Library/Frameworks/
## R.framework/Resources/modules/R_de.so'' had status 1
?base::mean
#Average GPA
mean(FirstYearGPA$GPA)
## [1] 3.096164
#Average SAT Verbal
mean(FirstYearGPA$SATV)
## [1] 605.0685
#Average SAT Math
mean(FirstYearGPA$SATM)
## [1] 634.2922
?stats::sd
#How spread out are the values from the mean?
#SD GPA
sd(FirstYearGPA$GPA)
## [1] 0.4654759
#SD SAT Verbal
sd(FirstYearGPA$SATV)
## [1] 83.39345
#SD SAT Math
sd(FirstYearGPA$SATM)
## [1] 75.23557
hist(FirstYearGPA$GPA)
hist(FirstYearGPA$SATV)
hist(FirstYearGPA$SATM, freq=FALSE)
lines(density(FirstYearGPA$SATM),col="red",lwd=3)
If you run the above lines at the same time, it produces a histogram with a density curve. “col” and “lwd” specify the color and thickness of the line. You can make “col” equal blue or hotpink or almost any color.
P.S. Putting “base::” or “mosaic::” in front of the function just specifies that “count()” (or whatever the functionis) is from the mosaic package (because many packages have commands with identical names)
?mosaicCore::count
#Gender distribution of the sample
mosaic::count(FirstYearGPA$Male == '0')
## n_TRUE
## 117
mosaic::count(FirstYearGPA$Male == '1')
## n_TRUE
## 102
#Number of FirstGen students
mosaic::count(FirstYearGPA$FirstGen == '0')
## n_TRUE
## 194
mosaic::count(FirstYearGPA$FirstGen == '1')
## n_TRUE
## 25
mosaic::favstats(FirstYearGPA$SATM)
## min Q1 median Q3 max mean sd n missing
## 430 580 640 690 800 634.2922 75.23557 219 0
psych::describe(FirstYearGPA$SATM)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 219 634.29 75.24 640 637.8 74.13 430 800 370 -0.4 -0.14
## se
## X1 5.08
fivenum(FirstYearGPA$SATM)
## [1] 430 580 640 690 800
summary(FirstYearGPA)
## GPA HSGPA SATV SATM
## Min. :1.930 Min. :2.340 Min. :260.0 Min. :430.0
## 1st Qu.:2.745 1st Qu.:3.170 1st Qu.:565.0 1st Qu.:580.0
## Median :3.150 Median :3.500 Median :610.0 Median :640.0
## Mean :3.096 Mean :3.453 Mean :605.1 Mean :634.3
## 3rd Qu.:3.480 3rd Qu.:3.760 3rd Qu.:670.0 3rd Qu.:690.0
## Max. :4.150 Max. :4.000 Max. :740.0 Max. :800.0
## Male HU SS FirstGen
## Min. :0.0000 Min. : 0.00 Min. : 0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 8.00 1st Qu.: 3.000 1st Qu.:0.0000
## Median :0.0000 Median :13.00 Median : 6.000 Median :0.0000
## Mean :0.4658 Mean :13.11 Mean : 7.249 Mean :0.1142
## 3rd Qu.:1.0000 3rd Qu.:17.00 3rd Qu.:11.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :40.00 Max. :21.000 Max. :1.0000
## White CollegeBound
## Min. :0.00 Min. :0.0000
## 1st Qu.:1.00 1st Qu.:1.0000
## Median :1.00 Median :1.0000
## Mean :0.79 Mean :0.9224
## 3rd Qu.:1.00 3rd Qu.:1.0000
## Max. :1.00 Max. :1.0000
To see what the commands do, enter “?favstats” or a question mark followed by the command into the Console.
stats::cor(FirstYearGPA$HSGPA,FirstYearGPA$GPA)
## [1] 0.4468873
stats::cor(FirstYearGPA$SATM,FirstYearGPA$GPA)
## [1] 0.1943439
stats::cor(FirstYearGPA$SATV,FirstYearGPA$GPA)
## [1] 0.3043114
#This makes a dataset that just has female students and another dataset that just has male students
female <- FirstYearGPA[ which(FirstYearGPA$Male=='0'), ]
male <- FirstYearGPA[ which(FirstYearGPA$Male=='1'), ]
#Are female SATM scores significantly different from zero?
t.test(female$SATM)
##
## One Sample t-test
##
## data: female$SATM
## t = 89.363, df = 116, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 594.8085 621.7727
## sample estimates:
## mean of x
## 608.2906
#What about male scores?
t.test(male$SATM)
##
## One Sample t-test
##
## data: male$SATM
## t = 102.17, df = 101, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 651.2232 677.0121
## sample estimates:
## mean of x
## 664.1176
#Are boys' and girls' scores significantly different from each other?
t.test(male$SATM,female$SATM)
##
## Welch Two Sample t-test
##
## data: male$SATM and female$SATM
## t = 5.9315, df = 216.88, p-value = 1.174e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 37.27632 74.37778
## sample estimates:
## mean of x mean of y
## 664.1176 608.2906
P.S. “Index” just refers to row number in the dataset
#Plots all high school GPAs
plot(FirstYearGPA$HSGPA,main="High School GPA")
#Plots all freshman GPAs
plot(FirstYearGPA$GPA,main="Freshman GPA")
#Plots male and female SAT Math scores
plot(male$SATM,col="blue",main="Male and Female SAT Math Scores")
points(female$SATM,col='magenta')
col <- c("lightblue","pink") #This makes a variable to color the plot.
boxplot(male$SATM,female$SATM,names=c("Male","Female"),main="Gender and SAT Math scores",col=col)
#Try making a boxplot but this time use SAT Verbal scores
plot(GPA~HSGPA,data=FirstYearGPA,main="High School GPA predicting Freshman GPA",xlab="Freshman GPA",ylab="High School GPA") + abline(lm(GPA~HSGPA,data=FirstYearGPA), col='red')
## integer(0)
#If you took away everything after the plus sign, it would just be a plot without a line
Linear models help us describe the relationship between the outcome variable and one or more predictor variables. Does high school GPA actually have an impact on freshman GPA?
mod1 <- lm(GPA~HSGPA,data=FirstYearGPA)
summary(mod1)
##
## Call:
## lm(formula = GPA ~ HSGPA, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.10565 -0.31329 0.05871 0.29485 0.82291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.17985 0.26194 4.504 1.09e-05 ***
## HSGPA 0.55501 0.07542 7.359 3.78e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4174 on 217 degrees of freedom
## Multiple R-squared: 0.1997, Adjusted R-squared: 0.196
## F-statistic: 54.15 on 1 and 217 DF, p-value: 3.783e-12
plot(GPA~SATM,data=FirstYearGPA) + abline(lm(GPA~SATM,data=FirstYearGPA), col='red')
## integer(0)
#Try making it so there's a title and captions along the x and y axes
mod2 <- lm(GPA~SATM,data=FirstYearGPA)
summary(mod2)
##
## Call:
## lm(formula = GPA ~ SATM, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1850 -0.3080 0.0409 0.3511 0.9752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.333499 0.263144 8.868 2.78e-16 ***
## SATM 0.001202 0.000412 2.919 0.00389 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4577 on 217 degrees of freedom
## Multiple R-squared: 0.03777, Adjusted R-squared: 0.03334
## F-statistic: 8.518 on 1 and 217 DF, p-value: 0.003888
#Try making a linear model that predicts GPA using SAT Verbal scores
mod3 <- lm(GPA~HSGPA*Male,data=FirstYearGPA)
summary(mod3)
##
## Call:
## lm(formula = GPA ~ HSGPA * Male, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12417 -0.31207 0.05607 0.30210 0.87282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7513 0.3517 2.137 0.0338 *
## HSGPA 0.6664 0.1003 6.642 2.49e-10 ***
## Male 0.8823 0.5259 1.678 0.0949 .
## HSGPA:Male -0.2306 0.1517 -1.520 0.1299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4148 on 215 degrees of freedom
## Multiple R-squared: 0.2169, Adjusted R-squared: 0.206
## F-statistic: 19.85 on 3 and 215 DF, p-value: 2.143e-11
mod4 <- lm(GPA~HSGPA*White,data=FirstYearGPA)
summary(mod4)
##
## Call:
## lm(formula = GPA ~ HSGPA * White, data = FirstYearGPA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07223 -0.27429 0.03405 0.28068 0.73491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2459 0.4963 0.496 0.6208
## HSGPA 0.7593 0.1441 5.268 3.33e-07 ***
## White 1.3010 0.5745 2.264 0.0246 *
## HSGPA:White -0.2923 0.1664 -1.757 0.0804 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3981 on 215 degrees of freedom
## Multiple R-squared: 0.2784, Adjusted R-squared: 0.2684
## F-statistic: 27.65 on 3 and 215 DF, p-value: 3.647e-15
white <- FirstYearGPA[ which(FirstYearGPA$White=='1'), ]
nonwhite <- FirstYearGPA[ which(FirstYearGPA$White=='0'), ]
plot(white$HSGPA,col="yellowgreen")
points(nonwhite$HSGPA,col='purple')
Now try making a linear regression model where FirstGen is the moderator instead of gender or race. Then make a similiar plot for FirstGen students. This will involve making two datasets (first-gen and not first-gen, like the male/female and white/nonwhite datasets) and overlaying their scatterplots.