| Package | Title | Maintainer | Version | URL | |
|---|---|---|---|---|---|
| car | Companion to Applied Regression | John Fox <jfox@mcmaster.ca>; | 3.0-2 | https://r-forge.r-project.org/projects/car/, | |
| tidyverse | Easily Install and Load the ‘Tidyverse’ | Hadley Wickham <hadley@rstudio.com>; | 1.2.1 | http://tidyverse.tidyverse.org, | |
| psych | Procedures for Psychological, Psychometric, and Personality Research | William Revelle <revelle@northwestern.edu>; | 1.8.12 | NA | |
| stats | The R Stats Package | R Core Team <R-core@r-project.org>; | 3.6.0 | NA | |
| summarytools | Tools to Quickly and Neatly Summarize Data | Dominic Comtois <dominic.comtois@gmail.com>; | 0.9.3 | NA | |
| kableExtra | Construct Complex Table with ‘kable’ and Pipe Syntax | Hao Zhu <haozhu233@gmail.com>; | 1.1.0 | http://haozhu233.github.io/kableExtra/, | 
Data wrangling is the process of reformatting raw data so it can be used in data analysis. Sometimes we need to zero-mean data, rename variables, remove outliers, deal with NA values, and a whole host of things before we can start investigating. Data wrangling is an important step of the data analysis process since many crucial decisions are made here that affect the results later on.
If you aren’t familiar with a function used below, just type ?function to learn more about it.
This guide will use the dataset airquality from the base R package datasets.
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6There are a number of shortcuts in R that will make your coding more efficient. For your reference here is a cheatsheet for R shortcuts and other cool things. I listed my most commonly used shortcuts below for Windows:
| Shortcut | Function | 
|---|---|
| Ctrl+A | Select All | 
| Ctrl+Enter | Run a single line | 
| Ctrl+Shift+Enter | Run a single chunk | 
| Ctrl+Shift+R | Run the entire file | 
| Ctrl+Shift+I | Insert a R chunk | 
## [1] 4In airquality, Temp is the fourth column.
This prints columns (and the columns’ contents) that contain the string Te, which only applies to Temp in this dataset. The select function is part of the dplyr package.
##   Temp
## 1   67
## 2   72
## 3   74
## 4   62
## 5   56
## 6   66If you look at the airquality dataset (also referred to as a dataframe), you’ll see an increasing column of numbers (starting at 1) to the left of the Ozone column. This is called the Index.
The below syntax prints the contents for row 42.
##    Ozone Solar.R Wind Temp Month Day
## 42    NA     259 10.9   93     6  11The below syntax prints the contents for all rows where Month is equal to 6.
##    Ozone Solar.R Wind Temp Month Day
## 32    NA     286  8.6   78     6   1
## 33    NA     287  9.7   74     6   2
## 34    NA     242 16.1   67     6   3
## 35    NA     186  9.2   84     6   4
## 36    NA     220  8.6   85     6   5
## 37    NA     264 14.3   79     6   6
## 38    29     127  9.7   82     6   7
## 39    NA     273  6.9   87     6   8
## 40    71     291 13.8   90     6   9
## 41    39     323 11.5   87     6  10
## 42    NA     259 10.9   93     6  11
## 43    NA     250  9.2   92     6  12
## 44    23     148  8.0   82     6  13
## 45    NA     332 13.8   80     6  14
## 46    NA     322 11.5   79     6  15
## 47    21     191 14.9   77     6  16
## 48    37     284 20.7   72     6  17
## 49    20      37  9.2   65     6  18
## 50    12     120 11.5   73     6  19
## 51    13     137 10.3   76     6  20
## 52    NA     150  6.3   77     6  21
## 53    NA      59  1.7   76     6  22
## 54    NA      91  4.6   76     6  23
## 55    NA     250  6.3   76     6  24
## 56    NA     135  8.0   75     6  25
## 57    NA     127  8.0   78     6  26
## 58    NA      47 10.3   73     6  27
## 59    NA      98 11.5   80     6  28
## 60    NA      31 14.9   77     6  29
## 61    NA     138  8.0   83     6  30When a value is blank or missing, R will interpret this cell as NA. The below syntax will print TRUE or FALSE for every cell depending on if it’s NA or not.
##      Ozone Solar.R  Wind  Temp Month   Day
## [1,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
## [6,] FALSE    TRUE FALSE FALSE FALSE FALSESince that’s not a good way of absorbing information, we can use colSums() to count the number of NAs for each column.
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0Which rows in Ozone and Solar.R have NA values?
##  [1]   5  10  25  26  27  32  33  34  35  36  37  39  42  43  45  46  52
## [18]  53  54  55  56  57  58  59  60  61  65  72  75  83  84 102 103 107
## [35] 115 119 150## [1]  5  6 11 27 96 97 98When dealing with missing values, remember that the argument na.rm=TRUE is your friend. You may not want to remove every observation with an NA value, since that’s about a fourth of the entire dataset, so you can just specify that for certain analyses that we pretend that the NA aren’t really there at all. na.rm = TRUE specifies that we should remove any NA values for that analysis.
## [1] NASee? It doesn’t work. You can’t find the average for a missing value.
## [1] 42.12931Now it works!
Boxplots and histograms are useful ways to visually find outliers. It looks like there are two outliers in Ozone and one outlier in Wind. We can also use the psych package to plot outliers.
Let’s imagine that airquality has a variable called Sunny where 1 meant Not Sunny, 2 meant Somewhat Sunny, and 3 meant Very Sunny. This is a 3-point Likert scale.
##   Ozone Solar.R Wind Temp Month Day Sunny
## 1    41     190  7.4   67     5   1     2
## 2    36     118  8.0   72     5   2     3
## 3    12     149 12.6   74     5   3     1
## 4    18     313 11.5   62     5   4     3
## 5    NA      NA 14.3   56     5   5     1
## 6    28      NA 14.9   66     5   6     2Now, say we wanted to change this 1-3 scale to a 0-2 scale. This something you might have to do often when managing survey data.
##   Ozone Solar.R Wind Temp Month Day Sunny
## 1    41     190  7.4   67     5   1     1
## 2    36     118  8.0   72     5   2     2
## 3    12     149 12.6   74     5   3     0
## 4    18     313 11.5   62     5   4     2
## 5    NA      NA 14.3   56     5   5     0
## 6    28      NA 14.9   66     5   6     1The below syntax renames Sunny to Sun.
##   Ozone Solar.R Wind Temp Month Day Sun
## 1    41     190  7.4   67     5   1   1
## 2    36     118  8.0   72     5   2   2
## 3    12     149 12.6   74     5   3   0
## 4    18     313 11.5   62     5   4   2
## 5    NA      NA 14.3   56     5   5   0
## 6    28      NA 14.9   66     5   6   1Let’s say we wanted the average of Temp for every month. aggregate is an extremely useful function provided by the stats package.
##   Month     Temp
## 1     5 65.54839
## 2     6 79.10000
## 3     7 83.90323
## 4     8 83.96774
## 5     9 76.90000We could also do so by day of the month.
##   Day Temp
## 1   1 80.2
## 2   2 80.8
## 3   3 79.4
## 4   4 81.8
## 5   5 79.2
## 6   6 79.8Sometimes generating random data is useful to testing out new functions.
I want to generate data based on rainfall that data. According to Wikipedia, the most rainfall in one day was 9.78 inches. So if I want to generate semi-realistic data, it should be between 0 and 10 inches. I can do this using sample().
## [1] 5.424837The average daily rainfall is pretty high, considering that Michigan isn’t that rainy. In June 2019, Ann Arbor hada monthly rainfall of 3.7 inches Let’s set the mean daily rainfall to 0.5 inches. We can do this using rnorm(), which generates data based on a normal distribution.
airquality$Rainfall <- NULL
airquality$Rainfall <- rnorm(n = 153,
                             mean = 0.5)
range(airquality$Rainfall)## [1] -2.488120  2.314857Now it includes negative rainfall. Let’s try that again. We can’t set a min/max with rnorm(), but we can with rtruncnorm.
airquality$Rainfall <- NULL
airquality$Rainfall <- truncnorm::rtruncnorm(n = 153,
                             a = 0, # Minimum value
                             b = 10, # Maximum value
                             mean = 0.5,
                             sd = .05)
range(airquality$Rainfall)## [1] 0.3594843 0.6283992## [1] 0.4980849We can do almost the same thing to generate random character data.
data(airquality)
airquality$Weather <- sample(x = c("Sunny", "Cloudy", "Rainy", "Windy", "Snowy"),
                             size = 153, 
                             replace = TRUE)
head(airquality)##   Ozone Solar.R Wind Temp Month Day Weather
## 1    41     190  7.4   67     5   1   Snowy
## 2    36     118  8.0   72     5   2   Snowy
## 3    12     149 12.6   74     5   3  Cloudy
## 4    18     313 11.5   62     5   4   Sunny
## 5    NA      NA 14.3   56     5   5   Rainy
## 6    28      NA 14.9   66     5   6  CloudyPlease note that airquality in its original form starts out with 153 rows. Let’s say I wanted to remove every observation that occured in the month of May.
## [1] 122Now I want to remove any rows that contained a missing value.
## [1] 111Let’s say that row 53 contains an error and I want to remove it.
## [1] 152Let’s say I wanted to remove the Day column entirely.
##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5##   Wind Temp Month Day
## 1  7.4   67     5   1
## 2  8.0   72     5   2
## 3 12.6   74     5   3
## 4 11.5   62     5   4
## 5 14.3   56     5   5
## 6 14.9   66     5   6Similiarly, I can use a very similiar syntax to remove column 6, which is Day.
##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5Let’s say I wanted for everytime the value “Rainy” appears, it should be replaced with “Stormy” in the column weather. Note how the syntax looks similiar to renaming a variable.
data(airquality)
airquality$Weather <- sample(x = c("Sunny", "Cloudy", "Rainy", "Windy", "Snowy"),
                             size = 153, 
                             replace = TRUE)
head(airquality)##   Ozone Solar.R Wind Temp Month Day Weather
## 1    41     190  7.4   67     5   1   Windy
## 2    36     118  8.0   72     5   2   Rainy
## 3    12     149 12.6   74     5   3   Windy
## 4    18     313 11.5   62     5   4   Rainy
## 5    NA      NA 14.3   56     5   5  Cloudy
## 6    28      NA 14.9   66     5   6   Snowy##   Ozone Solar.R Wind Temp Month Day Weather
## 1    41     190  7.4   67     5   1   Windy
## 2    36     118  8.0   72     5   2  Stormy
## 3    12     149 12.6   74     5   3   Windy
## 4    18     313 11.5   62     5   4  Stormy
## 5    NA      NA 14.3   56     5   5  Cloudy
## 6    28      NA 14.9   66     5   6   SnowyLet’s say that everytime the number “13” appears in the data, regardless of what column or row it’s in, we want to replace it with “77”. The syntax is very similiar to above, just without the variable specifier.
data(airquality)
length(which(airquality==13)) # How many times does the number 13 appear in the entire dataset?## [1] 10## [1] 9## [1] 0## [1] 19This is one way to manage NA/missing values: replace all NAs with a specific number. In the lab, we regularly replace all missing items with the code 88. The syntax is basically the same as replacing a value with another value.
## [1] 3## [1] 44## [1] 47## [1] 0This isn’t a recommended approach for all datasets though. We can tell that some cells in airquality naturally contain values of 88. This is more realistic when using self-report data that has a built-in Likert scale, so all possible values must be contained between 1 and 5.
Subsetting means to select specific rows or columns from a dataset. You can use this to make smaller datasets. For example, you may have a dataset of a survey measure, which includes 20+ items and 5 or so subscale sums. You may want to generate a table of descriptive statistics using summarytools, but you don’t really want to print the mean, median, and standard deviation for the 20+ specific items; you probably just want data for the subscales and total.
Let’s say I want to split up airquality into two datasets: one that contains Ozone, Solar.R., and Wind, and another that contains Temp, Month, and Day. Even if I had no packages installed, there are multiple ways I could do this with just base R.
data(airquality)
first_vars <- c("Ozone", "Solar.R", "Wind")
second_vars <- c("Temp", "Month", "Day")
first_data <- airquality[first_vars]
second_data <- airquality[second_vars]
head(first_data)##   Ozone Solar.R Wind
## 1    41     190  7.4
## 2    36     118  8.0
## 3    12     149 12.6
## 4    18     313 11.5
## 5    NA      NA 14.3
## 6    28      NA 14.9##   Temp Month Day
## 1   67     5   1
## 2   72     5   2
## 3   74     5   3
## 4   62     5   4
## 5   56     5   5
## 6   66     5   6Using a colon allows me to specify that I want columns 1 through 3. I could also specify columns 1, 2, and 3 individually, which is useful if I wanted to select nonsequential columns.
##   Ozone Solar.R Wind
## 1    41     190  7.4
## 2    36     118  8.0
## 3    12     149 12.6
## 4    18     313 11.5
## 5    NA      NA 14.3
## 6    28      NA 14.9##   Temp Month Day
## 1   67     5   1
## 2   72     5   2
## 3   74     5   3
## 4   62     5   4
## 5   56     5   5
## 6   66     5   6The two above methods subset datasets based on what I want to be in my new dataset. What if I wanted a new dataset that just excluded Day? The below syntax creates a new dataset called no_day_data that includes everything except day.
##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5I can do the same thing, except using index numbers instead of column names.
##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5Let’s say I had another dataset called pollution. Let’s also say that both airquality and pollution had ID numbers.
# Create pollution
water_pollution <- replicate(1,sample(1:10,153,rep=TRUE))
light_pollution <- replicate(1,sample(1:10,153,rep=TRUE))
trash_output <- replicate(1,sample(1:50,153,rep=TRUE))
pollution <- data.frame(water_pollution, light_pollution, trash_output)# Give airquality and pollution ID numbers
data(airquality)
airquality <- mutate(airquality, id = rownames(airquality))
pollution <- mutate(pollution, id = rownames(pollution))Now I want to merge airquality and pollution by id. It’s important to have a variable that the two datasets have in common, or else merge won’t know how to merge the datasets.
##    id Ozone Solar.R Wind Temp Month Day water_pollution light_pollution
## 1   1    41     190  7.4   67     5   1              10               7
## 2  10    NA     194  8.6   69     5  10               1               8
## 3 100    89     229 10.3   90     8   8               6               1
## 4 101   110     207  8.0   90     8   9               1               2
## 5 102    NA     222  8.6   92     8  10               9               3
## 6 103    NA     137 11.5   86     8  11               9               1
##   trash_output
## 1            6
## 2           50
## 3           15
## 4           30
## 5            5
## 6           45What if we wanted to create a new dataset that only included rows where Wind was greater than 10? Or if I only wanted to included observations where Wind was greater than 10 AND Ozone was less than 30? Or include observations where Wind was greater than 10 OR wind was less than 30?
If-then logic can get a bit complicated quickly when working with R. Two functions that can do this are the subset function in base R, the nrow function in base R, and the ifelse function in the dplyr package.
Let’s say I want to make a new dataset where I only wanted to include observations where Wind was greater than 10.
##   Ozone Solar.R Wind Temp Month Day
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
## 8    19      99 13.8   59     5   8
## 9     8      19 20.1   61     5   9Let’s say I wanted to count the number of observations where Ozone was greater than 30.
## [1] 96Let’s say I wanted to create a new binary variable and add it to airquality. This new variable will be called “Hot”. If Temp is below 70, Hot will be 0; if Temp is 70 or greater, Hot wil be 1.
##   Ozone Solar.R Wind Temp Month Day Hot
## 1    41     190  7.4   67     5   1   0
## 2    36     118  8.0   72     5   2   1
## 3    12     149 12.6   74     5   3   1
## 4    18     313 11.5   62     5   4   0
## 5    NA      NA 14.3   56     5   5   0
## 6    28      NA 14.9   66     5   6   0It’s important to keep your environment uncluttered while working with data. Try to get rid of unneeded things when you can. I made a bit of a mess when creating the pollution dataset and using if-then logic. You can remove one object at a time with rm().
Or you can remove everything except an object or a list of objects using the below syntax.
Finally, data cleaning usually includes basic descriptive statistics to see if the data is skewed or if there is a large number of outliers.
## [1] NA## [1] NAHowever, instead of typing out everything, we can use the descr function from summarytools to make a neat table of our variables.
N: 153
| Day | Month | Ozone | Solar.R | Temp | Wind | |
|---|---|---|---|---|---|---|
| Mean | 15.80 | 6.99 | 42.13 | 185.93 | 77.88 | 9.96 | 
| Std.Dev | 8.86 | 1.42 | 32.99 | 90.06 | 9.47 | 3.52 | 
| Min | 1.00 | 5.00 | 1.00 | 7.00 | 56.00 | 1.70 | 
| Q1 | 8.00 | 6.00 | 18.00 | 115.00 | 72.00 | 7.40 | 
| Median | 16.00 | 7.00 | 31.50 | 205.00 | 79.00 | 9.70 | 
| Q3 | 23.00 | 8.00 | 63.50 | 259.00 | 85.00 | 11.50 | 
| Max | 31.00 | 9.00 | 168.00 | 334.00 | 97.00 | 20.70 | 
| MAD | 11.86 | 1.48 | 25.95 | 98.59 | 8.90 | 3.41 | 
| IQR | 15.00 | 2.00 | 45.25 | 143.00 | 13.00 | 4.10 | 
| CV | 0.56 | 0.20 | 0.78 | 0.48 | 0.12 | 0.35 | 
| Skewness | 0.00 | 0.00 | 1.21 | -0.42 | -0.37 | 0.34 | 
| SE.Skewness | 0.20 | 0.20 | 0.22 | 0.20 | 0.20 | 0.20 | 
| Kurtosis | -1.22 | -1.32 | 1.11 | -1.00 | -0.46 | 0.03 | 
| N.Valid | 153.00 | 153.00 | 116.00 | 146.00 | 153.00 | 153.00 | 
| Pct.Valid | 100.00 | 100.00 | 75.82 | 95.42 | 100.00 | 100.00 | 
For more information about how to make the above table and plot look prettier, go through the pretty_r_guide.rmd on Github or you can find a link to the HTML here.