The Basics of Data Wrangling

Packages Used
Package	Title	Maintainer	Version	URL
car	Companion to Applied Regression	John Fox <jfox@mcmaster.ca>;	3.0-2	https://r-forge.r-project.org/projects/car/,
tidyverse	Easily Install and Load the ‘Tidyverse’	Hadley Wickham <hadley@rstudio.com>;	1.2.1	http://tidyverse.tidyverse.org,
psych	Procedures for Psychological, Psychometric, and Personality Research	William Revelle <revelle@northwestern.edu>;	1.8.12	NA
stats	The R Stats Package	R Core Team <R-core@r-project.org>;	3.6.0	NA
summarytools	Tools to Quickly and Neatly Summarize Data	Dominic Comtois <dominic.comtois@gmail.com>;	0.9.3	NA
kableExtra	Construct Complex Table with ‘kable’ and Pipe Syntax	Hao Zhu <haozhu233@gmail.com>;	1.1.0	http://haozhu233.github.io/kableExtra/,

Introduction to Data Wrangling

Data wrangling is the process of reformatting raw data so it can be used in data analysis. Sometimes we need to zero-mean data, rename variables, remove outliers, deal with NA values, and a whole host of things before we can start investigating. Data wrangling is an important step of the data analysis process since many crucial decisions are made here that affect the results later on.

If you aren’t familiar with a function used below, just type ?function to learn more about it.

This guide will use the dataset airquality from the base R package datasets.

data(airquality)
head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

R shortcuts

There are a number of shortcuts in R that will make your coding more efficient. For your reference here is a cheatsheet for R shortcuts and other cool things. I listed my most commonly used shortcuts below for Windows:

Shortcut	Function
Ctrl+A	Select All
Ctrl+Enter	Run a single line
Ctrl+Shift+Enter	Run a single chunk
Ctrl+Shift+R	Run the entire file
Ctrl+Shift+I	Insert a R chunk

Find something

Find a column based on name

which( colnames(airquality)=="Temp" )

## [1] 4

In airquality, Temp is the fourth column.

Find a column based on condition

This prints columns (and the columns’ contents) that contain the string Te, which only applies to Temp in this dataset. The select function is part of the dplyr package.

head(select(airquality, contains("Te")))

##   Temp
## 1   67
## 2   72
## 3   74
## 4   62
## 5   56
## 6   66

Find rows based on index number

If you look at the airquality dataset (also referred to as a dataframe), you’ll see an increasing column of numbers (starting at 1) to the left of the Ozone column. This is called the Index.

The below syntax prints the contents for row 42.

airquality[42, ]

##    Ozone Solar.R Wind Temp Month Day
## 42    NA     259 10.9   93     6  11

Find rows based on condition

The below syntax prints the contents for all rows where Month is equal to 6.

airquality[airquality$Month == 6, ]

##    Ozone Solar.R Wind Temp Month Day
## 32    NA     286  8.6   78     6   1
## 33    NA     287  9.7   74     6   2
## 34    NA     242 16.1   67     6   3
## 35    NA     186  9.2   84     6   4
## 36    NA     220  8.6   85     6   5
## 37    NA     264 14.3   79     6   6
## 38    29     127  9.7   82     6   7
## 39    NA     273  6.9   87     6   8
## 40    71     291 13.8   90     6   9
## 41    39     323 11.5   87     6  10
## 42    NA     259 10.9   93     6  11
## 43    NA     250  9.2   92     6  12
## 44    23     148  8.0   82     6  13
## 45    NA     332 13.8   80     6  14
## 46    NA     322 11.5   79     6  15
## 47    21     191 14.9   77     6  16
## 48    37     284 20.7   72     6  17
## 49    20      37  9.2   65     6  18
## 50    12     120 11.5   73     6  19
## 51    13     137 10.3   76     6  20
## 52    NA     150  6.3   77     6  21
## 53    NA      59  1.7   76     6  22
## 54    NA      91  4.6   76     6  23
## 55    NA     250  6.3   76     6  24
## 56    NA     135  8.0   75     6  25
## 57    NA     127  8.0   78     6  26
## 58    NA      47 10.3   73     6  27
## 59    NA      98 11.5   80     6  28
## 60    NA      31 14.9   77     6  29
## 61    NA     138  8.0   83     6  30

Find missing values

When a value is blank or missing, R will interpret this cell as NA. The below syntax will print TRUE or FALSE for every cell depending on if it’s NA or not.

head(is.na(airquality))

##      Ozone Solar.R  Wind  Temp Month   Day
## [1,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE   FALSE FALSE FALSE FALSE FALSE
## [5,]  TRUE    TRUE FALSE FALSE FALSE FALSE
## [6,] FALSE    TRUE FALSE FALSE FALSE FALSE

Since that’s not a good way of absorbing information, we can use colSums() to count the number of NAs for each column.

colSums(is.na(airquality))

##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0

Which rows in Ozone and Solar.R have NA values?

which(is.na(airquality$Ozone))

##  [1]   5  10  25  26  27  32  33  34  35  36  37  39  42  43  45  46  52
## [18]  53  54  55  56  57  58  59  60  61  65  72  75  83  84 102 103 107
## [35] 115 119 150

which(is.na(airquality$Solar.R))

## [1]  5  6 11 27 96 97 98

When dealing with missing values, remember that the argument na.rm=TRUE is your friend. You may not want to remove every observation with an NA value, since that’s about a fourth of the entire dataset, so you can just specify that for certain analyses that we pretend that the NA aren’t really there at all. na.rm = TRUE specifies that we should remove any NA values for that analysis.

mean(airquality$Ozone)

## [1] NA

See? It doesn’t work. You can’t find the average for a missing value.

mean(airquality$Ozone, na.rm=TRUE)

## [1] 42.12931

Now it works!

Find outliers

Boxplots and histograms are useful ways to visually find outliers. It looks like there are two outliers in Ozone and one outlier in Wind. We can also use the psych package to plot outliers.

boxplot(airquality)

hist(airquality$Ozone)

hist(airquality$Wind)

outlier <- psych::outlier(airquality, na.rm=TRUE)

Rescale variables

Let’s imagine that airquality has a variable called Sunny where 1 meant Not Sunny, 2 meant Somewhat Sunny, and 3 meant Very Sunny. This is a 3-point Likert scale.

data(airquality)
airquality$Sunny <- replicate(1,sample(1:3,153,rep=TRUE))
head(airquality)

##   Ozone Solar.R Wind Temp Month Day Sunny
## 1    41     190  7.4   67     5   1     2
## 2    36     118  8.0   72     5   2     3
## 3    12     149 12.6   74     5   3     1
## 4    18     313 11.5   62     5   4     3
## 5    NA      NA 14.3   56     5   5     1
## 6    28      NA 14.9   66     5   6     2

Now, say we wanted to change this 1-3 scale to a 0-2 scale. This something you might have to do often when managing survey data.

airquality$Sunny <- car::recode(airquality$Sunny, "1 = 0; 2 = 1; 3 = 2")
head(airquality)

##   Ozone Solar.R Wind Temp Month Day Sunny
## 1    41     190  7.4   67     5   1     1
## 2    36     118  8.0   72     5   2     2
## 3    12     149 12.6   74     5   3     0
## 4    18     313 11.5   62     5   4     2
## 5    NA      NA 14.3   56     5   5     0
## 6    28      NA 14.9   66     5   6     1

Rename columns

The below syntax renames Sunny to Sun.

names(airquality)[names(airquality) == "Sunny"] <- "Sun"
head(airquality)

##   Ozone Solar.R Wind Temp Month Day Sun
## 1    41     190  7.4   67     5   1   1
## 2    36     118  8.0   72     5   2   2
## 3    12     149 12.6   74     5   3   0
## 4    18     313 11.5   62     5   4   2
## 5    NA      NA 14.3   56     5   5   0
## 6    28      NA 14.9   66     5   6   1

Make new variables

Let’s say we wanted the average of Temp for every month. aggregate is an extremely useful function provided by the stats package.

Temp_month <- aggregate(Temp~Month, data=airquality, mean)
head(Temp_month)

##   Month     Temp
## 1     5 65.54839
## 2     6 79.10000
## 3     7 83.90323
## 4     8 83.96774
## 5     9 76.90000

We could also do so by day of the month.

Temp_day <- aggregate(Temp~Day, data=airquality, mean)
head(Temp_day)

##   Day Temp
## 1   1 80.2
## 2   2 80.8
## 3   3 79.4
## 4   4 81.8
## 5   5 79.2
## 6   6 79.8

Generate random data

Sometimes generating random data is useful to testing out new functions.

Generate random numeric data

I want to generate data based on rainfall that data. According to Wikipedia, the most rainfall in one day was 9.78 inches. So if I want to generate semi-realistic data, it should be between 0 and 10 inches. I can do this using sample().

airquality$Rainfall <- sample(x = 1:10,
                              size = 153,
                              replace=TRUE)

mean(airquality$Rainfall)

## [1] 5.424837

The average daily rainfall is pretty high, considering that Michigan isn’t that rainy. In June 2019, Ann Arbor hada monthly rainfall of 3.7 inches Let’s set the mean daily rainfall to 0.5 inches. We can do this using rnorm(), which generates data based on a normal distribution.

airquality$Rainfall <- NULL
airquality$Rainfall <- rnorm(n = 153,
                             mean = 0.5)
range(airquality$Rainfall)

## [1] -2.488120  2.314857

Now it includes negative rainfall. Let’s try that again. We can’t set a min/max with rnorm(), but we can with rtruncnorm.

airquality$Rainfall <- NULL
airquality$Rainfall <- truncnorm::rtruncnorm(n = 153,
                             a = 0, # Minimum value
                             b = 10, # Maximum value
                             mean = 0.5,
                             sd = .05)
range(airquality$Rainfall)

## [1] 0.3594843 0.6283992

mean(airquality$Rainfall)

## [1] 0.4980849

Generate random character data

We can do almost the same thing to generate random character data.

data(airquality)
airquality$Weather <- sample(x = c("Sunny", "Cloudy", "Rainy", "Windy", "Snowy"),
                             size = 153, 
                             replace = TRUE)
head(airquality)

##   Ozone Solar.R Wind Temp Month Day Weather
## 1    41     190  7.4   67     5   1   Snowy
## 2    36     118  8.0   72     5   2   Snowy
## 3    12     149 12.6   74     5   3  Cloudy
## 4    18     313 11.5   62     5   4   Sunny
## 5    NA      NA 14.3   56     5   5   Rainy
## 6    28      NA 14.9   66     5   6  Cloudy

Remove rows

Remove rows based on condition

Please note that airquality in its original form starts out with 153 rows. Let’s say I wanted to remove every observation that occured in the month of May.

airquality <- airquality[!(airquality$Month==5),]
nrow(airquality)

## [1] 122

Remove rows with any NA values

Now I want to remove any rows that contained a missing value.

data(airquality)
airquality <- airquality[complete.cases(airquality), ]
nrow(airquality)

## [1] 111

Remove rows based on row numbers

Let’s say that row 53 contains an error and I want to remove it.

data(airquality)
airquality <- airquality[-c(53), ]
nrow(airquality)

## [1] 152

Remove columns

Remove column based on name

Let’s say I wanted to remove the Day column entirely.

airquality$Day <- NULL
head(airquality)

##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5

Remove column with any NA values

data(airquality)
airquality <- airquality[ , colSums(is.na(airquality)) == 0]
head(airquality)

##   Wind Temp Month Day
## 1  7.4   67     5   1
## 2  8.0   72     5   2
## 3 12.6   74     5   3
## 4 11.5   62     5   4
## 5 14.3   56     5   5
## 6 14.9   66     5   6

Remove column based on column number

Similiarly, I can use a very similiar syntax to remove column 6, which is Day.

data(airquality)
airquality <- airquality[,-c(6)] 
head(airquality)

##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5

Replace a value

Replace a value for a column

Let’s say I wanted for everytime the value “Rainy” appears, it should be replaced with “Stormy” in the column weather. Note how the syntax looks similiar to renaming a variable.

data(airquality)
airquality$Weather <- sample(x = c("Sunny", "Cloudy", "Rainy", "Windy", "Snowy"),
                             size = 153, 
                             replace = TRUE)
head(airquality)

##   Ozone Solar.R Wind Temp Month Day Weather
## 1    41     190  7.4   67     5   1   Windy
## 2    36     118  8.0   72     5   2   Rainy
## 3    12     149 12.6   74     5   3   Windy
## 4    18     313 11.5   62     5   4   Rainy
## 5    NA      NA 14.3   56     5   5  Cloudy
## 6    28      NA 14.9   66     5   6   Snowy

airquality$Weather[airquality$Weather == "Rainy"] <- "Stormy"
head(airquality)

##   Ozone Solar.R Wind Temp Month Day Weather
## 1    41     190  7.4   67     5   1   Windy
## 2    36     118  8.0   72     5   2  Stormy
## 3    12     149 12.6   74     5   3   Windy
## 4    18     313 11.5   62     5   4  Stormy
## 5    NA      NA 14.3   56     5   5  Cloudy
## 6    28      NA 14.9   66     5   6   Snowy

Replace a value in an entire dataset

Let’s say that everytime the number “13” appears in the data, regardless of what column or row it’s in, we want to replace it with “77”. The syntax is very similiar to above, just without the variable specifier.

data(airquality)
length(which(airquality==13)) # How many times does the number 13 appear in the entire dataset?

## [1] 10

length(which(airquality==77))

## [1] 9

airquality[airquality==13] <- 77
length(which(airquality==13))

## [1] 0

length(which(airquality==77))

## [1] 19

Replace all NA values in an entire dataset

This is one way to manage NA/missing values: replace all NAs with a specific number. In the lab, we regularly replace all missing items with the code 88. The syntax is basically the same as replacing a value with another value.

data(airquality)
length(which(airquality==88))

## [1] 3

sum(is.na(airquality))

## [1] 44

airquality[is.na(airquality)] <- 88
length(which(airquality==88))

## [1] 47

sum(is.na(airquality))

## [1] 0

This isn’t a recommended approach for all datasets though. We can tell that some cells in airquality naturally contain values of 88. This is more realistic when using self-report data that has a built-in Likert scale, so all possible values must be contained between 1 and 5.

Subset a dataset

Subsetting means to select specific rows or columns from a dataset. You can use this to make smaller datasets. For example, you may have a dataset of a survey measure, which includes 20+ items and 5 or so subscale sums. You may want to generate a table of descriptive statistics using summarytools, but you don’t really want to print the mean, median, and standard deviation for the 20+ specific items; you probably just want data for the subscales and total.

Let’s say I want to split up airquality into two datasets: one that contains Ozone, Solar.R., and Wind, and another that contains Temp, Month, and Day. Even if I had no packages installed, there are multiple ways I could do this with just base R.

Subset a dataset based on column name

data(airquality)
first_vars <- c("Ozone", "Solar.R", "Wind")
second_vars <- c("Temp", "Month", "Day")

first_data <- airquality[first_vars]
second_data <- airquality[second_vars]

head(first_data)

##   Ozone Solar.R Wind
## 1    41     190  7.4
## 2    36     118  8.0
## 3    12     149 12.6
## 4    18     313 11.5
## 5    NA      NA 14.3
## 6    28      NA 14.9

head(second_data)

##   Temp Month Day
## 1   67     5   1
## 2   72     5   2
## 3   74     5   3
## 4   62     5   4
## 5   56     5   5
## 6   66     5   6

Subset a dataset based on column number

Using a colon allows me to specify that I want columns 1 through 3. I could also specify columns 1, 2, and 3 individually, which is useful if I wanted to select nonsequential columns.

first_data <- airquality[c(1:3)]
first_data <- airquality[c(1,2,3)]
head(first_data)

##   Ozone Solar.R Wind
## 1    41     190  7.4
## 2    36     118  8.0
## 3    12     149 12.6
## 4    18     313 11.5
## 5    NA      NA 14.3
## 6    28      NA 14.9

second_data <- airquality[c(4:6)]
second_data <- airquality[c(4,5,6)]
head(second_data)

##   Temp Month Day
## 1   67     5   1
## 2   72     5   2
## 3   74     5   3
## 4   62     5   4
## 5   56     5   5
## 6   66     5   6

Subset a dataset based on columns that I DON’T want

The two above methods subset datasets based on what I want to be in my new dataset. What if I wanted a new dataset that just excluded Day? The below syntax creates a new dataset called no_day_data that includes everything except day.

day <- names(airquality) %in% c("Day")
no_day_data <- airquality[!day]
head(no_day_data)

##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5

I can do the same thing, except using index numbers instead of column names.

no_day_data_again <- airquality[c(-6)]
head(no_day_data_again)

##   Ozone Solar.R Wind Temp Month
## 1    41     190  7.4   67     5
## 2    36     118  8.0   72     5
## 3    12     149 12.6   74     5
## 4    18     313 11.5   62     5
## 5    NA      NA 14.3   56     5
## 6    28      NA 14.9   66     5

Merge two datasets

Let’s say I had another dataset called pollution. Let’s also say that both airquality and pollution had ID numbers.

# Create pollution

water_pollution <- replicate(1,sample(1:10,153,rep=TRUE))
light_pollution <- replicate(1,sample(1:10,153,rep=TRUE))
trash_output <- replicate(1,sample(1:50,153,rep=TRUE))

pollution <- data.frame(water_pollution, light_pollution, trash_output)

# Give airquality and pollution ID numbers

data(airquality)
airquality <- mutate(airquality, id = rownames(airquality))
pollution <- mutate(pollution, id = rownames(pollution))

Now I want to merge airquality and pollution by id. It’s important to have a variable that the two datasets have in common, or else merge won’t know how to merge the datasets.

airquality <- merge(airquality, pollution, by = "id")
head(airquality)

##    id Ozone Solar.R Wind Temp Month Day water_pollution light_pollution
## 1   1    41     190  7.4   67     5   1              10               7
## 2  10    NA     194  8.6   69     5  10               1               8
## 3 100    89     229 10.3   90     8   8               6               1
## 4 101   110     207  8.0   90     8   9               1               2
## 5 102    NA     222  8.6   92     8  10               9               3
## 6 103    NA     137 11.5   86     8  11               9               1
##   trash_output
## 1            6
## 2           50
## 3           15
## 4           30
## 5            5
## 6           45

Using if-then logic

What if we wanted to create a new dataset that only included rows where Wind was greater than 10? Or if I only wanted to included observations where Wind was greater than 10 AND Ozone was less than 30? Or include observations where Wind was greater than 10 OR wind was less than 30?

If-then logic can get a bit complicated quickly when working with R. Two functions that can do this are the subset function in base R, the nrow function in base R, and the ifelse function in the dplyr package.

Example 1

Let’s say I want to make a new dataset where I only wanted to include observations where Wind was greater than 10.

data(airquality)
windy_data <- subset(airquality, Wind > 10)
head(windy_data)

##   Ozone Solar.R Wind Temp Month Day
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
## 8    19      99 13.8   59     5   8
## 9     8      19 20.1   61     5   9

Example 2

Let’s say I wanted to count the number of observations where Ozone was greater than 30.

great_ozone <- nrow(airquality[airquality$Ozone > 30, ])
print(great_ozone)

## [1] 96

Example 3

Let’s say I wanted to create a new binary variable and add it to airquality. This new variable will be called “Hot”. If Temp is below 70, Hot will be 0; if Temp is 70 or greater, Hot wil be 1.

airquality$Hot <- NA
airquality$Hot <- if_else(airquality$Temp < 70, 0, 1)
head(airquality)

##   Ozone Solar.R Wind Temp Month Day Hot
## 1    41     190  7.4   67     5   1   0
## 2    36     118  8.0   72     5   2   1
## 3    12     149 12.6   74     5   3   1
## 4    18     313 11.5   62     5   4   0
## 5    NA      NA 14.3   56     5   5   0
## 6    28      NA 14.9   66     5   6   0

Cleaning up your environment

It’s important to keep your environment uncluttered while working with data. Try to get rid of unneeded things when you can. I made a bit of a mess when creating the pollution dataset and using if-then logic. You can remove one object at a time with rm().

rm(water_pollution)

Or you can remove everything except an object or a list of objects using the below syntax.

data(airquality)
rm(list=setdiff(ls(), "airquality"))

Descriptive statistics

Finally, data cleaning usually includes basic descriptive statistics to see if the data is skewed or if there is a large number of outliers.

mean(airquality$Ozone)

## [1] NA

sd(airquality$Ozone)

## [1] NA

However, instead of typing out everything, we can use the descr function from summarytools to make a neat table of our variables.

data(airquality)
summarytools::descr(airquality, style = "rmarkdown")

Descriptive Statistics

airquality

N: 153

	Day	Month	Ozone	Solar.R	Temp	Wind
Mean	15.80	6.99	42.13	185.93	77.88	9.96
Std.Dev	8.86	1.42	32.99	90.06	9.47	3.52
Min	1.00	5.00	1.00	7.00	56.00	1.70
Q1	8.00	6.00	18.00	115.00	72.00	7.40
Median	16.00	7.00	31.50	205.00	79.00	9.70
Q3	23.00	8.00	63.50	259.00	85.00	11.50
Max	31.00	9.00	168.00	334.00	97.00	20.70
MAD	11.86	1.48	25.95	98.59	8.90	3.41
IQR	15.00	2.00	45.25	143.00	13.00	4.10
CV	0.56	0.20	0.78	0.48	0.12	0.35
Skewness	0.00	0.00	1.21	-0.42	-0.37	0.34
SE.Skewness	0.20	0.20	0.22	0.20	0.20	0.20
Kurtosis	-1.22	-1.32	1.11	-1.00	-0.46	0.03
N.Valid	153.00	153.00	116.00	146.00	153.00	153.00
Pct.Valid	100.00	100.00	75.82	95.42	100.00	100.00

# Skewness and Kurtosis

qqnorm(airquality$Ozone)
qqline(airquality$Ozone)

For more information about how to make the above table and plot look prettier, go through the pretty_r_guide.rmd on Github or you can find a link to the HTML here.