Ch3 Data Wrangling

The following R packages are needed for running the examples in this chapter.

library(tidyverse)
library(readxl)

1 What is Data Wrangling?

Here is a flowchart of data analysis.

Data wrangling includes preprocessing and transforming the data. It is essentially to convert what you have to what you need for visualization and modeling purposes. Data wrangling is best learned through examples. Here we go over a few most frequently used features. Since R is one of the best tools for data wrangling, descriptive analytics, data visualization, and modeling, we will use R throughout the course for demonstration. We start with some basics.

2 Set Up Working Directory

When working with R, the first thing is to set up a working directory to store all the code files, data files and others. All subsequent analysis will using files from this working directory.

Here I first create a directory called Rworkingdir in the C: drive in my computer and use it as my working directory. In RStudio, in the menu, click “Session” - “Set Working Directory” - “Choose Directory”. Navigate to the folder Rworkingdir and click “Open”. Then your working directory is set. Note that R console displays setwd("C:/Rworkingdir"). Alternatively, you can directly type it in R console.

setwd("C:/Rworkingdir")

Note that R use / to separe folder names which is different from Windows OS which uses \. You can also directly type in R console to set up the working directory. For example, the following code will set the working directory to D drive, folder “project1”,

setwd("D:/project1")

At any time, you can check the current working directory by getwd(). Lastly, R script are usually saved in the working directory.

3 Import and Export Data Files

We will demonstrate how to import comma-separated values (csv) files using read.csv(). Other file types will follow similarly. Suppose you have file containing data stored as my_data.csv. Suppose the first row of my_data.csv is the column/variable names. Then place this file in the working directory and type

d <- read.csv("my_data.csv", header = TRUE, stringsAsFactors = FALSE)

Note that when you read a file into R, it is by default a data frame.

If you create a subfolder in the working directory, say /data, and place the data file there, then you can use

d <- read.csv("data/my_data.csv", header = TRUE, stringsAsFactors = FALSE)

For other file types, for example, Excel files file_name.xls and file_name.xlsx, you can use R packages, such as readxl, readr, data.table, GDATA, XLConnect, ROBC, and RExcel. Here is an example

install.packages("readxl") # only need to run once
library(readxl)
excel_sheets("data/my_data.xls")
## [1] "Sheet1" "Sheet2"
d=read_excel("data/my_data.xls", sheet = "Sheet2")

To take a glance at the data, use head(d) and tail(d) to print the first and last a few rows of d.

head(d)
## # A tibble: 4 × 3
##   index  var1  var2
##   <dbl> <dbl> <dbl>
## 1     1  5342  4352
## 2     2  2352  2345
## 3     3  2435  1234
## 4     4  2345  1234
tail(d)
## # A tibble: 4 × 3
##   index  var1  var2
##   <dbl> <dbl> <dbl>
## 1     1  5342  4352
## 2     2  2352  2345
## 3     3  2435  1234
## 4     4  2345  1234

Exporting Data: Use the write.csv() function to save .csv files to your hard drive

write.csv(d,"my_exported_data.csv")

4 Missing Values

R shows missing values as NA. Missing values prevent R from calculating sums, means, equality checks, etc. Use the is.na() function to check for missing values. Use sum(is.na()) to calculate the total number of missing values.

my_vec=c(1,3,NA,8,NA,4,2)
mean(my_vec)
## [1] NA
mean(my_vec, na.rm = TRUE)
## [1] 3.6
is.na(my_vec)
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
which(is.na(my_vec))
## [1] 3 5
mean(my_vec[!is.na(my_vec)])
## [1] 3.6
mean(my_vec[-which(is.na(my_vec))])
## [1] 3.6
sum(is.na(my_vec))
## [1] 2

For data frames, use colSums(is.na()) to calculate the number of missing values per column.

df <- data.frame(V1 = c(1,3, NA,7, 4, 3.4),
                 V2 = c("this", NA, "is", "text", "OM", "BANA"), 
                 V3 = c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE), 
                 V4 = c(2.5, NA, NA, NA, NA, 3.3))
is.na(df)
##         V1    V2    V3    V4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE  TRUE
## [3,]  TRUE FALSE FALSE  TRUE
## [4,] FALSE FALSE FALSE  TRUE
## [5,] FALSE FALSE FALSE  TRUE
## [6,] FALSE FALSE FALSE FALSE
sum(is.na(df))
## [1] 6
colSums(is.na(df))
## V1 V2 V3 V4 
##  1  1  0  4

Use complete.cases() to keep only observations with no missing values. na.omit() also removes incomplete observations.

complete.cases(df)
## [1]  TRUE FALSE FALSE FALSE FALSE  TRUE
df[complete.cases(df), ] 
##    V1   V2    V3  V4
## 1 1.0 this  TRUE 2.5
## 6 3.4 BANA FALSE 3.3
na.omit(df)
##    V1   V2    V3  V4
## 1 1.0 this  TRUE 2.5
## 6 3.4 BANA FALSE 3.3

Discuss with stakeholders/clients before simply removing observations! Maybe missing data exists for a reason!

5 Data Wrangling via dplyr Package

One of the most frequently used functions in R are in dplyr package. Here we go through a few key features. In this section, we will use a running example data set.

Example of mpg data set This is a data containing the fuel efficiency information of various cars. The data is in ggplot2 package.
It is a data frame with 234 rows and 11 variables. Each row represents a particular car. Each column represents a variable describing the cars. Here is a list of variable explanations.

Variable Type Description Details
manufacturer char manufacturer name 15 manufacturers
model char model name 38 models
displ numeric engine displacement in liters, i.e., cylinder volume swept by pistons range from 1.6 to 7.0 liter
year integer year of manufacture two levels: 1999 and 2008
cyl integer number of cylinders four levels: 4, 5, 6, 8 cylinders
trans char type of transmission two levels: automatic, manual(int)
drv char type of drive train three levels: f,r,4. f = front-wheel drive, r = rear wheel drive, 4 = 4wd
cty integer city miles per gallon range from 9 to 35
hwy integer highway miles per gallon range from 12 to 44
fl char fuel type five levels: p,r,e,d,c. d=diesel, p=petrol, e=electric, etc.
class char vehicle class seven levels: compact, midsize, suv, 2seater, minivan, pickup, subcompact

We can take a look at the data set.

library(tidyverse)
data(mpg)
#str(mpg)
head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

5.1 Filter Observations via filter()

Using filter(), we can obtain a subset rows from the original data frame.

head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
mpg_y1999 = filter(mpg,year==1999)
head(mpg_y1999)
## # A tibble: 6 × 11
##   manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 audi         a4           1.8  1999     4 auto(… f        18    29 p     comp…
## 2 audi         a4           1.8  1999     4 manua… f        21    29 p     comp…
## 3 audi         a4           2.8  1999     6 auto(… f        16    26 p     comp…
## 4 audi         a4           2.8  1999     6 manua… f        18    26 p     comp…
## 5 audi         a4 quattro   1.8  1999     4 manua… 4        18    26 p     comp…
## 6 audi         a4 quattro   1.8  1999     4 auto(… 4        16    25 p     comp…
mpg_y1999_cyl4 = filter(mpg,year==1999,cyl==4)
head(mpg_y1999_cyl4)
## # A tibble: 6 × 11
##   manufacturer model       displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr>       <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi         a4            1.8  1999     4 auto… f        18    29 p     comp…
## 2 audi         a4            1.8  1999     4 manu… f        21    29 p     comp…
## 3 audi         a4 quattro    1.8  1999     4 manu… 4        18    26 p     comp…
## 4 audi         a4 quattro    1.8  1999     4 auto… 4        16    25 p     comp…
## 5 chevrolet    malibu        2.4  1999     4 auto… f        19    27 r     mids…
## 6 dodge        caravan 2wd   2.4  1999     4 auto… f        18    24 r     mini…

Similarly, you can use other criterion such as cyl>=1999, drv==“f”, and etc.

Note that mpg_y1999 = filter(mpg,year==1999) is equivalent to mpg_y1999 = mpg[mpg$year==1999,], the former runs faster and usually involves less synatx. For example, to achieve the effect of mpg_y1999_cyl4 = filter(mpg,year==1999,cyl==4) we need mpg_y1999_cyl4 = mpg[(mpg$year==1999)&(mpg&cyl==4),].

5.2 Select Variables via select()

Using select(), we can obtain a subset columns/variables from the original data frame.

mpg_engine = select(mpg,c("manufacturer","model","displ","cyl"))
head(mpg_engine)
## # A tibble: 6 × 4
##   manufacturer model displ   cyl
##   <chr>        <chr> <dbl> <int>
## 1 audi         a4      1.8     4
## 2 audi         a4      1.8     4
## 3 audi         a4      2       4
## 4 audi         a4      2       4
## 5 audi         a4      2.8     6
## 6 audi         a4      2.8     6

5.3 Create Variables via mutate()

Using mutate(), we can create new variables/columns from the original data frame.

mpg_ave = mutate(mpg,ave_mpg=(cty+hwy)/2,displ_per_cyl=displ/cyl)
head(mpg_ave)
## # A tibble: 6 × 13
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## # ℹ 2 more variables: ave_mpg <dbl>, displ_per_cyl <dbl>

Note that these features can be achieved by

mpg$ave_mpg=(mpg$cty+mpg$hwy)/2
mpg$displ_per_cyl=mpg$displ/mpg$cyl)

5.4 Summary Statistics via group_by() and summarize()

Sometimes, we would like to first divide the original data frame into groups or strata by one or two variables. Within each strata, we would like to calculate some summary statistics. This can be achieved by combining group_by() and summarize().

unique(mpg$drv)
## [1] "f" "4" "r"
mpg_by_drv = group_by(mpg, drv)
ave_mpg_by_drv = summarize(mpg_by_drv,
                           ave_hwy=mean(hwy,na.rm = TRUE),
                           ave_cty=mean(cty,na.rm = TRUE))
head(ave_mpg_by_drv)
## # A tibble: 3 × 3
##   drv   ave_hwy ave_cty
##   <chr>   <dbl>   <dbl>
## 1 4        19.2    14.3
## 2 f        28.2    20.0
## 3 r        21      14.1

5.5 Pipeline Operator %>%

The functions we introduce above have couple things in common, their first arguments are all data frame, they also return a new data frame. Sometimes, a data frame will go through a lot of transformations shown above. The pipeline operator will make this process easy.

mpg %>% 
  mutate(overall_mpg=(cty+hwy)/2) %>% 
  group_by(drv) %>% 
  summarize(ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))
## # A tibble: 3 × 2
##   drv   ave_overall_mpg
##   <chr>           <dbl>
## 1 4                16.8
## 2 f                24.1
## 3 r                17.5

This is equivalent to

result1 = mutate(mpg, overall_mpg=(cty+hwy)/2)
result2 = group_by(result1, drv)
result3 = summarize(result2, ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))
result3
## # A tibble: 3 × 2
##   drv   ave_overall_mpg
##   <chr>           <dbl>
## 1 4                16.8
## 2 f                24.1
## 3 r                17.5
summarize( group_by( mutate(mpg, overall_mpg=(cty+hwy)/2), 
                     drv),
           ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))
## # A tibble: 3 × 2
##   drv   ave_overall_mpg
##   <chr>           <dbl>
## 1 4                16.8
## 2 f                24.1
## 3 r                17.5

5.6 Merge Data Sets via *_join()

We often obtain data from different sources. Therefore, we need to merge datasets before making visualization or any analysis.

df1=data.frame(id=c(1001,1002,1003),name=c("Tom","Jerry","Lucy"))
df2=data.frame(id=c(1001,1003,1004),weight=c(150,110,200))
df1
##     id  name
## 1 1001   Tom
## 2 1002 Jerry
## 3 1003  Lucy
df2
##     id weight
## 1 1001    150
## 2 1003    110
## 3 1004    200
inner_join(df1,df2,by="id")
##     id name weight
## 1 1001  Tom    150
## 2 1003 Lucy    110
left_join(df1,df2,by="id")
##     id  name weight
## 1 1001   Tom    150
## 2 1002 Jerry     NA
## 3 1003  Lucy    110
right_join(df1,df2,by="id")
##     id name weight
## 1 1001  Tom    150
## 2 1003 Lucy    110
## 3 1004 <NA>    200
full_join(df1,df2,by="id")
##     id  name weight
## 1 1001   Tom    150
## 2 1002 Jerry     NA
## 3 1003  Lucy    110
## 4 1004  <NA>    200

6 R Cheatsheets for dplyr Package

Summary of R command

R Cheatsheet at https://rstudio.com/resources/cheatsheets/