Ch3 Data Wrangling

The following R packages are needed for running the examples in this chapter.

library(tidyverse)
library(readxl)

1 What is Data Wrangling?

Here is a flowchart of data analysis.

Data wrangling includes preprocessing and transforming the data. It is essentially to convert what you have to what you need for visualization and modeling purposes. Data wrangling is best learned through examples. Here we go over a few most frequently used features. Since R is one of the best tools for data wrangling, descriptive analytics, data visualization, and modeling, we will use R throughout the course for demonstration. We start with some basics.

2 Set Up Working Directory

When working with R, the first thing is to set up a working directory to store all the code files, data files and others. All subsequent analysis will using files from this working directory.

Here I first create a directory called Rworkingdir in the C: drive in my computer and use it as my working directory. In RStudio, in the menu, click “Session” - “Set Working Directory” - “Choose Directory”. Navigate to the folder Rworkingdir and click “Open”. Then your working directory is set. Note that R console displays setwd("C:/Rworkingdir"). Alternatively, you can directly type it in R console.

setwd("C:/Rworkingdir")

Note that R use / to separe folder names which is different from Windows OS which uses \. You can also directly type in R console to set up the working directory. For example, the following code will set the working directory to D drive, folder “project1”,

setwd("D:/project1")

At any time, you can check the current working directory by getwd(). Lastly, R script are usually saved in the working directory.

3 Import and Export Data Files

We will demonstrate how to import comma-separated values (csv) files using read.csv(). Other file types will follow similarly. Suppose you have file containing data stored as my_data.csv. Suppose the first row of my_data.csv is the column/variable names. Then place this file in the working directory and type

d <- read.csv("my_data.csv", header = TRUE, stringsAsFactors = FALSE)

Note that when you read a file into R, it is by default a data frame.

If you create a subfolder in the working directory, say /data, and place the data file there, then you can use

d <- read.csv("data/my_data.csv", header = TRUE, stringsAsFactors = FALSE)

For other file types, for example, Excel files file_name.xls and file_name.xlsx, you can use R packages, such as readxl, readr, data.table, GDATA, XLConnect, ROBC, and RExcel. Here is an example

install.packages("readxl") # only need to run once

library(readxl)
excel_sheets("data/my_data.xls")

## [1] "Sheet1" "Sheet2"

d=read_excel("data/my_data.xls", sheet = "Sheet2")

To take a glance at the data, use head(d) and tail(d) to print the first and last a few rows of d.

head(d)

## # A tibble: 4 × 3
##   index  var1  var2
##   <dbl> <dbl> <dbl>
## 1     1  5342  4352
## 2     2  2352  2345
## 3     3  2435  1234
## 4     4  2345  1234

tail(d)

## # A tibble: 4 × 3
##   index  var1  var2
##   <dbl> <dbl> <dbl>
## 1     1  5342  4352
## 2     2  2352  2345
## 3     3  2435  1234
## 4     4  2345  1234

Exporting Data: Use the write.csv() function to save .csv files to your hard drive

write.csv(d,"my_exported_data.csv")

4 Missing Values

R shows missing values as NA. Missing values prevent R from calculating sums, means, equality checks, etc. Use the is.na() function to check for missing values. Use sum(is.na()) to calculate the total number of missing values.

my_vec=c(1,3,NA,8,NA,4,2)
mean(my_vec)

## [1] NA

mean(my_vec, na.rm = TRUE)

## [1] 3.6

is.na(my_vec)

## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

which(is.na(my_vec))

## [1] 3 5

mean(my_vec[!is.na(my_vec)])

## [1] 3.6

mean(my_vec[-which(is.na(my_vec))])

## [1] 3.6

sum(is.na(my_vec))

## [1] 2

For data frames, use colSums(is.na()) to calculate the number of missing values per column.

df <- data.frame(V1 = c(1,3, NA,7, 4, 3.4),
                 V2 = c("this", NA, "is", "text", "OM", "BANA"), 
                 V3 = c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE), 
                 V4 = c(2.5, NA, NA, NA, NA, 3.3))
is.na(df)

##         V1    V2    V3    V4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE  TRUE
## [3,]  TRUE FALSE FALSE  TRUE
## [4,] FALSE FALSE FALSE  TRUE
## [5,] FALSE FALSE FALSE  TRUE
## [6,] FALSE FALSE FALSE FALSE

sum(is.na(df))

## [1] 6

colSums(is.na(df))

## V1 V2 V3 V4 
##  1  1  0  4

Use complete.cases() to keep only observations with no missing values. na.omit() also removes incomplete observations.

complete.cases(df)

## [1]  TRUE FALSE FALSE FALSE FALSE  TRUE

df[complete.cases(df), ]

##    V1   V2    V3  V4
## 1 1.0 this  TRUE 2.5
## 6 3.4 BANA FALSE 3.3

na.omit(df)

##    V1   V2    V3  V4
## 1 1.0 this  TRUE 2.5
## 6 3.4 BANA FALSE 3.3

Discuss with stakeholders/clients before simply removing observations! Maybe missing data exists for a reason!

5 Data Wrangling via `dplyr` Package

One of the most frequently used functions in R are in dplyr package. Here we go through a few key features. In this section, we will use a running example data set.

Example of mpg data set This is a data containing the fuel efficiency information of various cars. The data is in ggplot2 package.
It is a data frame with 234 rows and 11 variables. Each row represents a particular car. Each column represents a variable describing the cars. Here is a list of variable explanations.

Variable	Type	Description	Details
`manufacturer`	char	manufacturer name	15 manufacturers
`model`	char	model name	38 models
`displ`	numeric	engine displacement in liters, i.e., cylinder volume swept by pistons	range from 1.6 to 7.0 liter
`year`	integer	year of manufacture	two levels: 1999 and 2008
`cyl`	integer	number of cylinders	four levels: 4, 5, 6, 8 cylinders
`trans`	char	type of transmission	two levels: automatic, manual(int)
`drv`	char	type of drive train	three levels: f,r,4. f = front-wheel drive, r = rear wheel drive, 4 = 4wd
`cty`	integer	city miles per gallon	range from 9 to 35
`hwy`	integer	highway miles per gallon	range from 12 to 44
`fl`	char	fuel type	five levels: p,r,e,d,c. d=diesel, p=petrol, e=electric, etc.
`class`	char	vehicle class	seven levels: compact, midsize, suv, 2seater, minivan, pickup, subcompact

We can take a look at the data set.

library(tidyverse)
data(mpg)
#str(mpg)
head(mpg)

## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

5.1 Filter Observations via `filter()`

Using filter(), we can obtain a subset rows from the original data frame.

head(mpg)

## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

mpg_y1999 = filter(mpg,year==1999)
head(mpg_y1999)

## # A tibble: 6 × 11
##   manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
##   <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
## 1 audi         a4           1.8  1999     4 auto(… f        18    29 p     comp…
## 2 audi         a4           1.8  1999     4 manua… f        21    29 p     comp…
## 3 audi         a4           2.8  1999     6 auto(… f        16    26 p     comp…
## 4 audi         a4           2.8  1999     6 manua… f        18    26 p     comp…
## 5 audi         a4 quattro   1.8  1999     4 manua… 4        18    26 p     comp…
## 6 audi         a4 quattro   1.8  1999     4 auto(… 4        16    25 p     comp…

mpg_y1999_cyl4 = filter(mpg,year==1999,cyl==4)
head(mpg_y1999_cyl4)

## # A tibble: 6 × 11
##   manufacturer model       displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr>       <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi         a4            1.8  1999     4 auto… f        18    29 p     comp…
## 2 audi         a4            1.8  1999     4 manu… f        21    29 p     comp…
## 3 audi         a4 quattro    1.8  1999     4 manu… 4        18    26 p     comp…
## 4 audi         a4 quattro    1.8  1999     4 auto… 4        16    25 p     comp…
## 5 chevrolet    malibu        2.4  1999     4 auto… f        19    27 r     mids…
## 6 dodge        caravan 2wd   2.4  1999     4 auto… f        18    24 r     mini…

Similarly, you can use other criterion such as cyl>=1999, drv==“f”, and etc.

Note that mpg_y1999 = filter(mpg,year==1999) is equivalent to mpg_y1999 = mpg[mpg$year==1999,], the former runs faster and usually involves less synatx. For example, to achieve the effect of mpg_y1999_cyl4 = filter(mpg,year==1999,cyl==4) we need mpg_y1999_cyl4 = mpg[(mpg$year==1999)&(mpg&cyl==4),].

5.2 Select Variables via `select()`

Using select(), we can obtain a subset columns/variables from the original data frame.

mpg_engine = select(mpg,c("manufacturer","model","displ","cyl"))
head(mpg_engine)

## # A tibble: 6 × 4
##   manufacturer model displ   cyl
##   <chr>        <chr> <dbl> <int>
## 1 audi         a4      1.8     4
## 2 audi         a4      1.8     4
## 3 audi         a4      2       4
## 4 audi         a4      2       4
## 5 audi         a4      2.8     6
## 6 audi         a4      2.8     6

5.3 Create Variables via `mutate()`

Using mutate(), we can create new variables/columns from the original data frame.

mpg_ave = mutate(mpg,ave_mpg=(cty+hwy)/2,displ_per_cyl=displ/cyl)
head(mpg_ave)

## # A tibble: 6 × 13
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
## # ℹ 2 more variables: ave_mpg <dbl>, displ_per_cyl <dbl>

Note that these features can be achieved by

mpg$ave_mpg=(mpg$cty+mpg$hwy)/2
mpg$displ_per_cyl=mpg$displ/mpg$cyl)

5.4 Summary Statistics via `group_by()` and `summarize()`

Sometimes, we would like to first divide the original data frame into groups or strata by one or two variables. Within each strata, we would like to calculate some summary statistics. This can be achieved by combining group_by() and summarize().

unique(mpg$drv)

## [1] "f" "4" "r"

mpg_by_drv = group_by(mpg, drv)
ave_mpg_by_drv = summarize(mpg_by_drv,
                           ave_hwy=mean(hwy,na.rm = TRUE),
                           ave_cty=mean(cty,na.rm = TRUE))
head(ave_mpg_by_drv)

## # A tibble: 3 × 3
##   drv   ave_hwy ave_cty
##   <chr>   <dbl>   <dbl>
## 1 4        19.2    14.3
## 2 f        28.2    20.0
## 3 r        21      14.1

5.5 Pipeline Operator `%>%`

The functions we introduce above have couple things in common, their first arguments are all data frame, they also return a new data frame. Sometimes, a data frame will go through a lot of transformations shown above. The pipeline operator will make this process easy.

mpg %>% 
  mutate(overall_mpg=(cty+hwy)/2) %>% 
  group_by(drv) %>% 
  summarize(ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))

## # A tibble: 3 × 2
##   drv   ave_overall_mpg
##   <chr>           <dbl>
## 1 4                16.8
## 2 f                24.1
## 3 r                17.5

This is equivalent to

result1 = mutate(mpg, overall_mpg=(cty+hwy)/2)
result2 = group_by(result1, drv)
result3 = summarize(result2, ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))
result3

## # A tibble: 3 × 2
##   drv   ave_overall_mpg
##   <chr>           <dbl>
## 1 4                16.8
## 2 f                24.1
## 3 r                17.5

summarize( group_by( mutate(mpg, overall_mpg=(cty+hwy)/2), 
                     drv),
           ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))

## # A tibble: 3 × 2
##   drv   ave_overall_mpg
##   <chr>           <dbl>
## 1 4                16.8
## 2 f                24.1
## 3 r                17.5

5.6 Merge Data Sets via `*_join()`

We often obtain data from different sources. Therefore, we need to merge datasets before making visualization or any analysis.

df1=data.frame(id=c(1001,1002,1003),name=c("Tom","Jerry","Lucy"))
df2=data.frame(id=c(1001,1003,1004),weight=c(150,110,200))
df1

##     id  name
## 1 1001   Tom
## 2 1002 Jerry
## 3 1003  Lucy

df2

##     id weight
## 1 1001    150
## 2 1003    110
## 3 1004    200

inner_join(df1,df2,by="id")

##     id name weight
## 1 1001  Tom    150
## 2 1003 Lucy    110

left_join(df1,df2,by="id")

##     id  name weight
## 1 1001   Tom    150
## 2 1002 Jerry     NA
## 3 1003  Lucy    110

right_join(df1,df2,by="id")

##     id name weight
## 1 1001  Tom    150
## 2 1003 Lucy    110
## 3 1004 <NA>    200

full_join(df1,df2,by="id")

##     id  name weight
## 1 1001   Tom    150
## 2 1002 Jerry     NA
## 3 1003  Lucy    110
## 4 1004  <NA>    200

6 R Cheatsheets for `dplyr` Package

Summary of R command

R Cheatsheet at https://rstudio.com/resources/cheatsheets/

7 Reporting via R Markdown

Link: https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf

Ch3 Data Wrangling

Descriptive Analytics and Data Visualization

Yichen Qin (qinyn@ucmail.uc.edu), University of Cincinnati

2024-12-20

Ch3 Data Wrangling

1 What is Data Wrangling?

2 Set Up Working Directory

3 Import and Export Data Files

4 Missing Values

5 Data Wrangling via `dplyr` Package

5.1 Filter Observations via `filter()`

5.2 Select Variables via `select()`

5.3 Create Variables via `mutate()`

5.4 Summary Statistics via `group_by()` and `summarize()`

5.5 Pipeline Operator `%>%`

5.6 Merge Data Sets via `*_join()`

6 R Cheatsheets for `dplyr` Package

7 Reporting via R Markdown

Ch3 Data Wrangling

Descriptive Analytics and Data Visualization

Yichen Qin (qinyn@ucmail.uc.edu), University of Cincinnati

2024-12-20

Ch3 Data Wrangling

1 What is Data Wrangling?

2 Set Up Working Directory

3 Import and Export Data Files

4 Missing Values

5 Data Wrangling via dplyr Package

5.1 Filter Observations via filter()

5.2 Select Variables via select()

5.3 Create Variables via mutate()

5.4 Summary Statistics via group_by() and summarize()

5.5 Pipeline Operator %>%

5.6 Merge Data Sets via *_join()

6 R Cheatsheets for dplyr Package

7 Reporting via R Markdown

5 Data Wrangling via `dplyr` Package

5.1 Filter Observations via `filter()`

5.2 Select Variables via `select()`

5.3 Create Variables via `mutate()`

5.4 Summary Statistics via `group_by()` and `summarize()`

5.5 Pipeline Operator `%>%`

5.6 Merge Data Sets via `*_join()`

6 R Cheatsheets for `dplyr` Package