The following R packages are needed for running the examples in this chapter.
library(tidyverse)
library(readxl)
Here is a flowchart of data analysis.
Data wrangling includes preprocessing and transforming the data. It is essentially to convert what you have to what you need for visualization and modeling purposes. Data wrangling is best learned through examples. Here we go over a few most frequently used features. Since R is one of the best tools for data wrangling, descriptive analytics, data visualization, and modeling, we will use R throughout the course for demonstration. We start with some basics.
When working with R, the first thing is to set up a working directory to store all the code files, data files and others. All subsequent analysis will using files from this working directory.
Here I first create a directory called Rworkingdir
in
the C:
drive in my computer and use it as my working
directory. In RStudio, in the menu, click “Session” - “Set Working
Directory” - “Choose Directory”. Navigate to the folder
Rworkingdir
and click “Open”. Then your working directory
is set. Note that R console displays
setwd("C:/Rworkingdir")
. Alternatively, you can directly
type it in R console.
setwd("C:/Rworkingdir")
Note that R use /
to separe folder names which is
different from Windows OS which uses \
. You can also
directly type in R console to set up the working directory. For example,
the following code will set the working directory to D drive, folder
“project1”,
setwd("D:/project1")
At any time, you can check the current working directory by
getwd()
. Lastly, R script are usually saved in the working
directory.
We will demonstrate how to import comma-separated values (csv) files
using read.csv()
. Other file types will follow similarly.
Suppose you have file containing data stored as
my_data.csv
. Suppose the first row of
my_data.csv
is the column/variable names. Then place this
file in the working directory and type
d <- read.csv("my_data.csv", header = TRUE, stringsAsFactors = FALSE)
Note that when you read a file into R, it is by default a data frame.
If you create a subfolder in the working directory, say
/data
, and place the data file there, then you can use
d <- read.csv("data/my_data.csv", header = TRUE, stringsAsFactors = FALSE)
For other file types, for example, Excel files
file_name.xls
and file_name.xlsx
, you can use
R packages, such as readxl
, readr
,
data.table
, GDATA
, XLConnect
,
ROBC
, and RExcel
. Here is an example
install.packages("readxl") # only need to run once
library(readxl)
excel_sheets("data/my_data.xls")
## [1] "Sheet1" "Sheet2"
d=read_excel("data/my_data.xls", sheet = "Sheet2")
To take a glance at the data, use head(d)
and
tail(d)
to print the first and last a few rows of
d
.
head(d)
## # A tibble: 4 × 3
## index var1 var2
## <dbl> <dbl> <dbl>
## 1 1 5342 4352
## 2 2 2352 2345
## 3 3 2435 1234
## 4 4 2345 1234
tail(d)
## # A tibble: 4 × 3
## index var1 var2
## <dbl> <dbl> <dbl>
## 1 1 5342 4352
## 2 2 2352 2345
## 3 3 2435 1234
## 4 4 2345 1234
Exporting Data: Use the write.csv()
function to save
.csv files to your hard drive
write.csv(d,"my_exported_data.csv")
R shows missing values as NA
. Missing values prevent R
from calculating sums, means, equality checks, etc. Use the
is.na()
function to check for missing values. Use
sum(is.na())
to calculate the total number of missing
values.
my_vec=c(1,3,NA,8,NA,4,2)
mean(my_vec)
## [1] NA
mean(my_vec, na.rm = TRUE)
## [1] 3.6
is.na(my_vec)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE
which(is.na(my_vec))
## [1] 3 5
mean(my_vec[!is.na(my_vec)])
## [1] 3.6
mean(my_vec[-which(is.na(my_vec))])
## [1] 3.6
sum(is.na(my_vec))
## [1] 2
For data frames, use colSums(is.na())
to calculate the
number of missing values per column.
df <- data.frame(V1 = c(1,3, NA,7, 4, 3.4),
V2 = c("this", NA, "is", "text", "OM", "BANA"),
V3 = c(TRUE, FALSE, TRUE, TRUE, TRUE, FALSE),
V4 = c(2.5, NA, NA, NA, NA, 3.3))
is.na(df)
## V1 V2 V3 V4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE TRUE FALSE TRUE
## [3,] TRUE FALSE FALSE TRUE
## [4,] FALSE FALSE FALSE TRUE
## [5,] FALSE FALSE FALSE TRUE
## [6,] FALSE FALSE FALSE FALSE
sum(is.na(df))
## [1] 6
colSums(is.na(df))
## V1 V2 V3 V4
## 1 1 0 4
Use complete.cases()
to keep only observations with no
missing values. na.omit()
also removes incomplete
observations.
complete.cases(df)
## [1] TRUE FALSE FALSE FALSE FALSE TRUE
df[complete.cases(df), ]
## V1 V2 V3 V4
## 1 1.0 this TRUE 2.5
## 6 3.4 BANA FALSE 3.3
na.omit(df)
## V1 V2 V3 V4
## 1 1.0 this TRUE 2.5
## 6 3.4 BANA FALSE 3.3
Discuss with stakeholders/clients before simply removing observations! Maybe missing data exists for a reason!
dplyr
PackageOne of the most frequently used functions in R are in dplyr package. Here we go through a few key features. In this section, we will use a running example data set.
Example of mpg data set This is a data containing
the fuel efficiency information of various cars. The data is in ggplot2
package.
It is a data frame with 234 rows and 11 variables. Each row represents a
particular car. Each column represents a variable describing the cars.
Here is a list of variable explanations.
Variable | Type | Description | Details |
---|---|---|---|
manufacturer |
char | manufacturer name | 15 manufacturers |
model |
char | model name | 38 models |
displ |
numeric | engine displacement in liters, i.e., cylinder volume swept by pistons | range from 1.6 to 7.0 liter |
year |
integer | year of manufacture | two levels: 1999 and 2008 |
cyl |
integer | number of cylinders | four levels: 4, 5, 6, 8 cylinders |
trans |
char | type of transmission | two levels: automatic, manual(int) |
drv |
char | type of drive train | three levels: f,r,4. f = front-wheel drive, r = rear wheel drive, 4 = 4wd |
cty |
integer | city miles per gallon | range from 9 to 35 |
hwy |
integer | highway miles per gallon | range from 12 to 44 |
fl |
char | fuel type | five levels: p,r,e,d,c. d=diesel, p=petrol, e=electric, etc. |
class |
char | vehicle class | seven levels: compact, midsize, suv, 2seater, minivan, pickup, subcompact |
We can take a look at the data set.
library(tidyverse)
data(mpg)
#str(mpg)
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
filter()
Using filter()
, we can obtain a subset rows from the
original data frame.
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
mpg_y1999 = filter(mpg,year==1999)
head(mpg_y1999)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manua… f 21 29 p comp…
## 3 audi a4 2.8 1999 6 auto(… f 16 26 p comp…
## 4 audi a4 2.8 1999 6 manua… f 18 26 p comp…
## 5 audi a4 quattro 1.8 1999 4 manua… 4 18 26 p comp…
## 6 audi a4 quattro 1.8 1999 4 auto(… 4 16 25 p comp…
mpg_y1999_cyl4 = filter(mpg,year==1999,cyl==4)
head(mpg_y1999_cyl4)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 4 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 5 chevrolet malibu 2.4 1999 4 auto… f 19 27 r mids…
## 6 dodge caravan 2wd 2.4 1999 4 auto… f 18 24 r mini…
Similarly, you can use other criterion such as cyl>=1999, drv==“f”, and etc.
Note that mpg_y1999 = filter(mpg,year==1999)
is
equivalent to mpg_y1999 = mpg[mpg$year==1999,]
, the former
runs faster and usually involves less synatx. For example, to achieve
the effect of
mpg_y1999_cyl4 = filter(mpg,year==1999,cyl==4)
we need
mpg_y1999_cyl4 = mpg[(mpg$year==1999)&(mpg&cyl==4),]
.
select()
Using select()
, we can obtain a subset columns/variables
from the original data frame.
mpg_engine = select(mpg,c("manufacturer","model","displ","cyl"))
head(mpg_engine)
## # A tibble: 6 × 4
## manufacturer model displ cyl
## <chr> <chr> <dbl> <int>
## 1 audi a4 1.8 4
## 2 audi a4 1.8 4
## 3 audi a4 2 4
## 4 audi a4 2 4
## 5 audi a4 2.8 6
## 6 audi a4 2.8 6
mutate()
Using mutate()
, we can create new variables/columns from
the original data frame.
mpg_ave = mutate(mpg,ave_mpg=(cty+hwy)/2,displ_per_cyl=displ/cyl)
head(mpg_ave)
## # A tibble: 6 × 13
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
## # ℹ 2 more variables: ave_mpg <dbl>, displ_per_cyl <dbl>
Note that these features can be achieved by
mpg$ave_mpg=(mpg$cty+mpg$hwy)/2
mpg$displ_per_cyl=mpg$displ/mpg$cyl)
group_by()
and summarize()
Sometimes, we would like to first divide the original data frame into
groups or strata by one or two variables. Within each strata, we would
like to calculate some summary statistics. This can be achieved by
combining group_by()
and summarize()
.
unique(mpg$drv)
## [1] "f" "4" "r"
mpg_by_drv = group_by(mpg, drv)
ave_mpg_by_drv = summarize(mpg_by_drv,
ave_hwy=mean(hwy,na.rm = TRUE),
ave_cty=mean(cty,na.rm = TRUE))
head(ave_mpg_by_drv)
## # A tibble: 3 × 3
## drv ave_hwy ave_cty
## <chr> <dbl> <dbl>
## 1 4 19.2 14.3
## 2 f 28.2 20.0
## 3 r 21 14.1
%>%
The functions we introduce above have couple things in common, their first arguments are all data frame, they also return a new data frame. Sometimes, a data frame will go through a lot of transformations shown above. The pipeline operator will make this process easy.
mpg %>%
mutate(overall_mpg=(cty+hwy)/2) %>%
group_by(drv) %>%
summarize(ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))
## # A tibble: 3 × 2
## drv ave_overall_mpg
## <chr> <dbl>
## 1 4 16.8
## 2 f 24.1
## 3 r 17.5
This is equivalent to
result1 = mutate(mpg, overall_mpg=(cty+hwy)/2)
result2 = group_by(result1, drv)
result3 = summarize(result2, ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))
result3
## # A tibble: 3 × 2
## drv ave_overall_mpg
## <chr> <dbl>
## 1 4 16.8
## 2 f 24.1
## 3 r 17.5
summarize( group_by( mutate(mpg, overall_mpg=(cty+hwy)/2),
drv),
ave_overall_mpg=mean(overall_mpg,na.rm = TRUE))
## # A tibble: 3 × 2
## drv ave_overall_mpg
## <chr> <dbl>
## 1 4 16.8
## 2 f 24.1
## 3 r 17.5
*_join()
We often obtain data from different sources. Therefore, we need to merge datasets before making visualization or any analysis.
df1=data.frame(id=c(1001,1002,1003),name=c("Tom","Jerry","Lucy"))
df2=data.frame(id=c(1001,1003,1004),weight=c(150,110,200))
df1
## id name
## 1 1001 Tom
## 2 1002 Jerry
## 3 1003 Lucy
df2
## id weight
## 1 1001 150
## 2 1003 110
## 3 1004 200
inner_join(df1,df2,by="id")
## id name weight
## 1 1001 Tom 150
## 2 1003 Lucy 110
left_join(df1,df2,by="id")
## id name weight
## 1 1001 Tom 150
## 2 1002 Jerry NA
## 3 1003 Lucy 110
right_join(df1,df2,by="id")
## id name weight
## 1 1001 Tom 150
## 2 1003 Lucy 110
## 3 1004 <NA> 200
full_join(df1,df2,by="id")
## id name weight
## 1 1001 Tom 150
## 2 1002 Jerry NA
## 3 1003 Lucy 110
## 4 1004 <NA> 200
dplyr
PackageSummary of R command
R Cheatsheet at https://rstudio.com/resources/cheatsheets/
Link: https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf