Ch2 Analytics Tools

In this chapter, we introduce some of the most popular tools in visualization. We mostly focus on R and touch a little bit on others.

1 R and RStudio

R is a free software environment for statistical computing and graphics. It is available for free download from CRAN, the Comprehensive R Archive Network1. This statistical programming language helps analysts to process, transform, visualize, and analyze data, and present results. One of the most important features of R is its ability to generate beautiful visualizations. RStudio is an integrated development environment (IDE) for R, a powerful user interface for data analysis in R.

To install R, please go to https://www.r-project.org/, click “CRAN” on the left, select a location in USA (i.e., mirror), click “Download R for Windows/Mac/Linux”. For windows, select “base” and install. For Mac, select R-3.X.X.pkg that aligns with the OS version. To install RStudio, please go to https://rstudio.com/, click “Products”, select “RStudio”, click “RStudio Desktop”, click “DOWNLOAD RSTUDIO DESKTOP”, select the free version and click “DOWNLOAD” and install.

After the installation is complete, please open RStudio. In the menu, click “File” - “New File” - “R Script”, you will create your first R script file. In the R script file, we can type 1+1 and hit Ctrl+Enter, then 1+1 is sent to the R console for execution. You will the R console output as follows.

1+1
## [1] 2

Typically, the Rstudio window has four regions or panes:

  • R script(s): This is where your R code is. You can open multiple script files at the same time. Script files are saved as filename.R format.
  • R console: This is where output displays (i.e., the results from running your scripts). The output and code are not automatically saved.
  • Environment, history, and workspace related information: It displays the variables, objects, functions currently loaded in memory.
  • Files, plots, packages and help: It shows files in the working directory, displays figures, and installed R packages.

You can click “Tools” - “Global Optinos” - “Pane Layout” to change the pane settings.

For R console, here are some tips.

  • You can directly type command such as 1+1 and hit Enter.
  • The > is a prompt that indicates the R console is ready for the next command.
  • Code is in blue.
  • Output is black.
  • The [1] means the first value of the command’s results begins here.
  • Error messages are in red.
  • You can use R console as a calculator.
  • If you hit Enter before your command is complete, the prompt becomes +. Once you complete the command, the prompt changes back to >. For example, type 1/(2* in the console and hit Enter, and see what happens. Then finish the command with 3) and hit Enter. Alternatively, you can hit Esc to cancel the incomplete command.
  • If a command is taking too long to run, you can use the red “STOP” button on top of the console (or CTRL + c) to terminate it.

For R script pane, here are some tips

  • To execute a command or multiple command, you can highlight the commands and hit Ctrl+Enter. If no command is highlighted, Ctrl+Enter will execute the line the cursor is currently in. Alternatively, you can click the “Run” botton on top of the script pane, or execute the entire script by pressing the “Source” button.
  • You can use # to comment you code. For example
  • Place code in a script to save for later use/editing/additions
  • Open a new script by going to File -> New File -> R Script
1 + 2 # this line calculates the sum of 1 and 2
## [1] 3
# this is another line of comments. the next line is also commented out.
# 2 + 3 
# Note that the "+ 5" in the command below part is not executed.
3 + 4 # + 5
## [1] 7

You can change many other settings by going to “Tools” -> “Global Options.”

1.1 Assignment Operator

Use the assignment operator <- to create, store, data into variables. Equivalently, we can use =, however <- is usually preferred. You can type the variable’s name and press Enter to display the variable’s value after assignment

a <- 1
a
## [1] 1
b = 2
b
## [1] 2
a + b
## [1] 3

Note that a and b are now created and can be seen in the environment pane in RStudio. Use ls() and rm() to list and remove objects in R.

ls()
## [1] "a" "b"
rm(list=ls()) #to remove one variable, use rm(a)
ls()
## character(0)

1.2 Coding Style

To name R objects or R code files, use meaningful words, phrases, and sentences with words separated by the underscore _. For example, male_height, ave_salary_2020, and visualize_salary.R. This naming convention is referred to as snake_case. Meanwhile, the R object names should begin with a letter and can only contains letters, digits, and underscores. Some reserved words in R cannot be used, e.g., TRUE and FALSE. R is case sensitive, which means male_height and Male_height refer to different variables. R overwrites variable names.

In addition to naming objects, it is highly recommended to provide detailed comments for your code. A good example is as follows.

# create variables
a <- 1
b <- 2
# calculate the sum
a + b
## [1] 3

Never write code like the following

command
command
command
command

Lastly, please use space as frequently as possible. Below are the examples.

ave_height <- mean(observed_height, na.rm = TRUE) # good
ave_height<-mean(observed_height,na.rm = TRUE) # bad

1.3 R Packages

So far, we have been using the basis features in R, which includes only core functions that are widely needed for analysis. Sometimes, we may need a function to perform a special task that is not available in base R. Then we will need to use R packages.
R packages are simply a set of customized functions that are designed for a special set of tasks.
One of the greatest advantages of R is its extendability with many R packages.
Some of the most frequently used packages include ggplot2, tidyverse, MASS, and many others.

In order to use functions in a R package, we need two steps:

  • Install the package. We use install.packages(). This step needs to be run once in one computer.
  • Load the package. We use library(). This step needs to be executed every time you open R and RStudio.

Here is an example.

install.packages("ggplot2") # installation, only need to run once.
library(ggplot2) # loading, need to run everytime you open RStudio.

Note that you only need to run install.packages() once.
The installation procedure will download the package files from CRAN to your computer. On the other hand, you need to run library() every time you open R and RStudio since library() takes these downloaded functions and load them into memory for analysis.
When you open RStudio, none of these packages are loaded. The loading part cannot be skipped.

To get help on packages, you can use

help(package="ggplot2")
vignette(package="ggplot2")

Now we introduce a few useful packages.

1.3.1 ggplot2 Package

ggplot2 is an R package developed by Hadley Wickham. It is a system for declaratively creating graphics, based on the book of The Grammar of Graphics by Leland Wilkinson (Wilkinson 2005). We will mainly use this package for generating figures. Here we present a simple example.
Details will be covered in later chapters.

library(tidyverse)
library(ggplot2)
data(mpg)
ggplot(data=mpg) + geom_point(aes(x=displ,y=hwy))

1.3.2 dplyr and tidyr Packages

dplyr and tidyr are two R package for data wrangling. Here we present a simple example.

library(ggplot2)
library(dplyr)
data(mpg)
mpg %>% 
  mutate(ave = (hwy + cty)/2) %>%
  group_by(class) %>%
  summarize(averageMPG_by_class = mean(ave))
## # A tibble: 7 × 2
##   class      averageMPG_by_class
##   <chr>                    <dbl>
## 1 2seater                   20.1
## 2 compact                   24.2
## 3 midsize                   23.0
## 4 minivan                   19.1
## 5 pickup                    14.9
## 6 subcompact                24.3
## 7 suv                       15.8

1.3.3 readr Package

readr is an R package for data importing.

1.3.4 tidyverse Package

All the the packages mentioned above belong to a super R package called tidyverse which is a “universe” of many R functions. For more details, please go to the tidyverse website. Therefore in pratice, you only need to run library(tidyverse) and then will be able to use all the packages such as ggplot2, dplyr, tidyr, and readr.

install.packages("tidyverse")
library(tidyverse)

1.4 R Markdown

R markdown is a file type. An R markdown file contains narrative text and chunks of R code for a analytics project. After compilation (i.e., “knitr”), the R markdown file will generate a (html or pdf) report that contains the text with simply formatting, the R code, and R output from running the code. Therefore, an R Markdown file is able to combine everything about the analytics project into one single file. It allows you to turn your analyses into high quality documents, reports, presentations, and dashboards with easy modification.

R Markdown is different from the traditional Microsoft Word processor. Word is a WYSIWYG processor (i.e., what you see is what you get) in the sense that the document is in a form that resembles its appearance when printed or displayed as a finished product. You click a button to change the format of a sentence and the change is immediately seen. A WYSISYG processor requires the users to pay attention to formatting as well as content at the same time. On the other hand, R markdown file allows users to solely focus on the content and let R markdown file to decide the formatting automatically. User simply specify the a few format requirements and R markdown file decides the best layout, spacing, and many others. Therefore, users can pay more attention to the content. Such a document can be easily transferred to another format. Meanwhile, since the text and R code are embedded in the same R markdown file, the analysis can be easily reproduced.

1.5 Get Help on R

No matter how proficient you are with programming, you will almost always get various errors in your code. To fix these issues, you need to be able to get help yourself. Here are some of the simplest ways to get help.

  • Google: just add “with R” or “in R” at the end of any search.
    • The key to use Google efficiently is to form your question precisely. For example, search “how to read csv file in a different folder in R” is much better than search “read csv file location”.
    • One trick that I use the most frequently is that I copy the error message to Google and see if something comes up.
  • Stack Overflow is an online community oriented toward programming issues.
  • Cross Validated is an online community oriented toward data science.
  • R-bloggers is an online community oriented toward news and tutorials about R.
  • Youtube offers tutorials in R, ggplot2, Rmarkdown and many others.
  • If your computer does not work well with R, RStudio, or Rmarkdown, you can use the virtual machine.
  • For a particular R function, type help(function_name), ?function_name, ??function_name, example(function_name) to get help files. After these commands, you should have a popup website listing the help files for the function. You could also find some examples of the function.
help(sqrt)
?sqrt
??sqrt
example(sqrt)

2 R Basics

In this section, we will discuss some basic features in R. These features are the foundation of more advanced analysis.

Each variable stored in R is called an object. R has four basic types of objects: vectors, matrices, data frames, and lists. We will go over them briefly.

2.1 Vector

A vector is a sequence of values of the same type, such as a sequence of numbers, or a sequence of strings, and etc.

a=c(5,3,6,7)
a
## [1] 5 3 6 7
b=c("Tom","Jerry","John")
b
## [1] "Tom"   "Jerry" "John"
c=c(TRUE,TRUE,FALSE)
c
## [1]  TRUE  TRUE FALSE

Here we use the c() function to combine values into a vector (or a list).

There are three major types of data, numeric, logical, and character. You can use is.numeric(), is.logical(), and is.character() to test them

is.numeric(0.23)
## [1] TRUE
is.numeric("Tom")
## [1] FALSE
is.logical(FALSE)
## [1] TRUE
is.character("Tom")
## [1] TRUE
is.character(TRUE)
## [1] FALSE

To determine if an R object is a vector, use is.vector() as follows.

is.vector(a)
## [1] TRUE

length() can tell you the length of a vector. class() returns the type of data stored in the vector.

length(a)
## [1] 4
class(a)
## [1] "numeric"
class(b)
## [1] "character"
class(c)
## [1] "logical"

To select a subset of elements in a vector, use [] as follows.

a[1]
## [1] 5
a[3]
## [1] 6
a[c(1,3)]
## [1] 5 6
a[-c(1,3)]
## [1] 3 7
a[a>4]
## [1] 5 6 7

You can modify a vector’s elements by the following command.

a[3]=100
a
## [1]   5   3 100   7
a[c(2,3)]=c(50,10)
a
## [1]  5 50 10  7
a[c(2,3)]=-2
a
## [1]  5 -2 -2  7

For element-wise comparisons, you can use the following

x=c(5,1,3)
y=c(4,1,2)
x == y
## [1] FALSE  TRUE FALSE
x < y
## [1] FALSE FALSE FALSE
x <= y
## [1] FALSE  TRUE FALSE
x != y
## [1]  TRUE FALSE  TRUE
3 %in% x
## [1] TRUE
c(3,4,5,6,7) %in% x
## [1]  TRUE FALSE  TRUE FALSE FALSE

Note that here is a list of comparison operators

operator syntax
greater than a > b
greater than or equal to a >= b
less than a < b
less than or equal to a <= b
equal to a == b
belongs to a %in% b
not equal to a != b

With these operators, you can modify vectors in many different ways

a
## [1]  5 -2 -2  7
a[a<0]=3
a
## [1] 5 3 3 7

Here are some other ways to create vectors.

seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10,2)
## [1] 1 3 5 7 9
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
rep(10,5)
## [1] 10 10 10 10 10
mean(a)
## [1] 4.5
sd(a)
## [1] 1.914854
length(a)
## [1] 4
b[2]
## [1] "Jerry"
sum(c)
## [1] 2
d=c(10,20,30,40)
a+d
## [1] 15 23 33 47
a*d
## [1]  50  60  90 280

Combining different data types or forcing functions on certain data types results in coercion. Logical values can be converted to numeric 0 or 1. Numeric values can be converted to strings. Numeric values can also be converted to logical values (TRUE for nonzero and FALSE for zero)

as.numeric(TRUE)
## [1] 1
as.numeric(FALSE)
## [1] 0
as.character(123.456)
## [1] "123.456"
as.logical(123.456)
## [1] TRUE
as.logical(0)
## [1] FALSE

Sometimes, coercion can be subtle

c
## [1]  TRUE  TRUE FALSE
sum(c)
## [1] 2
mean(c)
## [1] 0.6666667

2.2 Matrix

A matrix is a two dimensional array of numbers (and numbers only). Use matrix() to create a matrix.

a=matrix(1:12,4,3)
a
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
class(a)
## [1] "matrix" "array"
nrow(a)
## [1] 4
ncol(a)
## [1] 3
dim(a)
## [1] 4 3

To access certain values in a matrix, use [] as follows.

a[2,3]
## [1] 10
a[2,]
## [1]  2  6 10
a[,2]
## [1] 5 6 7 8
a[,c(2,3)]
##      [,1] [,2]
## [1,]    5    9
## [2,]    6   10
## [3,]    7   11
## [4,]    8   12
dim(a)
## [1] 4 3
apply(a,1,mean)
## [1] 5 6 7 8
apply(a,1,sum)
## [1] 15 18 21 24
apply(a,1,sd)
## [1] 4 4 4 4
apply(a,2,sd)
## [1] 1.290994 1.290994 1.290994
apply(a,2,sum)
## [1] 10 26 42

2.3 Data Frame

A data frame is a two dimensional array where each column is of the same data type. Use the data.frame() function to create a data frame. Use str() to examine the structure of a data frame

a=data.frame(name=c("Tom","Jerry","John","Jane"),
             age=c(10,14,13,11),
             gender=c("Male","Male","Male","Female"))
a
##    name age gender
## 1   Tom  10   Male
## 2 Jerry  14   Male
## 3  John  13   Male
## 4  Jane  11 Female
str(a)
## 'data.frame':    4 obs. of  3 variables:
##  $ name  : chr  "Tom" "Jerry" "John" "Jane"
##  $ age   : num  10 14 13 11
##  $ gender: chr  "Male" "Male" "Male" "Female"
class(a)
## [1] "data.frame"
a[2,3]
## [1] "Male"
a[2,]
##    name age gender
## 2 Jerry  14   Male
a[,2]
## [1] 10 14 13 11
a$name
## [1] "Tom"   "Jerry" "John"  "Jane"
a$age[2]
## [1] 14
a[,"age"]
## [1] 10 14 13 11
a[["age"]]
## [1] 10 14 13 11
a[,c("name","age")]
##    name age
## 1   Tom  10
## 2 Jerry  14
## 3  John  13
## 4  Jane  11
a[2,c("name","age")]
##    name age
## 2 Jerry  14
dim(a)
## [1] 4 3

attributes() returns similar information, but will usually be used later. names() returns the column names

names(a)
## [1] "name"   "age"    "gender"

You can modify a data frame’s values by the following command.

a[3,2]=100
a
##    name age gender
## 1   Tom  10   Male
## 2 Jerry  14   Male
## 3  John 100   Male
## 4  Jane  11 Female
a[,2]=55
a
##    name age gender
## 1   Tom  55   Male
## 2 Jerry  55   Male
## 3  John  55   Male
## 4  Jane  55 Female

2.4 List

A list is a one dimensional array where each element can be of different types. Use the list() function to create a list.

e=list(current_rank=1,
       name="Tom",
       active=TRUE, 
       metric=matrix(1:6,2,3),
       family_members=c("Mary", "John"))
e
## $current_rank
## [1] 1
## 
## $name
## [1] "Tom"
## 
## $active
## [1] TRUE
## 
## $metric
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## $family_members
## [1] "Mary" "John"
e[[3]]
## [1] TRUE
e$active
## [1] TRUE
e[[4]]
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
e$metric
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
e[[4]][1,2]
## [1] 3
e[[5]]
## [1] "Mary" "John"
e[[5]][2]
## [1] "John"

2.5 Control Statements, Iterations, and Functions

If control statement:

x <- 5
if(x > 0){
print("Positive number")
}
## [1] "Positive number"

Some times, we need to repeat one operation for many times, then we will need for loop.

a=matrix(1:12,3,4)
for (i in 1:nrow(a))
{
  print(
    mean(a[i,])/(max(a[i,])-min(a[i,]))
    )
}
## [1] 0.6111111
## [1] 0.7222222
## [1] 0.8333333

If a set of operations are frequently used, you can define a function instead of repeating typing these operations.

my_func = function(x)
{
  output = mean(x)/(max(x)-min(x))
  return(output)
}
my_func(a[1,])
## [1] 0.6111111

Data passed into a function is called the function’s argument - Arguments can be results from another function - Use = to specify names of arguments, especially with multiple arguments for readability/QA

3 R Cheatsheets for R Basic Commands

Summary of R command

R Cheatsheet at https://rstudio.com/resources/cheatsheets/

References

Wilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed. Springer-Verlag New York.