In this chapter, we introduce some of the most popular tools in visualization. We mostly focus on R and touch a little bit on others.
R is a free software environment for statistical computing and graphics. It is available for free download from CRAN, the Comprehensive R Archive Network1. This statistical programming language helps analysts to process, transform, visualize, and analyze data, and present results. One of the most important features of R is its ability to generate beautiful visualizations. RStudio is an integrated development environment (IDE) for R, a powerful user interface for data analysis in R.
To install R, please go to https://www.r-project.org/, click “CRAN” on the left, select a location in USA (i.e., mirror), click “Download R for Windows/Mac/Linux”. For windows, select “base” and install. For Mac, select R-3.X.X.pkg that aligns with the OS version. To install RStudio, please go to https://rstudio.com/, click “Products”, select “RStudio”, click “RStudio Desktop”, click “DOWNLOAD RSTUDIO DESKTOP”, select the free version and click “DOWNLOAD” and install.
After the installation is complete, please open RStudio. In the menu,
click “File” - “New File” - “R Script”, you will create your first R
script file. In the R script file, we can type 1+1
and hit
Ctrl+Enter, then 1+1
is sent to the R
console for execution. You will the R console output as follows.
1+1
## [1] 2
Typically, the Rstudio window has four regions or panes:
filename.R
format.You can click “Tools” - “Global Optinos” - “Pane Layout” to change the pane settings.
For R console, here are some tips.
1+1
and hit
Enter.>
is a prompt that indicates the R console is
ready for the next command.[1]
means the first value of the command’s results
begins here.+
. Once you complete the command, the
prompt changes back to >
. For example, type
1/(2*
in the console and hit Enter, and see what happens.
Then finish the command with 3)
and hit Enter.
Alternatively, you can hit Esc to cancel the incomplete
command.For R script pane, here are some tips
#
to comment you code. For example1 + 2 # this line calculates the sum of 1 and 2
## [1] 3
# this is another line of comments. the next line is also commented out.
# 2 + 3
# Note that the "+ 5" in the command below part is not executed.
3 + 4 # + 5
## [1] 7
You can change many other settings by going to “Tools” -> “Global Options.”
Use the assignment operator <-
to create, store, data
into variables. Equivalently, we can use =
, however
<-
is usually preferred. You can type the variable’s
name and press Enter to display the variable’s value
after assignment
a <- 1
a
## [1] 1
b = 2
b
## [1] 2
a + b
## [1] 3
Note that a
and b
are now created and can
be seen in the environment pane in RStudio. Use ls()
and
rm()
to list and remove objects in R.
ls()
## [1] "a" "b"
rm(list=ls()) #to remove one variable, use rm(a)
ls()
## character(0)
To name R objects or R code files, use meaningful words, phrases, and
sentences with words separated by the underscore _
. For
example, male_height
, ave_salary_2020
, and
visualize_salary.R
. This naming convention is referred to
as snake_case.
Meanwhile, the R object names should begin with a letter and can only
contains letters, digits, and underscores. Some reserved words in R
cannot be used, e.g., TRUE
and FALSE
. R is
case sensitive, which means male_height
and
Male_height
refer to different variables. R overwrites
variable names.
In addition to naming objects, it is highly recommended to provide detailed comments for your code. A good example is as follows.
# create variables
a <- 1
b <- 2
# calculate the sum
a + b
## [1] 3
Never write code like the following
command
command
command
command
Lastly, please use space as frequently as possible. Below are the examples.
ave_height <- mean(observed_height, na.rm = TRUE) # good
ave_height<-mean(observed_height,na.rm = TRUE) # bad
So far, we have been using the basis features in R, which includes
only core functions that are widely needed for analysis. Sometimes, we
may need a function to perform a special task that is not available in
base R. Then we will need to use R packages.
R packages are simply a set of customized functions that are designed
for a special set of tasks.
One of the greatest advantages of R is its extendability with many R
packages.
Some of the most frequently used packages include ggplot2
,
tidyverse
, MASS
, and many others.
In order to use functions in a R package, we need two steps:
install.packages()
. This
step needs to be run once in one computer.library()
. This step needs to
be executed every time you open R and RStudio.Here is an example.
install.packages("ggplot2") # installation, only need to run once.
library(ggplot2) # loading, need to run everytime you open RStudio.
Note that you only need to run install.packages()
once.
The installation procedure will download the package files from CRAN to
your computer. On the other hand, you need to run library()
every time you open R and RStudio since library()
takes
these downloaded functions and load them into memory for analysis.
When you open RStudio, none of these packages are loaded. The loading
part cannot be skipped.
To get help on packages, you can use
help(package="ggplot2")
vignette(package="ggplot2")
Now we introduce a few useful packages.
ggplot2
Packageggplot2
is an R package developed by Hadley Wickham. It
is a system for declaratively creating graphics, based on the book of
The Grammar of Graphics by Leland Wilkinson (Wilkinson 2005). We will mainly use
this package for generating figures. Here we present a simple
example.
Details will be covered in later chapters.
library(tidyverse)
library(ggplot2)
data(mpg)
ggplot(data=mpg) + geom_point(aes(x=displ,y=hwy))
dplyr
and tidyr
Packagesdplyr
and tidyr
are two R package for data
wrangling. Here we present a simple example.
library(ggplot2)
library(dplyr)
data(mpg)
mpg %>%
mutate(ave = (hwy + cty)/2) %>%
group_by(class) %>%
summarize(averageMPG_by_class = mean(ave))
## # A tibble: 7 × 2
## class averageMPG_by_class
## <chr> <dbl>
## 1 2seater 20.1
## 2 compact 24.2
## 3 midsize 23.0
## 4 minivan 19.1
## 5 pickup 14.9
## 6 subcompact 24.3
## 7 suv 15.8
readr
Packagereadr
is an R package for data importing.
tidyverse
PackageAll the the packages mentioned above belong to a super R package
called tidyverse
which is a “universe” of many R functions.
For more details, please go to the tidyverse website. Therefore in
pratice, you only need to run library(tidyverse)
and then
will be able to use all the packages such as ggplot2
,
dplyr
, tidyr
, and readr
.
install.packages("tidyverse")
library(tidyverse)
R markdown is a file type. An R markdown file contains narrative text and chunks of R code for a analytics project. After compilation (i.e., “knitr”), the R markdown file will generate a (html or pdf) report that contains the text with simply formatting, the R code, and R output from running the code. Therefore, an R Markdown file is able to combine everything about the analytics project into one single file. It allows you to turn your analyses into high quality documents, reports, presentations, and dashboards with easy modification.
R Markdown is different from the traditional Microsoft Word processor. Word is a WYSIWYG processor (i.e., what you see is what you get) in the sense that the document is in a form that resembles its appearance when printed or displayed as a finished product. You click a button to change the format of a sentence and the change is immediately seen. A WYSISYG processor requires the users to pay attention to formatting as well as content at the same time. On the other hand, R markdown file allows users to solely focus on the content and let R markdown file to decide the formatting automatically. User simply specify the a few format requirements and R markdown file decides the best layout, spacing, and many others. Therefore, users can pay more attention to the content. Such a document can be easily transferred to another format. Meanwhile, since the text and R code are embedded in the same R markdown file, the analysis can be easily reproduced.
No matter how proficient you are with programming, you will almost always get various errors in your code. To fix these issues, you need to be able to get help yourself. Here are some of the simplest ways to get help.
help(function_name)
,
?function_name
, ??function_name
,
example(function_name)
to get help files. After these
commands, you should have a popup website listing the help files for the
function. You could also find some examples of the function.help(sqrt)
?sqrt
??sqrt
example(sqrt)
In this section, we will discuss some basic features in R. These features are the foundation of more advanced analysis.
Each variable stored in R is called an object. R has four basic types of objects: vectors, matrices, data frames, and lists. We will go over them briefly.
A vector is a sequence of values of the same type, such as a sequence of numbers, or a sequence of strings, and etc.
a=c(5,3,6,7)
a
## [1] 5 3 6 7
b=c("Tom","Jerry","John")
b
## [1] "Tom" "Jerry" "John"
c=c(TRUE,TRUE,FALSE)
c
## [1] TRUE TRUE FALSE
Here we use the c()
function to combine values into a
vector (or a list).
There are three major types of data, numeric, logical, and character.
You can use is.numeric()
, is.logical()
, and
is.character()
to test them
is.numeric(0.23)
## [1] TRUE
is.numeric("Tom")
## [1] FALSE
is.logical(FALSE)
## [1] TRUE
is.character("Tom")
## [1] TRUE
is.character(TRUE)
## [1] FALSE
To determine if an R object is a vector, use is.vector()
as follows.
is.vector(a)
## [1] TRUE
length()
can tell you the length of a vector.
class()
returns the type of data stored in the vector.
length(a)
## [1] 4
class(a)
## [1] "numeric"
class(b)
## [1] "character"
class(c)
## [1] "logical"
To select a subset of elements in a vector, use []
as
follows.
a[1]
## [1] 5
a[3]
## [1] 6
a[c(1,3)]
## [1] 5 6
a[-c(1,3)]
## [1] 3 7
a[a>4]
## [1] 5 6 7
You can modify a vector’s elements by the following command.
a[3]=100
a
## [1] 5 3 100 7
a[c(2,3)]=c(50,10)
a
## [1] 5 50 10 7
a[c(2,3)]=-2
a
## [1] 5 -2 -2 7
For element-wise comparisons, you can use the following
x=c(5,1,3)
y=c(4,1,2)
x == y
## [1] FALSE TRUE FALSE
x < y
## [1] FALSE FALSE FALSE
x <= y
## [1] FALSE TRUE FALSE
x != y
## [1] TRUE FALSE TRUE
3 %in% x
## [1] TRUE
c(3,4,5,6,7) %in% x
## [1] TRUE FALSE TRUE FALSE FALSE
Note that here is a list of comparison operators
operator | syntax |
---|---|
greater than | a > b |
greater than or equal to | a >= b |
less than | a < b |
less than or equal to | a <= b |
equal to | a == b |
belongs to | a %in% b |
not equal to | a != b |
With these operators, you can modify vectors in many different ways
a
## [1] 5 -2 -2 7
a[a<0]=3
a
## [1] 5 3 3 7
Here are some other ways to create vectors.
seq(1,10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(1,10,2)
## [1] 1 3 5 7 9
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
rep(10,5)
## [1] 10 10 10 10 10
mean(a)
## [1] 4.5
sd(a)
## [1] 1.914854
length(a)
## [1] 4
b[2]
## [1] "Jerry"
sum(c)
## [1] 2
d=c(10,20,30,40)
a+d
## [1] 15 23 33 47
a*d
## [1] 50 60 90 280
Combining different data types or forcing functions on certain data types results in coercion. Logical values can be converted to numeric 0 or 1. Numeric values can be converted to strings. Numeric values can also be converted to logical values (TRUE for nonzero and FALSE for zero)
as.numeric(TRUE)
## [1] 1
as.numeric(FALSE)
## [1] 0
as.character(123.456)
## [1] "123.456"
as.logical(123.456)
## [1] TRUE
as.logical(0)
## [1] FALSE
Sometimes, coercion can be subtle
c
## [1] TRUE TRUE FALSE
sum(c)
## [1] 2
mean(c)
## [1] 0.6666667
A matrix is a two dimensional array of numbers (and numbers only).
Use matrix()
to create a matrix.
a=matrix(1:12,4,3)
a
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
class(a)
## [1] "matrix" "array"
nrow(a)
## [1] 4
ncol(a)
## [1] 3
dim(a)
## [1] 4 3
To access certain values in a matrix, use []
as
follows.
a[2,3]
## [1] 10
a[2,]
## [1] 2 6 10
a[,2]
## [1] 5 6 7 8
a[,c(2,3)]
## [,1] [,2]
## [1,] 5 9
## [2,] 6 10
## [3,] 7 11
## [4,] 8 12
dim(a)
## [1] 4 3
apply(a,1,mean)
## [1] 5 6 7 8
apply(a,1,sum)
## [1] 15 18 21 24
apply(a,1,sd)
## [1] 4 4 4 4
apply(a,2,sd)
## [1] 1.290994 1.290994 1.290994
apply(a,2,sum)
## [1] 10 26 42
A data frame is a two dimensional array where each column is of the
same data type. Use the data.frame()
function to create a
data frame. Use str()
to examine the structure of a data
frame
a=data.frame(name=c("Tom","Jerry","John","Jane"),
age=c(10,14,13,11),
gender=c("Male","Male","Male","Female"))
a
## name age gender
## 1 Tom 10 Male
## 2 Jerry 14 Male
## 3 John 13 Male
## 4 Jane 11 Female
str(a)
## 'data.frame': 4 obs. of 3 variables:
## $ name : chr "Tom" "Jerry" "John" "Jane"
## $ age : num 10 14 13 11
## $ gender: chr "Male" "Male" "Male" "Female"
class(a)
## [1] "data.frame"
a[2,3]
## [1] "Male"
a[2,]
## name age gender
## 2 Jerry 14 Male
a[,2]
## [1] 10 14 13 11
a$name
## [1] "Tom" "Jerry" "John" "Jane"
a$age[2]
## [1] 14
a[,"age"]
## [1] 10 14 13 11
a[["age"]]
## [1] 10 14 13 11
a[,c("name","age")]
## name age
## 1 Tom 10
## 2 Jerry 14
## 3 John 13
## 4 Jane 11
a[2,c("name","age")]
## name age
## 2 Jerry 14
dim(a)
## [1] 4 3
attributes()
returns similar information, but will
usually be used later. names()
returns the column names
names(a)
## [1] "name" "age" "gender"
You can modify a data frame’s values by the following command.
a[3,2]=100
a
## name age gender
## 1 Tom 10 Male
## 2 Jerry 14 Male
## 3 John 100 Male
## 4 Jane 11 Female
a[,2]=55
a
## name age gender
## 1 Tom 55 Male
## 2 Jerry 55 Male
## 3 John 55 Male
## 4 Jane 55 Female
A list is a one dimensional array where each element can be of
different types. Use the list()
function to create a
list.
e=list(current_rank=1,
name="Tom",
active=TRUE,
metric=matrix(1:6,2,3),
family_members=c("Mary", "John"))
e
## $current_rank
## [1] 1
##
## $name
## [1] "Tom"
##
## $active
## [1] TRUE
##
## $metric
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## $family_members
## [1] "Mary" "John"
e[[3]]
## [1] TRUE
e$active
## [1] TRUE
e[[4]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
e$metric
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
e[[4]][1,2]
## [1] 3
e[[5]]
## [1] "Mary" "John"
e[[5]][2]
## [1] "John"
If control statement:
x <- 5
if(x > 0){
print("Positive number")
}
## [1] "Positive number"
Some times, we need to repeat one operation for many times, then we
will need for
loop.
a=matrix(1:12,3,4)
for (i in 1:nrow(a))
{
print(
mean(a[i,])/(max(a[i,])-min(a[i,]))
)
}
## [1] 0.6111111
## [1] 0.7222222
## [1] 0.8333333
If a set of operations are frequently used, you can define a function instead of repeating typing these operations.
my_func = function(x)
{
output = mean(x)/(max(x)-min(x))
return(output)
}
my_func(a[1,])
## [1] 0.6111111
Data passed into a function is called the function’s argument - Arguments can be results from another function - Use = to specify names of arguments, especially with multiple arguments for readability/QA
Summary of R command
R Cheatsheet at https://rstudio.com/resources/cheatsheets/