In this chapter, we use a few case studies to demonstrate how data visualization techniques help us extract insights from data. The following R packages are needed for running the examples in this chapter.
library(tidyverse)
library(ggthemes)
library(grid)
library(gridExtra)
#library(Lahman)
#library(MASS) # conflict with dplyr package for select(), use dplyr::select
library(RColorBrewer)
library(ggridges)
library(ggrepel)
library(kableExtra)
We analyze a data set containing the offense and defense statistics
for the baseball teams in Major League Baseball in the early years, in
particular, from 1871 through 1900. The data set includes baseball teams
from many “major” leagues such as American Association, Union
Association, Players League, Federal League, and the National
Association (of 1871-1875). This data set was created by Sean Lahman,
who pioneered the effort to make baseball statistics freely available to
the general public. The complete data set includes most recent data is
available in the R package Lahman
.
We first load the data and print the first a few rows.
Teams = read.csv("data/baseballgame_bf1900.csv")
head(Teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 <NA> BS1 BNA NA 3 31 NA 20 10 NA NA N
## 2 1871 <NA> CH1 CNA NA 2 28 NA 19 9 NA NA N
## 3 1871 <NA> CL1 CFC NA 8 29 NA 10 19 NA NA N
## 4 1871 <NA> FW1 KEK NA 7 19 NA 7 12 NA NA N
## 5 1871 <NA> NY2 NNA NA 5 33 NA 16 17 NA NA N
## 6 1871 <NA> PH1 PNA NA 1 28 NA 21 7 NA NA Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## teamIDretro
## 1 BS1
## 2 CH1
## 3 CL1
## 4 FW1
## 5 NY2
## 6 PH1
Each row in the data set represents various statistics of one team in a particular season. The statistics include the number of runs scored by that team in that year/season (R), the number of home runs (HR), the number of games (G), lost games/losses (L), and wins (W) and many others variables. Suppose we are interested in the relationship between the number of runs (R) and the number of lost games (L). We can easily generate a scatter plot to show the relationship.
Teams %>%
ggplot(aes(R, L)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
Usually, more runs lead to higher scores and hence fewer losses. However, when we visualize the data, we actually get a positive correlation, which is puzzling. We further fit a regression line to the data, which confirms our observations.
In order to understand this puzzle, we recall that the data we have is the statistics of each team in the early years. Back in the 1870s and 1880s, the baseball league is in its infancy. The league has experienced the ups and downs when it developed to its current scale. Therefore, each season is quite different in all aspects from another season. Therefore, when we pool the data across all the years together, the data tells a very different story.
For example, the number of runs and the number of losses depend on how many games each team played during that season. Since the league is growing rapidly, the number of games in each season is much different. Some season hold more games and some hold less.
Therefore, we could reexamine the relationship between the number of runs and the number of losses conditional on the number of games that each team plays in that season. This is sometimes called stratification. Essentially we divide the observations into groups according to the value of a third variable, i.e., the number of games in this case, so that the observations in each group all have similar or identical values for that variable. Within the group, we investigate the data set focusing on the two variables of interest, i.e., the number of runs and losses in this case. Then our results and findings are now conditional on the third variable by which we stratify the data.
Since we want to stratify by the number of games and it is a quantitative variable, we need to cut this variable into several bins/intervals. We can first check the range of the number of games as follow.
range(Teams$G)
## [1] 6 158
It seems an interval of length 20 will give us about 8 strata. We use
the function cut()
to convert the quantitative variable to
a categorical variable with values of (0, 20], … , (140, 160]. To
visualize the strata, we use the visualization technique called
faceting. It essentially plots the relationship of interest in many
small multiple of panels which correspond to the strata. Here, we plot
the scatter plot between the numbers of runs and losses within each
stratum.
Teams %>%
mutate(G_strata = cut(G, seq(0,180,20))) %>%
ggplot(aes(R, L)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ G_strata)
We immediately find out that the relationship becomes negative within each stratum. If we overlay the fitted regression within each stratum on the fitted regression to the whole data set, we can see the contrast.
Teams %>%
mutate(G_strata = cut(G, seq(0,180,20))) %>%
ggplot(aes(R, L, color=G_strata)) +
geom_point(alpha = 0.5) +
geom_smooth(aes(group=G_strata, color=G_strata),
method = "lm")+
scale_color_viridis_d()+
geom_smooth(color = "blue", method = "lm")
Clearly, the relationship between the numbers of runs and losses within strata and over the entire data set is opposite. This is a typical case of so called Simpson’s paradox. It is called a paradox because we see the sign of the correlation flips when comparing the entire population and specific groups (or strata). It usually happens because there is a confounder variable which affects both of the variables of interest. Since the confounder’s variation is more dominant, the overall pattern between the two variables of interest, when marginalized over the confounder, shows an unexpected association.
In this case, the number of game in each season is the confounder. The more game the team plays, the more runs and more losses the team acquires. When the analysis is stratified over the confounder, the true relationship appears.
We can further investigate this issue. We plot the year against the number of games, the number of runs, and the number of losses. We can immediately see that, as the sport becomes more popular over the years, there are more games in each season, with a few exceptions around 1984-1985. Therefore, the increase of the number of games over the years is the main reason for observing this positive relationship between the number of runs and the number of losses.
plot1 <- Teams %>%
ggplot(aes(yearID, G)) + geom_jitter(alpha = 0.5)
plot2 <- Teams %>%
ggplot(aes(yearID, R)) + geom_jitter(alpha = 0.5)
plot3 <- Teams %>%
ggplot(aes(yearID, L)) + geom_jitter(alpha = 0.5)
plot4 <- Teams %>%
ggplot(aes(G, R)) + geom_jitter(alpha = 0.5)
plot5 <- Teams %>%
ggplot(aes(G, L)) + geom_jitter(alpha = 0.5)
grid.arrange(plot1, plot2, plot3, plot4, plot5, ncol=3)
This example shows that how faceting in data visualization can help us identify/explain/investigate the Simpson’s paradox in the data set. One important tool for dealing with the Simpson’s paradox is conditionaling, or stratification. Conditional/stratification is achieved in data visualization using the facet feature, where small multiple figures are generated for comparison.
Next, we analyze a wind data set containing the hourly measurements
of the wind speed (ws) and direction (wd) in London in 1999. The data is
available in the R package openair
. We first load the data
and print the first a few rows.
wind = read.csv(file="data/LondonWind1999.csv", header = TRUE)
head(wind)
## date ws wd nox no2 o3 pm10 so2 co pm25
## 1 1999-01-01 00:00:00 5.04 140 88 35 4 21 3.835000 1.025000 18
## 2 1999-01-01 01:00:00 4.08 160 132 41 3 17 5.243333 2.700000 11
## 3 1999-01-01 02:00:00 4.80 160 168 40 4 17 6.513333 2.866667 8
## 4 1999-01-01 03:00:00 4.92 150 85 36 3 15 4.180000 1.625000 10
## 5 1999-01-01 04:00:00 4.68 150 93 37 3 16 4.250000 1.025000 11
## 6 1999-01-01 06:00:00 3.84 170 102 32 4 16 4.005000 0.800000 9
As we can see, each row represents one observation of the wind speed and direction at a particular hour in a particular day in 1999. We mostly focus on the wind speed (ws) and direction (wd) and ignore the other variables for now. For the wind direction, we list all possible directions which are equally spaced with 10 degrees aprt from each other.
sort(unique(wind$wd))
## [1] 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180
## [20] 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360
For this data set, we start with the scatter plot of direction versus speed in the upper left panel in the figure below. We can quickly visualize the data, but it does not show many insights because of the overlapping observations. We solve this issue by jittering the observations and immediately see that the wind direction is mostly focus on the 180-270 degree range. Meanwhile, the wind speed is also faster in these directions.
g1=ggplot(wind)+geom_point(aes(x=wd, y=ws),alpha=0.5, size=0.8)
g2=ggplot(wind)+geom_point(aes(x=factor(wd), y=ws),
position = position_jitter(0.3),
size = 0.8, alpha=0.2)+
scale_x_discrete(breaks=seq(0,360,30))
grid.arrange(g1,g2,ncol=2)
Even though the scatter plot is sufficient for faithfully presenting the data, but it is hard to obtain any insights. If we focus on the distribution of the wind speed in each direction, we could use side by side boxplot. In the left panel of the figure below, we can see that the wind speed distribution gradually shift to large values as the wind direction moves towards 220 degree. Once improvement we can make about this visualization is that since the wind direction is in a circular format, we can arrange these boxplots in a polar coordinate system as opposed to the Cartesian coordinate system.
g1=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+
scale_x_discrete(breaks=seq(0,360,30))
wind$wd[wind$wd==360]=0
g2=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+coord_polar(start = -pi/32)
grid.arrange(g1,g2,ncol=2)
In the right panel, we can see that boxplots are aligned in a circle. Through the median and first and third quartiles, we can clearly see that the wind speed is much faster in the southwest wind direction. The east wind has relatively fast speed. These insights are harder to extract from the Cartesian coordinate system, which is the advantage of the polar coordinate system.
Using the boxplot in the polar coordinate system is certainly useful, however, there is a shortcoming. For example, we can only visualize the distribution of wind speed in each direction, but cannot account for the frequency of each direction. From the jittered scatter plot, we know that not only does the wind speed is faster in the southwest direction, but there are also more observations of southwest wind than any other directions. Such information cannot be visualized in the figure above.
Alternatively, to visualize the frequency of the wind direction as well as the distribution of the wind speed in each direction, we could use barplot. We first convert the wind speed to different intervals of speed. Then we can use the stacked bar plot to show the frequencies of the wind speed intervals in each direction, shown in the left panel of the figure below. Lastly, we convert the stacked bar plot to the polar coordinate system and show it on the right panel.
wind$binned_speed = cut(wind$ws, c(seq(0,14,2),Inf), include.lowest = TRUE)
wind$direction =
cut(wind$wd,
c(-1,11.25,33.75,56.25,78.75,101.25,123.75,146.25,168.75,191.25,213.75,236.25,258.75,281.25,303.75,326.25,348.75,361),
labels = c("N","NNE","NE","ENE","E","ESE","SE","SSE","S","SSW","SW","WSW","W","WNW","NW","NNW","N"))
g3=ggplot(wind, aes(x=direction, fill = binned_speed)) +
geom_bar(position = position_stack(reverse = TRUE)) +
scale_fill_viridis_d()+
theme(legend.position = "top")
g4=ggplot(wind, aes(x=direction, fill = binned_speed)) +
geom_bar(width=1, position = position_stack(reverse = TRUE)) +
scale_fill_viridis_d()+
coord_polar(start = -pi/16)
grid.arrange(g3,g4,ncol=2)
In this visualization, we can see that the west, southwest, south bars take a large portion among all bars, which means the wind mostly comes from these directions in London. In addition, the wind blows fast in these directions because the colors of the bars show more yellow segments This visualization is called wind rose plot. It is powerful because it can visualize the wind speed and direction at the same time. Because of the polar coordinate system, the visualization is very easy to digest and intuitive. There are also many extension of this visualization, for example, the choice of the color scheme.
From Charlotte Wickham’s example
Where the Heat and the Thunder Hit Their Shots
shots = read.csv("data/shots_sum.csv", header = TRUE)
head(shots)
## x y num_shots num_made prop_made avg_points total_points
## 1 0 4 2 1 0.5 1.5 3
## 2 0 5 3 0 0.0 0.0 0
## 3 0 6 1 0 0.0 0.0 0
## 4 0 7 1 0 0.0 0.0 0
## 5 0 9 2 0 0.0 0.0 0
## 6 0 10 1 0 0.0 0.0 0
shots = filter(shots, num_shots < 1000)
ggplot(shots, aes(x, y))+
geom_point(aes(color = avg_points,
size = num_shots)) +
scale_color_distiller("Points", palette = "RdYlGn") +
scale_size("Attempts", range = c(0.1, 5) ) +
ylim(0, 35) +
coord_equal() +
theme_void() +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
axis.line = element_blank(),
axis.title = element_blank())
## Warning: Removed 54 rows containing missing values (`geom_point()`).
ggplot(shots, aes(x, y))+
geom_point(aes(color = avg_points)) +
scale_color_distiller("Points", palette = "RdYlGn") +
scale_size("Attempts", range = c(0.1, 5) ) +
ylim(0, 35) +
coord_equal() +
theme_void() +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
axis.line = element_blank(),
axis.title = element_blank())
## Warning: Removed 54 rows containing missing values (`geom_point()`).
From Charlotte Wickham’s example
library(dplyr)
library(hflights)
library(ggplot2)
hflights_df <- tbl_df(hflights)
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
hflights_df <- mutate(hflights_df,
DepHour = floor(DepTime/100),
DayOfWeek = factor(DayOfWeek,
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
Date = ISOdate(Year, Month, DayofMonth)
)
hou <- filter(hflights_df, Origin == "HOU")
hou_mon <- filter(hou, DayOfWeek == "Mon")
# over all mondays in 2011, avg delay of flights departing by hour
hou_mon_avg <- hou_mon %>%
group_by(DepHour) %>%
summarise(avg_delay = mean(DepDelay))
# initial plot
ggplot(hou_mon_avg, aes(DepHour, avg_delay)) +
geom_point() +
geom_line() +
ylab("Average delay (mins)") +
xlab("Departure time") +
scale_x_continuous(breaks = seq(0, 24, 6),
labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
theme_bw(18)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 row containing missing values (`geom_line()`).
#ggsave("08-monday.png", width = 6, height = 4)
# for each monday in 2011, avg delay of flights departing by hour
hou_mon_day <- filter(hou, DayOfWeek == "Mon") %>%
group_by(Date, DepHour) %>%
summarise(avg_delay = mean(DepDelay))
## `summarise()` has grouped output by 'Date'. You can override using the
## `.groups` argument.
# quantiles for delay by time
hou_mon_q <- hou_mon %>% group_by(DepHour) %>%
summarise(n = n(),
q25 = quantile(DepDelay, probs = 0.25, na.rm = TRUE),
q50 = quantile(DepDelay, probs = 0.5, na.rm = TRUE),
q75 = quantile(DepDelay, probs = 0.75, na.rm = TRUE))
hou_mon %>%
ggplot(aes(DepHour, DepDelay)) +
geom_point() +
#geom_line(aes(group = Date)) +
ylab("Average delay (mins)") +
xlab("Departure time") +
scale_x_continuous(breaks = seq(0, 24, 6),
labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
theme_bw(18)
## Warning: Removed 156 rows containing missing values (`geom_point()`).
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(tidyverse)
library(lubridate)
t = read_csv("data/temperature_airport.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 141294 Columns: 61
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): STATION, NAME, FMTM, PGTM
## dbl (45): LATITUDE, LONGITUDE, ELEVATION, ACMH, ACSH, AWND, GAHT, PRCP, PSU...
## lgl (11): ACMC, ACSC, DAPR, FRGT, MDPR, WT10, WT17, WT19, WV01, WV07, WV20
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
t2 = t %>%
mutate(YEAR = year(DATE),
MONTH = month(DATE),
DAY = day(DATE),
MONTHDAY = paste(ifelse(MONTH>=10, MONTH, paste("0",MONTH,sep="")),
ifelse( DAY>=10, DAY, paste("0", DAY,sep="")), sep="-"),
NAME = case_when(NAME == "CHICAGO OHARE INTERNATIONAL AIRPORT, IL US" ~ "ORD",
NAME == "CINCINNATI NORTHERN KENTUCKY INTERNATIONAL AIRPORT, KY US" ~ "CVG",
NAME == "JFK INTERNATIONAL AIRPORT, NY US" ~ "JFK",
NAME == "MIAMI INTERNATIONAL AIRPORT, FL US" ~ "MIA",
NAME == "SAN FRANCISCO INTERNATIONAL AIRPORT, CA US" ~ "SFO",
NAME == "MINNEAPOLIS ST. PAUL INTERNATIONAL AIRPORT, MN US" ~ "MSP")) %>%
group_by(NAME, MONTHDAY) %>%
summarize(TEMP = mean(TAVG, na.rm = TRUE) ) %>%
ungroup()
## `summarise()` has grouped output by 'NAME'. You can override using the
## `.groups` argument.
t2
## # A tibble: 1,830 × 3
## NAME MONTHDAY TEMP
## <chr> <chr> <dbl>
## 1 CVG 01-01 34.2
## 2 CVG 01-02 36.1
## 3 CVG 01-03 35.4
## 4 CVG 01-04 33.4
## 5 CVG 01-05 30
## 6 CVG 01-06 27.8
## 7 CVG 01-07 25.6
## 8 CVG 01-08 29.8
## 9 CVG 01-09 32.9
## 10 CVG 01-10 32.6
## # ℹ 1,820 more rows
ggplot(t2) +
geom_line(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME)) +
geom_smooth(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME), method = "gam") +
scale_x_discrete(breaks = c("01-01", "04-01", "07-01", "10-01"),
labels = c("01-01", "04-01", "07-01", "10-01")) +
scale_y_continuous(limits = c(0,100)) +
coord_polar()
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
This case study is originally developed in Irizarry (2019).
We have made some modifications to the case, but the majority of the analysis follows their book. We will analyze the gapminder data set1 which is about the statistics of different countries measured over the years.
The data set is created by Hans Rosling2, a professor of international health and a pioneer in data visualization. Using this data set and the visualization, he wants to educate the public about the common misunderstanding of the world and different countries. For this initiative, he has created the Gapminder Foundation3 which uses data to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events. Gapminder identifies systematic misconceptions about important global trends and proportions and uses reliable data to develop easy to understand teaching materials to rid people of their misconceptions. Its mission is to fight devastating ignorance with a fact-based worldview everyone can understand.
Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in peoples’ minds into an over-dramatic worldview and strong negative stress feelings: “The world is getting worse!”, “It’s we vs. them!”, “Other people are strange!”, “The population just keeps growing!” and “Nobody cares!” Hans Rosling conveyed actual data-based trends in a dramatic way of his own, using effective data visualization. This section is based on two talks that exemplify this approach to education: New Insights on Poverty4 and The Best Stats You’ve Ever Seen5.
On the Gapminder website, there are a list of questions about the wealth health and economics. You can go through these questions and test your knowledge and understanding. In his TED talk6, Hans Rosling tested the audience with questions on the world health such as the child mortality across different countries. For example, for each pair of countries below, which country do you think had the highest child mortality rates in 2015?
When answering these questions without data, the non-European countries are typically picked as having higher child mortality rates: Sri Lanka over Turkey, South Korea over Poland, and Malaysia over Russia. It is also common to assume that countries considered to be part of the developing world: Pakistan, Vietnam, Thailand, and South Africa, have similarly high mortality rates.
To answer these questions, we use data, we can use dplyr. For example, for the first comparison we see that:
gapminder = read.csv("data/gapminder.csv",header = TRUE)
gapminder %>%
filter(year == 2015 & country %in% c("Sri Lanka","Turkey")) %>%
select(country, infant_mortality)
## country infant_mortality
## 1 Sri Lanka 8.4
## 2 Turkey 11.6
Turkey has the higher infant mortality rate. We can use this code on all comparisons and find the following:
country | infant mortality | country | infant mortality |
---|---|---|---|
Sri Lanka | 8.4 | Turkey | 11.6 |
Poland | 4.5 | South Korea | 2.9 |
Malaysia | 6.0 | Russia | 8.2 |
Pakistan | 65.8 | Vietnam | 17.3 |
Thailand | 10.5 | South Africa | 33.6 |
We see that the European countries on this list have higher child mortality rates: Poland has a higher rate than South Korea, and Russia has a higher rate than Malaysia. We also see that Pakistan has a much higher rate than Vietnam, and South Africa has a much higher rate than Thailand. It turns out that when Hans Rosling gave this quiz to educated groups of people, the average score was less than 2.5 out of 5, worse than what they would have obtained had they guessed randomly. This implies that more than ignorant, we are misinformed. In this chapter we see how data visualization helps inform us.
Specifically, in this section, we use data to attempt to answer the following two questions:
We start by importing the data to R from the csv file.
gapminder = read.csv("data/gapminder.csv",header = TRUE)
dim(gapminder)
## [1] 10545 9
head(gapminder)
## country year infant_mortality life_expectancy fertility
## 1 Albania 1960 115.40 62.87 6.19
## 2 Algeria 1960 148.20 47.50 7.65
## 3 Angola 1960 208.00 35.98 7.32
## 4 Antigua and Barbuda 1960 NA 62.97 4.43
## 5 Argentina 1960 59.87 65.39 3.11
## 6 Armenia 1960 NA 66.86 4.55
## population gdp continent region
## 1 1636054 NA Europe Southern Europe
## 2 11124892 13828152297 Africa Northern Africa
## 3 5270844 NA Africa Middle Africa
## 4 54681 NA Americas Caribbean
## 5 20619075 108322326649 Americas South America
## 6 1867396 NA Asia Western Asia
As we can see there are 9 variables and 10545 observations. Each row represents one particular country in a particular year. These 9 variables include:
country
: name of the country. There are 185 countries
in total.year
: the year of observation. It ranges from 1960 to
2016.infant_mortality
: the infant mortality of the country
in that year.life_expectancy
: the life expectancy of the country in
that year.fertility
: the fertility rate of the country in that
year.population
: the population rate of the country in that
year.gdp
: the GDP of the country in that year.continent
: the continent the country belongs to. Values
include Europe, Africa, Americas, Asia, and Oceania.region
: the region the country belongs to the following
valuesunique(gapminder$region)
## [1] "Southern Europe" "Northern Africa"
## [3] "Middle Africa" "Caribbean"
## [5] "South America" "Western Asia"
## [7] "Australia and New Zealand" "Western Europe"
## [9] "Southern Asia" "Eastern Europe"
## [11] "Central America" "Western Africa"
## [13] "Southern Africa" "South-Eastern Asia"
## [15] "Eastern Africa" "Northern America"
## [17] "Eastern Asia" "Northern Europe"
## [19] "Melanesia" "Polynesia"
## [21] "Central Asia" "Micronesia"
unique(gapminder$year)
## [1] 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
## [16] 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
## [31] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
## [46] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
We check to see if there are any missing data and how the missing values distribute in the data.
colSums(is.na(gapminder))
## country year infant_mortality life_expectancy
## 0 0 1453 0
## fertility population gdp continent
## 187 185 2972 0
## region
## 0
As we can see, there are a lot of missing observations for the infant mortality rate, fertility rate, population, and gdp.
missing_by_year = gapminder %>%
group_by(year) %>%
summarize(num_missing_infant_mortality = sum(is.na(infant_mortality)),
num_missing_fertility = sum(is.na(fertility)),
num_missing_population = sum(is.na(population)),
num_missing_gdp = sum(is.na(gdp)))
missing_year_long = pivot_longer(missing_by_year, cols = 2:5, names_to = "variable", values_to = "missing_num")
ggplot(missing_year_long)+
geom_bar(aes(year,missing_num),width=0.5, stat = "identity")+ # geom_col()
facet_wrap(~variable)+
scale_x_continuous(breaks = seq(1960,2010,10))
On the other hand, we can look at the missing values for each country.
missing_by_coun = gapminder %>%
group_by(country) %>%
summarize(num_missing_infant_mortality = sum(is.na(infant_mortality)),
num_missing_fertility = sum(is.na(fertility)),
num_missing_population = sum(is.na(population)),
num_missing_gdp = sum(is.na(gdp)))
missing_by_coun %>%
arrange(desc(num_missing_infant_mortality)) %>%
select(country,num_missing_infant_mortality) %>%
head(10)
## # A tibble: 10 × 2
## country num_missing_infant_mortality
## <chr> <int>
## 1 Aruba 57
## 2 French Polynesia 57
## 3 Greenland 57
## 4 Hong Kong, China 57
## 5 Macao, China 57
## 6 New Caledonia 57
## 7 Puerto Rico 57
## 8 Antigua and Barbuda 31
## 9 Grenada 25
## 10 Montenegro 25
missing_by_coun %>%
arrange(desc(num_missing_fertility)) %>%
select(country,num_missing_fertility) %>%
head(10)
## # A tibble: 10 × 2
## country num_missing_fertility
## <chr> <int>
## 1 Greenland 3
## 2 Albania 1
## 3 Algeria 1
## 4 Angola 1
## 5 Antigua and Barbuda 1
## 6 Argentina 1
## 7 Armenia 1
## 8 Aruba 1
## 9 Australia 1
## 10 Austria 1
missing_by_coun %>%
arrange(desc(num_missing_population)) %>%
select(country,num_missing_population) %>%
head(10)
## # A tibble: 10 × 2
## country num_missing_population
## <chr> <int>
## 1 Albania 1
## 2 Algeria 1
## 3 Angola 1
## 4 Antigua and Barbuda 1
## 5 Argentina 1
## 6 Armenia 1
## 7 Aruba 1
## 8 Australia 1
## 9 Austria 1
## 10 Azerbaijan 1
missing_by_coun %>%
arrange(desc(num_missing_gdp)) %>%
select(country,num_missing_gdp) %>%
head(10)
## # A tibble: 10 × 2
## country num_missing_gdp
## <chr> <int>
## 1 Libya 46
## 2 Ireland 45
## 3 Qatar 45
## 4 West Bank and Gaza 45
## 5 Timor-Leste 44
## 6 Iraq 42
## 7 Montenegro 42
## 8 Aruba 40
## 9 Estonia 40
## 10 Maldives 40
Now we begin to analyze the data set. There is a common understanding about the world that is the world is divided into two groups: the developed countries which are associate with long life spans and small families, versus the developing countries (Africa, Asia, and Latin America) characterized by short life spans and large families. Is this true? To verify this, we propose our first question:
Is the world divided in terms of life expectancy and fertility rate (average number of children per woman)?
To answer this question, we use scatterplot of life expectancy versus fertility rate. We start by looking at data from about 50 years ago, when perhaps this view was first cemented in our minds.
library(RColorBrewer)
gapminder %>%
filter(year == 1962) %>% # data wrangling
ggplot(aes(fertility, life_expectancy, color = continent)) + # data visualization
geom_point() +
scale_color_brewer(palette = "Paired")
As we can see, most countries fall into two distinct categories: 1. Life expectancy around 70 years and 3 or fewer children per family. 2. Life expectancy lower than 65 years and more than 5 children per family. Therefore, it seems this understanding does not have some support from data and is somewhat true 50 years ago. However, is it still true today? In addition, is the situation different than 50 years ago? To answer this question, we need to regenerate a new figure for 2012 and compare with the previous one.
gapminder %>%
filter(year %in% c(1962, 2012)) %>%
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_wrap(~year) +
scale_color_brewer(palette = "Paired")
This figure shows the totality of the situation. If we want to focus on different continent, we can further break it down the comparison. As we can see, the most improvement takes place in Africa, Americas, and Asia.
gapminder %>%
filter(year %in% c(1962, 2012)) %>%
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_grid(year~continent) +
scale_color_brewer(palette = "Paired") +
theme(legend.position = "top")
This plot clearly shows that the majority of countries have moved from the developing world cluster to the western world one. In conclusion, the belief that the world is divided into two groups are no longer support nowadays. This is particularly true for Asia.
The figure focus on the overall improvement of the entire world or each continent. What is we want to focus on each country? We can plot the trajectory of each country.
gapminder %>%
filter(year %in% c(1962, 2012)) %>%
group_by(country) %>%
summarise(start_fert = fertility[year==1962],
end_fert = fertility[year==2012],
start_life = life_expectancy[year==1962],
end_life = life_expectancy[year==2012],
continent = continent) %>%
ggplot() +
geom_segment(aes(x=start_fert,xend=end_fert,
y=start_life,yend=end_life,
color=continent,
group=country)) +
facet_wrap(~continent) +
scale_color_brewer(palette = "Paired")
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
As we can see, Asian and Americas countries collectively improved significantly. Part of African countries improved significantly in both life expectancy and fertility, but the others have only moderate improvements and these improvement are mostly life expectancy. Most of the European countries have mild improvements.
We are looking at the span of 50 years, it make sense to look at the
data most closely and visualize how the changes took place through the
years. We can add 1970, 1980, 1990, and 2000. If we do this, we will not
want all the plots on the same row, the default behavior of
facet_grid
, since they will become too thin to show the
data. Instead, we will want to use multiple rows and columns. The
function facet_wrap
permits us to do this by automatically
wrapping the series of plots so that each display has viewable
dimensions:
gapminder %>%
filter(year %in% c(1962, 1980, 1990, 2000, 2012)) %>%
ggplot( aes(fertility, life_expectancy, color = continent)) +
geom_point() +
facet_wrap(~year, ncol=5) +
scale_color_brewer(palette = "Paired") +
theme(legend.position = "top")
gapminder %>%
filter(year %in% seq(1962, 2012, 1)) %>%
ggplot() +
geom_path(aes(fertility, life_expectancy,
color=continent, group = country)) +
#geom_text_repel(aes(fertility, life_expectancy, label=year)) +
facet_wrap(~continent) +
scale_color_brewer(palette = "Paired")
This plot clearly shows how most Asian countries have improved at a much
faster rate than European ones. Meanwhile, an interesting pattern is
that most countries improve the life expectancy first and then reduce
the fertility rate. This makes sense because the increase of life
expectancy implies the improvement of health care quality. Once the
health care condition is improved, the family does not need give birth
to many children.
The default choice of the range of the axes is important. When not
using facet
, this range is determined by the data shown in
the plot. When using facet
, this range is determined by the
data shown in all plots and therefore kept fixed across plots. This
makes comparisons across plots much easier. For example, in the above
plot, we can see that life expectancy has increased and the fertility
has decreased across most countries. We see this because the cloud of
points moves. This is not the case if we adjust the scales: In the plot
above, we have to pay special attention to the range to notice that the
plot on the right has a larger life expectancy.
filter(gapminder, year%in%c(1962, 2012)) %>%
ggplot(aes(fertility, life_expectancy, col = continent)) +
geom_point() +
facet_wrap(. ~ year, scales = "free")
We can repeat the same analysis using another pair of variables.
gapminder %>%
filter(year %in% c(1962, 2011)) %>% # data wrangling
mutate(income = gdp/population) %>%
ggplot(aes(income, life_expectancy, color = continent)) + # data visualization
scale_x_log10()+
geom_point() +
facet_wrap(~year) +
scale_color_brewer(palette = "Paired")
## Warning: Removed 106 rows containing missing values (`geom_point()`).
Another commonly held belief is that the wealth inequality has became more serious over the years with the rich getting richer and poor getting poorer. We now shift our attention to the second question related to the commonly held notion that wealth distribution across the world has become worse during the last decades.
Did the income inequality in the world become worse or better?
When general audiences are asked if poor countries have become poorer and rich countries become richer, the majority answers yes. By using stratification, histograms, smooth densities, and boxplots, we will be able to understand if this is in fact the case. First we learn how transformations can sometimes help provide more informative summaries and plots.
We use the GDP per capita as a measure of individual wealth and create this variable in the data set. Note that income is adjusted for inflation and represent current US dollars, so it is comparable across the years. In addition, this variable represents the average income of that country. There are certainly variation in income within that country, which is not what we are studing right now because we do not have assess to this data.
gapminder <- gapminder %>%
mutate(dollars_per_day = gdp/population/365,
gdp_per_capita = gdp/population)
Here is a histogram of GDP per capita from 1970.
past_year <- 1970
# gapminder %>%
# filter(year == past_year & !is.na(gdp_per_capita)) %>%
# ggplot(aes(dollars_per_day)) +
# geom_histogram(binwidth = 1, color = "black")
gapminder %>%
filter(year == past_year & !is.na(gdp_per_capita)) %>%
ggplot(aes(gdp_per_capita)) +
geom_histogram(binwidth = 1000)
Because of the skeness of the data, the visualization mainly displays the countries with gdp per capita above $4000. So we are unclear about the distribution of the gdp per capita for the rest of the countries, even though they are the majority in the data.
To address this issue, we consider the log transformation which convert the scale into a multiplicative order.
In addition to log base 10 transformation. Other common choices are base \(\mathrm{e}\) (the natural log) and base 2.
In general, the natural log should be avoided just because \(\mathrm{e}^2, \mathrm{e}^3, \dots\) are harder and less intuitive to calculate. In comparison, log base 2 and log base 10 are better choices, because \(2^2, 2^3, 2^4, \dots\) or \(10^2, 10^3, \dots\) are easy to compute in our heads.
In the dollars per day example, we used base 2 instead of base 10 because the resulting range is easier to interpret. The range of the values being plotted is 119.3340388, 1.7843103^{4}.
In base 10, this turns into a range that includes very few integers: just 0 and 1. With base two, our range includes -2, -1, 0, 1, 2, 3, 4, and 5. It is easier to compute \(2^x\) and \(10^x\) when \(x\) is an integer and between -10 and 10, so we prefer to have smaller integers in the scale. Another consequence of a limited range is that choosing the binwidth is more challenging. With log base 2, we know that a binwidth of 1 will translate to a bin with range \(x\) to \(2x\).
For an example in which base 2 makes more sense, consider the daily gdp per capita which equals to gdp per capita divided by 365.
There are two ways to transform the data for visualization. -
Transform the values before plotting them, i.e.,
aes(x=log10(gdp_per_capita))
. I1n this case, the axis ticks
display the transformed values such as 1, 2, 3, … - Use transformed
scales in the axes, i.e.,
+scale_x_continuous(trans="log10")
. In this case, the axis
ticks display the original values such as 10, 100, 1000, … Both
approaches are useful and have different strengths. Usually, the latter
gives a better visualization of the data.
There are other transformations available through the
trans
argument.
sqrt
) transformation is useful when
considering counts.logit
) is useful when
plotting proportions between 0 and 1.reverse
transformation is useful when we want
smaller values to be on the right or on top.If we use the log transformation with base of 10, then we can easily show $100 (extremely poor), $500 (very poor), $1000 (poor), $5000 (middle), $10000 (well off), $50000 (rich), $100000 (very rich). This transformation provides a close-up of the mid to lower income countries. Here is the distribution if we apply a log base 10 transform:
g1=gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(x=gdp_per_capita)) +
geom_histogram(binwidth = .18)+
scale_x_continuous(trans="log10")
g2=gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(gdp_per_capita)) +
geom_histogram(binwidth =0.7) +
scale_x_continuous(trans="log2")
grid.arrange(g1,g2,ncol=2)
We see multiple bumps in the histogram. This is often referred to as multimodal distribution. The locations where the density reaches local maximum are called local modes.
The histogram above suggests that the 1970 country income distribution has two modes: one at about 1000 dollars and another at about 10000 dollars. This bimodality is consistent with a dichotomous world made up of countries with average incomes less than $5000 and countries above that. However, which countries consist of these two groups?
In addition to the histogram, we can visualize the same data using density plot.
A histogram shows that the 1970 income has a multimodal distirbution which implies that the there are multiple clusters of countries with similar incomes and the clusters are separated from each other. However, we still do not know what are these clusters. Do these clusters represent the developed world and devleoping world? We will learn about clustering analysis in the later chapters. For now, we can manually divide the countries according to their geographical locations, such as continent and region.
gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(gdp_per_capita, fill=continent)) +
geom_histogram(binwidth =.18) +
scale_x_continuous(trans="log10") +
scale_fill_brewer(palette = "Paired")
Let’s start by quickly examining the data by region. We reorder the regions by the median value and use a log scale.
gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(gdp_per_capita, reorder(continent,gdp_per_capita), fill=continent)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(height=0.1, shape = 4) +
scale_x_continuous(trans = "log10") +
annotation_logticks(sides = "b") +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "top")
gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
mutate(region = reorder(region, gdp_per_capita, FUN = median)) %>%
ggplot(aes(gdp_per_capita, region,
fill=continent,
label=country)) +
geom_boxplot(outlier.shape = NA) +
geom_point(shape = 4) +
geom_text_repel(color="blue", size=2)+
scale_x_continuous(trans = "log10") +
annotation_logticks(sides = "b") +
scale_fill_brewer(palette = "Paired") +
scale_color_brewer(palette = "Paired") +
theme(legend.position = "top")
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: ggrepel: 4 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
gapminder %>%
filter(year == past_year,
!is.na(gdp),
region == "Western Asia") %>%
select(country, region, gdp_per_capita)
## country region gdp_per_capita
## 1 Georgia Western Asia 1003.6948
## 2 Israel Western Asia 10058.2650
## 3 Oman Western Asia 4328.5134
## 4 Saudi Arabia Western Asia 7551.5971
## 5 Syria Western Asia 576.6997
## 6 Turkey Western Asia 2163.1682
We can already see that there is indeed a “west versus the rest”
dichotomy: we see two clear groups, with the rich group composed of
North America, Northern and Western Europe, New Zealand and Australia.
We define groups based on this observation. We turn this
group
variable into a factor to control the order of the
levels.
gapminder <- gapminder %>%
mutate(group = case_when(
region %in% c("Northern Europe",
"Northern America",
"Australia and New Zealand",
"Western Europe",
"Southern Europe") ~ "Developed",
country %in% c("Japan", "Israel", "Hong Kong, China", "French Polynesia", "") ~ "Developed",
region %in% c("Eastern Asia",
"South-Eastern Asia",
"Southern Asia",
"Western Asia") ~ "Asia",
region %in% c("Caribbean",
"Central America", "South America") ~ "Latin America",
continent == "Africa" ~ "Africa",
TRUE ~ "Others"))
gapminder <- gapminder %>%
mutate(group = factor(group, levels = c("Others","Latin America", "Asia", "Africa", "Developed")))
The exploratory data analysis above has revealed two characteristics
about average income distribution in 1970. Using a histogram, we found a
bimodal distribution with the modes relating to poor and rich countries.
We now want to compare the distribution across these five groups to
confirm the “west versus the rest” dichotomy. The number of points in
each category is large enough that a summary plot may be useful. We
could generate five histograms or five density plots, but it may be more
practical to have all the visual summaries in one plot. We therefore
start by stacking boxplots next to each other. Note that we add the
layer
theme(axis.text.x = element_text(angle = 90, hjust = 1))
to
turn the group labels vertical, since they do not fit if we show them
horizontally, and remove the axis label to make space. Boxplots have the
limitation that by summarizing the data into five numbers, we might miss
important characteristics of the data. One way to avoid this is by
showing the data.
g1=gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(gdp_per_capita, reorder(group,gdp_per_capita), fill=group)) +
geom_boxplot() +
geom_jitter(height = 0.2, shape=4) +
scale_x_continuous(trans = "log10") +
scale_fill_brewer(palette = "Dark2")
g2=gapminder %>%
filter(year == past_year & !is.na(gdp)) %>%
ggplot(aes(gdp_per_capita, fill=group)) +
geom_histogram(binwidth = .18) +
scale_x_continuous(trans="log10") +
scale_fill_brewer(palette = "Dark2")
grid.arrange(g1,g2,ncol=2)
The jitter scatter plot shows the location of each individual point. This does not always reveal important characteristics of the distribution. when the number of data points is so large that there is over-plotting, showing the data can be counterproductive.
The sistogram shows the interval each data point belongs to. This is less information than the scatter plot, but still sometimes can be overwhelming, and masks the true pattern in the data.
On the other hand, the boxplot further summarizes the data by providing a five-number summary, but this sometimes can be a over simplification of the data. For example, the boxplot does not permit us to discover bimodal distributions.
Apparenly, we need something in between, which is the density plot.
We can show stacked smooth densities or histograms. We refer to these as
ridge plots. Because we are used to visualizing densities with
values in the x-axis, we stack them vertically. Also, because more space
is needed in this approach, it is convenient to overlay them. A useful
geom_density_ridges
parameter is scale
, which
lets you determine the amount of overlap, with scale = 1
meaning no overlap and larger values resulting in more overlap.
Similar to the box plot above, we can add data to the ridge plot. The height of the points is jittered and should not be interpreted in any way. Alternatively, we could the rug representation of the data.
library(ggridges)
gapminder %>%
filter(year == past_year & !is.na(gdp_per_capita)) %>%
ggplot(aes(gdp_per_capita, group, color=group)) +
geom_density_ridges(scale=1, jittered_points = TRUE) +
geom_density_ridges(scale=1, jittered_points = TRUE,
position = position_points_jitter(height = 0),
point_shape = '|', point_size = 3,
point_alpha = 1, alpha = 0.7) +
scale_x_continuous(trans = "log10") +
scale_fill_brewer(palette = "Dark2")
gapminder %>%
filter(year == past_year & !is.na(gdp_per_capita)) %>%
ggplot(aes(gdp_per_capita, fill=group)) +
geom_density(alpha = 0.5, position="stack", bw = 0.2) +
scale_x_continuous(trans = "log10", limit = c(30, 150000)) +
scale_fill_brewer(palette = "Dark2")
Now versus then for income inequality
Visualization shows that there are two clusters in income. But does
this dichotomy persist? Let’s use facet_grid
see how the
distributions have changed. To start, we will focus on two groups: the
west and the rest. We make four histograms.
Before we interpret the findings of this plot, we notice that there are more countries represented in the 2010 histograms than in 1970: the total counts are larger. One reason for this is that several countries were founded after 1970. For example, the Soviet Union divided into several countries during the 1990s. Another reason is that data was available for more countries in 2010.
We remake the plots using only countries with data available for both
years. In the data wrangling part of this book, we will learn
tidyverse tools that permit us to write efficient code
for this, but here we can use simple code using the
intersect
function:
past_year <- 1970
present_year <- 2010
years <- c(past_year, present_year)
country_list_1 <- gapminder %>%
filter(year == past_year & !is.na(gdp_per_capita)) %>%
pull(country)
country_list_2 <- gapminder %>%
filter(year == present_year & !is.na(gdp_per_capita)) %>%
pull(country)
country_list <- intersect(country_list_1, country_list_2)
These 108 account for 86% of the world population, so this subset should be representative.
Let’s remake the plot, but only for this subset by simply adding
country %in% country_list
to the filter
function:
gapminder %>%
filter(year %in% years & country %in% country_list) %>%
mutate(west = ifelse(group == "Developed", "Developed", "Developing")) %>%
ggplot(aes(gdp_per_capita)) +
geom_histogram(binwidth = 0.18, color = "black") +
scale_x_continuous(trans = "log10") +
facet_grid(year ~ west)
We now see that the rich countries have become a bit richer, but percentage-wise, the poor countries appear to have improved more. In particular, we see that the proportion of developing countries earning more than $16 a day increased substantially.
To see which specific regions improved the most, we can remake the boxplots we made above, but now adding the year 2010 and then using facet to compare the two years.
gapminder %>%
filter(year %in% years & country %in% country_list) %>%
ggplot(aes(group, gdp_per_capita)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(trans = "log10") +
xlab("") +
facet_grid(. ~ year)
Here, we pause to introduce another powerful ggplot2 feature. Because we want to compare each region before and after, it would be convenient to have the 1970 boxplot next to the 2010 boxplot for each region. In general, comparisons are easier when data are plotted next to each other.
So instead of faceting, we keep the data from each year together and ask to color (or fill) them depending on the year. Note that groups are automatically separated by year and each pair of boxplots drawn next to each other. Because year is a number, we turn it into a factor since ggplot2 automatically assigns a color to each category of a factor. Note that we have to convert the year columns from numeric to factor.
g1 = gapminder %>%
filter(year %in% years & country %in% country_list) %>%
mutate(year = factor(year)) %>%
ggplot(aes(group, gdp_per_capita, fill = year)) +
geom_boxplot() +
scale_y_continuous(trans = "log10") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
g2 = gapminder %>%
filter(year %in% years & country %in% country_list) %>%
mutate(year = factor(year)) %>%
ggplot(aes(continent, gdp_per_capita, fill = year)) +
geom_boxplot() +
scale_y_continuous(trans = "log10") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
g3 = gapminder %>%
filter(year %in% years & country %in% country_list) %>%
mutate(year = factor(year)) %>%
ggplot(aes(region, gdp_per_capita, fill = year)) +
geom_boxplot() +
scale_y_continuous(trans = "log10") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
plots1=list(g1,g2,g3)
lay1 = rbind(c(1, 2), c(3, 3))
grid.arrange(grobs = plots1,
layout_matrix = lay1)
Finally, we point out that if what we are most interested in is comparing before and after values, it might make more sense to plot the percentage increases. We are still not ready to learn to code this, but here is what the plot would look like:
g1 = gapminder %>%
filter(year %in% years & country %in% country_list) %>%
mutate(year = ifelse(year == past_year, "past", "present")) %>%
select(country, group, year, gdp_per_capita) %>%
spread(year, gdp_per_capita) %>%
mutate(percent_increase = (present-past)/past*100) %>%
mutate(group = reorder(group, percent_increase, FUN = median)) %>%
ggplot(aes(group, percent_increase, fill=group)) +
geom_boxplot() +
geom_point(show.legend = FALSE, shape=4) +
scale_y_continuous(trans = "log10") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab(paste("Percent increase:", past_year, "to", present_year)) +
scale_fill_brewer(palette = "Dark2")
g2 = gapminder %>%
filter(year %in% years & country %in% country_list) %>%
mutate(year = ifelse(year == past_year, "past", "present")) %>%
select(country, region, year, gdp_per_capita, continent) %>%
spread(year, gdp_per_capita) %>%
mutate(percent_increase = (present-past)/past*100) %>%
mutate(continent = reorder(continent,
percent_increase, FUN = median)) %>%
ggplot(aes(continent, percent_increase, fill = continent)) +
geom_boxplot() +
geom_jitter(width=0.1,shape=4, show.legend = FALSE) +
scale_y_continuous(trans = "log10") +
scale_fill_brewer(palette = "Paired") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("") +
ylab(paste("Percent increase:", past_year, "to", present_year))
g3 = gapminder %>%
filter(year %in% years & country %in% country_list) %>%
mutate(year = ifelse(year == past_year, "past", "present")) %>%
select(country, region, year, gdp_per_capita, continent) %>%
spread(year, gdp_per_capita) %>%
mutate(percent_increase = (present-past)/past*100) %>%
mutate(region = reorder(region,
percent_increase, FUN = median)) %>%
ggplot(aes(region, percent_increase, fill = continent)) +
geom_boxplot() +
geom_jitter(width=0.1,shape=4, show.legend = FALSE) +
scale_y_continuous(trans = "log10") +
scale_fill_brewer(palette = "Paired") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("") +
ylab(paste("Percent increase:", past_year, "to", present_year))
plots1=list(g1,g2,g3)
lay1 = rbind(c(1, 2), c(3, 3))
grid.arrange(grobs = plots1,
layout_matrix = lay1)
The previous data exploration suggested that the income gap between rich and poor countries has narrowed considerably during the last 40 years. We used a series of histograms and boxplots to see this. We suggest a succinct way to convey this message with just one plot.
Summary
Let’s start by noting that density plots for income distribution in 1970 and 2010 deliver the message that the gap is closing:
gapminder %>%
filter(year %in% years & country %in% country_list) %>%
ggplot(aes(gdp_per_capita)) +
geom_density(fill = "grey") +
scale_x_continuous(trans = "log10", limit = c(30, 150000),
breaks=c(100,1000,10000,100000),
labels=c("100","1000","10000","100000")) +
facet_grid(. ~ year)
In the 1970 plot, we see two clear modes: poor and rich countries. In 2010, it appears that some of the poor countries have shifted towards the right, closing the gap.
The next message we need to convey is that the reason for this change in distribution is that several poor countries became richer, rather than some rich countries becoming poorer. To do this, we can assign a color to the groups we identified during data exploration.
However, we first need to learn how to make these smooth densities in a way that preserves information on the number of countries in each group. To understand why we need this, note the discrepancy in the size of each group:
Developed | Developing |
---|---|
24 | 84 |
But when we overlay two densities, the default is to have the area
represented by each distribution add up to 1, regardless of the size of
each group. This makes it appear as if there are the same number of
countries in each group. To change this, we will need to learn to access
computed variables with geom_density
function.
To have the areas of these densities be proportional to the size of
the groups, we can simply multiply the y-axis values by the size of the
group. From the geom_density
help file, we see that the
functions compute a variable called count
that does exactly
this. We want this variable to be on the y-axis rather than the
density.
In ggplot2, we access these variables by surrounding
the name with two dots. We will therefore use the following mapping. If
we want the densities to be smoother, we use the bw
argument so that the same bandwidth is used in each density. We selected
0.2 after trying out several values.
gapminder %>%
filter(year %in% years & country %in% country_list) %>%
mutate(group = ifelse(group == "Developed", "Developed", "Developing")) %>%
ggplot(aes(x = gdp_per_capita, y = ..count.., fill = group)) +
scale_x_continuous(trans = "log10", limit = c(30,100000)) +
geom_density(alpha = 0.2, bw = 0.2) +
facet_grid(year ~ .)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This plot now shows what is happening very clearly. The developing world distribution is changing. A third mode appears consisting of the countries that most narrowed the gap.
To visualize if any of the groups defined above are driving this we can quickly make a ridge plot:
gapminder %>%
filter(year %in% years & !is.na(gdp_per_capita)) %>%
ggplot(aes(gdp_per_capita, group)) +
geom_density_ridges(adjust = 1.5) +
scale_x_continuous(trans = "log10", limit = c(30,100000)) +
facet_grid(. ~ year)
Another way to achieve this is by stacking the densities on top of each other:
gapminder %>%
filter(year %in% years & country %in% country_list) %>%
group_by(year) %>%
mutate(weight = population/sum(population)) %>%
ungroup() %>%
ggplot(aes(gdp_per_capita, fill = group)) +
geom_density(alpha = 0.2, bw = 0.2, position = "stack") +
scale_x_continuous(trans = "log10", limit = c(30, 150000)) +
facet_grid(year ~ .)
Here we can clearly see how the distributions for East Asia, Latin America, and others shift markedly to the right. While Sub-Saharan Africa remains stagnant.
Notice that we order the levels of the group so that the West’s density is plotted first, then Sub-Saharan Africa. Having the two extremes plotted first allows us to see the remaining bimodality better.
Weighted densities
As a final point, we note that these distributions weigh every
country the same. So if most of the population is improving, but living
in a very large country, such as China, we might not appreciate this. We
can actually weight the smooth densities using the weight
mapping argument. The plot then looks like this:
gapminder %>%
filter(year %in% years & country %in% country_list) %>%
group_by(year) %>%
mutate(weight = population/sum(population)) %>%
ungroup() %>%
ggplot(aes(gdp_per_capita, fill = group, weight = weight)) +
geom_density(alpha = 0.2, bw = 0.2, position = "stack") +
scale_x_continuous(trans = "log10", limit = c(30, 150000)) +
facet_grid(year ~ .)
This particular figure shows very clearly how the income distribution gap is closing with most of the poor remaining in Sub-Saharan Africa.