Ch5 Visualization Case Study

In this chapter, we use a few case studies to demonstrate how data visualization techniques help us extract insights from data. The following R packages are needed for running the examples in this chapter.

library(tidyverse)
library(ggthemes)
library(grid)
library(gridExtra)
#library(Lahman)
#library(MASS) # conflict with dplyr package for select(), use dplyr::select
library(RColorBrewer)
library(ggridges)
library(ggrepel)
library(kableExtra)

1 Case Study 1: Simpson’s Paradox

We analyze a data set containing the offense and defense statistics for the baseball teams in Major League Baseball in the early years, in particular, from 1871 through 1900. The data set includes baseball teams from many “major” leagues such as American Association, Union Association, Players League, Federal League, and the National Association (of 1871-1875). This data set was created by Sean Lahman, who pioneered the effort to make baseball statistics freely available to the general public. The complete data set includes most recent data is available in the R package Lahman.

We first load the data and print the first a few rows.

Teams = read.csv("data/baseballgame_bf1900.csv")
head(Teams)
##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871 <NA>    BS1      BNA    NA    3 31    NA 20 10     NA    NA     N
## 2   1871 <NA>    CH1      CNA    NA    2 28    NA 19  9     NA    NA     N
## 3   1871 <NA>    CL1      CFC    NA    8 29    NA 10 19     NA    NA     N
## 4   1871 <NA>    FW1      KEK    NA    7 19    NA  7 12     NA    NA     N
## 5   1871 <NA>    NY2      NNA    NA    5 33    NA 16 17     NA    NA     N
## 6   1871 <NA>    PH1      PNA    NA    1 28    NA 21  7     NA    NA     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5    879 373   7  42  22 235 14 0.840        New York Mutuals
## 6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro
## 1         BS1
## 2         CH1
## 3         CL1
## 4         FW1
## 5         NY2
## 6         PH1

Each row in the data set represents various statistics of one team in a particular season. The statistics include the number of runs scored by that team in that year/season (R), the number of home runs (HR), the number of games (G), lost games/losses (L), and wins (W) and many others variables. Suppose we are interested in the relationship between the number of runs (R) and the number of lost games (L). We can easily generate a scatter plot to show the relationship.

Teams %>% 
  ggplot(aes(R, L)) +  
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

Usually, more runs lead to higher scores and hence fewer losses. However, when we visualize the data, we actually get a positive correlation, which is puzzling. We further fit a regression line to the data, which confirms our observations.

In order to understand this puzzle, we recall that the data we have is the statistics of each team in the early years. Back in the 1870s and 1880s, the baseball league is in its infancy. The league has experienced the ups and downs when it developed to its current scale. Therefore, each season is quite different in all aspects from another season. Therefore, when we pool the data across all the years together, the data tells a very different story.

For example, the number of runs and the number of losses depend on how many games each team played during that season. Since the league is growing rapidly, the number of games in each season is much different. Some season hold more games and some hold less.

Therefore, we could reexamine the relationship between the number of runs and the number of losses conditional on the number of games that each team plays in that season. This is sometimes called stratification. Essentially we divide the observations into groups according to the value of a third variable, i.e., the number of games in this case, so that the observations in each group all have similar or identical values for that variable. Within the group, we investigate the data set focusing on the two variables of interest, i.e., the number of runs and losses in this case. Then our results and findings are now conditional on the third variable by which we stratify the data.

Since we want to stratify by the number of games and it is a quantitative variable, we need to cut this variable into several bins/intervals. We can first check the range of the number of games as follow.

range(Teams$G)
## [1]   6 158

It seems an interval of length 20 will give us about 8 strata. We use the function cut() to convert the quantitative variable to a categorical variable with values of (0, 20], … , (140, 160]. To visualize the strata, we use the visualization technique called faceting. It essentially plots the relationship of interest in many small multiple of panels which correspond to the strata. Here, we plot the scatter plot between the numbers of runs and losses within each stratum.

Teams %>%
  mutate(G_strata = cut(G, seq(0,180,20))) %>% 
  ggplot(aes(R, L)) +  
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  facet_wrap( ~ G_strata)

We immediately find out that the relationship becomes negative within each stratum. If we overlay the fitted regression within each stratum on the fitted regression to the whole data set, we can see the contrast.

Teams %>% 
  mutate(G_strata = cut(G, seq(0,180,20))) %>%
  ggplot(aes(R, L, color=G_strata)) +  
  geom_point(alpha = 0.5) + 
  geom_smooth(aes(group=G_strata, color=G_strata),
              method = "lm")+
  scale_color_viridis_d()+
  geom_smooth(color = "blue", method = "lm")

Clearly, the relationship between the numbers of runs and losses within strata and over the entire data set is opposite. This is a typical case of so called Simpson’s paradox. It is called a paradox because we see the sign of the correlation flips when comparing the entire population and specific groups (or strata). It usually happens because there is a confounder variable which affects both of the variables of interest. Since the confounder’s variation is more dominant, the overall pattern between the two variables of interest, when marginalized over the confounder, shows an unexpected association.

In this case, the number of game in each season is the confounder. The more game the team plays, the more runs and more losses the team acquires. When the analysis is stratified over the confounder, the true relationship appears.

We can further investigate this issue. We plot the year against the number of games, the number of runs, and the number of losses. We can immediately see that, as the sport becomes more popular over the years, there are more games in each season, with a few exceptions around 1984-1985. Therefore, the increase of the number of games over the years is the main reason for observing this positive relationship between the number of runs and the number of losses.

plot1 <- Teams %>% 
  ggplot(aes(yearID, G)) + geom_jitter(alpha = 0.5)
plot2 <- Teams %>% 
  ggplot(aes(yearID, R)) + geom_jitter(alpha = 0.5)
plot3 <- Teams %>%
  ggplot(aes(yearID, L)) + geom_jitter(alpha = 0.5)
plot4 <- Teams %>%
  ggplot(aes(G, R)) + geom_jitter(alpha = 0.5)
plot5 <- Teams %>%
  ggplot(aes(G, L)) + geom_jitter(alpha = 0.5)
grid.arrange(plot1, plot2, plot3, plot4, plot5, ncol=3)

This example shows that how faceting in data visualization can help us identify/explain/investigate the Simpson’s paradox in the data set. One important tool for dealing with the Simpson’s paradox is conditionaling, or stratification. Conditional/stratification is achieved in data visualization using the facet feature, where small multiple figures are generated for comparison.

2 Case Study 2: London Wind Map

Next, we analyze a wind data set containing the hourly measurements of the wind speed (ws) and direction (wd) in London in 1999. The data is available in the R package openair. We first load the data and print the first a few rows.

wind = read.csv(file="data/LondonWind1999.csv", header = TRUE)
head(wind)
##                  date   ws  wd nox no2 o3 pm10      so2       co pm25
## 1 1999-01-01 00:00:00 5.04 140  88  35  4   21 3.835000 1.025000   18
## 2 1999-01-01 01:00:00 4.08 160 132  41  3   17 5.243333 2.700000   11
## 3 1999-01-01 02:00:00 4.80 160 168  40  4   17 6.513333 2.866667    8
## 4 1999-01-01 03:00:00 4.92 150  85  36  3   15 4.180000 1.625000   10
## 5 1999-01-01 04:00:00 4.68 150  93  37  3   16 4.250000 1.025000   11
## 6 1999-01-01 06:00:00 3.84 170 102  32  4   16 4.005000 0.800000    9

As we can see, each row represents one observation of the wind speed and direction at a particular hour in a particular day in 1999. We mostly focus on the wind speed (ws) and direction (wd) and ignore the other variables for now. For the wind direction, we list all possible directions which are equally spaced with 10 degrees aprt from each other.

sort(unique(wind$wd))
##  [1]   0  10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180
## [20] 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360

For this data set, we start with the scatter plot of direction versus speed in the upper left panel in the figure below. We can quickly visualize the data, but it does not show many insights because of the overlapping observations. We solve this issue by jittering the observations and immediately see that the wind direction is mostly focus on the 180-270 degree range. Meanwhile, the wind speed is also faster in these directions.

g1=ggplot(wind)+geom_point(aes(x=wd, y=ws),alpha=0.5, size=0.8)
g2=ggplot(wind)+geom_point(aes(x=factor(wd), y=ws),
                           position = position_jitter(0.3),
                           size = 0.8, alpha=0.2)+
  scale_x_discrete(breaks=seq(0,360,30))
grid.arrange(g1,g2,ncol=2)

Even though the scatter plot is sufficient for faithfully presenting the data, but it is hard to obtain any insights. If we focus on the distribution of the wind speed in each direction, we could use side by side boxplot. In the left panel of the figure below, we can see that the wind speed distribution gradually shift to large values as the wind direction moves towards 220 degree. Once improvement we can make about this visualization is that since the wind direction is in a circular format, we can arrange these boxplots in a polar coordinate system as opposed to the Cartesian coordinate system.

g1=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+
  scale_x_discrete(breaks=seq(0,360,30))
wind$wd[wind$wd==360]=0
g2=ggplot(wind)+geom_boxplot(aes(x=factor(wd), y=ws),outlier.size = 0.5)+coord_polar(start = -pi/32)
grid.arrange(g1,g2,ncol=2)

In the right panel, we can see that boxplots are aligned in a circle. Through the median and first and third quartiles, we can clearly see that the wind speed is much faster in the southwest wind direction. The east wind has relatively fast speed. These insights are harder to extract from the Cartesian coordinate system, which is the advantage of the polar coordinate system.

Using the boxplot in the polar coordinate system is certainly useful, however, there is a shortcoming. For example, we can only visualize the distribution of wind speed in each direction, but cannot account for the frequency of each direction. From the jittered scatter plot, we know that not only does the wind speed is faster in the southwest direction, but there are also more observations of southwest wind than any other directions. Such information cannot be visualized in the figure above.

Alternatively, to visualize the frequency of the wind direction as well as the distribution of the wind speed in each direction, we could use barplot. We first convert the wind speed to different intervals of speed. Then we can use the stacked bar plot to show the frequencies of the wind speed intervals in each direction, shown in the left panel of the figure below. Lastly, we convert the stacked bar plot to the polar coordinate system and show it on the right panel.

wind$binned_speed = cut(wind$ws, c(seq(0,14,2),Inf), include.lowest = TRUE)
wind$direction = 
  cut(wind$wd, 
      c(-1,11.25,33.75,56.25,78.75,101.25,123.75,146.25,168.75,191.25,213.75,236.25,258.75,281.25,303.75,326.25,348.75,361),
      labels = c("N","NNE","NE","ENE","E","ESE","SE","SSE","S","SSW","SW","WSW","W","WNW","NW","NNW","N"))
g3=ggplot(wind, aes(x=direction, fill = binned_speed)) +
  geom_bar(position = position_stack(reverse = TRUE)) +
  scale_fill_viridis_d()+
  theme(legend.position = "top")
g4=ggplot(wind, aes(x=direction, fill = binned_speed)) +
  geom_bar(width=1, position = position_stack(reverse = TRUE)) +
  scale_fill_viridis_d()+
  coord_polar(start = -pi/16)
grid.arrange(g3,g4,ncol=2)

In this visualization, we can see that the west, southwest, south bars take a large portion among all bars, which means the wind mostly comes from these directions in London. In addition, the wind blows fast in these directions because the colors of the bars show more yellow segments This visualization is called wind rose plot. It is powerful because it can visualize the wind speed and direction at the same time. Because of the polar coordinate system, the visualization is very easy to digest and intuitive. There are also many extension of this visualization, for example, the choice of the color scheme.

3 Case Study 3: Basketball Shots

From Charlotte Wickham’s example

Where the Heat and the Thunder Hit Their Shots

shots = read.csv("data/shots_sum.csv", header = TRUE)
head(shots)
##   x  y num_shots num_made prop_made avg_points total_points
## 1 0  4         2        1       0.5        1.5            3
## 2 0  5         3        0       0.0        0.0            0
## 3 0  6         1        0       0.0        0.0            0
## 4 0  7         1        0       0.0        0.0            0
## 5 0  9         2        0       0.0        0.0            0
## 6 0 10         1        0       0.0        0.0            0
shots = filter(shots, num_shots < 1000)
ggplot(shots, aes(x, y))+
  geom_point(aes(color = avg_points, 
                 size = num_shots)) +
  scale_color_distiller("Points", palette = "RdYlGn") +
  scale_size("Attempts", range = c(0.1, 5) ) +
  ylim(0, 35) +
  coord_equal() +
  theme_void() +
  theme(axis.ticks = element_blank(),
        axis.text  = element_blank(),
        axis.line  = element_blank(),
        axis.title = element_blank())
## Warning: Removed 54 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(shots, aes(x, y))+
  geom_point(aes(color = avg_points)) +
  scale_color_distiller("Points", palette = "RdYlGn") +
  scale_size("Attempts", range = c(0.1, 5) ) +
  ylim(0, 35) +
  coord_equal() +
  theme_void() +
  theme(axis.ticks = element_blank(),
        axis.text  = element_blank(),
        axis.line  = element_blank(),
        axis.title = element_blank())
## Warning: Removed 54 rows containing missing values or values outside the scale range
## (`geom_point()`).

4 Case Study 4: Departure Time

From Charlotte Wickham’s example

library(dplyr)
library(hflights)
library(ggplot2)

hflights_df <- as_tibble(hflights)

hflights_df <- mutate(hflights_df, 
                      DepHour = floor(DepTime/100),
                      DayOfWeek = factor(DayOfWeek, 
                                         labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")),
                      Date = ISOdate(Year, Month, DayofMonth)
)
hou <- filter(hflights_df, Origin == "HOU")

hou_mon <- filter(hou, DayOfWeek == "Mon")

# over all mondays in 2011, avg delay of flights departing by hour
hou_mon_avg <- hou_mon %>%
  group_by(DepHour) %>%
  summarise(avg_delay = mean(DepDelay))

# initial plot
ggplot(hou_mon_avg, aes(DepHour, avg_delay)) + 
  geom_point() +
  geom_line() + 
  ylab("Average delay (mins)") +
  xlab("Departure time") +
  scale_x_continuous(breaks = seq(0, 24, 6),
                     labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
  theme_bw(18)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

#ggsave("08-monday.png", width = 6, height = 4)

# for each monday in 2011, avg delay of flights departing by hour
hou_mon_day <- filter(hou, DayOfWeek == "Mon") %>%
  group_by(Date, DepHour) %>%
  summarise(avg_delay = mean(DepDelay))
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by Date and DepHour.
## ℹ Output is grouped by Date.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(Date, DepHour))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
# quantiles for delay by time
hou_mon_q <- hou_mon %>% group_by(DepHour) %>%
  summarise(n = n(),
            q25 = quantile(DepDelay, probs = 0.25, na.rm = TRUE),
            q50 = quantile(DepDelay, probs = 0.5, na.rm = TRUE),
            q75 = quantile(DepDelay, probs = 0.75, na.rm = TRUE))

hou_mon %>%
  ggplot(aes(DepHour, DepDelay)) + 
  geom_point() +
  #geom_line(aes(group = Date)) + 
  ylab("Average delay (mins)") +
  xlab("Departure time") +
  scale_x_continuous(breaks = seq(0, 24, 6),
                     labels = c("midnight", "6am", "noon", "6pm", "midnight")) +
  theme_bw(18)
## Warning: Removed 156 rows containing missing values or values outside the scale range
## (`geom_point()`).

5 Case Study 5: Daily Temperature

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(tidyverse)
library(lubridate)
t = read_csv("data/temperature_airport.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 141294 Columns: 61
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): STATION, NAME, FMTM, PGTM
## dbl  (45): LATITUDE, LONGITUDE, ELEVATION, ACMH, ACSH, AWND, GAHT, PRCP, PSU...
## lgl  (11): ACMC, ACSC, DAPR, FRGT, MDPR, WT10, WT17, WT19, WV01, WV07, WV20
## date  (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
t2 = t %>%
  mutate(YEAR = year(DATE),
         MONTH = month(DATE),
         DAY = day(DATE),
         MONTHDAY = paste(ifelse(MONTH>=10, MONTH, paste("0",MONTH,sep="")), 
                          ifelse(  DAY>=10,   DAY, paste("0",  DAY,sep="")), sep="-"),
         NAME = case_when(NAME == "CHICAGO OHARE INTERNATIONAL AIRPORT, IL US" ~ "ORD", 
                          NAME == "CINCINNATI NORTHERN KENTUCKY INTERNATIONAL AIRPORT, KY US" ~ "CVG", 
                          NAME == "JFK INTERNATIONAL AIRPORT, NY US" ~ "JFK", 
                          NAME == "MIAMI INTERNATIONAL AIRPORT, FL US" ~ "MIA", 
                          NAME == "SAN FRANCISCO INTERNATIONAL AIRPORT, CA US" ~ "SFO",
                          NAME == "MINNEAPOLIS ST. PAUL INTERNATIONAL AIRPORT, MN US" ~ "MSP")) %>% 
  group_by(NAME, MONTHDAY) %>%
  summarize(TEMP = mean(TAVG, na.rm = TRUE) ) %>%
  ungroup()
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by NAME and MONTHDAY.
## ℹ Output is grouped by NAME.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(NAME, MONTHDAY))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
t2
## # A tibble: 1,830 × 3
##    NAME  MONTHDAY  TEMP
##    <chr> <chr>    <dbl>
##  1 CVG   01-01     34.2
##  2 CVG   01-02     36.1
##  3 CVG   01-03     35.4
##  4 CVG   01-04     33.4
##  5 CVG   01-05     30  
##  6 CVG   01-06     27.8
##  7 CVG   01-07     25.6
##  8 CVG   01-08     29.8
##  9 CVG   01-09     32.9
## 10 CVG   01-10     32.6
## # ℹ 1,820 more rows
ggplot(t2) +
  geom_line(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME)) +
  geom_smooth(aes(x = MONTHDAY, y = TEMP, group = NAME, color = NAME), method = "gam") +
  scale_x_discrete(breaks = c("01-01", "04-01", "07-01", "10-01"),
                   labels = c("01-01", "04-01", "07-01", "10-01")) +
  scale_y_continuous(limits = c(0,100)) +
  coord_polar()
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

6 Case Study 6: Gapminder

6.1 Data Description

We will analyze the gapminder data set1 which is about the statistics of health condition and income of different countries measured over the years. This case study is originally introduced in Irizarry (2019). We have made some modifications to the case, but the majority of our analysis follows their case.

We start by importing the data to R from the csv file.

library(tidyverse)
gapminder = read_csv("data/gapminder.csv")
## Rows: 10545 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): country, continent, region
## dbl (6): year, infant_mortality, life_expectancy, fertility, population, gdp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gapminder
## # A tibble: 10,545 × 9
##    country   year infant_mortality life_expectancy fertility population      gdp
##    <chr>    <dbl>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
##  1 Albania   1960            115.             62.9      6.19    1636054 NA      
##  2 Algeria   1960            148.             47.5      7.65   11124892  1.38e10
##  3 Angola    1960            208              36.0      7.32    5270844 NA      
##  4 Antigua…  1960             NA              63.0      4.43      54681 NA      
##  5 Argenti…  1960             59.9            65.4      3.11   20619075  1.08e11
##  6 Armenia   1960             NA              66.9      4.55    1867396 NA      
##  7 Aruba     1960             NA              65.7      4.82      54208 NA      
##  8 Austral…  1960             20.3            70.9      3.45   10292328  9.67e10
##  9 Austria   1960             37.3            68.8      2.7     7065525  5.24e10
## 10 Azerbai…  1960             NA              61.3      5.57    3897889 NA      
## # ℹ 10,535 more rows
## # ℹ 2 more variables: continent <chr>, region <chr>

As we can see there are 9 variables and 10545 observations. Each row represents one particular country in a particular year. These 9 variables include:

  • country: name of the country. There are 185 countries in total.
  • year: the year of observation. It ranges from 1960 to 2016.
  • infant_mortality: the infant mortality of the country in that year.
  • life_expectancy: the life expectancy of the country in that year.
  • fertility: the fertility rate of the country in that year.
  • population: the population rate of the country in that year.
  • gdp: the GDP of the country in that year.
  • continent: the continent the country belongs to. Values include Europe, Africa, Americas, Asia, and Oceania.
  • region: the region the country belongs to the following values

In terms of missing, we can see the number of missing values are mostly in gdp and infant_mortality rate.

colSums(is.na(gapminder))
##          country             year infant_mortality  life_expectancy 
##                0                0             1453                0 
##        fertility       population              gdp        continent 
##              187              185             2972                0 
##           region 
##                0

How many years of data do we have?

gapminder %>%
  select(year) %>%
  unique() %>%
  summarize(max = max(year),
            min = min(year))
## # A tibble: 1 × 2
##     max   min
##   <dbl> <dbl>
## 1  2016  1960

How many countries do we have?

gapminder %>%
  select(country) %>%
  unique() %>%
  dim()
## [1] 185   1

The total number of observations should be

(2016-1960+1)*185
## [1] 10545

This is exactly the sample size of this data set.

Let’s see if the missing are evenly distributed across years.

gapminder %>%
  group_by(year) %>%
  summarize(inf_count = sum(is.na(infant_mortality)),
            fer_count   = sum(is.na(fertility)),
            pop_count    = sum(is.na(population)),
            gdp_count    = sum(is.na(gdp))) %>%
  pivot_longer(cols = c(inf_count, fer_count, pop_count, gdp_count),
               names_to = "count_type",
               values_to = "count") %>%
  ggplot() +
  geom_line(aes(x = year, y = count, color = count_type)) +
  scale_color_discrete(name = "Variable") +
  ylab("Number of missing values")

As we can see, the missing values are heavy at the beginning and end of the period.

Let us see if the missing are evenly distributed across countries. First, we show the number of missing values for each variable for each country. Since we have too many countries, we filter out the countries with at least 25 missing values.

gapminder %>%
  group_by(country) %>%
  summarize(inf_count = sum(is.na(infant_mortality)),
            fer_count = sum(is.na(fertility)),
            pop_count = sum(is.na(population)),
            gdp_count = sum(is.na(gdp))) %>%
  mutate(sum_count = inf_count + fer_count + pop_count + gdp_count) %>%
  filter(sum_count > 25) %>%
  select(-sum_count) %>%
  pivot_longer(cols = c(inf_count, fer_count, pop_count, gdp_count),
               names_to = "count_type",
               values_to = "count") %>%
  mutate(count_type = factor(count_type, 
                      levels = c("gdp_count", "inf_count",  "fer_count", "pop_count"))) %>%
  ggplot() +
  geom_col(aes(y = reorder(country, count, sum), x = count, fill = count_type)) +
  theme(legend.position = "top")

Since GDP is one of the most important variables, let us look at the missing values’ pattern across time for each country. Again, we filter out the countries with at least 5 missing values for GDP.

country_list = gapminder %>%
  group_by(country) %>%
  summarize(gdp_count = sum(is.na(gdp))) %>%
  filter(gdp_count > 5) %>%
  pull(country)
gapminder %>%
  filter(country %in% country_list) %>%
  mutate(gdp_ind = is.na(gdp)) %>%
  ggplot() +
  geom_tile(aes(x = year, y = reorder(country, gdp_ind, sum), fill = gdp_ind))

If we look at the rest of the countries, then the missing values are much fewer.

6.2 Motivation and Research Questions

The data set is created by Hans Rosling2, a professor of international health and a pioneer in data visualization. Using this data set and the visualization, he wants to educate the public about the common misunderstanding of the world and different countries. For this initiative, he has created the Gapminder Foundation3 which uses data to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events . Gapminder identifies systematic misconceptions about important global trends and proportions and uses reliable data to develop easy to understand teaching materials to rid people of their misconceptions . Its mission is to fight devastating ignorance with a fact-based worldview everyone can understand .

Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in peoples’ minds into an over-dramatic worldview and strong negative stress feelings: “The world is getting worse!”, “It’s we vs. them!”, “Other people are strange!”, “The population just keeps growing!” and “Nobody cares!” Hans Rosling conveyed actual data-based trends in a dramatic way of his own, using effective data visualization. This section is based on two talks that exemplify this approach to education: New Insights on Poverty4 and The Best Stats You’ve Ever Seen5 .

On the Gapminder website, there are a list of questions about the wealth health and economics. You can go through these questions and test your knowledge and understanding. In his TED talk6, Hans Rosling tested the audience with questions on the world health such as the child mortality across different countries. For example, for each pair of countries below, which country do you think had the highest child mortality rates in 2015?

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa

When answering these questions without data, the non-European countries are typically picked as having higher child mortality rates: Sri Lanka over Turkey, South Korea over Poland, and Malaysia over Russia. It is also common to assume that countries considered to be part of the developing world: Pakistan, Vietnam, Thailand, and South Africa, have similarly high mortality rates.

To answer these questions, we use data. For example, for the first comparison we see that:

library(tidyverse)
gapminder = read_csv("data/gapminder.csv")
gapminder %>% 
  filter(year == 2015 & country %in% c("Sri Lanka","Turkey")) %>% 
  select(country, infant_mortality)
## # A tibble: 2 × 2
##   country   infant_mortality
##   <chr>                <dbl>
## 1 Sri Lanka              8.4
## 2 Turkey                11.6
gapminder %>% 
  filter(country %in% c("Sri Lanka","Turkey")) %>% 
  select(country, year, infant_mortality) %>%
  ggplot() +
  geom_line(aes(x= year, y = infant_mortality, color = country))
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Turkey has the higher infant mortality rate. We can use this code on all comparisons and find the following:

country infant mortality country infant mortality
Sri Lanka 8.4 Turkey 11.6
Poland 4.5 South Korea 2.9
Malaysia 6.0 Russia 8.2
Pakistan 65.8 Vietnam 17.3
Thailand 10.5 South Africa 33.6

We see that the European countries on this list have higher child mortality rates: Poland has a higher rate than South Korea, and Russia has a higher rate than Malaysia. We also see that Pakistan has a much higher rate than Vietnam, and South Africa has a much higher rate than Thailand. It turns out that when Hans Rosling gave this quiz to educated groups of people, the average score was less than 2.5 out of 5, worse than what they would have obtained had they guessed randomly. This implies that more than ignorant, we are misinformed. In this case study, we see how data visualization helps inform us.

Specifically, we use data to attempt to answer the following two questions:

  1. Is it a fair characterization of today’s world to say it is divided into western rich nations and the developing world in Africa, Asia, and Latin America?
  2. Has income inequality across countries worsened during the last 40 years?

6.3 Divided World

Now we begin to analyze the data set. There is a common understanding about the world that is the world is divided into two groups: the developed countries which are associate with long life spans and small families, versus the developing countries (Africa, Asia, and Latin America) characterized by short life spans and large families. Is this true? To verify this, we propose our first question:

Is the world divided in terms of life expectancy and fertility rate (average number of children per woman)?

To answer this question, we use a scatterplot of life expectancy versus fertility rate. We start by looking at data from about 50 years ago, when perhaps this view was first cemented in our minds.

library(RColorBrewer)
gapminder %>%
  filter(year == 1962) %>% # data wrangling
  ggplot() + # data visualization
  geom_point(aes(fertility, life_expectancy, color = continent)) +
  scale_color_brewer(palette = "Paired")

As we can see, most countries fall into two distinct categories: 1. Life expectancy around 70 years and 3 or fewer children per family. 2. Life expectancy lower than 65 years and more than 5 children per family. Therefore, it seems this understanding does have some support from data and is somewhat true 50 years ago. However, is it still true today? In addition, is the situation different than 50 years ago? To answer this question, we need to regenerate a new figure for 2012 and compare with the previous one.

gapminder %>%
  filter(year %in% c(1962, 2015)) %>%
  ggplot(aes(fertility, life_expectancy, color = continent)) +
  geom_point() +
  facet_wrap(~year) +
  scale_color_brewer(palette = "Paired")

This figure shows the totality of the situation. If we want to focus on different continent, we can further break it down the comparison. As we can see, the most improvement takes place in Africa, Americas, and Asia.

gapminder %>%
  filter(year %in% c(1962, 2012)) %>%
  ggplot(aes(fertility, life_expectancy, color = continent)) +
  geom_point() +
  facet_grid(year~continent) +
  scale_color_brewer(palette = "Paired")

This plot clearly shows that the majority of countries have moved from the developing world cluster to the western world one. In conclusion, the belief that the world is divided into two groups are no longer support nowadays. This is particularly true for Asia.

The figure focus on the overall improvement of the entire world or each continent. What is we want to focus on each country? We can plot the trajectory of each country.

gapminder %>%
  filter(year %in% c(1962, 2012)) %>%
  group_by(country) %>%
  summarise(start_fert = fertility[year==1962],
            end_fert = fertility[year==2012],
            start_life = life_expectancy[year==1962],
            end_life = life_expectancy[year==2012],
            continent = continent[1]) %>%
  ggplot() +
  geom_point(aes(x=start_fert, 
                 y=start_life,
                 color=continent)) +
  geom_point(aes(x=end_fert, 
                 y=end_life,
                 color=continent)) +
  geom_segment(aes(x=start_fert,xend=end_fert,
                   y=start_life,yend=end_life,
                   color=continent,
                   group=country)) +
  facet_wrap(~continent) +
  scale_color_brewer(palette = "Paired")

As we can see, Asian and Americas countries collectively improved significantly. Part of African countries improved significantly in both life expectancy and fertility, but the others have only moderate improvements and these improvement are mostly life expectancy. Most of the European countries have mild improvements.

We are looking at the span of 50 years, it make sense to look at the data most closely and visualize how the changes took place through the years. We can add 1970, 1980, 1990, and 2000.

gapminder %>% 
  ggplot() +
  geom_path(aes(fertility, life_expectancy, 
                color=continent, group = country)) +
  facet_wrap(~continent) +
  scale_color_brewer(palette = "Paired")
## Warning: Removed 187 rows containing missing values or values outside the scale range
## (`geom_path()`).

This plot clearly shows how most Asian countries have improved at a much faster rate than European ones. Meanwhile, an interesting pattern is that most countries improve the life expectancy first and then reduce the fertility rate. This makes sense because the increase of life expectancy implies the improvement of health care quality. Once the health care condition is improved, the family does not need give birth to many children.

We can repeat the same analysis using another pair of variables.

library(GGally)
gapminder %>%
  filter(year %in% c(1960)) %>%
  mutate(log_pop = log10(population),
         log_gdp_per_capita = log10(gdp/population)) %>%
  select(infant_mortality,life_expectancy,fertility,log_pop, log_gdp_per_capita) %>%
  ggpairs(upper = list(continuous = ggally_density))

gapminder %>%
  filter(year %in% c(2010)) %>%
  mutate(log_pop = log10(population),
         log_gdp_per_capita = log10(gdp/population)) %>%
  select(infant_mortality,life_expectancy,fertility,log_pop, log_gdp_per_capita) %>%
  ggpairs(upper = list(continuous = ggally_density))

gapminder %>%
  filter(year %in% c(1960, 2010)) %>%
  mutate(gdp_per_capita = gdp/population) %>%
  ggplot(aes(gdp_per_capita, life_expectancy, color = continent)) +
  scale_x_log10()+
  geom_point() +
  facet_wrap(~year) +
  scale_color_brewer(palette = "Paired")

6.4 Income Inequality

Another commonly held belief is that the wealth inequality has became more serious over the years with the rich getting richer and poor getting poorer. We now shift our attention to the second question related to the commonly held notion that wealth distribution across the world has become worse during the last decades.

Did the income inequality in the world become worse or better?

When general audiences are asked if poor countries have become poorer and rich countries become richer, the majority answers yes. By using stratification, histograms, smooth densities, and boxplots, we will be able to understand if this is in fact the case. First we learn how transformations can sometimes help provide more informative summaries and plots.

We use the GDP per capita as a measure of individual wealth and create this variable in the data set. Note that income is adjusted for inflation and represent current US dollars, so it is comparable across the years. In addition, this variable represents the average income of that country. There are certainly variation in income within that country, which is not what we are studing right now because we do not have assess to this data.

gapminder <- gapminder %>%  
  mutate(gdp_per_capita = gdp/population)

Here is a histogram of GDP per capita from 1970.

past_year <- 1960
gapminder %>% 
  filter(year == past_year & !is.na(gdp_per_capita)) %>%
  ggplot(aes(gdp_per_capita)) + 
  geom_histogram(binwidth = 1000)

Because of the skewness of the data, the visualization mainly displays the countries with gdp per capita above $4000. So we are unclear about the distribution of the gdp per capita for the rest of the countries, even though they are the majority in the data.

To address this issue, we consider the log transformation which convert the scale into a multiplicative order.

There are two ways to transform the data for visualization. - Transform the values before plotting them, i.e., aes(x=log10(gdp_per_capita)). I1n this case, the axis ticks display the transformed values such as 1, 2, 3, … - Use transformed scales in the axes, i.e., +scale_x_continuous(trans="log10"). In this case, the axis ticks display the original values such as 10, 100, 1000, … Both approaches are useful and have different strengths. Usually, the latter gives a better visualization of the data.

There are other transformations available through the trans argument.

  • The square root (sqrt) transformation is useful when considering counts.
  • The logistic transformation (logit) is useful when plotting proportions between 0 and 1.
  • The reverse transformation is useful when we want smaller values to be on the right or on top.

If we use the log transformation with base of 10, then we can easily show $100 (extremely poor), $500 (very poor), $1000 (poor), $5000 (middle), $10000 (well off), $50000 (rich), $100000 (very rich). This transformation provides a close-up of the mid to lower income countries. Here is the distribution if we apply a log base 10 transform:

past_year = 1960
gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(x=gdp_per_capita)) + 
  geom_density(bw = 0.1)+
  geom_rug() +
  scale_x_continuous(trans="log10")

gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(x=gdp_per_capita)) + 
  geom_histogram(bins = 15)+
  geom_rug() +
  scale_x_continuous(trans="log10")

We see multiple bumps in the histogram. This is often referred to as multimodal distribution. The locations where the density reaches local maximum are called local modes.

The histogram above suggests that the country income distribution has two modes: one at about 1000 dollars and another at about 10000 dollars. This bimodality is consistent with a dichotomous world made up of countries with average incomes less than $5000 and countries above that. However, which countries consist of these two groups?

6.4.1 Comparison among Regions

A histogram shows that the 1960 income has a multimodal distirbution which implies that the there are multiple clusters of countries with similar incomes and the clusters are separated from each other. However, we still do not know what are these clusters. Do these clusters represent the developed world and developing world? We will learn about clustering analysis in the later chapters. For now, we can manually divide the countries according to their geographical locations, such as continent and region.

Let’s start by quickly examining the data by region. We reorder the regions by the median value and use a log scale.

gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(gdp_per_capita, 
             reorder(continent,gdp_per_capita,FUN = median ), 
             fill=continent)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_jitter(height=0.1, shape = 4) +
  scale_x_continuous(trans = "log10") +
  annotation_logticks(sides = "b") +
  scale_fill_brewer(palette = "Paired") +
  theme(legend.position = "top")

library(ggrepel)
gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  mutate(region = reorder(region, gdp_per_capita, FUN = median)) %>%
  ggplot(aes(gdp_per_capita, region, 
             fill=continent, 
             label=country)) +
  geom_boxplot(outlier.shape = NA) + 
  geom_point(shape = 4) +
  geom_text_repel(color="blue", size=2)+
  scale_x_continuous(trans = "log10") +
  annotation_logticks(sides = "b") +
  scale_fill_brewer(palette = "Paired") +
  scale_color_brewer(palette = "Paired") +
  theme(legend.position = "top")
## Warning: The following aesthetics were dropped during statistical transformation: label.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

We can already see that there is indeed a “west versus the rest” dichotomy: we see two clear groups, with the rich group composed of North America, Northern and Western Europe, New Zealand and Australia. We define groups based on this observation. We turn this group variable into a factor to control the order of the levels.

gapminder <- gapminder %>% 
  mutate(group = case_when(
    region %in% c("Northern Europe",
                  "Northern America",
                  "Australia and New Zealand",
                  "Western Europe",
                  "Southern Europe")  ~ "Developed",
    country %in% c("Japan", "Israel", "Hong Kong, China", "French Polynesia", "") ~ "Developed",
    region %in% c("Eastern Asia", 
                  "South-Eastern Asia",
                  "Southern Asia",
                  "Western Asia") ~ "Asia",
    region %in% c("Caribbean", 
                  "Central America", "South America") ~ "Latin America",
    continent == "Africa" ~ "Africa",
    TRUE ~ "Others"))
gapminder <- gapminder %>% 
  mutate(group = factor(group, levels = c("Others","Latin America", "Asia", "Africa", "Developed")))
gapminder %>%
  filter(group == "Others") %>%
  select(country) %>%
  unique()
## # A tibble: 24 × 1
##    country              
##    <chr>                
##  1 Belarus              
##  2 Bulgaria             
##  3 Czech Republic       
##  4 Fiji                 
##  5 Hungary              
##  6 Kazakhstan           
##  7 Kiribati             
##  8 Kyrgyz Republic      
##  9 Micronesia, Fed. Sts.
## 10 Moldova              
## # ℹ 14 more rows

The exploratory data analysis above has revealed two characteristics about average income distribution in 1970. Using a histogram, we found a bimodal distribution with the modes relating to poor and rich countries. We now want to compare the distribution across these five groups to confirm the “west versus the rest” dichotomy. The number of points in each category is large enough that a summary plot may be useful. We could generate five histograms or five density plots, but it may be more practical to have all the visual summaries in one plot. We therefore start by stacking boxplots next to each other. Note that we add the layer theme(axis.text.x = element_text(angle = 90, hjust = 1)) to turn the group labels vertical, since they do not fit if we show them horizontally, and remove the axis label to make space. Boxplots have the limitation that by summarizing the data into five numbers, we might miss important characteristics of the data. One way to avoid this is by showing the data.

library(gridExtra)
gapminder %>% 
  filter(year == past_year & !is.na(gdp)) %>%
  ggplot(aes(gdp_per_capita, reorder(group,gdp_per_capita,median), fill=group)) +
  geom_boxplot() +
  geom_jitter(height = 0.2, shape=4) +
  scale_x_continuous(trans = "log10") +
  scale_fill_brewer(palette = "Dark2")

The jitter scatter plot shows the location of each individual point. This does not always reveal important characteristics of the distribution. when the number of data points is so large that there is over-plotting, showing the data can be counterproductive.

The sistogram shows the interval each data point belongs to. This is less information than the scatter plot, but still sometimes can be overwhelming, and masks the true pattern in the data.

On the other hand, the boxplot further summarizes the data by providing a five-number summary, but this sometimes can be a over simplification of the data. For example, the boxplot does not permit us to discover bimodal distributions.

Apparenly, we need something in between, which is the density plot. We can show stacked smooth densities or histograms. We refer to these as ridge plots. Because we are used to visualizing densities with values in the x-axis, we stack them vertically. Also, because more space is needed in this approach, it is convenient to overlay them. A useful geom_density_ridges parameter is scale, which lets you determine the amount of overlap, with scale = 1 meaning no overlap and larger values resulting in more overlap.

Similar to the box plot above, we can add data to the ridge plot. The height of the points is jittered and should not be interpreted in any way. Alternatively, we could the rug representation of the data.

library(ggridges)
gapminder %>% 
  filter(year == past_year & !is.na(gdp_per_capita)) %>%
  ggplot(aes(gdp_per_capita, 
             reorder(group,gdp_per_capita,median), 
             fill=group)) + 
  geom_density_ridges(scale=1, jittered_points = TRUE, 
                      position = position_points_jitter(height = 0),
                      point_shape = '|', point_size = 3, 
                      point_alpha = 1, alpha = 0.7) +
  scale_x_continuous(trans = "log10") + 
  scale_fill_brewer(palette = "Dark2")

gapminder %>% 
  filter(year == past_year & !is.na(gdp_per_capita)) %>%
  ggplot(aes(gdp_per_capita, fill=group)) + 
  geom_density(alpha = 0.5, position="stack", bw = 0.2) +
  scale_x_continuous(trans = "log10", limit = c(30, 150000)) +
  scale_fill_brewer(palette = "Dark2")

6.4.2 Comparison between Developed and Developing Countries

Now versus then for income inequality

Visualization shows that there are two clusters in income. But does this dichotomy persist? Let’s use facet_grid see how the distributions have changed. To start, we will focus on two groups: the west and the rest. We make four histograms.

Before we interpret the findings of this plot, we notice that there are more countries represented in the 2010 histograms than in 1970: the total counts are larger. One reason for this is that several countries were founded after 1970. For example, the Soviet Union divided into several countries during the 1990s. Another reason is that data was available for more countries in 2010.

We remake the plots using only countries with data available for both years.

past_year <- 1960
present_year <- 2010
years <- c(past_year, present_year)
country_list_1 <- gapminder %>% 
  filter(year == past_year & !is.na(gdp_per_capita)) %>% 
  pull(country)
country_list_2 <- gapminder %>% 
  filter(year == present_year & !is.na(gdp_per_capita)) %>% 
  pull(country)
country_list <- intersect(country_list_1, country_list_2)

These 94 account for 84% of the world population, so this subset should be representative.

Let’s remake the plot, but only for this subset by simply adding country %in% country_list to the filter function:

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(west = ifelse(group == "Developed", "Developed", "Developing")) %>%
  ggplot(aes(gdp_per_capita)) +
  geom_histogram(binwidth = 0.18, color = "black") +
  scale_x_continuous(trans = "log10") + 
  facet_grid(year ~ west)

We now see that the rich countries have become a bit richer, but percentage-wise, the poor countries appear to have improved more.

To see which specific regions improved the most, we can remake the boxplots we made above, but now adding the year 2010 and then using facet to compare the two years.

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  ggplot(aes(reorder(group, gdp_per_capita, median), gdp_per_capita)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(trans = "log10") +
  xlab("") +
  facet_grid( ~ year)

Here, we pause to introduce another powerful ggplot2 feature. Because we want to compare each region before and after, it would be convenient to have the 1960 boxplot next to the 2010 boxplot for each region. In general, comparisons are easier when data are plotted next to each other.

So instead of faceting, we keep the data from each year together and ask to color (or fill) them depending on the year. Note that groups are automatically separated by year and each pair of boxplots drawn next to each other. Because year is a number, we turn it into a factor since ggplot2 automatically assigns a color to each category of a factor. Note that we have to convert the year columns from numeric to factor.

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = factor(year)) %>%
  ggplot(aes(reorder(group, gdp_per_capita, median), 
             gdp_per_capita, 
             shape = year, 
             fill = year)) +
  geom_boxplot() +
  geom_point(
    position = position_jitterdodge(),#jitter.width = 0.15,dodge.width = 0.7
    alpha = 0.6,size = 4) +
  scale_y_continuous(trans = "log10") +
  xlab("") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = factor(year)) %>%
  ggplot(aes(reorder(region, gdp_per_capita, median), gdp_per_capita, fill = year)) +
  geom_boxplot() +
  scale_y_continuous(trans = "log10") +
  xlab("") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Finally, we point out that if what we are most interested in is comparing before and after values, it might make more sense to plot the percentage increases.

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = ifelse(year == past_year, "past", "present")) %>%
  select(country, group, year, gdp_per_capita) %>%
  pivot_wider(names_from = year, values_from= gdp_per_capita) %>%
  mutate(percent_increase = (present-past)/past*100) %>%
  mutate(group = reorder(group, percent_increase, FUN = median)) %>%
  ggplot(aes(group, percent_increase, fill=group)) +
    geom_boxplot() + 
    geom_point(show.legend = FALSE, shape=4) +
    scale_y_continuous(trans = "log10") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    ylab(paste("Percent increase:", past_year, "to", present_year)) +
    scale_fill_brewer(palette = "Dark2")

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(year = ifelse(year == past_year, "past", "present")) %>%
  select(country, region, year, gdp_per_capita, continent) %>%
  pivot_wider(names_from = year, values_from= gdp_per_capita) %>%
  mutate(percent_increase = (present-past)/past*100) %>%
  #mutate(region = reorder(region,percent_increase, FUN = median)) %>%
  ggplot(aes(reorder(region,percent_increase, FUN = median), percent_increase, fill = continent)) +
    geom_boxplot() + 
    geom_jitter(width=0.1,shape=4, show.legend = FALSE) +
    scale_y_continuous(trans = "log10") +
    scale_fill_brewer(palette = "Paired") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("") + 
    ylab(paste("Percent increase:", past_year, "to", present_year)) 

The previous data exploration suggested that the income gap between rich and poor countries has narrowed considerably during the last 40 years. We used a series of histograms and boxplots to see this. We suggest a succinct way to convey this message with just one plot.

Summary

Let’s start by noting that density plots for income distribution in 1960 and 2010 deliver the message that the gap is closing:

gapminder %>% 
  filter(country %in% country_list) %>%
  ggplot(aes(gdp_per_capita)) +
    geom_density(fill = "grey") + 
    scale_x_continuous(trans = "log10", limit = c(30, 150000), 
                       breaks=c(100,1000,10000,100000),
                       labels=c("100","1000","10000","100000")) + 
    facet_wrap(~ year)
## Warning: Removed 477 rows containing non-finite outside the scale range
## (`stat_density()`).

In the 1960 plot, we see two clear modes: poor and rich countries. In 2010, it appears that some of the poor countries have shifted towards the right, closing the gap.

The next message we need to convey is that the reason for this change in distribution is that several poor countries became richer, rather than some rich countries becoming poorer. To do this, we can assign a color to the groups we identified during data exploration.

However, we first need to learn how to make these smooth densities in a way that preserves information on the number of countries in each group. To understand why we need this, note the discrepancy in the size of each group:

Developed Developing
20 74

But when we overlay two densities, the default is to have the area represented by each distribution add up to 1, regardless of the size of each group. This makes it appear as if there are the same number of countries in each group. To change this, we will need to learn to access computed variables with geom_density function.

To have the areas of these densities be proportional to the size of the groups, we can simply multiply the y-axis values by the size of the group. From the geom_density help file, we see that the functions compute a variable called count that does exactly this. We want this variable to be on the y-axis rather than the density.

In ggplot2, we access these variables by surrounding the name with two dots. We will therefore use the following mapping. If we want the densities to be smoother, we use the bw argument so that the same bandwidth is used in each density. We selected 0.2 after trying out several values.

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  ggplot(aes(x = gdp_per_capita)) +
    scale_x_continuous(trans = "log10", limit = c(30,100000)) + 
    geom_density(alpha = 0.2, bw = 0.2) + 
    facet_grid(year ~ .)

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(group = ifelse(group == "Developed", "Developed", "Developing")) %>%
  ggplot(aes(x = gdp_per_capita, y = ..count../sum(..count..), fill = group)) +
    scale_x_continuous(trans = "log10", limit = c(30,100000)) + 
    geom_density(alpha = 0.2, bw = 0.2, position = "stack") + 
    facet_grid(year ~ .)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  mutate(group = ifelse(group == "Developed", "Developed", "Developing")) %>%
  ggplot(aes(x = gdp_per_capita, y = ..count../sum(..count..), fill = group)) +
    scale_x_continuous(trans = "log10", limit = c(30,100000)) + 
    geom_density(alpha = 0.2, bw = 0.2) + 
    facet_grid(year ~ .)

This plot now shows what is happening very clearly. The developing world distribution is changing. A third mode appears consisting of the countries that most narrowed the gap.

To visualize if any of the groups defined above are driving this we can quickly make a ridge plot:

gapminder %>% 
  filter(year %in% years & !is.na(gdp_per_capita)) %>%
  ggplot(aes(gdp_per_capita, group, fill = group)) + 
    geom_density_ridges(adjust = 1.5) +
    scale_x_continuous(trans = "log10", limit = c(30,100000)) + 
    facet_grid(year ~ .)

Another way to achieve this is by stacking the densities on top of each other:

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  ggplot(aes(gdp_per_capita, fill = group)) +
    geom_density(aes(y = ..count../sum(..count..)), alpha = 0.2, bw = 0.2, position = "stack") + 
    scale_x_continuous(trans = "log10", limit = c(30, 150000)) + 
    facet_grid(year ~ .) 

Here we can clearly see how the distributions for East Asia, Latin America, and others shift markedly to the right. While Sub-Saharan Africa remains stagnant.

Notice that we order the levels of the group so that the West’s density is plotted first, then Sub-Saharan Africa. Having the two extremes plotted first allows us to see the remaining bimodality better.

Weighted densities

As a final point, we note that these distributions weigh every country the same. So if most of the population is improving, but living in a very large country, such as China, we might not appreciate this. We can actually weight the smooth densities using the weight mapping argument. The plot then looks like this:

gapminder %>% 
  filter(year %in% years & country %in% country_list) %>%
  group_by(year) %>%
  mutate(weight = population/sum(population)) %>%
  ungroup() %>%
  ggplot(aes(gdp_per_capita, fill = group, weight = weight)) +
  geom_density(aes(y = ..count../sum(..count..)), alpha = 0.2, bw = 0.2, position = "stack") + 
  scale_x_continuous(trans = "log10", limit = c(30, 150000)) + 
  facet_grid(year ~ .) +
  geom_vline(xintercept = gapminder %>%
               filter(country == "China",
                      year %in% c(1960,2010)) %>%
               select(gdp_per_capita) %>%
               unlist(),
             color = "blue") +
  geom_vline(xintercept = gapminder %>%
               filter(country == "India",
                      year %in% c(1960,2010)) %>%
               select(gdp_per_capita) %>%
               unlist(),
             color = "red")

This particular figure shows very clearly how the income distribution gap is closing with most of the poor remaining in Sub-Saharan Africa.

References

Irizarry, Rafael A. 2019. Introduction to Data Science: Data Analysis and Prediction Algorithms with r. 1st ed. Chapman; Hall/CRC. https://rafalab.github.io/dsbook/.