Etienne Bacher: Visualize data on Nobel laureates per country

Etienne Bacher

The Nobel laureates of 2020 were announced last week, and I thought it would be interesting to visualize the repartition of laureates per country, as there are several ways to do so. I’m going to use this dataset available on Kaggle, which contains information on the year, category, name of the laureate, country, city and date of birth and death, among other things. Notice that this dataset goes from 1901 to 2016 and therefore doesn’t contain the most recent laureates.

But first of all, we need to load all the packages we will use in this analysis:

library(tidyverse)
library(janitor)
library(ggthemes)
library(gganimate)
library(here)
library(tmap)
library(countrycode)
library(echarts4r)

Import and clean data

Now, we can import the dataset. To remove the capital letters and transform the column names in snake case (i.e names such as “column_name” instead of “Column Name”), we can use the function clean_names() of the package {janitor} ¹:

nobel_laureates_raw <- read_csv(here("_posts/2020-10-18-nobel-laureates/nobel-laureates.csv")) %>%
  janitor::clean_names()

The first thing that we have to correct before doing visualization concerns the country names. Indeed, many countries have changed since 1901. For example, Czechoslovakia no longer exists, as well as Prussia. In this dataset, the columns containing country names display first the official name at the time, and then put the current name of the country between brackets.

# A tibble: 6 x 2
  birth_country     death_country
  <chr>             <chr>        
1 Netherlands       Germany      
2 France            France       
3 Prussia (Poland)  Germany      
4 Switzerland       Switzerland  
5 France            France       
6 Prussia (Germany) Germany

Since we only want the current country names, we must modify these columns so that:

if the name doesn’t have brackets (i.e the country hasn’t changed in time), we let it as-is;
if the name has brackets (i.e the country has changed), we only want to keep the name between brackets.

Since I must do this for two columns (birth_country and death_country), I created a function (and this was the perfect example of losing way too much time by making a function to save time…):

clean_country_names <- function(data, variable) {
  data <- data %>%
    mutate(
      x = gsub(
        "(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.",
        "",
        {{variable}},
        perl = T
      ),
      x = ifelse(x == "", {{variable}}, x)
    ) %>%
    select(- {{variable}}) %>%
    rename({{variable}} := "x")
}

This function takes a dataset (data), and creates a new column (x) that will take the name between brackets if original variable has brackets, or the unique name if the original variable doesn’t have brackets. Then, x is renamed as the variable we specified first. I must admit that regular expressions (such as the one in gsub()) continue to be a big mystery for me, and I thank StackOverflow for providing many examples.

Now, we apply this function to our columns with countries:

nobel_laureates <- clean_country_names(nobel_laureates_raw, birth_country)
nobel_laureates <- clean_country_names(nobel_laureates, death_country)

The country names are now cleaned:

# A tibble: 6 x 2
  birth_country death_country
  <chr>         <chr>        
1 Netherlands   Germany      
2 France        France       
3 Poland        Germany      
4 Switzerland   Switzerland  
5 France        France       
6 Germany       Germany

From now on, there are several ways to visualize the repartition of Nobel laureates per country. We could do a static bar plot, an animated bar plot to see the evolution in time, a static map, or an interactive map.

Plot the data

Static plot

First of all, we need to compute the number of Nobel laureates per country:

nobel_per_country <- nobel_laureates %>%
  select(birth_country, full_name) %>%
  distinct() %>%
  group_by(birth_country) %>%
  count(sort = TRUE) %>%
  ungroup() %>%
  drop_na()

Then we can plot this number, only for the first 20 countries (so that the plot can be readable):

nobel_per_country %>%
  select(birth_country, n) %>%
  top_n(20) %>%
  mutate(birth_country = reorder(birth_country, n)) %>%
  ggplot(aes(x = birth_country, y = n)) +
  geom_col() +
  coord_flip() +
  xlab("Country") +
  ylab("") +
  geom_text(aes(label = n), nudge_y = 10) +
  ggthemes::theme_clean()

We can also check the repartition per country and per category:

# The 20 countries with the most nobels
top_20 <- nobel_per_country %>%
  top_n(10) %>%
  select(birth_country) %>%
  unlist(use.names = FALSE)

nobel_laureates %>%
  select(birth_country, full_name, category) %>%
  distinct() %>%
  group_by(birth_country, category) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  drop_na() %>%
  select(- full_name) %>%
  distinct() %>%
  filter(birth_country %in% top_20) %>%
  ggplot(aes(x = birth_country, y = n)) +
  geom_col() +
  coord_flip() +
  xlab("Country") +
  ylab("") +
  geom_text(aes(label = n), nudge_y = 10) +
  ggthemes::theme_clean() +
  facet_wrap(~category)

Animated plots

To observe the evolution of this number in time, one way would be to plot lines with year in x-axis. But we could also keep the first plot we made and animate it with {gganimate}.

First, we compute the cumulated sum of Nobel laureates. Indeed, the number of laureates per year is useless for us, we want to see the evolution of the total number:

nobel_per_country_year <- nobel_laureates %>%
  select(year, birth_country) %>%
  group_by(year, birth_country) %>%
  count(sort = TRUE) %>%
  ungroup() %>%
  drop_na() %>%
  arrange(birth_country, year) %>%
  complete(year, birth_country) %>%
  mutate(n = ifelse(is.na(n), 0, n),
         year = as.integer(year)) %>%
  filter(birth_country %in% top_20) %>%
  group_by(birth_country) %>%
  mutate(n_cumul = cumsum(n)) %>%
  arrange(birth_country)

Then, we use almost the same code as for the first plot, but we add arguments at the end that tell how we want the animation to be:

plot_evol <- nobel_per_country_year %>%
  select(birth_country, year, n_cumul) %>%
  filter((year %% 2) != 0) %>%
  ggplot(aes(x = reorder(birth_country, n_cumul), y = n_cumul)) +
  geom_col() +
  coord_flip() +
  xlab("Country") +
  ylab("") +
  geom_text(aes(label = as.character(round(n_cumul, 0))), nudge_y = 10) +
  ggthemes::theme_clean() +
  transition_time(year) +
  ggtitle("Year: {frame_time}") +
  ease_aes('linear', interval = 2)

animate(plot_evol, duration = 15, fps = 20)

This allows us to see that the USA have seen their number of Nobel laureates surge from the 1960’s and 1970’s, which corresponds more or less to the creation of the so-called “Nobel Prize in Economics” in 1969. The plot per category also indicates that this prize plays a major role in the domination of the USA.

Maps

Static maps

To create maps, we rely on the package {tmap}. In addition to its functions, this package also gives access to a dataset that we will use to show the number of laureates per country.

data(World)

We need to merge our dataset of Nobel laureates with this dataset. But the country names differ. Therefore, we have to use ISO codes instead. World already contains ISO codes, so we only have to create those for our dataset. This can be done very easily with the package {countrycode}. However, some countries in our dataset don’t have ISO codes, such as Scotland, Northern Ireland or Czechoslovakia. The two former can be recoded as United Kingdom, but Czechoslovakia was located on current Slovakia, Czech Republic and Ukraine, so we drop it of our dataset.

nobel_per_country <- nobel_per_country %>%
  mutate(
    iso_birth = countrycode(birth_country, origin = "country.name", destination = "iso3c"),
    iso_birth = case_when(
      birth_country == "Scotland" | birth_country == "Northern Ireland" ~ "GBR",
      TRUE ~ iso_birth
    )
  )

We can now merge the two datasets based on their ISO codes…

World <- World %>%
  full_join(nobel_per_country, by = c("iso_a3" = "iso_birth")) %>%
  rename("number" = "n") %>%
  mutate(number = ifelse(is.na(number), 0, number))

… and we can build the map and fill the countries with the number of laureates:

tm_shape(World, projection = 4326) +
  tm_fill("number", breaks = c(0, 5, 10, 50, 200, Inf), palette = "YlOrBr") +
  tm_polygons() +
  tm_legend(title = "Nobel prizes per country", legend.title.size = 10^(-4)) +
  tm_layout(legend.outside = TRUE)

Interactive maps

Finally, we will make interactive maps with {echarts4r}. Firstly, let’s make an identical map as the one above but with a few interactive features.

{echarts4r} uses specific country names, so we use once again {countrycode} to modify the names in our dataset.

nobel_per_country_echarts <- e_country_names(data = nobel_per_country,
                                             input = iso_birth,
                                             type = "iso3c")

Now we can plot the map:

nobel_per_country_echarts %>%
  e_charts(iso_birth) %>%
  e_map(n, roam = TRUE) %>%
  e_visual_map(max = max(nobel_per_country_echarts$n))

Hovering the countries gives us their name, and the number of laureates in the legend. We can also zoom in and out. We could see the evolution of laureates in time with timeline = TRUE:

nobel_per_country_year_map <- nobel_laureates %>%
  select(year, birth_country) %>%
  group_by(year, birth_country) %>%
  count(sort = TRUE) %>%
  ungroup() %>%
  drop_na() %>%
  arrange(birth_country, year) %>%
  complete(year, birth_country) %>%
  mutate(n = ifelse(is.na(n), 0, n),
         year = as.integer(year)) %>%
  group_by(birth_country) %>%
  mutate(n_cumul = cumsum(n)) %>%
  arrange(birth_country)

nobel_per_country_year_map <- nobel_per_country_year_map %>%
  mutate(
    iso_birth = countrycode(birth_country, origin = "country.name", destination = "iso3c"),
    iso_birth = case_when(
      birth_country == "Scotland" | birth_country == "Northern Ireland" ~ "GBR",
      TRUE ~ iso_birth
    )
  )

nobel_per_country_year_echarts <- e_country_names(data = nobel_per_country_year_map,
                                                  input = iso_birth,
                                                  type = "iso3c")

nobel_per_country_year_echarts %>%
  group_by(year) %>%
  e_charts(iso_birth, timeline = TRUE) %>%
  e_map(n_cumul, roam = TRUE) %>%
  e_visual_map(max = 257) %>%
  e_timeline_opts(
    playInterval = 250,
    symbol = "none"
  )

And that’s it! I used data about Nobel laureates to present a few plots and maps made with {ggplot2}, {gganimate}, {tmap}, and {echarts4r}. I used these packages but there are countless ways to make plots or maps, whether static or interactive, with R:

plots: base R, {highcharter}, {charter}, {plotly}, etc.
maps: base R, {leaflet}, {sf}, {ggmap}, etc.

I hope you enjoyed it!

Session Info

This is my session info, so that you can see the versions of packages used. This is useful if the results in my post are no longer reproducible because packages changed. The packages with a star (*) are those explicitely called in the script.

─ Session info ─────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.0.3 (2020-10-10)
 os       Ubuntu 18.04.5 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en                          
 collate  fr_FR.UTF-8                 
 ctype    fr_FR.UTF-8                 
 tz       Europe/Paris                
 date     2021-01-20                  

─ Packages ─────────────────────────────────────────────────────────
 package      * version  date       lib
 abind          1.4-5    2016-07-21 [1]
 assertthat     0.2.1    2019-03-21 [1]
 backports      1.2.1    2020-12-09 [1]
 base64enc      0.1-3    2015-07-28 [1]
 broom          0.7.3    2020-12-16 [1]
 cellranger     1.1.0    2016-07-27 [1]
 class          7.3-17   2020-04-26 [4]
 classInt       0.4-3    2020-04-07 [1]
 cli            2.2.0    2020-11-20 [1]
 codetools      0.2-16   2018-12-24 [4]
 colorspace     2.0-0    2020-11-11 [1]
 countrycode  * 1.2.0    2020-05-22 [1]
 crayon         1.3.4    2017-09-16 [1]
 crosstalk      1.1.1    2021-01-12 [1]
 DBI            1.1.1    2021-01-15 [1]
 dbplyr         2.0.0    2020-11-03 [1]
 dichromat      2.0-0    2013-01-24 [1]
 digest         0.6.27   2020-10-24 [1]
 distill        1.2      2021-01-13 [1]
 downlit        0.2.1    2020-11-04 [1]
 dplyr        * 1.0.3    2021-01-15 [1]
 e1071          1.7-4    2020-10-14 [1]
 echarts4r    * 0.3.4    2020-10-29 [1]
 ellipsis       0.3.1    2020-05-15 [1]
 evaluate       0.14     2019-05-28 [1]
 fansi          0.4.2    2021-01-15 [1]
 farver         2.0.3    2020-01-16 [1]
 fastmap        1.0.1    2019-10-08 [1]
 forcats      * 0.5.0    2020-03-01 [1]
 fs             1.5.0    2020-07-31 [1]
 generics       0.1.0    2020-10-31 [1]
 gganimate    * 1.0.7    2020-10-15 [1]
 ggplot2      * 3.3.3    2020-12-30 [1]
 ggthemes     * 4.2.0    2019-05-13 [1]
 gifski         0.8.6    2018-09-28 [1]
 glue           1.4.2    2020-08-27 [1]
 gtable         0.3.0    2019-03-25 [1]
 haven          2.3.1    2020-06-01 [1]
 here         * 1.0.1    2020-12-13 [1]
 hms            1.0.0    2021-01-13 [1]
 htmltools      0.5.1    2021-01-12 [1]
 htmlwidgets    1.5.3    2020-12-10 [1]
 httpuv         1.5.5    2021-01-13 [1]
 httr           1.4.2    2020-07-20 [1]
 janitor      * 2.1.0    2021-01-05 [1]
 jsonlite       1.7.2    2020-12-09 [1]
 KernSmooth     2.23-17  2020-04-26 [4]
 knitr          1.30     2020-09-22 [1]
 labeling       0.4.2    2020-10-20 [1]
 later          1.1.0.1  2020-06-05 [1]
 lattice        0.20-41  2020-04-02 [4]
 leafem         0.1.3    2020-07-26 [1]
 leaflet        2.0.4.1  2021-01-07 [1]
 leafsync       0.1.0    2019-03-05 [1]
 lifecycle      0.2.0    2020-03-06 [1]
 lubridate      1.7.9.2  2020-11-13 [1]
 lwgeom         0.2-5    2020-06-12 [1]
 magrittr       2.0.1    2020-11-17 [1]
 mime           0.9      2020-02-04 [1]
 modelr         0.1.8    2020-05-19 [1]
 munsell        0.5.0    2018-06-12 [1]
 pillar         1.4.7    2020-11-20 [1]
 pkgconfig      2.0.3    2019-09-22 [1]
 png            0.1-7    2013-12-03 [1]
 prettyunits    1.1.1    2020-01-24 [1]
 progress       1.2.2    2019-05-16 [1]
 promises       1.1.1    2020-06-09 [1]
 purrr        * 0.3.4    2020-04-17 [1]
 R6             2.5.0    2020-10-28 [1]
 raster         3.4-5    2020-11-14 [1]
 RColorBrewer   1.1-2    2014-12-07 [1]
 Rcpp           1.0.6    2021-01-15 [1]
 readr        * 1.4.0    2020-10-05 [1]
 readxl         1.3.1    2019-03-13 [1]
 reprex         0.3.0    2019-05-16 [1]
 rlang          0.4.10   2020-12-30 [1]
 rmarkdown      2.6.4    2021-01-19 [1]
 rprojroot      2.0.2    2020-11-15 [1]
 rstudioapi     0.13     2020-11-12 [1]
 rvest          0.3.6    2020-07-25 [1]
 scales         1.1.1    2020-05-11 [1]
 sessioninfo    1.1.1    2018-11-05 [1]
 sf           * 0.9-6    2020-09-13 [1]
 shiny          1.5.0    2020-06-23 [1]
 snakecase      0.11.0   2019-05-25 [1]
 sp             1.4-5    2021-01-10 [1]
 stars          0.4-3    2020-07-08 [1]
 stringi        1.5.3    2020-09-09 [1]
 stringr      * 1.4.0    2019-02-10 [1]
 tibble       * 3.0.5    2021-01-15 [1]
 tidyr        * 1.1.2    2020-08-27 [1]
 tidyselect     1.1.0    2020-05-11 [1]
 tidyverse    * 1.3.0    2019-11-21 [1]
 tmap         * 3.2      2020-09-15 [1]
 tmaptools      3.1      2020-07-01 [1]
 tweenr         1.0.1    2018-12-14 [1]
 units          0.6-7    2020-06-13 [1]
 utf8           1.1.4    2018-05-24 [1]
 vctrs          0.3.6    2020-12-17 [1]
 viridisLite    0.3.0    2018-02-01 [1]
 withr          2.4.0    2021-01-16 [1]
 xfun           0.20     2021-01-06 [1]
 XML            3.99-0.5 2020-07-23 [1]
 xml2           1.3.2    2020-04-23 [1]
 xtable         1.8-4    2019-04-21 [1]
 yaml           2.2.1    2020-02-01 [1]
 source                              
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 Github (JohnCoene/echarts4r@082e62c)
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.1)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.1)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 Github (rstudio/rmarkdown@2e8572e)  
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.1)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.2)                      
 CRAN (R 4.0.3)                      
 CRAN (R 4.0.0)                      
 CRAN (R 4.0.0)                      

[1] /home/etienne/R/x86_64-pc-linux-gnu-library/4.0
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

This function is very useful even when column names are much more messy.↩︎

Visualize data on Nobel laureates per country