Visualize data on Nobel laureates per country

A post where I make animated graphs and maps to visualize the repartition of Nobel laureates per country.

Etienne Bacher
2020-10-18

The Nobel laureates of 2020 were announced last week, and I thought it would be interesting to visualize the repartition of laureates per country, as there are several ways to do so. I’m going to use this dataset available on Kaggle, which contains information on the year, category, name of the laureate, country, city and date of birth and death, among other things. Notice that this dataset goes from 1901 to 2016 and therefore doesn’t contain the most recent laureates.

But first of all, we need to load all the packages we will use in this analysis:

Import and clean data

Now, we can import the dataset. To remove the capital letters and transform the column names in snake case (i.e names such as “column_name” instead of “Column Name”), we can use the function clean_names() of the package {janitor} 1:

nobel_laureates_raw <- read_csv(here("_posts/2020-10-18-nobel-laureates/nobel-laureates.csv")) %>%
  janitor::clean_names()

The first thing that we have to correct before doing visualization concerns the country names. Indeed, many countries have changed since 1901. For example, Czechoslovakia no longer exists, as well as Prussia. In this dataset, the columns containing country names display first the official name at the time, and then put the current name of the country between brackets.

# A tibble: 6 x 2
  birth_country     death_country
  <chr>             <chr>        
1 Netherlands       Germany      
2 France            France       
3 Prussia (Poland)  Germany      
4 Switzerland       Switzerland  
5 France            France       
6 Prussia (Germany) Germany      

Since we only want the current country names, we must modify these columns so that:

Since I must do this for two columns (birth_country and death_country), I created a function (and this was the perfect example of losing way too much time by making a function to save time…):

clean_country_names <- function(data, variable) {
  data <- data %>%
    mutate(
      x = gsub(
        "(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.",
        "",
        {{variable}},
        perl = T
      ),
      x = ifelse(x == "", {{variable}}, x)
    ) %>%
    select(- {{variable}}) %>%
    rename({{variable}} := "x")
}

This function takes a dataset (data), and creates a new column (x) that will take the name between brackets if original variable has brackets, or the unique name if the original variable doesn’t have brackets. Then, x is renamed as the variable we specified first. I must admit that regular expressions (such as the one in gsub()) continue to be a big mystery for me, and I thank StackOverflow for providing many examples.

Now, we apply this function to our columns with countries:

nobel_laureates <- clean_country_names(nobel_laureates_raw, birth_country)
nobel_laureates <- clean_country_names(nobel_laureates, death_country)

The country names are now cleaned:

# A tibble: 6 x 2
  birth_country death_country
  <chr>         <chr>        
1 Netherlands   Germany      
2 France        France       
3 Poland        Germany      
4 Switzerland   Switzerland  
5 France        France       
6 Germany       Germany      

From now on, there are several ways to visualize the repartition of Nobel laureates per country. We could do a static bar plot, an animated bar plot to see the evolution in time, a static map, or an interactive map.

Plot the data

Static plot

First of all, we need to compute the number of Nobel laureates per country:

nobel_per_country <- nobel_laureates %>%
  select(birth_country, full_name) %>%
  distinct() %>%
  group_by(birth_country) %>%
  count(sort = TRUE) %>%
  ungroup() %>%
  drop_na()

Then we can plot this number, only for the first 20 countries (so that the plot can be readable):

nobel_per_country %>%
  select(birth_country, n) %>%
  top_n(20) %>%
  mutate(birth_country = reorder(birth_country, n)) %>%
  ggplot(aes(x = birth_country, y = n)) +
  geom_col() +
  coord_flip() +
  xlab("Country") +
  ylab("") +
  geom_text(aes(label = n), nudge_y = 10) +
  ggthemes::theme_clean()

We can also check the repartition per country and per category:

# The 20 countries with the most nobels
top_20 <- nobel_per_country %>%
  top_n(10) %>%
  select(birth_country) %>%
  unlist(use.names = FALSE)

nobel_laureates %>%
  select(birth_country, full_name, category) %>%
  distinct() %>%
  group_by(birth_country, category) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  drop_na() %>%
  select(- full_name) %>%
  distinct() %>%
  filter(birth_country %in% top_20) %>%
  ggplot(aes(x = birth_country, y = n)) +
  geom_col() +
  coord_flip() +
  xlab("Country") +
  ylab("") +
  geom_text(aes(label = n), nudge_y = 10) +
  ggthemes::theme_clean() +
  facet_wrap(~category)

Animated plots

To observe the evolution of this number in time, one way would be to plot lines with year in x-axis. But we could also keep the first plot we made and animate it with {gganimate}.

First, we compute the cumulated sum of Nobel laureates. Indeed, the number of laureates per year is useless for us, we want to see the evolution of the total number:

nobel_per_country_year <- nobel_laureates %>%
  select(year, birth_country) %>%
  group_by(year, birth_country) %>%
  count(sort = TRUE) %>%
  ungroup() %>%
  drop_na() %>%
  arrange(birth_country, year) %>%
  complete(year, birth_country) %>%
  mutate(n = ifelse(is.na(n), 0, n),
         year = as.integer(year)) %>%
  filter(birth_country %in% top_20) %>%
  group_by(birth_country) %>%
  mutate(n_cumul = cumsum(n)) %>%
  arrange(birth_country)

Then, we use almost the same code as for the first plot, but we add arguments at the end that tell how we want the animation to be:

plot_evol <- nobel_per_country_year %>%
  select(birth_country, year, n_cumul) %>%
  filter((year %% 2) != 0) %>%
  ggplot(aes(x = reorder(birth_country, n_cumul), y = n_cumul)) +
  geom_col() +
  coord_flip() +
  xlab("Country") +
  ylab("") +
  geom_text(aes(label = as.character(round(n_cumul, 0))), nudge_y = 10) +
  ggthemes::theme_clean() +
  transition_time(year) +
  ggtitle("Year: {frame_time}") +
  ease_aes('linear', interval = 2)

animate(plot_evol, duration = 15, fps = 20)