Visualize data on Nobel laureates per country
The Nobel laureates of 2020 were announced last week, and I thought it would be interesting to visualize the repartition of laureates per country, as there are several ways to do so. I’m going to use this dataset available on Kaggle, which contains information on the year, category, name of the laureate, country, city and date of birth and death, among other things. Notice that this dataset goes from 1901 to 2016 and therefore doesn’t contain the most recent laureates.
But first of all, we need to load all the packages we will use in this analysis:
library(tidyverse)
library(janitor)
library(ggthemes)
library(gganimate)
library(here)
library(tmap)
library(countrycode)
library(echarts4r)
Import and clean data
Now, we can import the dataset. To remove the capital letters and transform the column names in snake case (i.e names such as “column_name” instead of “Column Name”), we can use the function clean_names() of the package {janitor} 11 This function is very useful even when column names are much more messy.:
nobel_laureates_raw <- read_csv("nobel-laureates.csv") %>%
janitor::clean_names()
The first thing that we have to correct before doing visualization concerns the country names. Indeed, many countries have changed since 1901. For example, Czechoslovakia no longer exists, as well as Prussia. In this dataset, the columns containing country names display first the official name at the time, and then put the current name of the country between brackets.
# A tibble: 6 × 2
birth_country death_country
<chr> <chr>
1 Netherlands Germany
2 France France
3 Prussia (Poland) Germany
4 Switzerland Switzerland
5 France France
6 Prussia (Germany) Germany
Since we only want the current country names, we must modify these columns so that:
-
if the name doesn’t have brackets (i.e the country hasn’t changed in time), we let it as-is;
-
if the name has brackets (i.e the country has changed), we only want to keep the name between brackets.
Since I must do this for two columns (birth_country and death_country), I created a function (and this was the perfect example of losing way too much time by making a function to save time…):
clean_country_names <- function(data, variable) {
data <- data %>%
mutate(
x = gsub(
"(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.",
"",
{{variable}},
perl = T
),
x = ifelse(x == "", {{variable}}, x)
) %>%
select(- {{variable}}) %>%
rename({{variable}} := "x")
}
This function takes a dataset (data), and creates a new column (x) that will take the name between brackets if original variable has brackets, or the unique name if the original variable doesn’t have brackets. Then, x is renamed as the variable we specified first. I must admit that regular expressions (such as the one in gsub()) continue to be a big mystery for me, and I thank StackOverflow for providing many examples.
Now, we apply this function to our columns with countries:
nobel_laureates <- clean_country_names(nobel_laureates_raw, birth_country)
nobel_laureates <- clean_country_names(nobel_laureates, death_country)
The country names are now cleaned:
# A tibble: 6 × 2
birth_country death_country
<chr> <chr>
1 Netherlands Germany
2 France France
3 Poland Germany
4 Switzerland Switzerland
5 France France
6 Germany Germany
From now on, there are several ways to visualize the repartition of Nobel laureates per country. We could do a static bar plot, an animated bar plot to see the evolution in time, a static map, or an interactive map.
Plot the data
Static plot
First of all, we need to compute the number of Nobel laureates per country:
nobel_per_country <- nobel_laureates %>%
select(birth_country, full_name) %>%
distinct() %>%
group_by(birth_country) %>%
count(sort = TRUE) %>%
ungroup() %>%
drop_na()
Then we can plot this number, only for the first 20 countries (so that the plot can be readable):
nobel_per_country %>%
select(birth_country, n) %>%
top_n(20) %>%
mutate(birth_country = reorder(birth_country, n)) %>%
ggplot(aes(x = birth_country, y = n)) +
geom_col() +
coord_flip() +
xlab("Country") +
ylab("") +
geom_text(aes(label = n), nudge_y = 10) +
ggthemes::theme_clean()
We can also check the repartition per country and per category:
# The 20 countries with the most nobels
top_20 <- nobel_per_country %>%
top_n(10) %>%
select(birth_country) %>%
unlist(use.names = FALSE)
nobel_laureates %>%
select(birth_country, full_name, category) %>%
distinct() %>%
group_by(birth_country, category) %>%
mutate(n = n()) %>%
ungroup() %>%
drop_na() %>%
select(- full_name) %>%
distinct() %>%
filter(birth_country %in% top_20) %>%
ggplot(aes(x = birth_country, y = n)) +
geom_col() +
coord_flip() +
xlab("Country") +
ylab("") +
geom_text(aes(label = n), nudge_y = 10) +
ggthemes::theme_clean() +
facet_wrap(~category)
Animated plots
To observe the evolution of this number in time, one way would be to plot lines with year in x-axis. But we could also keep the first plot we made and animate it with {gganimate}.
First, we compute the cumulated sum of Nobel laureates. Indeed, the number of laureates per year is useless for us, we want to see the evolution of the total number:
nobel_per_country_year <- nobel_laureates %>%
select(year, birth_country) %>%
group_by(year, birth_country) %>%
count(sort = TRUE) %>%
ungroup() %>%
drop_na() %>%
arrange(birth_country, year) %>%
complete(year, birth_country) %>%
mutate(n = ifelse(is.na(n), 0, n),
year = as.integer(year)) %>%
filter(birth_country %in% top_20) %>%
group_by(birth_country) %>%
mutate(n_cumul = cumsum(n)) %>%
arrange(birth_country)
Then, we use almost the same code as for the first plot, but we add arguments at the end that tell how we want the animation to be:
plot_evol <- nobel_per_country_year %>%
select(birth_country, year, n_cumul) %>%
filter((year %% 2) != 0) %>%
ggplot(aes(x = reorder(birth_country, n_cumul), y = n_cumul)) +
geom_col() +
coord_flip() +
xlab("Country") +
ylab("") +
geom_text(aes(label = as.character(round(n_cumul, 0))), nudge_y = 10) +
ggthemes::theme_clean() +
transition_time(year) +
ggtitle("Year: {frame_time}") +
ease_aes('linear', interval = 2)
animate(plot_evol, duration = 15, fps = 20)
This allows us to see that the USA have seen their number of Nobel laureates surge from the 1960′s and 1970′s, which corresponds more or less to the creation of the so-called “Nobel Prize in Economics” in 1969. The plot per category also indicates that this prize plays a major role in the domination of the USA.
Maps
Static maps
To create maps, we rely on the package {tmap}. In addition to its functions, this package also gives access to a dataset that we will use to show the number of laureates per country.
data(World)
We need to merge our dataset of Nobel laureates with this dataset. But the country names differ. Therefore, we have to use ISO codes instead. World already contains ISO codes, so we only have to create those for our dataset. This can be done very easily with the package {countrycode}. However, some countries in our dataset don’t have ISO codes, such as Scotland, Northern Ireland or Czechoslovakia. The two former can be recoded as United Kingdom, but Czechoslovakia was located on current Slovakia, Czech Republic and Ukraine, so we drop it of our dataset.
nobel_per_country <- nobel_per_country %>%
mutate(
iso_birth = countrycode(birth_country, origin = "country.name", destination = "iso3c"),
iso_birth = case_when(
birth_country == "Scotland" | birth_country == "Northern Ireland" ~ "GBR",
TRUE ~ iso_birth
)
)
We can now merge the two datasets based on their ISO codes…
World <- World %>%
full_join(nobel_per_country, by = c("iso_a3" = "iso_birth")) %>%
rename("number" = "n") %>%
mutate(number = ifelse(is.na(number), 0, number))
… and we can build the map and fill the countries with the number of laureates:
tm_shape(World, crs = 4326) +
tm_polygons(fill = "number", fill.scale = tm_scale_intervals(
breaks = c(0, 5, 10, 50, 200, Inf),
values = "YlOrBr"
)) +
tm_legend(title = "Nobel prizes per country", legend.title.size = 10^(-4)) +
tm_layout(legend.outside = TRUE)