A post where I make animated graphs and maps to visualize the repartition of Nobel laureates per country.
The Nobel laureates of 2020 were announced last week, and I thought it would be interesting to visualize the repartition of laureates per country, as there are several ways to do so. I’m going to use this dataset available on Kaggle, which contains information on the year, category, name of the laureate, country, city and date of birth and death, among other things. Notice that this dataset goes from 1901 to 2016 and therefore doesn’t contain the most recent laureates.
But first of all, we need to load all the packages we will use in this analysis:
Now, we can import the dataset. To remove the capital letters and transform the column names in snake case (i.e names such as “column_name” instead of “Column Name”), we can use the function clean_names()
of the package {janitor}
1:
nobel_laureates_raw <- read_csv(here("_posts/2020-10-18-nobel-laureates/nobel-laureates.csv")) %>%
janitor::clean_names()
The first thing that we have to correct before doing visualization concerns the country names. Indeed, many countries have changed since 1901. For example, Czechoslovakia no longer exists, as well as Prussia. In this dataset, the columns containing country names display first the official name at the time, and then put the current name of the country between brackets.
# A tibble: 6 x 2
birth_country death_country
<chr> <chr>
1 Netherlands Germany
2 France France
3 Prussia (Poland) Germany
4 Switzerland Switzerland
5 France France
6 Prussia (Germany) Germany
Since we only want the current country names, we must modify these columns so that:
if the name doesn’t have brackets (i.e the country hasn’t changed in time), we let it as-is;
if the name has brackets (i.e the country has changed), we only want to keep the name between brackets.
Since I must do this for two columns (birth_country
and death_country
), I created a function (and this was the perfect example of losing way too much time by making a function to save time…):
This function takes a dataset (data
), and creates a new column (x
) that will take the name between brackets if original variable has brackets, or the unique name if the original variable doesn’t have brackets. Then, x
is renamed as the variable we specified first. I must admit that regular expressions (such as the one in gsub()
) continue to be a big mystery for me, and I thank StackOverflow for providing many examples.
Now, we apply this function to our columns with countries:
nobel_laureates <- clean_country_names(nobel_laureates_raw, birth_country)
nobel_laureates <- clean_country_names(nobel_laureates, death_country)
The country names are now cleaned:
# A tibble: 6 x 2
birth_country death_country
<chr> <chr>
1 Netherlands Germany
2 France France
3 Poland Germany
4 Switzerland Switzerland
5 France France
6 Germany Germany
From now on, there are several ways to visualize the repartition of Nobel laureates per country. We could do a static bar plot, an animated bar plot to see the evolution in time, a static map, or an interactive map.
First of all, we need to compute the number of Nobel laureates per country:
nobel_per_country <- nobel_laureates %>%
select(birth_country, full_name) %>%
distinct() %>%
group_by(birth_country) %>%
count(sort = TRUE) %>%
ungroup() %>%
drop_na()
Then we can plot this number, only for the first 20 countries (so that the plot can be readable):
nobel_per_country %>%
select(birth_country, n) %>%
top_n(20) %>%
mutate(birth_country = reorder(birth_country, n)) %>%
ggplot(aes(x = birth_country, y = n)) +
geom_col() +
coord_flip() +
xlab("Country") +
ylab("") +
geom_text(aes(label = n), nudge_y = 10) +
ggthemes::theme_clean()
We can also check the repartition per country and per category:
# The 20 countries with the most nobels
top_20 <- nobel_per_country %>%
top_n(10) %>%
select(birth_country) %>%
unlist(use.names = FALSE)
nobel_laureates %>%
select(birth_country, full_name, category) %>%
distinct() %>%
group_by(birth_country, category) %>%
mutate(n = n()) %>%
ungroup() %>%
drop_na() %>%
select(- full_name) %>%
distinct() %>%
filter(birth_country %in% top_20) %>%
ggplot(aes(x = birth_country, y = n)) +
geom_col() +
coord_flip() +
xlab("Country") +
ylab("") +
geom_text(aes(label = n), nudge_y = 10) +
ggthemes::theme_clean() +
facet_wrap(~category)
To observe the evolution of this number in time, one way would be to plot lines with year
in x-axis. But we could also keep the first plot we made and animate it with {gganimate}
.
First, we compute the cumulated sum of Nobel laureates. Indeed, the number of laureates per year is useless for us, we want to see the evolution of the total number:
nobel_per_country_year <- nobel_laureates %>%
select(year, birth_country) %>%
group_by(year, birth_country) %>%
count(sort = TRUE) %>%
ungroup() %>%
drop_na() %>%
arrange(birth_country, year) %>%
complete(year, birth_country) %>%
mutate(n = ifelse(is.na(n), 0, n),
year = as.integer(year)) %>%
filter(birth_country %in% top_20) %>%
group_by(birth_country) %>%
mutate(n_cumul = cumsum(n)) %>%
arrange(birth_country)
Then, we use almost the same code as for the first plot, but we add arguments at the end that tell how we want the animation to be:
plot_evol <- nobel_per_country_year %>%
select(birth_country, year, n_cumul) %>%
filter((year %% 2) != 0) %>%
ggplot(aes(x = reorder(birth_country, n_cumul), y = n_cumul)) +
geom_col() +
coord_flip() +
xlab("Country") +
ylab("") +
geom_text(aes(label = as.character(round(n_cumul, 0))), nudge_y = 10) +
ggthemes::theme_clean() +
transition_time(year) +
ggtitle("Year: {frame_time}") +
ease_aes('linear', interval = 2)
animate(plot_evol, duration = 15, fps = 20)