Text analysis of the dialogues in “The Americans”

I love the TV show “The Americans” and decided to harvest and analyze the dialogues.

Etienne Bacher

The Americans is a TV show that tells the story of two KGB agents infiltrated in the USA in the 1980’s. It shows us the life of these two people, who have to do espionage missions while taking care of their children and developing their own business as their cover.

Recently, I watched the talk of Ryan Timpe at RStudio::conf 2020 about learning R with humorous side projects. It made me think about what projects I could develop to learn new things with R, and it pushed me to mix my interest both for The Americans and for R. I thought it would be interesting to analyze the dialogues of this TV show since it required learning two new skills: scraping web data to get the dialogues, and doing text analysis to explore these dialogues.

Get the dialogues

Find a source for the dialogues

Apparently, contrarily to Friends, nobody developed a package containing the dialogues of The Americans yet. Therefore, I had to search online for these, and I found this website that contains all of the dialogues and other text information (lyrics, stage directions, etc.), with one page for one episode.

This website doesn’t provide the dialogues for the end of season 6. However, this is not a big issue. Another drawback of this website is that it doesn’t always provide information on who is talking, so it’s not possible to analyze the words of a specific person. But it’s good enough for me, I just want to train, the results don’t matter here.

Import the dialogues

Let’s see how to import the dialogues with episode 1 of season 1. First of all, two packages will be needed:

Now, we want to obtain the details of the page containing the dialogues for the first episode:

page <- xml2::read_html("http://transcripts.foreverdreaming.org/viewtopic.php?f=116&t=15871")

This condenses all HTML information into two lists. But we only need the dialogues. Therefore, we have to find the HTML argument that contains them. To do so, we can use the Inspector in the webpage (Ctrl+Shift+C). When hovering elements on the webpage, we can see that there are several classes. Each line is embedded into p, but we notice that the whole text is in div.postbody.

Therefore, we can select only this class:

page_text <- html_node(page, "div.postbody") %>%
  html_children() %>%
  xml2::xml_find_all("//div[contains(@class, 'postbody')]") %>%
  html_text(trim = TRUE)

Now, page_text is a very long vector that contains all text information. However, everything is not important here: we don’t want to analyze the lyrics of the soundtrack, the stage directions, or the name of the person who is talking. The latter could be interesting if we had it for every sentence. However, we only have it occasionally, which makes it useless. To remove this irrelevant text, we will use gsub(), a base R function.

page_text_cleaned <- page_text %>%
  gsub("♪♪", "", .) %>% # double music symbol
  gsub("♪ [^♪]+♪", "", .) %>% # text between music symbol (= lyrics)
  gsub("\\n", " ", .) %>% # new line symbol
  gsub("\\t", " ", .) %>% # \t
  gsub("\\[[^\\]]*\\]", "", ., perl = TRUE) %>% # text between brackets
  gsub("\\([^\\)]*\\)", "", ., perl = TRUE) %>% # text between parenthesis
  gsub("\\\\", "", .) %>% # backslash
  gsub("\\(adsbygoogle = window.adsbygoogle \\|\\| \\).push\\(\\{\\}\\);", "", .) %>% # ads
  gsub("Philip:", "", .) %>% 
  gsub("Elizabeth:", "", .) %>%
  gsub("Paige:", "", .) %>%
  gsub("Henry:", "", .) %>%
  gsub("Stan:", "", .) 

The text is now cleaned: most of the useless text was removed, and 99% of what’s left is the dialogues.

Generalize this to all episodes

As I said before, there is a different page for each episode. How can we generalize the previous step to all these pages?

Well, the code we used before will be the same once we have the HTML information of the episode page. The only problem here is that we must find a way to download this HTML information for all pages. We can notice that the URL addresses are almost identical for all episodes:

Episode 1, season 1: http://transcripts.foreverdreaming.org/viewtopic.php?f=116&t=15871

Episode 2, season 1: http://transcripts.foreverdreaming.org/viewtopic.php?f=116&t=15872

We see here that only the argument t differs between those addresses. If we could collect all the values of t for all the pages, we could then collect the HTML information very easily. There is now another problem: what are the values of t? We could suppose that we just need to add 1 to the previous number (15871, 15872, 15873…). However, for episode 12 in season 4 for example, the value of t is 27447. Therefore, we must find another way to collect these t values.

Collect t values

To do so, we use the Inspector once again, but this time on the home page, not on an episode page. Exploring the HTML tags, we notice that the t value is displayed in the class a.topictitle, among other information.