All posts

Text Analysis with tidytext

26 Jan 2019
Hello and welcome. Today we will dive into the fascinating world of Text Analysis, using the tidytext R package. The tidytext package is fantastic, as it allows the tidyverse toolbox to be used for Text Analysis.

Before we begin analysing text we need some text to analyse! We’re going to look at the State of the Union Addresses by Presidents Bill Clinton, George W. Bush and Barack Obama, which you can find in the GitHub repository for this post and you can find me on Twitter @stevo_marky.

For each president we will analyse seven State of the Union Addresses covering the period 1994 - 2016; we will look at:

  1. the key themes of each speech (term frequency and TF-IDF)

  2. the relationships between words (n-grams)

  3. the sentiment of each speech using different sentiment lexicons

  4. the speech structure and the length of sentences



This blogpost assumes good knowledge of tidyverse tools, such as stringr, tidyr, dplyr and ggplot2.

Get Speeches


Firstly, as I’ve saved the speeches as txt files, we use the readtext() function from the package of the same name to read in the speeches.

library(readtext)

# read in speeches using readtext
speeches_raw <- readtext("Transcripts/*")


We now have a dataframe where we have the name of each file and the speech in a row.

Prepare our Text for Analysis


Firstly, we will tidy up our data frame a little bit. We change the title of our the speeches from president_state_of_the_union_year.txt to president_year, which will be easier going forwards. We also extract the year of the speech, which will be useful for ordering.

library(dplyr)
library(stringr)
library(tidyr)

# tidy up the titles of our speeches
speeches <- speeches_raw %>%
# use dplyr::mutate and stringr::str_replace_all to change the title of each of the speeches
mutate(speech = str_replace_all(doc_id,
c("_state_of_the_union" = "", ".txt" = ""))) %>%
separate(speech, into = c("president", "year"), sep = "_", remove = FALSE) %>%
select(year, president, speech, text)


We want the text in a tidy data format (or tidy text in this context. For a full discussion of this see the book Text Mining with R: A Tidy Approach by Julia Silge and David Robinson).

For our purposes, the most important points are that:

  1. we will tokenize the text where "A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.” (Text Mining with R)

  2. we want one-token-per-row



Fortunately, there is a function that can do this for us: tidytext::unnest_tokens().

To see how this works, we will unnest one speech, George Bush, in 2002.

library(dplyr)
library(tidytext)

# run through one example
bush_2002 <- speeches %>%
filter(speech == "bush_2002") %>%
unnest_tokens(word, text, token = "words", to_lower = TRUE)


This code chunk takes Bush’s 2002 speech:

  1. takes the column text as an input

  2. unnests words (using token = “words”, which is the default behaviour and therefore doesn’t have to specified as an argument. We will look at other tokens later!)

  3. outputs a column named “word”

  4. converts all words to fully lower case (again, this is the default behaviour and therefore doesn’t have to specified as an argument)



This gives us a dataframe like:

year president speech word
1 2002 bush bush_2002 thank
1.1 2002 bush bush_2002 you
1.2 2002 bush bush_2002 very
1.3 2002 bush bush_2002 much
1.4 2002 bush bush_2002 mr
1.5 2002 bush bush_2002 speaker


The good news is that we can unnest every speech at once, as we have a column specifying which speech the text comes from in our input, and we will in our output. Therefore our code is:

library(tidytext)

# unnest everything all at once and remove stop words
words <- speeches %>%
unnest_tokens(word, text)


This is good! But if you look at the words now, you will see 137,987 words in all the speeches, which is a lot! But a lot of those words are common, everyday words such as “a”, “the” and “be”. We don’t want to include these words in our text analysis. These words are known as stop words. We’re going to remove these. To do this we use dplyr::anti_join() on our words dataframe, and the stop words that come with the tidytext package. This will leave only non stop words. Our code is:

library(tidytext)
library(dplyr)

words <- speeches %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)


We now have 55,137 words! Wow! That means that of our 137,987, 60% were stop words, which could have wreaked havoc on our analysis and led us to delivering no insight! But we are not quite ready yet! In our transcripts, audience applause is written to signify audience applause so we need to remove this.

We also have numerous instances of the word “America” as a root (America, American, Americans). We mutate all words having American as a root into a single word (this is called stemming). A step such as this is optional - it depends on your Text Analysis.

library(tidytext)
library(dplyr)

words <- speeches %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
# speech transcripts contain the word applause to signify audience applause
# filter out the word applause
filter(word != "applause") %>%
# mutate america root word
mutate(word = case_when(word %in%
c("america", "american", "americans") ~ "america",
TRUE ~ word))


The Key Themes of Each Speech


Let’s find, as a proportion of total words (after removing all stop words), the most frequently occuring words in each speech. This will be our word frequency. We take word frequency to compare between speeches, to account for different speech lengths. The code to do this is:

library(dplyr)

# get word frequency
word_count <- words %>%
count(year, president, speech, word)

# get total words in each speech
total_words <- word_count %>%
group_by(year, president, speech) %>%
summarise(total_words = sum(n))

# join and calculate word frequency
word_freq <- word_count %>%
left_join(., total_words, by = c("speech", "year", "president")) %>%
mutate(freq = n/total_words) %>%
# take the top ten for each speech
arrange(year, president, speech, desc(freq)) %>%
group_by(year, speech) %>%
top_n(10, wt = freq) %>%
ungroup()


This fives us a dataframe:

year president speech word n total_words freq
1 1994 clinton clinton_1994 people 62 2737 0.0227
2 1994 clinton clinton_1994 america 52 2737 0.0190
3 1994 clinton clinton_1994 health 41 2737 0.0150
4 1994 clinton clinton_1994 care 39 2737 0.0142
5 1994 clinton clinton_1994 congress 24 2737 0.00877
6 1994 clinton clinton_1994 country 20 2737 0.00731


The elegance here, is that once we have prepared our data, we can use standard tidyverse packages (dplyr, ggplot2) to perform the rest of our analysis.

We use ggplot2 to plot our results:

library(dplyr)
library(ggplot2)

# plot this dataframe
word_freq %>%
arrange(year, president, speech, desc(freq)) %>%
ggplot(aes(word, freq, fill = president)) +
geom_col() +
ggtitle("Most Frequent Words by Speech") +
xlab("Word") +
ylab("Frequency") +
facet_wrap( ~ year, scales = "free", ncol = 5) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "bottom")




What can we see here? Well, America dominates word frequency of each speech, but that isn’t surprising. We can see that “21st” and “century” appear a couple of times in the late nineties, the approach of the Millennium becoming an important topic. In 2002 we have “Afghanistan” in the top 10 for a speech for the only time and “Iraq” in 2004, 2007 and 2008. The speech in 2003 contains a lot of words that only appear in 2003, such as “hussein”, “saddam”and “nuclear”. Of course, this speech was two months prior to the Iraq war and these terms show the single most important political topic at the time.

Let’s try analysing using a different slant. Let’s pick a few terms and see how their frequency has changed over time. The words we will pick are "congress", "economy", "government", "jobs" and ”tax".

The data wrangling steps involve creating an empty “shell” dataframe and then populating with actual data. We have to create the shell dataframe, to ensure that our line graph shows zero for a year where a word was not spoken in the speech.

library(dplyr)
library(tidyr)

# create empty shell dataframe
freq_over_time_words <- data.frame(c("congress", "economy",
"government", "jobs", "tax"))
years <- speeches %>% select(year)

# use tidyr::crossing to get the cartesian product
freq_over_time_words <- crossing(freq_over_time_words, years)
colnames(freq_over_time_words) <- c("word", "year")

# filter words
words_over_time <- word_freq %>%
filter(word %in% c("congress", "economy", "government", "jobs", "tax")) %>%
select(word, year, freq)

# left join so we have zeros where the word was not mentioned
freq_over_time_words <- freq_over_time_words %>%
left_join(., words_over_time, by = c("year", "word")) %>%
replace(is.na(.), 0)


And to create our line plot:

library(ggplot2)

# for a selection of words plot word frequency over time
freq_over_time_words %>%
ggplot(aes(year, freq, col = word, group = word)) +
geom_point() +
geom_line() +
ggtitle("Word Frequency Over Time") +
xlab("Year") +
ylab("Word Frequency") +
theme_light() +
theme(legend.position = "bottom") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_grid(rows = vars(word), scales = "free_y")




Interestingly, it seems that out of the words we have chosen, their frequency fluctuates. Even the word “economy” has many years where it is not mentioned. The word “jobs” was clearly politically important between 2008 and 2016, in the aftermath of the 2008 Financial Crisis. Out of our words, “Congress” seems to be the word that is the most sticky across speeches.

Another way of analysing the key themes of each speech is to calculate something called the Term Frequency-Inverse Document Frequency (TF-IDF) for each word. We are using TF-IDF here to see which words are most distinctive to each speech.

To calculate TF-IDF we use the tidytext::bind_tf_idf() function.

library(tidytext)
library(dplyr)
library(ggplot2)

# tf-idf
tf_idf <- left_join(word_count, total_words) %>%
bind_tf_idf(word, speech, n) %>%
arrange(desc(tf_idf))

# plot tf-idf
tf_idf %>%
# filter by president
filter(president == "clinton") %>%
arrange(desc(tf_idf)) %>%
group_by(speech) %>%
top_n(5, wt = tf_idf) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = speech)) +
geom_col() +
ggtitle("Speech TF-IDF") +
ylab("TF-IDF") +
xlab("Word") +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none") +
facet_wrap(~speech, ncol = 2, scales = "free") +
coord_flip()




This highlights different words in comparison to Term Frequency alone. In 1995, for example, we see the word covenant. This is due to the fact that President Clinton used his 1995 address to explore his "New Covenant " political theme.

n-grams


Instead of looking at individual words, let’s instead look at n-grams. For the purposes of this blogpost we’ll look at bi-grams (two words together), but the concept can easily be applied to tri-grams and beyond.

To do this, we need to go far back and unnest our speeches, but this time use token = “ngrams”, n = 2.

library(dplyr)
library(tidytext)
library(tidyr)

# unnest into bigrams
bigrams <- speeches %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
# speech transcripts contain the word appluause to signify audience applause
# filter out the word applause
filter(word1 != "applause",
word2 != "applause")


As before, we filter out the word “applause”, but this time we have to look in two columns and filter both. The only thing then to add is to filter out rows that contains stop words in either the first or second part of the bigram, as otherwise our bigram analysis will be full of mundane phrases such as “and the”.

... %>% 
# ensure neither part of the bigram contains a stop word
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)


We now have our bigrams dataframe:

year president speech word1 word2
1 1994 clinton clinton_1994 103rd congress
2 1994 clinton clinton_1994 fellow americans
3 1994 clinton clinton_1994 teleprompter tonight
4 1994 clinton clinton_1994 tonight laughter
5 1994 clinton clinton_1994 grace tip
6 1994 clinton clinton_1994 tip o'neill


For further analysis, we unite columns word1 and word2:

library(dplyr)
library(tidytext)
library(tidyr)

bigrams <- speeches %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(word1 != "applause",
word2 != "applause") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(bigram, c("word1", "word2"), sep = " ")


Now we have our bigrams dataframe, we repeat the steps above to find the most common bigrams, by frequency for each speech. The full code is available in the GitHub repository.



For this plot, I have selected the top 4 bigrams, by frequency for each speech. In the event of a tie dplyr::top_n takes all rows of equal value, hence why some speeches have more bigrams than others. There are ways around this, but we will not explore them here.

We can see that the bi-gram “21st century” is very prevalent in the mid to late nineties, more so than when we were looking at individual words. There are many ways to analyse text!

Sentiment Analysis


It still feels as if there is a lot more insight we can garner from the speeches though! We next consider the sentiment of each speech. What words mean is a massively complex subject, but we can use some of the sentiment lexicons that come with the tidytext package to get an indication. We will use three: AFINN, Bing and NRC. We can see what these are by running the command get_sentiments(“sentiment”).

library(tidytext)

# three sentiment lexicons available
afinn <- get_sentiments("afinn")
bing <- get_sentiments("bing")
nrc <- get_sentiments("nrc")


We will follow a similar process for each. We will inner join our words dataframe onto each sentiment lexicon and then look at how the sentiment varies by speech, using ggplot2 to visualise our results.

So for AFINN we run the code:

library(dplyr)
library(ggplot2)

# afinn sentiment
# how positive and negative were each of the speeches
afinn_sentiment <- words %>%
inner_join(., afinn) %>%
group_by(year, speech) %>%
summarise(sentiment = sum(score))

# plot speeches using afinn sentiment ordered by year
afinn_sentiment %>%
ggplot(aes(reorder(speech, -sentiment), sentiment, fill = sentiment > 0)) +
geom_col() +
ggtitle("Sentiment of Each Speech") +
xlab("Speech") +
ylab("Sentiment") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none")


To get:



Using the AFINN sentiment we can see that the majority of State of the Union Address are positive in sentiment, with Bush’s 2003 speech the most negative.

We now use the Bing sentiment lexicon.

library(dplyr)

# bing sentiment
bing_sentiment <- words %>%
inner_join(., bing) %>%
group_by(year, speech, sentiment) %>%
count() %>%
mutate(net_n = case_when(sentiment == 'negative' ~ -n,
TRUE ~ n))


The Bing sentiment categorises words into negative and positive. We count the instances of each and then take the minus value of the negative count.



We can see that taking an overall sentiment hides fluctuations in the amount of negative or positive content. Whilst Clinton’s 2000 speech contains the most positivity, Obama’s 2011 speech contains the least negativity.

And we finally use the NRC lexicon. The NRC lexicon is a bit different in that it categorises words into ten different sentiments: trust, fear, negative, sadness, anger, surprise, positive, disgust, joy and anticipation.

We follow similar steps (available on the GitHub repository), and get a 100% stacked bar chart for each speech:



This fascinating analysis shows a pretty similar lexicon breakdown for every one of the speeches. In Bush’s 2003 speech, fear is more prevalent than in other years.

Negated Words


However, have we overlooked something in our sentiment analysis? What if, in 2003 Bush wasn’t using fearful words, like “disaster”, but instead was negating them: “no disaster”? Or what if instead of using joyous words such as “generous”, a President was saying “never generous”. This would substantially change our sentiment analysis and conclusions!

We should consider this possibility. To start to do this we create a list of negated words.

# consider the effect of negation on sentiment
# create list of negated words
negation_words <- c("ever", "never", "no", "not")


We follow very similar steps as we did before to break text into bigrams.

library(dplyr)
library(tidytext)
library(tidyr)

# as before separate words
bigrams_negate <- speeches %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(word1 != "applause",
word2 != "applause")


We next look to see if the first word (word1) in our bigrams is a negated word, and if it is, join the bigram to the AFINN sentiment lexicon using word2. Then we reverse and double the sentiment score previously applied. This gives us an adjusted score with which to change our earlier analysis. We then inner join to our previous analysis and add the original sentiment and our negated sentiment together to get the total sentiment for each speech.

library(dplyr)

# see if we have a negated word in word1 of the bi-gram
# afinn negated sentiment
afinn_negated_sentiment <- bigrams_negate %>%
filter(word1 %in% negation_words) %>%
inner_join(afinn, by = c(word2 = "word")) %>%
# we negate the sentiment by minusing the score from the afinn sentiment lexicon
# we multiply by two so we can account for the negation
mutate(score_adj = -score*2) %>%
group_by(speech) %>%
summarise(negate_sentiment = sum(score_adj)) %>%
inner_join(., afinn_sentiment) %>%
mutate(total_sentiment = sentiment + negate_sentiment)




Let’s compare this to our original AFINN sentiment graph: there are changes in both directions. Clinton’s 2000 speech speech is now considered less positive; Obama’s 2010 is now neutral, whilst Clinton’s 1996 speech is much less negative when we consider the effect of negation.

Sentence Structure


Finally, we are going to look at the sentence length of each of the speeches. We start off in the familiar fashion, by tokenizing our text, this time specifying the argument token = "sentences" . This splits the text on a full stop. We also take the step of removing some sentences that are especially short, or don't relate to actual speech text (such as "mr." or "(applause.))". Finally, we add a new column where we calculate the number of words in a sentence.

library(dplyr)
library(tidytext)
library(stringr)

# tokenise by sentence instead
sentences <- speeches %>%
unnest_tokens(sentence, text, token = "sentences") %>%
# tokenises on sentence which is specified by a "."
# remove certain sentences
filter(!sentence %in% c("mr.", "speaker, mr.",
"(applause.)", "(laughter and applause.)",
"[applause] thank you.", "applause.)")) %>%
# get number of words in a sentence
mutate(number_of_words = str_count(sentence, "\\S+"))


For each speech, we are going to plot the empirical cumulative distribution function for the length of a sentence. To do this we use ggplot2::stat_ecdf().

library(dplyr)
library(ggplot2)

# plot the empirical cumulative distribution function for sentence length
# obama
sentences %>%
filter(president == "obama") %>%
ggplot(aes(number_of_words, color = year)) +
stat_ecdf(geom = "step", pad = FALSE) +
ggtitle("Empirical Cumulative Distribution for Sentence Length") +
ylab("Cumulative Distribution") +
xlab("Number of Words") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "bottom")




For President Obama, we can see that his sentence structure across speeches is very similar, but the 2014 speech used a greater proportion of sentences with longer than 40 words.

I hope you've found this post interesting and informative. If you see any errors in the code, please let me know. If you have any comments, then please leave them below, and don't forget that you can find all the code for this post (and more!) at the post GitHub repository.

For other R blogs, you will find plenty on this site and you can visit R-Weekly, where this post was also featured.

Session and Package Information


The details of my R session and packages installed are:

sessionInfo()


R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.2

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ggplot2_3.1.0 tidyr_0.8.2 bindrcpp_0.2.2 dplyr_0.7.8 stringr_1.3.1 tidytext_0.2.0
[7] readtext_0.71

Leave a comment