By: Sarah Thompson
I started off strong this election. I watched the first three Democratic presidential debates from start to finish in a state of elation as discussion topics miraculously contained words like “climate change” and “renewable energy.” My people, they were back. The future was looking pretty bright. Fast-forward to November and I hardly knew which candidates were still running, let alone what they were talking about. I decided to try another way of watching the debates. By scraping and analyzing the transcripts for each of the Democratic Presidential debates, I set out to see if I could establish candidate positioning on political topics that I care about.
To do this, I generated word embeddings for each of my topics of interest: climate, insurance, gun violence, teachers, military, and immigration. Using the same methodology, I also generated the mean word embeddings for each presidential candidate. Next, I calculated the similarity between each Democratic presidential candidate’s mean embedding and the embedding of each topic of interest. This allowed me to establish candidate positioning based on their alignment with the given topic. Finally, I used k-means clustering to group the candidates. This research is useful because it provides a way for busy individuals to get an unbiased examination of each candidate based on topics of interest.
MOST COMMON WORDS
To acquire the data, I scraped the full transcripts of each Democratic presidential debate and tokenized the transcripts by speaker statements. The data consists of roughly 260,000 words with 9600 of them unique. The range of words per statement is 480 words and the mean words per statement is 56 words.
Table 1 lists the most common words spoken by the current presidential candidates that were not the among the most common words of any other current candidate. For this analysis, “stop words,” or words like and, because, and the were removed because those words would clearly be the most common had they been included. This type of analysis gives insight into the unique words the candidates say the most during the debates.
Table 1: Most common 50 words that are not most common to any other current candidate
Warren sounds understanding and driven with words like fight, pay, families, and understand. Klobuchar is focused on winning and bettering the current state. Sanders’s words like change, believe, companies, and Medicare indicate his passion for a new system that impacts millions of people. While Biden focuses on the number of deals he has made, Buttigeig’s words point to his experience in Washington and being part of a community. Steyer’s words include climate, question, wrong, different, corporations, and government, showing his commitment to environmental challenges and the new approaches he intends to implement. Bloomberg talks about the policy, jobs, and agreements that he has perhaps passed in New York. It is also interesting that crosstalk is one of Bloomberg’s most common words, which indicates that he and another candidate were speaking at the same time.
While insights, however subjective, can be gleaned from comparisons of simple counts of words, word embeddings can go even further in understanding the positioning of these candidates on important topics. A word’s embedding is its vector representation in multidimensional space that can be used as a means for comparison. Word2vec is a popular tool for generating word embeddings; it utilizes a shallow neural network to compress words into a vector of 300 dimensions. The neural network learns on a corpus of text and uses one of two methods, CBOW or Skip-a-gram, for predicting the desired word in a sequence. Once the model is trained, each row of its weight matrix contains a compressed, lower dimensional representation of a word it was trained on. This compressed representation is known as the word’s embedding and the embedding is also a vector in 300 dimensional space.
The performance of word2vec depends on the size of the corpus on which it was trained, which is why there are several pre-trained algorithms that can be downloaded and used for analyses such as this one. I used the genism Google News pre-trained model, which is a dictionary of word vectors with roughly 3 million words; it was trained on every English news article ever published, around 100 billion words. Through extensive scientific research, these vectors have proven to be effective in understanding the relationship between pairs of words and answering analogy questions.
One famous example of the power of word embeddings is the “King – man + woman” example, shown in Table 2. Here, I asked the pre-trained Google News model to return the words associated with the vectors similar to the king vector, minus the man vector, plus the woman vector; the results and script are shown below. The model performs as expected, almost magically returning “queen,” followed by “monarch,” and “princess.” Additionally, Mikolov et al., showed that embeddings can be used to solve analogy questions like “apples – apple ≈ cars – car” and Spanish is to Mexico as French is to France. Word embeddings represent not only the “attributional similarities” between words but also “relational similarities” between pairs of words as shown above, making them a great tool for comparing the presidential candidates to topics of interest (Levy et al, 2014).
Table 2: A famous application of word embeddings
Using this pre-trained model, I generated the word embeddings for each vector of interest.By looking at the synonyms of these words according to the pre-trained model, I was able to make sure that they properly captured the ideas I sought. Table 3 lists the results for climate, insurance, and military, respectively. Aside from several deviations like “ambassador_Brice_Lalonde” and “WholesaleInsurance.net,” these vectors seem to be on the right track.
Table 3: The words most similar to climate, insurance, and military
There are several ways to go about calculating the embeddings for each presidential candidate but I chose to use the mean word embeddings. I then calculated how much each candidate’s mean word embeddings align with each vector of interest by taking the cosine and scaling the results. What’s left is a score between zero and one, with one being the candidate most aligned with the vector of interest and zero being that of the least aligned. Because I took the mean word embeddings, words that are said more than once are not counted again. This analysis is not on the number of times a candidate said a particular word but on the amount of words that the candidate said that are like the vector of interest.
Graph 1 shows the results for the climate and insurance vectors. Here we can see that Steyer, Sanders, and Buttigieg align the most with climate, all with results above 0.8. Sanders and Warren align the most with insurance, and Biden, Kolbuchar, and Bloomberg align very little with both of these topics. Sanders is the clear winner in this pairing; his text corpus aligns the most with climate change and health insurance coverage.
Graph 1: Democratic presidential candidate debate corpus alignment with climate and insurance
Graph 2: Pairwise comparisons of Democratic presidential candidate corpus alignment
Graph 2 contains nine pairwise graphs of the topics of interest: immigration, military, gun violence, climate, teachers, and insurance. The results position Klobuchar, Warren, Buttigieg, and Steyer similarly in regards to immigration, with Bloomberg and Biden aligning the least with immigration and Sanders the most. Warren is by far the most aligned with teachers compared to the other candidates and Buttigieg and Sanders are most aligned with military. Overall, Bloomberg aligns the least with all the topics of interest. Sanders is the clear champion of my topics of interest, dominating the top right hand corner of almost all of the pairwise graphs.
In order to group candidates based on their alignment with the six topics of interest, I used the common k-means clustering algorithm. I decided on the four groups that are listed in Table 4. The first group, Odd Man Out, contains only Bloomberg, he spoke so little about my topics he formed a group all of his own. Issue Champions, Sanders and Warren, strongly aligned with all the topics that I care about, their campaigns seem most focused on my concerns. Climate Warriors, Buttigieg and Steyer align the most with climate change and align reasonably well with immigration, military, and gun violence.These two spoke very little about healthcare. And the Up in Arms members, Klobuchar and Biden, align moderately well with gun violence and military, show some focus on teachers and immigration, and spoke very little about climate change and healthcare.
Table 4: Groups of candidates based on k-means clustering
It’s important to remember that the point of this analysis is to establish candidate positioning on given topics. These findings fall short of casting any light on the candidate’s plan for handling a given topic and the fact that a candidate talks about a specific topic is no indication of how they feel about that particular topic. For example, Biden may talk about gun violence with the intention of increasing school security and Sanders may talk about gun violence with the intention of banning assault rifles. From this research, there is no way to tell how the candidates intend to tackle my topics of interest.
Another way to explore candidate positioning would be to use a weighted average, which properly emphasizes the frequency of spoken words. By using the mean word embeddings as the vector for the candidates, all words are treated the same, including those that are said much more frequently. It would be interesting to compare the results of the weighted average with the current findings. Additionally, this analysis does not take into consideration the context of the words being said by the candidates; although it is possible to do so. By training a word2vec algorithm on my corpus of all the debates, I could get an understanding of how these topics are being addressed by each candidate, which would be helpful in understanding the candidates’ plan and feelings towards the topics.
By using debate transcripts, the Gensim pre-trained model, and k-means clustering, I was able achieve my goal of establishing the positioning of the presidential candidates on topics of interest. Based on my analysis, Sanders and Warren are the candidates that align the most with my concerns. While these insights are certainly interesting and would be helpful for other individuals such as myself, there is much more left to explore. Next, I will look at the weighted average for calculating the candidates’ word embeddings and using word2vec to establish the context in which topics of interest were talked about.
Brinkerhoff, Douglas. “Word Embeddings.” Word Embeddings, Machine Learning, University of Montana, Missoula, MT, November 26, 2019.
Chandler, John, “Cluster Analysis.” Cluster Analysis, Applied Data Analytics, University of Montana, Missoula, MT, Fall, 2019.
Grus, Joe. Data Science From Scratch. 2nd ed., O’Reilly, 2019.
Karani, Dhruvil. “Introduction to Word Embedding and Word2Vec.” Towards Data Science, Medium, 1 Sept. 2018, https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
Levy, Omer, and Yoav Goldberg. “Linguistic regularities in sparse and explicit word representations.” Proceedings of the eighteenth conference on computational natural language learning. 2014.
McCormick, Chris. “Google’s trained Word2Vec model in Python.” Chris McCormick, 12 Apr. 2016,