News headlines text analysis
Introduction
In the present tutorial, I show an introductory text analysis of a ABC-news news headlines dataset. I will have a look to the most common words therein present and run a sentiment analysis on those headlines by taking advantage of the following sentiment words baselines:
NRC
Bing
AFINN
The NRC sentiment words baseline from Saif Mohammad and Peter Turney categorizes words into categories of positive, negative, anger, anticipation, disgust, fear joy, sadness, surprise and trust.
The Bing sentiment words baseline from Bing Liu and others categorizes words into positive or negative sentiment category.
The AFINN sentiment words baseline from Finn Arup Nielsen assigns words with a score from -5 to 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
For more information about those sentiment lexicons, see references listed out at the bottom.
Packages
I am going to take advantage of the following R packages.
suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidytext))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(textdata))
suppressPackageStartupMessages(library(widyr))
suppressPackageStartupMessages(library(ggplot2))
Packages versions are herein listed.
packages <- c("stringr", "dplyr", "tidytext", "tidyr", "textdata", "widyr", "ggplot2")
version <- lapply(packages, packageVersion)
version_c <- do.call(c, version)
data.frame(packages=packages, version = as.character(version_c))
Running on Windows-10 the following R language version.
R.version
## _
## platform x86_64-w64-mingw32
## arch x86_64
## os mingw32
## system x86_64, mingw32
## status
## major 3
## minor 5.3
## year 2019
## month 03
## day 11
## svn rev 76217
## language R
## version.string R version 3.5.3 (2019-03-11)
## nickname Great Truth
Note
Before running this code, make sure to have downloaded the lexicon of the sentiments lexicons by executing the following operation:
get_sentiments("nrc")
get_sentiments("bing")
get_sentiments("afinn")
and accepting all prescriptions as asked by the interactive menu showing up.
Getting Data
I then download our news dataset containing millions of headlines from:
“https://www.kaggle.com/therohk/million-headlines/downloads/million-headlines.zip/7â€
Its uncompression produces the abcnews-date-text.csv file. I load it into the news_data dataset and have a look at.
news_data <- read.csv("abcnews-date-text.csv", header = TRUE, stringsAsFactors = FALSE)
dim(news_data)
## [1] 1103663 2
head(news_data)
tail(news_data)
Token Analysis
It is time to extract the tokens from our dataset. Select the column named as headline_text and unnesting the word tokens determine the following.
news_df <- news_data %>% select(headline_text)
news_tokens <- news_df %>% unnest_tokens(word, headline_text)
head(news_tokens, 10)
tail(news_tokens, 10)
It is interesting to generate and inspect a table reporting how many times each token shows up within the headlines and its proportion with respect the total.
news_tokens_count <- news_tokens %>% count(word, sort = TRUE) %>% mutate(proportion = n / sum(n))
The top-10 words which appear most.
head(news_tokens_count, 10)
And the ones which appear less frequently:
tail(news_tokens_count, 10)
There is an issue in having doing that way. The issue is that there are words which do not have relevant role in easing the sentiment analysis, the so called stop words. Herein below the stop words wihin our dataset are shown.
data(stop_words)
head(stop_words, 10)
To remove stop words as required, we take advantage of the anti_join operation.
news_tokens_no_sp <- news_tokens %>% anti_join(stop_words)
head(news_tokens_no_sp, 10)
Then, counting news tokens again after having removed the stop words.
news_tokens_count <- news_tokens_no_sp %>% count(word, sort = TRUE) %>% mutate(proportion = n / sum(n))
head(news_tokens_count, 10)
tail(news_tokens_count)
Then, I filtering out tokens having more than 8,000 counts.
news_token_over8000 <- news_tokens_count %>% filter(n > 8000) %>% mutate(word = reorder(word, n))
nrow(news_token_over8000)
## [1] 32
head(news_token_over8000, 10)
tail(news_token_over8000, 10)
It is interesting to show the proportion as per-thousands by means of an histogram plot.
news_token_over8000 %>%
ggplot(aes(word, proportion*1000, fill=ceiling(proportion*1000))) +
geom_col() + xlab(NULL) + coord_flip() + theme(legend.position = "none")
News Sentiment Analysis
In this paragraph, I focus on each single headline to evaluate its specific sentiment as determined by each lexicon. Hence the output shall determine if each specific headline has got positive or negative sentiment.
head(news_df, 10)
I will analyse only the first 1000 headlines just for computational time reasons. The token list of such is as follows.
news_df_subset <- news_df[1:1000,,drop=FALSE]
tkn_l <- apply(news_df_subset, 1, function(x) { data.frame(headline_text=x, stringsAsFactors = FALSE) %>% unnest_tokens(word, headline_text)})
Removing the stop words from the token list.
single_news_tokens <- lapply(tkn_l, function(x) {anti_join(x, stop_words)})
str(single_news_tokens, list.len = 5)
## List of 1000
## $ 1 :'data.frame': 5 obs. of 1 variable:
## ..$ word: chr [1:5] "aba" "decides" "community" "broadcasting" ...
## $ 2 :'data.frame': 5 obs. of 1 variable:
## ..$ word: chr [1:5] "act" "fire" "witnesses" "aware" ...
## $ 3 :'data.frame': 4 obs. of 1 variable:
## ..$ word: chr [1:4] "calls" "infrastructure" "protection" "summit"
## $ 4 :'data.frame': 7 obs. of 1 variable:
## ..$ word: chr [1:7] "air" "nz" "staff" "aust" ...
## $ 5 :'data.frame': 6 obs. of 1 variable:
## ..$ word: chr [1:6] "air" "nz" "strike" "affect" ...
## [list output truncated]
As we can see, to each headline is associated a list of tokens. The sentiment of a headline is computed as based on the sum of positive/negative score of each token of.
single_news_tokens[[1]]
Bing lexicon
In this paragraph, the computation of the sentiment associated to the tokens list is shown for Bing lexicon. I first define a function named as compute_sentiment() whose purpose is to output the positiveness score of a specific headline.
compute_sentiment <- function(d) {
if (nrow(d) == 0) {
return(NA)
}
neg_score <- d %>% filter(sentiment=="negative") %>% nrow()
pos_score <- d %>% filter(sentiment=="positive") %>% nrow()
pos_score - neg_score
}
The inner join on bing lexicon of each single headline tokens list is given as input to the compute_sentiment() function to determine the sentiment score of each specific headline.
sentiments_bing <- get_sentiments("bing")
str(sentiments_bing)
## Classes 'tbl_df', 'tbl' and 'data.frame': 6786 obs. of 2 variables:
## $ word : chr "2-faces" "abnormal" "abolish" "abominable" ...
## $ sentiment: chr "negative" "negative" "negative" "negative" ...
single_news_sentiment_bing <- sapply(single_news_tokens, function(x) { x %>% inner_join(sentiments_bing) %>% compute_sentiment()})
The result is a vector of integers each element value at i-th position is the sentiment associated to the i-th news
str(single_news_sentiment_bing)
## Named int [1:1000] NA -1 1 -1 -1 2 0 NA NA NA ...
## - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
Here is the summary, please note that:
- the median is negative
- NA’s show up
summary(single_news_sentiment_bing)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -3.000 -1.000 -1.000 -0.475 1.000 2.000 520
Collecting the resulting in a data frame as follows.
single_news_sentiment_bing_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_bing)
head(single_news_sentiment_bing_df, 10)
NRC lexicon
In this paragraph, the computation of the sentiment associated to the tokens list is shown for NRC lexicon. With respect the previous analysis based on bing lexicon, some more pre-processing is needed as explained in what follows. First we get the NRC sentiment lexicon and see what are the sentiments threin present.
sentiments_nrc <- get_sentiments("nrc")
(unique_sentiments_nrc <- unique(sentiments_nrc$sentiment))
## [1] "trust" "fear" "negative" "sadness" "anger"
## [6] "surprise" "positive" "disgust" "joy" "anticipation"
To have as output a positive/negative sentiment result, I define a mapping of abovelisted sentiments to a positive/negative string result as follows.
compute_pos_neg_sentiments_nrc <- function(the_sentiments_nrc) {
s <- unique(the_sentiments_nrc$sentiment)
df_sentiments <- data.frame(sentiment = s,
mapped_sentiment = c("positive", "negative", "negative", "negative",
"negative", "positive", "positive", "negative",
"positive", "positive"))
ss <- sentiments_nrc %>% inner_join(df_sentiments)
the_sentiments_nrc$sentiment <- ss$mapped_sentiment
the_sentiments_nrc
}
nrc_sentiments_pos_neg_scale <- compute_pos_neg_sentiments_nrc(sentiments_nrc)
Above function is used to produce the single headline text sentiment results. Such result is given as input to the compute_sentiment() function.
single_news_sentiment_nrc <- sapply(single_news_tokens, function(x) { x %>% inner_join(nrc_sentiments_pos_neg_scale) %>% compute_sentiment()})
str(single_news_sentiment_nrc)
## Named int [1:1000] 1 -4 1 2 -2 2 4 NA 5 -2 ...
## - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
Here is the summary, please note that:
- the median is equal to zero
- NA’s show up
summary(single_news_sentiment_nrc)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -9.0000 -2.0000 0.0000 -0.3742 2.0000 9.0000 257
single_news_sentiment_nrc_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_nrc)
head(single_news_sentiment_nrc_df, 10)
AFINN lexicon
In this paragraph, the computation of the sentiment associated to the tokens list is shown for AFINN lexicon.
sentiments_afinn <- get_sentiments("afinn")
colnames(sentiments_afinn) <- c("word", "sentiment")
str(sentiments_afinn)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2477 obs. of 2 variables:
## $ word : chr "abandon" "abandoned" "abandons" "abducted" ...
## $ sentiment: num -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
## - attr(*, "spec")=
## .. cols(
## .. word = col_character(),
## .. value = col_double()
## .. )
As we can see, the afinn lexicon provides a score for each token. We just need to sum up each headline tokens score to obtain the sentiment score of the headline under analysis.
single_news_sentiment_afinn_df <- lapply(single_news_tokens, function(x) { x %>% inner_join(sentiments_afinn)})
single_news_sentiment_afinn <- sapply(single_news_sentiment_afinn_df, function(x) {
ifelse(nrow(x) > 0, sum(x$sentiment), NA)
})
str(single_news_sentiment_afinn)
## Named num [1:1000] NA -2 NA -2 -1 6 3 NA NA -2 ...
## - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
Here is the summary, please note that:
- the median is negative
- NA’s show up
summary(single_news_sentiment_afinn)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -9.000 -3.000 -2.000 -1.148 1.000 7.000 508
single_news_sentiment_afinn_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_afinn)
head(single_news_sentiment_afinn_df, 10)
Comparing results
Having obtained for each news three potential results as sentiment evaluation, we would like to compare their congruency. As congruence we mean the fact that all three lexicons express the same positive or negative result, in other words the same score sign indipendently from its magnitude. If NA values are present, the congruence shall be computed until at least two non NA values are available, otherwise is equal to NA.
Furthermore we compute the final news sentiment as based upon the sum of each lexicon sentiment score.
compute_congruence <- function(x,y,z) {
v <- c(sign(x), sign(y), sign(z))
# if only one lexicon reports the score, we cannot check for congruence
if (sum(is.na(v)) >= 2) {
return (NA)
}
# removing NA and zero value
v <- na.omit(v)
v_sum <- sum(v)
abs(v_sum) == length(v)
}
compute_final_sentiment <- function(x,y,z) {
if (is.na(x) && is.na(y) && is.na(z)) {
return (NA)
}
s <- sum(x, y, z, na.rm=TRUE)
# positive sentiments have score strictly greater than zero
# negative sentiments have score strictly less than zero
# neutral sentiments have score equal to zero
ifelse(s > 0, "positive", ifelse(s < 0, "negative", "neutral"))
}
news_sentiments_results <- data.frame(headline_text = news_df_subset$headline_text,
bing_score = single_news_sentiment_bing,
nrc_score = single_news_sentiment_nrc,
afinn_score = single_news_sentiment_afinn,
stringsAsFactors = FALSE)
news_sentiments_results <- news_sentiments_results %>% rowwise() %>%
mutate(final_sentiment = compute_final_sentiment(bing_score, nrc_score, afinn_score),
congruence = compute_congruence(bing_score, nrc_score, afinn_score))
head(news_sentiments_results, 40)
Is would be useful to replace the numeric score with same {negative, neutral, positive} scale.
replace_score_with_sentiment <- function(v_score) {
v_score[v_score > 0] <- "positive"
v_score[v_score < 0] <- "negative"
v_score[v_score == 0] <- "neutral"
v_score
}
news_sentiments_results$bing_score <- replace_score_with_sentiment(news_sentiments_results$bing_score)
news_sentiments_results$nrc_score <- replace_score_with_sentiment(news_sentiments_results$nrc_score)
news_sentiments_results$afinn_score <- replace_score_with_sentiment(news_sentiments_results$afinn_score)
news_sentiments_results[,2:5] <- lapply(news_sentiments_results[,2:5], as.factor)
head(news_sentiments_results, 40)
Tabularizations of each lexicon resulting sentiment and final sentiments are herein shown.
table(news_sentiments_results$bing_score, news_sentiments_results$final_sentiment, dnn = c("bing", "final"))
## final
## bing negative neutral positive
## negative 278 15 14
## neutral 16 6 11
## positive 6 7 127
table(news_sentiments_results$nrc_score, news_sentiments_results$final_sentiment, dnn = c("nrc", "final"))
## final
## nrc negative neutral positive
## negative 353 10 4
## neutral 18 13 6
## positive 25 16 298
table(news_sentiments_results$afinn_score, news_sentiments_results$final_sentiment, dnn = c("afinn", "final"))
## final
## afinn negative neutral positive
## negative 326 10 12
## neutral 3 1 6
## positive 4 9 121
Tabularization of congruence and final sentiments is herein shown.
table(news_sentiments_results$congruence, news_sentiments_results$final_sentiment, dnn = c("congruence", "final"))
## final
## congruence negative neutral positive
## FALSE 67 33 45
## TRUE 292 0 132
Conclusions
We analyzed the news headlines to determine their sentiments while taking advantage of three sentiments lexicons. We show some basics of the methodoloy for such purpose. We also had the chance to compare the results obtained across all three lexicons and set forth a final sentiment evaluation. If you are interested in understanding much more about text analysis, see ref. [4].
References
[1] NRC sentiment words baseline [http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm]
[2] BING sentiment words baseline [https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html]
[3] AFINN sentiment words baseline [https://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010]
[4] Text mining with R [https://www.tidytextmining.com/]