Saturday, June 13, 2020

Text Analysis

News headlines text analysis

Introduction

In the present tutorial, I show an introductory text analysis of a ABC-news news headlines dataset. I will have a look to the most common words therein present and run a sentiment analysis on those headlines by taking advantage of the following sentiment words baselines:

  • NRC

  • Bing

  • AFINN

The NRC sentiment words baseline from Saif Mohammad and Peter Turney categorizes words into categories of positive, negative, anger, anticipation, disgust, fear joy, sadness, surprise and trust.

The Bing sentiment words baseline from Bing Liu and others categorizes words into positive or negative sentiment category.

The AFINN sentiment words baseline from Finn Arup Nielsen assigns words with a score from -5 to 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

For more information about those sentiment lexicons, see references listed out at the bottom.

Packages

I am going to take advantage of the following R packages.

suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidytext))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(textdata))
suppressPackageStartupMessages(library(widyr))
suppressPackageStartupMessages(library(ggplot2))

Packages versions are herein listed.

packages <- c("stringr", "dplyr", "tidytext", "tidyr", "textdata", "widyr", "ggplot2")
version <- lapply(packages, packageVersion)
version_c <- do.call(c, version)
data.frame(packages=packages, version = as.character(version_c))
ABCDEFGHIJ0123456789
packages
<fctr>
version
<fctr>
stringr1.4.0
dplyr0.8.4
tidytext0.2.2
tidyr1.0.2
textdata0.3.0
widyr0.1.2
ggplot23.2.1

Running on Windows-10 the following R language version.

R.version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.3                         
## year           2019                        
## month          03                          
## day            11                          
## svn rev        76217                       
## language       R                           
## version.string R version 3.5.3 (2019-03-11)
## nickname       Great Truth

Note

Before running this code, make sure to have downloaded the lexicon of the sentiments lexicons by executing the following operation:

get_sentiments("nrc")
get_sentiments("bing")
get_sentiments("afinn")

and accepting all prescriptions as asked by the interactive menu showing up.

Getting Data

I then download our news dataset containing millions of headlines from:

“https://www.kaggle.com/therohk/million-headlines/downloads/million-headlines.zip/7”

Its uncompression produces the abcnews-date-text.csv file. I load it into the news_data dataset and have a look at.

news_data <- read.csv("abcnews-date-text.csv", header = TRUE, stringsAsFactors = FALSE)
dim(news_data)
## [1] 1103663       2
head(news_data)
ABCDEFGHIJ0123456789
 
 
publish_date
<int>
headline_text
<chr>
120030219aba decides against community broadcasting licence
220030219act fire witnesses must be aware of defamation
320030219a g calls for infrastructure protection summit
420030219air nz staff in aust strike for pay rise
520030219air nz strike to affect australian travellers
620030219ambitious olsson wins triple jump
tail(news_data)
ABCDEFGHIJ0123456789
 
 
publish_date
<int>
headline_text
<chr>
110365820171231stunning images from the sydney to hobart yacht
110365920171231the ashes smiths warners near miss liven up boxing day test
110366020171231timelapse: brisbanes new year fireworks
110366120171231what 2017 meant to the kids of australia
110366220171231what the papodopoulos meeting may mean for ausus
110366320171231who is george papadopoulos the former trump campaign aide

Token Analysis

It is time to extract the tokens from our dataset. Select the column named as headline_text and unnesting the word tokens determine the following.

news_df <- news_data %>% select(headline_text)
news_tokens <- news_df %>% unnest_tokens(word, headline_text)
head(news_tokens, 10)
ABCDEFGHIJ0123456789
 
 
word
<chr>
1aba
1.1decides
1.2against
1.3community
1.4broadcasting
1.5licence
2act
2.1fire
2.2witnesses
2.3must
tail(news_tokens, 10)
ABCDEFGHIJ0123456789
 
 
word
<chr>
1103662.7ausus
1103663who
1103663.1is
1103663.2george
1103663.3papadopoulos
1103663.4the
1103663.5former
1103663.6trump
1103663.7campaign
1103663.8aide

It is interesting to generate and inspect a table reporting how many times each token shows up within the headlines and its proportion with respect the total.

news_tokens_count <- news_tokens %>% count(word, sort = TRUE) %>% mutate(proportion = n / sum(n))

The top-10 words which appear most.

head(news_tokens_count, 10)
ABCDEFGHIJ0123456789
word
<chr>
n
<int>
proportion
<dbl>
to2142010.030294784
in1359810.019232007
for1302390.018419907
of807590.011421872
on730370.010329738
over503060.007114857
the498100.007044707
police359840.005089274
at317230.004486634
with296760.004197123

And the ones which appear less frequently:

tail(news_tokens_count, 10)
ABCDEFGHIJ0123456789
word
<chr>
n
<int>
proportion
<dbl>
zweli11.414316e-07
zwitkowsky11.414316e-07
zydelig11.414316e-07
zygar11.414316e-07
zygiefs11.414316e-07
zylvester11.414316e-07
zynga11.414316e-07
zyngier11.414316e-07
zz11.414316e-07
zzz11.414316e-07

There is an issue in having doing that way. The issue is that there are words which do not have relevant role in easing the sentiment analysis, the so called stop words. Herein below the stop words wihin our dataset are shown.

data(stop_words)
head(stop_words, 10)
ABCDEFGHIJ0123456789
word
<chr>
lexicon
<chr>
aSMART
a'sSMART
ableSMART
aboutSMART
aboveSMART
accordingSMART
accordinglySMART
acrossSMART
actuallySMART
afterSMART

To remove stop words as required, we take advantage of the anti_join operation.

news_tokens_no_sp <- news_tokens %>% anti_join(stop_words)
head(news_tokens_no_sp, 10)
ABCDEFGHIJ0123456789
 
 
word
<chr>
1aba
2decides
3community
4broadcasting
5licence
6act
7fire
8witnesses
9aware
10defamation

Then, counting news tokens again after having removed the stop words.

news_tokens_count <- news_tokens_no_sp %>% count(word, sort = TRUE) %>% mutate(proportion = n / sum(n))
head(news_tokens_count, 10)
ABCDEFGHIJ0123456789
word
<chr>
n
<int>
proportion
<dbl>
police359840.006732918
govt169230.003166440
court163800.003064840
council163430.003057917
interview150250.002811308
fire139100.002602681
nsw129120.002415947
australia123530.002311353
plan123070.002302746
water118740.002221728
tail(news_tokens_count)
ABCDEFGHIJ0123456789
word
<chr>
n
<int>
proportion
<dbl>
zygiefs11.871087e-07
zylvester11.871087e-07
zynga11.871087e-07
zyngier11.871087e-07
zz11.871087e-07
zzz11.871087e-07

Then, I filtering out tokens having more than 8,000 counts.

news_token_over8000 <- news_tokens_count %>% filter(n > 8000) %>% mutate(word = reorder(word, n))
nrow(news_token_over8000)
## [1] 32
head(news_token_over8000, 10) 
ABCDEFGHIJ0123456789
word
<fctr>
n
<int>
proportion
<dbl>
police359840.006732918
govt169230.003166440
court163800.003064840
council163430.003057917
interview150250.002811308
fire139100.002602681
nsw129120.002415947
australia123530.002311353
plan123070.002302746
water118740.002221728
tail(news_token_over8000, 10) 
ABCDEFGHIJ0123456789
word
<fctr>
n
<int>
proportion
<dbl>
day88180.001649924
hospital88150.001649363
car86900.001625974
coast84110.001573771
calls84010.001571900
win83150.001555809
woman82130.001536723
killed81290.001521006
accused80940.001514458
world80870.001513148

It is interesting to show the proportion as per-thousands by means of an histogram plot.

news_token_over8000 %>%  
  ggplot(aes(word, proportion*1000, fill=ceiling(proportion*1000))) +
  geom_col() + xlab(NULL) + coord_flip() + theme(legend.position = "none")

News Sentiment Analysis

In this paragraph, I focus on each single headline to evaluate its specific sentiment as determined by each lexicon. Hence the output shall determine if each specific headline has got positive or negative sentiment.

head(news_df, 10)
ABCDEFGHIJ0123456789
 
 
headline_text
<chr>
1aba decides against community broadcasting licence
2act fire witnesses must be aware of defamation
3a g calls for infrastructure protection summit
4air nz staff in aust strike for pay rise
5air nz strike to affect australian travellers
6ambitious olsson wins triple jump
7antic delighted with record breaking barca
8aussie qualifier stosur wastes four memphis match
9aust addresses un security council over iraq
10australia is locked into war timetable opp

I will analyse only the first 1000 headlines just for computational time reasons. The token list of such is as follows.

news_df_subset <- news_df[1:1000,,drop=FALSE]
tkn_l <- apply(news_df_subset, 1, function(x) { data.frame(headline_text=x, stringsAsFactors = FALSE) %>% unnest_tokens(word, headline_text)})

Removing the stop words from the token list.

single_news_tokens <- lapply(tkn_l, function(x) {anti_join(x, stop_words)})
str(single_news_tokens, list.len = 5)
## List of 1000
##  $ 1   :'data.frame':    5 obs. of  1 variable:
##   ..$ word: chr [1:5] "aba" "decides" "community" "broadcasting" ...
##  $ 2   :'data.frame':    5 obs. of  1 variable:
##   ..$ word: chr [1:5] "act" "fire" "witnesses" "aware" ...
##  $ 3   :'data.frame':    4 obs. of  1 variable:
##   ..$ word: chr [1:4] "calls" "infrastructure" "protection" "summit"
##  $ 4   :'data.frame':    7 obs. of  1 variable:
##   ..$ word: chr [1:7] "air" "nz" "staff" "aust" ...
##  $ 5   :'data.frame':    6 obs. of  1 variable:
##   ..$ word: chr [1:6] "air" "nz" "strike" "affect" ...
##   [list output truncated]

As we can see, to each headline is associated a list of tokens. The sentiment of a headline is computed as based on the sum of positive/negative score of each token of.

single_news_tokens[[1]]
ABCDEFGHIJ0123456789
word
<chr>
aba
decides
community
broadcasting
licence

Bing lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for Bing lexicon. I first define a function named as compute_sentiment() whose purpose is to output the positiveness score of a specific headline.

compute_sentiment <- function(d) {
  if (nrow(d) == 0) {
    return(NA)
  }
  neg_score <- d %>% filter(sentiment=="negative") %>% nrow()
  pos_score <- d %>% filter(sentiment=="positive") %>% nrow()
  pos_score - neg_score
} 

The inner join on bing lexicon of each single headline tokens list is given as input to the compute_sentiment() function to determine the sentiment score of each specific headline.

sentiments_bing <- get_sentiments("bing")
str(sentiments_bing)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6786 obs. of  2 variables:
##  $ word     : chr  "2-faces" "abnormal" "abolish" "abominable" ...
##  $ sentiment: chr  "negative" "negative" "negative" "negative" ...
single_news_sentiment_bing <- sapply(single_news_tokens, function(x) { x %>% inner_join(sentiments_bing) %>% compute_sentiment()})

The result is a vector of integers each element value at i-th position is the sentiment associated to the i-th news

str(single_news_sentiment_bing)
##  Named int [1:1000] NA -1 1 -1 -1 2 0 NA NA NA ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

  • the median is negative
  • NA’s show up
summary(single_news_sentiment_bing)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -3.000  -1.000  -1.000  -0.475   1.000   2.000     520

Collecting the resulting in a data frame as follows.

single_news_sentiment_bing_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_bing)
head(single_news_sentiment_bing_df, 10)
ABCDEFGHIJ0123456789
 
 
headline_text
<fctr>
score
<int>
1aba decides against community broadcasting licenceNA
2act fire witnesses must be aware of defamation-1
3a g calls for infrastructure protection summit1
4air nz staff in aust strike for pay rise-1
5air nz strike to affect australian travellers-1
6ambitious olsson wins triple jump2
7antic delighted with record breaking barca0
8aussie qualifier stosur wastes four memphis matchNA
9aust addresses un security council over iraqNA
10australia is locked into war timetable oppNA

NRC lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for NRC lexicon. With respect the previous analysis based on bing lexicon, some more pre-processing is needed as explained in what follows. First we get the NRC sentiment lexicon and see what are the sentiments threin present.

sentiments_nrc <- get_sentiments("nrc")
(unique_sentiments_nrc <- unique(sentiments_nrc$sentiment))
##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"

To have as output a positive/negative sentiment result, I define a mapping of abovelisted sentiments to a positive/negative string result as follows.

compute_pos_neg_sentiments_nrc <- function(the_sentiments_nrc) {
  s <- unique(the_sentiments_nrc$sentiment)
  df_sentiments <- data.frame(sentiment = s, 
                              mapped_sentiment = c("positive", "negative", "negative", "negative",
                                                    "negative", "positive", "positive", "negative", 
                                                    "positive", "positive"))
  ss <- sentiments_nrc %>% inner_join(df_sentiments)
  the_sentiments_nrc$sentiment <- ss$mapped_sentiment
  the_sentiments_nrc
}

nrc_sentiments_pos_neg_scale <- compute_pos_neg_sentiments_nrc(sentiments_nrc)

Above function is used to produce the single headline text sentiment results. Such result is given as input to the compute_sentiment() function.

single_news_sentiment_nrc <- sapply(single_news_tokens, function(x) { x %>% inner_join(nrc_sentiments_pos_neg_scale) %>% compute_sentiment()})
str(single_news_sentiment_nrc)
##  Named int [1:1000] 1 -4 1 2 -2 2 4 NA 5 -2 ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

  • the median is equal to zero
  • NA’s show up
summary(single_news_sentiment_nrc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -9.0000 -2.0000  0.0000 -0.3742  2.0000  9.0000     257
single_news_sentiment_nrc_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_nrc)
head(single_news_sentiment_nrc_df, 10)
ABCDEFGHIJ0123456789
 
 
headline_text
<fctr>
score
<int>
1aba decides against community broadcasting licence1
2act fire witnesses must be aware of defamation-4
3a g calls for infrastructure protection summit1
4air nz staff in aust strike for pay rise2
5air nz strike to affect australian travellers-2
6ambitious olsson wins triple jump2
7antic delighted with record breaking barca4
8aussie qualifier stosur wastes four memphis matchNA
9aust addresses un security council over iraq5
10australia is locked into war timetable opp-2

AFINN lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for AFINN lexicon.

sentiments_afinn <- get_sentiments("afinn")
colnames(sentiments_afinn) <- c("word", "sentiment")
str(sentiments_afinn)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2477 obs. of  2 variables:
##  $ word     : chr  "abandon" "abandoned" "abandons" "abducted" ...
##  $ sentiment: num  -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   word = col_character(),
##   ..   value = col_double()
##   .. )

As we can see, the afinn lexicon provides a score for each token. We just need to sum up each headline tokens score to obtain the sentiment score of the headline under analysis.

single_news_sentiment_afinn_df <- lapply(single_news_tokens, function(x) { x %>% inner_join(sentiments_afinn)})
single_news_sentiment_afinn <- sapply(single_news_sentiment_afinn_df, function(x) { 
      ifelse(nrow(x) > 0, sum(x$sentiment), NA)
  })
str(single_news_sentiment_afinn)
##  Named num [1:1000] NA -2 NA -2 -1 6 3 NA NA -2 ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

  • the median is negative
  • NA’s show up
summary(single_news_sentiment_afinn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -9.000  -3.000  -2.000  -1.148   1.000   7.000     508
single_news_sentiment_afinn_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_afinn)
head(single_news_sentiment_afinn_df, 10)
ABCDEFGHIJ0123456789
 
 
headline_text
<fctr>
score
<dbl>
1aba decides against community broadcasting licenceNA
2act fire witnesses must be aware of defamation-2
3a g calls for infrastructure protection summitNA
4air nz staff in aust strike for pay rise-2
5air nz strike to affect australian travellers-1
6ambitious olsson wins triple jump6
7antic delighted with record breaking barca3
8aussie qualifier stosur wastes four memphis matchNA
9aust addresses un security council over iraqNA
10australia is locked into war timetable opp-2

Comparing results

Having obtained for each news three potential results as sentiment evaluation, we would like to compare their congruency. As congruence we mean the fact that all three lexicons express the same positive or negative result, in other words the same score sign indipendently from its magnitude. If NA values are present, the congruence shall be computed until at least two non NA values are available, otherwise is equal to NA.

Furthermore we compute the final news sentiment as based upon the sum of each lexicon sentiment score.

compute_congruence <- function(x,y,z) {
  v <- c(sign(x), sign(y), sign(z))
  # if only one lexicon reports the score, we cannot check for congruence
  if (sum(is.na(v)) >= 2) {
    return (NA)
  }
  # removing NA and zero value
  v <- na.omit(v)
  v_sum <- sum(v)
  abs(v_sum) == length(v)
}

compute_final_sentiment <- function(x,y,z) {
  if (is.na(x) && is.na(y) && is.na(z)) {
    return (NA)
  }

  s <- sum(x, y, z, na.rm=TRUE)
  # positive sentiments have score strictly greater than zero
  # negative sentiments have score strictly less than zero
  # neutral sentiments have score equal to zero 
  ifelse(s > 0, "positive", ifelse(s < 0, "negative", "neutral"))
}

news_sentiments_results <- data.frame(headline_text = news_df_subset$headline_text, 
                                      bing_score = single_news_sentiment_bing, 
                                      nrc_score = single_news_sentiment_nrc, 
                                      afinn_score = single_news_sentiment_afinn,
                                      stringsAsFactors = FALSE)

news_sentiments_results <- news_sentiments_results %>% rowwise() %>% 
  mutate(final_sentiment = compute_final_sentiment(bing_score, nrc_score, afinn_score),
         congruence = compute_congruence(bing_score, nrc_score, afinn_score))
 
head(news_sentiments_results, 40)
ABCDEFGHIJ0123456789
headline_text
<chr>
bing_score
<int>
nrc_score
<int>
afinn_score
<dbl>
aba decides against community broadcasting licenceNA1NA
act fire witnesses must be aware of defamation-1-4-2
a g calls for infrastructure protection summit11NA
air nz staff in aust strike for pay rise-12-2
air nz strike to affect australian travellers-1-2-1
ambitious olsson wins triple jump226
antic delighted with record breaking barca043
aussie qualifier stosur wastes four memphis matchNANANA
aust addresses un security council over iraqNA5NA
australia is locked into war timetable oppNA-2-2

Is would be useful to replace the numeric score with same {negative, neutral, positive} scale.

replace_score_with_sentiment <- function(v_score) {
  v_score[v_score > 0] <- "positive"
  v_score[v_score < 0] <- "negative"
  v_score[v_score == 0] <- "neutral"
  v_score
} 

news_sentiments_results$bing_score <- replace_score_with_sentiment(news_sentiments_results$bing_score)
news_sentiments_results$nrc_score <- replace_score_with_sentiment(news_sentiments_results$nrc_score)
news_sentiments_results$afinn_score <- replace_score_with_sentiment(news_sentiments_results$afinn_score)

news_sentiments_results[,2:5] <- lapply(news_sentiments_results[,2:5], as.factor)

head(news_sentiments_results, 40)
ABCDEFGHIJ0123456789
headline_text
<chr>
bing_score
<fctr>
nrc_score
<fctr>
afinn_score
<fctr>
aba decides against community broadcasting licenceNApositiveNA
act fire witnesses must be aware of defamationnegativenegativenegative
a g calls for infrastructure protection summitpositivepositiveNA
air nz staff in aust strike for pay risenegativepositivenegative
air nz strike to affect australian travellersnegativenegativenegative
ambitious olsson wins triple jumppositivepositivepositive
antic delighted with record breaking barcaneutralpositivepositive
aussie qualifier stosur wastes four memphis matchNANANA
aust addresses un security council over iraqNApositiveNA
australia is locked into war timetable oppNAnegativenegative

Tabularizations of each lexicon resulting sentiment and final sentiments are herein shown.

table(news_sentiments_results$bing_score, news_sentiments_results$final_sentiment, dnn = c("bing", "final"))
##           final
## bing       negative neutral positive
##   negative      278      15       14
##   neutral        16       6       11
##   positive        6       7      127
table(news_sentiments_results$nrc_score, news_sentiments_results$final_sentiment, dnn = c("nrc", "final"))
##           final
## nrc        negative neutral positive
##   negative      353      10        4
##   neutral        18      13        6
##   positive       25      16      298
table(news_sentiments_results$afinn_score, news_sentiments_results$final_sentiment, dnn = c("afinn", "final"))
##           final
## afinn      negative neutral positive
##   negative      326      10       12
##   neutral         3       1        6
##   positive        4       9      121

Tabularization of congruence and final sentiments is herein shown.

table(news_sentiments_results$congruence, news_sentiments_results$final_sentiment, dnn = c("congruence", "final"))
##           final
## congruence negative neutral positive
##      FALSE       67      33       45
##      TRUE       292       0      132

Conclusions

We analyzed the news headlines to determine their sentiments while taking advantage of three sentiments lexicons. We show some basics of the methodoloy for such purpose. We also had the chance to compare the results obtained across all three lexicons and set forth a final sentiment evaluation. If you are interested in understanding much more about text analysis, see ref. [4].

References

[1] NRC sentiment words baseline [http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm]

[2] BING sentiment words baseline [https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html]

[3] AFINN sentiment words baseline [https://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010]

[4] Text mining with R [https://www.tidytextmining.com/]

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.