around-R: Text Analysis

News headlines text analysis

Introduction

In the present tutorial, I show an introductory text analysis of a ABC-news news headlines dataset. I will have a look to the most common words therein present and run a sentiment analysis on those headlines by taking advantage of the following sentiment words baselines:

NRC
Bing
AFINN

The NRC sentiment words baseline from Saif Mohammad and Peter Turney categorizes words into categories of positive, negative, anger, anticipation, disgust, fear joy, sadness, surprise and trust.

The Bing sentiment words baseline from Bing Liu and others categorizes words into positive or negative sentiment category.

The AFINN sentiment words baseline from Finn Arup Nielsen assigns words with a score from -5 to 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

For more information about those sentiment lexicons, see references listed out at the bottom.

Packages

I am going to take advantage of the following R packages.

suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidytext))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(textdata))
suppressPackageStartupMessages(library(widyr))
suppressPackageStartupMessages(library(ggplot2))

Packages versions are herein listed.

packages <- c("stringr", "dplyr", "tidytext", "tidyr", "textdata", "widyr", "ggplot2")
version <- lapply(packages, packageVersion)
version_c <- do.call(c, version)
data.frame(packages=packages, version = as.character(version_c))

ABCDEFGHIJ0123456789

packages <fctr>	version <fctr>
stringr	1.4.0
dplyr	0.8.4
tidytext	0.2.2
tidyr	1.0.2
textdata	0.3.0
widyr	0.1.2
ggplot2	3.2.1

Running on Windows-10 the following R language version.

R.version

##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.3                         
## year           2019                        
## month          03                          
## day            11                          
## svn rev        76217                       
## language       R                           
## version.string R version 3.5.3 (2019-03-11)
## nickname       Great Truth

Note

Before running this code, make sure to have downloaded the lexicon of the sentiments lexicons by executing the following operation:

get_sentiments("nrc")
get_sentiments("bing")
get_sentiments("afinn")

and accepting all prescriptions as asked by the interactive menu showing up.

Getting Data

I then download our news dataset containing millions of headlines from:

â€œhttps://www.kaggle.com/therohk/million-headlines/downloads/million-headlines.zip/7â€

Its uncompression produces the abcnews-date-text.csv file. I load it into the news_data dataset and have a look at.

news_data <- read.csv("abcnews-date-text.csv", header = TRUE, stringsAsFactors = FALSE)
dim(news_data)

## [1] 1103663       2

head(news_data)

ABCDEFGHIJ0123456789

	publish_date <int>	headline_text <chr>
1	20030219	aba decides against community broadcasting licence
2	20030219	act fire witnesses must be aware of defamation
3	20030219	a g calls for infrastructure protection summit
4	20030219	air nz staff in aust strike for pay rise
5	20030219	air nz strike to affect australian travellers
6	20030219	ambitious olsson wins triple jump

tail(news_data)

ABCDEFGHIJ0123456789

	publish_date <int>	headline_text <chr>
1103658	20171231	stunning images from the sydney to hobart yacht
1103659	20171231	the ashes smiths warners near miss liven up boxing day test
1103660	20171231	timelapse: brisbanes new year fireworks
1103661	20171231	what 2017 meant to the kids of australia
1103662	20171231	what the papodopoulos meeting may mean for ausus
1103663	20171231	who is george papadopoulos the former trump campaign aide

Token Analysis

It is time to extract the tokens from our dataset. Select the column named as headline_text and unnesting the word tokens determine the following.

news_df <- news_data %>% select(headline_text)
news_tokens <- news_df %>% unnest_tokens(word, headline_text)
head(news_tokens, 10)

ABCDEFGHIJ0123456789

	word <chr>
1	aba
1.1	decides
1.2	against
1.3	community
1.4	broadcasting
1.5	licence
2	act
2.1	fire
2.2	witnesses
2.3	must

tail(news_tokens, 10)

ABCDEFGHIJ0123456789

	word <chr>
1103662.7	ausus
1103663	who
1103663.1	is
1103663.2	george
1103663.3	papadopoulos
1103663.4	the
1103663.5	former
1103663.6	trump
1103663.7	campaign
1103663.8	aide

It is interesting to generate and inspect a table reporting how many times each token shows up within the headlines and its proportion with respect the total.

news_tokens_count <- news_tokens %>% count(word, sort = TRUE) %>% mutate(proportion = n / sum(n))

The top-10 words which appear most.

head(news_tokens_count, 10)

ABCDEFGHIJ0123456789

word <chr>	n <int>	proportion <dbl>
to	214201	0.030294784
in	135981	0.019232007
for	130239	0.018419907
of	80759	0.011421872
on	73037	0.010329738
over	50306	0.007114857
the	49810	0.007044707
police	35984	0.005089274
at	31723	0.004486634
with	29676	0.004197123

And the ones which appear less frequently:

tail(news_tokens_count, 10)

ABCDEFGHIJ0123456789

word <chr>	n <int>	proportion <dbl>
zweli	1	1.414316e-07
zwitkowsky	1	1.414316e-07
zydelig	1	1.414316e-07
zygar	1	1.414316e-07
zygiefs	1	1.414316e-07
zylvester	1	1.414316e-07
zynga	1	1.414316e-07
zyngier	1	1.414316e-07
zz	1	1.414316e-07
zzz	1	1.414316e-07

There is an issue in having doing that way. The issue is that there are words which do not have relevant role in easing the sentiment analysis, the so called stop words. Herein below the stop words wihin our dataset are shown.

data(stop_words)
head(stop_words, 10)

ABCDEFGHIJ0123456789

word <chr>	lexicon <chr>
a	SMART
a's	SMART
able	SMART
about	SMART
above	SMART
according	SMART
accordingly	SMART
across	SMART
actually	SMART
after	SMART

To remove stop words as required, we take advantage of the anti_join operation.

news_tokens_no_sp <- news_tokens %>% anti_join(stop_words)
head(news_tokens_no_sp, 10)

ABCDEFGHIJ0123456789

	word <chr>
1	aba
2	decides
3	community
4	broadcasting
5	licence
6	act
7	fire
8	witnesses
9	aware
10	defamation

Then, counting news tokens again after having removed the stop words.

news_tokens_count <- news_tokens_no_sp %>% count(word, sort = TRUE) %>% mutate(proportion = n / sum(n))
head(news_tokens_count, 10)

ABCDEFGHIJ0123456789

word <chr>	n <int>	proportion <dbl>
police	35984	0.006732918
govt	16923	0.003166440
court	16380	0.003064840
council	16343	0.003057917
interview	15025	0.002811308
fire	13910	0.002602681
nsw	12912	0.002415947
australia	12353	0.002311353
plan	12307	0.002302746
water	11874	0.002221728

tail(news_tokens_count)

ABCDEFGHIJ0123456789

word <chr>	n <int>	proportion <dbl>
zygiefs	1	1.871087e-07
zylvester	1	1.871087e-07
zynga	1	1.871087e-07
zyngier	1	1.871087e-07
zz	1	1.871087e-07
zzz	1	1.871087e-07

Then, I filtering out tokens having more than 8,000 counts.

news_token_over8000 <- news_tokens_count %>% filter(n > 8000) %>% mutate(word = reorder(word, n))
nrow(news_token_over8000)

## [1] 32

head(news_token_over8000, 10)

ABCDEFGHIJ0123456789

word <fctr>	n <int>	proportion <dbl>
police	35984	0.006732918
govt	16923	0.003166440
court	16380	0.003064840
council	16343	0.003057917
interview	15025	0.002811308
fire	13910	0.002602681
nsw	12912	0.002415947
australia	12353	0.002311353
plan	12307	0.002302746
water	11874	0.002221728

tail(news_token_over8000, 10)

ABCDEFGHIJ0123456789

word <fctr>	n <int>	proportion <dbl>
day	8818	0.001649924
hospital	8815	0.001649363
car	8690	0.001625974
coast	8411	0.001573771
calls	8401	0.001571900
win	8315	0.001555809
woman	8213	0.001536723
killed	8129	0.001521006
accused	8094	0.001514458
world	8087	0.001513148

It is interesting to show the proportion as per-thousands by means of an histogram plot.

news_token_over8000 %>%  
  ggplot(aes(word, proportion*1000, fill=ceiling(proportion*1000))) +
  geom_col() + xlab(NULL) + coord_flip() + theme(legend.position = "none")

News Sentiment Analysis

In this paragraph, I focus on each single headline to evaluate its specific sentiment as determined by each lexicon. Hence the output shall determine if each specific headline has got positive or negative sentiment.

head(news_df, 10)

ABCDEFGHIJ0123456789

	headline_text <chr>
1	aba decides against community broadcasting licence
2	act fire witnesses must be aware of defamation
3	a g calls for infrastructure protection summit
4	air nz staff in aust strike for pay rise
5	air nz strike to affect australian travellers
6	ambitious olsson wins triple jump
7	antic delighted with record breaking barca
8	aussie qualifier stosur wastes four memphis match
9	aust addresses un security council over iraq
10	australia is locked into war timetable opp

I will analyse only the first 1000 headlines just for computational time reasons. The token list of such is as follows.

news_df_subset <- news_df[1:1000,,drop=FALSE]
tkn_l <- apply(news_df_subset, 1, function(x) { data.frame(headline_text=x, stringsAsFactors = FALSE) %>% unnest_tokens(word, headline_text)})

Removing the stop words from the token list.

single_news_tokens <- lapply(tkn_l, function(x) {anti_join(x, stop_words)})

str(single_news_tokens, list.len = 5)

## List of 1000
##  $ 1   :'data.frame':    5 obs. of  1 variable:
##   ..$ word: chr [1:5] "aba" "decides" "community" "broadcasting" ...
##  $ 2   :'data.frame':    5 obs. of  1 variable:
##   ..$ word: chr [1:5] "act" "fire" "witnesses" "aware" ...
##  $ 3   :'data.frame':    4 obs. of  1 variable:
##   ..$ word: chr [1:4] "calls" "infrastructure" "protection" "summit"
##  $ 4   :'data.frame':    7 obs. of  1 variable:
##   ..$ word: chr [1:7] "air" "nz" "staff" "aust" ...
##  $ 5   :'data.frame':    6 obs. of  1 variable:
##   ..$ word: chr [1:6] "air" "nz" "strike" "affect" ...
##   [list output truncated]

As we can see, to each headline is associated a list of tokens. The sentiment of a headline is computed as based on the sum of positive/negative score of each token of.

single_news_tokens[[1]]

ABCDEFGHIJ0123456789

word <chr>
aba
decides
community
broadcasting
licence

Bing lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for Bing lexicon. I first define a function named as compute_sentiment() whose purpose is to output the positiveness score of a specific headline.

compute_sentiment <- function(d) {
  if (nrow(d) == 0) {
    return(NA)
  }
  neg_score <- d %>% filter(sentiment=="negative") %>% nrow()
  pos_score <- d %>% filter(sentiment=="positive") %>% nrow()
  pos_score - neg_score
}

The inner join on bing lexicon of each single headline tokens list is given as input to the compute_sentiment() function to determine the sentiment score of each specific headline.

sentiments_bing <- get_sentiments("bing")
str(sentiments_bing)

## Classes 'tbl_df', 'tbl' and 'data.frame':    6786 obs. of  2 variables:
##  $ word     : chr  "2-faces" "abnormal" "abolish" "abominable" ...
##  $ sentiment: chr  "negative" "negative" "negative" "negative" ...

single_news_sentiment_bing <- sapply(single_news_tokens, function(x) { x %>% inner_join(sentiments_bing) %>% compute_sentiment()})

The result is a vector of integers each element value at i-th position is the sentiment associated to the i-th news

str(single_news_sentiment_bing)

##  Named int [1:1000] NA -1 1 -1 -1 2 0 NA NA NA ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

the median is negative
NAâ€™s show up

summary(single_news_sentiment_bing)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -3.000  -1.000  -1.000  -0.475   1.000   2.000     520

Collecting the resulting in a data frame as follows.

single_news_sentiment_bing_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_bing)
head(single_news_sentiment_bing_df, 10)

ABCDEFGHIJ0123456789

	headline_text <fctr>	score <int>
1	aba decides against community broadcasting licence	NA
2	act fire witnesses must be aware of defamation	-1
3	a g calls for infrastructure protection summit	1
4	air nz staff in aust strike for pay rise	-1
5	air nz strike to affect australian travellers	-1
6	ambitious olsson wins triple jump	2
7	antic delighted with record breaking barca	0
8	aussie qualifier stosur wastes four memphis match	NA
9	aust addresses un security council over iraq	NA
10	australia is locked into war timetable opp	NA

NRC lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for NRC lexicon. With respect the previous analysis based on bing lexicon, some more pre-processing is needed as explained in what follows. First we get the NRC sentiment lexicon and see what are the sentiments threin present.

sentiments_nrc <- get_sentiments("nrc")
(unique_sentiments_nrc <- unique(sentiments_nrc$sentiment))

##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"

To have as output a positive/negative sentiment result, I define a mapping of abovelisted sentiments to a positive/negative string result as follows.

compute_pos_neg_sentiments_nrc <- function(the_sentiments_nrc) {
  s <- unique(the_sentiments_nrc$sentiment)
  df_sentiments <- data.frame(sentiment = s, 
                              mapped_sentiment = c("positive", "negative", "negative", "negative",
                                                    "negative", "positive", "positive", "negative", 
                                                    "positive", "positive"))
  ss <- sentiments_nrc %>% inner_join(df_sentiments)
  the_sentiments_nrc$sentiment <- ss$mapped_sentiment
  the_sentiments_nrc
}

nrc_sentiments_pos_neg_scale <- compute_pos_neg_sentiments_nrc(sentiments_nrc)

Above function is used to produce the single headline text sentiment results. Such result is given as input to the compute_sentiment() function.

single_news_sentiment_nrc <- sapply(single_news_tokens, function(x) { x %>% inner_join(nrc_sentiments_pos_neg_scale) %>% compute_sentiment()})

str(single_news_sentiment_nrc)

##  Named int [1:1000] 1 -4 1 2 -2 2 4 NA 5 -2 ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

the median is equal to zero
NAâ€™s show up

summary(single_news_sentiment_nrc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -9.0000 -2.0000  0.0000 -0.3742  2.0000  9.0000     257

single_news_sentiment_nrc_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_nrc)
head(single_news_sentiment_nrc_df, 10)

ABCDEFGHIJ0123456789

	headline_text <fctr>	score <int>
1	aba decides against community broadcasting licence	1
2	act fire witnesses must be aware of defamation	-4
3	a g calls for infrastructure protection summit	1
4	air nz staff in aust strike for pay rise	2
5	air nz strike to affect australian travellers	-2
6	ambitious olsson wins triple jump	2
7	antic delighted with record breaking barca	4
8	aussie qualifier stosur wastes four memphis match	NA
9	aust addresses un security council over iraq	5
10	australia is locked into war timetable opp	-2

AFINN lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for AFINN lexicon.

sentiments_afinn <- get_sentiments("afinn")
colnames(sentiments_afinn) <- c("word", "sentiment")
str(sentiments_afinn)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2477 obs. of  2 variables:
##  $ word     : chr  "abandon" "abandoned" "abandons" "abducted" ...
##  $ sentiment: num  -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   word = col_character(),
##   ..   value = col_double()
##   .. )

As we can see, the afinn lexicon provides a score for each token. We just need to sum up each headline tokens score to obtain the sentiment score of the headline under analysis.

single_news_sentiment_afinn_df <- lapply(single_news_tokens, function(x) { x %>% inner_join(sentiments_afinn)})
single_news_sentiment_afinn <- sapply(single_news_sentiment_afinn_df, function(x) { 
      ifelse(nrow(x) > 0, sum(x$sentiment), NA)
  })

str(single_news_sentiment_afinn)

##  Named num [1:1000] NA -2 NA -2 -1 6 3 NA NA -2 ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

the median is negative
NAâ€™s show up

summary(single_news_sentiment_afinn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -9.000  -3.000  -2.000  -1.148   1.000   7.000     508

single_news_sentiment_afinn_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_afinn)
head(single_news_sentiment_afinn_df, 10)

ABCDEFGHIJ0123456789

	headline_text <fctr>	score <dbl>
1	aba decides against community broadcasting licence	NA
2	act fire witnesses must be aware of defamation	-2
3	a g calls for infrastructure protection summit	NA
4	air nz staff in aust strike for pay rise	-2
5	air nz strike to affect australian travellers	-1
6	ambitious olsson wins triple jump	6
7	antic delighted with record breaking barca	3
8	aussie qualifier stosur wastes four memphis match	NA
9	aust addresses un security council over iraq	NA
10	australia is locked into war timetable opp	-2

Comparing results

Having obtained for each news three potential results as sentiment evaluation, we would like to compare their congruency. As congruence we mean the fact that all three lexicons express the same positive or negative result, in other words the same score sign indipendently from its magnitude. If NA values are present, the congruence shall be computed until at least two non NA values are available, otherwise is equal to NA.

Furthermore we compute the final news sentiment as based upon the sum of each lexicon sentiment score.

compute_congruence <- function(x,y,z) {
  v <- c(sign(x), sign(y), sign(z))
  # if only one lexicon reports the score, we cannot check for congruence
  if (sum(is.na(v)) >= 2) {
    return (NA)
  }
  # removing NA and zero value
  v <- na.omit(v)
  v_sum <- sum(v)
  abs(v_sum) == length(v)
}

compute_final_sentiment <- function(x,y,z) {
  if (is.na(x) && is.na(y) && is.na(z)) {
    return (NA)
  }

  s <- sum(x, y, z, na.rm=TRUE)
  # positive sentiments have score strictly greater than zero
  # negative sentiments have score strictly less than zero
  # neutral sentiments have score equal to zero 
  ifelse(s > 0, "positive", ifelse(s < 0, "negative", "neutral"))
}

news_sentiments_results <- data.frame(headline_text = news_df_subset$headline_text, 
                                      bing_score = single_news_sentiment_bing, 
                                      nrc_score = single_news_sentiment_nrc, 
                                      afinn_score = single_news_sentiment_afinn,
                                      stringsAsFactors = FALSE)

news_sentiments_results <- news_sentiments_results %>% rowwise() %>% 
  mutate(final_sentiment = compute_final_sentiment(bing_score, nrc_score, afinn_score),
         congruence = compute_congruence(bing_score, nrc_score, afinn_score))
 
head(news_sentiments_results, 40)

ABCDEFGHIJ0123456789

headline_text <chr>	bing_score <int>	nrc_score <int>	afinn_score <dbl>
aba decides against community broadcasting licence	NA	1	NA
act fire witnesses must be aware of defamation	-1	-4	-2
a g calls for infrastructure protection summit	1	1	NA
air nz staff in aust strike for pay rise	-1	2	-2
air nz strike to affect australian travellers	-1	-2	-1
ambitious olsson wins triple jump	2	2	6
antic delighted with record breaking barca	0	4	3
aussie qualifier stosur wastes four memphis match	NA	NA	NA
aust addresses un security council over iraq	NA	5	NA
australia is locked into war timetable opp	NA	-2	-2

Is would be useful to replace the numeric score with same {negative, neutral, positive} scale.

replace_score_with_sentiment <- function(v_score) {
  v_score[v_score > 0] <- "positive"
  v_score[v_score < 0] <- "negative"
  v_score[v_score == 0] <- "neutral"
  v_score
} 

news_sentiments_results$bing_score <- replace_score_with_sentiment(news_sentiments_results$bing_score)
news_sentiments_results$nrc_score <- replace_score_with_sentiment(news_sentiments_results$nrc_score)
news_sentiments_results$afinn_score <- replace_score_with_sentiment(news_sentiments_results$afinn_score)

news_sentiments_results[,2:5] <- lapply(news_sentiments_results[,2:5], as.factor)

head(news_sentiments_results, 40)

ABCDEFGHIJ0123456789

headline_text <chr>	bing_score <fctr>	nrc_score <fctr>	afinn_score <fctr>
aba decides against community broadcasting licence	NA	positive	NA
act fire witnesses must be aware of defamation	negative	negative	negative
a g calls for infrastructure protection summit	positive	positive	NA
air nz staff in aust strike for pay rise	negative	positive	negative
air nz strike to affect australian travellers	negative	negative	negative
ambitious olsson wins triple jump	positive	positive	positive
antic delighted with record breaking barca	neutral	positive	positive
aussie qualifier stosur wastes four memphis match	NA	NA	NA
aust addresses un security council over iraq	NA	positive	NA
australia is locked into war timetable opp	NA	negative	negative

Tabularizations of each lexicon resulting sentiment and final sentiments are herein shown.

table(news_sentiments_results$bing_score, news_sentiments_results$final_sentiment, dnn = c("bing", "final"))

##           final
## bing       negative neutral positive
##   negative      278      15       14
##   neutral        16       6       11
##   positive        6       7      127

table(news_sentiments_results$nrc_score, news_sentiments_results$final_sentiment, dnn = c("nrc", "final"))

##           final
## nrc        negative neutral positive
##   negative      353      10        4
##   neutral        18      13        6
##   positive       25      16      298

table(news_sentiments_results$afinn_score, news_sentiments_results$final_sentiment, dnn = c("afinn", "final"))

##           final
## afinn      negative neutral positive
##   negative      326      10       12
##   neutral         3       1        6
##   positive        4       9      121

Tabularization of congruence and final sentiments is herein shown.

table(news_sentiments_results$congruence, news_sentiments_results$final_sentiment, dnn = c("congruence", "final"))

##           final
## congruence negative neutral positive
##      FALSE       67      33       45
##      TRUE       292       0      132

Conclusions

We analyzed the news headlines to determine their sentiments while taking advantage of three sentiments lexicons. We show some basics of the methodoloy for such purpose. We also had the chance to compare the results obtained across all three lexicons and set forth a final sentiment evaluation. If you are interested in understanding much more about text analysis, see ref. [4].

References

[1] NRC sentiment words baseline [http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm]

[2] BING sentiment words baseline [https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html]

[3] AFINN sentiment words baseline [https://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010]

[4] Text mining with R [https://www.tidytextmining.com/]

around-R

Saturday, June 13, 2020

Text Analysis

News headlines text analysis

Introduction

Packages

Note

Getting Data

Token Analysis

News Sentiment Analysis

Bing lexicon

NRC lexicon

AFINN lexicon

Comparing results

Conclusions

References

No comments:

Post a Comment

Featured Post

Plant Leaf Classification - Part 3

Total Pageviews