Predicting the Delhi Election using Twitter Data

Introduction

Over the past two weeks, Delhi Assembly Elections 2015 have completely redefined the way India has witnessed political battles fought between individuals and parties. Social media Twitter witnessed this political battle during the voting day. It was abuzz with activity with a record number of tweets pouring in on February 7. Delhi Assembly elections ruled Twitter with # Delhi Votes as the top trending hashtag during the day.
The idea is to analyze the sentiments from Twitter tweets containing hashtags and words like '#Kejriwal', 'Kejriwal', '@AamAdmiParty' , '#KiranBedi', 'KiranBedi', '@BJPDelhiState','#DelhiDecides','DelhiVotes'. Next step is to identify whether a tweet expresses a positive or a negative sentiment about a particular candidate.
Sentiment analysis

Sentiment Analysis is an ongoing field of research in text mining field for the treatment of opinions, sentiments and subjectivity of text. 

Examples :

Politics: What do people think about this candidate or issue?
Products: What do people think about the new iPhone?

Building a data set

Prior to analyzing Twitter data, we need to obtain the data. You need a developer account on twitter to pull the twitter data. After creating a developer account, you need to authenticate your application with Twitter, thus allowing you to mine tweets. It is possible with Twitter application programming interface(API).

After integrating with twitter, you need to specify keywords - '#Kejriwal', 'Kejriwal', '@AamAdmiParty' , '#KiranBedi', 'KiranBedi', '@BJPDelhiState' for which you want the information. I have collected 50932 tweets starting 31st January,2015 to 5th February,2015 (two days before the election date) for both the parties. Then, i remove duplicate tweets as some are retweeted.

The detailed R code is shown in the later portion of this article.


Making data ready to use

  1. Extract the text content of tweets
  2. Eliminate extra white-spaces
  3. Convert text to lower case
  4. Remove words like stopwords
  5. Build your own stopwords list especially for this data set
  6. Remove punctuation symbols
  7. Remove numbers

Findings

1. The word "isbaaraap" occured maximum times followed by "delhimodipmbedicm". 




2. Wordcloud comparing the frequencies of words between BJP and AAP.



3. Sentiment Analysis of Tweets by Emotional Categories




4. Final Sentiment Analysis of Tweets - AAP has greater positive sentiments than BJP.


Summary :

Since AAP has greater positive sentiments than BJP on twitter, they are likely to get higher % votes. There is no way we can say anything about voting seats.

Appendix :

Install the following R packages
install.packages("twitteR")
install.packages("wordcloud")
install.packages("plyr")
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")
install.packages("tm")

Detailed R Code
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
library(reshape2)
library(twitteR)
library(wordcloud)

#Sentiment Package is not available on CRAN. You need to install it from archive.

install.packages("devtools")
require(devtools)
install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")
require(sentiment)
ls("package:sentiment")

# You have to make iteration to fetch all tweets. All the iterations are not mentioned in the code
Kejriwal_tweets = searchTwitter("#Kejriwal",since="2015-01-31",until="2015-02-05", n=1500,lang="en",cainfo="cacert.pem")
Kejriwal_tweets2 = searchTwitter("@AamAadmiParty",since="2015-01-31",until="2015-02-05", n=1500,lang="en",cainfo="cacert.pem")

bedi_tweets = searchTwitter("@BJPDelhiState", since="2015-01-31",until="2015-02-05",n=1500, lan="en",cainfo="cacert.pem")
bedi_tweets2 = searchTwitter("#KiranBedi", since="2015-01-31",until="2015-02-05",n=1500, lan="en",cainfo="cacert.pem")

# get the text
Kejriwal_txt = sapply( unlist(Kejriwal_tweets) , function(x) '$'( x , "text"))
Kejriwal_txt2 = sapply( unlist(Kejriwal_tweets2) , function(x) '$'( x , "text"))

bedi_txt = sapply( unlist(bedi_tweets) , function(x) '$'( x , "text"))
bedi_txt2 = sapply( unlist(bedi_tweets2) , function(x) '$'( x , "text"))

# how many tweets of each keyword
nd = c(length(Kejriwal_txt), length(Kejriwal_txt2), length(bedi_txt), length(bedi_txt2))

# join texts
Kejriwal_txt= c(Kejriwal_txt, Kejriwal_txt2)
bedi_txt= c(bedi_txt, bedi_txt2)


# Remove the duplicated tweets
Kejriwal_txt <- Kejriwal_txt[!duplicated(Kejriwal_txt)]
bedi_txt <- bedi_txt[!duplicated(bedi_txt)]

# how many unique tweets of each keyword
nd1 = c(length(Kejriwal_txt), length(bedi_txt))
nd1

# clean text function
clean.text <- function(some_txt)
{  some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)
   some_txt = gsub("@\\w+", "", some_txt)
   some_txt = gsub("[[:punct:]]", "", some_txt)
   some_txt = gsub("[[:digit:]]", "", some_txt)
   some_txt = gsub("http\\w+", "", some_txt)
   some_txt = gsub("[ \t]{2,}", "", some_txt)
   some_txt = gsub("^\\s+|\\s+$", "", some_txt)
# Remove non-english characters
   some_txt = gsub("[^\x20-\x7E]", "", some_txt)

   # define "tolower error handling" function
   try.tolower = function(x)
   {  y = NA
      try_error = tryCatch(tolower(x), error=function(e) e)
      if (!inherits(try_error, "error"))
      y = tolower(x)
      return(y)
   }

   some_txt = sapply(some_txt, try.tolower)
   some_txt = some_txt[some_txt != ""]
   names(some_txt) = NULL
   return(some_txt)}

# clean text
Kejriwal_clean = clean.text(Kejriwal_txt)
bedi_clean = clean.text(bedi_txt)

# join cleaned texts in a single vector
Kejriwals = paste(Kejriwal_clean, collapse=" ")
bedis = paste(bedi_clean, collapse=" ")
kej_bed = c(Kejriwals, bedis)

# Corpus
kb_corpus = Corpus(VectorSource(kej_bed))

"delhiwithmodi","modipmbedicm"

# remove stopwords
skipwords = c(stopwords("english"), "CM", "Chief Minister","year","years", "yes","bjp","aap","amp","delhipolls","delhielections",
"ravishaskskejriwal","delhi","elections","election","kejriwal", "kejriwals", "kiran","bedi", "todays", "reads", "live", "watch",
"zee","star","ndtv","congress","will","can","must","money","many","make","say","says","cant","kiranbedi","arvind","delhielection",
"arvindkejriwal","party","vote","even","now","namo","modi","nota","notamensrights","hey","world","class","create","men",
"vihar","sure","every","day","dont","get","media","one","see","said","feb","like"
,"use","together")

kb.tf <- list(weighting = weightTf, stopwords  = skipwords,
              removePunctuation = TRUE,
              tolower = TRUE,
              minWordLength = 4,
              removeNumbers = TRUE, stripWhitespace = TRUE, 
              stemDocument= TRUE)

# term-document matrix
tdm = TermDocumentMatrix(kb_corpus, control = kb.tf)

# convert as matrix
tdm = as.matrix(tdm)

# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE) 

# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

p <- ggplot(subset(dm, freq>20), aes(word, freq))
p <-p+ geom_bar(stat="identity")
p <-p+ theme(axis.text.x=element_text(angle=45, hjust=1))

png("hist.png", 480,480)
p
dev.off()

dev.new()

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(6, "Dark2"),min.freq=10, scale=c(4,.2),rot.per=.15,max.words=80)

# add column names
colnames(tdm) = c("AAP","BJP")

#write.csv(tdm,"matrix.csv")

# comparison cloud
png(file="KejriwalvsBedi.png",height=600,width=1200)
par(mfrow=c(1,2))

comparison.cloud(tdm, random.order=FALSE, colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),title.size=1.5, max.words=100, scale=c(4,.2),rot.per=.15)

# commanility cloud
png(file="Common.png",height=600,width=1200)
par(mfrow=c(1,2))

wordcloud(tdm, random.order=FALSE, colors = brewer.pal(8, "Dark2"),title.size=1.5, max.words=100)

#Sentiment Analysis code starts from here
# run model
bjp_class_emo = classify_emotion(bedi_clean, algorithm="bayes", prior=1.0)

# Fetch emotion category best_fit for our analysis purposes, visitors to this tutorials are encouraged to play around with other classifications as well.
emotion = bjp_class_emo[,7]

# Replace NA’s (if any, generated during classification process) by word “unknown”
emotion[is.na(emotion)] = "unknown"

# Polarity Classification
bjp_class_pol = classify_polarity(bedi_clean, algorithm="bayes")

# we will fetch polarity category best_fit for our analysis purposes, and as usual, visitors to this tutorials are encouraged to play around with other classifications as well
polarity = bjp_class_pol[,4]

# Let us now create a data frame with the above results obtained and rearrange data for plotting purposes
# creating data frame using emotion category and polarity results earlier obtained

sentiment_dataframe = data.frame(text=bedi_clean, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)

# rearrange data inside the frame by sorting it
sentiment_dataframe = within(sentiment_dataframe, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

write.csv(sentiment_dataframe,"BJP.csv")

sentiment_dataframe=read.csv("BJP.csv")

# In the next step we will plot the obtained results (in data frame)

# First let us plot the distribution of emotions according to emotion categories
# We will use ggplot function from ggplot2 Package (for more look at the help on ggplot) and RColorBrewer Package

ggplot(sentiment_dataframe, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) +
scale_fill_brewer(palette="Dark2") + ggtitle('Sentiment Analysis of Tweets on Twitter about BJP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories')

ggplot(sentiment_dataframe, aes(x=polarity))+geom_bar(aes(y=..count.., fill=polarity)) +
scale_fill_brewer(palette="RdGy") + ggtitle('Sentiment Analysis of Tweets on Twitter about BJP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Polarity Categories')

#Sentiment Analysis - AAM ADMI PARTY
# run model

Kejriwal_class_emo = classify_emotion(Kejriwal_clean, algorithm="bayes", prior=1.0)

# Fetch emotion category best_fit for our analysis purposes, visitors to this tutorials are encouraged to play around with other classifications as well.
emotion1 = Kejriwal_class_emo[,7]

# Replace NA’s (if any, generated during classification process) by word “unknown”
emotion1[is.na(emotion1)] = "unknown"

# Similar to above, we will classify polarity in the text
# This process will classify the text data into four categories (pos – The absolute log likelihood of the document expressing a positive sentiment, neg – The absolute log likelihood of the document expressing a negative sentimen, pos/neg  – The ratio of absolute log likelihoods between positive and negative sentiment scores where a score of 1 indicates a neutral sentiment, less than 1 indicates a negative sentiment, and greater than 1 indicates a positive sentiment; AND best_fit – The most likely sentiment category (e.g. positive, negative, neutral) for the given text)

Kejriwal_class_pol = classify_polarity(Kejriwal_clean, algorithm="bayes")

# we will fetch polarity category best_fit for our analysis purposes, and as usual, visitors to this tutorials are encouraged to play around with other classifications as well
polarity1 = Kejriwal_class_pol[,4]

# Let us now create a data frame with the above results obtained and rearrange data for plotting purposes
# creating data frame using emotion category and polarity results earlier obtained

sentiment_dataframe = data.frame(text=Kejriwal_clean, emotion=emotion1, polarity=polarity1, stringsAsFactors=FALSE)

# rearrange data inside the frame by sorting it
sentiment_dataframe = within(sentiment_dataframe, emotion1 <- factor(emotion1, levels=names(sort(table(emotion1), decreasing=TRUE))))

# In the next step we will plot the obtained results (in data frame)

# First let us plot the distribution of emotions according to emotion categories
# We will use ggplot function from ggplot2 Package (for more look at the help on ggplot) and RColorBrewer Package

ggplot(sentiment_dataframe, aes(x=emotion1)) + geom_bar(aes(y=..count.., fill=emotion1)) +
scale_fill_brewer(palette="Dark2") +
ggtitle('Sentiment Analysis of Tweets on Twitter about AAP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories')

write.csv(sentiment_dataframe,"AAP_Data.csv")

sentiment_dataframe  = read.csv("BJP.csv")

ggplot(sentiment_dataframe, aes(x=factor(Polarity), fill=Candidate)) + geom_bar(position="dodge")+
scale_fill_brewer(palette="Dark2") +
ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Sentiments')

ggplot(sentiment_dataframe, aes(x=factor(emotion), fill=Candidate)) + geom_bar(position="dodge")+
scale_fill_brewer(palette="Dark2") +
ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotional Categories')

Best Online Course : Practical Data Science using R

- Explain Advanced Algorithms in Simple English
- Live Projects & Case Studies
- Domain Knowledge
- Job Placement Assistance
- Money Back Guarantee


R Tutorials : 75 Free R Tutorials

About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has close to 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like retail and commercial banking, Telecom, HR and Automotive.


While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

20 Responses to "Predicting the Delhi Election using Twitter Data"

  1. Excellent Work... Keep Going :-)

    ReplyDelete
  2. can u make same using SAS ?

    ReplyDelete
  3. Wooww. Very helpful post. Great work. Thanks for share it with us.

    ReplyDelete
  4. Durgesh Samariya18 October 2015 at 03:39

    I tried this code but i get one error,
    the package "sentiment" not able to install. shows its not available for R version 3.2.1.

    Please solve this problem

    Thanks in advance

    ReplyDelete
    Replies
    1. Sentiment is not available on cran.
      install.packages("devtools")
      require(devtools)
      install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")
      require(sentiment)
      ls("package:sentiment

      Delete
    2. I did that thing but showing same error

      ERROR: dependency ‘Rstem’ is not available for package ‘sentiment’

      Delete
    3. Install package 'Rstem' prior to installing package sentiment. That's what error is talking about.

      Delete
  5. Thank You for solving all problem,
    I have one more problem.

    on last graph

    ggplot(sentiment_dataframe, aes(x=factor(emotion), fill=Candidate)) + geom_bar(position="dodge")+
    scale_fill_brewer(palette="Dark2") +
    ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +
    theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotional Categories')

    Showing error
    object 'Candidate' not found

    and i check your code you not define Candidate code

    and generate blank graph.

    Please solve this.

    Thank you in advance

    ReplyDelete
  6. Great work and i love it. However in the last but one code, the "Polarity" has to be with a small "p". Also the definition of "Candidate" is missing in the code. I would be glad if you could amend the code.

    ggplot(sentiment_dataframe, aes(x=factor(Polarity), fill=Candidate)) + geom_bar(position="dodge")+
    scale_fill_brewer(palette="Dark2") +
    ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +
    theme(legend.position='right') + ylab('Number of Tweets') + xlab('Sentiments')

    ReplyDelete
  7. Please help me . How to define "candidate".please write code for that

    ReplyDelete
    Replies
    1. Hello, this is what i did to define my "Candidate" variable:

      sentiment_dataframe1$Candidate <- "AAP"
      sentiment_dataframe$Candidate <- "BJP"

      compare.df = merge(sentiment_dataframe, sentiment_dataframe1, all = T)
      #, by = c("AAP", "BJP"))
      #, suffixes = c(".sentiment_dataframe"
      # , ".sentiment_dataframe1"))
      head(compare.df)

      j = ggplot(compare.df, aes(x = factor(emotion), fill = Candidate))
      j = j + geom_bar(position = "dodge")
      j = j + scale_fill_brewer(palette = "Dark2")
      j = j + ggtitle("Sentiment Analysis of Tweets - AAP vs BJP")
      j = j + theme(legend.position = "right")
      j = j + ylab("Number of Tweets") + xlab("Emotional Categories")
      print(j)

      k = ggplot(compare.df, aes(x = factor(polarity), fill = Candidate))
      k = k + geom_bar(position = "dodge")
      k = k + scale_fill_brewer(palette = "Dark2")
      k = k + ggtitle("Sentiment Analysis of Tweets - AAP vs BJP")
      k = k + theme(legend.position = "right")
      k = k + ylab("Number of Tweets") + xlab("Sentiments")
      print(k)

      #analsis without "unknown"
      compare.df %>%
      group_by(emotion) %>%
      filter(emotion != "unknown")%>%
      mutate(emotion1 = factor(emotion))%>%
      ggplot(aes(x = factor(emotion1), fill = Candidate)) +
      geom_bar(position = "dodge") +
      scale_fill_brewer(palette = "Dark2") +
      xlab("") + ylab("Number of Tweets") +
      ggtitle("Sentiment Analysis of Tweets - AAP vs BJP") +
      theme(legend.position = "right") +
      ylab("Number of Tweets") + xlab("Emotion Categories")


      Hope it helps. Enjoy

      Delete
    2. thanks for your help. But there is one problem in ggplot of sentiment analysis of "emotion"

      For emotion "anger" ,results are shown for only one parameter"AAP". those of BJP are not shown.

      Delete
    3. Hi, could it be that the number of tweets for anger of those of BJP are so small compared to all other emotions. Maybe you should check the scaling(xlim =c(,)) of your plot.

      Delete
    4. Thank you.Now I want to create some interface ,GUI for above code in R. How can I do that ?

      Delete
  8. Thanxxx a lot . I want some GUI for above code.how can I get that ?

    ReplyDelete

Next → ← Prev