Predicting the Delhi Election using Twitter Data

Introduction

Over the past two weeks, Delhi Assembly Elections 2015 have completely redefined the way India has witnessed political battles fought between individuals and parties. Social media Twitter witnessed this political battle during the voting day. It was abuzz with activity with a record number of tweets pouring in on February 7. Delhi Assembly elections ruled Twitter with # Delhi Votes as the top trending hashtag during the day.

The idea is to analyze the sentiments from Twitter tweets containing hashtags and words like '#Kejriwal', 'Kejriwal', '@AamAdmiParty' , '#KiranBedi', 'KiranBedi', '@BJPDelhiState','#DelhiDecides','DelhiVotes'. Next step is to identify whether a tweet expresses a positive or a negative sentiment about a particular candidate.

Sentiment analysis

Sentiment Analysis is an ongoing field of research in text mining field for the treatment of opinions, sentiments and subjectivity of text.

Examples :

Politics: What do people think about this candidate or issue?

Products: What do people think about the new iPhone?

Building a data set

Prior to analyzing Twitter data, we need to obtain the data. You need a developer account on twitter to pull the twitter data. After creating a developer account, you need to authenticate your application with Twitter, thus allowing you to mine tweets. It is possible with Twitter application programming interface(API).

After integrating with twitter, you need to specify keywords - '#Kejriwal', 'Kejriwal', '@AamAdmiParty' , '#KiranBedi', 'KiranBedi', '@BJPDelhiState' for which you want the information. I have collected 50932 tweets starting 31st January,2015 to 5th February,2015 (two days before the election date) for both the parties. Then, i remove duplicate tweets as some are retweeted.

The detailed R code is shown in the later portion of this article.

Making data ready to use

Extract the text content of tweets
Eliminate extra white-spaces
Convert text to lower case
Remove words like stopwords
Build your own stopwords list especially for this data set
Remove punctuation symbols
Remove numbers

Findings

1. The word "isbaaraap" occured maximum times followed by "delhimodipmbedicm".

2. Wordcloud comparing the frequencies of words between BJP and AAP.

3. Sentiment Analysis of Tweets by Emotional Categories

4. Final Sentiment Analysis of Tweets - AAP has greater positive sentiments than BJP.

Summary :

Since AAP has greater positive sentiments than BJP on twitter, they are likely to get higher % votes. There is no way we can say anything about voting seats.

Appendix :

Install the following R packages

install.packages("twitteR")
install.packages("wordcloud")
install.packages("plyr")
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")
install.packages("tm")

Detailed R Code

library(plyr)

library(dplyr)

library(stringr)

library(ggplot2)

library(reshape2)

library(twitteR)

library(wordcloud)

#Sentiment Package is not available on CRAN. You need to install it from archive.

install.packages("devtools")

require(devtools)

install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")

require(sentiment)

ls("package:sentiment")

# You have to make iteration to fetch all tweets. All the iterations are not mentioned in the code

Kejriwal_tweets = searchTwitter("#Kejriwal",since="2015-01-31",until="2015-02-05", n=1500,lang="en",cainfo="cacert.pem")

Kejriwal_tweets2 = searchTwitter("@AamAadmiParty",since="2015-01-31",until="2015-02-05", n=1500,lang="en",cainfo="cacert.pem")

bedi_tweets = searchTwitter("@BJPDelhiState", since="2015-01-31",until="2015-02-05",n=1500, lan="en",cainfo="cacert.pem")

bedi_tweets2 = searchTwitter("#KiranBedi", since="2015-01-31",until="2015-02-05",n=1500, lan="en",cainfo="cacert.pem")

# get the text

Kejriwal_txt = sapply( unlist(Kejriwal_tweets) , function(x) '$'( x , "text"))

Kejriwal_txt2 = sapply( unlist(Kejriwal_tweets2) , function(x) '$'( x , "text"))

bedi_txt = sapply( unlist(bedi_tweets) , function(x) '$'( x , "text"))

bedi_txt2 = sapply( unlist(bedi_tweets2) , function(x) '$'( x , "text"))

# how many tweets of each keyword

nd = c(length(Kejriwal_txt), length(Kejriwal_txt2), length(bedi_txt), length(bedi_txt2))

# join texts

Kejriwal_txt= c(Kejriwal_txt, Kejriwal_txt2)

bedi_txt= c(bedi_txt, bedi_txt2)

# Remove the duplicated tweets

Kejriwal_txt <- Kejriwal_txt[!duplicated(Kejriwal_txt)]

bedi_txt <- bedi_txt[!duplicated(bedi_txt)]

# how many unique tweets of each keyword

nd1 = c(length(Kejriwal_txt), length(bedi_txt))

nd1

# clean text function

clean.text <- function(some_txt)

{ some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)

some_txt = gsub("@\\w+", "", some_txt)

some_txt = gsub("[[:punct:]]", "", some_txt)

some_txt = gsub("[[:digit:]]", "", some_txt)

some_txt = gsub("http\\w+", "", some_txt)

some_txt = gsub("[ \t]{2,}", "", some_txt)

some_txt = gsub("^\\s+|\\s+$", "", some_txt)

# Remove non-english characters

some_txt = gsub("[^\x20-\x7E]", "", some_txt)

# define "tolower error handling" function

try.tolower = function(x)

{ y = NA

try_error = tryCatch(tolower(x), error=function(e) e)

if (!inherits(try_error, "error"))

y = tolower(x)

return(y)

}

some_txt = sapply(some_txt, try.tolower)

some_txt = some_txt[some_txt != ""]

names(some_txt) = NULL

return(some_txt)}

# clean text

Kejriwal_clean = clean.text(Kejriwal_txt)

bedi_clean = clean.text(bedi_txt)

# join cleaned texts in a single vector

Kejriwals = paste(Kejriwal_clean, collapse=" ")

bedis = paste(bedi_clean, collapse=" ")

kej_bed = c(Kejriwals, bedis)

# Corpus

kb_corpus = Corpus(VectorSource(kej_bed))

"delhiwithmodi","modipmbedicm"

# remove stopwords

skipwords = c(stopwords("english"), "CM", "Chief Minister","year","years", "yes","bjp","aap","amp","delhipolls","delhielections",

"ravishaskskejriwal","delhi","elections","election","kejriwal", "kejriwals", "kiran","bedi", "todays", "reads", "live", "watch",

"zee","star","ndtv","congress","will","can","must","money","many","make","say","says","cant","kiranbedi","arvind","delhielection",

"arvindkejriwal","party","vote","even","now","namo","modi","nota","notamensrights","hey","world","class","create","men",

"vihar","sure","every","day","dont","get","media","one","see","said","feb","like"

,"use","together")

kb.tf <- list(weighting = weightTf, stopwords = skipwords,

removePunctuation = TRUE,

tolower = TRUE,

minWordLength = 4,

removeNumbers = TRUE, stripWhitespace = TRUE,

stemDocument= TRUE)

# term-document matrix

tdm = TermDocumentMatrix(kb_corpus, control = kb.tf)

# convert as matrix

tdm = as.matrix(tdm)

# get word counts in decreasing order

word_freqs = sort(rowSums(tdm), decreasing=TRUE)

# create a data frame with words and their frequencies

dm = data.frame(word=names(word_freqs), freq=word_freqs)

p <- ggplot(subset(dm, freq>20), aes(word, freq))

p <-p+ geom_bar(stat="identity")

p <-p+ theme(axis.text.x=element_text(angle=45, hjust=1))

png("hist.png", 480,480)

p

dev.off()

dev.new()

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(6, "Dark2"),min.freq=10, scale=c(4,.2),rot.per=.15,max.words=80)

# add column names

colnames(tdm) = c("AAP","BJP")

#write.csv(tdm,"matrix.csv")

# comparison cloud

png(file="KejriwalvsBedi.png",height=600,width=1200)

par(mfrow=c(1,2))

comparison.cloud(tdm, random.order=FALSE, colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),title.size=1.5, max.words=100, scale=c(4,.2),rot.per=.15)

# commanility cloud

png(file="Common.png",height=600,width=1200)

par(mfrow=c(1,2))

wordcloud(tdm, random.order=FALSE, colors = brewer.pal(8, "Dark2"),title.size=1.5, max.words=100)

#Sentiment Analysis code starts from here

# run model

bjp_class_emo = classify_emotion(bedi_clean, algorithm="bayes", prior=1.0)

# Fetch emotion category best_fit for our analysis purposes, visitors to this tutorials are encouraged to play around with other classifications as well.

emotion = bjp_class_emo[,7]

# Replace NA’s (if any, generated during classification process) by word “unknown”

emotion[is.na(emotion)] = "unknown"

# Polarity Classification

bjp_class_pol = classify_polarity(bedi_clean, algorithm="bayes")

# we will fetch polarity category best_fit for our analysis purposes, and as usual, visitors to this tutorials are encouraged to play around with other classifications as well

polarity = bjp_class_pol[,4]

# Let us now create a data frame with the above results obtained and rearrange data for plotting purposes

# creating data frame using emotion category and polarity results earlier obtained

sentiment_dataframe = data.frame(text=bedi_clean, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)

# rearrange data inside the frame by sorting it

sentiment_dataframe = within(sentiment_dataframe, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

write.csv(sentiment_dataframe,"BJP.csv")

sentiment_dataframe=read.csv("BJP.csv")

# In the next step we will plot the obtained results (in data frame)

# First let us plot the distribution of emotions according to emotion categories

# We will use ggplot function from ggplot2 Package (for more look at the help on ggplot) and RColorBrewer Package

ggplot(sentiment_dataframe, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) +

scale_fill_brewer(palette="Dark2") + ggtitle('Sentiment Analysis of Tweets on Twitter about BJP') +

theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories')

ggplot(sentiment_dataframe, aes(x=polarity))+geom_bar(aes(y=..count.., fill=polarity)) +

scale_fill_brewer(palette="RdGy") + ggtitle('Sentiment Analysis of Tweets on Twitter about BJP') +

theme(legend.position='right') + ylab('Number of Tweets') + xlab('Polarity Categories')

#Sentiment Analysis - AAM ADMI PARTY

# run model

Kejriwal_class_emo = classify_emotion(Kejriwal_clean, algorithm="bayes", prior=1.0)

# Fetch emotion category best_fit for our analysis purposes, visitors to this tutorials are encouraged to play around with other classifications as well.

emotion1 = Kejriwal_class_emo[,7]

# Replace NA’s (if any, generated during classification process) by word “unknown”

emotion1[is.na(emotion1)] = "unknown"

# Similar to above, we will classify polarity in the text

# This process will classify the text data into four categories (pos – The absolute log likelihood of the document expressing a positive sentiment, neg – The absolute log likelihood of the document expressing a negative sentimen, pos/neg – The ratio of absolute log likelihoods between positive and negative sentiment scores where a score of 1 indicates a neutral sentiment, less than 1 indicates a negative sentiment, and greater than 1 indicates a positive sentiment; AND best_fit – The most likely sentiment category (e.g. positive, negative, neutral) for the given text)

Kejriwal_class_pol = classify_polarity(Kejriwal_clean, algorithm="bayes")

# we will fetch polarity category best_fit for our analysis purposes, and as usual, visitors to this tutorials are encouraged to play around with other classifications as well

polarity1 = Kejriwal_class_pol[,4]

# Let us now create a data frame with the above results obtained and rearrange data for plotting purposes

# creating data frame using emotion category and polarity results earlier obtained

sentiment_dataframe = data.frame(text=Kejriwal_clean, emotion=emotion1, polarity=polarity1, stringsAsFactors=FALSE)

# rearrange data inside the frame by sorting it

sentiment_dataframe = within(sentiment_dataframe, emotion1 <- factor(emotion1, levels=names(sort(table(emotion1), decreasing=TRUE))))

# In the next step we will plot the obtained results (in data frame)

# First let us plot the distribution of emotions according to emotion categories

# We will use ggplot function from ggplot2 Package (for more look at the help on ggplot) and RColorBrewer Package

ggplot(sentiment_dataframe, aes(x=emotion1)) + geom_bar(aes(y=..count.., fill=emotion1)) +

scale_fill_brewer(palette="Dark2") +

ggtitle('Sentiment Analysis of Tweets on Twitter about AAP') +

theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories')

write.csv(sentiment_dataframe,"AAP_Data.csv")

sentiment_dataframe = read.csv("BJP.csv")

ggplot(sentiment_dataframe, aes(x=factor(Polarity), fill=Candidate)) + geom_bar(position="dodge")+

scale_fill_brewer(palette="Dark2") +

ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +

theme(legend.position='right') + ylab('Number of Tweets') + xlab('Sentiments')

ggplot(sentiment_dataframe, aes(x=factor(emotion), fill=Candidate)) + geom_bar(position="dodge")+

scale_fill_brewer(palette="Dark2") +

ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +

theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotional Categories')

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 23 Responses to "Predicting the Delhi Election using Twitter Data"

NIKHILFebruary 10, 2015 at 9:26 PM
Awesome...
Deepanshu BhallaFebruary 10, 2015 at 11:10 PM
Thank you so much!
LijaFebruary 11, 2015 at 12:48 AM
Excellent Work... Keep Going :-)
chandanFebruary 11, 2015 at 2:46 AM
awesome post...thanks..:)
soumyaFebruary 13, 2015 at 2:41 AM
can u make same using SAS ?
AlokJuly 2, 2015 at 4:02 AM
Wooww. Very helpful post. Great work. Thanks for share it with us.
Durgesh SamariyaOctober 18, 2015 at 3:39 AM
I tried this code but i get one error,
the package "sentiment" not able to install. shows its not available for R version 3.2.1.

Please solve this problem

Thanks in advance

UnknownNovember 8, 2015 at 8:30 AM
Thank You for solving all problem,
I have one more problem.

on last graph

ggplot(sentiment_dataframe, aes(x=factor(emotion), fill=Candidate)) + geom_bar(position="dodge")+
scale_fill_brewer(palette="Dark2") +
ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotional Categories')

Showing error
object 'Candidate' not found

and i check your code you not define Candidate code

and generate blank graph.

Please solve this.

Thank you in advance
UnknownFebruary 10, 2016 at 5:57 PM
Great work and i love it. However in the last but one code, the "Polarity" has to be with a small "p". Also the definition of "Candidate" is missing in the code. I would be glad if you could amend the code.

ggplot(sentiment_dataframe, aes(x=factor(Polarity), fill=Candidate)) + geom_bar(position="dodge")+
scale_fill_brewer(palette="Dark2") +
ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Sentiments')
UnknownOctober 6, 2016 at 9:05 AM
Please help me . How to define "candidate".please write code for that
UnknownOctober 8, 2016 at 7:11 AM
Thanxxx a lot . I want some GUI for above code.how can I get that ?
UnknownAugust 14, 2017 at 4:27 AM
Error in get_oauth_sig() : OAuth has not been registered for this session

Showing the above Error......Please help
Avijeet BiswalDecember 11, 2017 at 10:34 PM

BJP_tweets = searchTwitter("#NarendraModi",since="2017-11-27",until="2017-11-12", n=3000,lang="en",cainfo="cacert.pem")

I am trying to predict the ongoing Gujarat election by using the above code. But I am getting an error as mentioned below:

Error in tw_from_response(out, ...) :
unused argument (cainfo = "cacert.pem")

Help me solve this issue.
Regards
Avijeet
Avijeet BiswalDecember 12, 2017 at 7:54 AM
I am unable to plot a barplot for both the parties on the same plot.
Help me!
UnknownJuly 3, 2018 at 10:49 AM
> install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")
Downloading package from url: http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz
Installing sentiment
"C:/PROGRA~1/R/R-35~1.0/bin/x64/R" --no-site-file --no-environ --no-save \
--no-restore --quiet CMD INSTALL \
"C:/Users/kanda/AppData/Local/Temp/RtmpO2mAni/devtoolsd2ac36c2577/sentiment" \
--library="C:/Users/kanda/Documents/R/win-library/3.5" --install-tests

ERROR: dependency 'Rstem' is not available for package 'sentiment'
* removing 'C:/Users/kanda/Documents/R/win-library/3.5/sentiment'
In R CMD INSTALL
Installation failed: Command failed (1)