Introduction
The idea is to analyze the sentiments from Twitter tweets containing hashtags and words like '#Kejriwal', 'Kejriwal', '@AamAdmiParty' , '#KiranBedi', 'KiranBedi', '@BJPDelhiState','#DelhiDecides','DelhiVotes'. Next step is to identify whether a tweet expresses a positive or a negative sentiment about a particular candidate.
Sentiment analysis
Sentiment Analysis is an ongoing field of research in text mining field for the treatment of opinions, sentiments and subjectivity of text.
Examples :
Politics: What do people think about this candidate or issue?
Products: What do people think about the new iPhone?
Building a data set
Prior to analyzing Twitter data, we need to obtain the data. You need a developer account on twitter to pull the twitter data. After creating a developer account, you need to authenticate your application with Twitter, thus allowing you to mine tweets. It is possible with Twitter application programming interface(API).
After integrating with twitter, you need to specify keywords - '#Kejriwal', 'Kejriwal', '@AamAdmiParty' , '#KiranBedi', 'KiranBedi', '@BJPDelhiState' for which you want the information. I have collected 50932 tweets starting 31st January,2015 to 5th February,2015 (two days before the election date) for both the parties. Then, i remove duplicate tweets as some are retweeted.
The detailed R code is shown in the later portion of this article.
The detailed R code is shown in the later portion of this article.
- Extract the text content of tweets
- Eliminate extra white-spaces
- Convert text to lower case
- Remove words like stopwords
- Build your own stopwords list especially for this data set
- Remove punctuation symbols
- Remove numbers
Findings
1. The word "isbaaraap" occured maximum times followed by "delhimodipmbedicm".
2. Wordcloud comparing the frequencies of words between BJP and AAP.
3. Sentiment Analysis of Tweets by Emotional Categories
4. Final Sentiment Analysis of Tweets - AAP has greater positive sentiments than BJP.
Since AAP has greater positive sentiments than BJP on twitter, they are likely to get higher % votes. There is no way we can say anything about voting seats.
Appendix :
Install the following R packages
install.packages("twitteR")
install.packages("wordcloud")
install.packages("plyr")
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")
install.packages("tm")
Detailed R Code
library(plyr)library(dplyr)library(stringr)library(ggplot2)library(reshape2)library(twitteR)library(wordcloud)#Sentiment Package is not available on CRAN. You need to install it from archive.install.packages("devtools")require(devtools)install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")require(sentiment)ls("package:sentiment")
# You have to make iteration to fetch all tweets. All the iterations are not mentioned in the codeKejriwal_tweets = searchTwitter("#Kejriwal",since="2015-01-31",until="2015-02-05", n=1500,lang="en",cainfo="cacert.pem")Kejriwal_tweets2 = searchTwitter("@AamAadmiParty",since="2015-01-31",until="2015-02-05", n=1500,lang="en",cainfo="cacert.pem")bedi_tweets = searchTwitter("@BJPDelhiState", since="2015-01-31",until="2015-02-05",n=1500, lan="en",cainfo="cacert.pem")bedi_tweets2 = searchTwitter("#KiranBedi", since="2015-01-31",until="2015-02-05",n=1500, lan="en",cainfo="cacert.pem")# get the textKejriwal_txt = sapply( unlist(Kejriwal_tweets) , function(x) '$'( x , "text"))Kejriwal_txt2 = sapply( unlist(Kejriwal_tweets2) , function(x) '$'( x , "text"))bedi_txt = sapply( unlist(bedi_tweets) , function(x) '$'( x , "text"))bedi_txt2 = sapply( unlist(bedi_tweets2) , function(x) '$'( x , "text"))# how many tweets of each keywordnd = c(length(Kejriwal_txt), length(Kejriwal_txt2), length(bedi_txt), length(bedi_txt2))# join textsKejriwal_txt= c(Kejriwal_txt, Kejriwal_txt2)bedi_txt= c(bedi_txt, bedi_txt2)# Remove the duplicated tweetsKejriwal_txt <- Kejriwal_txt[!duplicated(Kejriwal_txt)]bedi_txt <- bedi_txt[!duplicated(bedi_txt)]# how many unique tweets of each keywordnd1 = c(length(Kejriwal_txt), length(bedi_txt))nd1# clean text functionclean.text <- function(some_txt){ some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)some_txt = gsub("@\\w+", "", some_txt)some_txt = gsub("[[:punct:]]", "", some_txt)some_txt = gsub("[[:digit:]]", "", some_txt)some_txt = gsub("http\\w+", "", some_txt)some_txt = gsub("[ \t]{2,}", "", some_txt)some_txt = gsub("^\\s+|\\s+$", "", some_txt)# Remove non-english characterssome_txt = gsub("[^\x20-\x7E]", "", some_txt)# define "tolower error handling" functiontry.tolower = function(x){ y = NAtry_error = tryCatch(tolower(x), error=function(e) e)if (!inherits(try_error, "error"))y = tolower(x)return(y)}some_txt = sapply(some_txt, try.tolower)some_txt = some_txt[some_txt != ""]names(some_txt) = NULLreturn(some_txt)}# clean textKejriwal_clean = clean.text(Kejriwal_txt)bedi_clean = clean.text(bedi_txt)# join cleaned texts in a single vectorKejriwals = paste(Kejriwal_clean, collapse=" ")bedis = paste(bedi_clean, collapse=" ")kej_bed = c(Kejriwals, bedis)# Corpuskb_corpus = Corpus(VectorSource(kej_bed))"delhiwithmodi","modipmbedicm"# remove stopwordsskipwords = c(stopwords("english"), "CM", "Chief Minister","year","years", "yes","bjp","aap","amp","delhipolls","delhielections","ravishaskskejriwal","delhi","elections","election","kejriwal", "kejriwals", "kiran","bedi", "todays", "reads", "live", "watch","zee","star","ndtv","congress","will","can","must","money","many","make","say","says","cant","kiranbedi","arvind","delhielection","arvindkejriwal","party","vote","even","now","namo","modi","nota","notamensrights","hey","world","class","create","men","vihar","sure","every","day","dont","get","media","one","see","said","feb","like","use","together")kb.tf <- list(weighting = weightTf, stopwords = skipwords,removePunctuation = TRUE,tolower = TRUE,minWordLength = 4,removeNumbers = TRUE, stripWhitespace = TRUE,stemDocument= TRUE)# term-document matrixtdm = TermDocumentMatrix(kb_corpus, control = kb.tf)# convert as matrixtdm = as.matrix(tdm)# get word counts in decreasing orderword_freqs = sort(rowSums(tdm), decreasing=TRUE)# create a data frame with words and their frequenciesdm = data.frame(word=names(word_freqs), freq=word_freqs)p <- ggplot(subset(dm, freq>20), aes(word, freq))p <-p+ geom_bar(stat="identity")p <-p+ theme(axis.text.x=element_text(angle=45, hjust=1))png("hist.png", 480,480)pdev.off()dev.new()wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(6, "Dark2"),min.freq=10, scale=c(4,.2),rot.per=.15,max.words=80)# add column namescolnames(tdm) = c("AAP","BJP")#write.csv(tdm,"matrix.csv")# comparison cloudpng(file="KejriwalvsBedi.png",height=600,width=1200)par(mfrow=c(1,2))comparison.cloud(tdm, random.order=FALSE, colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),title.size=1.5, max.words=100, scale=c(4,.2),rot.per=.15)# commanility cloudpng(file="Common.png",height=600,width=1200)par(mfrow=c(1,2))wordcloud(tdm, random.order=FALSE, colors = brewer.pal(8, "Dark2"),title.size=1.5, max.words=100)#Sentiment Analysis code starts from here# run modelbjp_class_emo = classify_emotion(bedi_clean, algorithm="bayes", prior=1.0)# Fetch emotion category best_fit for our analysis purposes, visitors to this tutorials are encouraged to play around with other classifications as well.emotion = bjp_class_emo[,7]# Replace NA’s (if any, generated during classification process) by word “unknown”emotion[is.na(emotion)] = "unknown"# Polarity Classificationbjp_class_pol = classify_polarity(bedi_clean, algorithm="bayes")# we will fetch polarity category best_fit for our analysis purposes, and as usual, visitors to this tutorials are encouraged to play around with other classifications as wellpolarity = bjp_class_pol[,4]# Let us now create a data frame with the above results obtained and rearrange data for plotting purposes# creating data frame using emotion category and polarity results earlier obtainedsentiment_dataframe = data.frame(text=bedi_clean, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE)# rearrange data inside the frame by sorting itsentiment_dataframe = within(sentiment_dataframe, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))write.csv(sentiment_dataframe,"BJP.csv")sentiment_dataframe=read.csv("BJP.csv")# In the next step we will plot the obtained results (in data frame)# First let us plot the distribution of emotions according to emotion categories# We will use ggplot function from ggplot2 Package (for more look at the help on ggplot) and RColorBrewer Packageggplot(sentiment_dataframe, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) +scale_fill_brewer(palette="Dark2") + ggtitle('Sentiment Analysis of Tweets on Twitter about BJP') +theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories')ggplot(sentiment_dataframe, aes(x=polarity))+geom_bar(aes(y=..count.., fill=polarity)) +scale_fill_brewer(palette="RdGy") + ggtitle('Sentiment Analysis of Tweets on Twitter about BJP') +theme(legend.position='right') + ylab('Number of Tweets') + xlab('Polarity Categories')#Sentiment Analysis - AAM ADMI PARTY# run modelKejriwal_class_emo = classify_emotion(Kejriwal_clean, algorithm="bayes", prior=1.0)# Fetch emotion category best_fit for our analysis purposes, visitors to this tutorials are encouraged to play around with other classifications as well.emotion1 = Kejriwal_class_emo[,7]# Replace NA’s (if any, generated during classification process) by word “unknown”emotion1[is.na(emotion1)] = "unknown"# Similar to above, we will classify polarity in the text# This process will classify the text data into four categories (pos – The absolute log likelihood of the document expressing a positive sentiment, neg – The absolute log likelihood of the document expressing a negative sentimen, pos/neg – The ratio of absolute log likelihoods between positive and negative sentiment scores where a score of 1 indicates a neutral sentiment, less than 1 indicates a negative sentiment, and greater than 1 indicates a positive sentiment; AND best_fit – The most likely sentiment category (e.g. positive, negative, neutral) for the given text)Kejriwal_class_pol = classify_polarity(Kejriwal_clean, algorithm="bayes")# we will fetch polarity category best_fit for our analysis purposes, and as usual, visitors to this tutorials are encouraged to play around with other classifications as wellpolarity1 = Kejriwal_class_pol[,4]# Let us now create a data frame with the above results obtained and rearrange data for plotting purposes# creating data frame using emotion category and polarity results earlier obtainedsentiment_dataframe = data.frame(text=Kejriwal_clean, emotion=emotion1, polarity=polarity1, stringsAsFactors=FALSE)# rearrange data inside the frame by sorting itsentiment_dataframe = within(sentiment_dataframe, emotion1 <- factor(emotion1, levels=names(sort(table(emotion1), decreasing=TRUE))))# In the next step we will plot the obtained results (in data frame)# First let us plot the distribution of emotions according to emotion categories# We will use ggplot function from ggplot2 Package (for more look at the help on ggplot) and RColorBrewer Packageggplot(sentiment_dataframe, aes(x=emotion1)) + geom_bar(aes(y=..count.., fill=emotion1)) +scale_fill_brewer(palette="Dark2") +ggtitle('Sentiment Analysis of Tweets on Twitter about AAP') +theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotion Categories')write.csv(sentiment_dataframe,"AAP_Data.csv")sentiment_dataframe = read.csv("BJP.csv")ggplot(sentiment_dataframe, aes(x=factor(Polarity), fill=Candidate)) + geom_bar(position="dodge")+scale_fill_brewer(palette="Dark2") +ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +theme(legend.position='right') + ylab('Number of Tweets') + xlab('Sentiments')ggplot(sentiment_dataframe, aes(x=factor(emotion), fill=Candidate)) + geom_bar(position="dodge")+scale_fill_brewer(palette="Dark2") +ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotional Categories')
Awesome...
ReplyDeleteThank you so much!
ReplyDeleteExcellent Work... Keep Going :-)
ReplyDeleteawesome post...thanks..:)
ReplyDeleteGlad you liked it :-)
Deletecan u make same using SAS ?
ReplyDeleteWooww. Very helpful post. Great work. Thanks for share it with us.
ReplyDeleteI tried this code but i get one error,
ReplyDeletethe package "sentiment" not able to install. shows its not available for R version 3.2.1.
Please solve this problem
Thanks in advance
Sentiment is not available on cran.
Deleteinstall.packages("devtools")
require(devtools)
install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")
require(sentiment)
ls("package:sentiment
I did that thing but showing same error
DeleteERROR: dependency ‘Rstem’ is not available for package ‘sentiment’
Install package 'Rstem' prior to installing package sentiment. That's what error is talking about.
DeleteThank You for solving all problem,
ReplyDeleteI have one more problem.
on last graph
ggplot(sentiment_dataframe, aes(x=factor(emotion), fill=Candidate)) + geom_bar(position="dodge")+
scale_fill_brewer(palette="Dark2") +
ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Emotional Categories')
Showing error
object 'Candidate' not found
and i check your code you not define Candidate code
and generate blank graph.
Please solve this.
Thank you in advance
Great work and i love it. However in the last but one code, the "Polarity" has to be with a small "p". Also the definition of "Candidate" is missing in the code. I would be glad if you could amend the code.
ReplyDeleteggplot(sentiment_dataframe, aes(x=factor(Polarity), fill=Candidate)) + geom_bar(position="dodge")+
scale_fill_brewer(palette="Dark2") +
ggtitle('Sentiment Analysis of Tweets - BJP vs AAP') +
theme(legend.position='right') + ylab('Number of Tweets') + xlab('Sentiments')
Please help me . How to define "candidate".please write code for that
ReplyDeleteHello, this is what i did to define my "Candidate" variable:
Deletesentiment_dataframe1$Candidate <- "AAP"
sentiment_dataframe$Candidate <- "BJP"
compare.df = merge(sentiment_dataframe, sentiment_dataframe1, all = T)
#, by = c("AAP", "BJP"))
#, suffixes = c(".sentiment_dataframe"
# , ".sentiment_dataframe1"))
head(compare.df)
j = ggplot(compare.df, aes(x = factor(emotion), fill = Candidate))
j = j + geom_bar(position = "dodge")
j = j + scale_fill_brewer(palette = "Dark2")
j = j + ggtitle("Sentiment Analysis of Tweets - AAP vs BJP")
j = j + theme(legend.position = "right")
j = j + ylab("Number of Tweets") + xlab("Emotional Categories")
print(j)
k = ggplot(compare.df, aes(x = factor(polarity), fill = Candidate))
k = k + geom_bar(position = "dodge")
k = k + scale_fill_brewer(palette = "Dark2")
k = k + ggtitle("Sentiment Analysis of Tweets - AAP vs BJP")
k = k + theme(legend.position = "right")
k = k + ylab("Number of Tweets") + xlab("Sentiments")
print(k)
#analsis without "unknown"
compare.df %>%
group_by(emotion) %>%
filter(emotion != "unknown")%>%
mutate(emotion1 = factor(emotion))%>%
ggplot(aes(x = factor(emotion1), fill = Candidate)) +
geom_bar(position = "dodge") +
scale_fill_brewer(palette = "Dark2") +
xlab("") + ylab("Number of Tweets") +
ggtitle("Sentiment Analysis of Tweets - AAP vs BJP") +
theme(legend.position = "right") +
ylab("Number of Tweets") + xlab("Emotion Categories")
Hope it helps. Enjoy
thanks for your help. But there is one problem in ggplot of sentiment analysis of "emotion"
DeleteFor emotion "anger" ,results are shown for only one parameter"AAP". those of BJP are not shown.
Hi, could it be that the number of tweets for anger of those of BJP are so small compared to all other emotions. Maybe you should check the scaling(xlim =c(,)) of your plot.
DeleteThank you.Now I want to create some interface ,GUI for above code in R. How can I do that ?
DeleteThanxxx a lot . I want some GUI for above code.how can I get that ?
ReplyDeleteError in get_oauth_sig() : OAuth has not been registered for this session
ReplyDeleteShowing the above Error......Please help
ReplyDeleteBJP_tweets = searchTwitter("#NarendraModi",since="2017-11-27",until="2017-11-12", n=3000,lang="en",cainfo="cacert.pem")
I am trying to predict the ongoing Gujarat election by using the above code. But I am getting an error as mentioned below:
Error in tw_from_response(out, ...) :
unused argument (cainfo = "cacert.pem")
Help me solve this issue.
Regards
Avijeet
I am unable to plot a barplot for both the parties on the same plot.
ReplyDeleteHelp me!
> install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")
ReplyDeleteDownloading package from url: http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz
Installing sentiment
"C:/PROGRA~1/R/R-35~1.0/bin/x64/R" --no-site-file --no-environ --no-save \
--no-restore --quiet CMD INSTALL \
"C:/Users/kanda/AppData/Local/Temp/RtmpO2mAni/devtoolsd2ac36c2577/sentiment" \
--library="C:/Users/kanda/Documents/R/win-library/3.5" --install-tests
ERROR: dependency 'Rstem' is not available for package 'sentiment'
* removing 'C:/Users/kanda/Documents/R/win-library/3.5/sentiment'
In R CMD INSTALL
Installation failed: Command failed (1)