A wordcloud is a text mining technique that allows us to visualize most frequently used keywords in a paragraph.
Import multiple files from a folder
Note : In the above syntax, "text mining" is a folder name. I have placed all text files in this folder
Step 4 : Locate and load the corpus
If imported a single file
If imported multiple files from a folder
Step 5 : Data Cleaning
Note : You can remove sparse terms with the following code :
The example wordcloud is shown below :
Create WordCloud with R Programming |
How to create Word Cloud with R
Step 1 : Install the required packages
install.packages("wordcloud")Note : If these packages are already installed, you don't need to install them again.
install.packages("tm")
install.packages("ggplot2")
Step 2 : Load the above installed packages
library("wordcloud")
library("tm")
library(ggplot2)
Step 3 : Import data into R
Import a single file
cname<-read.csv("C:/Users/Deepanshu Bhalla/Documents/Text.csv",head=TRUE)
Import multiple files from a folder
setwd("C:\\Users\\Deepanshu Bhalla\\Documents\\text mining")
cname <-getwd()
## Number of documents
length(dir(cname))
## list file names
dir(cname)
Step 4 : Locate and load the corpus
If imported a single file
docs<-Corpus(VectorSource(cname[,1]));
If imported multiple files from a folder
docs <- Corpus (DirSource(cname))
docs
summary(docs)
inspect(docs[1])
Step 5 : Data Cleaning
# Simple Transformation
for (j in seq(docs))
{
docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
docs[[j]] = gsub("@\\w+", "", docs[[j]])
docs[[j]] = gsub("http\\w+", "", docs[[j]])
docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])
docs[[j]] = gsub("^\\s+|\\s+quot;", "", docs[[j]])
docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
}
# Specify stopwords other than in-bult english stopwords
skipwords = c(stopwords("english"), "system","technology")
kb.tf <- list(weighting = weightTf, stopwords = skipwords,
removePunctuation = TRUE,
tolower = TRUE,
minWordLength = 4,
removeNumbers = TRUE, stripWhitespace = TRUE,
stemDocument= TRUE)
# term-document matrix
docs <- tm_map(docs, PlainTextDocument)
tdm = TermDocumentMatrix(docs, control = kb.tf)Step 6 : Create WordCloud with R
# convert as matrix
tdm = as.matrix(tdm)
# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# Keep wordcloud the same
set.seed(123)
#Plot Histogram
p <- ggplot(subset(dm, freq>10), aes(word, freq))
p <-p+ geom_bar(stat="identity")
p <-p+ theme(axis.text.x=element_text(angle=45, hjust=1))
p
#Plot Wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(6, "Dark2"),min.freq=10, scale=c(4,.2),rot.per=.15,max.words=100)
Note : You can remove sparse terms with the following code :
tdm.frequent = removeSparseTerms(tdm, 0.1)
Awesome article.
ReplyDeleteThank you! Glad you liked it :-)
DeleteHi, I followed all the steps but failed to transform a sentence into a short term. Which is step 5,8
ReplyDeleteI got the error message:
Error in UseMethod("content", x) :
no applicable method for 'content' applied to an object of class "character"
I can run all Step 1 to Step 4 till now.
Could you help me fix this problem?
Thank you in advance.
I have updated the code. That should solve your problem.Thanks!
DeleteHi Deepanshu
ReplyDeleteI followed your steps but I got the following error...please help
Error in eval(expr, envir, enclos) : object 'word' not found
May i know the section of the code which returns this error?
DeleteHey, I am using your code, but I can't do my own stopwords. I want to do a wordcloud of a chat and specify some words that shouldn't go into the wordcloud. Can you please tell me how to get my own words into stopwords? Thank you in advance
ReplyDeleteHi !
ReplyDeleteGreat work.
Well, I am trying to create a word-cloud using tweets. But all it shows in the wordcloud is: object,class,status words
LINK to screenshot: https://drive.google.com/file/d/0B4ZhibK97rv0SE9XOXgwOERYNjQ/view?usp=sharing
Help appreciated.
Thanks
Hi Thanks for sharing this.
ReplyDeleteWhen I executed step 5 syntax I got below error
Error: unexpected string constant in:
" docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])
docs[[j]] = gsub(""
> }
Error: unexpected '}' in "}"
The above error was resolved in step 5, but when I executed next set of syntax getting error. Please see below for more details. Can you please let me know why I am getting this error
Delete> tdm = TermDocumentMatrix(docs, control = kb.tf)
Error: inherits(doc, "TextDocument") is not TRUE
>
> # convert as matrix
> tdm = as.matrix(tdm)
Error in as.matrix(tdm) : object 'tdm' not found
>
> # get word counts in decreasing order
> word_freqs = sort(rowSums(tdm), decreasing=TRUE)
Error in is.data.frame(x) : object 'tdm' not found
>
> # create a data frame with words and their frequencies
> dm = data.frame(word=names(word_freqs), freq=word_freqs)
Error in data.frame(word = names(word_freqs), freq = word_freqs) :
object 'word_freqs' not found
This is fantastic!!!
ReplyDeleteThere is a small error in step 5 (instead of " used ;)
for (j in seq(docs))
{
docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
docs[[j]] = gsub("@\\w+", "", docs[[j]])
docs[[j]] = gsub("http\\w+", "", docs[[j]])
docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])
docs[[j]] = gsub("^\\s+|\\s+quot;", "", docs[[j]])
docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
}
Yes George, but in step 5 for next set of syntax I am getting below error, not sure why I am getting this error
Delete> tdm = TermDocumentMatrix(docs, control = kb.tf)
Error: inherits(doc, "TextDocument") is not TRUE
>
> # convert as matrix
> tdm = as.matrix(tdm)
Error in as.matrix(tdm) : object 'tdm' not found
>
> # get word counts in decreasing order
> word_freqs = sort(rowSums(tdm), decreasing=TRUE)
Error in is.data.frame(x) : object 'tdm' not found
>
> # create a data frame with words and their frequencies
> dm = data.frame(word=names(word_freqs), freq=word_freqs)
Error in data.frame(word = names(word_freqs), freq = word_freqs) :
object 'word_freqs' not found
Hi George,
DeleteI am sorry i didn't get your point. What exactly you have changed in the code -
for (j in seq(docs))
{
docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
docs[[j]] = gsub("@\\w+", "", docs[[j]])
docs[[j]] = gsub("http\\w+", "", docs[[j]])
docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])
docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])
docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
}
Hi Deepanshu,
DeleteThis code in Step 5
docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])
missed the " after "^\\s+|\\s+quot;
Issue with the version of tm you are using.
ReplyDeleterun the following command before running
tdm = TermDocumentMatrix(docs, control = kb.tf)
docs <- tm_map(docs, PlainTextDocument)
Thanks a lot George, it does work perfectly.
DeleteJust one question about note mentioned, in the syntax and step
Note : You can remove sparse terms with the following code :
tdm.frequent = removeSparseTerms(tdm, 0.1)
What is the use of this?
Thanks a lot for your help
One more question, I increased my data points, means I included more comments in .csv file, but got only three word in word cloud.
ReplyDeleteWhether earlier there were too many words displayed in chart
Why when I added more comments word cloud is showing less words in chart
i'm using word cloud in shiny , if i select the one data set it have to show the word cloud for corresponding to which i was selected.
ReplyDeletecan any one help me on this issues.
Deletecan any one help me on this issues.
Deletei'm using word cloud in shiny , if i select the one data set it have to show the word cloud for corresponding to which i was selected.
ReplyDeleteWould appreciate it if you could provide a link to the data file(s)?
ReplyDelete