# Create WordCloud with R

A wordcloud is a text mining technique that allows us to visualize most frequently used keywords in a paragraph.

The example wordcloud is shown below :
 Create WordCloud with R Programming
How to create Word Cloud with R

Step 1 : Install the required packages
install.packages("wordcloud")
install.packages("tm")
install.packages("ggplot2")
Note : If these packages are already installed, you don't need to install them again.

Step 2 : Load the above installed packages
library("wordcloud")
library("tm")
library(ggplot2)

Step 3 : Import data into R

Import a single file

Import multiple files from a folder

setwd("C:\\Users\\Deepanshu Bhalla\\Documents\\text mining")
cname <-getwd()
## Number of documents
length(dir(cname))
## list file names
dir(cname)

Note : In the above syntax, "text mining" is a folder name. I have placed all text files in this folder

Step 4 : Locate and load the corpus

If imported a single file

docs<-Corpus(VectorSource(cname[,1]));

If imported multiple files from a folder

docs <- Corpus (DirSource(cname))
docs
summary(docs)
inspect(docs[1])

Step 5 : Data Cleaning

# Simple Transformation
for (j in seq(docs))
{
docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
docs[[j]] = gsub("@\\w+", "", docs[[j]])
docs[[j]] = gsub("http\\w+", "", docs[[j]])
docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])
docs[[j]] = gsub("^\\s+|\\s+quot;", "", docs[[j]])
docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
}
# Specify stopwords other than in-bult english stopwords
skipwords = c(stopwords("english"), "system","technology")

kb.tf <- list(weighting = weightTf, stopwords  = skipwords,
removePunctuation = TRUE,
tolower = TRUE,
minWordLength = 4,
removeNumbers = TRUE, stripWhitespace = TRUE,
stemDocument= TRUE)

# term-document matrix
docs <- tm_map(docs, PlainTextDocument)
tdm = TermDocumentMatrix(docs, control = kb.tf)

# convert as matrix
tdm = as.matrix(tdm)

# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE)

# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

Step 6 : Create WordCloud with R
# Keep wordcloud the same
set.seed(123)

#Plot Histogram
p <- ggplot(subset(dm, freq>10), aes(word, freq))
p <-p+ geom_bar(stat="identity")
p <-p+ theme(axis.text.x=element_text(angle=45, hjust=1))

p

#Plot Wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(6, "Dark2"),min.freq=10, scale=c(4,.2),rot.per=.15,max.words=100)

Note : You can remove sparse terms with the following code :
tdm.frequent = removeSparseTerms(tdm, 0.1)
23 Responses to "Create WordCloud with R"
1. Awesome article.

1. Thank you! Glad you liked it :-)

2. I follow your tutorial and I tried to make a word cloud using "I have a dream" text from M. Luther King.

Your post is clear and amazing because I was able to reproduce it. Thank you!

http://www.sthda.com/english/wiki/easyggplot2

3. Hi, I followed all the steps but failed to transform a sentence into a short term. Which is step 5,8
I got the error message:
Error in UseMethod("content", x) :
no applicable method for 'content' applied to an object of class "character"

I can run all Step 1 to Step 4 till now.

Could you help me fix this problem?

1. I have updated the code. That should solve your problem.Thanks!

4. Hi Deepanshu

1. May i know the section of the code which returns this error?

5. Hey, I am using your code, but I can't do my own stopwords. I want to do a wordcloud of a chat and specify some words that shouldn't go into the wordcloud. Can you please tell me how to get my own words into stopwords? Thank you in advance

6. Hi !
Great work.
Well, I am trying to create a word-cloud using tweets. But all it shows in the wordcloud is: object,class,status words
Help appreciated.
Thanks

7. Hi Thanks for sharing this.

When I executed step 5 syntax I got below error

Error: unexpected string constant in:
" docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])
docs[[j]] = gsub(""
> }
Error: unexpected '}' in "}"

1. The above error was resolved in step 5, but when I executed next set of syntax getting error. Please see below for more details. Can you please let me know why I am getting this error

> tdm = TermDocumentMatrix(docs, control = kb.tf)
Error: inherits(doc, "TextDocument") is not TRUE
>
> # convert as matrix
> tdm = as.matrix(tdm)
>
> # get word counts in decreasing order
> word_freqs = sort(rowSums(tdm), decreasing=TRUE)
>
> # create a data frame with words and their frequencies
> dm = data.frame(word=names(word_freqs), freq=word_freqs)
Error in data.frame(word = names(word_freqs), freq = word_freqs) :

8. This is fantastic!!!

There is a small error in step 5 (instead of " used ;)

for (j in seq(docs))
{
docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
docs[[j]] = gsub("@\\w+", "", docs[[j]])
docs[[j]] = gsub("http\\w+", "", docs[[j]])
docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])

docs[[j]] = gsub("^\\s+|\\s+quot;", "", docs[[j]])

docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
}

1. Yes George, but in step 5 for next set of syntax I am getting below error, not sure why I am getting this error

> tdm = TermDocumentMatrix(docs, control = kb.tf)
Error: inherits(doc, "TextDocument") is not TRUE
>
> # convert as matrix
> tdm = as.matrix(tdm)
>
> # get word counts in decreasing order
> word_freqs = sort(rowSums(tdm), decreasing=TRUE)
>
> # create a data frame with words and their frequencies
> dm = data.frame(word=names(word_freqs), freq=word_freqs)
Error in data.frame(word = names(word_freqs), freq = word_freqs) :

2. Hi George,

I am sorry i didn't get your point. What exactly you have changed in the code -

for (j in seq(docs))
{
docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
docs[[j]] = gsub("@\\w+", "", docs[[j]])
docs[[j]] = gsub("http\\w+", "", docs[[j]])
docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])
docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])
docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
}

3. Hi Deepanshu,

This code in Step 5
docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])

missed the " after "^\\s+|\\s+quot;

9. Issue with the version of tm you are using.

run the following command before running
tdm = TermDocumentMatrix(docs, control = kb.tf)

docs <- tm_map(docs, PlainTextDocument)

1. Thanks a lot George, it does work perfectly.

Just one question about note mentioned, in the syntax and step

Note : You can remove sparse terms with the following code :
tdm.frequent = removeSparseTerms(tdm, 0.1)

What is the use of this?

Thanks a lot for your help

10. One more question, I increased my data points, means I included more comments in .csv file, but got only three word in word cloud.
Whether earlier there were too many words displayed in chart

Why when I added more comments word cloud is showing less words in chart

11. i'm using word cloud in shiny , if i select the one data set it have to show the word cloud for corresponding to which i was selected.

1. can any one help me on this issues.

2. can any one help me on this issues.

12. i'm using word cloud in shiny , if i select the one data set it have to show the word cloud for corresponding to which i was selected.

13. Would appreciate it if you could provide a link to the data file(s)?

