Create WordCloud with R

Deepanshu Bhalla 23 Comments , , ,
A wordcloud is a text mining technique that allows us to visualize most frequently used keywords in a paragraph.

The example wordcloud is shown below :
Create WordCloud with R Programming
How to create Word Cloud with R

Step 1 : Install the required packages
install.packages("wordcloud")
install.packages("tm")
install.packages("ggplot2")
Note : If these packages are already installed, you don't need to install them again.


Step 2 : Load the above installed packages
library("wordcloud")
library("tm")
library(ggplot2)

Step 3 : Import data into R

Import a single file 
cname<-read.csv("C:/Users/Deepanshu Bhalla/Documents/Text.csv",head=TRUE)

Import multiple files from a folder

setwd("C:\\Users\\Deepanshu Bhalla\\Documents\\text mining")
cname <-getwd()
## Number of documents
length(dir(cname))
## list file names
dir(cname) 

Note : In the above syntax, "text mining" is a folder name. I have placed all text files in this folder

Step 4 : Locate and load the corpus

If imported a single file 

docs<-Corpus(VectorSource(cname[,1]));

If imported multiple files from a folder

docs <- Corpus (DirSource(cname))
docs
summary(docs)
inspect(docs[1])

Step 5 : Data Cleaning

# Simple Transformation
for (j in seq(docs))
{
docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
docs[[j]] = gsub("@\\w+", "", docs[[j]])
docs[[j]] = gsub("http\\w+", "", docs[[j]])
docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])
docs[[j]] = gsub("^\\s+|\\s+quot;", "", docs[[j]])
docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
}
# Specify stopwords other than in-bult english stopwords
skipwords = c(stopwords("english"), "system","technology")

kb.tf <- list(weighting = weightTf, stopwords  = skipwords,
              removePunctuation = TRUE,
              tolower = TRUE,
              minWordLength = 4,
              removeNumbers = TRUE, stripWhitespace = TRUE,
              stemDocument= TRUE)

# term-document matrix
docs <- tm_map(docs, PlainTextDocument) 
tdm = TermDocumentMatrix(docs, control = kb.tf)

# convert as matrix
tdm = as.matrix(tdm)

# get word counts in decreasing order
word_freqs = sort(rowSums(tdm), decreasing=TRUE)

# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)

Step 6 : Create WordCloud with R
# Keep wordcloud the same
set.seed(123)

#Plot Histogram
p <- ggplot(subset(dm, freq>10), aes(word, freq))
p <-p+ geom_bar(stat="identity")
p <-p+ theme(axis.text.x=element_text(angle=45, hjust=1))

p

#Plot Wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(6, "Dark2"),min.freq=10, scale=c(4,.2),rot.per=.15,max.words=100)

Note : You can remove sparse terms with the following code :
 tdm.frequent = removeSparseTerms(tdm, 0.1)
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

23 Responses to "Create WordCloud with R"
  1. I follow your tutorial and I tried to make a word cloud using "I have a dream" text from M. Luther King.

    Your post is clear and amazing because I was able to reproduce it. Thank you!

    http://www.sthda.com/english/wiki/easyggplot2

    ReplyDelete
  2. Hi, I followed all the steps but failed to transform a sentence into a short term. Which is step 5,8
    I got the error message:
    Error in UseMethod("content", x) :
    no applicable method for 'content' applied to an object of class "character"

    I can run all Step 1 to Step 4 till now.

    Could you help me fix this problem?
    Thank you in advance.

    ReplyDelete
    Replies
    1. I have updated the code. That should solve your problem.Thanks!

      Delete
  3. Hi Deepanshu

    I followed your steps but I got the following error...please help


    Error in eval(expr, envir, enclos) : object 'word' not found

    ReplyDelete
    Replies
    1. May i know the section of the code which returns this error?

      Delete
  4. Hey, I am using your code, but I can't do my own stopwords. I want to do a wordcloud of a chat and specify some words that shouldn't go into the wordcloud. Can you please tell me how to get my own words into stopwords? Thank you in advance

    ReplyDelete
  5. Hi !
    Great work.
    Well, I am trying to create a word-cloud using tweets. But all it shows in the wordcloud is: object,class,status words
    LINK to screenshot: https://drive.google.com/file/d/0B4ZhibK97rv0SE9XOXgwOERYNjQ/view?usp=sharing
    Help appreciated.
    Thanks

    ReplyDelete
  6. Hi Thanks for sharing this.

    When I executed step 5 syntax I got below error

    Error: unexpected string constant in:
    " docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])
    docs[[j]] = gsub(""
    > }
    Error: unexpected '}' in "}"

    ReplyDelete
    Replies
    1. The above error was resolved in step 5, but when I executed next set of syntax getting error. Please see below for more details. Can you please let me know why I am getting this error


      > tdm = TermDocumentMatrix(docs, control = kb.tf)
      Error: inherits(doc, "TextDocument") is not TRUE
      >
      > # convert as matrix
      > tdm = as.matrix(tdm)
      Error in as.matrix(tdm) : object 'tdm' not found
      >
      > # get word counts in decreasing order
      > word_freqs = sort(rowSums(tdm), decreasing=TRUE)
      Error in is.data.frame(x) : object 'tdm' not found
      >
      > # create a data frame with words and their frequencies
      > dm = data.frame(word=names(word_freqs), freq=word_freqs)
      Error in data.frame(word = names(word_freqs), freq = word_freqs) :
      object 'word_freqs' not found

      Delete
  7. This is fantastic!!!

    There is a small error in step 5 (instead of " used ;)


    for (j in seq(docs))
    {
    docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
    docs[[j]] = gsub("@\\w+", "", docs[[j]])
    docs[[j]] = gsub("http\\w+", "", docs[[j]])
    docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])

    docs[[j]] = gsub("^\\s+|\\s+quot;", "", docs[[j]])

    docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
    }

    ReplyDelete
    Replies
    1. Yes George, but in step 5 for next set of syntax I am getting below error, not sure why I am getting this error

      > tdm = TermDocumentMatrix(docs, control = kb.tf)
      Error: inherits(doc, "TextDocument") is not TRUE
      >
      > # convert as matrix
      > tdm = as.matrix(tdm)
      Error in as.matrix(tdm) : object 'tdm' not found
      >
      > # get word counts in decreasing order
      > word_freqs = sort(rowSums(tdm), decreasing=TRUE)
      Error in is.data.frame(x) : object 'tdm' not found
      >
      > # create a data frame with words and their frequencies
      > dm = data.frame(word=names(word_freqs), freq=word_freqs)
      Error in data.frame(word = names(word_freqs), freq = word_freqs) :
      object 'word_freqs' not found

      Delete
    2. Hi George,

      I am sorry i didn't get your point. What exactly you have changed in the code -

      for (j in seq(docs))
      {
      docs[[j]] = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", docs[[j]])
      docs[[j]] = gsub("@\\w+", "", docs[[j]])
      docs[[j]] = gsub("http\\w+", "", docs[[j]])
      docs[[j]] = gsub("[ \t]{2,}", "", docs[[j]])
      docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])
      docs[[j]] = gsub("[^\x20-\x7E]", "", docs[[j]])
      }

      Delete
    3. Hi Deepanshu,

      This code in Step 5
      docs[[j]] = gsub("^\\s+|\\s+quot;, "", docs[[j]])

      missed the " after "^\\s+|\\s+quot;

      Delete
  8. Issue with the version of tm you are using.

    run the following command before running
    tdm = TermDocumentMatrix(docs, control = kb.tf)

    docs <- tm_map(docs, PlainTextDocument)

    ReplyDelete
    Replies
    1. Thanks a lot George, it does work perfectly.

      Just one question about note mentioned, in the syntax and step

      Note : You can remove sparse terms with the following code :
      tdm.frequent = removeSparseTerms(tdm, 0.1)

      What is the use of this?

      Thanks a lot for your help

      Delete
  9. One more question, I increased my data points, means I included more comments in .csv file, but got only three word in word cloud.
    Whether earlier there were too many words displayed in chart

    Why when I added more comments word cloud is showing less words in chart

    ReplyDelete
  10. i'm using word cloud in shiny , if i select the one data set it have to show the word cloud for corresponding to which i was selected.

    ReplyDelete
  11. i'm using word cloud in shiny , if i select the one data set it have to show the word cloud for corresponding to which i was selected.

    ReplyDelete
  12. Would appreciate it if you could provide a link to the data file(s)?

    ReplyDelete
Next → ← Prev