I am trying to create a word-cloud from the most used categories tags in some videos.
Everything runs OK, BUT when the document matrix is created some of the categories split into individual words. these affected categories use the “&” symbol between words.
(examples: River & Lake, Sea & Islands, Beach & Cliffs,…)
How to keep those words together and create the word-cloud correctly?
library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer") #load the text data into docs variable docs <- Corpus(VectorSource(textos)) toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) #Text Mining. docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") docs <- tm_map(docs, stripWhitespace)
#Document matrix is a table containing the frequency of the words. #Column names are words and row names are documents. #The function TermDocumentMatrix() from text mining package can be used as follow dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)
#plot the wordcloud wordcloud(words = d$word, freq = d$freq, scale = c(3,.4), min.freq = 1, max.words=Inf, random.order=FALSE, rot.per=0.15, colors=brewer.pal(6, "Dark2"))
Your first screenshot shows that you can create a vector of words like this:
docs = c("A & B", "A & B", "C", "C", "C", NA, "A & B", "A & B", "A & B", NA)
Where your words still include
Then you can just skip the process that splits on
& and run this instead:
library(dplyr) library(tm) library(SnowballC) library(wordcloud) library(RColorBrewer) df_docs_counts = data.frame(docs, stringsAsFactors = F) %>% # create a dataframe of words na.omit() %>% # exclude NAs count(docs, sort=T) # count number for each word wordcloud(df_docs_counts$docs, df_docs_counts$n)