I am trying to create a word-cloud from the most used categories tags in some videos.
Everything runs OK, BUT when the document matrix is created some of the categories split into individual words. these affected categories use the “&” symbol between words.
(examples: River & Lake, Sea & Islands, Beach & Cliffs,…)
How to keep those words together and create the word-cloud correctly?
library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer") #load the text data into docs variable docs <- Corpus(VectorSource(textos)) toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) #Text Mining. docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|") docs <- tm_map(docs, stripWhitespace)
#Document matrix is a table containing the frequency of the words. #Column names are words and row names are documents. #The function TermDocumentMatrix() from text mining package can be used as follow dtm <- TermDocumentMatrix(docs) m <- as.matrix(dtm) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v) head(d, 10)
#plot the wordcloud wordcloud(words = d$word, freq = d$freq, scale = c(3,.4), min.freq = 1, max.words=Inf, random.order=FALSE, rot.per=0.15, colors=brewer.pal(6, "Dark2"))