R – create wordcloud from most used categories

问题内容:

I am trying to create a word-cloud from the most used categories tags in some videos.

Everything runs OK, BUT when the document matrix is created some of the categories split into individual words. these affected categories use the “&” symbol between words.

(examples: River & Lake, Sea & Islands, Beach & Cliffs,…)

How to keep those words together and create the word-cloud correctly?

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

#load the text data into docs variable
docs <- Corpus(VectorSource(textos))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

#Text Mining. 
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, stripWhitespace)

screenshot of function inspect(docs) showing the words

#Document matrix is a table containing the frequency of the words. 
#Column names are words and row names are documents. 
#The function TermDocumentMatrix() from text mining package can be used as follow

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

after applying TermDocumentMatrix. the categories with “& symbol are separated in individual words

#plot the wordcloud

wordcloud(words = d$word, freq = d$freq, scale = c(3,.4), min.freq = 1,
          max.words=Inf, random.order=FALSE, rot.per=0.15, 
          colors=brewer.pal(6, "Dark2"))

result of wordcloud showing the most used categories

问题评论:

1  
Why do you split words using & ? You have to find a way to skip that in your process. I’ll post something that might help you….
    
Also, have a look at the tidytext package: cran.r-project.org/web/packages/tidytext/vignettes/…
    
All of this was mainly trial and error, but I will check out what you suggest! thanks!
– SaoCricalho
1 hour ago

答案:

答案1:

Your first screenshot shows that you can create a vector of words like this:

docs = c("A & B", "A & B", "C", "C", "C", NA, "A & B", "A & B", "A & B", NA)

Where your words still include &.

Then you can just skip the process that splits on & and run this instead:

library(dplyr)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)

df_docs_counts = data.frame(docs, stringsAsFactors = F) %>%  # create a dataframe of words
      na.omit() %>%                                          # exclude NAs
      count(docs, sort=T)                                    # count number for each word

wordcloud(df_docs_counts$docs, df_docs_counts$n)

答案评论:

原文地址:

https://stackoverflow.com/questions/47752408/r-create-wordcloud-from-most-used-categories

添加评论

友情链接:蝴蝶教程