Extract ngrams with R

I am trying to extract 3grams from nirvana text, so for tfis I am using package ngramrr

.

require(ngramrr)
require(tm)
require(magrittr)

nirvana <- c("hello hello hello how low", "hello hello hello how low",
             "hello hello hello how low", "hello hello hello",
             "with the lights out", "it less dangerous", "here we are now", "entertain us",
             "i feel stupid", "and contagious", "here we are now", "entertain us",
             "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")

ngramrr(nirvana[1], ngmax = 3)

Corpus(VectorSource(nirvana))

      

I get this result:

[1] "hello"             "hello"             "hello"             "how"               "low"               "hello hello"       "hello hello"      
 [8] "hello how"         "how low"           "hello hello hello" "hello hello how"   "hello how low"   

      

I would like to know how can I do to build TermDocumentMatrix

where the terms are a list of trigrams.

thank

+3


source to share


1 answer


My comment above is almost complete, but it goes something like this:



nirvana %>% tokens(ngrams = 1:3) %>% # generate tokens
  dfm %>% # generate dfm
  convert(to = "tm") %>% # convert to tm document-term-matrix
  t # transpose it to term-document-matrix

      

+1


source







All Articles