Extract ngrams with R

Question

Extract ngrams with R

I am trying to extract 3grams from nirvana text, so for tfis I am using package ngramrr

.

require(ngramrr)
require(tm)
require(magrittr)

nirvana <- c("hello hello hello how low", "hello hello hello how low",
             "hello hello hello how low", "hello hello hello",
             "with the lights out", "it less dangerous", "here we are now", "entertain us",
             "i feel stupid", "and contagious", "here we are now", "entertain us",
             "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")

ngramrr(nirvana[1], ngmax = 3)

Corpus(VectorSource(nirvana))

I get this result:

[1] "hello"             "hello"             "hello"             "how"               "low"               "hello hello"       "hello hello"      
 [8] "hello how"         "how low"           "hello hello hello" "hello hello how"   "hello how low"

I would like to know how can I do to build TermDocumentMatrix

where the terms are a list of trigrams.

thank

+3

r text-mining

dr.nasri84 05 May '17 at 14:24

source to share

1 answer

amatsuo_net · Accepted Answer · 2017-05-05T14:53:27+0000

My comment above is almost complete, but it goes something like this:

nirvana %>% tokens(ngrams = 1:3) %>% # generate tokens
  dfm %>% # generate dfm
  convert(to = "tm") %>% # convert to tm document-term-matrix
  t # transpose it to term-document-matrix

Extract ngrams with R

More articles: