Use DocumentTermMatrix in R with dictionary parameter

I want to use R to classify text. I am using DocumentTermMatrix to return a word matrix:

library(tm)
crude <- "japan korea usa uk albania azerbaijan"
corps <- Corpus(VectorSource(crude))
dtm <- DocumentTermMatrix(corps)
inspect(dtm)

words <- c("australia", "korea", "uganda", "japan", "argentina", "turkey")
test <- DocumentTermMatrix(corps, control=list(dictionary = words))
inspect(test)

      

The first one inspect(dtm)

works as expected with the result:

    Terms
Docs albania azerbaijan japan korea usa
   1       1          1     1     1   1

      

But the second one inspect(test)

shows this result:

    Terms
Docs argentina australia japan korea turkey uganda
   1         0         1     0     1      0      0

      

So far the expected result is:

    Terms
Docs argentina australia japan korea turkey uganda
   1         0         0     1     1      0      0

      

Is this a bug or am I using it incorrectly?

+3


source to share


1 answer


Corpus () seems to have a bug when indexing word frequency.



Use VCorpus () instead, this will give you the expected result.

+1


source







All Articles