Tm loses metadata when applying tm_map

I have a (small) problem with the tm r library. let's say I have a corpus:

# boilerplate
bcorp <- c("one","two","three","four","five")
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)

      

Result:

[1] "1" "2" "3" "4" "5"

      

It works. But when I try to use the tm_map () transformation:

# this does not work
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
tdm <- TermDocumentMatrix(myCorpus)

      

gives

Error: inherits(doc, "TextDocument") is not TRUE

      

The solution suggested in this case was to convert to PlainTextDocument.

# this works but erase the metadata
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)

      

Result:

[1] "character(0)" "character(0)" "character(0)" "character(0)" "character(0)"

      

Now it works, but it removes all metadata (in this case, document names). Is there a way to preserve metadata, or save and then restore it?

+3


source to share


1 answer


I found him.

Line:

myCorpus <- tm_map(myCorpus, PlainTextDocument)

      

solves the problem but removes the metadata.

I found this answer which explains the best way to use tm_map (). I just need to replace:



myCorpus <- tm_map(myCorpus, tolower)

      

from:

myCorpus <- tm_map(myCorpus, content_transformer(tolower))

      

And everything works!

+8


source







All Articles