TermDocumentMatrix doing unsolicited cleanup (like removing punctuation)

Question

TermDocumentMatrix doing unsolicited cleanup (like removing punctuation)

The TermDocumentMatrix

package function tm

does not work as per my understanding of the documentation. It seems to be handling conditions that I didn't ask for.

Here's an example:

require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf), 
                                                 removePunctuation = FALSE))
rownames(tdm)

You can see from the output that the punctuation has been removed, and the expression "raise ... what" has been stripped:

 [1] "a"         "about"     "am"        "and"       "astrology" "cap"       "capricorn" "does"      "i"         "me"        "moon"      "rising"    "say"       "sun"       "that"     
[16] "what"

In a related SO question , the problem was with the tokenizer that was removing punctuation. However, I am using the default words

tokenizer which I do not believe in this:

> sapply(corpus, words)
      [,1]           
 [1,] "Astrology:"   
 [2,] "I"            
 [3,] "am"           
 [4,] "a"            
 [5,] "Capricorn"    
 [6,] "Sun"          
 [7,] "Cap"          
 [8,] "moon"         
 [9,] "and"          
[10,] "cap"          
[11,] "rising...what"
[12,] "does"         
[13,] "that"         
[14,] "say"          
[15,] "about"        
[16,] "me?"

Is the observed behavior wrong, or what is the misunderstanding?

+3

r tm

James hirschorn May 7 '17 at 2:35

source to share

1 answer

lukeA · Accepted Answer · 2017-05-07T13:35:01+0000

You have an object SimpleCorpus

that came with the version 0.7 tm package and which - according to ?SimpleCorpus

-

adopts internal shortcuts to improve performance and minimize memory pressure

class(corpus)
# [1] "SimpleCorpus" "Corpus"

Now that help(TermDocumentMatrix)

points out:

The available local parameters are documented in termFreq and delegated internally to the call to termFreq. This value is different for SimpleCorpus . In this case, all parameters are processed in a fixed order in one pass to improve performance. It always uses Boost Tokenizer (via Rcpp) ...

This way you are not using words

as a tokenizer, which would really give you

words(sentence)
 [1] "Astrology:"    "I"             "am"            "a"             "Capricorn"     "Sun"           "Cap"          
 [8] "moon"          "and"           "cap"           "rising...what" "does"          "that"          "say"          
[15] "about"         "me?"

As pointed out in the comments, you can make your corpus explicitly Volatile ?VCorpus

to get full flexibility:

The volatile body is completely stored in memory and therefore everything changes only affect the corresponding R object

corpus <- VCorpus(VectorSource(sentence)) 
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))

TermDocumentMatrix doing unsolicited cleanup (like removing punctuation)

More articles: