TermDocumentMatrix doing unsolicited cleanup (like removing punctuation)

The TermDocumentMatrix

package function tm

does not work as per my understanding of the documentation. It seems to be handling conditions that I didn't ask for.

Here's an example:

require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf), 
                                                 removePunctuation = FALSE))
rownames(tdm)

      

You can see from the output that the punctuation has been removed, and the expression "raise ... what" has been stripped:

 [1] "a"         "about"     "am"        "and"       "astrology" "cap"       "capricorn" "does"      "i"         "me"        "moon"      "rising"    "say"       "sun"       "that"     
[16] "what"  

      

In a related SO question , the problem was with the tokenizer that was removing punctuation. However, I am using the default words

tokenizer which I do not believe in this:

> sapply(corpus, words)
      [,1]           
 [1,] "Astrology:"   
 [2,] "I"            
 [3,] "am"           
 [4,] "a"            
 [5,] "Capricorn"    
 [6,] "Sun"          
 [7,] "Cap"          
 [8,] "moon"         
 [9,] "and"          
[10,] "cap"          
[11,] "rising...what"
[12,] "does"         
[13,] "that"         
[14,] "say"          
[15,] "about"        
[16,] "me?" 

      

Is the observed behavior wrong, or what is the misunderstanding?

+3


source to share


1 answer


You have an object SimpleCorpus

that came with the version 0.7 tm package and which - according to ?SimpleCorpus

-

adopts internal shortcuts to improve performance and minimize memory pressure

class(corpus)
# [1] "SimpleCorpus" "Corpus"  

      

Now that help(TermDocumentMatrix)

points out:

The available local parameters are documented in termFreq and delegated internally to the call to termFreq. This value is different for SimpleCorpus . In this case, all parameters are processed in a fixed order in one pass to improve performance. It always uses Boost Tokenizer (via Rcpp) ...



This way you are not using words

as a tokenizer, which would really give you

words(sentence)
 [1] "Astrology:"    "I"             "am"            "a"             "Capricorn"     "Sun"           "Cap"          
 [8] "moon"          "and"           "cap"           "rising...what" "does"          "that"          "say"          
[15] "about"         "me?"  

      

As pointed out in the comments, you can make your corpus explicitly Volatile ?VCorpus

to get full flexibility:

The volatile body is completely stored in memory and therefore everything changes only affect the corresponding R object

corpus <- VCorpus(VectorSource(sentence)) 
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))

      

+3


source







All Articles