TermDocumentMatrix doing unsolicited cleanup (like removing punctuation)
The TermDocumentMatrix
package function tm
does not work as per my understanding of the documentation. It seems to be handling conditions that I didn't ask for.
Here's an example:
require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf),
removePunctuation = FALSE))
rownames(tdm)
You can see from the output that the punctuation has been removed, and the expression "raise ... what" has been stripped:
[1] "a" "about" "am" "and" "astrology" "cap" "capricorn" "does" "i" "me" "moon" "rising" "say" "sun" "that"
[16] "what"
In a related SO question , the problem was with the tokenizer that was removing punctuation. However, I am using the default words
tokenizer which I do not believe in this:
> sapply(corpus, words)
[,1]
[1,] "Astrology:"
[2,] "I"
[3,] "am"
[4,] "a"
[5,] "Capricorn"
[6,] "Sun"
[7,] "Cap"
[8,] "moon"
[9,] "and"
[10,] "cap"
[11,] "rising...what"
[12,] "does"
[13,] "that"
[14,] "say"
[15,] "about"
[16,] "me?"
Is the observed behavior wrong, or what is the misunderstanding?
source to share
You have an object SimpleCorpus
that came with the version 0.7 tm package and which - according to ?SimpleCorpus
-
adopts internal shortcuts to improve performance and minimize memory pressure
class(corpus)
# [1] "SimpleCorpus" "Corpus"
Now that help(TermDocumentMatrix)
points out:
The available local parameters are documented in termFreq and delegated internally to the call to termFreq. This value is different for SimpleCorpus . In this case, all parameters are processed in a fixed order in one pass to improve performance. It always uses Boost Tokenizer (via Rcpp) ...
This way you are not using words
as a tokenizer, which would really give you
words(sentence)
[1] "Astrology:" "I" "am" "a" "Capricorn" "Sun" "Cap"
[8] "moon" "and" "cap" "rising...what" "does" "that" "say"
[15] "about" "me?"
As pointed out in the comments, you can make your corpus explicitly Volatile ?VCorpus
to get full flexibility:
The volatile body is completely stored in memory and therefore everything changes only affect the corresponding R object
corpus <- VCorpus(VectorSource(sentence))
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))
source to share