Remove meaningless words from corpus in R

Question

Remove meaningless words from corpus in R

I use tm

and wordcloud

to do some basic text search in R. The processed text contains a lot of words that are meaningless like asfdg, aawptkr and I need to filter such words. The closest solution I have found is using library(qdapDictionaries)

and creating a custom function to validate words.

library(qdapDictionaries)
is.word  <- function(x) x %in% GradyAugmented

# example
> is.word("aapg")
[1] FALSE

The rest of the text method used:

curDir <- "E:/folder1/"  # folder1 contains a.txt, b.txt
myCorpus <- VCorpus(DirSource(curDir))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)

myCorpus <- tm_map(myCorpus,foo) # foo clears meaningless words from corpus

The problem is is.word()

great for data processing, but how do I use it to handle corpus ?

thank

+3

r tm

parth 01 june 17 at 12:43

source to share

2 answers

If you want to try a different text mining package, this will work:

library(readtext)
library(quanteda)
myCorpus <- corpus(readtext("E:/folder1/*.txt"))

# tokenize the corpus
myTokens <- tokens(myCorpus, remove_punct = TRUE, remove_numbers = TRUE)
# keep only the tokens found in an English dictionary
myTokens <- tokens_select(myTokens, names(data_int_syllables))

From there, you can form a document matrix (called "dfm" in quanda) for analysis, and it will only contain functions that were found in the form of English terms that match in the dictionary (which contains about 130,000 words).

+4

Ken benoit June 14. 17 at 22:20

source to share

Moody_Mudskipper · Accepted Answer · 2017-06-09T16:58:52+0000

Not sure if this would be the most resource efficient method (I don't know the package very well), but it should work:

tdm <- TermDocumentMatrix(myCorpus )
all_tokens       <- findFreqTerms(tdm, 1)
tokens_to_remove <- setdiff(all_tokens,GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), 
                 tokens_to_remove)

Remove meaningless words from corpus in R

More articles: