Remove meaningless words from corpus in R

I use tm

and wordcloud

to do some basic text search in R. The processed text contains a lot of words that are meaningless like asfdg, aawptkr and I need to filter such words. The closest solution I have found is using library(qdapDictionaries)

and creating a custom function to validate words.

library(qdapDictionaries)
is.word  <- function(x) x %in% GradyAugmented

# example
> is.word("aapg")
[1] FALSE

      

The rest of the text method used:

curDir <- "E:/folder1/"  # folder1 contains a.txt, b.txt
myCorpus <- VCorpus(DirSource(curDir))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeNumbers)

myCorpus <- tm_map(myCorpus,foo) # foo clears meaningless words from corpus

      

The problem is is.word()

great for data processing, but how do I use it to handle corpus ?

thank

+3


source to share


2 answers


Not sure if this would be the most resource efficient method (I don't know the package very well), but it should work:



tdm <- TermDocumentMatrix(myCorpus )
all_tokens       <- findFreqTerms(tdm, 1)
tokens_to_remove <- setdiff(all_tokens,GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), 
                 tokens_to_remove)

      

+2


source


If you want to try a different text mining package, this will work:

library(readtext)
library(quanteda)
myCorpus <- corpus(readtext("E:/folder1/*.txt"))

# tokenize the corpus
myTokens <- tokens(myCorpus, remove_punct = TRUE, remove_numbers = TRUE)
# keep only the tokens found in an English dictionary
myTokens <- tokens_select(myTokens, names(data_int_syllables))

      



From there, you can form a document matrix (called "dfm" in quanda) for analysis, and it will only contain functions that were found in the form of English terms that match in the dictionary (which contains about 130,000 words).

+4


source







All Articles