How to lemmit an entire corpus in R faster than my application

I tried different things to make a huge chunk of words using different methods in the R language. Finally, I was able to use a package koRpus

that wraps the application TreeTagger

.

content.cc

- this is my corpus containing about 7000 documents with an average word count of about 300 words. I installed the function:

lemmatizeCorpus <- function(x) {

  if (x != "") {
    words.cc <- treetag(x, treetagger="manual", format="obj",
                      TT.tknz=FALSE, lang="en",
                      TT.options=list(path="c:/TreeTagger", preset="en"))

    words.lm <- ifelse(words.cc@TT.res$token != words.cc@TT.res$lemma, 
                     ifelse(words.cc@TT.res$lemma != "<unknown>", words.cc@TT.res$lemma, words.cc@TT.res$token),
                     words.cc@TT.res$token)

    content.w <- toString(paste(words.lm, collapse = " "))

  }
}

      

and runs like this:

content.lw <- sapply(X = content.cc$content, FUN = function(x) lemmatizeCorpus(x), USE.NAMES = F)

      

It brings the desired effect - it changes words that have their root in the TT dictionary, and, what is important here, leaves the hierarchy the same as in the corpus (number of documents, words, words, number of words). The problem is that it runs for about an hour (on my rather slow machine, but it doesn't matter which cp it runs on).

I tried to concatenate the entire corpus into one char: matrix stri_extract_all_words(content.cc$content)

and applied the corpus as an integer to the function treetag

. It was about 5x faster (same function body), but I got lost trying to find the indices for which the words belong to which document, because the number of words extracted stri

and executed was treetag

slightly different. This loop is stable.

Another attempt was to use a stemmer from a package tm

that is popular and help and solutions can also be found on this forum, but it hits the regex memory limit very quickly and goes into a loop, producing the same effect as the current approach.

All I need are some suggestions, what can I do with it? May I? It may not be possible to speed it up because it TreeTagger

works like that and it couldn't be faster. I know this is difficult. Using sapply

for example the result is about 2x faster than a pure loop, so this is some improvement.

+3


source to share





All Articles