Finding the best way to replace list templates in long documents

Using the tm package, I have a corpus of 10,900 docs.

docs = Corpus(VectorSource(abstracts$abstract))

      

And I also have a list of terms (a list of terms) and all their synonyms and different spellings. I use it to convert each synonym or spelling into one term.

Term, Synonyms
term1, synonym1
term1, synonym2
term1, synonym3
term2, synonym1
... etc

      

The way I am doing it right now is to iterate over all documents and another nester loop through all terms to find and replace.

for (s in 1:length(docs)){
  for (i in 1:nrow(termslist)){
    docs[[s]]$content<-gsub(termslist[i,2], termslist[i,1], docs[[s]])
  }
  print(s)
}

      

This currently ranks second for a document (about 1000 lines in the term list), which means 10,900 seconds, which is roughly 6 hours!

Is there a more optimized way to do this in the tm package, or inside R in general?

UPDATE:

Answer to <math>. I had to recreate the table with unique terms as rows, and the second column would be synonyms separated by '|', then just flip them. It now takes significantly less time than before.

** [The messy] code for creating a new table:

newtermslist<-list()
authname<-unique(termslist[,1])
newtermslist<- cbind(newtermslist,authname)
syns<-list()
for (i in seq(authname)){
  syns<- rbind(syns,
                   paste0('(', 
                          paste(termslist[which(termslist[,1]==authname[i]),2],collapse='|')
                          , ')')
  )
}
newtermslist<-cbind(newtermslist,syns)
newtermslist<-cbind(unlist(newtermslist[,1]),unlist(newtermslist[,2]))

      

+3


source to share


2 answers


I think if you want to do a lot of replacements, this might be the only way to do it (that is, to consistently store the saved result as input for the next replacement).

However, you can get some speed (you will have to do benchmarking to compare):



  • use fixed=T

    (since your synonyms are not regex, but literal spellings), useBytes=T

    (** see ?gsub

    - if you have multibyte locale this may or may not be a good idea). Or
  • compress your list of terms - if it blue

    has synonyms cerulean

    , cobalt

    and sky

    , then your regular expression can be (cerulean|cobalt|sky)

    substituted blue

    , so that all synonyms for blue

    are replaced in one iteration, not 3 separate ones. To do this, you must preprocess your term list - for example newtermslist <- ddply(terms, .(term), summarize, regex=paste0('(', paste(synonym, collapse='|'), ')'))

    , and then do the current loop. You will have fixed=F

    (default, i.e. using regex).
  • see also ?tm_map

    and ?content_transformer

    . I'm not sure if this will speed things up, but you can try.

(repeat benchmarking - try library(rbenchmark); benchmark(expression1, expression2, ...)

or good ol ' system.time

for sync, Rprof

for profiling)

+1


source


Here I am answering my own question as you go through a parallel solution that will do something in parallel. It should run the code faster, but I haven't compared the two solutions yet.



library(doParallel)
library(foreach)
cl<-makeCluster(detectCores())
registerDoParallel(cl)
system.time({ # this one to print how long it takes after it evaluate the expression
  foreach(s=1:length(docs)) %:% foreach(i=1:nrow(newtermslist)) %dopar% {
    docs[[s]]$content<-gsub(newtermslist[i,2], newtermslist[i,1], docs[[s]]$content)
  }
})
stopCluster(cl)

      

+1


source







All Articles