Finding the best way to replace list templates in long documents

Question

Finding the best way to replace list templates in long documents

Using the tm package, I have a corpus of 10,900 docs.

docs = Corpus(VectorSource(abstracts$abstract))

And I also have a list of terms (a list of terms) and all their synonyms and different spellings. I use it to convert each synonym or spelling into one term.

Term, Synonyms
term1, synonym1
term1, synonym2
term1, synonym3
term2, synonym1
... etc

The way I am doing it right now is to iterate over all documents and another nester loop through all terms to find and replace.

for (s in 1:length(docs)){
  for (i in 1:nrow(termslist)){
    docs[[s]]$content<-gsub(termslist[i,2], termslist[i,1], docs[[s]])
  }
  print(s)
}

This currently ranks second for a document (about 1000 lines in the term list), which means 10,900 seconds, which is roughly 6 hours!

Is there a more optimized way to do this in the tm package, or inside R in general?

UPDATE:

Answer to <math>. I had to recreate the table with unique terms as rows, and the second column would be synonyms separated by '|', then just flip them. It now takes significantly less time than before.

** [The messy] code for creating a new table:

newtermslist<-list()
authname<-unique(termslist[,1])
newtermslist<- cbind(newtermslist,authname)
syns<-list()
for (i in seq(authname)){
  syns<- rbind(syns,
                   paste0('(', 
                          paste(termslist[which(termslist[,1]==authname[i]),2],collapse='|')
                          , ')')
  )
}
newtermslist<-cbind(newtermslist,syns)
newtermslist<-cbind(unlist(newtermslist[,1]),unlist(newtermslist[,2]))

+3

r text-mining tm

Fahd Jul 25 15 at 12:28 am

source to share

2 answers

Here I am answering my own question as you go through a parallel solution that will do something in parallel. It should run the code faster, but I haven't compared the two solutions yet.

library(doParallel)
library(foreach)
cl<-makeCluster(detectCores())
registerDoParallel(cl)
system.time({ # this one to print how long it takes after it evaluate the expression
  foreach(s=1:length(docs)) %:% foreach(i=1:nrow(newtermslist)) %dopar% {
    docs[[s]]$content<-gsub(newtermslist[i,2], newtermslist[i,1], docs[[s]]$content)
  }
})
stopCluster(cl)

+1

Fahd Jul 25 '15 at 3:21

source to share

mathematical.coffee · Accepted Answer · 2015-07-25T00:50:03+0000

I think if you want to do a lot of replacements, this might be the only way to do it (that is, to consistently store the saved result as input for the next replacement).

However, you can get some speed (you will have to do benchmarking to compare):

use fixed=T

(since your synonyms are not regex, but literal spellings), useBytes=T

(** see ?gsub

- if you have multibyte locale this may or may not be a good idea). Or
compress your list of terms - if it blue

has synonyms cerulean

, cobalt

and sky

, then your regular expression can be (cerulean|cobalt|sky)

substituted blue

, so that all synonyms for blue

are replaced in one iteration, not 3 separate ones. To do this, you must preprocess your term list - for example newtermslist <- ddply(terms, .(term), summarize, regex=paste0('(', paste(synonym, collapse='|'), ')'))

, and then do the current loop. You will have fixed=F

(default, i.e. using regex).
see also ?tm_map

and ?content_transformer

. I'm not sure if this will speed things up, but you can try.

(repeat benchmarking - try library(rbenchmark); benchmark(expression1, expression2, ...)

or good ol ' system.time

for sync, Rprof

for profiling)

Finding the best way to replace list templates in long documents

More articles: