Finding the best way to replace list templates in long documents
Using the tm package, I have a corpus of 10,900 docs.
docs = Corpus(VectorSource(abstracts$abstract))
And I also have a list of terms (a list of terms) and all their synonyms and different spellings. I use it to convert each synonym or spelling into one term.
Term, Synonyms
term1, synonym1
term1, synonym2
term1, synonym3
term2, synonym1
... etc
The way I am doing it right now is to iterate over all documents and another nester loop through all terms to find and replace.
for (s in 1:length(docs)){
for (i in 1:nrow(termslist)){
docs[[s]]$content<-gsub(termslist[i,2], termslist[i,1], docs[[s]])
}
print(s)
}
This currently ranks second for a document (about 1000 lines in the term list), which means 10,900 seconds, which is roughly 6 hours!
Is there a more optimized way to do this in the tm package, or inside R in general?
UPDATE:
Answer to <math>. I had to recreate the table with unique terms as rows, and the second column would be synonyms separated by '|', then just flip them. It now takes significantly less time than before.** [The messy] code for creating a new table:
newtermslist<-list()
authname<-unique(termslist[,1])
newtermslist<- cbind(newtermslist,authname)
syns<-list()
for (i in seq(authname)){
syns<- rbind(syns,
paste0('(',
paste(termslist[which(termslist[,1]==authname[i]),2],collapse='|')
, ')')
)
}
newtermslist<-cbind(newtermslist,syns)
newtermslist<-cbind(unlist(newtermslist[,1]),unlist(newtermslist[,2]))
source to share
I think if you want to do a lot of replacements, this might be the only way to do it (that is, to consistently store the saved result as input for the next replacement).
However, you can get some speed (you will have to do benchmarking to compare):
- use
fixed=T
(since your synonyms are not regex, but literal spellings),useBytes=T
(** see?gsub
- if you have multibyte locale this may or may not be a good idea). Or - compress your list of terms - if it
blue
has synonymscerulean
,cobalt
andsky
, then your regular expression can be(cerulean|cobalt|sky)
substitutedblue
, so that all synonyms forblue
are replaced in one iteration, not 3 separate ones. To do this, you must preprocess your term list - for examplenewtermslist <- ddply(terms, .(term), summarize, regex=paste0('(', paste(synonym, collapse='|'), ')'))
, and then do the current loop. You will havefixed=F
(default, i.e. using regex). - see also
?tm_map
and?content_transformer
. I'm not sure if this will speed things up, but you can try.
(repeat benchmarking - try library(rbenchmark); benchmark(expression1, expression2, ...)
or good ol ' system.time
for sync, Rprof
for profiling)
source to share
Here I am answering my own question as you go through a parallel solution that will do something in parallel. It should run the code faster, but I haven't compared the two solutions yet.
library(doParallel)
library(foreach)
cl<-makeCluster(detectCores())
registerDoParallel(cl)
system.time({ # this one to print how long it takes after it evaluate the expression
foreach(s=1:length(docs)) %:% foreach(i=1:nrow(newtermslist)) %dopar% {
docs[[s]]$content<-gsub(newtermslist[i,2], newtermslist[i,1], docs[[s]]$content)
}
})
stopCluster(cl)
source to share