All possible endings of a text word (biomedical) phrase

Question

All possible endings of a text word (biomedical) phrase

I am familiar with verbal completion and completion from the tm package in R.

I am trying to find a quick and dirty method to find all variants of a given word (inside some corpus). For example, I would like to get "leukocytes" and "leuckocytic" if my input is "leukocytes".

If I had to do this right now, I would probably just go with something like:

library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"), 
    ignore.case = T, x = dictionary, value = T)

I used Lovins because the Snowball Porter doesn't seem to be aggressive enough.

I am open to suggestions for other developers, scripting (Python?), Or completely different approaches.

+3

python r nlp bioinformatics text-mining

Mark miller 23 jul. '15 at 19:30

source to share

1 answer

BioGeek · Accepted Answer · 2017-04-04T12:21:23+0000

This solution requires pre-processing your enclosure. But once that's done, it's a very fast dictionary lookup.

from collections import defaultdict
from stemming.porter2 import stem

with open('/usr/share/dict/words') as f:
    words = f.read().splitlines()

stems = defaultdict(list)

for word in words:
    word_stem = stem(word)
    stems[word_stem].append(word)

if __name__ == '__main__':
    word = 'leukocyte'
    word_stem = stem(word)
    print(stems[word_stem])

For /usr/share/dict/words

corpus this gives the result

['leukocyte', "leukocyte's", 'leukocytes']

It uses a module stemming

that can be installed with

pip install stemming

All possible endings of a text word (biomedical) phrase

More articles: