Python extract words from a string based on a large list of words

Question

Python extract words from a string based on a large list of words

First, I have a large list of words:

words = ['about', 'black', 'red', ...]  # nums: 20000+

Then if given a string like:

s = 'blackingabouthahah'

I want to receive ['black', 'about']

I tried using regex for this:

pattern = re.compile('|'.join(words))
print pattern.findall(s)

This works, but I'm worried about the speed and memory usage of this method.

Is there a better solution?

+3

python regex

wong2 June 10. 15 at 6:58

source to share

2 answers

Wiktor Stribiżew · Answer 1 · 2015-06-10T07:07:37+0000

You can take a non-regex approach .find

using comprehension:

words = ['about', 'black', 'red']
s = 'blackingabouthahah'
print [x for x in words if s.find(x)>-1]

See IDEONE demo

This will produce unique occurrences of the terms in the list. If you need to count all occurrences:

words = ['about', 'black', 'red']
s = 'blackingabouthahahabout'
print [s.count(x) for x in words]

Since I don't see the difference between the first about

and the second about

. See another demo .

Shruti srivastava · Answer 2 · 2015-06-10T09:12:47+0000

If you just want to print I have a solution here

   import re

   words = ['about', 'black', 'red',] 
   s = 'dsjhdgblackingabouthahah'

   for items in words:
      if re.search (items,s):
          print items

If you want results in a new list, you can try this:

 import re

 words = ['about', 'black', 'red',] 
 s = 'dsjhdgblackingabouthahah'
 mylist = []
 for items in words:
    if re.search (items,s):
       mylist.append( items)

 print mylist

Python extract words from a string based on a large list of words

More articles: