Problem with joining a list of multiple lines to a list of one line in Python
I am trying to write a Python program to check for phrases in a file in a document. My program works great until it hits a phrase like "happy (+) feet". I think the error is related to the "(+)" in the phrase; however I am not sure how to revise my regex to get it to work.
This is my code:
import re
handle = open('document.txt', 'r')
text = handle.read()
lst = list()
with open('phrases.txt', 'r') as phrases:
for phrase in phrases:
phrase = phrase.rstrip()
if len(phrase) > 0 and phrase not in lst:
ealst.append(phrase)
counts = {}
for each_phrase in lst:
word = each_phrase.rsplit()
pattern = re.compile(r'%s' % '\s+'.join(word), re.IGNORECASE)
counts[each_phrase] = len(pattern.findall(text))
for key, value in counts.items():
if value > 0:
print key,',', value
handle.close()
phrases.close()
source to share
You need to use re.escape
when declaring word
:
word = map(re.escape, each_phrase.rsplit())
And maybe change \s+
to \s*
to make the space optional:
pattern = re.compile(r'%s' % '\s*'.join(word), re.IGNORECASE)
The parentheses (
and )
, +
plus the regular expression plus special characters must be escaped in a regular expression outside the character class to match literals.
Example IDEONE demo
source to share