Python: using re.sub to replace multiple substrings multiple times
I am trying to fix text that has some very common scanning errors (I mistake for myself and vice versa). Basically I would like the replacement string to re.sub
depend on the number of times "I" is encountered, something like this:
re.sub("(\w+)(I+)(\w*)", "\g<1>l+\g<3>", "I am stiII here.")
What's the best way to achieve this?
source to share
Pass the function as a replacement string as described in. Your function can identify the error and create a better lookup based on that.
def replacement(match):
if "I" in match.group(2):
return match.group(1) + "l" * len(match.group(2)) + match.group(3)
# Add additional cases here and as ORs in your regex
re.sub(r"(\w+)(II+)(\w*)", replacement, "I am stiII here.")
>>> I am still here.
(note that I changed your regex so that the repeated Is appears in the same group.)
source to share
You can use lookaround instead I
, followed by or preceded by another I
:
print re.sub("(?<=I)I|I(?=I)", "l", "I am stiII here.")
source to share
Based on the answer suggested by DNS, I built something more sophisticated to catch all cases (or at least most of them), being careful not to add too many bugs:
def Irepl(matchobj):
# Catch acronyms
if matchobj.group(0).isupper():
return matchobj.group(0)
else:
# Replace Group2 with 'l's
return matchobj.group(1) + 'l'*len(matchobj.group(2)) + matchobj.group(3)
# Impossible to know if first letter is correct or not (possibly a name)
I_FOR_l_PATTERN = "([a-zA-HJ-Z]+?)(I+)(\w*)"
for line in lines:
tmp_line = line.replace("l'", "I'").replace("'I", "'l").replace(" l ", " I ")
tmp_line = re.sub("^l ", "I ", tmp_line)
cor_line = re.sub(I_FOR_l_PATTERN, Irepl, tmp_line)
# Loop to catch all errors in a word (iIIegaI for example)
while cor_line != tmp_line:
tmp_line = cor_line
cor_line = re.sub(I_FOR_l_PATTERN, Irepl, tmp_line)
Hope this helps someone else!
source to share