Regex that only matches text that is not part of the HTML markup? (Python)
How can I create a template match if it is not inside an HTML tag?
Here's my attempt below. Does anyone have a better / different approach?
import re
inputstr = 'mary had a <b class="foo"> little loomb</b>'
rx = re.compile('[aob]')
repl = 'x'
outputstr = ''
i = 0
for astr in re.compile(r'(<[^>]*>)').split(inputstr):
i = 1 - i
if i:
astr = re.sub(rx, repl, astr)
outputstr += astr
print outputstr
output:
mxry hxd x <b class="foo"> little lxxmx</b>
Notes:
- The <[^>] *> pattern for matching HTML tags is clearly wrong - I wrote it quickly and did not consider the possibility of angle brackets in quoted attributes (eg '<img alt = "next>" />'). It ignores <script> or <style> tags or comments.
+1
source to share
1 answer
Since you are using Python anyway, if I were you, I would take a look at Beautiful Soup , which is a Python HTML / XML Parser . In fact, there are so many special cases and headaches with writing your own parser, it just isn't worth the effort. Your regex will be unmanageably large and still won't give correct results in all cases.
Just use Beautiful Soup .
+12
source to share