Regex that only matches text that is not part of the HTML markup? (Python)

How can I create a template match if it is not inside an HTML tag?

Here's my attempt below. Does anyone have a better / different approach?

import re

inputstr = 'mary had a <b class="foo"> little loomb</b>'

rx = re.compile('[aob]')
repl = 'x'

outputstr = ''
i = 0

for astr in re.compile(r'(<[^>]*>)').split(inputstr):
    i = 1 - i

    if i:
        astr = re.sub(rx, repl, astr)

    outputstr += astr

print outputstr

      

output:

mxry hxd x <b class="foo"> little lxxmx</b>

      

Notes:

  • The <[^>] *> pattern for matching HTML tags is clearly wrong - I wrote it quickly and did not consider the possibility of angle brackets in quoted attributes (eg '<img alt = "next>" />'). It ignores <script> or <style> tags or comments.
+1


source to share


1 answer


Since you are using Python anyway, if I were you, I would take a look at Beautiful Soup , which is a Python HTML / XML Parser . In fact, there are so many special cases and headaches with writing your own parser, it just isn't worth the effort. Your regex will be unmanageably large and still won't give correct results in all cases.



Just use Beautiful Soup .

+12


source







All Articles