Regex that only matches text that is not part of the HTML markup? (Python)

Question

Regex that only matches text that is not part of the HTML markup? (Python)

How can I create a template match if it is not inside an HTML tag?

Here's my attempt below. Does anyone have a better / different approach?

import re

inputstr = 'mary had a <b class="foo"> little loomb</b>'

rx = re.compile('[aob]')
repl = 'x'

outputstr = ''
i = 0

for astr in re.compile(r'(<[^>]*>)').split(inputstr):
    i = 1 - i

    if i:
        astr = re.sub(rx, repl, astr)

    outputstr += astr

print outputstr

output:

mxry hxd x <b class="foo"> little lxxmx</b>

Notes:

The <[^>] *> pattern for matching HTML tags is clearly wrong - I wrote it quickly and did not consider the possibility of angle brackets in quoted attributes (eg '<img alt = "next>" />'). It ignores <script> or <style> tags or comments.

+1

python regex

ʞɔıu Dec 30. At 22:33

source to share

1 answer

Tamas Czinege · Answer 1 · 2008-12-30T22:56:26+0000

Since you are using Python anyway, if I were you, I would take a look at Beautiful Soup , which is a Python HTML / XML Parser . In fact, there are so many special cases and headaches with writing your own parser, it just isn't worth the effort. Your regex will be unmanageably large and still won't give correct results in all cases.

Just use Beautiful Soup .

Regex that only matches text that is not part of the HTML markup? (Python)

More articles: