Regex capable of matching anything other than a specific token

I'm trying to make a regex capable of matching "anything", but some token, I followed this answer ( "Match everything except the specified strings"). but for me it didn't work at all ...

Here's an example

text = '<a> whatever href="obviously_a_must_have" whatever <div> this div should be accepted </div> ... </a>'

regex = r'<a[^><]*href=\"[^\"]+\"(?!.*(</a>))*</a>' #(not working as intended)

[^><]* #- should accept any number of characters except < and >, meaning it shouldn't close the tag nor open a new one - *working*;
href=\"[^\"]+\" #- should match an href - *working*;
(?!.*(</a>))* #- should match anything but the end of the tag a - *NOT WORKING*.

      

+3


source to share


1 answer


The problem is that in

(?!.*(</a>))*

      

you have two errors.

  • /

    must be shielded. Use instead \/

    .

  • You cannot use * on another *. Try it on regex101 , and he will say * The preceding token is not quantifiable

    . I highly recommend this site for testing and understanding regular expressions.

Your first part doesn't work either, because you have> after the text and the regex won't match that.

Try this to get started:

<a>[^><]*href=\"[^\"]+\".*(?:<\/a>) 

      

This regex is much better, it will match your text. But it hasn't been filled in yet as it also matches texts with extra ends. We don't want this extra end to appear anywhere until the real end. So add a negative lookbehind:

<a>[^><]*href=\"[^\"]+\"(?:(?<!<\/a>).)*(?:<\/a>)

      



But as you can see here , it just matches the first end and ignores the others. And we want to smudge it. Also, we don't need any additional start tags. Let's limit the coincidence to start and end.

^<a>[^><]*href=\"[^\"]+\"(?:(?<!<\/a>).)*(?:<\/a>)$

      

Here are the tests.

Maybe you rather want to keep the href in <a...>

? Something like:

'<a whatever href="obviously_a_must_have"> whatever <div> this div should be accepted </div> ... </a>'

      

Then the regex would be:

^<a[^><]*href=\"[^\"]+\"[^><]*>(?:(?<!<\/a>).)*(?:<\/a>)$

      

The tests are here .

When developing regular expressions, I advise you to do something simple first, with many. * that will fit everyone, and change them step by step for real parts.

0


source







All Articles