Negative lookbehind in regex with optional prefix

We are using the following regex to recognize urls (derived from this gist by Jim Gruber ). This is done in Scala using scala.util.matching

, which in turn uses java.util.regex

:

(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""β€˜β€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b/?(?!@)))

      

This version has hidden slashes, for Rubular :

(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""β€˜β€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))

      

Previously, the front-end only sent plaintext to the background, however, they now allow users to create anchor tags for URLs. Therefore, the trailing end now needs to recognize URLs other than those already in the anchored tags. I first tried to do this with negative bias, ignoring URLs prefixed withhref="

(?i)\b((?<!href=")((?:https?: ... etc

      

The problem is that our regex url is very liberal, recognizing http://www.google.com

, www.google.com

and google.com

- given

 <a href="http://www.google.com">Google</a>

      

a negative lookbehind will ignore http://www.google.com

, but then the regex will still recognize www.google.com

. I'm wondering if there is a concise way to tell the regex to "ignore www.google.com

and google.com

if they are substrings of ignored http(s)://www.google.com

"

I am currently using a filter for url regex matches (the code is in Scala) - this also ignores urls in link text ( <a href="http://www.google.com">www.google.com</a>

), ignoring prefixed >

and </a>

suffixed urls . I would prefer to use a filter if done in a regex, making the already complex regex even more unreadable.

urlPattern.findAllMatchIn(text).toList.filter(m => {
  val start: Int = m.start(1)
  val end: Int = m.end(1)
  val isHref: Boolean = (start - 6 > 0) && 
    text.substring(start - 6, start) == """href=""""
  val isAnchor: Boolean = (start - 1 > 0 && end + 3 < text.length && 
    text.substring(start - 1, start) == ">" && 
    text.substring(end, end + 3) == "</a>")
  !(isHref || isAnchor) && Option(m.group(1)).isDefined
})

      

+3


source to share


3 answers


<a href=\S+|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""β€˜β€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))

      

or

<a href=(?:(?!<\/a>).)*<\/a>|\b((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""β€˜β€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))

      

Try it. This basically means:



  • Considers all links href

    so that they cannot be matched later

  • Doesn't write it down, so it won't show up in groups

    .

  • Process the rest as before.

See demo.

http://regex101.com/r/vR4fY4/17

+1


source


It seems that you are not just ignoring www.google.com

and google.com

if they are substrings of the ignored http(s)://www.google.com"

, but instead any fragments of the substring from the previously ignored section ... In this case, you can use a little code to get around this! See Regular Expression:

(a href=")?(?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""β€˜β€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@))))
^^^^^^^^^^^

      

I am not good at scala, but you can probably do this:

val links = new Regex("""(a href=")?(?i)\b(((?:https?:... """.r, "unwanted")
val unwanted = for (o <- links findAllMatchIn text) yield o group "unwanted"

      

If unwanted

- scala.Null

, then the match is helpful.



You can workaround for replacement by replacing the alternative:

a href="(?i)\b(?:(?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""β€˜β€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))|((?i)\b(((?:https?:(?:\/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?!js)[a-z]{2,6}\/)(?:[^\s()<>{}\[\]]+)(?:[^\s`!()\[\]{};:'".,<>?«»""β€˜β€™])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?!js)[a-z]{2,6}\b\/?(?!@)))))

      

The second part of the regex behind the pipe is |

grouped as a capturing group. You can replace this regex with the first group:\1

A similar question:

+1


source


How easy is it to add the part <a href=

as an optional group, and then, by checking for a match, you only return matches in which that group is empty?

0


source







All Articles