...">

Excluding characters in match \ S regex

I have the following regex expression to match html links:

<a\s*href=['|"](http:\/\/(.*?)\S['|"]>

      

it's kind of work. Also, actually. Since it grabs everything after <href ... and just keeps going. I want to exclude the quotation marks from the last \ S match. Is there a way to do this?

EDIT: This will force it to capture only before the quotes, not everything after

+1


source to share


6 answers


I don't think your regex does what you want.

<a\s*href=['|"](http:\/\/(.*?)\S['|"]>

      

This grabs anything that is not greedy, from http: // to the first nonspace character before a quote, single quote, or pipe. For that matter, I'm not sure how he parses as he doesn't seem to have close enough partnerships.



If you're trying to grab the href, you can try something like this:

<a .*?+href=['"](http:\/\/.*?)['"].*?>

      

It uses. *? (unwanted match) to allow other attributes (purpose, title, etc.). It matches an href that starts and ends with either a single or double quote (it does not distinguish and allows the href to open with one and close with the other).

+4


source


Why are you trying to match HTML links to regex?

Depending on what you are trying to do, the right thing will differ.



You can try using HTML Parser. There are several available, there is even one from the Python library: https://docs.python.org/library/htmlparser.html

Hope this helps!

+1


source


>>> import re
>>> regex = '<a\s+href=["\'](http://(.*?))["\']>'
>>> string = '<a href="http://google.com/test/this">'
>>> match = re.search(regex, string)
>>> match.group(1)
'http://google.com/test/this'
>>> match.group(2)
'google.com/test/this'

      

explanations:

 \s+   = match at least one white space (<ahref) is a bad link
 ["\'] = character class, | has no meaning within square brackets
         (it will match a literal pipe "|")

      

+1


source


\ S matches any character that is not a space character, like [^ \ s]

Written this way, you can easily escape quotes: [^ \ s "']

Note that you may have to give. *? in your regex the same treatment. Period matches any character that is not a newline character, like [^ \ r \ n]

Again, written like this, you can easily escape the quotes: [^ \ r \ n '"]

+1


source


Read the book Mastering Regular Expressions by Jeff Friedl.

As written:

<a\s*href=['|"](http:\/\/(.*?)\S['|"]>

      

You have unbalanced parentheses in your expression. Maybe the problem is that the first match is treated as "reading to the end of the regexp". Also, why don't you want the last nonspace character of the URL?

... *? The (lazy greedy) operator is interesting. I must say, however, that I would be more inclined to write:

<a\s+href=['|"]http://([^'"><]+)\1>

      

This distinguishes between "<ahref" (non-existent HTML tag) and "<a href" (valid HTML tag). It does not fix the "http: //" prefix. I'm not sure if you need to avoid slashes - in Perl, where I mostly work, I don't. The capture part uses greedy matching, but only for characters that can be semi-legitimately displayed in the URL. In particular, it excludes both the quotes and the end tag (and, for good measure, the start tag as well). If you really want the "http: //" prefix, shift the copied parenthesis accordingly.

0


source


I ran into the problem with single quotes in some URLs like this one from Fox Sports. I made a small adjustment that I think should take care of this.

http://msn.foxsports.com/mlb/story/9152594/Fehr : "Increased" -Consumer Market without Ads

/ <a \ h + href \ s * = \ s * ["'] (http: //.*) ["'] [> \ s] / i

this requires that the closing quote be followed by a space or closing parenthesis.

0


source







All Articles