Regex: Why are empty strings included (in the list of tuples) in re.findall ()?

According to pattern match here , matches 213.239.250.131

and 014.10.26.06

.

However, when I run the generated Python code and print out the value re.findall(p, test_str)

, I get:

[('', '', '213.239.250.131'), ('', '', '014.10.26.06')]

      

I could hack the list and it loads to get the values ​​I'm looking for (IP addresses), but (i) they might not always be in the same position in the tuples and (ii) I would rather understand what's going on here. so that I can either tighten up the regex or extract only the IPs using re

Python's own functionality .

Why am I getting this list of tuples, why the apparent whitespace is the same, and how do we ensure that only IP addresses are returned?

+3


source to share


2 answers


Whenever you use a capturing group , it always returns an unload, even if it is empty / null. You have 3 capture groups, so you will always have them in your results findall

.

At regex101.com you can see these non-participating groups by enabling them in Options:

enter image description here

You can tighten up your regex by removing the capturing groups:

(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

      

Or even (?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}(?:\.\d{1,3}){3}

.



See demo regex

And since the regex pattern does not contain a capturing group, re.findall

will return matches rather than capturing the contents of the group:

import re
p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
test_str = "from mail.example.com (example.com. [213.239.250.131]) by\n mx.google.com with ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n <alex@example.com> (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256\n bits=128/128); Sun, 26 Oct 2014 06:16:58 -0700 (PDT)"
print(re.findall(p, test_str))

      

Output from the online Python demo :

['213.239.250.131', '014.10.26.06']

      

+6


source


these are capture groups. if you do, or ask for it, it will return empty matches for non-matching expressions.

(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})

the first or has 2 groups:
(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})



and after the third or third:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})

to put it in a simple way, each parenthesis defines a capturing group that will be displayed if the value matches or not.

+1


source







All Articles