Regex: Why are empty strings included (in the list of tuples) in re.findall ()?
According to pattern match here , matches 213.239.250.131
and 014.10.26.06
.
However, when I run the generated Python code and print out the value re.findall(p, test_str)
, I get:
[('', '', '213.239.250.131'), ('', '', '014.10.26.06')]
I could hack the list and it loads to get the values I'm looking for (IP addresses), but (i) they might not always be in the same position in the tuples and (ii) I would rather understand what's going on here. so that I can either tighten up the regex or extract only the IPs using re
Python's own functionality .
Why am I getting this list of tuples, why the apparent whitespace is the same, and how do we ensure that only IP addresses are returned?
source to share
Whenever you use a capturing group , it always returns an unload, even if it is empty / null. You have 3 capture groups, so you will always have them in your results findall
.
At regex101.com you can see these non-participating groups by enabling them in Options:
You can tighten up your regex by removing the capturing groups:
(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
Or even (?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}(?:\.\d{1,3}){3}
.
See demo regex
And since the regex pattern does not contain a capturing group, re.findall
will return matches rather than capturing the contents of the group:
import re
p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
test_str = "from mail.example.com (example.com. [213.239.250.131]) by\n mx.google.com with ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n <alex@example.com> (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256\n bits=128/128); Sun, 26 Oct 2014 06:16:58 -0700 (PDT)"
print(re.findall(p, test_str))
Output from the online Python demo :
['213.239.250.131', '014.10.26.06']
source to share
these are capture groups. if you do, or ask for it, it will return empty matches for non-matching expressions.
(([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
the first or has 2 groups: (([a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4})
and after the third or third: (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
to put it in a simple way, each parenthesis defines a capturing group that will be displayed if the value matches or not.
source to share