Python regex insufficient match
I am writing a regex to match phone numbers. One of the problems I ran into is that some zip codes are similar to phone numbers. For example, in Brazil, postal codes are as follows:
30.160-0131
Thus, a simple regex will treat them as false positives:
In [63]: re.search(r"(?P<phone>\d+\.\d+-\d+)", "30.160-0131")
Out[63]: <_sre.SRE_Match at 0x102150990>
Fortunately, such postal codes often have a prefix that usually means "postal code", for example:
CEP 30.160-0131
So, if you see a CEP in front of something that looks like a phone number, then it's not a phone number - it's a zip code. I'm trying to write a regex to capture this using negative lookbehind , but it doesn't work. It still matches:
In [62]: re.search(r"(?<!CEP )(\d+\.\d+-\d+)", "CEP 30.160-0131")
Out[62]: <_sre.SRE_Match at 0x102150eb8>
Why does it still match, and how can I get a negative appearance to survive the match?
source to share
You can avoid negative hits if you allow matching these postcodes and still only retrieve phone numbers:
m = re.search(r"CEP \d+\.\d+-\d+|(\d+\.\d+-\d+)", s)
And then check if you have something in m.group(1)
for phone numbers.
Little demo with re.findall
:
>>> import re
>>> s = "There is a CEP 30.160-0131 and a 30.160-0132 in that sentence, which repeats itself like there is a CEP 30.160-0131 and a 30.160-0132 in that sentence."
>>> m = re.findall(r"CEP \d+\.\d+-\d+|(\d+\.\d+-\d+)", s)
>>> print(m)
['', '30.160-0132', '', '30.160-0132']
And from there you can filter out empty lines.
source to share
The expression corresponds to the fact that you are not doing anything to bind the number. For example:
"CEP 11.213-132"
will match 1.213-132
since it does not immediately follow CEP
. But you can force a space or the start of an anchor to a line before the first digit:
re.search(r"(?<!CEP)(?:\s+|^)(\d+\.\d+-\d+)", s)
source to share