Python regex insufficient match

I am writing a regex to match phone numbers. One of the problems I ran into is that some zip codes are similar to phone numbers. For example, in Brazil, postal codes are as follows:

30.160-0131

      

Thus, a simple regex will treat them as false positives:

In [63]: re.search(r"(?P<phone>\d+\.\d+-\d+)", "30.160-0131")
Out[63]: <_sre.SRE_Match at 0x102150990>

      

Fortunately, such postal codes often have a prefix that usually means "postal code", for example:

CEP 30.160-0131

      

So, if you see a CEP in front of something that looks like a phone number, then it's not a phone number - it's a zip code. I'm trying to write a regex to capture this using negative lookbehind , but it doesn't work. It still matches:

In [62]: re.search(r"(?<!CEP )(\d+\.\d+-\d+)", "CEP 30.160-0131")
Out[62]: <_sre.SRE_Match at 0x102150eb8>

      

Why does it still match, and how can I get a negative appearance to survive the match?

+3


source to share


2 answers


You can avoid negative hits if you allow matching these postcodes and still only retrieve phone numbers:

m = re.search(r"CEP \d+\.\d+-\d+|(\d+\.\d+-\d+)", s)

      

And then check if you have something in m.group(1)

for phone numbers.




Little demo with re.findall

:

>>> import re
>>> s = "There is a CEP 30.160-0131 and a  30.160-0132 in that sentence, which repeats itself like there is a CEP 30.160-0131 and a  30.160-0132 in that sentence."
>>> m = re.findall(r"CEP \d+\.\d+-\d+|(\d+\.\d+-\d+)", s)
>>> print(m)
['', '30.160-0132', '', '30.160-0132']

      

And from there you can filter out empty lines.

+1


source


The expression corresponds to the fact that you are not doing anything to bind the number. For example:

"CEP 11.213-132"

      



will match 1.213-132

since it does not immediately follow CEP

. But you can force a space or the start of an anchor to a line before the first digit:

re.search(r"(?<!CEP)(?:\s+|^)(\d+\.\d+-\d+)", s)

      

+2


source







All Articles