Find characters that don't match a given regex

I am writing a program to check and correct a given date as a string. Let's take 04121987

as a date in the format ddmmyyyy

. Regular expression for a date like this:

(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])(19\d\d|20\d\d)

      

If I match my string against a regex it works well. In Python:

>>> regex = re.compile(r'(0[1-9]|[12][0-9]|3[01])(0[1-9]|1[012])(19\d\d|20\d\d)')
>>> regex.findall('04121987')
[('04', '12', '1987')]

      

If I have a string 04721987

, then you can clearly see what is 72

not a valid month and therefore the string will not match the regex.

>>> regex.findall('04721987')
[]

      

What I would like to know is the character that makes the regex fail and its position. In this case it is 7

. How can I do this in Python?

+3


source to share


4 answers


I believe you don't want to, because the module is _sre

implemented in C; (.

You can use this package instead (by patching the monkey sre_compile

, changing the path first and importing your new one _sre

, etc ..) But I don't think it's worth it. This is an implementation of a package _sre

written entirely in Python, so you can see the source code, edit it, and do something right when the next character doesn't match.

You can do a similar thing:



  • Splitting date string by 3 (day, month and year) and regex matching independently
  • date validation using some other way not related to regex

You may not have gotten the exact figure where the error is, but I don't think it makes too much sense in this scenario if you are telling the user what is wrong (day, month, or year).

+1


source


This solution is a beast and I hope you find a better method. This code is easy to test and may suffice. The errorindex () function takes a date as a string and returns a list of indices of the invalid entries. There are ambiguities, although if the 1st month figure is incorrect. It is impossible to tell if the second digit is correct or not without knowing the 1st. Here is the code. Note: I forgot about leap years!



def errorindex(s):
  err = []
  for i in range(len(s)):
    if i == 0:  #month1
      if int(s[i]) < 0 or int(s[i]) > 1:
        err.append(i)
    if i == 1:  #month2
      if int(s[i-1]) == 0:
        if int(s[i]) < 1 or int(s[i]) > 9:
          err.append(i)
      elif int(s[i-1]) == 1:
        if int(s[i]) < 0 or int(s[i]) > 2:
          err.append(i)
      else:
        if int(s[i]) < 0 or int(s[i]) > 2:
          err.append(i)
    if i == 2:  #day1
      if int(s[i]) < 0 or int(s[i]) > 3:
        err.append(i)
    if i == 3:  #day2
      if int(s[i-1]) in [0,1,2] and str(s[:2]) != '02':
        if int(s[i]) < 0 or int(s[i]) > 9:
          err.append(i)
      elif int(s[i-1]) in [0,1,2] and str(s[:2]) == '02':
        if int(s[i]) < 0 or int(s[i]) > 8:
          err.append(i)
    if i == 4:  #year1
      if int(s[i]) < 1 or int(s[i]) > 2:
        err.append(i)
    if i == 5:  #year2
      if int(s[i-1]) == 1:
        if int(s[i]) != 9:
          err.append(i)  
      elif int(s[i-1]) == 2:
        if int(s[i]) != 0:
          err.append(i)
    if i ==6:
      if int(s[i]) < 0 or int(s[i]) > 9:
        err.append(i)
    if i ==7:
      if int(s[i]) < 0 or int(s[i]) > 9:
        err.append(i)
  return err

s = '04721987'  

print(errorindex(s))

      

+1


source


Well, the most obvious answer for me is to use some regex library that uses state machines or write my own. How can, with some changes, determine exactly where this happened. But I suppose this is not something you are willing to do.

Otherwise, if you know that the input will have an exact size, exact date format, you can divide it into 3 sectors - dd.mm.yyyy and try to apply corresponding regular expressions for each individual character separately. It's not a good solution, but you get what you want.

0


source


One possible approach is to construct a regex that matches anything, but puts the good matches and the bad matches in different groups. Examine which groups are populated in the results to see which group failed.

>>> regex = re.compile(r'(?:(0[1-9]|[12][0-9]|3[01])|(.{,2}))(?:(0[1-9]|1[012])|(.{,2}))(?:(19\d\d|20\d\d)|(.{,4}))')
>>> regex.match('04121987').groups()
('04', None, '12', None, '1987', None)
>>> regex.match('04721987').groups()
('04', None, None, '72', '1987', None)
>>> regex.match('0412').groups()
('04', None, '12', None, None, '')

      


Another approach is to take a suitable string as a base and replace it with the character of the input string with a character and check on each iteration. Here I am using datetime.datetime.strptime

to check. You can also use regex, although it should take years up to 2999, which is why the question in the question doesn't work.

from datetime import datetime

def str_to_date(s):
    good_date = '01011999'
    for i in xrange(len(good_date)):
        try:
            d = datetime.strptime(s[:i+1] + good_date[i+1:], '%d%m%Y')
        except ValueError:
            raise ValueError("Bad character '%s' at index %d" % (s[i:i+1], i))
    return d

      

0


source







All Articles