Search pattern includes square brackets

I am trying to find exact words in a file. I read the file line by line and loop through the lines to find the exact words. Since the keyword is in

not suitable for finding exact words, I use the regex pattern.

def findWord(w):
    return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search

      

The problem with this function is that it doesn't recognize square brackets [xyz]

.

for example

findWord('data_var_cod[0]')('Cod_Byte1 = DATA_VAR_COD[0]') 

      

returns None

, whereas

findWord('data_var_cod')('Cod_Byte1 = DATA_VAR_COD') 

      

returns <_sre.SRE_Match object at 0x0000000015622288>

Can anyone help me customize the regex pattern?

+3


source to share


3 answers


This is because the regex engine accepts square brackets as a character class, which are regex to get this problem, you need to avoid regex characters. you can use the function re.escape

:

def findWord(w):
    return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search

      

Also as a more pythonic way to get all matches, you can use re.fildall()

that returns a list of matches, or re.finditer

that returns an iterator that contains matchobjects.

But this method is not complete and efficient, because when you use a word boundary, your inner word must contain characters of the same type.



>>> ss = 'hello string [processing] in python.'  
>>>re.compile(r'\b({0})\b'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss)
>>> 
>>>re.compile(r'({})'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss).group(0)
'[processing]'

      

So I suggest removing word boundaries if your words do not contain any words.

But, as a more general way, you can use the following regex, which uses positive lookahead that matches the words surrounding the space, or comes at the end of the line or leading:

r'(?: |^)({})(?=[. ]|$) '

      

+1


source


This is because, [

and ]

is of particular importance. You must specify the line you are looking for:

re.escape(regex)

      

Holds a regex for you. Change your code to:



return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search
                                      ↑↑↑↑↑↑↑↑↑

      

You can see what re.quote

does it for your string, for example:

>>> w = '[xyz]'
>>> print re.escape(w)
\[xyz\]

      

+1


source


You need a clever way to create a regular expression:

def findWord(w):
    if re.match(r'\w', w) and re.search(r'\w$', w):
        return re.compile(r'\b{0}\b'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'\w', w) and not re.search(r'\w$', w):
        return re.compile(r'{0}'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'\w', w) and re.search(r'\w$', w):
        return re.compile(r'{0}\b'.format(w), flags=re.IGNORECASE).search
    if re.match(r'\w', w) and not re.search(r'\w$', w):
        return re.compile(r'\b{0}'.format(w), flags=re.IGNORECASE).search

      

The problem is that some of your keywords will only have character words at the beginning, others only at the end, most of them will have word characters at both ends, and some will have non-word characters. To effectively check a word boundary, you need to know if a word character is present at the beginning / end of a keyword.

Thus, with re.match(r'\w', x)

we can check if a keyword starts with a word character, and if so, add \b

to the pattern, and with, re.search(r'\w$', x)

we can check if a keyword ends with a word character.

If you have multiple keywords for string validation, you can check this post of mine .

0


source







All Articles