Maximum match of a substring in a regular expression

I would like to extract the characters of an element (if any) from a word. For this, I prepared a regular expression matching pattern consisting of all the element symbols in the periodic table.

H|He|Li|Be|B|C|N|O|F|Ne|Na|Mg|Al|Si|P|S|Cl|Ar|K|Ca|Sc|Ti|V|Cr|Mn|Fe|Co|Ni|Cu|Zn|Ga|Ge|As|Se|Br|Kr|Rb|Sr|Y|Zr|Nb|Mo|Tc|Ru|Rh|Pd|Ag|Cd|In|Sn|Sb|Te|I|Xe|Cs|Ba|La|Ce|Pr|Nd|Pm|Sm|Eu|Gd|Tb|Dy|Ho|Er|Tm|Yb|Lu|Hf|Ta|W|Re|Os|Ir|Pt|Au|Hg|Tl|Pb|Bi|Po|At|Rn|Fr|Ra|Ac|Th|Pa|U|Np|Pu|Am|Cm|Bk|Cf|Es|Fm|Md|No|Lr|Rf|Db|Sg|Bh|Hs|Mt

      

Now, for a given word, I would like to extract elements from it using the above regex pattern. The problem I am facing now is that for words like

CuIn2Se

      

I can extract

C,In,S

      

as items. This is not a correct extraction as I need

Cu, In, Se

      

from a regex, whereas I am getting "C, In, S" and I believe the reason for this is because the matching pattern sees "C" before "Cu" and "S" before "Se" (for example, the current pattern matching is similar)

C | In | S | Cu | Se

      

To solve this problem, I think I would need to ensure that the regex matches the maximum number of characters in my word by looking for all words in the pattern.

+3


source to share


4 answers


The correct way to do this is to order all your elements in descending order of their length

>>> import re
>>> pat = re.compile('Cu|In|Se|C|S')
>>> s = 'CuIn2Se'
>>> pat.findall(s)
['Cu', 'In', 'Se']

      

This is clearly explained in the docs



Small note

Considering that your input string is very long, I wrote a small script that makes it sort in descending order of length. This might help you.

'|'.join(sorted(s.split('|'),key = len,reverse = True))

      

+4


source


You can also use a regex module named list:



>>> import regex
>>> s='H|He|Li|Be|B|C|N|O|F|Ne|Na|Mg|Al|Si|P|S|Cl|Ar|K|Ca|Sc|Ti|V|Cr|Mn|Fe|Co|Ni|Cu|Zn|Ga|Ge|As|Se|Br|Kr|Rb|Sr|Y|Zr|Nb|Mo|Tc|Ru|Rh|Pd|Ag|Cd|In|Sn|Sb|Te|I|Xe|Cs|Ba|La|Ce|Pr|Nd|Pm|Sm|Eu|Gd|Tb|Dy|Ho|Er|Tm|Yb|Lu|Hf|Ta|W|Re|Os|Ir|Pt|Au|Hg|Tl|Pb|Bi|Po|At|Rn|Fr|Ra|Ac|Th|Pa|U|Np|Pu|Am|Cm|Bk|Cf|Es|Fm|Md|No|Lr|Rf|Db|Sg|Bh|Hs|Mt'
>>> p=regex.compile(r"\L<options>", options=s.split('|'))
>>> p.findall('CuIn2Se')
['Cu', 'In', 'Se']

      

+1


source


Another easy way

x="CuIn2Se"
print re.findall(r"(?:C|In|S|Cu|Se)(?=[A-Z0-9]|$|\s)",x)

      

Online demo

+1


source


I would take a different approach, just to be different. Rather than listing all the connections in one big regex, it might be faster to make them a set, grab whatever might be compound, and filter it after the fact.

import re

molecule = "CuIn2Se"

compounds = re.findall("[A-Z][a-z]?", molecule)

all_compounds = set(("H, He, Li, Be, B, C, N, O, F, Ne, Na, Mg, "
                     "Al, Si, P, S, Cl, Ar, K, Ca, Sc, Ti, V, "
                     "Cr, Mn, Fe, Co, Ni, Cu, Zn, Ga, Ge, As, Se, "
                     "Br, Kr, Rb, Sr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, "
                     "Ag, Cd, In, Sn, Sb, Te, I, Xe, Cs, Ba, La, Ce, "
                     "Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, "
                     "Yb, Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, "
                     "Pb, Bi, Po, At, Rn, Fr, Ra, Ac, Th, Pa, U, Np, "
                     "Pu, Am, Cm, Bk, Cf, Es, Fm, Md, No, Lr, Rf, Db, "
                     "Sg, Bh, Hs, Mt").split(", "))

actual_compounds = filter(lambda ch: ch in all_compounds, compounds)

      

This should be faster if you have tons of strings to search, as test suite membership is much faster than regex syntax. If you only have a few, the cost of creating a set can outweigh the speed when parsing strings. The golden rule is to profile your code and remember that premature optimization is the root of all evil.

0


source







All Articles