Maximum match of a substring in a regular expression
I would like to extract the characters of an element (if any) from a word. For this, I prepared a regular expression matching pattern consisting of all the element symbols in the periodic table.
H|He|Li|Be|B|C|N|O|F|Ne|Na|Mg|Al|Si|P|S|Cl|Ar|K|Ca|Sc|Ti|V|Cr|Mn|Fe|Co|Ni|Cu|Zn|Ga|Ge|As|Se|Br|Kr|Rb|Sr|Y|Zr|Nb|Mo|Tc|Ru|Rh|Pd|Ag|Cd|In|Sn|Sb|Te|I|Xe|Cs|Ba|La|Ce|Pr|Nd|Pm|Sm|Eu|Gd|Tb|Dy|Ho|Er|Tm|Yb|Lu|Hf|Ta|W|Re|Os|Ir|Pt|Au|Hg|Tl|Pb|Bi|Po|At|Rn|Fr|Ra|Ac|Th|Pa|U|Np|Pu|Am|Cm|Bk|Cf|Es|Fm|Md|No|Lr|Rf|Db|Sg|Bh|Hs|Mt
Now, for a given word, I would like to extract elements from it using the above regex pattern. The problem I am facing now is that for words like
CuIn2Se
I can extract
C,In,S
as items. This is not a correct extraction as I need
Cu, In, Se
from a regex, whereas I am getting "C, In, S" and I believe the reason for this is because the matching pattern sees "C" before "Cu" and "S" before "Se" (for example, the current pattern matching is similar)
C | In | S | Cu | Se
To solve this problem, I think I would need to ensure that the regex matches the maximum number of characters in my word by looking for all words in the pattern.
source to share
The correct way to do this is to order all your elements in descending order of their length
>>> import re
>>> pat = re.compile('Cu|In|Se|C|S')
>>> s = 'CuIn2Se'
>>> pat.findall(s)
['Cu', 'In', 'Se']
This is clearly explained in the docs
Small note
Considering that your input string is very long, I wrote a small script that makes it sort in descending order of length. This might help you.
'|'.join(sorted(s.split('|'),key = len,reverse = True))
source to share
You can also use a regex module named list:
>>> import regex
>>> s='H|He|Li|Be|B|C|N|O|F|Ne|Na|Mg|Al|Si|P|S|Cl|Ar|K|Ca|Sc|Ti|V|Cr|Mn|Fe|Co|Ni|Cu|Zn|Ga|Ge|As|Se|Br|Kr|Rb|Sr|Y|Zr|Nb|Mo|Tc|Ru|Rh|Pd|Ag|Cd|In|Sn|Sb|Te|I|Xe|Cs|Ba|La|Ce|Pr|Nd|Pm|Sm|Eu|Gd|Tb|Dy|Ho|Er|Tm|Yb|Lu|Hf|Ta|W|Re|Os|Ir|Pt|Au|Hg|Tl|Pb|Bi|Po|At|Rn|Fr|Ra|Ac|Th|Pa|U|Np|Pu|Am|Cm|Bk|Cf|Es|Fm|Md|No|Lr|Rf|Db|Sg|Bh|Hs|Mt'
>>> p=regex.compile(r"\L<options>", options=s.split('|'))
>>> p.findall('CuIn2Se')
['Cu', 'In', 'Se']
source to share
I would take a different approach, just to be different. Rather than listing all the connections in one big regex, it might be faster to make them a set, grab whatever might be compound, and filter it after the fact.
import re
molecule = "CuIn2Se"
compounds = re.findall("[A-Z][a-z]?", molecule)
all_compounds = set(("H, He, Li, Be, B, C, N, O, F, Ne, Na, Mg, "
"Al, Si, P, S, Cl, Ar, K, Ca, Sc, Ti, V, "
"Cr, Mn, Fe, Co, Ni, Cu, Zn, Ga, Ge, As, Se, "
"Br, Kr, Rb, Sr, Y, Zr, Nb, Mo, Tc, Ru, Rh, Pd, "
"Ag, Cd, In, Sn, Sb, Te, I, Xe, Cs, Ba, La, Ce, "
"Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, "
"Yb, Lu, Hf, Ta, W, Re, Os, Ir, Pt, Au, Hg, Tl, "
"Pb, Bi, Po, At, Rn, Fr, Ra, Ac, Th, Pa, U, Np, "
"Pu, Am, Cm, Bk, Cf, Es, Fm, Md, No, Lr, Rf, Db, "
"Sg, Bh, Hs, Mt").split(", "))
actual_compounds = filter(lambda ch: ch in all_compounds, compounds)
This should be faster if you have tons of strings to search, as test suite membership is much faster than regex syntax. If you only have a few, the cost of creating a set can outweigh the speed when parsing strings. The golden rule is to profile your code and remember that premature optimization is the root of all evil.
source to share