Regular expression for multiple events in python
I need to parse lines with multiple language codes below
008800002 Bruxelles-Nord$Br ussel Nord$<deu>$Brussel Noord$<nld>
-
008800002
is id -
Bruxelles-Nord$Br ussel Nord$
is a name1 -
deu
is language 1 -
$Brussel Noord$
name 2 -
nld
is a second language.
SO, an idea is a name and a language can appear N times. I need to collect them all. language in <>
is 3 characters long (fixed) and all names end with a sign $
.
I tried this one but it doesn't give the expected result.
x = re.compile('(?P<stop_id>\d{9})\s(?P<authority>[[\x00-\x7F]{3}|\s{3}])\s(?P<stop_name>.*)
(?P<lang_code>(?:[<]\S{0,4}))',flags=re.UNICODE)
I don't know how to get duplicate items. Required
Bruxelles-Nord$Br ussel Nord$<deu>$Brussel Noord$
as stop_name and <nld>
as language.
source to share
Do this in two steps. The first single identifier from the name / language pairs; then use re.finditer
in the name / language section to iterate over the pairs and stuff them into a dict.
import re
line = u"008800002 Bruxelles-Nord$Br ussel Nord$<deu>$Brussel Noord$<nld>"
m = re.search("(\d+)\s+(.*)", line, re.UNICODE)
id = m.group(1)
names = {}
for m in re.finditer("(.*?)<(.*?)>", m.group(2), re.UNICODE):
names[m.group(2)] = m.group(1)
print id, names
source to share