Python - string strings with variable repeating substring

I am trying to do something that I thought would be simple (and probably is), however I am hitting a wall. I have a string containing document numbers. In most cases, the format is ###### - # - ### , however in some cases where there should be one digit, there are multiple individual digits separated by commas (i.e. ###### - #, # , # - ### ). The number of individual digits separated by commas is variable. Below is an example:

For the line below:

('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')

      

I need to return:

['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']

      

I just got to a line that matches the pattern ###### - # - ### :

import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m

      

Thanks in advance for your help!

Matt

+3


source to share


4 answers


Perhaps something like this:

>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
    a, b, c = m.groups()
    for x in b.split(','):
        print a + x + c
...         
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003

      



Or using a list comprehension

>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']

      

+2


source


Use '\d{6}-\d(,\d)*-\d{3}'

.



*

means "as much as you want (0 is included)". It applies to the previous item, here '(,\d)'

.

0


source


I wouldn't use a single regex to try and parse this. Since this is essentially a list of strings, you might find it easier to replace "&". semicolon globally in the string and then use split () to put the items into the list.

Doing a list loop will allow you to write one function to parse and correct the string, and then you can push it to a new list and display the string.

replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
    newItem = myfunction(item)
    newList.append(newItem)

newstring = newlist(join(','))

      

0


source


(\d{6}-)((?:\d,?)+)(-\d{3})

We take 3 capture groups. We compare the first part and the last part in a simple way. The center portion is not necessarily repeated and optionally contains ",". The Regex, however, will only match the latter, so it ?:

won't store it at all. What's left is the following output:

>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'),  ('030421-', '1', '-003')]

      

You will have to manually process the second term in order to separate them and join them, but a list comprehension should be able to do this.

0


source







All Articles