Python - string strings with variable repeating substring
I am trying to do something that I thought would be simple (and probably is), however I am hitting a wall. I have a string containing document numbers. In most cases, the format is ###### - # - ### , however in some cases where there should be one digit, there are multiple individual digits separated by commas (i.e. ###### - #, # , # - ### ). The number of individual digits separated by commas is variable. Below is an example:
For the line below:
('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
I need to return:
['030421-1-001', '030421-2-001' '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002' '030421-1-003']
I just got to a line that matches the pattern ###### - # - ### :
import re
p = re.compile('\d{6}-\d{1}-\d{3}')
m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
print m
Thanks in advance for your help!
Matt
source to share
Perhaps something like this:
>>> import re
>>> s = '030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003'
>>> it = re.finditer(r'(\b\d{6}-)(\d(?:,\d)*)(-\d{3})\b', s)
>>> for m in it:
a, b, c = m.groups()
for x in b.split(','):
print a + x + c
...
030421-1-001
030421-2-001
030421-1-002
030421-1-002
030421-2-002
030421-3-002
030421-1-003
Or using a list comprehension
>>> [a+x+c for a, b, c in (m.groups() for m in it) for x in b.split(',')]
['030421-1-001', '030421-2-001', '030421-1-002', '030421-1-002', '030421-2-002', '030421-3-002', '030421-1-003']
source to share
I wouldn't use a single regex to try and parse this. Since this is essentially a list of strings, you might find it easier to replace "&". semicolon globally in the string and then use split () to put the items into the list.
Doing a list loop will allow you to write one function to parse and correct the string, and then you can push it to a new list and display the string.
replace(string, '&', ',')
initialList = string.split(',')
for item in initialList:
newItem = myfunction(item)
newList.append(newItem)
newstring = newlist(join(','))
source to share
(\d{6}-)((?:\d,?)+)(-\d{3})
We take 3 capture groups. We compare the first part and the last part in a simple way. The center portion is not necessarily repeated and optionally contains ",". The Regex, however, will only match the latter, so it ?:
won't store it at all. What's left is the following output:
>>> p = re.compile('(\d{6}-)((?:\d,?)+)(-\d{3})')
>>> m = p.findall('030421-1,2-001 & 030421-1-002,030421-1,2,3-002, 030421-1-003')
>>> m
[('030421-', '1,2', '-001'), ('030421-', '1', '-002'), ('030421-', '1,2,3', '-002'), ('030421-', '1', '-003')]
You will have to manually process the second term in order to separate them and join them, but a list comprehension should be able to do this.
source to share