Parsing a file into a dictionary in python
I have a file, a small snippet of which you can see below:
Clutch001
Albino X Pastel
Bumble Bee X Albino Lesser
Clutch002
Bee X Fire Bee
Albino Cinnamon X Albino
Mojave X Bumble Bee
Clutch003
Black Pastel X Banana Ghost Lesser
....
The number of lines between Clucthank youXX and next ClutchXXX may be different, but not zero. I was wondering if it is possible somehow to take a specific line from a file using it as a key (in my case it will be ClutchXXX) and text before the second occurrence of a specific line as a value for the dictionary? I want to get a dictionary like this:
d={'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
I'm more interested in the part where we take the string template and store it as a key and text after the value. Any suggestions or guidance on a useful approach would be appreciated.
source to share
from itertools import groupby
from functools import partial
key = partial(re.match, r'Clutch\d\d\d')
with open('foo.txt') as f:
groups = (', '.join(map(str.strip, g)) for k, g in groupby(f, key=key))
pprint(dict(zip(*[iter(groups)]*2)))
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
source to share
Collect strings in lists while storing this list in a dictionary:
d = {}
values = None
with open(filename) as inputfile:
for line in inputfile:
line = line.strip()
if line.startswith('Clutch'):
values = d[line] = []
else:
values.append(line)
This gives you:
{'Clutch001': ['Albino X Pastel', 'Bumble Bee X Albino Lesser']
'Clutch002': ['Bee X Fire Bee', 'Albino Cinnamon X Albino', 'Mojave X Bumble Bee']
'Clutch003': ['Black Pastel X Banana Ghost Lesser']}
It's easy enough to turn all these lists into separate lines, but after loading the file:
d = {key: ', '.join(value) for key, value in d.items()}
You can also attach when reading a file; I would use a generator function to process the file in groups:
def per_clutch(inputfile):
clutch = None
lines = []
for line in inputfile:
line = line.strip()
if line.startswith('Clutch'):
if lines:
yield clutch, lines
clutch, lines = line, []
else:
lines.append(line)
if clutch and lines:
yield clutch, lines
then just add up all the groups in the dictionary:
with open(filename) as inputfile:
d = {clutch: ', '.join(lines) for clutch, lines in per_clutch(inputfile)}
Demonstration of the latter:
>>> def per_clutch(inputfile):
... clutch = None
... lines = []
... for line in inputfile:
... line = line.strip()
... if line.startswith('Clutch'):
... if lines:
... yield clutch, lines
... clutch, lines = line, []
... else:
... lines.append(line)
... if clutch and lines:
... yield clutch, lines
...
>>> sample = '''\
... Clutch001
... Albino X Pastel
... Bumble Bee X Albino Lesser
... Clutch002
... Bee X Fire Bee
... Albino Cinnamon X Albino
... Mojave X Bumble Bee
... Clutch003
... Black Pastel X Banana Ghost Lesser
... '''.splitlines(True)
>>> {clutch: ', '.join(lines) for clutch, lines in per_clutch(sample)}
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
>>> from pprint import pprint
>>> pprint(_)
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
source to share
As noted in the comments, if "Grip" (or any other keyword) can be relied upon to not appear in lines without a keyword, you can use the following:
keyword = "Clutch"
with open(filename) as inputfile:
t = inputfile.read()
d = {keyword + s[:3]: s[3:].strip().replace('\n', ', ') for s in t.split(keyword)}
This reads the entire file into memory at once, so it should be avoided if your file might get very large.
source to share
You can use re.split()
to list the parts "Clutch"
in a file:
import re
tokens = iter(re.split(r'(^Clutch\d{3}\s*$)\s+', file.read(), flags=re.M))
next(tokens) # skip until the first Clutch
print({k: ', '.join(v.splitlines()) for k, v in zip(tokens, tokens)})
Output
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
source to share
Lets the file file.txt contain:
Clutch001 Albino X Pastel Bumble Bee X Albino Lesser Clutch002 Bee X Fire Bee Albino Cinnamon X Albino Mojave x bumble bee Clutch003 Black Pastel X Banana Ghost Lesser
To get a dictionary, try the following:
import re
with open('file.txt', 'r') as f:
result = re.split(
r'(Clutch\d{3}).*?',
f.read(),
flags=re.DOTALL # including '\n'
)[1:] # result is ['Clutch001', '\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', 'Clutch002', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', 'Clutch003', '\nBlack Pastel X Banana Ghost Lesser\n']
keys = result[::2] # keys is ['Clutch001', 'Clutch002', 'Clutch003']
values = result[1::2] # values is ['\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', '\nBlack Pastel X Banana Ghost Lesser\n']
values = map(
lambda value: value.strip().replace('\n', ', '),
values
) # values is ['Albino X Pastel, Bumble Bee X Albino Lesser', 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Black Pastel X Banana Ghost Lesser']
d = dict(zip(keys, values)) # d is {'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
source to share
Here's a version that works, more or less. I'm not sure how Pythonic is this (maybe it can be compressed and could definitely be improved):
import re
import fileinput
d = dict()
key = ''
rx = re.compile('^Clutch\d\d\d$')
for line in fileinput.input():
line = line[0:-1]
if rx.match(line):
key = line
d[key] = ''
else:
d[key] += line
print d
for key in d:
print key, d[key]
Output (which repeats information):
{'Clutch001': 'Albino X PastelBumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
Clutch001 Albino X PastelBumble Bee X Albino Lesser
Clutch002 Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee
Clutch003 Black Pastel X Banana Ghost Lesser
If for some reason the first line is not a concatenation line, you get an error because of an empty key.
Comma concatenation, dealing with broken text files (no newline ending), etc .:
import fileinput
d = {}
for line in fileinput.input():
line = line.rstrip('\r\n') # line.strip() for leading and trailing space
if line.startswith('Clutch'):
key = line
d[key] = ''
pad = ''
else:
d[key] += pad + line
pad = ', '
print d
for key in d:
print "'%s': '%s'" % (key, d[key])
The "pad" technique is something I like in other contexts and works great here. I'm pretty sure this won't be considered pythonic.
Revised output sample:
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
'Clutch003': 'Black Pastel X Banana Ghost Lesser'
source to share
Assuming the word Coupling occurs independently on its own string, the following would work:
import re
d = {}
with open(filename) as f:
for line in f:
if re.match("^Clutch[0-9]+", line) :
match = line # match is the key searched for
match = match.replace('\n', ' ') # newlines are replaced
d[match] = ''
else:
line = line.replace('\n', ' ')
d[match] += line # all lines without the word 'Clutch'
# are added to the matched key
source to share