Parsing a file into a dictionary in python

I have a file, a small snippet of which you can see below:

Albino X Pastel
Bumble Bee X Albino Lesser
Bee X Fire Bee
Albino Cinnamon X Albino
Mojave X Bumble Bee
Black Pastel X Banana Ghost Lesser


The number of lines between Clucthank youXX and next ClutchXXX may be different, but not zero. I was wondering if it is possible somehow to take a specific line from a file using it as a key (in my case it will be ClutchXXX) and text before the second occurrence of a specific line as a value for the dictionary? I want to get a dictionary like this:

d={'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
   'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
   'Clutch003': 'Black Pastel X Banana Ghost Lesser'}


I'm more interested in the part where we take the string template and store it as a key and text after the value. Any suggestions or guidance on a useful approach would be appreciated.


source to share

7 replies

from itertools import groupby
from functools import partial

key = partial(re.match, r'Clutch\d\d\d')

with open('foo.txt') as f:
    groups = (', '.join(map(str.strip, g)) for k, g in groupby(f, key=key))

{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}




Collect strings in lists while storing this list in a dictionary:

d = {}
values = None
with open(filename) as inputfile:
    for line in inputfile:
        line = line.strip()
        if line.startswith('Clutch'):
            values = d[line] = []


This gives you:

{'Clutch001': ['Albino X Pastel', 'Bumble Bee X Albino Lesser']
 'Clutch002': ['Bee X Fire Bee', 'Albino Cinnamon X Albino', 'Mojave X Bumble Bee']
 'Clutch003': ['Black Pastel X Banana Ghost Lesser']}


It's easy enough to turn all these lists into separate lines, but after loading the file:

d = {key: ', '.join(value) for key, value in d.items()}


You can also attach when reading a file; I would use a generator function to process the file in groups:

def per_clutch(inputfile):
    clutch = None
    lines = []
    for line in inputfile:
        line = line.strip()
        if line.startswith('Clutch'):
            if lines:
                yield clutch, lines
            clutch, lines = line, []
    if clutch and lines:
        yield clutch, lines


then just add up all the groups in the dictionary:

with open(filename) as inputfile:
    d = {clutch: ', '.join(lines) for clutch, lines in per_clutch(inputfile)}


Demonstration of the latter:

>>> def per_clutch(inputfile):
...     clutch = None
...     lines = []
...     for line in inputfile:
...         line = line.strip()
...         if line.startswith('Clutch'):
...             if lines:
...                 yield clutch, lines
...             clutch, lines = line, []
...         else:
...             lines.append(line)
...     if clutch and lines:
...         yield clutch, lines
>>> sample = '''\
... Clutch001
... Albino X Pastel
... Bumble Bee X Albino Lesser
... Clutch002
... Bee X Fire Bee
... Albino Cinnamon X Albino
... Mojave X Bumble Bee
... Clutch003
... Black Pastel X Banana Ghost Lesser
... '''.splitlines(True)
>>> {clutch: ', '.join(lines) for clutch, lines in per_clutch(sample)}
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
>>> from pprint import pprint
>>> pprint(_)
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}




As noted in the comments, if "Grip" (or any other keyword) can be relied upon to not appear in lines without a keyword, you can use the following:

keyword = "Clutch"
with open(filename) as inputfile:
    t =
    d = {keyword + s[:3]: s[3:].strip().replace('\n', ', ') for s in t.split(keyword)}


This reads the entire file into memory at once, so it should be avoided if your file might get very large.



You can use re.split()

to list the parts "Clutch"

in a file:

import re

tokens = iter(re.split(r'(^Clutch\d{3}\s*$)\s+',, flags=re.M))
next(tokens) # skip until the first Clutch
print({k: ', '.join(v.splitlines()) for k, v in zip(tokens, tokens)})



{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}




Lets the file file.txt contain:

Albino X Pastel
Bumble Bee X Albino Lesser
Bee X Fire Bee
Albino Cinnamon X Albino
Mojave x bumble bee
Black Pastel X Banana Ghost Lesser

To get a dictionary, try the following:

import re

with open('file.txt', 'r') as f:
    result = re.split(
        flags=re.DOTALL # including '\n'
    )[1:] # result is ['Clutch001', '\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', 'Clutch002', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', 'Clutch003', '\nBlack Pastel X Banana Ghost Lesser\n']

    keys = result[::2] # keys is ['Clutch001', 'Clutch002', 'Clutch003']
    values = result[1::2] # values is ['\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', '\nBlack Pastel X Banana Ghost Lesser\n']

    values = map(
        lambda value: value.strip().replace('\n', ', '),
    ) # values is ['Albino X Pastel, Bumble Bee X Albino Lesser', 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Black Pastel X Banana Ghost Lesser']

    d = dict(zip(keys, values)) # d is {'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}




Here's a version that works, more or less. I'm not sure how Pythonic is this (maybe it can be compressed and could definitely be improved):

import re
import fileinput

d = dict()
key = ''
rx = re.compile('^Clutch\d\d\d$')

for line in fileinput.input():
    line = line[0:-1]
    if rx.match(line):
        key = line
        d[key] = ''
        d[key] += line

print d

for key in d:
    print key, d[key]


Output (which repeats information):

{'Clutch001': 'Albino X PastelBumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
Clutch001 Albino X PastelBumble Bee X Albino Lesser
Clutch002 Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee
Clutch003 Black Pastel X Banana Ghost Lesser


If for some reason the first line is not a concatenation line, you get an error because of an empty key.

Comma concatenation, dealing with broken text files (no newline ending), etc .:

import fileinput

d = {}

for line in fileinput.input():
    line = line.rstrip('\r\n') # line.strip() for leading and trailing space
    if line.startswith('Clutch'):
        key = line
        d[key] = ''
        pad = ''
        d[key] += pad + line
        pad = ', '

print d

for key in d:
    print "'%s': '%s'" % (key, d[key])


The "pad" technique is something I like in other contexts and works great here. I'm pretty sure this won't be considered pythonic.

Revised output sample:

{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
'Clutch003': 'Black Pastel X Banana Ghost Lesser'




Assuming the word Coupling occurs independently on its own string, the following would work:

import re
d = {}
with open(filename) as f:
for line in f:
    if re.match("^Clutch[0-9]+", line) :
        match = line   # match is the key searched for
        match = match.replace('\n', ' ')    # newlines are replaced
        d[match] = ''
        line = line.replace('\n', ' ')
        d[match] += line  # all lines without the word 'Clutch'
                          # are added to the matched key




All Articles