Parsing a file into a dictionary in python

Question

Parsing a file into a dictionary in python

I have a file, a small snippet of which you can see below:

Clutch001
Albino X Pastel
Bumble Bee X Albino Lesser
Clutch002
Bee X Fire Bee
Albino Cinnamon X Albino
Mojave X Bumble Bee
Clutch003
Black Pastel X Banana Ghost Lesser
....

The number of lines between Clucthank youXX and next ClutchXXX may be different, but not zero. I was wondering if it is possible somehow to take a specific line from a file using it as a key (in my case it will be ClutchXXX) and text before the second occurrence of a specific line as a value for the dictionary? I want to get a dictionary like this:

d={'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
   'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
   'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

I'm more interested in the part where we take the string template and store it as a key and text after the value. Any suggestions or guidance on a useful approach would be appreciated.

+3

python dictionary file python-2.7

tinySandy Dec 28. 14 at 2:09 am

source to share

7 replies

Collect strings in lists while storing this list in a dictionary:

d = {}
values = None
with open(filename) as inputfile:
    for line in inputfile:
        line = line.strip()
        if line.startswith('Clutch'):
            values = d[line] = []
        else:
            values.append(line)

This gives you:

{'Clutch001': ['Albino X Pastel', 'Bumble Bee X Albino Lesser']
 'Clutch002': ['Bee X Fire Bee', 'Albino Cinnamon X Albino', 'Mojave X Bumble Bee']
 'Clutch003': ['Black Pastel X Banana Ghost Lesser']}

It's easy enough to turn all these lists into separate lines, but after loading the file:

d = {key: ', '.join(value) for key, value in d.items()}

You can also attach when reading a file; I would use a generator function to process the file in groups:

def per_clutch(inputfile):
    clutch = None
    lines = []
    for line in inputfile:
        line = line.strip()
        if line.startswith('Clutch'):
            if lines:
                yield clutch, lines
            clutch, lines = line, []
        else:
            lines.append(line)
    if clutch and lines:
        yield clutch, lines

then just add up all the groups in the dictionary:

with open(filename) as inputfile:
    d = {clutch: ', '.join(lines) for clutch, lines in per_clutch(inputfile)}

Demonstration of the latter:

>>> def per_clutch(inputfile):
...     clutch = None
...     lines = []
...     for line in inputfile:
...         line = line.strip()
...         if line.startswith('Clutch'):
...             if lines:
...                 yield clutch, lines
...             clutch, lines = line, []
...         else:
...             lines.append(line)
...     if clutch and lines:
...         yield clutch, lines
... 
>>> sample = '''\
... Clutch001
... Albino X Pastel
... Bumble Bee X Albino Lesser
... Clutch002
... Bee X Fire Bee
... Albino Cinnamon X Albino
... Mojave X Bumble Bee
... Clutch003
... Black Pastel X Banana Ghost Lesser
... '''.splitlines(True)
>>> {clutch: ', '.join(lines) for clutch, lines in per_clutch(sample)}
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
>>> from pprint import pprint
>>> pprint(_)
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

+3

Martijn pieters Dec 28. '14 at 2:16

source to share

As noted in the comments, if "Grip" (or any other keyword) can be relied upon to not appear in lines without a keyword, you can use the following:

keyword = "Clutch"
with open(filename) as inputfile:
    t = inputfile.read()
    d = {keyword + s[:3]: s[3:].strip().replace('\n', ', ') for s in t.split(keyword)}

This reads the entire file into memory at once, so it should be avoided if your file might get very large.

+2

Stuart Dec 28. 14 at 4:03

source to share

You can use re.split()

to list the parts "Clutch"

in a file:

import re

tokens = iter(re.split(r'(^Clutch\d{3}\s*$)\s+', file.read(), flags=re.M))
next(tokens) # skip until the first Clutch
print({k: ', '.join(v.splitlines()) for k, v in zip(tokens, tokens)})

Output

{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

+2

jfs Dec 28. 14 at 5:51

source to share

Lets the file file.txt contain:

Clutch001
Albino X Pastel
Bumble Bee X Albino Lesser
Clutch002
Bee X Fire Bee
Albino Cinnamon X Albino
Mojave x bumble bee
Clutch003
Black Pastel X Banana Ghost Lesser

To get a dictionary, try the following:

import re

with open('file.txt', 'r') as f:
    result = re.split(
        r'(Clutch\d{3}).*?',
        f.read(),
        flags=re.DOTALL # including '\n'
    )[1:] # result is ['Clutch001', '\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', 'Clutch002', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', 'Clutch003', '\nBlack Pastel X Banana Ghost Lesser\n']

    keys = result[::2] # keys is ['Clutch001', 'Clutch002', 'Clutch003']
    values = result[1::2] # values is ['\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', '\nBlack Pastel X Banana Ghost Lesser\n']

    values = map(
        lambda value: value.strip().replace('\n', ', '),
        values
    ) # values is ['Albino X Pastel, Bumble Bee X Albino Lesser', 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Black Pastel X Banana Ghost Lesser']

    d = dict(zip(keys, values)) # d is {'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

+2

Fomalhaut Dec 28. '14 at 7:26

source to share

Here's a version that works, more or less. I'm not sure how Pythonic is this (maybe it can be compressed and could definitely be improved):

import re
import fileinput

d = dict()
key = ''
rx = re.compile('^Clutch\d\d\d$')

for line in fileinput.input():
    line = line[0:-1]
    if rx.match(line):
        key = line
        d[key] = ''
    else:
        d[key] += line

print d

for key in d:
    print key, d[key]

Output (which repeats information):

{'Clutch001': 'Albino X PastelBumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
Clutch001 Albino X PastelBumble Bee X Albino Lesser
Clutch002 Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee
Clutch003 Black Pastel X Banana Ghost Lesser

If for some reason the first line is not a concatenation line, you get an error because of an empty key.

Comma concatenation, dealing with broken text files (no newline ending), etc .:

import fileinput

d = {}

for line in fileinput.input():
    line = line.rstrip('\r\n') # line.strip() for leading and trailing space
    if line.startswith('Clutch'):
        key = line
        d[key] = ''
        pad = ''
    else:
        d[key] += pad + line
        pad = ', '

print d

for key in d:
    print "'%s': '%s'" % (key, d[key])

The "pad" technique is something I like in other contexts and works great here. I'm pretty sure this won't be considered pythonic.

Revised output sample:

{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
'Clutch003': 'Black Pastel X Banana Ghost Lesser'

+1

Jonathan Leffler Dec 28. '14 at 2:27

source to share

Assuming the word Coupling occurs independently on its own string, the following would work:

import re
d = {}
with open(filename) as f:
for line in f:
    if re.match("^Clutch[0-9]+", line) :
        match = line   # match is the key searched for
        match = match.replace('\n', ' ')    # newlines are replaced
        d[match] = ''
    else:
        line = line.replace('\n', ' ')
        d[match] += line  # all lines without the word 'Clutch'
                          # are added to the matched key

+1

Bolboa Dec 28. 14 at 6:43

source to share

jamylak · Accepted Answer · 2014-12-28T03:04:42+0000

from itertools import groupby
from functools import partial

key = partial(re.match, r'Clutch\d\d\d')

with open('foo.txt') as f:
    groups = (', '.join(map(str.strip, g)) for k, g in groupby(f, key=key))
    pprint(dict(zip(*[iter(groups)]*2)))

{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

Parsing a file into a dictionary in python

Output

More articles: