Optimize python loop

The next loop creates a giant bottleneck in my program. In particular, since records can be over 500k.

records = [item for sublist in records for item in sublist] #flatten the list
for rec in records:
    if len(rec) > 5:
        tag = '%s.%s' %(rec[4], rec[5].strip())
        if tag in mydict:
            mydict[tag][0] += 1
            mydict[tag][1].add(rec[6].strip())
        else:
            mydict[tag] = [1, set(rec[6].strip())]

      

I don't see a way I could do this with a dictionary / list comprehension, and I'm not sure if calling the map will do me much good. Is there a way to optimize this loop?

Edit: The dictionary contains information about some of the operations taking place in the program. rec[4]

is the package that contains the operation, and rec[5]

is the name of the operation. The raw logs contain an int instead of the actual name, so when the log files are read into the list, the int is scanned and replaced with the name of the operation. An incremental counter counts the number of times operations have been performed, and the set contains the parameters for the operation. I am using set because I do not want duplicates for the parameters. The stripe simply removes the white space. The existence of this white space is unpredictable in rec[6]

, but rether consists of rec[4]

and rec[5]

.

+3


source to share


2 answers


Instead of flattening such a huge list, you can directly iterate over its flattened iterator with itertools.chain.from_iterable

.

from itertools import chain

for rec in chain.from_iterable(records):
    #rest of the code

      



This is about 3x faster than genxp's equivalent nested for-loop version:

In [13]: records = [[None]*500]*10000

In [14]: %%timeit
    ...: for rec in chain.from_iterable(records): pass
    ...: 
10 loops, best of 3: 54.7 ms per loop

In [15]: %%timeit
    ...: for rec in (item for sublist in records for item in sublist): pass
    ...: 
10 loops, best of 3: 170 ms per loop

In [16]: %%timeit #Your version
    ...: for rec in [item for sublist in records for item in sublist]: pass
    ...: 
1 loops, best of 3: 249 ms per loop

      

+6


source


I don't know if it will make it faster or not, but instead of ...

if tag in mydict:
    mydict[tag][0] += 1
    mydict[tag][1].add(rec[6].strip())
else:
    mydict[tag] = [1, set(rec[6].strip())]

      



you may try...

element = mydict.setdefault(tag, [0, set()])
element[0] += 1
element[1].add(rec[6], strip())

      

+3


source







All Articles