Optimizing a loop that extracts substrings from a field based on another field

Question

Optimizing a loop that extracts substrings from a field based on another field

I am dealing with a large dataset (4 million by 4). The first column is the name identifier and many rows have the same name. The second column is a position starting at -6 and always continuing until a new identifier is encountered, after which it will start counting again. The third column is a random number, not important here. The fourth column is a long sequence of numbers, such as a long barcode. The data looks something like this:

YKLOI    -6    01    123456789012345678901234
YKLOI    -5    25    123456789012345678901234
YKLOI    -4    05    123456789012345678901234
YKLOI    -3    75    123456789012345678901234
YKLOI    -2    83    123456789012345678901234
YKLOI    -1    05    123456789012345678901234
YKLOI     0    34    123456789012345678901234
YKLOI     1    28    123456789012345678901234
YKLJW    -6    87    569845874254658425485
YKLJW    -5    87    569845874254658425485
...

And I would like to do this:

YKLOI    -6    01    123   #puts 1st triplet in position -6
YKLOI    -5    25    456   #puts 2nd triplet in position -5
YKLOI    -4    05    789   #puts 3rd triplet in position -4
YKLOI    -3    75    012   #puts 4th triplet in position -3
YKLOI    -2    83    345                ...
YKLOI    -1    05    678
YKLOI     0    34    901
YKLOI     1    28    234   #puts last triplet in the last position
YKLJW    -6    87    569   #puts 1st triplet in position -6
YKLJW    -5    87    845   #puts 2nd triplet in position -5
...

The length in what's in the fourth column varies a lot, but the numbers in the second column are always in sequence.

Below is the code I got that actually does the job, but does it forever. At the moment it has been running for almost 18 hours and it barely exceeds 1 million.

I tried several alternatives, like only creating a map if the names in the first column in consecutive rows are different, but that just adds an instruction to it and makes the code much slower.

Does anyone have any suggestion on how to improve performance for this task?

import pandas as pd

#imports data
d = pd.read_csv('INPUT_FILE', sep='\t') 

#acknowledges that data was imported
print "Import Okay" 

#sets output path
output='OUTPUT_FILE' 

#loops from the first row till the end
for z in xrange(0,len(d)-1): 

    #goes to the fourth column, split the content every 3 characters and creates 
    #a list of these triplets.
    mop=map(''.join, zip(*[iter(d.loc[z][3])]*3)) 

    #substitutes the content of the fourth column in the z line by the triplet in 
    #the z+6 positon
    d.ix[z,3] = mop[int(d.loc[z][1])+6]

    #writes the new z line into the output file
    d.loc[[z]].to_csv(output, sep='\t', header=False, index=False, mode='a')

#acknowledges that the code is through
print "Done"

+3

performance optimization python pandas

beakman 01 Aug 15 at 20:24

source to share

1 answer

chrisb · Accepted Answer · 2015-08-01T21:04:15+0000

Two simple changes to get you started. First, don't add a step-by-step output file, it adds a lot of unnecessary overhead and is your biggest problem by far.

Secondly, you seem to have a lot of steps to go to get the triplet out. Something like this would be more efficient and .apply

removes some of the overhead.

def triplet(row):
    loc = (row[1] + 6) * 3
    return row[3][loc:loc+3]

d[3] = d.apply(triplet, axis=1)

# save the whole file once
d.to_csv(output2, sep='\t', header=False, index=False)

Optimizing a loop that extracts substrings from a field based on another field

More articles: