Improve python code in terms of speed

I have a very large file (1.5 billion lines) in the following form:

1    67108547    67109226    gene1$transcript1    0    +    1    0
1    67108547    67109226    gene1$transcript1    0    +    2    1
1    67108547    67109226    gene1$transcript1    0    +    3    3
1    67108547    67109226    gene1$transcript1    0    +    4    4 

                                 .
                                 .
                                 .

1    33547109    33557650    gene2$transcript1    0    +    239    2
1    33547109    33557650    gene2$transcript1    0    +    240    0

                                 .
                                 .
                                 .

1    69109226    69109999    gene1$transcript1    0    +    351    1
1    69109226    69109999    gene1$transcript1    0    +    352    0 

      

I want to do a refactoring / sorting of this file based on the ID in column 4. The file is made up of blocks. If you concatenate columns 4,1,2 and 3, you create a unique ID for each block. This is the key for dicionary all_exons and the value is a numpy array containing all the column 8 values. Then I have a second unique_identifiers dictionary that has the attributes from column 4 as the key and assigns a list of matching block IDs. As output, I write the file as follows:

>gene1
0
1
3
4
1
0
>gene2
2
0

      

I have already written code (see below) that does this, but my implementation is very slow. This will take about 18 hours.

import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np


def parse_blocks(bedtools_file):
    unique_identifiers = {} # Dictionary with key: gene, value: list of exons
    all_exons = {} # Dictionary contatining all exons

    # Parse file and ...
    with open(bedtools_file) as fp:
        sp_line = []
        for line in fp:
            sp_line = line.strip().split("\t")
            current_id = sp_line[3].split("$")[0]

           identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
           if(identifier in all_exons):
               item = float(sp_line[7])
               all_exons[identifier]=np.append(all_exons[identifier],item)
           else:
               all_exons[identifier] = np.array([sp_line[7]],float)

           if(current_id in unique_identifiers):
               unique_identifiers[current_id].add(identifier)
           else:
               unique_identifiers[current_id] =set([identifier])
  return unique_identifiers, all_exons

identifiers, introns = parse_blocks(options.bed)

w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
    w.write(">"+str(gene)+"\n")
    for intron in sorted(list(identifiers[gene])):
        for base in introns[intron]:
            w.write(str(base)+"\n")
w.close()

      

How can I fill in the above code to run faster?

+3


source to share


2 answers


You are also importing pandas

, so I provide a solution pandas

that basically only requires two lines of code. However, I don't know how it works on large datasets and is faster than your approach (but I'm pretty sure it does).

In the example below, the data you provided is stored in table.txt

. I then use groupby

to get all the values ​​in your eighth column, storing them in a list for the corresponding ID in column 4 (note that my indices start at 0) and converts that data structure into a dictionary which can then be easily printed out.

import pandas as pd

df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'

op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))

      

So, in this case op

it looks like this:

{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}

      

Now you can print the output like this and to the pipeline in a specific file:

for k,v in op.iteritems():

    print k.split('$')[0]
    for val in v:
        print val

      

This gives the desired output:

gene1
0
1
3
4
1
0
gene2
2
0

      

Maybe you can give it a try and let me know how it compares to your solution!

Edit2:

You mentioned in the comments that you want to print the genes in the correct order. You can do it like this:



# add some fake genes to op from above
op['gene0$stuff'] = [7,9]       
op['gene4$stuff'] = [5,9]

# print using 'sorted'
for k,v in sorted(op.iteritems()):

    print k.split('$')[0]
    for val in v:
        print val

      

which gives you:

gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9

      

EDIT1:

I'm not sure if the duplicates are intended, but you can easily get rid of them by following these steps:

op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))

      

Now it op2

will look like this:

{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}

      

You print the output as before:

for k,v in op2.iteritems():

    print k.split('$')[0]
    for val in v:
        print val

      

which gives you

gene1
0
1
3
4
gene2
0
2

      

+1


source


I will try to simplify your question, my solution is this:

  • Scan a large file first. For each other, current_id

    open a temporary file and add the column value 8 to that file.
  • After a full scan, combine all the fragments with the resulting file.

Here's the code:



# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess


class ChunkBoss(object):
    """Boss for file chunks"""

    def __init__(self):
        self.opened_files = {}

    def write_chunk(self, current_id, value):
        if current_id not in self.opened_files:
            self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
            self.opened_files[current_id].write('>%s\n' % current_id)

        self.opened_files[current_id].write('%s\n' % value)

    def cat_result(self, filename):
        """Catenate chunks to one big file
        """
        # Sort the chunks
        chunk_file_list = []
        for current_id in sorted(self.opened_files.keys()):
            chunk_file_list.append(self.opened_files[current_id].name)

        # Flush chunks
        [chunk.flush() for chunk in self.opened_files.values()]

        # By calling cat command
        with open(filename, 'wb') as fp:
            subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)

    def clean_up(self):
        [os.unlink(chunk.name) for chunk in self.opened_files.values()]


def main():
    boss = ChunkBoss()
    with open('bigfile.data') as fp:
        for line in fp:
            data = line.strip().split()
            current_id = data[3].split("$")[0]
            value = data[7]

            # Write value to temp chunk
            boss.write_chunk(current_id, value)

    boss.cat_result('result.txt')
    boss.clean_up()

if __name__ == '__main__':
    main()

      

I have tested the performance of my script, with bigfile.data

around 150k lines. It took about 0.5s on my laptop. Maybe you can try.

0


source







All Articles