Get unique rows based on ONLY 2 columns

I have large (50k lines) files formatted this way

chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    35720   36081   NR_026820_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028325_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028327_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028325_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028327_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143
chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143

      

The second and third columns are the start and end positions. I want to delete all lines with the same start and end positions (the rest of the line doesn't matter) and only keep the first time. For example, I would like to store rows 14 and 15 in the sampled data, because although the start position is the same, there is not at the end. 15 and 16 have the same start and end, so I want to remove 16. I'm trying to do this in Python, but I really don't understand how to deal with the requirement that two columns be unique.

Any ideas on the best / easiest way to do this?

+3


source to share


6 answers


Consider using the excellent pandas library to load and manipulate data like this:

data_string = """
chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    35720   36081   NR_026820_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028325_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    323891  324060  NR_028327_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028325_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324287  324345  NR_028327_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143
chr1    324438  328581  NR_028325_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143
"""

# this looks suspicously csv-like

import pandas
import StringIO

buf = StringIO.StringIO(data_string)

# this will create a DataFrame object with header: 0, 1, 2, ...
# if you have the file path, you can use that instead of the StringIO buffer
df = pandas.read_csv(buf, delim_whitespace=True, header=None)

>>> print df

      0       1       2                                   3   4  5         6   \
0   chr1   35276   35481     NR_026820_exon_1_0_chr1_35277_r   0  -  0.526829   
1   chr1   35720   36081     NR_026818_exon_2_0_chr1_35721_r   0  -  0.398892   
2   chr1   35720   36081     NR_026820_exon_2_0_chr1_35721_r   0  -  0.398892   
3   chr1   69090   70008  NM_001005484_exon_0_0_chr1_69091_f   0  +  0.571895   
4   chr1  134772  139696    NR_039983_exon_0_0_chr1_134773_r   0  -  0.366775   
5   chr1  139789  139847    NR_039983_exon_1_0_chr1_139790_r   0  -  0.551724   
6   chr1  140074  140566    NR_039983_exon_2_0_chr1_140075_r   0  -  0.475610   
7   chr1  323891  324060    NR_028322_exon_0_0_chr1_323892_f   0  +  0.426035   
8   chr1  323891  324060    NR_028325_exon_0_0_chr1_323892_f   0  +  0.426035   
9   chr1  323891  324060    NR_028327_exon_0_0_chr1_323892_f   0  +  0.426035   
10  chr1  324287  324345    NR_028322_exon_1_0_chr1_324288_f   0  +  0.551724   
11  chr1  324287  324345    NR_028325_exon_1_0_chr1_324288_f   0  +  0.551724   
12  chr1  324287  324345    NR_028327_exon_1_0_chr1_324288_f   0  +  0.551724   
13  chr1  324438  326938    NR_028327_exon_2_0_chr1_324439_f   0  +  0.375200   
14  chr1  324438  328581    NR_028322_exon_2_0_chr1_324439_f   0  +  0.378228   
15  chr1  324438  328581    NR_028325_exon_2_0_chr1_324439_f   0  +  0.378228  

# ... more data skipped...

      



Now it's very simple:

# drop duplicates for non-unique sets of values in columns 1, 2 (start, end)
no_dups = df.drop_duplicates([1, 2])

>>> print no_dups
          0       1       2                                   3   4  5         6   \
0   chr1   35276   35481     NR_026820_exon_1_0_chr1_35277_r   0  -  0.526829   
1   chr1   35720   36081     NR_026818_exon_2_0_chr1_35721_r   0  -  0.398892   
3   chr1   69090   70008  NM_001005484_exon_0_0_chr1_69091_f   0  +  0.571895   
4   chr1  134772  139696    NR_039983_exon_0_0_chr1_134773_r   0  -  0.366775   
5   chr1  139789  139847    NR_039983_exon_1_0_chr1_139790_r   0  -  0.551724   
6   chr1  140074  140566    NR_039983_exon_2_0_chr1_140075_r   0  -  0.475610   
7   chr1  323891  324060    NR_028322_exon_0_0_chr1_323892_f   0  +  0.426035   
10  chr1  324287  324345    NR_028322_exon_1_0_chr1_324288_f   0  +  0.551724   
13  chr1  324438  326938    NR_028327_exon_2_0_chr1_324439_f   0  +  0.375200   
14  chr1  324438  328581    NR_028322_exon_2_0_chr1_324439_f   0  +  0.378228 

      

+1


source


You can use pandas to load your file and then delete rows based on two columns as you please (simple example):



import pandas as pd

raw_data = {'firstcolumn': ['chr1', 'chr1', 'chr1'],
            'start_position': [35276, 35720, 35720],
            'end_position': [35481, 36081, 36081],
            'A': [4, 4, 31],
            'B': [25, 25, 57]}
df = pd.DataFrame(raw_data, columns = ['firstcolumn', 'start_position','end_position', 'A', 'B'])

df.drop_duplicates(['start_position','end_position']) #drop duplicate rows based on these 2 columns

      

+2


source


You can use something like this:

with open('out_lines.dat', 'w') as out_file:
    with open('in_lines.dat', 'r') as in_file:
        prev_start_end = []
        for line in in_file:
            line_data = line.split()
            if line_data and len(line_data) == 15:
                start = line_data[1]
                end = line_data[2]
                if (start, end) not in prev_start_end:
                    out_file.write(line)
                    prev_start_end.append((start, end))

      

And the output for your input will be:

chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143

      

0


source


Once someone suggested, store your "keys" (get started fields) in a set, and skip printing the line if the key has been seen before:

with open('datafile.tsv','r') as f:
    for line in f:
        fields=line.split('\t')
        key=tuple(fields[1:3])
        if key in s: continue
        s.add(key)
        print(line)

      

You can redirect python output to a new file.

0


source


def unique_positions(filename):
    with open(filename) as lines:
        seen_positions = set()
        for line in lines:
            position = tuple(line.split()[1:3])
            if position not in seen_positions:
                seen_positions.add(position)
                yield line

for line in unique_positions('data.csv):
    print line

      

0


source


If you just want to remove duplicates and write to a file, you can use groupby, grouping by two columns and calling side by side to only get the first row for multiple matches, or a single row whichever may be, this also stores very little in memory:

from itertools import groupby

with open("in.csv") as f, open("out.csv", "w") as out:
    for _, v in groupby(f, key=lambda x: x.split()[1:3]):
        out.write(next(v))

      

Output:

chr1    35276   35481   NR_026820_exon_1_0_chr1_35277_r 0       -       0.526829        0.473171        54      37      60      54      0       0       205
chr1    35720   36081   NR_026818_exon_2_0_chr1_35721_r 0       -       0.398892        0.601108        73      116     101     71      0       0       361
chr1    69090   70008   NM_001005484_exon_0_0_chr1_69091_f      0       +       0.571895        0.428105        212     218     175     313     0       0       918
chr1    134772  139696  NR_039983_exon_0_0_chr1_134773_r        0       -       0.366775        0.633225        997     1194    1924    809     0       0       4924
chr1    139789  139847  NR_039983_exon_1_0_chr1_139790_r        0       -       0.551724        0.448276        13      12      14      19      0       0       58
chr1    140074  140566  NR_039983_exon_2_0_chr1_140075_r        0       -       0.475610        0.524390        126     144     114     108     0       0       492
chr1    323891  324060  NR_028322_exon_0_0_chr1_323892_f        0       +       0.426035        0.573964        37      41      56      35      0       0       169
chr1    324287  324345  NR_028322_exon_1_0_chr1_324288_f        0       +       0.551724        0.448276        19      15      11      13      0       0       58
chr1    324438  326938  NR_028327_exon_2_0_chr1_324439_f        0       +       0.375200        0.624800        400     1013    549     538     0       0       2500
chr1    324438  328581  NR_028322_exon_2_0_chr1_324439_f        0       +       0.378228        0.621772        678     1580    996     889     0       0       4143

      

If you want to change the original file, use tempfile with the shutil.move parameter:

from itertools import groupby
from tempfile import NamedTemporaryFile
from shutil import move
with open("in.csv") as f, NamedTemporaryFile(dir=".",delete=False) as out:
    for _, v in groupby(f, key=lambda x: x.split()[1:3]):
        out.write(next(v))
move(out.name,"foo.csv")

      

0


source







All Articles