Get unique rows based on ONLY 2 columns
I have large (50k lines) files formatted this way
chr1 35276 35481 NR_026820_exon_1_0_chr1_35277_r 0 - 0.526829 0.473171 54 37 60 54 0 0 205
chr1 35720 36081 NR_026818_exon_2_0_chr1_35721_r 0 - 0.398892 0.601108 73 116 101 71 0 0 361
chr1 35720 36081 NR_026820_exon_2_0_chr1_35721_r 0 - 0.398892 0.601108 73 116 101 71 0 0 361
chr1 69090 70008 NM_001005484_exon_0_0_chr1_69091_f 0 + 0.571895 0.428105 212 218 175 313 0 0 918
chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 0.366775 0.633225 997 1194 1924 809 0 0 4924
chr1 139789 139847 NR_039983_exon_1_0_chr1_139790_r 0 - 0.551724 0.448276 13 12 14 19 0 0 58
chr1 140074 140566 NR_039983_exon_2_0_chr1_140075_r 0 - 0.475610 0.524390 126 144 114 108 0 0 492
chr1 323891 324060 NR_028322_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 323891 324060 NR_028325_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 323891 324060 NR_028327_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 324287 324345 NR_028322_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324287 324345 NR_028325_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324287 324345 NR_028327_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324438 326938 NR_028327_exon_2_0_chr1_324439_f 0 + 0.375200 0.624800 400 1013 549 538 0 0 2500
chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 0.378228 0.621772 678 1580 996 889 0 0 4143
chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 0.378228 0.621772 678 1580 996 889 0 0 4143
The second and third columns are the start and end positions. I want to delete all lines with the same start and end positions (the rest of the line doesn't matter) and only keep the first time. For example, I would like to store rows 14 and 15 in the sampled data, because although the start position is the same, there is not at the end. 15 and 16 have the same start and end, so I want to remove 16. I'm trying to do this in Python, but I really don't understand how to deal with the requirement that two columns be unique.
Any ideas on the best / easiest way to do this?
source to share
Consider using the excellent pandas library to load and manipulate data like this:
data_string = """
chr1 35276 35481 NR_026820_exon_1_0_chr1_35277_r 0 - 0.526829 0.473171 54 37 60 54 0 0 205
chr1 35720 36081 NR_026818_exon_2_0_chr1_35721_r 0 - 0.398892 0.601108 73 116 101 71 0 0 361
chr1 35720 36081 NR_026820_exon_2_0_chr1_35721_r 0 - 0.398892 0.601108 73 116 101 71 0 0 361
chr1 69090 70008 NM_001005484_exon_0_0_chr1_69091_f 0 + 0.571895 0.428105 212 218 175 313 0 0 918
chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 0.366775 0.633225 997 1194 1924 809 0 0 4924
chr1 139789 139847 NR_039983_exon_1_0_chr1_139790_r 0 - 0.551724 0.448276 13 12 14 19 0 0 58
chr1 140074 140566 NR_039983_exon_2_0_chr1_140075_r 0 - 0.475610 0.524390 126 144 114 108 0 0 492
chr1 323891 324060 NR_028322_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 323891 324060 NR_028325_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 323891 324060 NR_028327_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 324287 324345 NR_028322_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324287 324345 NR_028325_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324287 324345 NR_028327_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324438 326938 NR_028327_exon_2_0_chr1_324439_f 0 + 0.375200 0.624800 400 1013 549 538 0 0 2500
chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 0.378228 0.621772 678 1580 996 889 0 0 4143
chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 0.378228 0.621772 678 1580 996 889 0 0 4143
"""
# this looks suspicously csv-like
import pandas
import StringIO
buf = StringIO.StringIO(data_string)
# this will create a DataFrame object with header: 0, 1, 2, ...
# if you have the file path, you can use that instead of the StringIO buffer
df = pandas.read_csv(buf, delim_whitespace=True, header=None)
>>> print df
0 1 2 3 4 5 6 \
0 chr1 35276 35481 NR_026820_exon_1_0_chr1_35277_r 0 - 0.526829
1 chr1 35720 36081 NR_026818_exon_2_0_chr1_35721_r 0 - 0.398892
2 chr1 35720 36081 NR_026820_exon_2_0_chr1_35721_r 0 - 0.398892
3 chr1 69090 70008 NM_001005484_exon_0_0_chr1_69091_f 0 + 0.571895
4 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 0.366775
5 chr1 139789 139847 NR_039983_exon_1_0_chr1_139790_r 0 - 0.551724
6 chr1 140074 140566 NR_039983_exon_2_0_chr1_140075_r 0 - 0.475610
7 chr1 323891 324060 NR_028322_exon_0_0_chr1_323892_f 0 + 0.426035
8 chr1 323891 324060 NR_028325_exon_0_0_chr1_323892_f 0 + 0.426035
9 chr1 323891 324060 NR_028327_exon_0_0_chr1_323892_f 0 + 0.426035
10 chr1 324287 324345 NR_028322_exon_1_0_chr1_324288_f 0 + 0.551724
11 chr1 324287 324345 NR_028325_exon_1_0_chr1_324288_f 0 + 0.551724
12 chr1 324287 324345 NR_028327_exon_1_0_chr1_324288_f 0 + 0.551724
13 chr1 324438 326938 NR_028327_exon_2_0_chr1_324439_f 0 + 0.375200
14 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 0.378228
15 chr1 324438 328581 NR_028325_exon_2_0_chr1_324439_f 0 + 0.378228
# ... more data skipped...
Now it's very simple:
# drop duplicates for non-unique sets of values in columns 1, 2 (start, end)
no_dups = df.drop_duplicates([1, 2])
>>> print no_dups
0 1 2 3 4 5 6 \
0 chr1 35276 35481 NR_026820_exon_1_0_chr1_35277_r 0 - 0.526829
1 chr1 35720 36081 NR_026818_exon_2_0_chr1_35721_r 0 - 0.398892
3 chr1 69090 70008 NM_001005484_exon_0_0_chr1_69091_f 0 + 0.571895
4 chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 0.366775
5 chr1 139789 139847 NR_039983_exon_1_0_chr1_139790_r 0 - 0.551724
6 chr1 140074 140566 NR_039983_exon_2_0_chr1_140075_r 0 - 0.475610
7 chr1 323891 324060 NR_028322_exon_0_0_chr1_323892_f 0 + 0.426035
10 chr1 324287 324345 NR_028322_exon_1_0_chr1_324288_f 0 + 0.551724
13 chr1 324438 326938 NR_028327_exon_2_0_chr1_324439_f 0 + 0.375200
14 chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 0.378228
source to share
You can use pandas to load your file and then delete rows based on two columns as you please (simple example):
import pandas as pd
raw_data = {'firstcolumn': ['chr1', 'chr1', 'chr1'],
'start_position': [35276, 35720, 35720],
'end_position': [35481, 36081, 36081],
'A': [4, 4, 31],
'B': [25, 25, 57]}
df = pd.DataFrame(raw_data, columns = ['firstcolumn', 'start_position','end_position', 'A', 'B'])
df.drop_duplicates(['start_position','end_position']) #drop duplicate rows based on these 2 columns
source to share
You can use something like this:
with open('out_lines.dat', 'w') as out_file:
with open('in_lines.dat', 'r') as in_file:
prev_start_end = []
for line in in_file:
line_data = line.split()
if line_data and len(line_data) == 15:
start = line_data[1]
end = line_data[2]
if (start, end) not in prev_start_end:
out_file.write(line)
prev_start_end.append((start, end))
And the output for your input will be:
chr1 35276 35481 NR_026820_exon_1_0_chr1_35277_r 0 - 0.526829 0.473171 54 37 60 54 0 0 205
chr1 35720 36081 NR_026818_exon_2_0_chr1_35721_r 0 - 0.398892 0.601108 73 116 101 71 0 0 361
chr1 69090 70008 NM_001005484_exon_0_0_chr1_69091_f 0 + 0.571895 0.428105 212 218 175 313 0 0 918
chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 0.366775 0.633225 997 1194 1924 809 0 0 4924
chr1 139789 139847 NR_039983_exon_1_0_chr1_139790_r 0 - 0.551724 0.448276 13 12 14 19 0 0 58
chr1 140074 140566 NR_039983_exon_2_0_chr1_140075_r 0 - 0.475610 0.524390 126 144 114 108 0 0 492
chr1 323891 324060 NR_028322_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 324287 324345 NR_028322_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324438 326938 NR_028327_exon_2_0_chr1_324439_f 0 + 0.375200 0.624800 400 1013 549 538 0 0 2500
chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 0.378228 0.621772 678 1580 996 889 0 0 4143
source to share
Once someone suggested, store your "keys" (get started fields) in a set, and skip printing the line if the key has been seen before:
with open('datafile.tsv','r') as f:
for line in f:
fields=line.split('\t')
key=tuple(fields[1:3])
if key in s: continue
s.add(key)
print(line)
You can redirect python output to a new file.
source to share
If you just want to remove duplicates and write to a file, you can use groupby, grouping by two columns and calling side by side to only get the first row for multiple matches, or a single row whichever may be, this also stores very little in memory:
from itertools import groupby
with open("in.csv") as f, open("out.csv", "w") as out:
for _, v in groupby(f, key=lambda x: x.split()[1:3]):
out.write(next(v))
Output:
chr1 35276 35481 NR_026820_exon_1_0_chr1_35277_r 0 - 0.526829 0.473171 54 37 60 54 0 0 205
chr1 35720 36081 NR_026818_exon_2_0_chr1_35721_r 0 - 0.398892 0.601108 73 116 101 71 0 0 361
chr1 69090 70008 NM_001005484_exon_0_0_chr1_69091_f 0 + 0.571895 0.428105 212 218 175 313 0 0 918
chr1 134772 139696 NR_039983_exon_0_0_chr1_134773_r 0 - 0.366775 0.633225 997 1194 1924 809 0 0 4924
chr1 139789 139847 NR_039983_exon_1_0_chr1_139790_r 0 - 0.551724 0.448276 13 12 14 19 0 0 58
chr1 140074 140566 NR_039983_exon_2_0_chr1_140075_r 0 - 0.475610 0.524390 126 144 114 108 0 0 492
chr1 323891 324060 NR_028322_exon_0_0_chr1_323892_f 0 + 0.426035 0.573964 37 41 56 35 0 0 169
chr1 324287 324345 NR_028322_exon_1_0_chr1_324288_f 0 + 0.551724 0.448276 19 15 11 13 0 0 58
chr1 324438 326938 NR_028327_exon_2_0_chr1_324439_f 0 + 0.375200 0.624800 400 1013 549 538 0 0 2500
chr1 324438 328581 NR_028322_exon_2_0_chr1_324439_f 0 + 0.378228 0.621772 678 1580 996 889 0 0 4143
If you want to change the original file, use tempfile with the shutil.move parameter:
from itertools import groupby
from tempfile import NamedTemporaryFile
from shutil import move
with open("in.csv") as f, NamedTemporaryFile(dir=".",delete=False) as out:
for _, v in groupby(f, key=lambda x: x.split()[1:3]):
out.write(next(v))
move(out.name,"foo.csv")
source to share