Order a column, then print a specific line with awk on the command line
I have a txt file:
ID row1 row2 row3 score
rs16 ... ... ... 0.23
rs52 ... ... ... 1.43
rs87 ... ... ... 0.45
rs89 ... ... ... 2.34
rs67 ... ... ... 1.89
Row1-row3 don't matter.
I have about 8 million lines and the scores range from 0 to 3. I would like to get a score that corresponds to 1%. I was thinking about reordering the data by score and then printing out the ~ 80,000 line? What do you guys think would be the best code to do this?
source to share
With GNU coreutils, you can do it like this:
sort -k5gr <(tail -n+2 infile) | head -n80KB
You can increase the speed of the above pipeline by removing columns 2 through 4 as follows:
tr -s ' ' < infile | cut -d' ' -f1,5 > outfile
Or together:
sort -k5gr <(tail -n+2 <(tr -s ' ' < infile | cut -d' ' -f1,5)) | head -n80KB
Edit
I noticed that you are only interested in the 80,000th line of the result, then sed -n 80000 {p;q}
instead of head
what you assumed is the way to exit.
Explanation
tail:
-
-n+2
- skip heading.
sort:
-
k5
- sort by 5th column. -
gr
- flags that sort select the reverse general-numeric form.
head:
-
n
- the number of lines to save.KB
- multiplier 1000, seeinfo head
for others.
source to share
With GNU awk, you can sort values by setting PROCINFO["sorted_in"]
to "@val_num_desc"
. For example, for example:
parse.awk
# Set sorting method BEGIN { PROCINFO["sorted_in"]="@val_num_desc" } # Print header NR == 1 { print $1, $5 } # Save 1st and 5th columns in g and h hashes respectively NR>1 { g[NR] = $1; h[NR] = $5 } # Print values from g and h until ratio is reached END { for(k in h) { if(i++ >= int(0.5 + NR*ratio_to_keep)) exit print g[k], h[k] } }
Run it like this:
awk -f parse.awk OFS='\t' ratio_to_keep=.01 infile
source to share