Order a column, then print a specific line with awk on the command line

Question

Order a column, then print a specific line with awk on the command line

I have a txt file:

ID   row1   row2   row3   score
rs16 ...    ...    ...    0.23
rs52 ...    ...    ...    1.43
rs87 ...    ...    ...    0.45
rs89 ...    ...    ...    2.34
rs67 ...    ...    ...    1.89

Row1-row3 don't matter.

I have about 8 million lines and the scores range from 0 to 3. I would like to get a score that corresponds to 1%. I was thinking about reordering the data by score and then printing out the ~ 80,000 line? What do you guys think would be the best code to do this?

+3

command awk

Evan Jul 24 15 at 20:49

source to share

2 answers

With GNU awk, you can sort values by setting PROCINFO["sorted_in"]

to "@val_num_desc"

. For example, for example:

parse.awk

# Set sorting method
BEGIN { PROCINFO["sorted_in"]="@val_num_desc" }

# Print header
NR == 1 { print $1, $5 }

# Save 1st and 5th columns in g and h hashes respectively
NR>1 { g[NR] = $1; h[NR] = $5 }

# Print values from g and h until ratio is reached
END {
  for(k in h) { 
    if(i++ >= int(0.5 + NR*ratio_to_keep)) 
      exit
    print g[k], h[k]
  }
}

Run it like this:

awk -f parse.awk OFS='\t' ratio_to_keep=.01 infile

0

Thor Jul 24 At 21:45

source to share

Thor · Accepted Answer · 2015-07-24T20:57:33+0000

With GNU coreutils, you can do it like this:

sort -k5gr <(tail -n+2 infile) | head -n80KB

You can increase the speed of the above pipeline by removing columns 2 through 4 as follows:

tr -s ' ' < infile | cut -d' ' -f1,5 > outfile

Or together:

sort -k5gr <(tail -n+2 <(tr -s ' ' < infile | cut -d' ' -f1,5)) | head -n80KB

Edit

I noticed that you are only interested in the 80,000th line of the result, then sed -n 80000 {p;q}

instead of head

what you assumed is the way to exit.

Explanation

tail:

-n+2

- skip heading.

sort:

k5

- sort by 5th column.
gr

- flags that sort select the reverse general-numeric form.

head:

n

- the number of lines to save. KB

- multiplier 1000, see info head

for others.

Order a column, then print a specific line with awk on the command line

Edit

Explanation

More articles: