Order a column, then print a specific line with awk on the command line

I have a txt file:

ID   row1   row2   row3   score
rs16 ...    ...    ...    0.23
rs52 ...    ...    ...    1.43
rs87 ...    ...    ...    0.45
rs89 ...    ...    ...    2.34
rs67 ...    ...    ...    1.89

      

Row1-row3 don't matter.

I have about 8 million lines and the scores range from 0 to 3. I would like to get a score that corresponds to 1%. I was thinking about reordering the data by score and then printing out the ~ 80,000 line? What do you guys think would be the best code to do this?

+3


source to share


2 answers


With GNU coreutils, you can do it like this:

sort -k5gr <(tail -n+2 infile) | head -n80KB

      

You can increase the speed of the above pipeline by removing columns 2 through 4 as follows:

tr -s ' ' < infile | cut -d' ' -f1,5 > outfile

      

Or together:

sort -k5gr <(tail -n+2 <(tr -s ' ' < infile | cut -d' ' -f1,5)) | head -n80KB

      

Edit

I noticed that you are only interested in the 80,000th line of the result, then sed -n 80000 {p;q}

instead of head

what you assumed is the way to exit.



Explanation

tail:

  • -n+2

    - skip heading.

sort:

  • k5

    - sort by 5th column.
  • gr

    - flags that sort select the reverse general-numeric form.

head:

  • n

    - the number of lines to save. KB

    - multiplier 1000, see info head

    for others.
+2


source


With GNU awk, you can sort values ​​by setting PROCINFO["sorted_in"]

to "@val_num_desc"

. For example, for example:

parse.awk

# Set sorting method
BEGIN { PROCINFO["sorted_in"]="@val_num_desc" }

# Print header
NR == 1 { print $1, $5 }

# Save 1st and 5th columns in g and h hashes respectively
NR>1 { g[NR] = $1; h[NR] = $5 }

# Print values from g and h until ratio is reached
END {
  for(k in h) { 
    if(i++ >= int(0.5 + NR*ratio_to_keep)) 
      exit
    print g[k], h[k]
  }
}

      



Run it like this:

awk -f parse.awk OFS='\t' ratio_to_keep=.01 infile

      

0


source







All Articles