How do I extract all N columns and write them to new files?

I have struggled to write code to extract all N columns from the input file and write them to the output files according to their extraction order.

(My real deal is to extract every 800 columns from the whole 24005 columns file starting at column 6, so I need a loop)

In a simpler case, extracting every 3 columns (fields) from the input file with the starting point of the second column.

for example if the input file looks like this:

aa 1 2 3 4 5 6 7 8 9 
bb 1 2 3 4 5 6 7 8 9 
cc 1 2 3 4 5 6 7 8 9 
dd 1 2 3 4 5 6 7 8 9 

      

and I want the result to look like this output_file_1:

1 2 3
1 2 3
1 2 3
1 2 3

      

output_file_2:

4 5 6  
4 5 6 
4 5 6 
4 5 6 

      

output_file_3:

7 8 9
7 8 9 
7 8 9
7 8 9

      

I tried this but it doesn't work:

awk 'for(i=2;i<=10;i+a) {{printf "%s ",$i};a=3}' <inputfile>

      

This gave me a syntax error and the more I fix the more problems it gets.

I also tried the linux command, but while I was dealing with large files it seems to be easy. And I'm wondering if would cut the cut cut every 3 fields like awk.

Can someone please help me with this and give a short explanation? Thanks in advance.

+3


source to share


4 answers


The actions awk takes on the input must be enclosed in curly braces, so the reason you tried to execute a one-liner awk results in a syntax error is because the loop for

doesn't follow this rule. The syntactically correct version would be:

awk '{for(i=2;i<=10;i+a) {printf "%s ",$i};a=3}' <inputfile>

      

This is syntactically correct (almost, see the end of this post.), But doesn't do what you think.

To separate the column-wise output from different files, it is best to use awk

the redirection operator >

. This will give you the output you want, given that your input files always have 10 columns:

awk '{ print $2,$3,$4 > "file_1"; print $5,$6,$7 > "file_2"; print $8,$9,$10 > "file_3"}' <inputfile>

      

remember " "

to include filenames.


CONNECTED: REAL WORLD CASE

If you need to loop through the columns because there are too many of them, you can still use awk (gawk) with two loops: one on the output files and one on the columns per file. This is possible:



#!/usr/bin/gawk -f 

BEGIN{
  CTOT = 24005 # total number of columns, you can use NF as well
  DELTA = 800  # columns per file
  START = 6 # first useful column
  d = CTOT/DELTA # number of output files.
}
{
  for ( i = 0 ; i < d ; i++)
  {
    for ( j = 0 ; j < DELTA ; j++)
    {
      printf("%f\t",$(START+j+i*DELTA)) > "file_out_"i
    }
    printf("\n") >  "file_out_"i
   }
 }

      

I've tried this on simple input files in your example. It works if CTOT can be split into DELTA. I assumed you have floats ( %f

), just change this with what you need.

Tell me.


Ps, note that the loop is infinite as it i

does not grow: i+a

must be replaced with i+=a

, but a=3

must be inside the inner curly braces:

awk '{for(i=2;i<=10;i+=a) {printf "%s ",$i;a=3}}' <inputfile>

      

this evaluates to a = 3 in every loop, which is a bit pointless. Best version:

awk '{for(i=2;i<=10;i+=3) {printf "%s ",$i}}' <inputfile>

      

However, this will just print the 2nd, 5th and 8th columns of your file, which is not what you want.

+3


source


awk '{ print $2, $3,  $4 >"output_file_1";
       print $5, $6,  $7 >"output_file_2";
       print $8, $9, $10 >"output_file_3";
     }' input_file

      

This allows you to walk through the input file, which is preferable to multiple passes. Obviously, the code above only considers a fixed number of columns (and therefore a fixed number of output files). It can be modified if needed to handle variable column numbers and generate variable file names, etc.


(My real deal is to extract every 800 columns from the whole 24005 columns file starting at column 6, so I need a loop)



In this case, you are correct; you need a loop. You actually need two loops:

awk 'BEGIN { gap = 800; start = 6; filebase = "output_file_"; }
     {
         for (i = start; i < start + gap; i++)
         {
             file = sprintf("%s%d", filebase, i);
             for (j = i; j <= NF; j += gap)
                  printf("%s ", $j) > file;
             printf "\n" > file;
         }
     }' input_file

      

I demonstrated this to my satisfaction with an input file with 25 columns (numbers 1-25 in the corresponding columns) and a space set to 8 and an initial value of 2. The result below is 8 files inserted horizontally.

2 10 18    3 11 19    4 12 20    5 13 21    6 14 22    7 15 23    8 16 24    9 17 25
2 10 18    3 11 19    4 12 20    5 13 21    6 14 22    7 15 23    8 16 24    9 17 25
2 10 18    3 11 19    4 12 20    5 13 21    6 14 22    7 15 23    8 16 24    9 17 25
2 10 18    3 11 19    4 12 20    5 13 21    6 14 22    7 15 23    8 16 24    9 17 25

      

+2


source


With GNU awk:

$ awk -v d=3 '{for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3",""); print "----"}' file
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----
1 2 3
4 5 6
7 8 9
----

      

Just redirect the output to files:

$ awk -v d=3 '{sfx=0; for(i=2;i<NF;i+=d) print gensub("(([^ ]+ +){" i-1 "})(([^ ]+( +|$)){" d "}).*","\\3","") > ("output_file_" ++sfx)}' file

      

The idea is just to tell gensub () to skip the first few (i-1) fields and then print the number of fields you want (d = 3) and ignore the rest (. *). If you are not printing exact multiples of the number of fields, you will need to mass the number of fields to be printed on the last iteration of the loop. Do the math ...

Here's a version that will work in any awk. It takes 2 loops and changes the spaces between fields, but is probably easier to understand:

$ awk -v d=3 '{sfx=0; for(i=2;i<=NF;i+=d) {str=fs=""; for(j=i;j<i+d;j++) {str = str fs $j; fs=" "}; print str > ("output_file_" ++sfx)} }' file

      

+2


source


I was able to use the following command line. :) It uses a for loop and feeds the awk program to stdin with -f -

. The program itself is awk

created using the bash variable math.

for i in 0 1 2; do 
    echo "{print \$$((i*3+2)) \" \" \$$((i*3+3)) \" \" \$$((i*3+4))}" \
  | awk -f -  t.file   > "file$((i+1))"
done

      


Update: After updating the question, I tried to hack a script that dynamically creates the requested 800-cols-awk script (version as per Jonathan Leffers answer) and pipes that to awk. While the scripts look good (to me), it creates an awk syntax error. The question is, is this too much for awk or am I missing something? I would really like feedback!

Update: Investigated this and found documentation that says it awk

has many limitations af. In such situations, they told to use gawk. (GNU awk implementation). I did it. But I still get a syntax error. Evaluation is still supported!

#!/bin/bash

# Note! Although the script output looks ok (for me)
# it produces an awk syntax error. is this just too much for awk?

# open pipe to stdin of awk
exec 3> >(gawk -f - test.file)

# verify output using cat
#exec 3> >(cat)

echo '{' >&3

# write dynamic script to awk
for i in {0..24005..800} ; do
    echo -n " print " >&3
    for (( j=$i; j <= $((i+800)); j++ )) ; do
        echo -n "\$$j " >&3
        if [ $j = 24005 ] ; then
            break
        fi
    done
    echo "> \"file$((i/800+1))\";" >&3
done
echo "}"

      

+1


source







All Articles