Clear header of csv files with awk

I have a bunch of csv files that look like this:

Time,Channel A
(s),(V)

0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560

      

And I want to convert it to:

Time (s),Channel A (V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560

      

The units are not necessarily identical between files (i.e. maybe (ms),(ยตV)

, etc.)

I got to an empty line (which is always line # 3) with awk 'FNR!=3'

, but cutting the second line arguments and inserting them beyond the fields of the first line exceeds my capacity; (

Please help me! Motivation: There are many such files. Doing this manually is not possible ...

EDIT: First of all: thanks for helping me. Seems like the same problem with your two answers could potentially be a problem in my file? I suspect the empty string is causing problems in some way?

  • shell-script gives me error 'xpected empty line, read '

  • perl-script and awk-script give me the wrong first line:

     (V) (s),Channel A
    0.00000000,0.58058560
    
          

If that helps, I have uploaded a sample profile .

+3


source to share


4 answers


You can use perl instead

perl -lpe 'if($.==1){$x=<>;($T,$C)=$x=~/(\([^\)]\))/g;s/[^,]+\K,.*/ $T$& $C/;<>}' file

Time (s),Channel A (V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560

      



If you have a CRLF file that looks like you.

perl -lpe 'chop;if($.==1){$x=<>;($T,$C)=$x=~/(\([^\)]\))/g;s/[^,]+\K,.*/ $T$& $C/;<>}' file

      

+1


source


If you decide to stay in bash, here's a script that will do error checking for you

#!/usr/bin/env bash

function Error() { 1>&2 echo "$@"; }

function cleanup_csv() {
    IFS=$',\r' read -a Header || { Error "could not read header"; return 1; }
    IFS=$',\r' read -a Units || { Error "could not read units"; return 1; }
    declare -r NCols=${#Header[@]}
    declare -r n_units=${#Units[@]}

    [[ $NCols -eq $n_units ]] || {
        Error "number of columns in Header ($NCols) not equal to Units ($n_units)"
        return 1
    }

    if IFS=$' \t\r' read -a words; then
        if [[ $(echo "${words[*]}") ]]; then
            Error "expected empty line, read '${words[*]}'"
            return 1
        fi
    else
        Error "could not read line 3, expected empty line"
        return 1
    fi

    local i= sep=
    for ((i=0; i<NCols; ++i)); do
        printf "%s" "$sep${Header[i]} ${Units[i]}"
        sep=,
    done
    printf "\n"
    cat
}

cleanup_csv "$@"

      

Given the file 'in' with your example input

Time,Channel A
(s),(V)

0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560

      

and called this way

./cleanup_csv.sh < in

      

it produces the desired output

Time (s),Channel A (V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560

      



I also checked it with more than 2 columns

Time,Channel A,Channel B
(s),(V),(mV)

0.00000000,0.58058560,12.34
0.00666667,0.58913630,3.1415
0.02000002,0.58058560,0.913

      

And this also works

Time (s),Channel A (V),Channel B (mV)
0.00000000,0.58058560,12.34
0.00666667,0.58913630,3.1415
0.02000002,0.58058560,0.913

      

Edit

I updated reading and empty line detection because your input has Windows style line endings (CRLF). You can see the carriage return with od -xc

. I went ahead and added tabs to the character set for use as field separators (IFS). Then I'll see if there are any non-whitespace characters left after echo

.

Here's a hex dump of the input under test (with added space and tab on the "blank" line):

$ od -xc in.csv
0000000    6954    656d    432c    6168    6e6e    6c65    4120    0a0d
          T   i   m   e   ,   C   h   a   n   n   e   l       A  \r  \n
0000020    7328    2c29    5628    0d29    200a    2009    0d20    300a
          (   s   )   ,   (   V   )  \r  \n      \t          \r  \n   0
0000040    302e    3030    3030    3030    2c30    2e30    3835    3530
          .   0   0   0   0   0   0   0   0   ,   0   .   5   8   0   5
0000060    3538    3036    0a0d    2e30    3030    3636    3636    3736
          8   5   6   0  \r  \n   0   .   0   0   6   6   6   6   6   7
0000100    302c    352e    3938    3331    3336    0d30    300a    302e
          ,   0   .   5   8   9   1   3   6   3   0  \r  \n   0   .   0
0000120    3331    3333    3333    2c34    2e30    3835    3530    3538
          1   3   3   3   3   3   4   ,   0   .   5   8   0   5   8   5
0000140    3036    0a0d    2e30    3230    3030    3030    3230    302c
          6   0  \r  \n   0   .   0   2   0   0   0   0   0   2   ,   0
0000160    352e    3038    3835    3635    0d30    300a    302e    3632
          .   5   8   0   5   8   5   6   0  \r  \n   0   .   0   2   6
0000200    3636    3636    2c39    2e30    3835    3139    3633    3033
          6   6   6   6   9   ,   0   .   5   8   9   1   3   6   3   0
0000220    0a0d    2e30    3330    3333    3333    3633    302c    352e
         \r  \n   0   .   0   3   3   3   3   3   3   6   ,   0   .   5
0000240    3938    3331    3336    0d30    300a    302e    3034    3030
          8   9   1   3   6   3   0  \r  \n   0   .   0   4   0   0   0
0000260    3030    2c33    2e30    3835    3139    3633    3033    0a0d
          0   0   3   ,   0   .   5   8   9   1   3   6   3   0  \r  \n
0000300

      

+2


source


awk -F , -v RS='\r\n' 'FNR == 1 { x = $1; y = $2 } FNR == 2 { print x " " $1 "," y " " $2 } FNR > 3'

      

Short explanation: awk

accepts boolean expressions as range patterns (something in scope can be used, so FNR == n

just means the pattern applies to line number n in the current file; the last pattern applies to everything after line # 3; omitting the action means just printing input as read, so lines # 1 and # 2 are treated on purpose, each with its own meaning, and line # 3 is ignored because there is no matching pattern for it.

UPDATE: Edited to set RS

(record delimiter) as per @ 123 suggestion. An alternative way would be to include it in the script itself, perhaps along with a FS

(field separator) like this:

awk 'BEGIN { FS=","; RS="\r\n" } FNR == 1 { x = $1; y = $2 } FNR == 2 { print x " " $1 "," y " " $2 } FNR > 3'

      

+1


source


You can do:

awk 'BEGIN{
           FS=OFS=","
           }
     FNR==1{
           for(i=1;i<=NF;i++) l1[i]=$i
           }
     FNR==2{
           for (i=1;i<=NF;i++) l2[i]=$i
           }
     FNR==3{
           s=""
           for (i=1;i in l1 || i in l2; i++)
               s=s ? s OFS l1[i] " " l2[i] : s l1[i] " " l2[i]
           print s    
           }
     FNR<=3{ next }
     1' file

      

Printing

Time (s),Channel A (V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560

      

0


source







All Articles