Clear header of csv files with awk
I have a bunch of csv files that look like this:
Time,Channel A
(s),(V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560
And I want to convert it to:
Time (s),Channel A (V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560
The units are not necessarily identical between files (i.e. maybe (ms),(ยตV)
, etc.)
I got to an empty line (which is always line # 3) with awk 'FNR!=3'
, but cutting the second line arguments and inserting them beyond the fields of the first line exceeds my capacity; (
Please help me! Motivation: There are many such files. Doing this manually is not possible ...
EDIT: First of all: thanks for helping me. Seems like the same problem with your two answers could potentially be a problem in my file? I suspect the empty string is causing problems in some way?
- shell-script gives me error
'xpected empty line, read '
-
perl-script and awk-script give me the wrong first line:
(V) (s),Channel A 0.00000000,0.58058560
If that helps, I have uploaded a sample profile .
source to share
You can use perl instead
perl -lpe 'if($.==1){$x=<>;($T,$C)=$x=~/(\([^\)]\))/g;s/[^,]+\K,.*/ $T$& $C/;<>}' file
Time (s),Channel A (V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560
If you have a CRLF file that looks like you.
perl -lpe 'chop;if($.==1){$x=<>;($T,$C)=$x=~/(\([^\)]\))/g;s/[^,]+\K,.*/ $T$& $C/;<>}' file
source to share
If you decide to stay in bash, here's a script that will do error checking for you
#!/usr/bin/env bash
function Error() { 1>&2 echo "$@"; }
function cleanup_csv() {
IFS=$',\r' read -a Header || { Error "could not read header"; return 1; }
IFS=$',\r' read -a Units || { Error "could not read units"; return 1; }
declare -r NCols=${#Header[@]}
declare -r n_units=${#Units[@]}
[[ $NCols -eq $n_units ]] || {
Error "number of columns in Header ($NCols) not equal to Units ($n_units)"
return 1
}
if IFS=$' \t\r' read -a words; then
if [[ $(echo "${words[*]}") ]]; then
Error "expected empty line, read '${words[*]}'"
return 1
fi
else
Error "could not read line 3, expected empty line"
return 1
fi
local i= sep=
for ((i=0; i<NCols; ++i)); do
printf "%s" "$sep${Header[i]} ${Units[i]}"
sep=,
done
printf "\n"
cat
}
cleanup_csv "$@"
Given the file 'in' with your example input
Time,Channel A
(s),(V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560
and called this way
./cleanup_csv.sh < in
it produces the desired output
Time (s),Channel A (V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560
I also checked it with more than 2 columns
Time,Channel A,Channel B
(s),(V),(mV)
0.00000000,0.58058560,12.34
0.00666667,0.58913630,3.1415
0.02000002,0.58058560,0.913
And this also works
Time (s),Channel A (V),Channel B (mV)
0.00000000,0.58058560,12.34
0.00666667,0.58913630,3.1415
0.02000002,0.58058560,0.913
Edit
I updated reading and empty line detection because your input has Windows style line endings (CRLF). You can see the carriage return with od -xc
. I went ahead and added tabs to the character set for use as field separators (IFS). Then I'll see if there are any non-whitespace characters left after echo
.
Here's a hex dump of the input under test (with added space and tab on the "blank" line):
$ od -xc in.csv
0000000 6954 656d 432c 6168 6e6e 6c65 4120 0a0d
T i m e , C h a n n e l A \r \n
0000020 7328 2c29 5628 0d29 200a 2009 0d20 300a
( s ) , ( V ) \r \n \t \r \n 0
0000040 302e 3030 3030 3030 2c30 2e30 3835 3530
. 0 0 0 0 0 0 0 0 , 0 . 5 8 0 5
0000060 3538 3036 0a0d 2e30 3030 3636 3636 3736
8 5 6 0 \r \n 0 . 0 0 6 6 6 6 6 7
0000100 302c 352e 3938 3331 3336 0d30 300a 302e
, 0 . 5 8 9 1 3 6 3 0 \r \n 0 . 0
0000120 3331 3333 3333 2c34 2e30 3835 3530 3538
1 3 3 3 3 3 4 , 0 . 5 8 0 5 8 5
0000140 3036 0a0d 2e30 3230 3030 3030 3230 302c
6 0 \r \n 0 . 0 2 0 0 0 0 0 2 , 0
0000160 352e 3038 3835 3635 0d30 300a 302e 3632
. 5 8 0 5 8 5 6 0 \r \n 0 . 0 2 6
0000200 3636 3636 2c39 2e30 3835 3139 3633 3033
6 6 6 6 9 , 0 . 5 8 9 1 3 6 3 0
0000220 0a0d 2e30 3330 3333 3333 3633 302c 352e
\r \n 0 . 0 3 3 3 3 3 3 6 , 0 . 5
0000240 3938 3331 3336 0d30 300a 302e 3034 3030
8 9 1 3 6 3 0 \r \n 0 . 0 4 0 0 0
0000260 3030 2c33 2e30 3835 3139 3633 3033 0a0d
0 0 3 , 0 . 5 8 9 1 3 6 3 0 \r \n
0000300
source to share
awk -F , -v RS='\r\n' 'FNR == 1 { x = $1; y = $2 } FNR == 2 { print x " " $1 "," y " " $2 } FNR > 3'
Short explanation: awk
accepts boolean expressions as range patterns (something in scope can be used, so FNR == n
just means the pattern applies to line number n in the current file; the last pattern applies to everything after line # 3; omitting the action means just printing input as read, so lines # 1 and # 2 are treated on purpose, each with its own meaning, and line # 3 is ignored because there is no matching pattern for it.
UPDATE: Edited to set RS
(record delimiter) as per @ 123 suggestion. An alternative way would be to include it in the script itself, perhaps along with a FS
(field separator) like this:
awk 'BEGIN { FS=","; RS="\r\n" } FNR == 1 { x = $1; y = $2 } FNR == 2 { print x " " $1 "," y " " $2 } FNR > 3'
source to share
You can do:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
for(i=1;i<=NF;i++) l1[i]=$i
}
FNR==2{
for (i=1;i<=NF;i++) l2[i]=$i
}
FNR==3{
s=""
for (i=1;i in l1 || i in l2; i++)
s=s ? s OFS l1[i] " " l2[i] : s l1[i] " " l2[i]
print s
}
FNR<=3{ next }
1' file
Printing
Time (s),Channel A (V)
0.00000000,0.58058560
0.00666667,0.58913630
0.01333334,0.58058560
source to share