Replace column value in csv file with (g) awk delimited containing lines

Question

Replace column value in csv file with (g) awk delimited containing lines

I am using gawk 4.0.1 and I know how to replace a column value in a CSV file, for example:

> ROW='1,2,3,4,5,6'
> echo $ROW | gawk -F, -vOFS=, '$2="X"'
1,X,3,4,5,6

However, I am dealing with a file that contains lines containing a delimiter. The column is read fine, but when the value is replaced, an additional separator is added:

> ROW='1,"2,3",4,5,6'
> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '{print $2}'
"2,3"
> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '$2="X"'
1,X,,4,5,6

This is what I was expecting:

> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '$2="X"'
1,X,4,5,6

The value "2,3,3" is replaced with "X". How can I solve this?

EDIT : I didn't add that I also have empty fields. So the best example of a line would be:

ROW='1,,"2,3",4,5,6'

REPORT 2: From Dawg's answer, I believe this is not possible in pure awk. While I agree that the solution with python is better, the only solution with awk is to include some preprocessing and post-processing to handle empty fields.

#/bin/bash
ROW='1,,"2,3",4,"",5'
for col in {1..6}; do 
    echo $ROW |\ 
        sed 's:,,:, ,:' |\ 
        gawk -v c=$col -v OFS=, -v FPAT='([^,]+)|("[^\"]*")' '$c="X"' |\
        sed 's:, ,:,,:g'
done

Output:

X,,"2,3",4,"",5
1,X,"2,3",4,"",5
1,,X,4,"",5
1,,"2,3",X,"",5
1,,"2,3",4,X,5
1,,"2,3",4,"",X

+3

awk csv gawk

gospes Sep 25 '14 at 9:26

source to share

3 answers

dawg · Answer 1 · 2014-09-25T14:23:09+0000

$ echo $ROW | awk -vOFS=, -vFPAT="([^,]+)|(\"[^\"]+\")" '$2="X"'
1,X,4,5,6

I used a template from the GNU Awk 4.7 manual Defining fields by content

Compare with *

in the same way:

$ echo $ROW | awk -vOFS=, -vFPAT="([^,]*)|(\"[^\"]*\")" '$2="X"'
1,X,,4,5,6

So the answer is - (in this limited example) - use -vFPAT="([^,]+)|(\"[^\"]+\")"

, but then it doesn't work with empty fields like1,"2,3",4,,"","should be 6th field"

Here is the result with two types of empty fields ( ,,

and ""

):

$ echo $ROW2 | awk -vOFS=, -vFPAT="([^,]+)|(\"[^\"]+\")" '$2="X"'
1,X,4,"","should be 6th field"
      ^^                    - missing the ',,' field
            ^^^             - now the 5th field  -- BUG!

By convention, ROW2

it should be considered as having 6 fields with empty fields ,,

and ""

, each of which counts as 1 field. If you don't count empty fields as fields, you will lose the number of fields that are after spaces. Add to the list of complications CSV with awk regex.

Be aware that CSV is surprisingly complex and to handle many of the possibilities is not trivial with awk or regex .

Another solution for CSV is to use Perl or Python with more complex and standardized CSV libraries they can use. In the case of Python, this is part of the standard Python distribution.

Here is a Python solution that is fully RFC 4180 compliant

$ echo $ROW | python -c '
> import csv, fileinput
> for line in csv.reader(fileinput.input()):
> print ",".join(e if i!=1 else "X" for i, e in enumerate(line))'
1,X,4,5,6

This makes it easy to handle more complex CSV files.

There are 4 records of 5 CSV fields with CRLF

in quoted fields, escaped quotes in quoted fields, and both kinds of empty fields ( ,,

and ""

).

1,"2,3",4,5,6
"11,12",13,14,15,16
21,"22,
23",24,25,"26
27"
31,,"33\"not 32\"","",35

With this same script (using repr

to see the full field values, but you're probably using str

in normal conditions), all these cases are handled correctly according to RFC 4180:

$ cat /tmp/3.csv | python -c '
import csv, fileinput
for line in csv.reader(fileinput.input()):
   print ",".join(repr(e) if i!=1 else "X" for i, e in enumerate(line))'
'1',X,'4','5','6'
'11,12',X,'14','15','16'
'21',X,'24','25','26\n27'
'31',X,'33\\not 32\\""','','35'

This is tricky with awk as it \n

detects every record, we handle empty fields incorrectly and we don't handle escaped quotes correctly:

$ cat /tmp/3.csv | awk -vOFS=, -vFPAT='[^,]+|"[^"]*"' '$2="X"'
1,X,4,5,6
"11,12",X,14,15,16
21,X
23",X,25,"26
27",X
31,X,"",35

Now you will need to redefine RS into a regex that will find quotes around CR and read short lines with awk ... Add support for escaped quotes ... Make a more complex regex to separate fields ... Difficult .. Good luck!

karthick Sundaram · Answer 2 · 2014-09-25T12:16:44+0000

Out for

$ ROW='1,"2,3",4,5,6' 
$ echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"].*"' '$2="X"'
1,X,4,5,6

Both of these commands work fine. The second command *

was skipped when pasting here.

Perl:

$var='1,"2,3",4,5,6';
$var=~s/\".*\"/X/g;
print $var;

karthick Sundaram · Answer 3 · 2014-09-25T15:50:36+0000

$ echo $ ROW | gawk -vOFS =, -vFPAT = '[^,] + | "[^"]. "'' $ 2 =" X "'

... must be after [^ "]

echo $ ROW | gawk -vOFS =, -vFPAT = '[^,] + | "[^"]. * "'' $ 2 =" X "'

These 2 answers give output 1, x, 4,5,6 for ROW = '1, "2,3", 4,5,6'

Replace column value in csv file with (g) awk delimited containing lines

More articles: