Using strings from multiple input files as search criteria for selected columns in a CSV file using AWK
The nature of the problem:
I have a CSV file with 10 columns, of which 4 columns indicate disease codes. Let's say these are columns 1 - 4. I have 2 text files that contain the "include" and "exclude" codes.
The include file looks like this: a file with input lines n
, each on newlines
Example:
123 12300 12301 124 12400 12401 1250
The exception file looks like this: a file with input lines m
, each of which is also on line feed characters.
Example:
456 457 458 459
The truncated version of the CSV file will look like this:
D1,D2,D3,D4,A,B,C,D,E,F 123,00,145,567,A1,B1,C1,D1,E1,F1 890,001,456,0009,A2,B2,C2,D2,E2,F2 12301,456,00,145,A3,B3,C3,D3,E3,F3 567,1250,010,321,A4,B4,C4,D4,E4,F4
Using AWK, how can I take 2 files named inclusion
and exclusion
and a CSV file that returns this:
D1,D2,D3,D4,A,B,C,D,E,F 123,00,145,567,A1,B1,C1,D1,E1,F1 567,1250,010,321,A4,B4,C4,D4,E4,F4
CSV file can have millions of lines, and the files inclusion
and exclusion
may have dozens of lines. This is not homework and I appreciate the help.
source to share
Using grep
$ head -n1 <file; grep -E "(^|,)($(tr '\n' '|' <inclusion))(,|$)" file | grep -Ev "(^|,)($(tr '\n' '|' <exclusion))(,|$)"
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4
Using awk
$ awk -v inc="(^|,)($(tr '\n' '|' <inclusion))(,|$)" -v exc="(^|,)($(tr '\n' '|' <exclusion))(,|$)" 'NR==1 || ($0 ~ inc && ! ($0 ~ exc))' file
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4
How it works
For grep and awk solutions, the key step is to create a regular expression that matches the include or exclude files. Since it is shorter, take exclusion
as an example. We can create a regular expression for it like this:
$ echo "(^|,)($(tr '\n' '|' <exclusion))(,|$)"
(^|,)(456|457|458|459|)(,|$)
The regular expression for inclusion
works in a similar way. After creating and excluding regexes, we can use them with either grep or awk. When using awk, we use the condition:
NR==1 || ($0 ~ inc && ! ($0 ~ exc))
If this condition is true, awk performs its default action, which is to print a line. The condition is true if (1) we are in the first row, NR==1
or if (2) the string matches the regular expression to enable, inc
and does not match the regular expression for the exclusion exc
.
Alternative awk solution
$ gawk -F, -v inc="$(<inclusion)" -v exc="$(<exclusion)" 'BEGIN{n=split(inc,x,"\n"); for (j=1;j<=n;j++)incl[x[j]]=1; n=split(exc,x,"\n"); for (j=1;j<=n;j++)excl[x[j]]=1;} NR==1{print;next} {p=0;for (j=1;j<=NF;j++) if ($j in incl)p=1; for (j=1;j<=NF;j++) if ($j in excl) p=0;} p' file
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4
The same code, written over several lines, looks like this:
gawk -F, -v inc="$(<inclusion)" -v exc="$(<exclusion)" '
BEGIN{
n=split(inc,x,"\n")
for (j=1;j<=n;j++)incl[x[j]]=1
n=split(exc,x,"\n")
for (j=1;j<=n;j++)excl[x[j]]=1
}
NR==1{
print
next
}
{
p=0
for (j=1;j<=NF;j++) if ($j in incl) p=1
for (j=1;j<=NF;j++) if ($j in excl) p=0
}
p
' file
The above array creates incl
both excl
data inclusion
and exclusion
. Any line with a margin in is incl
marked for printing p=1
. If, however, the line contains a field excl
, then p
set to false in the, p=0
.
source to share