Using strings from multiple input files as search criteria for selected columns in a CSV file using AWK

The nature of the problem:

I have a CSV file with 10 columns, of which 4 columns indicate disease codes. Let's say these are columns 1 - 4. I have 2 text files that contain the "include" and "exclude" codes.

The include file looks like this: a file with input lines n

, each on newlines

Example:

123
12300
12301
124
12400
12401
1250

      

The exception file looks like this: a file with input lines m

, each of which is also on line feed characters.

Example:

456
457
458
459

      

The truncated version of the CSV file will look like this:

D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
890,001,456,0009,A2,B2,C2,D2,E2,F2
12301,456,00,145,A3,B3,C3,D3,E3,F3
567,1250,010,321,A4,B4,C4,D4,E4,F4

      

Using AWK, how can I take 2 files named inclusion

and exclusion

and a CSV file that returns this:

D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

      

CSV file can have millions of lines, and the files inclusion

and exclusion

may have dozens of lines. This is not homework and I appreciate the help.

+3


source to share


1 answer


Using grep

$ head -n1 <file; grep -E "(^|,)($(tr '\n' '|' <inclusion))(,|$)" file | grep -Ev "(^|,)($(tr '\n' '|' <exclusion))(,|$)"
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

      

Using awk

$ awk -v inc="(^|,)($(tr '\n' '|' <inclusion))(,|$)" -v exc="(^|,)($(tr '\n' '|' <exclusion))(,|$)" 'NR==1 || ($0 ~ inc && ! ($0 ~ exc))' file
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

      

How it works

For grep and awk solutions, the key step is to create a regular expression that matches the include or exclude files. Since it is shorter, take exclusion

as an example. We can create a regular expression for it like this:

$ echo "(^|,)($(tr '\n' '|' <exclusion))(,|$)"
(^|,)(456|457|458|459|)(,|$)

      

The regular expression for inclusion

works in a similar way. After creating and excluding regexes, we can use them with either grep or awk. When using awk, we use the condition:



NR==1 || ($0 ~ inc && ! ($0 ~ exc))

      

If this condition is true, awk performs its default action, which is to print a line. The condition is true if (1) we are in the first row, NR==1

or if (2) the string matches the regular expression to enable, inc

and does not match the regular expression for the exclusion exc

.

Alternative awk solution

$ gawk -F, -v inc="$(<inclusion)" -v exc="$(<exclusion)" 'BEGIN{n=split(inc,x,"\n"); for (j=1;j<=n;j++)incl[x[j]]=1; n=split(exc,x,"\n"); for (j=1;j<=n;j++)excl[x[j]]=1;} NR==1{print;next} {p=0;for (j=1;j<=NF;j++) if ($j in incl)p=1; for (j=1;j<=NF;j++) if ($j in excl) p=0;} p' file
D1,D2,D3,D4,A,B,C,D,E,F
123,00,145,567,A1,B1,C1,D1,E1,F1
567,1250,010,321,A4,B4,C4,D4,E4,F4

      

The same code, written over several lines, looks like this:

gawk -F, -v inc="$(<inclusion)" -v exc="$(<exclusion)" '
BEGIN{
    n=split(inc,x,"\n")
    for (j=1;j<=n;j++)incl[x[j]]=1
    n=split(exc,x,"\n")
    for (j=1;j<=n;j++)excl[x[j]]=1
}
NR==1{
    print
    next
} 

{
    p=0
    for (j=1;j<=NF;j++) if ($j in incl) p=1
    for (j=1;j<=NF;j++) if ($j in excl) p=0
}
p
' file

      

The above array creates incl

both excl

data inclusion

and exclusion

. Any line with a margin in is incl

marked for printing p=1

. If, however, the line contains a field excl

, then p

set to false in the, p=0

.

+3


source







All Articles