Awk completely concatenate 2 files on one column

I have 2 CSV files from which I want to merge using AWK.

file1.csv:

A1,B1,C1
"apple",1,2
"orange",2,3
"pear",5,4

      

file2.csv:

A2,D2,E2,F2
"apple",1,3,4
"peach",2,3,3
"pear",5,4,2
"mango",6,5,1

      

This is the result I want:

A1,B1,C1,A2,D2,E2,F2
"apple",1,2,"apple",1,3,4
"orange",2,3,NULL,NULL,NULL,NULL
"pear",5,4,"pear",5,4,2
NULL,NULL,NULL,"peach",2,3,3
NULL,NULL,NULL,"mango",6,5,1

      

I want to make a full join on file 1 and file2 where A1 = A2. File2 has more lines than file1. For records that do not have matching column values, NULL values ​​will be inserted instead.

+3


source to share


4 answers


You can use a standard utility join

for simplicity.

note: join requires sorted input, so the solution must sort the inputs first

approximate union



tail -n +2 file1.csv | sort -k 1 1>file3.csv;
tail -n +2 file2.csv | sort -k 1 1>file4.csv;
paste -d, file1.csv file2.csv | head -n 1 1>output.txt;
join -a 1 -a 2 -t , -e NULL -1 1 -2 1 \
     -o 1.1,1.2,1.3,2.1,2.2,2.3,2.4 \
     file3.csv file4.csv 1>>output.txt;

      

<strong> outputs

A1,B1,C1,A2,D2,E2,F2
"apple",1,2,"apple",1,3,4
NULL,NULL,NULL,"mango",6,5,1
"orange",2,3,NULL,NULL,NULL,NULL
NULL,NULL,NULL,"peach",2,3,3
"pear",5,4,"pear",5,4,2

      

+3


source


You can use this one awk

:



awk -F, 'FNR==1{if (NR==1)print "A1,B1,C1,A2,D2,E2,F2";next} 
         FNR==NR{a[$1]=$0;next}
         {print $0 FS (($1 in a)? a[$1]:"NULL,NULL,NULL,NULL"); delete a[$1]}
         END{for (i in a) print "NULL,NULL,NULL," a[i]}' file2.csv file1.csv
A1,B1,C1,A2,D2,E2,F2
"apple",1,2,"apple",1,3,4
"orange",2,3,NULL,NULL,NULL,NULL
"pear",5,4,"pear",5,4,2
NULL,NULL,NULL,"mango",6,5,1
NULL,NULL,NULL,"peach",2,3,3

      

+2


source


try this:

awk -F',' 'BEGIN{flag=2}NR==FNR{if(flag==2){head=$0;--flag;}else{a[$1]=$0}}
NR>FNR{if(flag==1){print head","$0;flag=0}else{if(a[$1]){print a[$1],$0;delete a[$1]}
else{print "NULL,NULL,NULL,"$0}}}END{for(i in a){if(a[i]){print a[i]",NULL,NULL,NULL,NULL"}}}' 
file1.csv file2.csv

      

output:

A1,B1,C1,A2,D2,E2,F2
"apple",1,2 "apple",1,3,4
NULL,NULL,NULL,"peach",2,3,3
"pear",5,4 "pear",5,4,2
NULL,NULL,NULL,"mango",6,5,1
"orange",2,3,NULL,NULL,NULL,NULL

      

+1


source


For completeness, a bash script can also solve the problem with some limitations. The following examples are fine for the example, but expect matching strings on the same line in each of the input files. If this is not the case with the csv file, then a simple preview of the files is required.

It was more of an exercise than a contender, but ended up meeting the conditions in the same or better order than some of the other solutions. If you have any questions, let me know:

#!/bin/bash

declare -i l1=0   # lines in file 1
declare -i l2=0   # lines in file 2
declare -i a1s=0  # array 1 stride
declare -i a2s=0  # array 2 stride

while read -r line; do              ## fill array from file1
    a1+=( $(tr ',' ' ' <<<$line) )
    ((l1++))
done <"$1"

while read -r line; do              ## fill array from file2

    a2+=( $(tr ',' ' ' <<<$line) )
    ((l2++))

done <"$2"

a1s=$((${#a1[@]}/l1))   ## stride of array 1
a2s=$((${#a2[@]}/l2))   ## stride of array 2

[ $l1 -lt $l2 ] && lim=$l1 || lim=$l2   ## which has more rows?

for ((i = 0; i < lim; i++)); do         ## for common rows
    if [ $i -eq 0 -o ${a1[$((i*a1s))]} = ${a2[$((i*a2s))]} ]; then
        for ((j = 0; j < a1s; j++)); do
            [ $j -eq 0 ] && printf "%s" ${a1[$((i*a1s+j))]} || printf ",%s" ${a1[$((i*a1s+j))]}
        done
        for ((j = 0; j < a2s; j++)); do printf ",%s" ${a2[$((i*a2s+j))]}; done
        printf "\n"
    else
        for ((j = 0; j < a1s; j++)); do
            [ $j -eq 0 ] && printf "%s" ${a1[$((i*a1s+j))]} || printf ",%s" ${a1[$((i*a1s+j))]}
        done
        for ((j = 0; j < a2s; j++)); do printf ",NULL"; done
        printf "\n"
        for ((j = 0; j < a1s; j++)); do
            [ $j -eq 0 ] && printf "NULL" || printf ",NULL"
        done
        for ((j = 0; j < a2s; j++)); do printf ",%s" ${a2[$((i*a2s+j))]}; done
        printf "\n"
    fi
done

if [ $l1 -lt $l2 ]; then    ## for excess rows (longest row-wise)
    last=$l2
    for ((i = lim; i < last; i++)); do
        for ((j = 0; j < a1s; j++)); do
            [ $j -eq 0 ] && printf "NULL" || printf ",NULL"
        done
        for ((j = 0; j < a2s; j++)); do printf ",%s" ${a2[$((i*a2s+j))]}; done
        printf "\n"
    done
else
    last=$l1
    for ((i = lim; i < last; i++)); do
        for ((j = 0; j < a1s; j++)); do
            [ $j -eq 0 ] && printf "%s" ${a1[$((i*a1s+j))]} || printf ",%s" ${a1[$((i*a1s+j))]}
        done
        for ((j = 0; j < a2s; j++)); do printf ",NULL"; done
        printf "\n"
    done
fi

exit 0

      

Output

$ bash ../read2redir.sh A1.txt A2.txt
A1,B1,C1,A2,D2,E2,F2
"apple",1,2,"apple",1,3,4
"orange",2,3,NULL,NULL,NULL,NULL
NULL,NULL,NULL,"peach",2,3,3
"pear",5,4,"pear",5,4,2
NULL,NULL,NULL,"mango",6,5,1

      

0


source







All Articles