Awk completely concatenate 2 files on one column
I have 2 CSV files from which I want to merge using AWK.
file1.csv:
A1,B1,C1
"apple",1,2
"orange",2,3
"pear",5,4
file2.csv:
A2,D2,E2,F2
"apple",1,3,4
"peach",2,3,3
"pear",5,4,2
"mango",6,5,1
This is the result I want:
A1,B1,C1,A2,D2,E2,F2
"apple",1,2,"apple",1,3,4
"orange",2,3,NULL,NULL,NULL,NULL
"pear",5,4,"pear",5,4,2
NULL,NULL,NULL,"peach",2,3,3
NULL,NULL,NULL,"mango",6,5,1
I want to make a full join on file 1 and file2 where A1 = A2. File2 has more lines than file1. For records that do not have matching column values, NULL values will be inserted instead.
source to share
You can use a standard utility join
for simplicity.
note: join requires sorted input, so the solution must sort the inputs first
approximate union
tail -n +2 file1.csv | sort -k 1 1>file3.csv;
tail -n +2 file2.csv | sort -k 1 1>file4.csv;
paste -d, file1.csv file2.csv | head -n 1 1>output.txt;
join -a 1 -a 2 -t , -e NULL -1 1 -2 1 \
-o 1.1,1.2,1.3,2.1,2.2,2.3,2.4 \
file3.csv file4.csv 1>>output.txt;
<strong> outputs
A1,B1,C1,A2,D2,E2,F2
"apple",1,2,"apple",1,3,4
NULL,NULL,NULL,"mango",6,5,1
"orange",2,3,NULL,NULL,NULL,NULL
NULL,NULL,NULL,"peach",2,3,3
"pear",5,4,"pear",5,4,2
source to share
You can use this one awk
:
awk -F, 'FNR==1{if (NR==1)print "A1,B1,C1,A2,D2,E2,F2";next}
FNR==NR{a[$1]=$0;next}
{print $0 FS (($1 in a)? a[$1]:"NULL,NULL,NULL,NULL"); delete a[$1]}
END{for (i in a) print "NULL,NULL,NULL," a[i]}' file2.csv file1.csv
A1,B1,C1,A2,D2,E2,F2
"apple",1,2,"apple",1,3,4
"orange",2,3,NULL,NULL,NULL,NULL
"pear",5,4,"pear",5,4,2
NULL,NULL,NULL,"mango",6,5,1
NULL,NULL,NULL,"peach",2,3,3
source to share
try this:
awk -F',' 'BEGIN{flag=2}NR==FNR{if(flag==2){head=$0;--flag;}else{a[$1]=$0}}
NR>FNR{if(flag==1){print head","$0;flag=0}else{if(a[$1]){print a[$1],$0;delete a[$1]}
else{print "NULL,NULL,NULL,"$0}}}END{for(i in a){if(a[i]){print a[i]",NULL,NULL,NULL,NULL"}}}'
file1.csv file2.csv
output:
A1,B1,C1,A2,D2,E2,F2
"apple",1,2 "apple",1,3,4
NULL,NULL,NULL,"peach",2,3,3
"pear",5,4 "pear",5,4,2
NULL,NULL,NULL,"mango",6,5,1
"orange",2,3,NULL,NULL,NULL,NULL
source to share
For completeness, a bash script can also solve the problem with some limitations. The following examples are fine for the example, but expect matching strings on the same line in each of the input files. If this is not the case with the csv file, then a simple preview of the files is required.
It was more of an exercise than a contender, but ended up meeting the conditions in the same or better order than some of the other solutions. If you have any questions, let me know:
#!/bin/bash
declare -i l1=0 # lines in file 1
declare -i l2=0 # lines in file 2
declare -i a1s=0 # array 1 stride
declare -i a2s=0 # array 2 stride
while read -r line; do ## fill array from file1
a1+=( $(tr ',' ' ' <<<$line) )
((l1++))
done <"$1"
while read -r line; do ## fill array from file2
a2+=( $(tr ',' ' ' <<<$line) )
((l2++))
done <"$2"
a1s=$((${#a1[@]}/l1)) ## stride of array 1
a2s=$((${#a2[@]}/l2)) ## stride of array 2
[ $l1 -lt $l2 ] && lim=$l1 || lim=$l2 ## which has more rows?
for ((i = 0; i < lim; i++)); do ## for common rows
if [ $i -eq 0 -o ${a1[$((i*a1s))]} = ${a2[$((i*a2s))]} ]; then
for ((j = 0; j < a1s; j++)); do
[ $j -eq 0 ] && printf "%s" ${a1[$((i*a1s+j))]} || printf ",%s" ${a1[$((i*a1s+j))]}
done
for ((j = 0; j < a2s; j++)); do printf ",%s" ${a2[$((i*a2s+j))]}; done
printf "\n"
else
for ((j = 0; j < a1s; j++)); do
[ $j -eq 0 ] && printf "%s" ${a1[$((i*a1s+j))]} || printf ",%s" ${a1[$((i*a1s+j))]}
done
for ((j = 0; j < a2s; j++)); do printf ",NULL"; done
printf "\n"
for ((j = 0; j < a1s; j++)); do
[ $j -eq 0 ] && printf "NULL" || printf ",NULL"
done
for ((j = 0; j < a2s; j++)); do printf ",%s" ${a2[$((i*a2s+j))]}; done
printf "\n"
fi
done
if [ $l1 -lt $l2 ]; then ## for excess rows (longest row-wise)
last=$l2
for ((i = lim; i < last; i++)); do
for ((j = 0; j < a1s; j++)); do
[ $j -eq 0 ] && printf "NULL" || printf ",NULL"
done
for ((j = 0; j < a2s; j++)); do printf ",%s" ${a2[$((i*a2s+j))]}; done
printf "\n"
done
else
last=$l1
for ((i = lim; i < last; i++)); do
for ((j = 0; j < a1s; j++)); do
[ $j -eq 0 ] && printf "%s" ${a1[$((i*a1s+j))]} || printf ",%s" ${a1[$((i*a1s+j))]}
done
for ((j = 0; j < a2s; j++)); do printf ",NULL"; done
printf "\n"
done
fi
exit 0
Output
$ bash ../read2redir.sh A1.txt A2.txt
A1,B1,C1,A2,D2,E2,F2
"apple",1,2,"apple",1,3,4
"orange",2,3,NULL,NULL,NULL,NULL
NULL,NULL,NULL,"peach",2,3,3
"pear",5,4,"pear",5,4,2
NULL,NULL,NULL,"mango",6,5,1
source to share