Insert data together into bash
I'll show you an example of what I need to do with my data. I have two text files separated by a tab.
cat in1.tsv
111 A B C
111 D E F
111 G H I
222 A B C
333 A B C
333 D E F
This table can have about a thousand rows. The number of columns is less than 100. The first column can have duplicate vouls (for example, 111 and 333).
cat in2.tsv
111 a b c
222 a b c
333 d e f
This file displays the values ββin column 1 only once. I need to concatenate these two files according to its first column match.
cat output.tsv
111 A B C 111 a b c
111 D E F 111 a b c
111 G H I 111 a b c
222 A B C 222 a b c
333 A B C 333 d e f
333 D E F 333 d e f
My solution works if the matrix size is the same:
paste <(sort in1.tsv) <(sort in2.tsv) > output.tsv
I appreciate any help with awk, bash, or other programs that are fast for a lot of lines.
source to share
Awk
for help!
awk 'BEGIN{FS=OFS="\t"}FNR==NR{for(i=2;i<=NF;i++) map[$1]=(map[$1] FS $i); next}$1 in map{print $0,$1,map[$1]}' in2.tsv in1.tsv
outputs the result in tab-separated format as you would expect. Delete OFS="\t"
if you don't want to split the o / p tab.
As far as logic goes, create a map containing the values ββfor column 1 on in2.csv
, into a hashmap map[]
, and then on in1.csv
select those lines containing the $1
same as from the generated map and print the contents of the line.
source to share
Here's the approach bash
:
First release each file:
LC_ALL=C sort init1.tsv -S75% -t$'\t' -k1,1 > init1.tsv.sorted
LC_ALL=C sort init2.tsv -S75% -t$'\t' -k1,1 > init2.tsv.sorted
Then instead of pasting
lets join
them on the first column,
join init1.tsv.sorted init2.tsv.sorted -1 1 -2 2 -t$'\t'
If you want a specific type of join, it's like a left outer join, then I would do the following:
join init1.tsv.sorted init2.tsv.sorted -1 1 -2 2 -t$'\t' -a1
A quick note -S
indicates how much RAM you want to use, the faster you want to do this operation, the more you should use.
source to share
The command join
seems to almost do what you want:
$ join in1.tsv in2.tsv
111 A B C a b c
111 D E F a b c
111 G H I a b c
222 A B C a b c
333 A B C d e f
333 D E F d e f
The default behavior is to concatenate rows based on the first column with a space separator. Using the format option -o
gives us the same result. Sorting is also required, as Dmitry Polonsky says in the comments:
join -o 1.1,1.2,1.3,1.4,2.1,2.2,2.3,2.4 <(sort in1.tsv) <(sort in2.tsv)
source to share
In Python, without relying on sorted files:
#!/usr/bin/env python
with open("in1.tsv") as in1, open("in2.tsv") as in2:
d = {line.split()[0]: line for line in in2}
for line in in1:
print(line.strip(), d[line.split()[0]], sep="\t", end="")
This basically creates a mapping from the values ββof the first column to rows from in2.tsv
, then traverses the lines in1.tsv
and concatenates them with the corresponding rows in2.tsv
using a mapping.
source to share