Insert data together into bash

I'll show you an example of what I need to do with my data. I have two text files separated by a tab.

cat in1.tsv

111 A B C
111 D E F
111 G H I
222 A B C
333 A B C
333 D E F

      

This table can have about a thousand rows. The number of columns is less than 100. The first column can have duplicate vouls (for example, 111 and 333).

cat in2.tsv

111 a b c 
222 a b c 
333 d e f

      

This file displays the values ​​in column 1 only once. I need to concatenate these two files according to its first column match.

cat output.tsv

111 A B C 111 a b c
111 D E F 111 a b c
111 G H I 111 a b c
222 A B C 222 a b c 
333 A B C 333 d e f
333 D E F 333 d e f 

      

My solution works if the matrix size is the same:

paste  <(sort in1.tsv) <(sort in2.tsv) > output.tsv

      

I appreciate any help with awk, bash, or other programs that are fast for a lot of lines.

+3


source to share


5 answers


Awk

for help!

awk 'BEGIN{FS=OFS="\t"}FNR==NR{for(i=2;i<=NF;i++) map[$1]=(map[$1] FS $i); next}$1 in map{print $0,$1,map[$1]}' in2.tsv in1.tsv

      



outputs the result in tab-separated format as you would expect. Delete OFS="\t"

if you don't want to split the o / p tab.

As far as logic goes, create a map containing the values ​​for column 1 on in2.csv

, into a hashmap map[]

, and then on in1.csv

select those lines containing the $1

same as from the generated map and print the contents of the line.

+3


source


Here's the approach bash

:

First release each file:

LC_ALL=C sort init1.tsv -S75% -t$'\t' -k1,1 > init1.tsv.sorted

LC_ALL=C sort init2.tsv -S75% -t$'\t' -k1,1 > init2.tsv.sorted

      

Then instead of pasting

lets join

them on the first column,



join init1.tsv.sorted init2.tsv.sorted -1 1 -2 2 -t$'\t'

      

If you want a specific type of join, it's like a left outer join, then I would do the following:

join init1.tsv.sorted init2.tsv.sorted -1 1 -2 2 -t$'\t' -a1

      

A quick note -S

indicates how much RAM you want to use, the faster you want to do this operation, the more you should use.

+2


source


The command join

seems to almost do what you want:

$ join in1.tsv in2.tsv
111 A B C a b c
111 D E F a b c
111 G H I a b c
222 A B C a b c
333 A B C d e f
333 D E F d e f

      

The default behavior is to concatenate rows based on the first column with a space separator. Using the format option -o

gives us the same result. Sorting is also required, as Dmitry Polonsky says in the comments:

join -o 1.1,1.2,1.3,1.4,2.1,2.2,2.3,2.4 <(sort in1.tsv) <(sort in2.tsv)

      

+2


source


In Python, without relying on sorted files:

#!/usr/bin/env python

with open("in1.tsv") as in1, open("in2.tsv") as in2:
    d = {line.split()[0]: line for line in in2}
    for line in in1:
        print(line.strip(), d[line.split()[0]], sep="\t", end="")

      

This basically creates a mapping from the values ​​of the first column to rows from in2.tsv

, then traverses the lines in1.tsv

and concatenates them with the corresponding rows in2.tsv

using a mapping.

+2


source


This might work for you (GNU sed):

 sed -r 's#^(\S+)\s.*#/^\1/s/$/ &/#' file2 | sed -f - file

      

Create sed script from second file. This script consists of a regular expression that, when matched, appends the matching entry from the second file to the matched entry from the first.

+2


source







All Articles