Reduce multiple large tab delimiters based on row and column using awk

I have some very large (several gigabytes) tab-delimited files with named lines (4.5e6 lines) and columns (10 to several hundred).

Those. InputFile1.txt

            A           B           C          D
Row1        1           2           1          3
Row2        2           4           5          3
Row3        3           6           6          4
Row4        4           8           9          4
Row5        5           2           0          1

      

InputFile2.txt

            E           F           G        
Row1        7           1           5          
Row2        7           5           5          
Row3        6           4           7          
Row4        5           4           8          
Row5        4           9           0        

      

I also have two index files, one for rows and one for columns. I.e:

IndexRows.txt (all these lines will be in all files)

Row1
Row3
Row4

      

IndexCols.txt (no duplicate columns in files)

B
C
F

      

I want to efficiently extract rows and columns specified in index files from tab-delimited files and then concatenate all columns into one file. I have experience with R and could do it with R, but these files are very large and using R will push the limits / if possible at all.

Can anyone suggest an efficient way to do this using bash / awk ?

In this example, the output will look like this:

            B       C       F  
Row1        2       1       1
Row3        6       6       4
Row4        8       9       4

      

thank

+3


source to share


1 answer


I would approach the problem as follows.

library(data.table)

DT   <- fread(f.txt,          sep="\t",  header=TRUE)
ROWS <- fread(file_rows.txt,  sep="\t",  header=FALSE)
COLS <- fread(file_cols.txt,  sep="\t",  header=FALSE)

setkey(DT, id)
setkey(ROWS) # sets key to the single column

## Note that this filters DT to only those rows with `id` in ROWS$V1
DT[ROWS]

      



Finally, to filter columns and rows:

DT[ROWS, .SD, .SDcols=COLS$V1]

      

+1


source







All Articles