Reduce multiple large tab delimiters based on row and column using awk
I have some very large (several gigabytes) tab-delimited files with named lines (4.5e6 lines) and columns (10 to several hundred).
Those. InputFile1.txt
A B C D
Row1 1 2 1 3
Row2 2 4 5 3
Row3 3 6 6 4
Row4 4 8 9 4
Row5 5 2 0 1
InputFile2.txt
E F G
Row1 7 1 5
Row2 7 5 5
Row3 6 4 7
Row4 5 4 8
Row5 4 9 0
I also have two index files, one for rows and one for columns. I.e:
IndexRows.txt (all these lines will be in all files)
Row1
Row3
Row4
IndexCols.txt (no duplicate columns in files)
B
C
F
I want to efficiently extract rows and columns specified in index files from tab-delimited files and then concatenate all columns into one file. I have experience with R and could do it with R, but these files are very large and using R will push the limits / if possible at all.
Can anyone suggest an efficient way to do this using bash / awk ?
In this example, the output will look like this:
B C F
Row1 2 1 1
Row3 6 6 4
Row4 8 9 4
thank
source to share
I would approach the problem as follows.
library(data.table)
DT <- fread(f.txt, sep="\t", header=TRUE)
ROWS <- fread(file_rows.txt, sep="\t", header=FALSE)
COLS <- fread(file_cols.txt, sep="\t", header=FALSE)
setkey(DT, id)
setkey(ROWS) # sets key to the single column
## Note that this filters DT to only those rows with `id` in ROWS$V1
DT[ROWS]
Finally, to filter columns and rows:
DT[ROWS, .SD, .SDcols=COLS$V1]
source to share