Concatenate two massive tables based on common rows

I have two large data tables (or they will have them, I still need to get them in the same format) containing SNP genetic data.

They are huge tables, so everything I do with them I have to do in a cluster.

Both tables have> 600,000 rows containing data for different but overlapping SNPs. Each column is an individual person (one table has 942 selections, one has 92). After the other table is formatted correctly, both tables will look like this:

dbSNP_RSID  Sample1 Sample2 Sample3 Sample4 Sample5
rs10000011  CC  CC  CC  CC  TC
rs1000002   TC  TT  CC  TT  TT
rs10000023  TG  TG  TT  TG  TG
rs1000003   AA  AG  AG  AA  AA
rs10000041  TT  TG  TT  TT  TG
rs10000046  GG  GG  AG  GG  GG
rs10000057  AA  AG  GG  AA  AA
rs10000073  TC  TT  TT  TT  TT
rs10000092  TC  TC  CC  TC  TT
rs1000014   GG  GG  GG  GG  GG
rs10000154  GG  AG  AG  AA  AG
rs10000159  GG  AG  GG  GG  AG
rs1000016   AA  AG  AA  AG  GG
rs10000182  AA  AA  AG  AA  AA
rs1000020   TC  TC  TT  TT  TC

      

I want to create a large table with> 1000 columns and which has an intersection of ~ 600,000 rows presented in both tables. R seems like a good language to use. Anyone have any suggestions on how to do this? Thank you!

+3


source to share


2 answers


You can just use merge

for example:

mergedTable <- merge(table1, table2, by = "dbSNP_RSID")

      

If your samples have overlapping column names, you will find that there are (for example) columns named Sample1.x and Sample1.y in the concatenated table. You can fix this by renaming the columns before or after the merge.



Reproducible example:

x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
  matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
    sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
  matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
    sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")

      

+2


source


Use data.table where DT1

is the first table DT2

is the second:



library(data.table)
setkey(DT1,"id")
setkey(DT2,"id")
DT <- merge(DT1,DT2,by = "id")

      

+5


source







All Articles