Concatenate two massive tables based on common rows
I have two large data tables (or they will have them, I still need to get them in the same format) containing SNP genetic data.
They are huge tables, so everything I do with them I have to do in a cluster.
Both tables have> 600,000 rows containing data for different but overlapping SNPs. Each column is an individual person (one table has 942 selections, one has 92). After the other table is formatted correctly, both tables will look like this:
dbSNP_RSID Sample1 Sample2 Sample3 Sample4 Sample5
rs10000011 CC CC CC CC TC
rs1000002 TC TT CC TT TT
rs10000023 TG TG TT TG TG
rs1000003 AA AG AG AA AA
rs10000041 TT TG TT TT TG
rs10000046 GG GG AG GG GG
rs10000057 AA AG GG AA AA
rs10000073 TC TT TT TT TT
rs10000092 TC TC CC TC TT
rs1000014 GG GG GG GG GG
rs10000154 GG AG AG AA AG
rs10000159 GG AG GG GG AG
rs1000016 AA AG AA AG GG
rs10000182 AA AA AG AA AA
rs1000020 TC TC TT TT TC
I want to create a large table with> 1000 columns and which has an intersection of ~ 600,000 rows presented in both tables. R seems like a good language to use. Anyone have any suggestions on how to do this? Thank you!
source to share
You can just use merge
for example:
mergedTable <- merge(table1, table2, by = "dbSNP_RSID")
If your samples have overlapping column names, you will find that there are (for example) columns named Sample1.x and Sample1.y in the concatenated table. You can fix this by renaming the columns before or after the merge.
Reproducible example:
x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")
source to share