Isolating Data Regions Based on Multiple Columns
I have a dataset shortened here:
SNP chr BP log10 PPA
rs10068 17 56555 1.16303 0.030
rs10032 17 56561 26.364 0.975
rs10354 17 34951 4.3212 0.626
rs10043 17 20491 0.00097 0.006
rs10457 17 69572 -0.38403 0.014
rs10465 17 69872 8.19547 0.927
where PPA is the posterior association probability. Since I have some high log10 values โโ(> 6), I would like to define reliable intervals around these regions to determine exactly how large or small they are.
To do this, I would like to first determine the SNP with log10> 6, which is simple enough using a subset.
newdata <- subset(data, log10 > 6)
However, I would also like to include in this subset SNPs that are physically close to these lead SNPs using BP 500 +/- BP of lead SNPs (with log10> 6). This is where I'm not sure how best to move on. Is this something I can work in subset
, or should I first define these lead SNPs in my raw data and then a subset from there?
Once I isolate these regions, I can move forward.
Any suggestions are greatly appreciated!
source to share
s <- read.table(header=T, text="SNP chr BP log10 PPA
rs10068 17 56555 1.16303 0.030
rs10032 17 56561 26.364 0.975
rs10354 17 34951 4.3212 0.626
rs10043 17 20491 0.00097 0.006
rs10457 17 69572 -0.38403 0.014
rs10465 17 69872 8.19547 0.927")
Distance to any line where s $ log10> 6:
outer(s$BP[s$log10 > 6], s$BP, '-')
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 6 0 21610 36070 -13011 -13311
## [2,] 13317 13311 34921 49381 300 0
Any column above with an absolute value <500 is the one you want to keep:
s[apply(outer(s$BP[s$log10 > 6], s$BP, '-'), 2, function(x) any(abs(x) < 500)),]
## SNP chr BP log10 PPA
## 1 rs10068 17 56555 1.16303 0.030
## 2 rs10032 17 56561 26.36400 0.975
## 5 rs10457 17 69572 -0.38403 0.014
## 6 rs10465 17 69872 8.19547 0.927
source to share
For what it's worth, here's a solution using a package GenomicRanges
. This package is based on IRanges
and very efficiently processes interval data using interval trees
. And that should also handle all the different chromosomes.
-
First boot data:
# load data df <- read.table(header=T, text="SNP chr BP log10 PPA rs10068 17 56555 1.16303 0.030 rs10032 17 56561 26.364 0.975 rs10354 17 34951 4.3212 0.626 rs10043 17 20491 0.00097 0.006 rs10457 17 69572 -0.38403 0.014 rs10465 17 69872 8.19547 0.927")
-
Get a subset where log10> 6
df.f <- df[df$log10 > 6, ]
-
Load and create ranges for source data and subsets
require(GenomicRanges) df.gr <- GRanges(Rle(df$chr), IRanges(df$BP, df$BP)) df.f.gr <- GRanges(Rle(df.f$chr), IRanges(df.f$BP, df.f$BP))
-
Find all overlaps of data with log10> 6 with all other SNPs with break = +/- 500
olaps <- findOverlaps(df.f.gr, df.gr, maxgap=500)
-
Get withdrawal
# if you just want a whole data.frame with ALL SNPs that have # log > 6 or within 500 bases of one with log > 6, then, out <- df[subjectHits(olaps), ] # SNP chr BP log10 PPA # 1 rs10068 17 56555 1.16303 0.030 # 2 rs10032 17 56561 26.36400 0.975 # 5 rs10457 17 69572 -0.38403 0.014 # 6 rs10465 17 69872 8.19547 0.927 # In case you want for each SNP that has log > 6, all of the # snps that are within 500 bases (either side) apart for # this SNP, separately, then, out.list <- split(out, df$BP[queryHits(olaps)]) # $`56555` # SNP chr BP log10 PPA # 1 rs10068 17 56555 1.16303 0.030 # 2 rs10032 17 56561 26.36400 0.975 # # $`56561` # SNP chr BP log10 PPA # 5 rs10457 17 69572 -0.38403 0.014 # 6 rs10465 17 69872 8.19547 0.927
source to share