Isolating Data Regions Based on Multiple Columns

Question

Isolating Data Regions Based on Multiple Columns

I have a dataset shortened here:

SNP chr       BP log10   PPA
rs10068  17 56555 1.16303 0.030
rs10032  17 56561 26.364 0.975
rs10354  17 34951 4.3212 0.626
rs10043  17 20491 0.00097 0.006
rs10457  17 69572 -0.38403 0.014
rs10465  17 69872 8.19547 0.927

where PPA is the posterior association probability. Since I have some high log10 values (> 6), I would like to define reliable intervals around these regions to determine exactly how large or small they are.

To do this, I would like to first determine the SNP with log10> 6, which is simple enough using a subset.

newdata <- subset(data, log10 > 6)

However, I would also like to include in this subset SNPs that are physically close to these lead SNPs using BP 500 +/- BP of lead SNPs (with log10> 6). This is where I'm not sure how best to move on. Is this something I can work in subset

, or should I first define these lead SNPs in my raw data and then a subset from there?

Once I isolate these regions, I can move forward.

Any suggestions are greatly appreciated!

+3

r intervals subset

mfk534 Feb 17 13 at 23:52

source to share

2 answers

For what it's worth, here's a solution using a package GenomicRanges

. This package is based on IRanges

and very efficiently processes interval data using interval trees

. And that should also handle all the different chromosomes.

First boot data:

# load data
df <- read.table(header=T, text="SNP chr       BP log10   PPA
rs10068  17 56555 1.16303 0.030
rs10032  17 56561 26.364 0.975
rs10354  17 34951 4.3212 0.626
rs10043  17 20491 0.00097 0.006
rs10457  17 69572 -0.38403 0.014
rs10465  17 69872 8.19547 0.927")

Get a subset where log10> 6

df.f <- df[df$log10 > 6, ]

Load and create ranges for source data and subsets

require(GenomicRanges)
df.gr <- GRanges(Rle(df$chr), IRanges(df$BP, df$BP))
df.f.gr <- GRanges(Rle(df.f$chr), IRanges(df.f$BP, df.f$BP))

Find all overlaps of data with log10> 6 with all other SNPs with break = +/- 500

olaps <- findOverlaps(df.f.gr, df.gr, maxgap=500)

Get withdrawal

# if you just want a whole data.frame with ALL SNPs that have 
# log > 6 or within 500 bases of one with log > 6, then,        
out <- df[subjectHits(olaps), ] 

#       SNP chr    BP    log10   PPA
# 1 rs10068  17 56555  1.16303 0.030
# 2 rs10032  17 56561 26.36400 0.975
# 5 rs10457  17 69572 -0.38403 0.014
# 6 rs10465  17 69872  8.19547 0.927

# In case you want for each SNP that has log > 6, all of the 
# snps that are within 500 bases (either side) apart for 
# this SNP, separately, then,
out.list <- split(out, df$BP[queryHits(olaps)])
# $`56555`
#       SNP chr    BP    log10   PPA
# 1 rs10068  17 56555  1.16303 0.030
# 2 rs10032  17 56561 26.36400 0.975
#     
# $`56561`
#       SNP chr    BP    log10   PPA
# 5 rs10457  17 69572 -0.38403 0.014
# 6 rs10465  17 69872  8.19547 0.927

+1

Arun 18 Feb At 8:51 am

source to share

Matthew lundberg · Accepted Answer · 2013-02-18T00:00:16+0000

s <- read.table(header=T, text="SNP chr       BP log10   PPA
rs10068  17 56555 1.16303 0.030
rs10032  17 56561 26.364 0.975
rs10354  17 34951 4.3212 0.626
rs10043  17 20491 0.00097 0.006
rs10457  17 69572 -0.38403 0.014
rs10465  17 69872 8.19547 0.927")

Distance to any line where s $ log10> 6:

outer(s$BP[s$log10 > 6], s$BP, '-')
##       [,1]  [,2]  [,3]  [,4]   [,5]   [,6]
## [1,]     6     0 21610 36070 -13011 -13311
## [2,] 13317 13311 34921 49381    300      0

Any column above with an absolute value <500 is the one you want to keep:

s[apply(outer(s$BP[s$log10 > 6], s$BP, '-'), 2, function(x) any(abs(x) < 500)),]
##       SNP chr    BP    log10   PPA
## 1 rs10068  17 56555  1.16303 0.030
## 2 rs10032  17 56561 26.36400 0.975
## 5 rs10457  17 69572 -0.38403 0.014
## 6 rs10465  17 69872  8.19547 0.927

Isolating Data Regions Based on Multiple Columns

More articles: