Isolating Data Regions Based on Multiple Columns

I have a dataset shortened here:

SNP chr       BP log10   PPA
rs10068  17 56555 1.16303 0.030
rs10032  17 56561 26.364 0.975
rs10354  17 34951 4.3212 0.626
rs10043  17 20491 0.00097 0.006
rs10457  17 69572 -0.38403 0.014
rs10465  17 69872 8.19547 0.927

      

where PPA is the posterior association probability. Since I have some high log10 values โ€‹โ€‹(> 6), I would like to define reliable intervals around these regions to determine exactly how large or small they are.

To do this, I would like to first determine the SNP with log10> 6, which is simple enough using a subset.

newdata <- subset(data, log10 > 6)

      

However, I would also like to include in this subset SNPs that are physically close to these lead SNPs using BP 500 +/- BP of lead SNPs (with log10> 6). This is where I'm not sure how best to move on. Is this something I can work in subset

, or should I first define these lead SNPs in my raw data and then a subset from there?

Once I isolate these regions, I can move forward.

Any suggestions are greatly appreciated!

+3


source to share


2 answers


s <- read.table(header=T, text="SNP chr       BP log10   PPA
rs10068  17 56555 1.16303 0.030
rs10032  17 56561 26.364 0.975
rs10354  17 34951 4.3212 0.626
rs10043  17 20491 0.00097 0.006
rs10457  17 69572 -0.38403 0.014
rs10465  17 69872 8.19547 0.927")

      

Distance to any line where s $ log10> 6:

outer(s$BP[s$log10 > 6], s$BP, '-')
##       [,1]  [,2]  [,3]  [,4]   [,5]   [,6]
## [1,]     6     0 21610 36070 -13011 -13311
## [2,] 13317 13311 34921 49381    300      0

      



Any column above with an absolute value <500 is the one you want to keep:

s[apply(outer(s$BP[s$log10 > 6], s$BP, '-'), 2, function(x) any(abs(x) < 500)),]
##       SNP chr    BP    log10   PPA
## 1 rs10068  17 56555  1.16303 0.030
## 2 rs10032  17 56561 26.36400 0.975
## 5 rs10457  17 69572 -0.38403 0.014
## 6 rs10465  17 69872  8.19547 0.927

      

+6


source


For what it's worth, here's a solution using a package GenomicRanges

. This package is based on IRanges

and very efficiently processes interval data using interval trees

. And that should also handle all the different chromosomes.



  • First boot data:

    # load data
    df <- read.table(header=T, text="SNP chr       BP log10   PPA
    rs10068  17 56555 1.16303 0.030
    rs10032  17 56561 26.364 0.975
    rs10354  17 34951 4.3212 0.626
    rs10043  17 20491 0.00097 0.006
    rs10457  17 69572 -0.38403 0.014
    rs10465  17 69872 8.19547 0.927")
    
          

  • Get a subset where log10> 6

    df.f <- df[df$log10 > 6, ]
    
          

  • Load and create ranges for source data and subsets

    require(GenomicRanges)
    df.gr <- GRanges(Rle(df$chr), IRanges(df$BP, df$BP))
    df.f.gr <- GRanges(Rle(df.f$chr), IRanges(df.f$BP, df.f$BP))
    
          

  • Find all overlaps of data with log10> 6 with all other SNPs with break = +/- 500

    olaps <- findOverlaps(df.f.gr, df.gr, maxgap=500)
    
          

  • Get withdrawal

    # if you just want a whole data.frame with ALL SNPs that have 
    # log > 6 or within 500 bases of one with log > 6, then,        
    out <- df[subjectHits(olaps), ] 
    
    #       SNP chr    BP    log10   PPA
    # 1 rs10068  17 56555  1.16303 0.030
    # 2 rs10032  17 56561 26.36400 0.975
    # 5 rs10457  17 69572 -0.38403 0.014
    # 6 rs10465  17 69872  8.19547 0.927
    
    # In case you want for each SNP that has log > 6, all of the 
    # snps that are within 500 bases (either side) apart for 
    # this SNP, separately, then,
    out.list <- split(out, df$BP[queryHits(olaps)])
    # $`56555`
    #       SNP chr    BP    log10   PPA
    # 1 rs10068  17 56555  1.16303 0.030
    # 2 rs10032  17 56561 26.36400 0.975
    #     
    # $`56561`
    #       SNP chr    BP    log10   PPA
    # 5 rs10457  17 69572 -0.38403 0.014
    # 6 rs10465  17 69872  8.19547 0.927
    
          

+1


source







All Articles