Compare splitting and merging two data frames

How to compare two datasets df1 and df2 by gene names and extract corresponding values ​​for each gene name from df2 and insert it into df1

df1 <-

Genes    sample.ID  chrom   loc.start   loc.end num.mark
Klri2     LO.WGS      1   3010000 173490000     8430
Rrs1      LO.WGS      1   3010000 173490000     8430
Serpin    LO.WGS      1   3010000 173490000     8430
Myoc      LO.WGS          1   3010000 173490000     8430
St18      LO.WGS          1   3010000 173490000     8430


df2 <-

    RL  pValue.   chr   start            end    CNA     Genes
    2   2.594433   1    129740006   129780779   gain    Klri2   
    2   3.941399   1    130080653   130380997   gain    Serpin,St18,Myoc

df3<-

Genes   sample.ID  chrom  loc.start  loc.end num.mark   RL  pValue      CNA
Klri2    LO.WGS     1   3010000   173490000     8430    2   2.594433    gain
Rrs1     LO.WGS     1   3010000   173490000     8430    0     0          0
Serpin   LO.WGS     1   3010000   173490000     8430    2   3.941399    gain
Myoc     LO.WGS     1   3010000   173490000     8430    2   3.941399    gain
St18     LO.WGS     1   3010000   173490000     8430    2   3.941399    gain

      

+3


source to share


2 answers


You may try:

library(splitstackshape)   
out <- cSplit(df2, "Genes", sep = ",", "long")

      

This will reshape df2

in the correct format (one line for each gene):

#   RL  pValue. chr     start       end  CNA  Genes
#1:  2 2.594433   1 129740006 129780779 gain  Klri2
#2:  2 3.941399   1 130080653 130380997 gain Serpin
#3:  2 3.941399   1 130080653 130380997 gain   St18
#4:  2 3.941399   1 130080653 130380997 gain   Myoc

      

Then you just need to use merge()

or left_join()

from dplyr

:



library(dplyr)
df3 <- left_join(df1, out)

      

If you want to replace NA

with 0

, you can do:

df3 <- left_join(df1, out) %>% mutate_each(funs(ifelse(is.na(.), 0, .)))

      

Or, if you prefer a subset:

df3 <- left_join(df1, out) %>% (function(x) { x[is.na(x)] <- 0; x })

      

+5


source


This is a fusion operation, but first you must bring df2

in the correct format, which will contain one line for each gene (not one entry for multiple genes, separated by commas). There is a handy function for this from the package tidyr

,unnest()

df2 <- tidyr::unnest(
         transform(df2, Genes = strsplit(as.character(df2$Genes), ",")),
         Genes)

      

The result looks like this



df2
#  RL  pValue. chr     start       end  CNA  Genes
#1  2 2.594433   1 129740006 129780779 gain  Klri2
#2  2 3.941399   1 130080653 130380997 gain Serpin
#3  2 3.941399   1 130080653 130380997 gain   St18
#4  2 3.941399   1 130080653 130380997 gain   Myoc

      

Now you can just use merge(df1, df2, all.x = TRUE)

either left_join

from dplyr

(or other packages, for example data.table

, depending on which one you want to explore). Note that this will lead to NA

where you want the zeros, but you can easily replace them.

+4


source







All Articles