Compare splitting and merging two data frames
How to compare two datasets df1 and df2 by gene names and extract corresponding values ββfor each gene name from df2 and insert it into df1
df1 <-
Genes sample.ID chrom loc.start loc.end num.mark
Klri2 LO.WGS 1 3010000 173490000 8430
Rrs1 LO.WGS 1 3010000 173490000 8430
Serpin LO.WGS 1 3010000 173490000 8430
Myoc LO.WGS 1 3010000 173490000 8430
St18 LO.WGS 1 3010000 173490000 8430
df2 <-
RL pValue. chr start end CNA Genes
2 2.594433 1 129740006 129780779 gain Klri2
2 3.941399 1 130080653 130380997 gain Serpin,St18,Myoc
df3<-
Genes sample.ID chrom loc.start loc.end num.mark RL pValue CNA
Klri2 LO.WGS 1 3010000 173490000 8430 2 2.594433 gain
Rrs1 LO.WGS 1 3010000 173490000 8430 0 0 0
Serpin LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
Myoc LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
St18 LO.WGS 1 3010000 173490000 8430 2 3.941399 gain
source to share
You may try:
library(splitstackshape)
out <- cSplit(df2, "Genes", sep = ",", "long")
This will reshape df2
in the correct format (one line for each gene):
# RL pValue. chr start end CNA Genes
#1: 2 2.594433 1 129740006 129780779 gain Klri2
#2: 2 3.941399 1 130080653 130380997 gain Serpin
#3: 2 3.941399 1 130080653 130380997 gain St18
#4: 2 3.941399 1 130080653 130380997 gain Myoc
Then you just need to use merge()
or left_join()
from dplyr
:
library(dplyr)
df3 <- left_join(df1, out)
If you want to replace NA
with 0
, you can do:
df3 <- left_join(df1, out) %>% mutate_each(funs(ifelse(is.na(.), 0, .)))
Or, if you prefer a subset:
df3 <- left_join(df1, out) %>% (function(x) { x[is.na(x)] <- 0; x })
source to share
This is a fusion operation, but first you must bring df2
in the correct format, which will contain one line for each gene (not one entry for multiple genes, separated by commas). There is a handy function for this from the package tidyr
,unnest()
df2 <- tidyr::unnest(
transform(df2, Genes = strsplit(as.character(df2$Genes), ",")),
Genes)
The result looks like this
df2
# RL pValue. chr start end CNA Genes
#1 2 2.594433 1 129740006 129780779 gain Klri2
#2 2 3.941399 1 130080653 130380997 gain Serpin
#3 2 3.941399 1 130080653 130380997 gain St18
#4 2 3.941399 1 130080653 130380997 gain Myoc
Now you can just use merge(df1, df2, all.x = TRUE)
either left_join
from dplyr
(or other packages, for example data.table
, depending on which one you want to explore). Note that this will lead to NA
where you want the zeros, but you can easily replace them.
source to share