R: digitize pie plot - how to plot unrelated areas between sectors with minimal overlap

I have a data frame with common features between 4 patient groups and cell types. I have many different features, but the common ones (present in more than 1 group) are just a few.

I want to make a circus plot that reflects several connections between common functions across patient groups and cell types, and also gives an idea of ​​how many unseparated traits there are in each group.

As I think about it, it should be a 4-quadrant graph (one for each patient group and cell type) with several links between them. Each sector size should reflect the total number of functions in the group, and most of this area should not be connected to other groups, but empty.

This is what I have so far, but I don't want sectors dedicated to each function, just for each patient group and cell type.

MWE:

library(circlize)

patients <- c(rep("patient1",20), rep("patient2",10))
cell.types <- c(rep("cell1",12), rep("cell2",8),rep("cell1",6), rep("cell2",4))
features <- c(paste("feature",1:12,sep="_"), paste("feature",9:16,sep="_"), paste("feature",c(1,2,9,10,17,18),sep="_"), paste("feature",c(1,18,19,20),sep="_"))
dat <- data.frame(patient=patients, cell.type=cell.types, feature=features)
dat
dat <- with(dat, table(paste(patient,cell.type,sep='|'), feature))
dat

chordDiagram(as.data.frame(dat), transparency = 0.5)

      

fig1

EDIT !!

What @ m-dz is showing in his answer is actually the format I'm looking for, 4 sectors for 4 different patient / cell.type combinations showing only connections, while unrelated functions, although not shown. must take into account the size of the sector.

However, I realize that I have a more complex scenario than the one listed in the MWE above.

The function is thought to appear in 2 patient / cell groups. Not only when it is identical in two groups, but also when it is similar ... (sequence identity is above the threshold). So I have layoffs ...

Function A in patient1-cell1 can be connected to function A in patient2-cell1, but also for function B ... Function A must be counted only once (unique values) for patient1-cell1 and expand to 2 different functions in patient2-cell1 ...

Below is an example of how my actual data looks more accurate, and see if you can work with this example, we can get the final graph! Thank!!

##MWE
#NON OVERLAPPING SETS!

#1: non-shared features
nonshared <- data.frame(patient=c(rep("pat1",20), rep("pat2",10)), cell.type=c(rep("cell1",12), rep("cell2",8),rep("cell1",6), rep("cell2",4)), feature=paste("a",1:30,sep=''))
nonshared

#2: features shared between cell types within same patient
sharedcells <- data.frame(patient=c(rep("pat1",3), rep("pat2",4)), cell.types=c(rep("cell1||cell2",3),rep("cell1||cell2",4)), features=c("b1||b1","b1||b1","b1||b1","b2||b2","b3||b3","b4||b4","b4||b5"))
sharedcells

#3: features shared between patients within same cell types
sharedpats <- data.frame(patients=c(rep("pat1||pat2",2), rep("pat1||pat2",6)), cell.type=c(rep("cell1",2),rep("cell2",6)), features=c("c1||c1","c2||c1","c3||c3","c3||c4","c3||c5","c6||c5","c7||c7","c8||c8"))
sharedpats

#4: features shared between patients and cell types
#4.1: shared across pat1-cell1, pat1-cell2, pat2-cell1, pat2-cell2
sharedall1 <- data.frame(both=c(rep("pat1-cell1||pat1-cell2||pat2-cell1||pat2-cell2",4)), features=c("d1||d1||d1||d1","d2||d2||d2||d3","d4||d4||d3||d3","d5||d5||d5||d5"))
#4.2: shared across pat1-cell1, pat1-cell2, pat2-cell1
sharedall2 <- data.frame(both=c(rep("pat1-cell1||pat1-cell2||pat2-cell1",2)), features=c("d6||d6||d6","d7||d7||d7"))
#4.3: shared across pat1-cell1, pat1-cell2, pat2-cell2
sharedall3 <- data.frame(both="pat1-cell1||pat1-cell2||pat2-cell2", features="d8||d8||d9")
#4.4: shared across pat1-cell1, pat2-cell1, pat2-cell2
sharedall4 <- data.frame(both="pat1-cell1||pat2-cell1||pat2-cell2", features="d10||d10||d9")
#4.5: shared across pat1-cell2, pat2-cell1, pat2-cell2
sharedall5 <- data.frame(both=c(rep("pat1-cell2||pat2-cell1||pat2-cell2",3)), features=c("d11||d11||d11","d12||d13||d13","d12||d14||d14"))
#4.6: shared across pat1-cell1, pat2-cell2
sharedall6 <- data.frame()
#4.7: shared across pat1-cell2, pat2-cell1
sharedall7 <- data.frame(both=c(rep("pat1-cell2||pat2-cell1",2)), features=c("d15||d16","d17||d17"))

sharedall <- rbind(sharedall1, sharedall2, sharedall3, sharedall4, sharedall5, sharedall6, sharedall7)
sharedall
#you see there might be overlaps between the different subsets of sharedall, but not between sharedall, sharedparts, sharedcells, and nonshared

#I NEED A CIRCOS PLOT THAT SHOWS ALL THE CONNECTIONS. THE NON-CONNECTED (nonshared) FEATURES SHOULD NOT BE SHOWN, BUT THE SHOULD COUNT TO THE SIZE OF THE SECTOR (CORRESPONDING TO A PATIENT-CELL COMBINATION)

#THE FEATURES SHOULD BE COUNT UNIQUELY, SO IF THERE ARE ENTRIES LIKE:
#3 pat1||pat2     cell2   c3||c3
#4 pat1||pat2     cell2   c3||c4
#5 pat1||pat2     cell2   c3||c5
#THE FEATURE c3 SHOULD BE COUNT ONCE FOR pat1, AND EXPAND TO 3 DIFFERENT FEATURES IN pat2

      

+3


source to share


3 answers


A note on expected result: the goal was to create a graph showing how many common functions are there, ignoring individual functions (1st graph below) or overlapping common functions (e.g. 2nd graph looks like the same functions shared between all groups, which is incorrectly considered by the 1st plot, but the ratio of functions shared between groups is important here).

The code below shows the following two digits (Figure 1 on the left for reference):

All individual functions

enter image description here

Easy count of unique and common features



enter image description here

One of them must meet expectations.

# Prep. data --------------------------------------------------------------

nonshared <- data.frame(patient=c(rep("pat1",20), rep("pat2",10)), cell.type=c(rep("cell1",12), rep("cell2",8),rep("cell1",6), rep("cell2",4)), feature=paste("a",1:30,sep=''))
sharedcells <- data.frame(patient=c(rep("pat1",3), rep("pat2",4)), cell.types=c(rep("cell1||cell2",3),rep("cell1||cell2",4)), features=c("b1||b1","b1||b1","b1||b1","b2||b2","b3||b3","b4||b4","b4||b5"))
sharedpats <- data.frame(patients=c(rep("pat1||pat2",2), rep("pat1||pat2",6)), cell.type=c(rep("cell1",2),rep("cell2",6)), features=c("c1||c1","c2||c1","c3||c3","c3||c4","c3||c5","c6||c5","c7||c7","c8||c8"))
sharedall1 <- data.frame(both=c(rep("pat1-cell1||pat1-cell2||pat2-cell1||pat2-cell2",4)), features=c("d1||d1||d1||d1","d2||d2||d2||d3","d4||d4||d3||d3","d5||d5||d5||d5"))
sharedall2 <- data.frame(both=c(rep("pat1-cell1||pat1-cell2||pat2-cell1",2)), features=c("d6||d6||d6","d7||d7||d7"))
sharedall3 <- data.frame(both="pat1-cell1||pat1-cell2||pat2-cell2", features="d8||d8||d9")
sharedall4 <- data.frame(both="pat1-cell1||pat2-cell1||pat2-cell2", features="d10||d10||d9")
sharedall5 <- data.frame(both=c(rep("pat1-cell2||pat2-cell1||pat2-cell2",3)), features=c("d11||d11||d11","d12||d13||d13","d12||d14||d14"))
sharedall6 <- data.frame()
sharedall7 <- data.frame(both=c(rep("pat1-cell2||pat2-cell1",2)), features=c("d15||d16","d17||d17"))
sharedall <- rbind(sharedall1, sharedall2, sharedall3, sharedall4, sharedall5, sharedall6, sharedall7)

#I NEED A CIRCOS PLOT THAT SHOWS ALL THE CONNECTIONS. THE NON-CONNECTED (nonshared) FEATURES SHOULD NOT BE SHOWN, BUT THE SHOULD COUNT TO THE SIZE OF THE SECTOR (CORRESPONDING TO A PATIENT-CELL COMBINATION)

#THE FEATURES SHOULD BE COUNT UNIQUELY, SO IF THERE ARE ENTRIES LIKE:
#3 pat1||pat2     cell2   c3||c3
#4 pat1||pat2     cell2   c3||c4
#5 pat1||pat2     cell2   c3||c5
#THE FEATURE c3 SHOULD BE COUNT ONCE FOR pat1, AND EXPAND TO 3 DIFFERENT FEATURES IN pat2



# Start -------------------------------------------------------------------

library(circlize)
library(data.table)
library(magrittr)
library(stringr)
library(RColorBrewer)

# Split and pad with 0 ----------------------------------------------------
fun <- function(x) unlist(tstrsplit(x, split = '||', fixed = TRUE))

nonshared %>% setDT()
sharedcells %>% setDT()
sharedpats %>% setDT()
sharedall %>% setDT()

nonshared <- nonshared[, .(group = paste(patient, cell.type, sep = '-'), feature)][, feature := paste0('a', str_pad(str_extract(feature, '[0-9]+'), 2, 'left', '0'))]
sharedcells <- sharedcells[, lapply(.SD, fun), by = 1:nrow(sharedcells)][, .(group = paste(patient, cell.types, sep = '-'), feature = features)][, feature := paste0('b', str_pad(str_extract(feature, '[0-9]+'), 2, 'left', '0'))]
sharedpats <- sharedpats[, lapply(.SD, fun), by = 1:nrow(sharedpats)][, .(group = paste(patients, cell.type, sep = '-'), feature = features)][, feature := paste0('c', str_pad(str_extract(feature, '[0-9]+'), 2, 'left', '0'))]
sharedall <- sharedall[, lapply(.SD, fun), by = 1:nrow(sharedall)][, .(group = both, feature = features)][, feature := paste0('d', str_pad(str_extract(feature, '[0-9]+'), 2, 'left', '0'))]

dt_split <- rbindlist(
  list(
    nonshared,
    sharedcells,
    sharedpats,
    sharedall
  )
)

# Set key and self join to find shared features ---------------------------
setkey(dt_split, feature)
dt_join <- dt_split[dt_split, .(group, i.group, feature), allow.cartesian = TRUE] %>%
  .[group != i.group, ]

# Create a "sorted key" ---------------------------------------------------
# key := paste(sort(.SD)...
# To leave only unique combinations of groups and features
dt_join <-
  dt_join[,
          key := paste(sort(.SD), collapse = '|'),
          by = 1:nrow(dt_join),
          .SDcols = c('group', 'i.group')
          ] %>%
  setorder(feature, key) %>%
  unique(by = c('key', 'feature')) %>%
  .[, .(
    group_from = i.group,
    group_to = group,
    feature = feature)]

# Rename and key ----------------------------------------------------------

dt_split %>% setnames(old = 'group', new = 'group_from') %>% setkey(group_from, feature)
dt_join %>% setkey(group_from, feature)



# Individual features -----------------------------------------------------

# Features without connections --------------------------------------------

dt_singles <- dt_split[, .(group_from, group_to = group_from, feature)] %>%
  .[, N := .N, by = feature] %>%
  .[!(N > 1 & group_from == group_to), !c('N')]

# Bind all, add some columns etc. -----------------------------------------

dt_bind <- rbind(dt_singles, dt_join) %>% setorder(group_from, feature, group_to)

dt_bind[, ':='(
  group_from_f = paste(group_from, feature, sep = '.'),
  group_to_f = paste(group_to, feature, sep = '.'))]
dt_bind[, feature := NULL]  # feature can be removed

# Colour
dt_bind[, colour := ifelse(group_from_f == group_to_f, "#FFFFFF00", '#00000050')]  # Change first to #FF0000FF to show red blobs

# Prep. sectors -----------------------------------------------------------

sectors_f <- union(dt_bind[, group_from_f], dt_bind[, group_to_f]) %>% sort()

colour_lookup <-
  union(dt_bind[, group_from], dt_bind[, group_to]) %>% sort() %>%
  structure(seq_along(.) + 1, names = .)
sector_colours <- str_replace_all(sectors_f, '.[a-d][0-9]+', '') %>%
  colour_lookup[.]

# Gaps between sectors ----------------------------------------------------

gap_sizes <- c(0.0, 1.0)
gap_degree <-
  sapply(table(names(sector_colours)), function(i) c(rep(gap_sizes[1], i-1), gap_sizes[2])) %>%
  unlist() %>% unname()
# gap_degree <- rep(0, length(sectors_f))  # Or no gap



# Plot! -------------------------------------------------------------------

# Each "sector" is a separate patient/cell/feature combination

circos.par(gap.degree = gap_degree)
circos.initialize(sectors_f, xlim = c(0, 1))
circos.trackPlotRegion(ylim = c(0, 1), track.height = 0.05, bg.col = sector_colours, bg.border = NA)

for(i in 1:nrow(dt_bind)) {
  row_i <- dt_bind[i, ]
  circos.link(
    row_i[['group_from_f']], c(0, 1),
    row_i[['group_to_f']], c(0, 1),
    border = NA, col = row_i[['colour']]
  )
}

# "Feature" labels
circos.trackPlotRegion(track.index = 2, ylim = c(0, 1), panel.fun = function(x, y) {
  sector.index = get.cell.meta.data("sector.index")
  circos.text(0.5, 0.25, sector.index, col = "white", cex = 0.6, facing = "clockwise", niceFacing = TRUE)
}, bg.border = NA)

# "Patient/cell" labels
for(s in names(colour_lookup)) {
  sectors <- sectors_f %>% { .[str_detect(., s)] }
  highlight.sector(
    sector.index = sectors, track.index = 1, col = colour_lookup[s],
    text = s, text.vjust = -1, niceFacing = TRUE)
}

circos.clear()



# counts of unique and shared features ------------------------------------

xlims <- dt_split[, .N, by = group_from][, .(x_from = 0, x_to = N)] %>% as.matrix()
links <- dt_join[, .N, by = .(group_from, group_to)]
colours <- dt_split[, unique(group_from)] %>% structure(seq_along(.) + 1, names = .)

library(circlize)

sectors = names(colours)
circos.par(cell.padding = c(0, 0, 0, 0))
circos.initialize(sectors, xlim = xlims)
circos.trackPlotRegion(ylim = c(0, 1), track.height = 0.05, bg.col = colours, bg.border = NA)

for(i in 1:nrow(links)) {
  link <- links[i, ]
  circos.link(link[[1]], c(0, link[[3]]), link[[2]], c(0, link[[3]]), col = '#00000025', border = NA)
}

# "Patient/cell" labels
for(s in sectors) {
  highlight.sector(
    sector.index = s, track.index = 1, col = colours[s], 
    text = s, text.vjust = -1, niceFacing = TRUE)
}

circos.clear()

      

Edit: just by adding a link from the deleted comment: see this answer for a nice example of marking!

+3


source


@ m-dz provides the right direction. I can provide more details on your simulated data.

Let's start here:

patients <- c(rep("patient1",20), rep("patient2",10))
cell.types <- c(rep("cell1",12), rep("cell2",8),rep("cell1",6), rep("cell2",4))
features <- c(paste("feature",1:12,sep="_"), paste("feature",9:16,sep="_"), paste("feature",c(1,2,9,10,17,18),sep="_"), paste("feature",c(1,18,19,20),sep="_"))
dat <- data.frame(patient=patients, cell.type=cell.types, feature=features)
dat <- with(dat, table(paste(patient,cell.type,sep='|'), feature))

      

as.data.frame

converts dat

to a three-column data frame (i.e. an adjacency list where references start at the first column and point to the second column)

dat = as.data.frame(dat, stringsAsFactors = FALSE)

      

Generate colors for patients / cells and functions.

features = unique(dat[[2]])
features_col = structure(rand_color(length(features)), names = features)
patients_col = structure(2:5, names = unique(dat[[1]]))

      



If the feature only exists in one patient / cell combination, you don't want to show it, but still want to keep your position in the plot, you can just set #FFFFFF00

as color (white with full transparency so that it won't be covered by other links). Here we want the color of the link to be the same as for the function sectors.

col = ifelse(dat[[3]], features_col[dat[[2]]], "#FFFFFF00")
col = gsub("FF$", "80", col) # half transparent
features_count = tapply(dat[[3]], dat[[2]], sum)
# set color to white if it only exists in one patient/cell
col[features_count[dat[[2]]] == 1] = "#FFFFFF00" 

      

And the final chord diagram:

chordDiagram(dat, col = col, grid.col = c(features_col, patients_col))

      

You can see that there are at least two links in the function sectors that point to patients / cage.

enter image description here

+2


source


get prepared data

    library(circlize)
    patients <- c(rep("patient1",20), rep("patient2",10))
    cell.types <- c(rep("cell1",12), rep("cell2",8),rep("cell1",6), rep("cell2",4))
    features <- c(paste("feature",1:12,sep="_"), paste("feature",9:16,sep="_"),     paste("feature",c(1,2,9,10,17,18),sep="_"), paste("feature",c(1,18,19,20),sep="_"))
    dat <- data.frame(patient=patients, cell.type=cell.types, feature=features)
    dat <- with(dat, table(paste(patient,cell.type,sep='|'), feature))
    dat<-as.data.frame(dat,stringsAsFactors = FALSE)

      

enter image description here

get all combinations of patients and cell types

    df=NULL
    for(i in levels(as.factor(dat$feature))){
        temp<-as.data.frame(matrix(combn(dat[which(dat$feature==i),1],2),byrow = TRUE,ncol=2),stringsAsFactors = FALSE)
        temp$feature=i
        temp$Freq=1
        Freq_0<-subset(dat$Var1,dat$feature==i & dat$Freq==0)
        for(j in Freq_0){
          temp$Freq[temp$V1==j | temp$V2==j]=0
        }
        df<-rbind(df,temp)
    }

      

enter image description here

add color

    df$color=rainbow(dim(df)[1])
    df[which(df$Freq==0),5]="white"
    df$Freq=1
    chordDiagram(df[,c(-3,-5)], transparency = 0.5,col = df$color)

      

different link means different function and link color is white where "Freq" is 0 enter image description here

I turn the color "white" into "black", and black - more visible enter image description here

If you want to keep the "feature" attribute ...... prepare the data first

    library(circlize)
    patients <- c(rep("patient1",20), rep("patient2",10))
    cell.types <- c(rep("cell1",12), rep("cell2",8),rep("cell1",6), rep("cell2",4))
    features <- c(paste("feature",1:12,sep="_"), paste("feature",9:16,sep="_"), paste("feature",c(1,2,9,10,17,18),sep="_"), paste("feature",c(1,18,19,20),sep="_"))
    dat <- data.frame(patient=patients, cell.type=cell.types, feature=features)
    dat <- with(dat, table(paste(patient,cell.type,sep='|'), feature))
    dat<-as.data.frame(dat,stringsAsFactors = FALSE)
    df=NULL
    for(i in levels(as.factor(dat$feature))){
      temp<-as.data.frame(matrix(combn(dat[which(dat$feature==i),1],2),byrow = TRUE,ncol=2),stringsAsFactors = FALSE)
      temp$feature=i
      temp$Freq=1
      Freq_0<-subset(dat$Var1,dat$feature==i & dat$Freq==0)
      for(j in Freq_0){
        temp$Freq[temp$V1==j | temp$V2==j]=0
      }
      df<-rbind(df,temp)
    }

      

processed

    library(dplyr)
    df1<-subset(df,df$Freq==1)
    df0<-subset(df,df$Freq==0)
    df1_mod<-summarise(group_by(df1,V1,V2),Freq=n())
    df0_mod<-summarise(group_by(df0,V1,V2),Freq=n())

      

add color

    df1_mod$color<-rainbow(5)
    df0_mod$color<-"white"
    df_res<-rbind(df0_mod,df1_mod)

      

enter image description here

draw it

chordDiagram(df_res, transparency = 0.5,col = df_res$color)

      

enter image description here

enter image description here

This figure shows that there are many zeros in "Freq".

+1


source







All Articles