R-Programming - ggplot2 - boxplot questions (varwidth & position_dodge / stat_summary & position_dodge)

Question

R-Programming - ggplot2 - boxplot questions (varwidth & position_dodge / stat_summary & position_dodge)

I am currently using ggplot2 to display some distributions of functions with boxes. I can create simple boxes by changing color, shape, etc., but I cannot achieve those that combine multiple parameters.

1 °) My goal is to display side by side for men and women, which can be done with position = position_dodge(width=0.9)

. I want the width of the rectangle to be proportional to the size of the sample, which can be done with var_width=TRUE

. First problem: when I put two parameters together it doesn't work and I get the following message:

position_dodge requires non-overlapping spans x

Box when used var_width=TRUE

and position_dodge

together:

I tried to resize the plot but it didn't help. If I skip var_width=TRUE

the boxes will dodge correctly. Is there a way out of this or is it the ggplot2 limit?

2 °) Also, I want to show the size of each sample that creates the boxes. I can get the calculation using stat_summary(fun.data = give.n

, but unfortunately I haven't found a way to avoid the numbers overlapping with each other when the boxes have the same positions. I tried using hjust

and vjust

to change the positions of the numbers, but they seem to have the same origin, so it doesn't help.

Overlapping numbers generated stats_summary

when boxes are anchored:

Since there are no labels, I couldn't use geom_text

, or I didn't find a way how to get the stat passed to geom_text

. So the second problem is this: how can I display each number on its own box?

Here is my code:

`library(ggplot2)
# function to get the median of my sample
give.n <- function(x){
  return(c(y = median(x), label = length(x)))
}

plot_boxes <- function(mydf, mycolumn1, mycolumn2) {

  mylegendx <- deparse(substitute(mycolumn1))
  mylegendy <- deparse(substitute(mycolumn2))


  g2  <- ggplot(mydf, aes(x=as.factor(mycolumn1), y=mycolumn2, color=Gender, 
    fill=Gender)) +
  geom_boxplot( data=mydf, aes(x=as.factor(mycolumn1), y=mycolumn2, 
     color=Gender), position=position_dodge(width=0.9), alpha=0.3) +
  stat_summary(fun.data = give.n, geom = "text", size = 3, vjust=1) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_discrete(name = mylegendx ) +
  labs(title=paste("Boxplot ", substring(mylegendy, 11), " by ", 
     substring(mylegendx, 11))  , x = mylegendx, y = mylegendy)

  print(g2)


 }

#setwd("~/data")
filename <- "df_stackoverflow.csv"


df_client <- read.csv(file=filename, header=TRUE, sep=";", dec=".")

plot_boxes(df_client, df_client$Client.Class, df_client$nbyears_client)`

And the data looks like this (small sample from the dataset - 20,000 rows):

Client.Id;Client.Status;Client.Class;Gender;nbyears_client
3;Active;Middle Class;Male;1.38
4;Active;Middle Class;Male;0.9
5;Active;Retiree;Female;0.21
6;Active;Middle Class;Male;0.9
7;Active;Middle Class;Male;3.55
8;Active;Subprime;Male;1.16
9;Active;Middle Class;Male;1.21
10;Active;Part-time;Male;3.38
17;Active;Middle Class;Male;1.83
19;Active;Subprime;Female;5.81
20;Active;Farming;Male;8.99
21;Active;Subprime;Female;6.49
22;Active;Middle Class;Male;1.54
23;Active;Middle Class;Female;2.74
24;Active;Subprime;Male;0.46
25;Active;Executive;Female;0.49
26;Active;Middle Class;Female;3.55
27;Active;Middle Class;Male;3.83
29;Active;Subprime;Female;2.66
30;Active;Middle Class;Male;2.72
31;Active;Middle Class;Female;4.88
32;Active;Subprime;Male;1.46
34;Active;Middle Class;Female;7.16
41;Active;Middle Class;Male;0.65
44;Active;Middle Class;Male;2
45;Active;Subprime;Male;1.13