Superposition of two-dimensional on many factors in R
First of all, I am still a beginner. I am trying to interpret and draw a stack plot with R. I have already looked at several answers, but some were not specific to my case and others I just didn’t understand:
- https://stats.stackexchange.com/questions/31597/graphing-a-probability-curve-for-a-logit-model-with-multiple-predictors
- https://stats.stackexchange.com/questions/47020/plotting-logistic-regression-interaction-categorical-in-r
- Output Multivariate Logistic Regression Model Results in R
I have a dataset dvl
that has five columns: Variant, Region, Time, Person, and PrecededByPrep. I would like to do a multivariate comparison of Variant with the other four predictors. Each column can have one of two possible values:
- Option:
elk
orieder
. - Region =
VL
orNL
. - Time:
time
orno time
- Person:
person
orno person
- PrecededByPrep:
1
or0
Here's the logistic regression
From the answers I figured out that the library ggplot2
might be the best drawing library. I've read his documentation, but for the life of me I can't figure out how to do this: how can I compare Variant
to three other factors?
It took me a while, but I did something similar in Photoshop that I wanted (fictitious values!).
Dark gray / light gray: possible values Variant
y-axis: frequency x-axis: each column divided by possible values
I know to make individual plots both stacked and grouped , but basically I don't know how to stack, group line plots, ggplot2
can be used, but if it can be done I would prefer that.
I think this can be seen as a sample dataset, although I'm not entirely sure. I am starting with R and I have been reading about creating a sample set.
t <- data.frame(Variant = sample(c("iedere","elke"),size = 50, replace = TRUE),
Region = sample(c("VL","NL"),size = 50, replace = TRUE),
PrecededByPrep = sample(c("1","0"),size = 50, replace = TRUE),
Person = sample(c("person","no person"),size = 50, replace = TRUE),
Time = sample(c("time","no time"),size = 50, replace = TRUE))
I would like the plot to be aesthetically pleasing. What I meant:
- Chart color (i.e. for bars):
col=c("paleturquoise3", "palegreen3")
- Bold for axis labels
font.lab=2
, but not for value labels (for example, isgionin bold, but
VLand
NL` is not in bold) -
#404040
as font, axis and line colors - Axis labels: x:,
factors
y:frequency
source to share
Here is one possibility, which starts with an "unstabilized" data frame, melt
it, draw it with geom_bar
in ggplot2
(which does the count for each group), separate the graph by variable using facet_wrap
.
Create toy data:
set.seed(123)
df <- data.frame(Variant = sample(c("iedere", "elke"), size = 50, replace = TRUE),
Region = sample(c("VL", "NL"), size = 50, replace = TRUE),
PrecededByPrep = sample(c("1", "0"), size = 50, replace = TRUE),
Person = sample(c("person", "no person"), size = 50, replace = TRUE),
Time = sample(c("time", "no time"), size = 50, replace = TRUE))
Change the data:
library(reshape2)
df2 <- melt(df, id.vars = "Variant")
Plot:
library(ggplot2)
ggplot(data = df2, aes(factor(value), fill = Variant)) +
geom_bar() +
facet_wrap(~variable, nrow = 1, scales = "free_x") +
scale_fill_grey(start = 0.5) +
theme_bw()
There are many options for customizing the plot, such as the ordering of factor levels , labels with a rotating axis , wrapping labels on two lines (for example, for a longer variable name "PrecededByPrep"), or changing the distance between edges .
Setting (following updates in question and comments from OP)
# labeller function used in facet_grid to wrap "PrecededByPrep" on two lines
# see http://www.cookbook-r.com/Graphs/Facets_%28ggplot2%29/#modifying-facet-label-text
my_lab <- function(var, value){
value <- as.character(value)
if (var == "variable") {
ifelse(value == "PrecededByPrep", "Preceded\nByPrep", value)
}
}
ggplot(data = df2, aes(factor(value), fill = Variant)) +
geom_bar() +
facet_grid(~variable, scales = "free_x", labeller = my_lab) +
scale_fill_manual(values = c("paleturquoise3", "palegreen3")) + # manual fill colors
theme_bw() +
theme(axis.text = element_text(face = "bold"), # axis tick labels bold
axis.text.x = element_text(angle = 45, hjust = 1), # rotate x axis labels
line = element_line(colour = "gray25"), # line colour gray25 = #404040
strip.text = element_text(face = "bold")) + # facet labels bold
xlab("factors") + # set axis labels
ylab("frequency")
Add counts for each bar (edit following comments from OP).
The basic principles of calculating y coordinates can be found in this Q&A . Here I use dplyr
to calculate the counts per bar (i.e. label
in geom_text
) and their coordinates y
, but this can of course be done in base
R, plyr
or data.table
.
# calculate counts (i.e. labels for geom_text) and their y positions.
library(dplyr)
df3 <- df2 %>%
group_by(variable, value, Variant) %>%
summarise(n = n()) %>%
mutate(y = cumsum(n) - (0.5 * n))
# plot
ggplot(data = df2, aes(x = factor(value), fill = Variant)) +
geom_bar() +
geom_text(data = df3, aes(y = y, label = n)) +
facet_grid(~variable, scales = "free_x", labeller = my_lab) +
scale_fill_manual(values = c("paleturquoise3", "palegreen3")) + # manual fill colors
theme_bw() +
theme(axis.text = element_text(face = "bold"), # axis tick labels bold
axis.text.x = element_text(angle = 45, hjust = 1), # rotate x axis labels
line = element_line(colour = "gray25"), # line colour gray25 = #404040
strip.text = element_text(face = "bold")) + # facet labels bold
xlab("factors") + # set axis labels
ylab("frequency")
source to share
Here is my suggestion for a solution with barplot
R base function :
1.calculate counts
l_count_df<-lapply(colnames(t)[-1],function(nomcol){table(t$Variant,t[,nomcol])})
count_df<-l_count_df[[1]]
for (i in 2:length(l_count_df)){
count_df<-cbind(count_df,l_count_df[[i]])
}
2.draw a barcode without axis names keeping the column coordinates
par(las=1,col.axis="#404040",mar=c(5,4.5,4,2),mgp=c(3.5,1,0))
bp<-barplot(count_df,width=1.2,space=rep(c(1,0.3),4),col=c("paleturquoise3", "palegreen3"),border="#404040", axisname=F, ylab="Frequency",
legend=row.names(count_df),ylim=c(0,max(colSums(count_df))*1.2))
3.name the columns
mtext(side=1,line=0.8,at=bp,text=colnames(count_df))
mtext(side=1,line=2,at=(bp[seq(1,8,by=2)]+bp[seq(2,8,by=2)])/2,text=colnames(t)[-1],font=2)
4.add values inside columns
for(i in 1:ncol(count_df)){
val_elke<-count_df[1,i]
val_iedere<-count_df[2,i]
text(bp[i],val_elke/2,val_elke)
text(bp[i],val_elke+val_iedere/2,val_iedere)
}
This is what I get (with my random data):
source to share
I am basically answering another question. I suppose this can be seen as perverse on my part, but I really don't like barriers of almost any kind. They always seemed to create wasted space because the current informational numeric values are less useful than a well-formed table. The package vcd
offers the extended mosaicplot function, which I think is more aptly called the "multidimensional line font, which is any of the ones I've seen so far. This requires you to first build a contingency table for which the xtabs
function seems to be perfect."
install.packages)"vcd")
library(vcd)
help(package=vcd,mosaic)
col=c("paleturquoise3", "palegreen3")
vcd::mosaic(xtabs(~Variant+Region + PrecededByPrep + Time, data=ttt)
,highlighting="Variant", highlighting_fill=col)
It was a 5-line plot and this is a 5-way plot:
png(); vcd::mosaic( xtabs(
~Variant+Region + PrecededByPrep + Person + Time,
data=ttt)
,highlighting="Variant", highlighting_fill=col); dev.off()
source to share