Calculation of Gene Expression Data by Means in a Randomized Experiment
I am (new to R) analyzing a randomized study of the effect of two treatments on gene expression. We evaluated 5 different genes at baseline and 1 year later. Gene fold is calculated as the 1 year value divided by the baseline value.
Gene example: IL10_BL IL10_1Y IL10_fold
Gene expression is measured as a continuous variable, usually in the range from 0.1 to 5.0. 100 patients were randomized to either a statin regimen or a diet.
I would like to make the following plot: - the Y-axis should represent the average gene expression with a 95% confidence limit - The X-axis should be categorical, with a baseline of 1 year and a fold for each of the 5 genes grouped by treatment. So 5 genes with 3 values for each gene in two groups would mean 30 categories on the X-axis. It would be very nice if the points for the same gene would be linked to a line.
I have tried doing this myself (using ggplot2) without any success. I tried to do it directly from raw data that looks like this (first 6 observations and 2 different genes):
genes <- read.table(header=TRUE, sep=";", text =
"treatment;IL10_BL;IL10_1Y;IL10_fold;IL6_BL;IL6_1Y;IL6_fold;
diet;1.1;1.5;1.4;1.4;1.4;1.1;
statin;2.5;3.3;1.3;2.7;3.1;1.1;
statin;3.2;4.0;1.3;1.5;1.6;1.1;
diet;3.8;4.4;1.2;3.0;2.9;0.9;
statin;1.1;3.1;2.8;1.0;1.0;1.0;
diet;3.0;6.0;2.0;2.0;1.0;0.5;")
I would really appreciate any help (or link to a similar thread) to do this.
source to share
First, you need to melt your data into a long format so that one column (your X column) contains a categorical variable indicating whether the observation is BL
, 1Y
or fold
.
(your team creates an empty column, which may be required for the first: genes$X = NULL
)
library(reshape2)
genes.long = melt(genes, id.vars='treatment', value.name='expression')
Then you will need the gene and measurement (baseline, 1 year, fold) in different columns (from this question ).
genes.long$gene = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 1))
genes.long$measurement = as.character(lapply(strsplit(as.character(genes.long$variable), split='_'), '[', 2))
And put the measurement in the order you expect:
genes.long$measurement = factor(genes.long$measurement, levels=c('BL', '1Y', 'fold'))
Then you can plot using calls stat_summary()
for mean and confidence intervals. Use edges to separate groups (healing and gene combinations).
ggplot(genes.long, aes(measurement, expression)) +
stat_summary(fun.y = mean, geom='point') +
stat_summary(fun.data = 'mean_cl_boot', geom='errorbar', width=.25) +
facet_grid(.~treatment+gene)
You can change the order to facet_grid(.~gene+treatment)
if you want the top level to be the genome instead of the cure.
source to share