How can I pass values to ddply based on a column?
I want to be able to pass two sets of values to a GROUPED BY column Category
. Is there a way to do this using ddply
the package plyr
?
I want to do something like this:
ddply(idata.frame(data), .(Category), wilcox.test, data[Type=="PRE",], data[Type=="POST",])
wilcox.test
is the following function:
Description
Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.
Usage
wilcox.test(x, ...)
Arguments
x
numeric vector of data values. Non-finite (e.g. infinite or missing) values will be omitted.
y
an optional numeric vector of data values: as with x non-finite values will be omitted.
.... rest of the arguments snipped ....
I have the following output from dput
:
structure(list(Category = c("A", "C",
"B", "C", "D", "E",
"C", "A", "F", "B",
"E", "C", "C", "A",
"C", "A", "B", "H",
"I", "A"), Type = c("POST", "POST",
"POST", "POST", "PRE", "POST", "POST", "PRE", "POST",
"POST", "POST", "POST", "POST", "PRE", "PRE", "POST",
"POST", "POST", "POST", "POST"), Value = c(1560638113,
1283621, 561329742, 2727503, 938032, 4233577690, 0, 4209749646,
111467236, 174667894, 1071501854, 720499, 2195611, 1117814707,
1181525, 1493315101, 253416809, 327012982, 538595522, 3023339026
)), .Names = c("Category", "Type", "Value"), row.names = c(21406L,
123351L, 59875L, 45186L, 126720L, 94153L, 48067L, 159371L, 54303L,
63318L, 104100L, 58162L, 41945L, 159794L, 57757L, 178622L, 83812L,
130655L, 30860L, 24513L), class = "data.frame")
Any suggestions?
source to share
I always use anonymous function:
ddply(idata.frame(data), .(Category),
function(x) wilcox.test(x[Type == "PRE",], x[Type == "POST",])
I'm not sure if functions wilcox.test
return anything nice to concatenate in data.frame
by default, so you'll have to tweak yourself a bit. Alternatively use dlply
to get the list wilcox.test
.
source to share
There are two problems here:
-
Paul's solution doesn't seem to work in my case, even though I'm using the same data. I think the subset syntax is the cause, but I was unable to crack the error.
-
Your data is actually too small for any comparison to be computed with a statistical test given the structure you want to use (i.e.
Category
xType
). After all, if you look at the number of categories in your dataframe, they all have less than 30 values, and half have only one value:> table(data$Category) A B C D E F H I 5 3 6 1 2 1 1 1
But the good news is that I have found a solution for you.
First, I had to create a wider table. And since I was (very) lazy, I just did this:
for(i in 1:10){data <- rbind(data,data)}
data$Value <- jitter(data$Value,5e3)
data$Type <- sample(c("POST","PRE"),size=nrow(data),replace=T,prob=c(0.80,0.20))
I duplicated the table 10 times, added noise to the numerical values, and reassigned randomly "PRE" and "POST" with the same proportion it was adding to the original data frame. Please note that the values themselves are not very important here, I am just using the same data structure you gave us.
This way we end up with a much larger table, and more importantly, a denser table:
> table(data$Category, data$Type)
POST PRE
A 4135 985
B 2470 602
C 4881 1263
D 814 210
E 1634 414
F 815 209
H 846 178
I 813 211
So there you go!
Now we can work out a solution. For the sake of clarity, I've written a function that runs the Wilcoxon test separately. The trick is that it has to return a vector that will be included in the dataframe that you need for your output.
Call by function wx
:
wx <- function(d){
w <- wilcox.test(
# First vector (x)
subset(d, Type == "PRE", select = Value )[,1],
subset(d, Type == "POST", select = Value )[,1]
)
# c(1,3) returns the Stat and the P-value (tweak that if you want something else)
return(w[c(1,3)])
}
Finally, you just need to apply the function to your dataframe:
> ddply(data, .(Category), .fun = wx )
Category V1 V2
A 2047794 0.7862484
B 725554 0.3585648
C 3071435 0.8459535
D 80693 0.2112926
E 347314 0.3984288
F 83304 0.6252554
H 71762 0.3247840
I 88874 0.4177269
Nothing really matters, of course, given how I built the table, but you have stat in V1 and a P value in V2.
source to share