How can I pass values ​​to ddply based on a column?

I want to be able to pass two sets of values ​​to a GROUPED BY column Category

. Is there a way to do this using ddply

the package plyr

?

I want to do something like this:

ddply(idata.frame(data), .(Category), wilcox.test, data[Type=="PRE",], data[Type=="POST",])

      

wilcox.test

is the following function:

Description

Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.

Usage

wilcox.test(x, ...)

Arguments

x   
numeric vector of data values. Non-finite (e.g. infinite or missing) values will be omitted.

y   
an optional numeric vector of data values: as with x non-finite values will be omitted.

.... rest of the arguments snipped ....

      

I have the following output from dput

:

structure(list(Category = c("A", "C", 
"B", "C", "D", "E", 
"C", "A", "F", "B", 
"E", "C", "C", "A", 
"C", "A", "B", "H", 
"I", "A"), Type = c("POST", "POST", 
"POST", "POST", "PRE", "POST", "POST", "PRE", "POST", 
"POST", "POST", "POST", "POST", "PRE", "PRE", "POST", 
"POST", "POST", "POST", "POST"), Value = c(1560638113, 
1283621, 561329742, 2727503, 938032, 4233577690, 0, 4209749646, 
111467236, 174667894, 1071501854, 720499, 2195611, 1117814707, 
1181525, 1493315101, 253416809, 327012982, 538595522, 3023339026
)), .Names = c("Category", "Type", "Value"), row.names = c(21406L, 
123351L, 59875L, 45186L, 126720L, 94153L, 48067L, 159371L, 54303L, 
63318L, 104100L, 58162L, 41945L, 159794L, 57757L, 178622L, 83812L, 
130655L, 30860L, 24513L), class = "data.frame")

      

Any suggestions?

+3


source to share


2 answers


I always use anonymous function:

ddply(idata.frame(data), .(Category), 
    function(x) wilcox.test(x[Type == "PRE",], x[Type == "POST",])

      



I'm not sure if functions wilcox.test

return anything nice to concatenate in data.frame

by default, so you'll have to tweak yourself a bit. Alternatively use dlply

to get the list wilcox.test

.

+4


source


There are two problems here:

  • Paul's solution doesn't seem to work in my case, even though I'm using the same data. I think the subset syntax is the cause, but I was unable to crack the error.

  • Your data is actually too small for any comparison to be computed with a statistical test given the structure you want to use (i.e. Category

    x Type

    ). After all, if you look at the number of categories in your dataframe, they all have less than 30 values, and half have only one value:

    > table(data$Category)
    A B C D E F H I 
    5 3 6 1 2 1 1 1
    
          

But the good news is that I have found a solution for you.

First, I had to create a wider table. And since I was (very) lazy, I just did this:

for(i in 1:10){data <- rbind(data,data)}

data$Value <- jitter(data$Value,5e3) 

data$Type <- sample(c("POST","PRE"),size=nrow(data),replace=T,prob=c(0.80,0.20))

      

I duplicated the table 10 times, added noise to the numerical values, and reassigned randomly "PRE" and "POST" with the same proportion it was adding to the original data frame. Please note that the values ​​themselves are not very important here, I am just using the same data structure you gave us.

This way we end up with a much larger table, and more importantly, a denser table:

    > table(data$Category, data$Type)

      POST  PRE
    A 4135  985
    B 2470  602
    C 4881 1263
    D  814  210
    E 1634  414
    F  815  209
    H  846  178
    I  813  211

      



So there you go!

Now we can work out a solution. For the sake of clarity, I've written a function that runs the Wilcoxon test separately. The trick is that it has to return a vector that will be included in the dataframe that you need for your output.

Call by function wx

:

 wx <- function(d){
 w <- wilcox.test(
  # First vector (x)
    subset(d, Type == "PRE", select = Value )[,1], 
    subset(d, Type == "POST", select = Value )[,1]
      )
  # c(1,3) returns the Stat and the P-value (tweak that if you want something else)
  return(w[c(1,3)])
  }

      

Finally, you just need to apply the function to your dataframe:

> ddply(data, .(Category), .fun = wx  )
    Category      V1        V2
           A 2047794 0.7862484
           B  725554 0.3585648
           C 3071435 0.8459535
           D   80693 0.2112926 
           E  347314 0.3984288
           F   83304 0.6252554
           H   71762 0.3247840
           I   88874 0.4177269

      

Nothing really matters, of course, given how I built the table, but you have stat in V1 and a P value in V2.

+2


source







All Articles