Correlation in R, when I do "pairwise.complet.obs" I get the error "standard deviation is 0"

I'm trying to do some group correlation and have used this very helpful thread:

spearman correlation by group in R

however, there are some NA values ​​in my 2 variables and in my groups, so I get NA as the result for each group

so i tried this:

> j <- lapply(split(HTNPS, HTNPS$callcat), function(HTNPS){cor(HTNPS$NPS_int, 
HTNPS$holdtime_int,use="pairwise.complete.obs", method = "spearman")})

      

but then, although I get more reasonable numbers, I get this warning: In cor (HTNPS $ NPS_int, HTNPS $ holdtime_int, use = "pairwise.complete.obs",: standard deviation is zero

As requested, I did dput (head (HTNPS, 40) for the respective columns

> dput(head(HTNPS[,20:24], 40))
structure(list(holdtime_int = structure(c(6, 11, 7, 7, 5, 7, 
6, 5, 3, 6, 3, 5, 6, 105, 7, 6, 353, 5, 6, 9, 6, 6, 12, 5, 5, 
5, 249, 5, 7, 11, 5, 7, 5, 290, 6, 6, 6, 6, 5, 6), .Dim = c(40L, 
1L)), NPS_int = structure(c(1, NA, NA, 3, NA, 1, 1, 2, NA, NA, 
NA, NA, 3, 2, 1, NA, 2, 4, 1, 2, NA, 3, 1, 1, 1, 1, 1, 1, 1, 
2, 1, 3, 1, 1, 1, 2, 4, 2, 1, 1), .Dim = c(40L, 1L)), HTnot0 = structure(c(6, 
11, 7, 7, 5, 7, 6, 5, 3, 6, 3, 5, 6, 105, 7, 6, 353, 5, 6, 9, 
6, 6, 12, 5, 5, 5, 249, 5, 7, 11, 5, 7, 5, 290, 6, 6, 6, 6, 5, 
6), .Dim = c(40L, 1L)), callcat = structure(c(NA, NA, "CARD", 
"CARD", "GENERAL", "LOAN", "CHANGE DETAILS", "GENERAL", "LOAN", 
"CHANGE DETAILS", "LOAN", "CARD", "FUNDS TRANSFER", "FEE", "BALANCE", 
NA, "CARD", NA, NA, "STATEMENT", "CARD", "CARD", "GENERAL", "CARD", 
"CARD", "TERM DEPOSIT", "CARD", "GENERAL", "CARD", "CARD", "GENERAL", 
NA, NA, NA, NA, "CARD", "CARD", "FUNDS TRANSFER", "GENERAL", 
"MyBusinessOverride"), .Dim = c(40L, 1L), .Dimnames = list(NULL, 
"callcat")), HTcat = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 12L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 9L, 1L, 1L, 1L, 1L, 1L, 1L, 10L, 1L, 1L, 
1L, 1L, 1L, 1L), .Dim = c(40L, 1L), .Dimnames = list(NULL, "HTcat"))), .Names = c("holdtime_int", 
"NPS_int", "HTnot0", "callcat", "HTcat"), row.names = c(NA, 40L
), class = "data.frame")

      

+3


source to share


1 answer


If you do this split, many of your samples consist of only one observation (after removing the NA). Obviously there is no correlation there.

The warning you get is when one of the two variables only contains one value. In your example, this is, for example, a dataframe for callcat==FUNDS TRANSFER

. holdtime_int

has only one value (6), so the standard deviation is 0 (hence a warning) and the resulting correlation is NA.

I don't know why you are looking at these correlations, but in the data you provided, they almost make no sense to me. If you want to get rid of the warning, you can create a check, for example:



lapply(split(HTNPS,HTNPS$callcat), function(x){
  x <- na.exclude( x[c("holdtime_int","NPS_int")] )
  if(any(sapply(x, function(i)length(unique(i))) < 2 )){
    NA
  } else {
    cor(x[,1],x[,2], method="spearman")
  }
})

      

Which should give you the same result, but without warning. Note the use na.exclude

to get rid of NA.

+1


source







All Articles