Correlation in R, when I do "pairwise.complet.obs" I get the error "standard deviation is 0"
I'm trying to do some group correlation and have used this very helpful thread:
spearman correlation by group in R
however, there are some NA values in my 2 variables and in my groups, so I get NA as the result for each group
so i tried this:
> j <- lapply(split(HTNPS, HTNPS$callcat), function(HTNPS){cor(HTNPS$NPS_int,
HTNPS$holdtime_int,use="pairwise.complete.obs", method = "spearman")})
but then, although I get more reasonable numbers, I get this warning: In cor (HTNPS $ NPS_int, HTNPS $ holdtime_int, use = "pairwise.complete.obs",: standard deviation is zero
As requested, I did dput (head (HTNPS, 40) for the respective columns
> dput(head(HTNPS[,20:24], 40))
structure(list(holdtime_int = structure(c(6, 11, 7, 7, 5, 7,
6, 5, 3, 6, 3, 5, 6, 105, 7, 6, 353, 5, 6, 9, 6, 6, 12, 5, 5,
5, 249, 5, 7, 11, 5, 7, 5, 290, 6, 6, 6, 6, 5, 6), .Dim = c(40L,
1L)), NPS_int = structure(c(1, NA, NA, 3, NA, 1, 1, 2, NA, NA,
NA, NA, 3, 2, 1, NA, 2, 4, 1, 2, NA, 3, 1, 1, 1, 1, 1, 1, 1,
2, 1, 3, 1, 1, 1, 2, 4, 2, 1, 1), .Dim = c(40L, 1L)), HTnot0 = structure(c(6,
11, 7, 7, 5, 7, 6, 5, 3, 6, 3, 5, 6, 105, 7, 6, 353, 5, 6, 9,
6, 6, 12, 5, 5, 5, 249, 5, 7, 11, 5, 7, 5, 290, 6, 6, 6, 6, 5,
6), .Dim = c(40L, 1L)), callcat = structure(c(NA, NA, "CARD",
"CARD", "GENERAL", "LOAN", "CHANGE DETAILS", "GENERAL", "LOAN",
"CHANGE DETAILS", "LOAN", "CARD", "FUNDS TRANSFER", "FEE", "BALANCE",
NA, "CARD", NA, NA, "STATEMENT", "CARD", "CARD", "GENERAL", "CARD",
"CARD", "TERM DEPOSIT", "CARD", "GENERAL", "CARD", "CARD", "GENERAL",
NA, NA, NA, NA, "CARD", "CARD", "FUNDS TRANSFER", "GENERAL",
"MyBusinessOverride"), .Dim = c(40L, 1L), .Dimnames = list(NULL,
"callcat")), HTcat = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 12L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 9L, 1L, 1L, 1L, 1L, 1L, 1L, 10L, 1L, 1L,
1L, 1L, 1L, 1L), .Dim = c(40L, 1L), .Dimnames = list(NULL, "HTcat"))), .Names = c("holdtime_int",
"NPS_int", "HTnot0", "callcat", "HTcat"), row.names = c(NA, 40L
), class = "data.frame")
source to share
If you do this split, many of your samples consist of only one observation (after removing the NA). Obviously there is no correlation there.
The warning you get is when one of the two variables only contains one value. In your example, this is, for example, a dataframe for callcat==FUNDS TRANSFER
. holdtime_int
has only one value (6), so the standard deviation is 0 (hence a warning) and the resulting correlation is NA.
I don't know why you are looking at these correlations, but in the data you provided, they almost make no sense to me. If you want to get rid of the warning, you can create a check, for example:
lapply(split(HTNPS,HTNPS$callcat), function(x){
x <- na.exclude( x[c("holdtime_int","NPS_int")] )
if(any(sapply(x, function(i)length(unique(i))) < 2 )){
NA
} else {
cor(x[,1],x[,2], method="spearman")
}
})
Which should give you the same result, but without warning. Note the use na.exclude
to get rid of NA.
source to share