Replacing missing values with mean groups in R-error: out of bounds
I have a huge file that looks like this:
V1 SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9
GROUP1 1 NA 2 1 1 NA 1 1 2
GROUP1 1 2 NA 0 0 2 1 1 NA
GROUP1 0 2 2 0 NA 1 1 1 2
GROUP2 1 2 1 1 1 NA 2 0 2
GROUP2 1 1 1 NA 0 1 0 1 NA
GROUP2 1 1 NA 1 0 1 NA 1 0
What I need to do is replace the missing values with the mean. I did it in a small example and it works. However, when I do this on a large file, I get an error: "index out of bounds". What I am doing is: Create a list with groups that I want to keep for further analysis:
group.list = unique(data_file$V1)
Now I need to do the average for each column based on the group:
A<-colMeans(data_file[data_file$V1 == group.list[1],-1],na.rm=T)
for(i in 2:length(group.list)){
A <- rbind(A, colMeans(data_file[data_file$V1 %in% group.list[i],-1], na.rm=T))
}
rownames(A)<-group.list
There are some column averages (SNPs) that are missing. So I did this:
SNP.present <- which(A[1,]>=0)
for(i in 2:length(group.list)){
SNP.present <- intersect(SNP.present,which(A[i,]>=0))
}
A <- A[,SNP.present]
data_file1 = data_file[,c(1,SNP.present+1)]
for(i in 1:dim(data_file1)[1]){
a <- which(is.na(data_file1[i,]))
if(length(a)>0){
data_file1[i,a]<-A[data_file1$V1[i],a]
}
}
When I run this on a small dataset, it works. However, when I run the full dataset, I get the error:
Error in [data_file1 $ V1 [i], a]: index out of bounds
Does anyone know what might be wrong?
+3
source to share