How can I change a specific field in a list of data frames?
Suppose I am writing the following R code:
first.value <- sample(100, 100, replace=TRUE)
second.value <- sample(10, 100, replace=TRUE)
X <- data.frame(first.value, second.value)
split.X <- split(X, second.value)
This code creates a data frame with two fields and splits into cells according to the second. Now suppose I wanted to normalize every bit; that is, subtract the mean and divide by the standard deviation. I could accomplish this
normalized.first.value <- sapply(split.X, function(X) {(X$first.value - mean(X$first.value)) / sd(X$first.value)})
But this creates a new list with normalized versions of each bin. I really want to replace the copy of the data in split.X
my normalized version.
To illustrate here some examples:
> first.value <- sample(100, 100, replace=TRUE)
> second.value <- sample(10, 100, replace=TRUE)
> X <- data.frame(first.value, second.value)
> split.X <- split(X, second.value)
> normalized.first.value <- sapply(split.X, function(X) {(X$first.value - mean(X$first.value)) / sd(X$first.value)})
> split.X[[1]]
first.value second.value
4 34 1
8 40 1
24 21 1
31 34 1
37 23 1
40 22 1
> normalized.first.value[[1]]
[1] 0.625 1.375 -1.000 0.625 -0.750 -0.875
What I really want to do is put the values normalized.first.value[[1]]
in split.X[[1]]$first.value
, and the same for the other indices.
This can be achieved with a loop for
like this:
for (i in 1:length(split.X)) {
split.X[[i]]$first.value <- (split.X[[i]]$first.value - mean(split.X[[i]]$first.value) / sd(split.X[[i]]$first.value);
}
But loops are for
BAD in R, and I would like to use sapply
, lapply
etc. if possible. Unfortunately when working with a list of dataframes sapply
and lapply
doesn't seem to be repeated the way I want.
source to share
You can use Map
as both lists are the same length. It works by replacing the first column "split.X" with the corresponding element list
in "normalized.first.value"
Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
Or we can skip the length of "split.X", get the list items "split.X" and "normalized.first.value" based on the index, and then replace.
lapply(seq_along(split.X), function(i) {
x1 <- split.X[[i]]
x1[,'first.value'] <- normalized.first.value[[i]]
x1})
source to share
Here's a more arcane way (although I still think the loop for
is fine in this case)
new.split.X <- mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
SIMPLIFY=F)
How it works: Applies [<-
to everyone split.X[[i]]
. T
is the index i
to replace (i.e. All of them), 'first.value'
is the index j
to replace (this column), normalized.first.value
contains replacements.
The loop may be easier to read at the end, although it may not be slower than complex solutions *apply
.
library(rbenchmark)
benchmark(loop={
for (i in 1:length(split.X))
split.X[[i]]$first.value <- normalized.first.value[[i]]
},
mapply={
mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
SIMPLIFY=F)
},
Map={
Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
},
lapply={
lapply(seq_along(split.X), function(i) {
x1 <- split.X[[i]]
x1[,'first.value'] <- normalized.first.value[[i]]
x1})
})
test replications elapsed relative user.self sys.self user.child sys.child
4 lapply 100 0.034 4.857 0.035 0 0 0
1 loop 100 0.007 1.000 0.007 0 0 0
3 Map 100 0.012 1.714 0.013 0 0 0
2 mapply 100 0.030 4.286 0.032 0 0 0
So the explicit loop is the fastest, and the easieset to read anyway.
source to share