How do I update the values ββin a column based on the values ββin the same column but on different rows?
Let's take an example:
> set.seed(42)
> ids <- c("u1", "u2", "u3")
> groups <- c(rep("A",3), rep("B",3), rep("C",3))
> reps <- c(rep("r1",9), rep("r2",9), rep("r3",9))
> vals <- rnorm(27, 0, 2)
>
> df = data.frame(ids = rep(ids, 9), groups = rep(groups,3), reps = reps, vals = vals)
> df
ids groups reps vals
1 u1 A r1 2.7419169
2 u2 A r1 -1.1293963
3 u3 A r1 0.7262568
4 u1 B r1 1.2657252
5 u2 B r1 0.8085366
6 u3 B r1 -0.2122490
7 u1 C r1 3.0230440
8 u2 C r1 -0.1893181
9 u3 C r1 4.0368474
10 u1 A r2 -0.1254282
11 u2 A r2 2.6097393
12 u3 A r2 4.5732908
13 u1 B r2 -2.7777214
14 u2 B r2 -0.5575775
15 u3 B r2 -0.2666427
16 u1 C r2 1.2719008
17 u2 C r2 -0.5685058
18 u3 C r2 -5.3129108
19 u1 A r3 -4.8809339
20 u2 A r3 2.6402267
21 u3 A r3 -0.6132772
22 u1 B r3 -3.5626169
23 u2 B r3 -0.3438347
24 u3 B r3 2.4293494
25 u1 C r3 3.7903869
26 u2 C r3 -0.8609383
27 u3 C r3 -0.5145388
What I want to do is subtract the average of the values ββin C.r1, C.r2 and C.r3 values ββfor each id. The idea is to use group C as a baseline for other groups.
So, in terms of the expected output, for the first two lines:
-
(u1, A, r1) should be changed as 2.74 - mean (3.02, 1.27, 3.79) = 0.046
-
(u2, A, r1) should be changed as -1.23 - mean (-0.18, -0.56, -0.86) = -0.69
How can I get this to work with all rows in a large table (about 1M rows) that contains a number of other columns besides the relevant ones here? Obviously I need to group ids
, but finding values ββthat match specifically group == C
along with the mean of vals is a little trickier.
> dt <- setDT(df)
> dt[groups == "C", cmean := mean(vals), ids]
gives me a group C measurement for each id (in multiple instances), but I can't use those values ββat once as all other rows are already filtered out. I think I might need a chain, but I don't know how.
I would be equally interested in solutions with data.table
anddplyr
source to share
We can do a post subset join for 'groups' that are 'C', grouped by 'ids', get mean
from 'vals', then join the original dataset on
'ids', subtract 'shafts' from the first dataset with 'means' from the second and assign ( :=
) it to "novales"
setDT(df)[df[groups=="C", .(Meanvals = mean(vals)), ids],
newvals := vals - Meanvals, on = .(ids)]
head(df)
source to share
One possible dplyr
solution:
library(dplyr)
df %>% group_by(ids) %>%
mutate(mean = mean(vals[groups=="C"]),
vals = vals - mean) %>% select(-mean)
# A tibble: 27 Γ 4
ids groups reps vals
<fctr> <fctr> <fctr> <dbl>
1 u1 A r1 0.04680632
2 u2 A r1 -0.58980895
3 u3 A r1 1.32312422
4 u1 B r1 -1.42938536
5 u2 B r1 1.34812404
source to share