How do I update the values ​​in a column based on the values ​​in the same column but on different rows?

Let's take an example:

> set.seed(42)
> ids <- c("u1", "u2", "u3")
> groups <- c(rep("A",3), rep("B",3), rep("C",3))
> reps <- c(rep("r1",9), rep("r2",9), rep("r3",9))
> vals <- rnorm(27, 0, 2)
> 
> df = data.frame(ids = rep(ids, 9), groups = rep(groups,3), reps = reps, vals = vals)
> df
   ids groups reps       vals
1   u1      A   r1  2.7419169
2   u2      A   r1 -1.1293963
3   u3      A   r1  0.7262568
4   u1      B   r1  1.2657252
5   u2      B   r1  0.8085366
6   u3      B   r1 -0.2122490
7   u1      C   r1  3.0230440
8   u2      C   r1 -0.1893181
9   u3      C   r1  4.0368474
10  u1      A   r2 -0.1254282
11  u2      A   r2  2.6097393
12  u3      A   r2  4.5732908
13  u1      B   r2 -2.7777214
14  u2      B   r2 -0.5575775
15  u3      B   r2 -0.2666427
16  u1      C   r2  1.2719008
17  u2      C   r2 -0.5685058
18  u3      C   r2 -5.3129108
19  u1      A   r3 -4.8809339
20  u2      A   r3  2.6402267
21  u3      A   r3 -0.6132772
22  u1      B   r3 -3.5626169
23  u2      B   r3 -0.3438347
24  u3      B   r3  2.4293494
25  u1      C   r3  3.7903869
26  u2      C   r3 -0.8609383
27  u3      C   r3 -0.5145388

      

What I want to do is subtract the average of the values ​​in C.r1, C.r2 and C.r3 values ​​for each id. The idea is to use group C as a baseline for other groups.

So, in terms of the expected output, for the first two lines:

  • (u1, A, r1) should be changed as 2.74 - mean (3.02, 1.27, 3.79) = 0.046

  • (u2, A, r1) should be changed as -1.23 - mean (-0.18, -0.56, -0.86) = -0.69

How can I get this to work with all rows in a large table (about 1M rows) that contains a number of other columns besides the relevant ones here? Obviously I need to group ids

, but finding values ​​that match specifically group == C

along with the mean of vals is a little trickier.

> dt <- setDT(df)
> dt[groups == "C", cmean := mean(vals), ids]

      

gives me a group C measurement for each id (in multiple instances), but I can't use those values ​​at once as all other rows are already filtered out. I think I might need a chain, but I don't know how.

I would be equally interested in solutions with data.table

anddplyr

+3


source to share


2 answers


We can do a post subset join for 'groups' that are 'C', grouped by 'ids', get mean

from 'vals', then join the original dataset on

'ids', subtract 'shafts' from the first dataset with 'means' from the second and assign ( :=

) it to "novales"



setDT(df)[df[groups=="C", .(Meanvals = mean(vals)), ids], 
                         newvals := vals - Meanvals, on = .(ids)]
head(df)

      

+1


source


One possible dplyr

solution:



library(dplyr)
df %>% group_by(ids) %>%
  mutate(mean = mean(vals[groups=="C"]), 
         vals = vals - mean) %>% select(-mean)

# A tibble: 27 Γ— 4
      ids groups   reps        vals
   <fctr> <fctr> <fctr>       <dbl>
1      u1      A     r1  0.04680632
2      u2      A     r1 -0.58980895
3      u3      A     r1  1.32312422
4      u1      B     r1 -1.42938536
5      u2      B     r1  1.34812404

      

+1


source







All Articles