Aggregation based on "side by side" values
I have a very messy dataframe (webscraped) which unfortunately has a lot of double and even triple entries. Most of the data looks like this:
> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
var1 var2 var3
1 a right.a correct.a
2 a <NA> correct.a
3 b right.b correct.b
4 b <NA> correct.b
5 c right.c correct.c
6 c <NA> correct.c
7 d right.d correct.d
8 d <NA> correct.d
"var1" is my id variable that I need to use for aggregation. My goal is to have a data view that looks like this:
var1 var2 var3
1 a right.a correct.a
2 b right.b correct.b
3 c right.c correct.c
4 d right.d correct.d
However, the main problem is that not the whole data structure looks like this. In fact, I have other parts that look like this:
> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
var1 var2 var3
1 e <NA> correct.e
2 e <NA> correct.e
3 e right.e <NA>
4 f <NA> correct.f
5 f <NA> <NA>
6 g <NA> correct.g
7 g right.g wrong.g
8 g <NA> <NA>
And other options. After all, each identifier must have one line with correct and right var2 and var3. At this point I am at a loss: My var1 is not unique . However, I know that duplicate ids that "belong" together are grouped in a dataframe (as seen in my examples); for example, lines 4102 and 4103 may contain another "a".
What I think would be capable is to use an aggregate with var1 as ID, but furthermore, when saying R, the aggregate should just check + -2 of var1 while doing so. Any ideas how to code this?
Thank!
source to share
Here is a method using data.table
library(data.table)
setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
var1 V1 V2
1: a right.a correct.a
2: b right.b correct.b
3: c right.c correct.c
4: d right.d correct.d
and
setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
var1 V1 V2
1: e right.e correct.e
2: f NA correct.f
3: g right.g correct.g
The idea is var2[!is.na(var2)][1]
, for example, to take the first not missing value from var2. If all values ββare missing, then this returns NA. This operation is performed for both variables using var1.
If you have more than two variables, you can switch to lapply
. For example, the following.
df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
var1 var2 var3
1: a right.a correct.a
2: b right.b correct.b
3: c right.c correct.c
4: d right.d correct.d
In the case where more than one value of var1 has a valid value and this is indicated by not skipping var2, you can achieve the intended result by using a join.
Data from the comment,
df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))
Then, using this data,
setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
var1 var2 var3
1: a right.a correct.a
2: a right.a1 correct.a1
3: b right.b correct.b
4: c right.c correct.c
5: d right.d correct.d
Here, all non-empty observations of var2 are concatenated with var1 into the original dataset.
source to share
If var2
and var3
have only one unique value for each level var1
, then:
library(dplyr)
df = rbind(df1,df2)
df %>% group_by(var1) %>%
summarise_all(funs(.[!is.na(.)][1]))
var1 var2 var3
1 a right.a correct.a
2 b right.b correct.b
3 c right.c correct.c
4 d right.d correct.d
5 e right.e correct.e
6 f <NA> correct.f
7 g right.g correct.g
source to share