Aggregation based on "side by side" values

Question

Aggregation based on "side by side" values

I have a very messy dataframe (webscraped) which unfortunately has a lot of double and even triple entries. Most of the data looks like this:

> df1<-data.frame(var1=c("a","a","b","b","c","c","d","d"),var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA),var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d"))
> df1
  var1    var2      var3
1    a right.a correct.a
2    a    <NA> correct.a
3    b right.b correct.b
4    b    <NA> correct.b
5    c right.c correct.c
6    c    <NA> correct.c
7    d right.d correct.d
8    d    <NA> correct.d

"var1" is my id variable that I need to use for aggregation. My goal is to have a data view that looks like this:

  var1    var2      var3
1    a right.a correct.a
2    b right.b correct.b
3    c right.c correct.c
4    d right.d correct.d

However, the main problem is that not the whole data structure looks like this. In fact, I have other parts that look like this:

> df2<-data.frame(var1=c("e","e","e","f","f","g","g","g"),var2=c(NA,NA,"right.e",NA,NA,NA,"right.g",NA),var3=c("correct.e","correct.e",NA,"correct.f",NA,"correct.g","correct.g",NA))
> df2
  var1    var2      var3
1    e    <NA> correct.e
2    e    <NA> correct.e
3    e right.e      <NA>
4    f    <NA> correct.f
5    f    <NA>      <NA>
6    g    <NA> correct.g
7    g right.g   wrong.g
8    g    <NA>      <NA>

And other options. After all, each identifier must have one line with correct and right var2 and var3. At this point I am at a loss: My var1 is not unique . However, I know that duplicate ids that "belong" together are grouped in a dataframe (as seen in my examples); for example, lines 4102 and 4103 may contain another "a".

What I think would be capable is to use an aggregate with var1 as ID, but furthermore, when saying R, the aggregate should just check + -2 of var1 while doing so. Any ideas how to code this?

Thank!

+2

r aggregate dataframe na

Mr. Zen 29 Mar 17 at 17:45

source to share

2 answers

If var2

and var3

have only one unique value for each level var1

, then:

library(dplyr)

df = rbind(df1,df2)

df %>% group_by(var1) %>%
  summarise_all(funs(.[!is.na(.)][1]))

   var1    var2      var3
1     a right.a correct.a
2     b right.b correct.b
3     c right.c correct.c
4     d right.d correct.d
5     e right.e correct.e
6     f    <NA> correct.f
7     g right.g correct.g

+1

eipi10 29 Mar 17 at 17:55

source to share

lmo · Accepted Answer · 2017-03-29T18:12:59+0000

Here is a method using data.table

library(data.table)

setDT(df1)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

and

setDT(df2)[, .(var2[!is.na(var2)][1], var3[!is.na(var3)][1]), by=var1]
   var1      V1        V2
1:    e right.e correct.e
2:    f      NA correct.f
3:    g right.g correct.g

The idea is var2[!is.na(var2)][1]

, for example, to take the first not missing value from var2. If all values are missing, then this returns NA. This operation is performed for both variables using var1.

If you have more than two variables, you can switch to lapply

. For example, the following.

df1[, lapply(.SD, function(i) i[!is.na(i)][1]), by=var1]
   var1    var2      var3
1:    a right.a correct.a
2:    b right.b correct.b
3:    c right.c correct.c
4:    d right.d correct.d

In the case where more than one value of var1 has a valid value and this is indicated by not skipping var2, you can achieve the intended result by using a join.

Data from the comment,

df1<-data.frame(var1=c("a","a","b","b","c","c","d","d","a","a"),
                var2=c("right.a",NA,"right.b",NA,"right.c",NA,"right.d",NA,"right.a1",NA),
                var3=c("correct.a","correct.a","correct.b","correct.b","correct.c","correct.c","correct.d","correct.d","correct.a1","correct.a1"))

Then, using this data,

setDT(df1)[df1[, .(var2=var2[!is.na(var2)]), by=var1], on=.(var1, var2)]
   var1     var2       var3
1:    a  right.a  correct.a
2:    a right.a1 correct.a1
3:    b  right.b  correct.b
4:    c  right.c  correct.c
5:    d  right.d  correct.d

Here, all non-empty observations of var2 are concatenated with var1 into the original dataset.

Aggregation based on "side by side" values

More articles: