Data frame failure
How can I collapse my dataframe when many observations have multiple rows but no more than one value for each of several different variables?
Here's what I have:
id title info var1 var2 var3
1 foo Some string here string 1
1 foo Some string here string 2
1 foo Some string here string 3
2 bar A different string string 4 string 5
2 bar A different string string 6
3 baz Something else string 7 string 8
This is what I want:
id title info var1 var2 var3
1 foo Some string here string 1 string 2 string 3
2 bar A different string string 4 string 5 string 6
3 baz Something else string 7 string 8
I think I have
ddply(merged, .(id, title, info), summarize, var1 = max(var1), var2 = max(var2), var3 = max(var3))
But the problem is that there are many other var1-var3 variables and they are programmatically generated. As a result, I need a way to insert var1 = max(var1)
etc. Programmatically based on a list of variable names.
+3
source to share
1 answer
Many possible ways to achieve this, here are two
Define some helper function
Myfunc <- function(x) x[x != '']
Using data.table
library(data.table)
setDT(df)[, lapply(.SD, Myfunc), by = list(id, title, info)]
# id title info var1 var2 var3
# 1: 1 foo Some string here string 1 string 2 string 3
# 2: 2 bar A different string string 4 string 5 string 6
# 3: 3 baz Something else string 7 NA string 8
Or similarly dplyr
library(dplyr)
df %>%
group_by(id, title, info) %>%
summarise_each(funs(Myfunc))
# Source: local data table [3 x 6]
# Groups: id, title
#
# id title info var1 var2 var3
# 1 1 foo Some string here string 1 string 2 string 3
# 2 2 bar A different string string 4 string 5 string 6
# 3 3 baz Something else string 7 NA string 8
+3
source to share