R Dataframe: Concatenate rows per column, row by row, by group
I have what seems to be a very inefficient solution to a specific problem. I have text data that for various reasons is split across data lines at arbitrary intervals. However, some subsets are known to belong to each other based on unique combinations of other variables in the data frame. See for example the MWE demonstrating the structure and my original solution:
# Data
df <- read.table(text="page passage person index text
1 123 A 1 hello
1 123 A 2 my
1 123 A 3 name
1 123 A 4 is
1 123 A 5 guy
1 124 B 1 well
1 124 B 2 hello
1 124 B 3 guy",header=T,stringsAsFactors=F)
master<-data.frame()
for (i in 123:max(df$passage)) {
print(paste0('passage ',i))
tempset <- df[df$passage==i,]
concat<-''
for (j in 1:nrow(tempset)) {
print(paste0('index ',j))
concat<-paste(concat, tempset$text[j])
}
tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
master<-rbind(master, tempdf)
rm(concat, tempset, tempdf)
}
master
> master
tempset.page.1. tempset.passage.1. tempset.person.1. concat
1 1 123 A hello my name is guy
2 1 124 B well hello guy
In this example, as in my real-life case, "walkthrough" is a unique grouping variable, so it is not necessary to bring other parts along with it, although I would like them to be available in my dataset.
My current estimates are that this procedure, which I developed, will take hours for a dataset that is otherwise easily processed by R on my computer. Perhaps there are some benefits that can be achieved by either other functions or packages, or not creating and deleting so many objects?
Thanks for any help here!
source to share
data.table
Here's one way:
require(data.table)
DT <- data.table(df)
DT[,.(concat=paste0(text,collapse=" ")),by=.(page,passage,person)]
# page passage person concat
# 1: 1 123 A hello my name is guy
# 2: 1 124 B well hello guy
I guess the additional variables (besides passage
) in are by
not worth much. I think.
dplyr Analog
df %>%
group_by(page,passage,person) %>%
summarise(concat=paste0(text,collapse=" "))
# Source: local data frame [2 x 4]
# Groups: page, passage, person
#
# page passage person concat
# 1 1 123 A hello my name is guy
# 2 1 124 B well hello guy
base R One way:
df$concat <- with(df,ave(text,passage,FUN=function(x)paste0(x,collapse=" ")))
unique(df[,which(names(df)%in%c("page","passage","person","concat"))])
# page passage person concat
# 1 1 123 A hello my name is guy
# 6 1 124 B well hello guy
source to share