R Dataframe: Concatenate rows per column, row by row, by group

Question

R Dataframe: Concatenate rows per column, row by row, by group

I have what seems to be a very inefficient solution to a specific problem. I have text data that for various reasons is split across data lines at arbitrary intervals. However, some subsets are known to belong to each other based on unique combinations of other variables in the data frame. See for example the MWE demonstrating the structure and my original solution:

# Data
df <- read.table(text="page passage  person index text
1  123   A   1 hello      
1  123   A   2 my
1  123   A   3 name
1  123   A   4 is
1  123   A   5 guy
1  124   B   1 well
1  124   B   2 hello
1  124   B   3 guy",header=T,stringsAsFactors=F)

master<-data.frame()
for (i in 123:max(df$passage)) {
  print(paste0('passage ',i))
  tempset <- df[df$passage==i,]
  concat<-''
  for (j in 1:nrow(tempset)) {
    print(paste0('index ',j))
    concat<-paste(concat, tempset$text[j])
  }
  tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
  master<-rbind(master, tempdf)
  rm(concat, tempset, tempdf)
}
master
> master
  tempset.page.1. tempset.passage.1. tempset.person.1.                concat
1               1                123                 A  hello my name is guy
2               1                124                 B        well hello guy

In this example, as in my real-life case, "walkthrough" is a unique grouping variable, so it is not necessary to bring other parts along with it, although I would like them to be available in my dataset.

My current estimates are that this procedure, which I developed, will take hours for a dataset that is otherwise easily processed by R on my computer. Perhaps there are some benefits that can be achieved by either other functions or packages, or not creating and deleting so many objects?

Thanks for any help here!

+3

string r string-concatenation aggregate-functions dataframe

SOConnell May 15 '15 at 19:05

source to share

2 answers

data.table

Here's one way:

require(data.table)
DT <- data.table(df)

DT[,.(concat=paste0(text,collapse=" ")),by=.(page,passage,person)]
#    page passage person               concat
# 1:    1     123      A hello my name is guy
# 2:    1     124      B       well hello guy

I guess the additional variables (besides passage

) in are by

not worth much. I think.

dplyr Analog

df %>% 
  group_by(page,passage,person) %>% 
  summarise(concat=paste0(text,collapse=" "))

# Source: local data frame [2 x 4]
# Groups: page, passage, person
# 
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 2    1     124      B       well hello guy

base R One way:

df$concat <- with(df,ave(text,passage,FUN=function(x)paste0(x,collapse=" ")))
unique(df[,which(names(df)%in%c("page","passage","person","concat"))])
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 6    1     124      B       well hello guy

+7

Frank May 15 '15 at 19:11

source to share

Matthew plourde · Accepted Answer · 2015-05-15T19:46:15+0000

There are two ways:

base R

aggregate(
    text ~ page + passage + person, 
    data=df, 
    FUN=paste, collapse=' '
)

dplyr

library(dplyr)
df %>% 
    group_by_(~page, ~passage, ~person) %>%
    summarize_(text=~paste(text, collapse=' '))

R Dataframe: Concatenate rows per column, row by row, by group

More articles: