R Dataframe: Concatenate rows per column, row by row, by group

I have what seems to be a very inefficient solution to a specific problem. I have text data that for various reasons is split across data lines at arbitrary intervals. However, some subsets are known to belong to each other based on unique combinations of other variables in the data frame. See for example the MWE demonstrating the structure and my original solution:

# Data
df <- read.table(text="page passage  person index text
1  123   A   1 hello      
1  123   A   2 my
1  123   A   3 name
1  123   A   4 is
1  123   A   5 guy
1  124   B   1 well
1  124   B   2 hello
1  124   B   3 guy",header=T,stringsAsFactors=F)

master<-data.frame()
for (i in 123:max(df$passage)) {
  print(paste0('passage ',i))
  tempset <- df[df$passage==i,]
  concat<-''
  for (j in 1:nrow(tempset)) {
    print(paste0('index ',j))
    concat<-paste(concat, tempset$text[j])
  }
  tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
  master<-rbind(master, tempdf)
  rm(concat, tempset, tempdf)
}
master
> master
  tempset.page.1. tempset.passage.1. tempset.person.1.                concat
1               1                123                 A  hello my name is guy
2               1                124                 B        well hello guy

      

In this example, as in my real-life case, "walkthrough" is a unique grouping variable, so it is not necessary to bring other parts along with it, although I would like them to be available in my dataset.

My current estimates are that this procedure, which I developed, will take hours for a dataset that is otherwise easily processed by R on my computer. Perhaps there are some benefits that can be achieved by either other functions or packages, or not creating and deleting so many objects?

Thanks for any help here!

+3


source to share


2 answers


There are two ways:

base R

aggregate(
    text ~ page + passage + person, 
    data=df, 
    FUN=paste, collapse=' '
)

      



dplyr

library(dplyr)
df %>% 
    group_by_(~page, ~passage, ~person) %>%
    summarize_(text=~paste(text, collapse=' '))

      

+4


source


data.table

Here's one way:

require(data.table)
DT <- data.table(df)

DT[,.(concat=paste0(text,collapse=" ")),by=.(page,passage,person)]
#    page passage person               concat
# 1:    1     123      A hello my name is guy
# 2:    1     124      B       well hello guy

      

I guess the additional variables (besides passage

) in are by

not worth much. I think.


dplyr Analog



df %>% 
  group_by(page,passage,person) %>% 
  summarise(concat=paste0(text,collapse=" "))

# Source: local data frame [2 x 4]
# Groups: page, passage, person
# 
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 2    1     124      B       well hello guy

      


base R One way:

df$concat <- with(df,ave(text,passage,FUN=function(x)paste0(x,collapse=" ")))
unique(df[,which(names(df)%in%c("page","passage","person","concat"))])
#   page passage person               concat
# 1    1     123      A hello my name is guy
# 6    1     124      B       well hello guy

      

+7


source







All Articles