Concatenate dataframe rows using parsing

Question

Concatenate dataframe rows using parsing

I am trying to import a conversation with the following structure into a dataframe:

conversation<-data.frame(
             uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                         "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                         "01/08/2015 2:59:19 pm: Person 1: Same here"))

This structure will make it easier to parse date, time, person and message. But there are a few cases where the message contains a new line and therefore the data frame is unstructured, for example:

conversation_errors<-data.frame(
                     uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                 "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                 "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
                                 "lend me your arms,",
                                 "fast as thunderbolts,",
                                 "for a pillow on my journey."))

How are you going to merge these instances? Is there any package that I am not aware of?

The desired function will simply recognize the missing structure and "merge" with the previous line so that I get:

conversation_fixed<-data.frame(
                    uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."))

Any thoughts?

+3

string text r string-concatenation dataframe

eflores89 07 jul. At 4:49 am

source to share

2 answers

Here's an alternative approach:

read.table(text=paste(gsub("(^\\d{2}/\\d{2}/\\d{4}\\s)", "\n\\1", conversation_errors$uniquerow),
                      collapse = " "), sep = "\n", stringsAsFactors = F)[,1]

What gives:

[1] "01/08/2015 2:49:49 pm: Person 1: Hello "                                                                                                   
[2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you "                                                                                        
[3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."

(Thanks Ken for the borrowed regex)

0

Jay 07 jul. '15 at 8:11

source to share

Ken benoit · Accepted Answer · 2015-07-07T06:03:15+0000

Assuming that you can correctly identify properly structured strings with a timestamp (presented below in properDataRegex

), then this will be done:

mydata <- c("01/08/2015 2:49:49 pm: Person 1: Hello",
            "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
            "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
            "lend me your arms,",
            "fast as thunderbolts,",
            "for a pillow on my journey.",
            "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method",
            "but it will get the job done.")

properDataRegex <- "^\\d{2}/\\d{2}/\\d{4}\\s"
improperDataBool <- !grepl(properDataRegex, mydata)
while (sum(improperDataBool)) {
    mergeWPrevIndex <- which(c(FALSE, !improperDataBool[-length(improperDataBool)]) & 
                             improperDataBool)
    mydata[mergeWPrevIndex - 1] <- 
        paste(mydata[mergeWPrevIndex - 1], mydata[mergeWPrevIndex])
    mydata <- mydata[-mergeWPrevIndex]
    improperDataBool <- !grepl(properDataRegex, mydata)
}

mydata
## [1] "01/08/2015 2:49:49 pm: Person 1: Hello"                                                                                                    
## [2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you"                                                                                         
## [3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."
## [4] "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method but it will get the job done."

Here mydata

is a character vector, but of course it is now trivial to do in the data.frame as it was in the question, or parse it with read.table()

or read.fwf()

.

Concatenate dataframe rows using parsing

More articles: