Concatenate dataframe rows using parsing

I am trying to import a conversation with the following structure into a dataframe:

             uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                         "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                         "01/08/2015 2:59:19 pm: Person 1: Same here"))


This structure will make it easier to parse date, time, person and message. But there are a few cases where the message contains a new line and therefore the data frame is unstructured, for example:

                     uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                 "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                 "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
                                 "lend me your arms,",
                                 "fast as thunderbolts,",
                                 "for a pillow on my journey."))


How are you going to merge these instances? Is there any package that I am not aware of?

The desired function will simply recognize the missing structure and "merge" with the previous line so that I get:

                    uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."))


Any thoughts?


source to share

2 answers

Assuming that you can correctly identify properly structured strings with a timestamp (presented below in properDataRegex

), then this will be done:

mydata <- c("01/08/2015 2:49:49 pm: Person 1: Hello",
            "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
            "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
            "lend me your arms,",
            "fast as thunderbolts,",
            "for a pillow on my journey.",
            "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method",
            "but it will get the job done.")

properDataRegex <- "^\\d{2}/\\d{2}/\\d{4}\\s"
improperDataBool <- !grepl(properDataRegex, mydata)
while (sum(improperDataBool)) {
    mergeWPrevIndex <- which(c(FALSE, !improperDataBool[-length(improperDataBool)]) & 
    mydata[mergeWPrevIndex - 1] <- 
        paste(mydata[mergeWPrevIndex - 1], mydata[mergeWPrevIndex])
    mydata <- mydata[-mergeWPrevIndex]
    improperDataBool <- !grepl(properDataRegex, mydata)

## [1] "01/08/2015 2:49:49 pm: Person 1: Hello"                                                                                                    
## [2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you"                                                                                         
## [3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."
## [4] "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method but it will get the job done."


Here mydata

is a character vector, but of course it is now trivial to do in the data.frame as it was in the question, or parse it with read.table()

or read.fwf()




Here's an alternative approach:

read.table(text=paste(gsub("(^\\d{2}/\\d{2}/\\d{4}\\s)", "\n\\1", conversation_errors$uniquerow),
                      collapse = " "), sep = "\n", stringsAsFactors = F)[,1]


What gives:

[1] "01/08/2015 2:49:49 pm: Person 1: Hello "                                                                                                   
[2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you "                                                                                        
[3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."


(Thanks Ken for the borrowed regex)



All Articles