R: merge text documents by index

Question

R: merge text documents by index

I have a dataframe that looks like this:

_________________id ________________text______
    1   | 7821             | "some text here"
    2   | 7821             |  "here as well"
    3   | 7821             |  "and here"
    4   | 567              |   "etcetera"
    5   | 567              |    "more text"
    6   | 231              |   "other text"

And I would like to group the texts by IDs, so I can run the clustering algorithm:

________________id___________________text______
    1   | 7821             | "some text here here as well and here"
    2   | 567              |   "etcetera more text"
    3   | 231              |   "other text"

Is there a way to do this? I am importing from a database table and I have a lot of data so I cannot do it manually.

+3

r text-mining

d12n 28 jan. 13 at 16:37

source to share

1 answer

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2013-01-28T16:43:56+0000

You are actually looking for aggregate

, not merge

, and there should be many examples on SO demonstrating different options for aggregation. Here's the simplest and most direct approach, using a formulaic approach to point columns to aggregate

.

Here's your data in a copy and paste form

mydata <- structure(list(id = c(7821L, 7821L, 7821L, 567L, 567L, 231L), 
    text = structure(c(6L, 3L, 1L, 2L, 4L, 5L), .Label = c("and here", 
    "etcetera", "here as well", "more text", "other text", "some text here"
    ), class = "factor")), .Names = c("id", "text"), class = "data.frame", 
    row.names = c(NA, -6L))

Here's the aggregated result.

aggregate(text ~ id, mydata, paste, collapse = " ")
#     id                                 text
# 1  231                           other text
# 2  567                   etcetera more text
# 3 7821 some text here here as well and here

Of course there is also data.table

one that has a nice compact syntax (and amazing speed):

> library(data.table)
> DT <- data.table(mydata)
> DT[, paste(text, collapse = " "), by = "id"]
     id                                   V1
1: 7821 some text here here as well and here
2:  567                   etcetera more text
3:  231                           other text

R: merge text documents by index

More articles: