R: merge text documents by index
I have a dataframe that looks like this:
_________________id ________________text______
1 | 7821 | "some text here"
2 | 7821 | "here as well"
3 | 7821 | "and here"
4 | 567 | "etcetera"
5 | 567 | "more text"
6 | 231 | "other text"
And I would like to group the texts by IDs, so I can run the clustering algorithm:
________________id___________________text______
1 | 7821 | "some text here here as well and here"
2 | 567 | "etcetera more text"
3 | 231 | "other text"
Is there a way to do this? I am importing from a database table and I have a lot of data so I cannot do it manually.
source to share
You are actually looking for aggregate
, not merge
, and there should be many examples on SO demonstrating different options for aggregation. Here's the simplest and most direct approach, using a formulaic approach to point columns to aggregate
.
Here's your data in a copy and paste form
mydata <- structure(list(id = c(7821L, 7821L, 7821L, 567L, 567L, 231L),
text = structure(c(6L, 3L, 1L, 2L, 4L, 5L), .Label = c("and here",
"etcetera", "here as well", "more text", "other text", "some text here"
), class = "factor")), .Names = c("id", "text"), class = "data.frame",
row.names = c(NA, -6L))
Here's the aggregated result.
aggregate(text ~ id, mydata, paste, collapse = " ")
# id text
# 1 231 other text
# 2 567 etcetera more text
# 3 7821 some text here here as well and here
Of course there is also data.table
one that has a nice compact syntax (and amazing speed):
> library(data.table)
> DT <- data.table(mydata)
> DT[, paste(text, collapse = " "), by = "id"]
id V1
1: 7821 some text here here as well and here
2: 567 etcetera more text
3: 231 other text
source to share