String concatenation using the apply function in R
I have the following code, the purpose of which is to convert a sequence to tuples of three. It executes correctly, but is especially slow when applied to very large datasets (i.e. Millions of rows).
I suspect the culprit is the "for-loops" over the vector (specifically for the y: loop) and feel that there must be a more efficient method using one of the functions used - unfortunately I am not too familiar with this approach and wanted to would ask for some help (please!).
M.Order <- function(in.vector) {
return.str <- vector()
in.vector <- strsplit(in.vector, ' > ', fixed = T)
for (x in 1:length(in.vector)) {
output <- NULL
if(length(in.vector[[x]]) == 1) {
output <- paste0(in.vector[[x]], '|NULL|NULL')
} else if(length(in.vector[[x]]) == 2) {
output <- paste(c(in.vector[[x]][1], in.vector[[x]][2],'NULL'), collapse='|')
} else if(length(in.vector[[x]]) == 3) {
output <- paste(in.vector[[x]], collapse = '|')
} else for (y in 1:(length(in.vector[[x]])-2)) {
output <- ifelse(length(output) == 0
,paste(in.vector[[x]][y:(y+2)], collapse = '|')
,paste0(output, ' > ', paste(in.vector[[x]][y:(y+2)], collapse = '|'))
)
}
return.str[x] <- output
}
return (return.str)
}
orig.str <- rbind.data.frame(
'A > B > C > B > B > A > B > A > C',
'A > B',
'A > C > B',
'A',
'A > B > D > C')
colnames(orig.str) <- 'Original'
orig.str$Processed <- M.Order(as.character(orig.str$Original))
orig.str
which returns (correctly)
Original Processed
1 A > B > C > B > B > A > B > A > C A|B|C > B|C|B > C|B|B > B|B|A > B|A|B > A|B|A > B|A|C
2 A > B A|B|NULL
3 A > C > B A|C|B
4 A A|NULL|NULL
5 A > B > D > C A|B|D > B|D|C
source to share
EDIT: Remove the rollapply function since it's slow and created my own function. Runtime on 327,680 lines:
- My code: 5.62 seconds
- Your code is 5.66 seconds.
Thus, no significant difference.
First, split the strings with ">" and add NULL to the vector if it does not contain at least three elements. Then use rollapply to concatenate groups of three characters separated by "|" and finally collapse those groups.
# sample data
df = data.frame(Original=c("A > B > C > B > B > A > B > A > C","A > B","A > C > B","A","A > B > D > C"),stringsAsFactors = FALSE)
for(i in 1:16) df=rbind(df,df)
groups <- function(x)
{
result <- vector("character", length(x)-2)
for(k in 1:(length(x)-2) )
{
result[k] = paste(x[k:(k+2)],collapse="|")
}
return(paste(result,collapse=" > "))
}
array1 = lapply(strsplit(df$Original," > "), function(x) if (length(x) == 1) {c(x[1],"NULL","NULL")} else {if (length(x) == 2) {c(x[1:2],"NULL")} else {x}})
df$modified = lapply(array1,groups)
Output: (as a list for readability)
[[1]]
[1] "A|B|C > B|C|B > C|B|B > B|B|A > B|A|B > A|B|A > B|A|C"
[[2]]
[1] "A|B|NULL"
[[3]]
[1] "A|C|B"
[[4]]
[1] "A|NULL|NULL"
[[5]]
[1] "A|B|D > B|D|C"
Hope this helps!
source to share
The basic logic seems to be described by the following rule:
- Split lines by
>
- For each line, starting at each position, concatenate the next 3 characters with
'|'
. - Concatenate all resulting tuples with spaces.
Step 2 is the most difficult. It can be solved using the following generic function:
merge_tuples = function (str, len, sep) {
start_positions = seq_len(max(length(str) - len + 1, 1))
tuple_indices = lapply(start_positions, seq, length.out = len)
lapply(tuple_indices, function (i) paste(str[i], collapse = sep))
}
This has been generalized to work with any size (not just 3) and every separator (not just '|'
).
Example:
> merge_tuples(c('A', 'B', 'C'), 2, ':')
[[1]]
[1] "A:B"
[[2]]
[1] "B:C"
With this solution, res is easily solved:
orig = c('A > B > C > B > B > A > B > A > C',
'A > B',
'A > C > B',
'A',
'A > B > D > C')
tuples = lapply(strsplit(orig, ' > '), merge_tuples, len = 3, sep = '|')
merged = sapply(tuples, paste, collapse = ' ')
This will output NA
instead NULL
(as in your code) in places where there are not enough elements. I guess it doesn't really matter. If so, replace the occurrences gsub
.
source to share
Partial solution ...
The following function converts one string:
makes = function (S)
{
L = strsplit(gsub(" > ", "", S), "")[[1]]
m = outer(1:3, 0:(length(L) - 3), "+")
m[] = L[m]
paste(apply(m, 2, function(x) {
paste0(x, collapse = "|")
}), collapse = " > ")
}
Works by using outer
to make a matrix of offsets and then using that to get the elements from the string after the string has only been cleared by letters and split into a vector. Then this is just a case of pasting all this:
> makes(orig.str$Original[1])
[1] "A|B|C > B|C|B > C|B|B > B|B|A > B|A|B > A|B|A > B|A|C"
It makes a hash shorter than 3, though:
> makes(orig.str$Original[2])
[1] "A|B|NA > A|B|A"
Warning message:
In m[] = L[m] :
number of items to replace is not a multiple of replacement length
> makes(orig.str$Original[3])
[1] "A|C|B"
> makes(orig.str$Original[4])
Error in L[m] : only 0 may be mixed with negative subscripts
> makes(orig.str$Original[5])
[1] "A|B|D > B|D|C"
It might be worthwhile to explicitly identify these edge cases ( length(L) < 3
the code should do this) and handle them separately.
Then apply a data frame for each one.
source to share