How do I change the suffix array to search for multiple strings?

Question

How do I change the suffix array to search for multiple strings?

I recently updated my knowledge of algorithms and read in suffix arrays. Each text I read defined them as an array of suffixes one search string at a time, but some articles have mentioned it "trivially" generalizing to an entire list of search strings, but I can't see how.

Suppose I am trying to implement a simple substring search over a list of words, and I want to return a list of words that match the given substring. A naive approach would seem to be to insert the lexicographic trailing "$" character between the words in my list, concatenate them all together, and create a suffix tree from the result. But this appears to be generating a large number of irrelevant entries. If I create the original banana $ muffin string, then I end up creating suffixes for ana $ muffin, which I will never use.

I would appreciate any hints on how to do this, or better yet, a pointer to some text of the algorithms that handle this case.

+3

list algorithm suffix-array

swestrup Jul 17 At 15:02

source to share

2 answers

Now, after reading most of Dan Gusfield's "String, Tree and Sequence Algorithms" book, the answer seems clear.

If you start with a tree of strings with multiple strings, one of the standard conversion algorithms will still work. However, instead of getting an array of integers, you get an array of lists. Each list contains one or more pairs of string identifiers and a starting offset in that string.

The resulting structure is still useful, but not as efficient as a regular suffix array.

0

swestrup 02 Feb 16 at 19:38

source to share

Nicolas · Accepted Answer · 2015-07-30T09:15:52+0000

In Suffix-Arrays, you usually don't use strings, only one string. It will be a concatenated version of multiple lines with some endtoken (different for each line). For suffix arrays, you use pointers (or array index) to refer to the suffix (only the position is required for the first marker / character). So the space required is an array + for each pointer suffix. (it's just a pretty simple implementation, you have to do more to get more performance).

In this case, you can optimize the sorting algorithm for suffixes, since you only need to sort those suffixes referenced by pointers to the end. Anything behind the endtoken doesn't need to be used in the sorting algorithm.

How do I change the suffix array to search for multiple strings?

More articles: