Algorithm for dividing a set of rows into a minimal set of mutually exclusive groups of approximately the same size

I have a large set of lines. I want to split strings into subsets so that:

  • Each item in the subset shares 1 or more contiguous characters.
  • Common contiguous characters that define a subset are unique to the set of subsets (that is, the common characters are sufficient to define a subset of strings that are mutually exclusive with other subsets).
  • The subsets are roughly the same size.
  • The resulting set of subsets is the minimum number of subsets that meet the above criteria.

For example, given the following set of names:

Alan, Larry, Alfred, Barbara, Alphonse, Karl

I can split this set into two equal sized subsets. Subset 1 defined by adjacent "AL" characters will be

Alan, Alfred, Alphonse

Subset 2, defined by adjacent symbols ar, will be

Larry, Barbara, Karl.

I'm looking for an algorithm that will do this for any arbitrary set of rows. The resulting set of subsets should not be 2, but it should be minimal, and the resulting subsets should be approximately equal.

Elliot

+3


source to share


2 answers


Take a look at http://en.wikipedia.org/wiki/Suffix_array . It is possible that you really want to create a suffix array for each document, and they concatenate all the suffix arrays with pointers to the original versions, so that you can search the collection as one for a string, looking for it as a suffix in the array.



+2


source


It's complicated. I wonder if there is some higher purpose (like indexing words) or is this just an academic question?

It is not solvable at all unless you make the trivial solution of one set defined by empty order (which occurs in all words). For example, take the line: a

, ab

, b

.

  • a

    must go into the set defined a

    .
  • b

    must go into the set defined b

    .
  • ab

    must go to both, because it contains both subsequences.

Would there be a similar example with the type of words you are dealing with? I dont know. You may be dealing with word matching with more than one set, or you may have a linking system to determine where to put it.



Assuming this isn't a problem, the wheel-burrow conversion can help find good substrings.

Or how about something like:

  • Generate all subsequences in words.
  • Build an interference graph of subsequences, where an edge connects two subsequences if they both occur in the same word.
  • Graph color.
  • Choose a representative subsequence for each color.
  • Make a set defined by each representative subsequence. If all words of this color have this substring, put them all in this set.
  • Otherwise, remove this substring from the graph and repeat from step 3.

This algorithm is probably broken, but it might give you some ideas for a solution (or at least some insight into the trickery of your question;).

+2


source







All Articles