Is there an algorithm for selecting multiple lines from a list so that the number of lines equals the number of different letters in them?

Edit2: I think David Eisenstat's solution works, but I'll check it before answering the question.

Example of a list of strings:

1.) "a"

2.) "ab"

3.) "bc"

4.) "dc"

5.) "efa"

6.) "ef"

7.) "gh"

8.) "hi"

You can choose number 1.) there is 1 line and 1 letter in it: "a" You can also select 1.) and 2.) these are 2 lines with two letters in them "a" and "b"

other valid string combinations:

1.) 2.) 3.)

1.) 5.) 6.)

there is no correct combination with "h" (it would be ideal if such cases could be proved, but you can assume that the program should only work if there is a valid answer)

There may be an additional condition that the strings you select must contain one specified letter, however, simply detecting all possible combinations will also solve the problem. eg. the indicated letter "c" the only solution in this case would be: 1.) 2.) 3)

[more info] Purpose of this : I want to create a program that can choose from a large list of equations (probably about 100) that can be used to solve for a variable. Each equation gets one line, each letter in the line representing one unknown. The list of equations is different, for example. cannot be derived from each other, so you need as many equations as there are unknowns in them. The solution for unknowns will be done in CAS, so you don't need to worry about that. However, I believe that CAS (Maxima) may have a limit on the number of equations it can solve at the same time, and may be too slow if you give too many unnecessary equations at once.

As a start, I would use an algorithm to reduce the number of lines to speed it up. At first, all strings containing the specified letter are listed in the abbreviated list, then all strings containing letters from the strings in the given list are part of the abbreviated list until none are added. for example, the abbreviated list "g" would be 7.) "gh" and 8.) "hi" This will only remove unnecessary lines, but the task remains the same with the others.

I think this can be solved by removing unnecessary lines from the above list until all others are needed, however I do not know how to explicitly determine which lines will be unnecessary (other than those mentioned in the previous paragraph).

If you are working with an additional condition: This is an optimization problem. I don't need a perfect solution, only an optimal solution . The program does not need to find the absolute minimum number of lines that yield a solution. Having a few extra lines in the solution will probably only slow down the computer, but it will be acceptable.

Edit: optional clarification about the meaning of the strings: Each letter in the string represents an unknown in the equation, so the equation a = 2 would be represented by "a" because the only unknown, Equation a + b = 0 would be represented by "ab" and b ^ 2- c = 0 to "bc"

+3


source to share


2 answers


I'm not sure how to call this problem. Seems to be NP-hard, so I'm going to suggest an integer programming formulation that a ready-made solver might attack.

Let be a x_i

0-1 variable indicating whether an equation is included in the output i

. Let be a y_j

0-1 variable indicating whether the variable is included j

in the output. We have limitations

for all equations i, for all variables j in equation i, y_j - x_i >= 0.

      

We need as many equations as there are variables in the output.

(sum over all equations i of x_i) - (sum over all variables j of y_j) = 0

      



As you point out, empty set must be specifically disabled. Let be the k

variable that should appear in the output.

sum over all equations i containing variable k of x_i >= 1

      

Naturally the goal

minimize sum over all equations i of x_i.

      

0


source


I'm not sure if thinking about this in terms of strings is helpful. Instead, it is an elementary problem in graph theory, namely detecting connected graph components .

This can be done as follows. Define each vertex in your graph as one of the solution variables. Then the given variable xi will have a direct relationship with the variable xj if both appear in one or more of your equations. That is, if you have the equation "F (x1, x2, ..., xi) = 0", then this sets the graph graphs between all combinations of pairs of vertices x1-xi. (Strictly speaking, to detect connected components, it is enough to establish links between adjacent pairs of variables, that is, a link x1 to x2 to x3 ... to x [i-1] to xi.) Sample data structures for such a data model can be found here: Unspecified presentation of graphs ; or here: 4.1 Indirect graphs .

By plotting the relationship between the variables in your equation, you can easily detect the related parts using the elementary graph coloring algorithm :



  • Start at the first vertex x0 of the graph.

  • Mark the top as visited.

  • Recursively for each connected vertex ...

  • Visit it and mark it as having visited, but if you encounter a vertex already marked as visited, stop the recursion.

The set of vertices marked as visited in this way form a connected component of the graph. For any irregularities noted, revert the process to detect any remaining connected components.

This algorithm is linear in the number of vertices and edges.

0


source







All Articles