Restore original order based on set of incomplete ordered sets
Let's say my raw data is
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
It is corrupted and I have several incomplete sets where the order is valid but not all elements are present.
1, 4, 6, 7, 8, 11, 12 1, 2, 4, 5, 6, 9, 10, 12 2, 4, 7, 9, 10, 11 4, 7, 9, 12
and etc.
I also have a list of all original items without any ordering.
I need to recover as much of the original data as possible. I have no guarantee that I have enough information to recover everything. I need to figure out what I have and figure out which parts are reliable.
Complications may arise (but I will solve the problem without them first):
Incomplete set ordering is mostly valid but may have a few bugs here and there written by humans.
I may have additional information for each pair of elements in incomplete sets such as
- "there is nothing between 5 and 6",
- "of course there is something else between 7 and 12, but I'm not sure how much or what exactly",
- "there may or may not be anything between 3 and 4",
- "there is only one unknown item between 7 and 9"
I would like to incorporate this information into an algorithm in order to recover more data.
My best idea:
Use incomplete arrays in a sort function as follows: conclude that A> B if there is an incomplete set in which B precedes A. If there is no set in which both A and B are present, return A == B.
What I don't like about this is that I don't know which parts are fully rebuilt and random. To help me that I'm going to shuffle the original list of items, sort again and see which items change and which don't. And do this several thousand times (the number of items in the list is <50, so I can use the most bulldozing methods for this problem)
Any better suggestions?
source to share
Create a directed graph from your incomplete sets and do topological sort
Some errors can be found as loops (no loops in acyclic graph)
source to share
Create a directed graph as MBo suggests. If the resulting graph is a directed acyclic graph (DAG), that is considered a validation of the original data and can perform a topological sort to recover the original ordering information.
Some additional information may be included in the schedule if you believe that all of your information is reliable. For example, "there is nothing between 5 and 6" means (if you understand that 5 → 6) that each edge of the graph from v to 6 (v is not equal to 5) can be replaced by an edge from v to 5. "There may or may not be be nothing between 3 and 4 ": all of this tells us that it is 3 → 4, if he even talks so much.
Other information is more complex. The "something between 7 and 12" can be included in the digraph as 7 → 12, but the "something" part cannot, as far as I can tell. There might be a way to use it by increasing the graph to include "something" in the vertices, but I can't get it to work. Instead, I recommend getting your top-level algorithm to spit out each vertex (assuming there aren't many) and evaluating them by the number of additional constraints they fit into. As a bonus, you will find out how many answers are possible. You can also use it during your sorting, for example if you are looking for an item immediately after 7, do not select 12, but it is useless to me and you will not get any result at all if the information is inconsistent.
If the resulting graph is not a DAG, you can still split it into tightly coupled components (like the Tarjan algorithm). Strongly connected components are parts that are unreliable. The tightly coupled components will themselves constitute a DAG that can be approximated, but each component larger than 1 vertex will require additional special treatment. One way to deal with this is to try to find the minimum set of feedback arcs , i.e. The minimum number of edges to exclude in a tightly connected component to turn it into a DAG. The minimum arc feedback problem is NP-hard, but the problem is the "fixed parameter": http://dl.acm.org/citation.cfm?doid=1411509.1411511... Less sane approaches will probably work as well, like defining a loop and removing the random edge of the loop until there are no more loops.
source to share
I guess there might be some kind of algorithm if you take a long string and try to compare that string with all the other two neithebor characters (including start and end): if there is something in between them on other lines, add that to the long line and start again. If nothing is added, check the next pair.
source to share