Finding the most tree hierarchy that explains the data

Consider the following data file:

      A  B  C
1    A1 B1 C1
2    A2 B2 C2
3    A3 B1 C1
4    A1 B1 C2
5    A2 B1 C1
6    A1 B4 C2

      

where A

, B

and C

represent the attributes. I hope to deduce the hierarchical structure most likely A

, B

and C

. By that I mean looking for an order {A,B,C}

that yields a hierarchy with the fewest nodes with multiple parents.

For example, consider one hierarchical possibility:

A->B->C

      

Note that it has multiple-parent nodes. To see this, notice what A1

happens with B1

and B4

in the combinations A1 B1 C1

and A1 B4 C1

. However, it A3

also happens with B1

in line 3

, with A3 B1 C1

.

In other words, just by focusing on this part of the graph, if we assume a hierarchy A->B->C

, we will have a node B1

with two parents:

                      enter image description here

So the question is being asked by an arbitrary dataframe like the one above, how can I find a hierarchical column ordering that gives the smallest number of multi-parent nodes?

Notes:

There are many variations on this problem, for example.

  • Finding the hierarchy with the fewest (additional) multithreaded edges
  • Finding the hierarchy with the least number of edges

A solution to any of these options would be great.

+3


source to share


1 answer


Here is an undirected graph with your dataframe. Edge (x, y) means that there is some row of data that mentions both x, y.

For example, the last line "A1, B4, C2" adds edges (A1, B4), (B4, C2), (A1, C2)

Now you can sort A, B, C according to your wishes.



Finding the hierarchy with the least number of (additional) multithreaded edges

We can fulfill all the conventions (this is pretty fast for N = 8..10 ) and find the cheapest (smallest, shortest) one. The boundary costs in such a tree (below) can be calculated from the above graph.

Mb might be some greedy approach, smth like "Pick the cheapest one at the current step", I'm not sure right now, but I'm sure this presentation of the problem is foreseen.

+2


source







All Articles