Chained data without iteration

I have a large set of From / To pairs representing a hierarchy of related nodes. For example, the hierarchy:

     4 -- 5 -- 8
    / 
   2 --- 6 - 9 -- 10
  /           \ 
 1              -- 11
  \
   3 ----7

      

encapsulated as:

{(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}

      

I would like to be able to create a function that returns all the nodes above the specified node, for example:

nodes[2].us
> [4, 5, 6, 8, 9, 10, 11]

      

My actual node set is in the tens of thousands, so I would like to return a list of all upstream nodes very quickly without having to recurse through the entire set every time I want to get the upstream set.

This is my best attempt, but it doesn't go beyond two levels.

class Node:
    def __init__(self, fr, to):
        self.fr = fr
        self.to = to
        self.us = set()

def build_hierarchy(nodes):
    for node in nodes.values():
        if node.to in nodes:
            nodes[node.to].us.add(node)
    for node in nodes.values():
        for us_node in node.us.copy():
            node.us |= us_node.us
    return nodes

from_to = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3), (1, 0)}
nodes = {fr: Node(fr, to) for fr, to in from_to} # node objects indexed by "from"
nodes = build_hierarchy(nodes)

print [node.fr for node in nodes[2].us]
> [4, 6, 5, 9]

      

+3


source to share


2 answers


I'll show you two ways to do this. First, we'll just change your attribute us

to intelligently compute and cache descendant search results. Second, we will use graphs library networkx

.

I would recommend that you go with a graph library if your data naturally has a graph structure. This way you will save a lot of hassle.

Caching us

property of nodes

You can make your attribute a us

property and cache the results of previous searches:

class Node(object):

    def __init__(self):
        self.name = None
        self.parent = None
        self.children = set()
        self._upstream = set()

    def __repr__(self):
        return "Node({})".format(self.name)

    @property
    def upstream(self):
        if self._upstream:
            return self._upstream
        else:
            for child in self.children:
                self._upstream.add(child)
                self._upstream |= child.upstream
            return self._upstream

      

Please note that I am using a slightly different view than you. I will create a graph:

import collections

edges = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
nodes = collections.defaultdict(lambda: Node())

for node, parent in edges:
    nodes[node].name = node
    nodes[parent].name = parent
    nodes[node].parent = nodes[parent]
    nodes[parent].children.add(nodes[node])

      

and I will search for the top nodes for node 2:



>>> nodes[2].upstream
{Node(5), Node(4), Node(11), Node(9), Node(6), Node(8), Node(10)}

      

After calculating nodes upstream of 2, they will not be recalculated if you call for example nodes[1].upstream

. If you make any changes to your chart, the ascending nodes will be wrong.

Using networkx

If we use networkx

to represent our graph, finding all descendants of a node is very simple:

>>> import networkx as nx
>>> from_to = [(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), 
               (2, 1), (3, 1), (7, 3), (1, 0)]
>>> graph = nx.DiGraph(from_to).reverse()
>>> nx.descendants(graph, 2)
{4, 5, 6, 8, 9, 10, 11}

      

This does not fully answer your question, which seemed to be aimed at optimizing the search for descendants so that the work does not repeat itself on subsequent calls. However, as far as we know, it networkx.descendants

can do smart caching.

So this is what I suggest: avoid premature optimization and use libraries. If it's networkx.descendants

too slow, you can examine the code networkx

to see if it caches the search. If not, you can create your own cache lookup using more primitive functions networkx

. My bet is that it networkx.descendants

will work perfectly and you will not need to do any extra work.

+1


source


Here's a function that will compute the entire upstream list for a single node:

def upstream_nodes(start_node):
    result = []
    current = start_node
    while current.to:  # current.to == 0 means we're at the root node
        result.append(current.to)
        current = nodes[current.to]
    return result

      

You said that you don't want to iterate over the entire set of nodes every time you request an upstream, but that is not the case: it will just ask the parent and its parent all the way to the root. So if a node is four levels down, it will do four dictionary lookups.



Or, if you want to be really smart, here is a version that will do each parent search every time, and then store that search in the node object attribute .us

so you never have to compute the value again. (If the parent links of your nodes don't change after the graph is created, this will work - if you change your graph, of course it won't).

def caching_upstream_nodes(start_node, nodes):
    # start_node is the Node object whose upstream set you want
    # nodes is the dictionary you created mapping ints to Node objects
    if start_node.us:
        # We already calculated this once, no need to re-calculate
        return start_node.us
    parent = nodes.get(start_node.to)
    if parent is None:
        # We're at the root node
        start_node.us = set()
        return start_node.us
    # Otherwise, our upstream is our parent upstream, plus the parent
    parent_upstream = caching_upstream_nodes(parent, nodes)
    start_node.us = parent_upstream.copy()
    start_node.us.add(start_node.to)
    return start_node.us

      

One of these two features should be what you are looking for. (NOTE: Use a little caution when doing these exercises, as I just wrote them and did not take the time to test them. I believe the algorithm is correct, but there is always the chance that I made a basic mistake in writing.)

+1


source







All Articles