Chained data without iteration
I have a large set of From / To pairs representing a hierarchy of related nodes. For example, the hierarchy:
4 -- 5 -- 8
/
2 --- 6 - 9 -- 10
/ \
1 -- 11
\
3 ----7
encapsulated as:
{(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
I would like to be able to create a function that returns all the nodes above the specified node, for example:
nodes[2].us
> [4, 5, 6, 8, 9, 10, 11]
My actual node set is in the tens of thousands, so I would like to return a list of all upstream nodes very quickly without having to recurse through the entire set every time I want to get the upstream set.
This is my best attempt, but it doesn't go beyond two levels.
class Node:
def __init__(self, fr, to):
self.fr = fr
self.to = to
self.us = set()
def build_hierarchy(nodes):
for node in nodes.values():
if node.to in nodes:
nodes[node.to].us.add(node)
for node in nodes.values():
for us_node in node.us.copy():
node.us |= us_node.us
return nodes
from_to = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3), (1, 0)}
nodes = {fr: Node(fr, to) for fr, to in from_to} # node objects indexed by "from"
nodes = build_hierarchy(nodes)
print [node.fr for node in nodes[2].us]
> [4, 6, 5, 9]
source to share
I'll show you two ways to do this. First, we'll just change your attribute us
to intelligently compute and cache descendant search results. Second, we will use graphs library networkx
.
I would recommend that you go with a graph library if your data naturally has a graph structure. This way you will save a lot of hassle.
Caching us
property of nodes
You can make your attribute a us
property and cache the results of previous searches:
class Node(object):
def __init__(self):
self.name = None
self.parent = None
self.children = set()
self._upstream = set()
def __repr__(self):
return "Node({})".format(self.name)
@property
def upstream(self):
if self._upstream:
return self._upstream
else:
for child in self.children:
self._upstream.add(child)
self._upstream |= child.upstream
return self._upstream
Please note that I am using a slightly different view than you. I will create a graph:
import collections
edges = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
nodes = collections.defaultdict(lambda: Node())
for node, parent in edges:
nodes[node].name = node
nodes[parent].name = parent
nodes[node].parent = nodes[parent]
nodes[parent].children.add(nodes[node])
and I will search for the top nodes for node 2:
>>> nodes[2].upstream
{Node(5), Node(4), Node(11), Node(9), Node(6), Node(8), Node(10)}
After calculating nodes upstream of 2, they will not be recalculated if you call for example nodes[1].upstream
. If you make any changes to your chart, the ascending nodes will be wrong.
Using networkx
If we use networkx
to represent our graph, finding all descendants of a node is very simple:
>>> import networkx as nx
>>> from_to = [(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2),
(2, 1), (3, 1), (7, 3), (1, 0)]
>>> graph = nx.DiGraph(from_to).reverse()
>>> nx.descendants(graph, 2)
{4, 5, 6, 8, 9, 10, 11}
This does not fully answer your question, which seemed to be aimed at optimizing the search for descendants so that the work does not repeat itself on subsequent calls. However, as far as we know, it networkx.descendants
can do smart caching.
So this is what I suggest: avoid premature optimization and use libraries. If it's networkx.descendants
too slow, you can examine the code networkx
to see if it caches the search. If not, you can create your own cache lookup using more primitive functions networkx
. My bet is that it networkx.descendants
will work perfectly and you will not need to do any extra work.
source to share
Here's a function that will compute the entire upstream list for a single node:
def upstream_nodes(start_node):
result = []
current = start_node
while current.to: # current.to == 0 means we're at the root node
result.append(current.to)
current = nodes[current.to]
return result
You said that you don't want to iterate over the entire set of nodes every time you request an upstream, but that is not the case: it will just ask the parent and its parent all the way to the root. So if a node is four levels down, it will do four dictionary lookups.
Or, if you want to be really smart, here is a version that will do each parent search every time, and then store that search in the node object attribute .us
so you never have to compute the value again. (If the parent links of your nodes don't change after the graph is created, this will work - if you change your graph, of course it won't).
def caching_upstream_nodes(start_node, nodes):
# start_node is the Node object whose upstream set you want
# nodes is the dictionary you created mapping ints to Node objects
if start_node.us:
# We already calculated this once, no need to re-calculate
return start_node.us
parent = nodes.get(start_node.to)
if parent is None:
# We're at the root node
start_node.us = set()
return start_node.us
# Otherwise, our upstream is our parent upstream, plus the parent
parent_upstream = caching_upstream_nodes(parent, nodes)
start_node.us = parent_upstream.copy()
start_node.us.add(start_node.to)
return start_node.us
One of these two features should be what you are looking for. (NOTE: Use a little caution when doing these exercises, as I just wrote them and did not take the time to test them. I believe the algorithm is correct, but there is always the chance that I made a basic mistake in writing.)
source to share