Fast cycle detection on an acyclic chart?
I have a program in Ruby 1.9.3 that creates a RubyTree . My data is best described as Directed Acyclic Graph (DAG); note that this is not a polygon. Well, at least the data should be a DAG, even though users are doing their best to frustrate my program with bad data.
I am creating a DAG dynamically by parsing an XML document. An XML document does not explicitly indicate a tree structure, but it provides cross-referenced integer identifiers that establish relationships between elements in the document.
I need to make sure the RubyTree does not contain loops. The original data may (erroneously) have a loop, and if so, my program needs to know about it and not introduce an infinite loop or crash. To accomplish this currently, I have mixed in the standard Ruby TSort library in the RubyTree class Tree::TreeNode
. This uses Tarjan's algorithm to perform a topological sort on the plot every time a node is added. During topological sorting, if a loop is found, an exception is thrown - exactly what I want.
For example:
module Tree
class TreeNode
include TSort
def tsort_each_node(&block)
self.each(&block)
end
def tsort_each_child(node, &block)
node.get_children().each { |child| yield child }
end
def add(child, at_index = -1)
#The standard RubyTree implementation of add goes here
begin
self.tsort()
rescue TSort::Cyclic => exce
self.remove!(child)
raise exce
end
return child
end
end
end
I had to change a few more methods. Basically all it takes to traverse the tree or children needed to implement TSort, or get rid of its move dependency (for example, I simplified Tree::TreeNode#to_s()
it to get it back Tree::TreeNode#name
.)
My program is now functionally correct. I have done significant testing and the results are working fine: all I have to do is salvage TSort::Cyclic
at the correct points in my code, and if I ever try to add a node that calls the loop, the node is removed and I can log the problem to the report. to deal with later (by correcting the original data).
The problem is that on a RubyTree of 75,000 or so, where the number of edges is very close to the number of vertices minus 1, iteratively running Tarjan's algorithm creates an algorithmic complexity that looks rather quadratic. Tarjan itself O(|V| + |E|)
, which in my case is about O(2*|V|)
, but every time I call add()
it |V|
increments by 1 when I create a graph node on node. I can't just call Tarjan at the end, because I might need to traverse the graph or parts of it during the read-compare-add loop, and any attempt to traverse might hang the program or collapse if there is actually a loop. (It goes without saying that my code is single threaded, and if it weren't, we'd have a huge problem. At its core, I rely on the fact thatadd()
never returns without throwing an exception if there is a loop, and even if there is loop, and the node is removed in such a way as to clean up the loop before returning add()
.)
But this is too slow! It takes more than half an hour for this loop alone, and my program consists of several other steps that take their fair share of time. But, since it's worth it, just doing Tarjan eats up the lion's share of performance, judging by the results ruby-perf
. I tried switching from arrays to linked lists in RubyTree each
, but only reduced the execution time by about 1% by removing the call loads Array#concat
.
I found this awesome article by Tarjan who came up with the tightly coupled component algorithm that Ruby relies on TSort
, and it seems that incremental loop detection is an active area of ββresearch. However, the level of dialogue in the document is much higher than my head, and I am sure that I lack the mathematical background to translate the results of my work into Ruby code. Not only that, but from reading the Notes section of the document, it seems that their best-effort algorithms have rather alarming worst time periods, so it might not even be faster than my current method, depending on the specifics of my data.
Am I missing something stupid here, or is it the best choice to put into mental work to analyze Tarjan's work and try to come up with a Ruby implementation of one of the algorithms? Note that I don't really care about the topological aspect of the sorting algorithm; this is a side effect of what I really want. If the tree weren't topologically sorted, but there were still no cycles guaranteed, I would be perfectly happy.
It's also worth noting that loops are somewhat rare in my original data. That is, loops can happen due to manual error in the data entry process, but they will never happen intentionally and must always be reported to the program so that he can tell me, so I can beat someone over the head with a billclub for entering incorrect data. ... Also, the program is absolutely obligated to keep smoking even if it detects a particularly nasty cycle, so I can't just stick my head in the sand and hope there won't be any cycles.
What's the problem?
At the request of some, here's a demo that you can run to see the problem at work.
Install the stable RubyTree version (I am using MRI 1.9.3). Then compare the output of the two programs:
Figure 1: Hangs permanently at 100% CPU usage on the main thread after printing "Third time"
require 'tree'
a = Tree::TreeNode.new('a', nil)
b = Tree::TreeNode.new('b', nil)
c = Tree::TreeNode.new('c', nil)
a.add(b)
a.add(c)
puts "First time"
b.add(c)
puts "Second time"
b.add(a)
puts "Third time"
c.add(b)
puts "Fourth time"
c.add(a)
puts "Fifth time"
puts "Done"
Figure 2: Goes all the way and prints "Done" and the result has no loop
Please note that I usually do things in blocks rescue
to log this cycle (s) and complain loudly about the person doing these cycles.
require 'tree'
require 'tsort'
module Tree
class TreeNode
include TSort
def tsort_each_node(&block)
self.each(&block)
end
def tsort_each_child(node, &block)
node.get_children().each { |child| yield child}
end
def to_s
name
end
def get_children()
return @children
end
def add(child, at_index = -1)
unless child
raise ArgumentError, "Attempting to add a nil node" # Only handles the immediate child scenario
end
if self.equal?(child)
raise TSort::Cyclic, "Cycle detected: [#{child.name}, #{child.name}]"
end
# Lazy man unique test, won't test if children of child are unique in this tree too.
if @children_hash.include?(child.name)
raise "Child #{child.name} already added!"
end
if insertion_range.include?(at_index)
@children.insert(at_index, child)
else
raise "Attempting to insert a child at a non-existent location (#{at_index}) when only positions from #{insertion_range.min} to #{insertion_range.max} exist."
end
@children_hash[child.name] = child
child.parent = self
#CYCLE DETECTION - raises TSort::Cyclic if this caused a cycle
begin
self.tsort()
rescue TSort::Cyclic => exce
self.remove!(child)
raise exce
end
return child
end
end
end
a = Tree::TreeNode.new('a', nil)
b = Tree::TreeNode.new('b', nil)
c = Tree::TreeNode.new('c', nil)
a.add(b)
a.add(c)
puts "First time"
b.add(c)
puts "Second time"
begin
b.add(a)
rescue
end
puts "Third time"
begin
c.add(b)
rescue
end
puts "Fourth time"
begin
c.add(a)
rescue
end
puts "Fifth time"
puts "Done"
The goal is to develop code that is functionally equivalent to Appendix 2, but scales better for more vertices (I don't expect more than 10 ^ 6 vertices, in which case I would be fine with it, taking a few minutes ("go to get a cup of coffee ") on a modern desktop workstation, but not hours or longer.)
source to share
The Plexus gem for Ruby seems to have solved the worst of my problems. I've tried GRATR before, but it won't load because it's incompatible with Ruby 1.9.3, but Plexus is a fork of the GRATR fork that works with 1.9.3.
My problem was that I was using a data structure (RubyTree) that was not designed to handle loops, but Plexus Digraph can actually keep working with loops. The API is designed with these in mind.
The solution I came across is quite simple: basically, now that my graph data structure doesn't hang in a loop, I can just call the Tarjan algorithm at the end of the plotting procedure - there is actually a nice wrapper in there acyclic?
, but it just calls topsort()
under hood and implements topological sorting using Tarjan's tightly coupled algorithm components like Ruby stdlib TSort
. However, it uses its own implementation instead TSort
. I'm not sure why.
Unfortunately, I am now stuck trying to develop an implementation of the minimal feedback arc problem (minimal FAS problem) that is NP-hard. A minimal FAS problem is required because I need to remove the least intrusive number of arcs in the graph to make it acyclic.
My plan now is to get a list of strongly related components from Plexus, which is an array of arrays; if any of the second-level arrays contains more than one element, then this array describes an element with a loop based on the definition of a strongly coupled component. Then I need (using a minimum of FAS or approximation) to remove edges and / or vertices to make an arithmetic plot, and iteratively run Tarjan until each SCC subarm is 1 in length.
I think brute force is probably the best approach to solving the minimum FAS: I don't need to be too smart because the number of nodes in any SCC in my dataset will almost never exceed, say, 5 or 6 Exponential by 5 or 6 fine. I seriously doubt that I will have an SCC set of many hundreds of nodes with dozens of different cycles; it will be an extremely pathological worst case, which I think will never happen. However, if this were the case, the execution time would be quite long.
Basically I need to try to remove the set of cardinality of the arcs of the graph, one subset at a time with the set of subsets sorted in ascending order by subset size, and "guess and check" if the graph is cyclic (Tarjan), then add edges back if that power gain does not fix the cycle.
If the number of edges and nodes is less than 20 or so, which is almost guaranteed, it shouldn't take significant execution time.
Removing the iterative Tarjan definitely solved my problem of complexity on the happy path (no loops or just a trivial loop), which is actually where it gave me the most heartburn - instead of plotting 25 minutes to plot it, it takes 15 seconds ...
Lesson learned: If your program is slow, it is probably because you are doing a lot of unnecessary work. In my case, unnecessary work was doing Tarjan topological sorting every time a new vertex was added to the plot, which was only required because of the implementation details of the library I originally chose to model my data.
source to share