"Agglomerative" graph clustering based on node weight in network X?

I have a very large connected graph (millions of nodes). Each edge has a weight that determines the proximity of the connected nodes. I want to find "clusters" in the graph (many nodes that are very close to each other). For example, if the nodes were every city in the US, and the edges were the distance between cities, the clusters could be {Dallas, Houston, Fort Worth) and {New York, Bridgeport, Jersey City, Trenton}.

Clusters do not have to be the same size, and not all nodes need to be in a cluster. Instead, the clusters should have some average minimum weight, W which is (sum of weights in a cluster) / (number of edges in a cluster).

I am very comfortable with Python and NetworkX seems to be the standard tool for this

It seems like it wouldn't be too hard to program, though not particularly efficient. Is there a name for the algorithm I am describing? Is there an implementation in NetworkX already?


source to share

1 answer

I know of some graph splitting algorithms, the goal of which is to make all the pieces roughly the same size and the smallest edge cut as possible, but as you described you don't need such an algorithm. Anyway, I think your NP problem is complete, as are many other graph splitting problems. Maybe there are some algorithms that specifically fit your problem perfectly (and I think there are, but I don't know them), but I think you can still find good and acceptable solutions with a slight change to some of the algorithms originally intended to find the minimum edge with the same component size.

For example see this . i think you can use k-way layered partitioning with some modifications. For example, during the coarsening phase, you can use Light Edge Matching. Consider a situation where, in the coarsening phase, you mapped A and B into one group, and C and D into another group. the weight of the edge between these two groups is the sum of the edges of its members to each other, for example. W = Wac + Wad + Wbc + Wbd, where W is the edge weight, Wac is the edge weight between A and C, and so on. I also think that considering the Wac, Wad, Wbc and Wbd averages instead of their sum is also worth trying.

In my experience this algorithm is very fast and I'm not sure if you can find a pre-built library in python to make changes to it.



All Articles