Heuristic Dispersion Geography

I am trying to implement a hill climbing algorithm to decide which sites to choose from a set of sites based on specific criteria. There are up to 5,000 seats to choose from.

One of those criteria is geographic variance, so I need to be able to assign any subset of my locations a value that represents the variance.

Each location has latitude and longitude data.

Speed ​​is an issue and so I need a heuristic that will estimate how dispersed a particular set of locations is (i.e. a possible solution).

I tried to sum the pairwise distances of each of the locations in my potential solution, but that turned out to be too slow.

Then I tried the sum of the distances from the center of all locations in my potential solution, it turned out to be faster, but doesn't work as efficiently. Using this approach will facilitate the creation of multiple location clusters.

Any other suggestions are welcome.

+3


source to share


2 answers


First, could you repeat what you mean by pairwise sum? I am asking because it sounds like you are forming all possible pairs and it will be very weak. If so, how about finding the first 1) nearest neighbors, then 2) the longest path?

1) If I recall correctly, you can do this in less than O (n log n) time. 2) If formed trees are disabled, you also need to find the shortest distances between trees. And because of the trees, this is not an NP-complete problem, but the shortest path alg is actually enough.



At this point, with a huge suspicion that I did not understand your problem correctly, how about some kind of deviation from the number of occurrences in the field of geology, evenly distributed between the extreme points, or selected with some preliminary heuristic.

Can you define or refine the concept of variance?

0


source


Consider three situations:

  • All of your nodes are in a dense cluster.
  • All of your nodes are in a cluster, but the cluster is wide.
  • Many of your nodes are in a cluster, but a few nodes are far from the cluster.

Your sum over all pairwise distances fixes (1) and (2): close clusters yield lower results than large clusters. How is this done for (3)? Here, the share e

of the total number of nodes N

is far away, at an average distance D

. The rest of the nodes (1-e)N

are clustered at an average distance D

.

Now, how many paired connections do you need to consider? For clustered nodes ((1-e)N)^2=e^2*N^2-2*e*N^2+N^2

such connections exist . There are e^2*N^2

connections for remote nodes .

Now multiply these values ​​by the average distances. This gives the overall pairwise average (d*(e^2*N^2-2*e*N^2+N^2)+D*(e^2*N^2))/N

. Suppose now that it is e

small, we can neglect the terms including e^2

. Thus, the average d*(N^2-2*e*N^2)/N

.



Now let's look at your second metric: Everyone moves away from the midpoint. This works great for (1) and (2): close clusters have lower results than large clusters. How is this done at 3? Use the same e

as above to represent outliers. The average distance of the nodes from the center point is now given by (d*(1-e)*N+D*e*N)/N

. In other words, clustered nodes are no longer weighted that much.

Is there a way that you can have an easy calculation and still weigh the cluster nodes more appropriately? I think so.

My suggestion is that you pick random pairs of nodes from your list, calculate the inter-domain distance, and then average the results. For (1) dense clusters, all the nodes will be close to each other, so all the random pairs you draw will be close to the pairwise average that you would get with your calculations. For (2), free clusters, the same will be true. For (3), a cluster with outliers, you are more likely to draw nodes from the inside of the cluster than from the outside, so the outliers will end up being ignored.

As the number of node samples increases, the cluster will dominate the random sample. I assume this will provide decent accuracy with O(N)

, not O(N^2)

time.

0


source







All Articles