Filtering neighboring points from the list

I half answered the question about finding mass clusters in a bitmap . I say semi-answer because I left it in a state where I had all the points in the raster array sorted by mass and left it to the reader to filter the list by removing the points from the same cluster.

Then, after thinking about this step, I found that the solution would not pop up at me as I thought. So now I'm asking you guys for help. We have a list of points with these masses (a list of Python tuples, but you can represent it however you see fit in any language):

[ (6, 2, 6.1580555555555554),
  (2, 1, 5.4861111111111107),
  (1, 1, 4.6736111111111107),
  (1, 4, 4.5938888888888885),
  (2, 0, 4.54),
  (1, 5, 4.4480555555555554),
  (4, 7, 4.4480555555555554),
  (5, 7, 4.4059637188208614),
  (4, 8, 4.3659637188208613),
  (1, 0, 4.3611111111111107),
  (5, 8, 4.3342191043083904),
  (5, 2, 4.119574829931973),
  ...
  (8, 8, 0.27611111111111108),
  (0, 8, 0.24138888888888888) ]

      

Each tuple has the form:

(x, y, mass)

      

Note that the list is sorted here. If your solution prefers they don't sort it in order.

The task, if you remember , is to find the main clusters of mass. The number of clusters is unknown. But you know the dimensions of the bitmap. Sometimes several points in a cluster have more mass than the center of the next (in size) cluster. So what I want to do is go from points with higher mass and delete points in the same cluster (points nearby).

When I tried this, I had to iterate over the parts of the list again. I have a feeling that I am just stupid about this. How would you do it? Pseudo code or real code. Of course, if you can just get away from where I left the Python code in this answer, it's easier for me to experiment with it.

The next step is to find out how many clusters are actually in the bitmap. I am still struggling with defining this problem, so I may come back with a question about it.

EDIT: I have to clarify that I know there is no "correct" answer to this question. And the name of the question is key. The first phase of my clustering. I'm on the lookout for a fast, accurate - "sufficient" method to filter out neighboring points.

Let me know if you see how I can make the question clear.

0


source to share


6 answers


As you know, you are asking for a solution to the ill-posed problem : there is no definitive solution. It's great ... it just makes it more fun. Your problem is mainly caused by not knowing how many clusters you want. Clustering is one of the key areas of machine learning, and there are many approaches that have been developed over the years.

As Arachnid pointed out, the k-mean algorithm tends to be good and is fairly easy to implement. The results are critically dependent on the initial guess and the number of clusters desired. To overcome the problem of initial guessing, it is sufficient to run the algorithm with random initializations often and choose the best result. You will need to define what "best" means. One measure would be the root-mean-square distance of each point to its cluster center. If you want to automatically guess how many clusters there are, you must run the algorithm with a whole series of cluster numbers. For any good "best" measure, more clusters will always look better than fewer, so you need a way to penalize too many clusters.DiscussionThe MDL on wikipedia is a good starting point.

K means clustering is basically the simplest mixing model . It is sometimes useful to upgrade to a mixture of gaussians learned by maximizing expectations (described in the link provided). It may be more reliable than k-means. It takes a little more work to figure it out, but when you do, it's not much more complicated than a k-tool to implement.



There are many other clustering techniques such as agglomerative clustering and spectral clustering. Agglomerative clustering is fairly easy to implement, but choosing when to stop building clusters can be tricky. If you are doing agglomeration clustering, you probably want to look at kd trees for faster nearest neighbor searches. smacl answers a slightly different way of performing agglomeration clustering using a Voronoi diagram.

There are models that can automatically select the number of clusters for you, for example based on Latent Dirichlet Distribution , but they are much more difficult to get the device right.

You can also look at the mean-shift algorithm to see if it comes close to what you really want.

+5


source


It seems to me that you are looking for a K-means algorithm .



+4


source


As I mentioned in the comment to your question, the answer is based on whether mass can be considered a scalar in this context. If so, color based solutions probably won't work, as color is often not perceived to be scalar.

For example, if I have a given area with 1 point of high mass, is it the same as in the same area with 10 points of 1/10 mass? If so, in this context, mass is not scalar and I would like to take a look at an algorithm used to spatially align similar unscaled values, for example. voronoi charts .

alt text

In this case, when two adjacent voronoi regions have a sufficiently close coincidence of masses and distance, they can be grouped together. You can repeat this to find all clusters.

If, on the other hand, your mass is scalable, or that the mass at an unknown location can be interpolated from surrounding points, I would tend to triangulate both the contour of the input and use the areas between the contours to find clusters of similar mass.

+3


source


This sounds like color quantization, where you reduce the number of colors in the image. One way would be to plot colors in space and combine the clusters into the center (or weighted average) of the cluster.

The exact name of the algorithm that called this memory does not allow me, but I will edit the answer if it appears, but in the meantime you should look at color quantization and see if some of the algorithms are helpful.

+1


source


Start with the bulging problem . You're also looking for some "convex hulls" that look like clusters.

Note that "clusters" are undefined. You have an average mass in your field. Some items are above average and some are below average. How much above average means you've found a cluster? How far apart should the nodes be part of a cluster or a separate cluster?

What's the difference between two mountain peaks and a ridge?

You have to compute "topography" - the union of all points with equal density into regions. This requires you to select a location and work with your desire from a point in a radial direction, locating positions where the densities are equal. You can connect these points to regions.

If you've chosen your starting point correctly, the regions should nest. Starting point selection is easy because you start with local highs.

+1


source


Since you're already talking about mass, why not solve the problem of gravity. A simple particle system doesn't have to be super precise, and you don't have to run it for too long before you can guess the number of clusters significantly.

If you have a better idea of ​​the cluster numbers, a k-means nearest neighbor becomes possible.

+1


source







All Articles