Handle a very large distance matrix in C (or C ++ if it might help)

I am implementing this clustering algorithm http://www.sciencemag.org/content/344/6191/1492.full in C in my software and I need to plot a distance matrix, but in some cases, the size of the dataset (after removing redundancy ) is huge (n> 1,500,000 and even more, up to 4,000,000 in more complex cases). My problem is that even the distribution of the upper triangular matrix will be ( (1500000*1500000) - 1500000) * 0.5 * sizeof(float) =~ 5.5e12 Bytes

. So memory crashes (even on our compute nodes with 256GB of RAM) and writing to disk is not an option in this case.

Aside from shrinking the size (which I'll be looking at) of the dataset for a cluster, does anyone have any idea of ​​a method I could use to approximate and store this amount of information?

NB As I said in the title, I am using C and I can also use C ++. Also, if anyone has another clustering algorithm (where the number of clusters is determined by the algorithm itself) please suggest it to me.

Thanks in advance for your time,

+3


source to share


1 answer


You may have to step back and revise your algorithm.

First, you may not need a matrix of distances between all pairs of data points. Perhaps you could concatenate similar data points into data cells and then create a matrix of distances between cells.

That is, start by calculating the paired distances between points, but keep only relatively small distances and pointers to the "other" point. A view of a very rare matrix of shorter distances. This can be done in parallel.



Then create data cells that contain groups of points with mutually small distances between them. For example, if you have threshold "short" distances such that the bins hold on average, say 50 data points, you would get 1,500,000 / 50 = 30,000 bins.

Then repeat the data and calculate the distances between the bins. This will result in 30,000 ^ 2 distances, which is a matrix of about 4GB. Also, you still have 30,000 with 50 ^ 2 spaces inside the boxes, which is another 300MB. This amount of data is quite manageable.

If replacing the distance between data points with the distance between the corresponding cells is accurate enough for your application to work. It all depends on the type of data you are dealing with and the accuracy requirements of your application.

+7


source







All Articles