Sorting structures in order of least change

This turned out to be incomprehensible. I will rephrase

Is there an algorithm or approach that will allow the array to be sorted in such a way that it minimizes the differences between consecutive elements?

struct element
{
uint32 positions[8];
}

      

These records are order insensitive.
The output file format is defined as:

byte  present;  // each bit indicating whether position[i] is present
uint32 position0;
-- (only bits set in Present are actually written in the file).  
uint32 positionN;  // N is the bitcount of "present"
byte  nextpresent;   

      

All records are guaranteed to be unique, so "current" byte 0 is EOF. The file is parsed by updating the "current" structure with the current fields, and the result is added to the list.

For example: {1, 2, 3}, {2, 3, 2}, {4, 2, 3}
Would be: 111b 1 2 3 001b 4 111b 2 3 2
Store 2 numbers from unsorted approach.

My goal is to minimize the size of the output file.

0


source to share


2 answers


Your problem

I think this question should really be labeled as "compression".

As I understand it, you have unordered records that are eight 4-byte integers: 32 bytes in total. You want to keep these records with a minimum file size and decide to use some form of delta encoding based on Hamming Distance . You are asking how best to sort the data for the compression scheme you have created.

Your assumptions

From what you told us, I see no real reason to split your 32 bytes the way you described (other than the fact that word boundaries are handy)! If you get the same data back, do you really care if it's encoded as eight lots of 4 bytes or sixteen lots of 2 bytes, or as one huge 32 byte integer?

Also, if you have something about a problem area that makes your method a favorite, it's best to use a tried and true compression scheme . You should be able to find code that is already written and you will get good performance on typical data.



Your question

Go back to your original question if you really want to go this route. It is easy to imagine collecting the starting record (I don't think it will make a big difference, but it probably makes sense to choose "smallest" or "largest") and calculate the Hamming distance to all other records. You can then select the one with the minimum save distance and repeat. Obviously this is O (n ^ 2) in the number of records. Unfortunately, this document (which I haven't read or understood in detail) makes it look like calculating the minimum Hamming distance from one line to a set of others is inherently difficult and doesn't have very good approximations.

Obviously, you could get better complexity by sorting your records based on the Hamming weight (which boils down to counting the number of that 32-byte integer), which is O (n log (n)) in the number of records. Then use some differential encoding of the result. But I don't think it will create a terribly good compression scheme: integers between 0 and 7 might end up something like:

000, 100, 010, 001, 101, 011, 110, 111

0, 4, 2, 1, 5, 3, 6, 7

Which brings us back to the question I asked earlier: Are you sure your compression scheme is better than something more standard for your specific data?

+5


source


You look at a couple of subproblems, determining the difference between structures, then sorting.

I am not very clear about your description of the structure or the priority of the differences, but I assume you can do this and calculate the difference between the two instances. For files, there are known algorithms for discussing these things, such as those used in diff .



For your order, you are looking at the classic traveling salesman problem . If you sort through a few of these things, it's easy. If you sort them, you will have to settle for "good" unless you are ready to apply domain knowledge and many little tricks from TSP.

+1


source







All Articles