C - Fastest way to sort a large 2D integer array
I have a 2D array where each line contains 6 integers that are already sorted in ascending order. Example:
1 2 3 4 5 6
6 8 9 10 13 15
1 4 5 6 7 9
1 4 5 6 7 8
3 18 19 20 25 34
Expected Result:
1 2 3 4 5 6
1 4 5 6 7 8
1 4 5 6 7 9
3 18 19 20 25 34
6 8 9 10 13 15
The actual data contains 8 to 33 meters of records like this. I am trying to figure out the fastest way to sort this array. I currently have a working code using qsort:
qsort call:
qsort(allRecords, lineCount, sizeof(int*), cmpfunc);
cmpfunc:
int cmpfunc (const void * a, const void * b)
{
const int *rowA = *(const int **)a;
const int *rowB = *(const int **)b;
if (rowA[0] > rowB[0]) return 1;
if (rowA[0] < rowB[0]) return -1;
if (rowA[1] > rowB[1]) return 1;
if (rowA[1] < rowB[1]) return -1;
if (rowA[2] > rowB[2]) return 1;
if (rowA[2] < rowB[2]) return -1;
if (rowA[3] > rowB[3]) return 1;
if (rowA[3] < rowB[3]) return -1;
if (rowA[4] > rowB[4]) return 1;
if (rowA[4] < rowB[4]) return -1;
if (rowA[5] > rowB[5]) return 1;
if (rowA[5] < rowB[5]) return -1;
return 0;
}
For a sample of 33 million records, it takes about 35.6 seconds (gcc-O1), which is pretty fast, but I'm wondering if there is a faster way to do this, given the pre-sorted values ββon each line.
This naturally applies to data where the most common first digit is 1, so a 33 m record file can have 12 m records starting at 1, then 8 m records starting at 2, 5 m records starting at 3, etc. ... I'm not sure if this lends itself to one particular sort of sort over another (like heapsort).
I understand that qsort has a fair amount of overhead due to the fact that it has to call a function all the time, so I'm hoping for even better performance.
I don't usually write C code, so I am very open to suggestions and criticism as I collect all of this from tutorials and other StackOverflow questions / answers.
EDIT: As requested, my initialization code is:
// Empty record
int recArray[6] = {0,0,0,0,0,0};
// Initialize allRecords
int** allRecords;
allRecords = (int**) malloc(lineCount*sizeof(int*));
for(i=0; i < lineCount; i++)
{
allRecords[i] = (int*) malloc(6*sizeof(int));
}
// Zero-out all records
for(i=0; i < lineCount; i++)
{
memcpy(allRecords[i], recArray, 6 * sizeof(int));
}
I'm still learning the right way to do things, so I wouldn't be surprised if I get it all wrong. A guide to the correct right would be appreciated.
Someone else asked about the range of values ββ- I'm not sure if the range will change in the future, but currently the values ββare between 1 and 99.
Also, for profiling - I built a little function that uses gettimeofday () to output seconds / microseconds and then compare before and after. I am open to better ways. The result looks like this:
// <-- Here I capture the gettimeofday() structure output
Sorting...
Sorted.
Time Taken: 35.628882s // <-- Capture it again, show the difference
EDIT: Per @doynax - I now "box" 6 values ββof each line into unsigned long long int:
// Initialize allRecords
unsigned long long int* allRecords;
allRecords = (unsigned long long int*) malloc(lineCount*sizeof(unsigned long long int));
for(i=0; i < lineCount; i++)
{
allRecords[i] = 0;
}
...
// "Pack" current value (n0) into an unsigned long long int
if(recPos == 0) { lineSum += n0 * UINT64_C(1); }
else if(recPos == 1) { lineSum += n0 * UINT64_C(100); }
else if(recPos == 2) { lineSum += n0 * UINT64_C(10000); }
else if(recPos == 3) { lineSum += n0 * UINT64_C(1000000); }
else if(recPos == 4) { lineSum += n0 * UINT64_C(100000000); }
else if(recPos == 5) { lineSum += n0 * UINT64_C(10000000000); }
...
allRecords[linecount] = lineSum;
lineSum = 0;
I can also later "unpack" one of those unsigned long long int values ββback to the original 6 ints.
However, when I try to sort:
qsort(allRecords, lineCount, sizeof(unsigned long long int), cmpfunc);
...
int cmpfunc (const void * a, const void * b)
{
if (*(unsigned long long int*)a > *(unsigned long long int*)b) return 1;
if (*(unsigned long long int*)a < *(unsigned long long int*)b) return -1;
return 0;
}
... the results are not sorted as expected. If I show the first and last lines before and after sorting using this:
printf("[%i] = %llu = %i,%i,%i,%i,%i,%i\n", j, lineSum, recArray[0]...recArray[5]);
Output:
First and last 5 rows before sorting:
[#] = PACKED INT64 = UNPACKED
[0] = 462220191706 = 6,17,19,20,22,46
[1] = 494140341005 = 5,10,34,40,41,49
[2] = 575337201905 = 5,19,20,37,53,57
[3] = 504236262316 = 16,23,26,36,42,50
[4] = 534730201912 = 12,19,20,30,47,53
[46] = 595648302516 = 16,25,30,48,56,59
[47] = 453635251108 = 8,11,25,35,36,45
[48] = 403221161202 = 2,12,16,21,32,40
[49] = 443736310604 = 4,6,31,36,37,44
[50] = 575248312821 = 21,28,31,48,52,57
First and last 5 rows after sorting:
[0] = 403221161202 = 2,12,16,21,32,40
[1] = 413218141002 = 2,10,14,18,32,41
[2] = 443736310604 = 4,6,31,36,37,44
[3] = 444127211604 = 4,16,21,27,41,44
[4] = 453028070302 = 2,3,7,28,30,45
[46] = 585043260907 = 7,9,26,43,50,58
[47] = 593524170902 = 2,9,17,24,35,59
[48] = 595248392711 = 11,27,39,48,52,59
[49] = 595251272612 = 12,26,27,51,52,59
[50] = 595648302516 = 16,25,30,48,56,59
I am guessing that I am somehow comparing the wrong values ββ(like pointer values ββinstead of actual values), but I'm not entirely sure what the correct syntax is.
On the positive side, it flares up quickly. :)
Sorting 33-bit 64-bit ints takes about 4-5 seconds (at least in its current, erroneous form).
source to share
Jonathan Leffler's comment on packaging reordering is the place and I thought about it while looking through your code. The following would be my approach:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <string.h> // for memcpy
#define ROW_LENGTH 6
#define ROW_COUNT 15
// 1, 2, 3, 4, 5, 6, 6, 8, 9, 10, 13, 15, 1, 4, 5, 6, 7, 9, 1, 4, 5, 6, 7, 8, 3, 18, 19, 20, 25, 34
/*
1 2 3 4 5 6
6 8 9 10 13 15
1 4 5 6 7 9
1 4 5 6 7 8
3 18 19 20 25 34
*/
// Insertion sorting taken from /questions/19059/fastest-sort-of-fixed-length-6-int-array/141732#141732 with modification
static __inline__ int sortUlliArray(unsigned long long int *d, int length){
int i, j;
for (i = 1; i < length; i++) {
unsigned long long int tmp = d[i];
for (j = i; j >= 1 && tmp < d[j-1]; j--)
d[j] = d[j-1];
d[j] = tmp;
}
return i; // just to shutup compiler
}
int cmpfunc (const void * a, const void * b)
{
if (*(unsigned long long int*)a > *(unsigned long long int*)b) return 1;
if (*(unsigned long long int*)a < *(unsigned long long int*)b) return -1;
return 0;
}
int main(){
int array[ROW_COUNT][ROW_LENGTH],
decodedResultsArray[ROW_COUNT][ROW_LENGTH];
const int rawData[] = { 1, 2, 3, 4, 5, 6,
6, 8, 9, 10, 13, 15,
1, 4, 5, 6, 7, 9,
1, 4, 5, 6, 7, 8,
3, 18, 19, 20, 25, 34,
6,17,19,20,22,46,
5,10,34,40,41,49,
5,19,20,37,53,57,
16,23,26,36,42,50,
12,19,20,30,47,53,
16,25,30,48,56,59,
8,11,25,35,36,45,
2,12,16,21,32,40,
4,6,31,36,37,44,
21,28,31,48,52,57
};
memcpy(array, rawData, sizeof(rawData)/sizeof(*rawData)); // copy elements into array memory
// Sort
// precompute keys
unsigned long long int *rowSums = calloc(ROW_COUNT, sizeof(unsigned long long int));
unsigned long long int *sortedSums = rowSums ? calloc(ROW_COUNT, sizeof(unsigned long long int)) : NULL; // if rowSums is null, don't bother trying to allocate.
if(!rowSums || !sortedSums){
free(rowSums);
free(sortedSums);
fprintf(stderr, "Failed to allocate memory!\n");
fflush(stderr); // should be unnecessary, but better to make sure it gets printed
exit(100);
}
int i=0, j=0, k=0;
for(; i < ROW_COUNT; i++){
rowSums[i] = 0; // this should be handled by calloc, but adding this for debug
for(j=0; j < ROW_LENGTH; j++){
unsigned long long int iScalar=1;
for(k=ROW_LENGTH-1; k > j; --k)
iScalar *= 100; // probably not the most efficient way to compute this, but this is meant more as an example/proof of concept
unsigned long long int iHere = array[i][j];
rowSums[i] += (iHere * iScalar);
// printf("DEBUG ITERATION REPORT\n\t\tRow #%d\n\t\tColumn #%d\n\t\tiScalar: %llu\n\t\tiHere: %llu\n\t\tCurrent Sum for Row: %llu\n\n", i, j, iScalar, iHere, rowSums[i]);
fflush(stdout);
}
}
memcpy(sortedSums, rowSums, sizeof(unsigned long long int)*ROW_COUNT);
// Some debugging output:
/*
printf("Uncopied Sums:\n");
for(i=0; i < ROW_COUNT; i++)
printf("SortedRowSums[%d] = %llu\n", i, rowSums[i]);
printf("Memcopyed sort array:\n");
for(i=0; i < ROW_COUNT; i++)
printf("SortedRowSums[%d] = %llu\n", i, sortedSums[i]);
*/
clock_t begin = clock();
//qsort(sortedSums, ROW_COUNT, sizeof(unsigned long long int), cmpfunc);
sortUlliArray(sortedSums, ROW_COUNT);
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time for sort: %lf\n", time_spent);
printf("Before sort array:\n");
for(i=0; i<ROW_COUNT; i++){
for(j=0; j < ROW_LENGTH; j++){
printf("Unsorted[%d][%d] = %d\n", i, j, array[i][j]);
}
}
printf("Values of sorted computed keys:\n");
for(i=0; i < ROW_COUNT; i++)
printf("SortedRowSums[%d] = %llu\n", i, sortedSums[i]);
// Unpack:
for(i=0; i < ROW_COUNT; i++){
for(j=0; j < ROW_LENGTH; j++){
unsigned long long int iScalar=1;
for(k=ROW_LENGTH-1; k > j; --k)
iScalar *= 100;
unsigned long long int removalAmount = sortedSums[i]/iScalar;
decodedResultsArray[i][j] = removalAmount;
sortedSums[i] -= (removalAmount*iScalar);
// DEBUG:
// printf("Decoded Result for decoded[%d][%d] = %d\n", i, j, decodedResultsArray[i][j]);
}
}
printf("\nFinal Output:\n");
for(i=0; i < ROW_COUNT; i++){
printf("Row #%d: %d", i, decodedResultsArray[i][0]);
for(j=1; j < ROW_LENGTH; j++){
printf(", %d", decodedResultsArray[i][j]);
}
puts("");
}
fflush(stdout);
free(rowSums);
free(sortedSums);
return 1;
}
Note that this is not optimized for maximum efficiency and is littered with debug leads, but it is a proof of concept for packaging nonetheless. Also, given the number of lines you have to handle, you would probably be better off using it qsort()
, but I'm using it with sortUlliArray(...)
(which is a modified version of the Insert-sort function from fooobar.com/questions/19059 / ... ). You will need to give him a test to find out what works best for your case.
All in all, the end result of running this code on 15 hard-coded lines is:
Row #0: 1, 2, 3, 4, 5, 6
Row #1: 1, 4, 5, 6, 7, 8
Row #2: 1, 4, 5, 6, 7, 9
Row #3: 2, 12, 16, 21, 32, 40
Row #4: 3, 18, 19, 20, 25, 34
Row #5: 4, 6, 31, 36, 37, 44
Row #6: 5, 10, 34, 40, 41, 49
Row #7: 5, 19, 20, 37, 53, 57
Row #8: 6, 8, 9, 10, 13, 15
Row #9: 6, 17, 19, 20, 22, 46
Row #10: 8, 11, 25, 35, 36, 45
Row #11: 12, 19, 20, 30, 47, 53
Row #12: 16, 23, 26, 36, 42, 50
Row #13: 16, 25, 30, 48, 56, 59
Row #14: 21, 28, 31, 48, 52, 57
So this seems to handle cases where the numbers are very similar, which was a problem with the order in which the numbers were packed.
Anyway, the code above should work, but this is meant as an example, so I'll leave it to you to apply the optimizations you need.
The code was tested on a 64-bit MacBook Air with 1.6GHz Intel Core i5 and 4GB 1600MHz DDR3. So, quite weak processor and slow memory, but it was able to do the sort for 15 rows of 0.004 milliseconds, so pretty fast in my opinion. (This is just a measure of the sorting speed for the above test case, not the prepackaging or unpacking speeds as they might use some optimization.)
The main credit goes to Doinax and Jonathan Leffler.
source to share