C ++ for data access

I m writing simple benchmark on C++ to compare the execution time of data access on different platforms.And I

got strange results. I am measuring sequential access time and order access time. To do this, I simply copy one data array to another in two ways. The code and result are shown below. The time I got is mixed. Evaluating the int data type shows that sequential access is faster (this is normal). But for float and double types, it's just the opposite (see results below). Maybe I am doing the benchmarking wrong or there are some pitfalls that I did not take into account? Or can you suggest some benchmarks for comparing data access or performance of simple operations for different data types?

template<typename T>
std::chrono::nanoseconds::rep PerformanceMeter<T>::testDataAccessArr()
{
    std::chrono::nanoseconds::rep totalSequential = 0;

    T* arrDataIn = new T[k_SIZE];
    T* arrDataOut = new T[k_SIZE];

    std::generate_n(arrDataIn, k_SIZE, DataProcess<T>::valueGenerator);
    DataProcess<T>::clearCache();

    std::chrono::nanoseconds::rep timeSequential = measure::ns(copySequentialArr, arrDataIn, arrDataOut, k_SIZE);

    std::cout << "Sequential order access:\t" << timePrint(timeSequential) << "\t";
    std::cout.flush();

    std::chrono::nanoseconds::rep totalIndirection = 0;
    T** pointers = new T*[k_SIZE];
    T** pointersOut = new T*[k_SIZE];
    for (size_t i = 0; i < k_SIZE; ++i)
    {
        pointers[i] = &arrDataIn[i];
        pointersOut[i] = &arrDataOut[i];
    }

    std::generate_n(arrDataIn, k_SIZE, DataProcess<T>::valueGenerator);
    std::generate_n(arrDataOut, k_SIZE, DataProcess<T>::valueGenerator);

    DataProcess<T>::clearCache();

    totalIndirection = measure::ns(copyIndirectionArr, pointers, pointersOut, k_SIZE);

    std::cout << std::endl << "Indirection order access:\t" << timePrint(totalIndirection) << std::endl;
    std::cout.flush();

    delete[] arrDataIn;
    delete[] arrDataOut;
    delete[] pointers;
    delete[] pointersOut;

    return timeSequential;
}

template <typename T>
void PerformanceMeter<T>::copySequentialArr(const T* dataIn, T* dataOut, size_t dataSize)
{
    for (int i = 0; i < dataSize; i++)
        dataOut[i] = dataIn[i];
}

template <typename T>
void PerformanceMeter<T>::copyIndirectionArr(T** dataIn, T** dataOut, size_t dataSize)
{
    for (int i = 0; i < dataSize; i++)
        *dataOut[i] = *dataIn[i];
}

      

Results:

------------------- Measuring int ---------------

data: 10 MB; iterations: 1

Serial Order Access: 8.50454ms

Direction Order Access: 11.6925ms

------------------- Float Measurement ------------

data: 10 MB; iterations: 1

Serial Order Access: 8.84023ms

Direction Order Access: 8.53148ms

------------------- Double measure -----------

data: 10 MB; iterations: 1

Sequential order access: 5.57747ms

Access to order without specifying direction: 3.72843ms

+3


source to share


1 answer


Here is an example (using T = int

) the build output from GCC 6.3 with -O2

: copySequentialArr

and copyIndirectionArr

.

It can be seen from the assembly that they are very similar to each other, but copyIndirectionArr

requires two mov

more commands than copySequentialArr

. In doing so, we can conclude which copySequentialArr

is the fastest.

The same is true when using T = double

: copySequentialArr

and copyIndirectionArr

.

Vectorization



It's funny when we start using -O3

: copySequentialArr

and copyIndirectionArr

. No change in copyIndirectionArr

, but copySequentialArr

now vectorized by the compiler. This vectorization will make it even faster than before under normal conditions.

Disclainmer

These exams of the resulting assembly code are "out of context" in the sense that the compiler will optimize it even further if it knows about this context.

0


source







All Articles