Time mismatch with pthreads

My multithreaded C program does the following procedure:

#define NUM_LOOP 500000000
long long sum = 0;

void* add_offset(void *n){
        int offset = *(int*)n;
        for(int i = 0; i<NUM_LOOP; i++) sum += offset;
        pthread_exit(NULL);
}

      

Of course sum

, I should update by purchasing a lock, but before that I have a problem with the running time of this simple program.

When the main function (Single Thread):

int main(void){

        pthread_t tid1;
        int offset1 = 1;
        pthread_create(&tid1,NULL,add_offset,&offset1);
        pthread_join(tid1,NULL);
        printf("sum = %lld\n",sum); 
        return 0;
}

      

Output and running time:

sum = 500000000

real    0m0.686s
user    0m0.680s
sys     0m0.000s

      

When the main function (Multi Threaded Sequential):

int main(void){

        pthread_t tid1;
        int offset1 = 1;
        pthread_create(&tid1,NULL,add_offset,&offset1);
        pthread_join(tid1,NULL);

        pthread_t tid2;
        int offset2 = -1;
        pthread_create(&tid2,NULL,add_offset,&offset2);
        pthread_join(tid2,NULL);

        printf("sum = %lld\n",sum);

        return 0;
}

      

Output and running time:

sum = 0

real    0m1.362s
user    0m1.356s
sys     0m0.000s

      

So far, the program works as expected. But when the main function (Multi Threaded Concurrent):

int main(void){

        pthread_t tid1;
        int offset1 = 1;
        pthread_create(&tid1,NULL,add_offset,&offset1);

        pthread_t tid2;
        int offset2 = -1;
        pthread_create(&tid2,NULL,add_offset,&offset2);

        pthread_join(tid1,NULL);
        pthread_join(tid2,NULL);

        printf("sum = %lld\n",sum);

        return 0;
}

      

Output and running time:

sum = 166845932

real    0m2.087s
user    0m3.876s
sys     0m0.004s

      

An erroneous value sum

due to lack of synchronization is not an issue here, but runtime. The actual execution time of parallel execution is much longer than the execution time. This is the opposite of what is expected from parallel execution on a multi-core processor.

Please explain what could be the problem here.

+3


source to share


2 answers


This is not an unusual effect if multiple threads access the same shared state (at least on x86). It is commonly referred to as ping pong in cash :

Whenever one core wants to update the value of this variable, it first needs to "own" the cache line (locking the cache line for writing) from the other core, which takes some time. Then another kernel wants to return the cache line ...



Thus, even without the synchronization primitive, you are paying significant overhead compared to the sequential case.

+3


source


As suggested by @spectras, I made the following changes to the procedure add_offset

:

#define NUM_LOOP 500000000
long long sum = 0;

void* add_offset(void *n){
        int offset = *(int*)n;
        long long sum_local = sum; //read sum
        for(int i = 0; i<NUM_LOOP; i++) sum_local += offset;
        sum = sum_local; //write to sum
        pthread_exit(NULL);
}

      

The main function of multithreaded-parallel execution remains the same as above, the runtime is now as expected, that is:

sum = 500000000

real    0m0.683s
user    0m1.356s
sys     0m0.000s

      



One more output and runtime:

sum = -500000000

real    0m0.686s
user    0m1.360s
sys     0m0.000s

      

These two and only these two output values ​​are expected because the streams are not synchronized. The sum

output value reflects which stream (with offset = 1 or offset = -1) has been updated sum

.

0


source







All Articles