Time mismatch with pthreads
My multithreaded C program does the following procedure:
#define NUM_LOOP 500000000
long long sum = 0;
void* add_offset(void *n){
int offset = *(int*)n;
for(int i = 0; i<NUM_LOOP; i++) sum += offset;
pthread_exit(NULL);
}
Of course sum
, I should update by purchasing a lock, but before that I have a problem with the running time of this simple program.
When the main function (Single Thread):
int main(void){
pthread_t tid1;
int offset1 = 1;
pthread_create(&tid1,NULL,add_offset,&offset1);
pthread_join(tid1,NULL);
printf("sum = %lld\n",sum);
return 0;
}
Output and running time:
sum = 500000000
real 0m0.686s
user 0m0.680s
sys 0m0.000s
When the main function (Multi Threaded Sequential):
int main(void){
pthread_t tid1;
int offset1 = 1;
pthread_create(&tid1,NULL,add_offset,&offset1);
pthread_join(tid1,NULL);
pthread_t tid2;
int offset2 = -1;
pthread_create(&tid2,NULL,add_offset,&offset2);
pthread_join(tid2,NULL);
printf("sum = %lld\n",sum);
return 0;
}
Output and running time:
sum = 0
real 0m1.362s
user 0m1.356s
sys 0m0.000s
So far, the program works as expected. But when the main function (Multi Threaded Concurrent):
int main(void){
pthread_t tid1;
int offset1 = 1;
pthread_create(&tid1,NULL,add_offset,&offset1);
pthread_t tid2;
int offset2 = -1;
pthread_create(&tid2,NULL,add_offset,&offset2);
pthread_join(tid1,NULL);
pthread_join(tid2,NULL);
printf("sum = %lld\n",sum);
return 0;
}
Output and running time:
sum = 166845932
real 0m2.087s
user 0m3.876s
sys 0m0.004s
An erroneous value sum
due to lack of synchronization is not an issue here, but runtime. The actual execution time of parallel execution is much longer than the execution time. This is the opposite of what is expected from parallel execution on a multi-core processor.
Please explain what could be the problem here.
source to share
This is not an unusual effect if multiple threads access the same shared state (at least on x86). It is commonly referred to as ping pong in cash :
Whenever one core wants to update the value of this variable, it first needs to "own" the cache line (locking the cache line for writing) from the other core, which takes some time. Then another kernel wants to return the cache line ...
Thus, even without the synchronization primitive, you are paying significant overhead compared to the sequential case.
source to share
As suggested by @spectras, I made the following changes to the procedure add_offset
:
#define NUM_LOOP 500000000
long long sum = 0;
void* add_offset(void *n){
int offset = *(int*)n;
long long sum_local = sum; //read sum
for(int i = 0; i<NUM_LOOP; i++) sum_local += offset;
sum = sum_local; //write to sum
pthread_exit(NULL);
}
The main function of multithreaded-parallel execution remains the same as above, the runtime is now as expected, that is:
sum = 500000000
real 0m0.683s
user 0m1.356s
sys 0m0.000s
One more output and runtime:
sum = -500000000
real 0m0.686s
user 0m1.360s
sys 0m0.000s
These two and only these two output values ββare expected because the streams are not synchronized. The sum
output value reflects which stream (with offset = 1 or offset = -1) has been updated sum
.
source to share