Pthreads performance difference
I am programming highly sensitive code. I am implementing a simple scheduler to distribute workloads and the main thread is responsible for the scheduler.
cpu_set_t cpus;
pthread_attr_t attr;
pthread_attr_init(&attr);
for(int i_group =0; i_group<n_groups; i_group++){
std::cout << i_t<< "\t"<<i_group << "th group of cpu" <<std::endl;
for(int i =index ; i < index+group_size[i_group]; i++){
struct timeval start, end;
double spent_time;
gettimeofday(&start, NULL);
arguments[i].i_t=i_t;
arguments[i].F_x=F_xs[i_t];
arguments[i].F_y=F_ys[i_t];
arguments[i].F_z=F_zs[i_t];
CPU_ZERO(&cpus);
CPU_SET(arguments[i].thread_id, &cpus);
int err= pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus);
if(err!=0){
std::cout << err <<std::endl;
exit(-1);
}
arguments[i].i_t=i_t;
pthread_create( &threads[i], &attr, &cpu_work, &arguments[i]);
gettimeofday(&end, NULL);
spent_time = ((end.tv_sec - start.tv_sec) * 1000000u + end.tv_usec - start.tv_usec) / 1.e6;
std::cout <<"create: " << spent_time << "s " << std::endl;
}
i_t++;
cpu_count++;
arr_finish[i_group]=false;
}
}
how to create the main thread. For a simple explanation, I am assuming i_group = 1. Baby streams divide and grab a bunch of matrix-matrix multiplications. Here rank stands for thread_id.
int local_first = size[2]*( rank -1 )/n_compute_thread ;
int local_end = size[2] * rank/n_compute_thread-1;
//mkl_set_num_threads_local(10);
gettimeofday(&start, NULL);
for(int i_z=local_first; i_z<=local_end; i_z++ ){
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans,
size[0], size[1], size[0], 1.0, F_x, size[0],
rho[i_z], size[1], 0.0, T_gamma[i_z], size[1] );
}
for(int i_z=local_first; i_z<=local_end; i_z++ ){
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans,
size[0], size[1], size[1], 1.0, T_gamma[i_z], size[0],
F_y, size[1], 0.0, T_gamma2[i_z], size[0] );
}
gettimeofday(&end, NULL);
std::cout <<i_t <<"\t"<< arg->thread_id <<"\t"<< sched_getcpu()<< "\t" << "compute: " <<spent_time << "s" <<std::endl;
Even though the workload is distributed fast enough, the performance of each thread is too high. see result below
5 65 4 4 calculate: 0.270229s
5 64 1 1 calculate: 0.284958s
5 65 2 2 calculate: 0.741197s
5 65 3 3 calculate: 0.76302s
Second columnshows how many matrix matrix multiplications are performed on a particular thread. the last column shows the consumption time. When I first saw this result, I thought it was due to the proximity of threads. So I added a few lines to control thread binding. However, it did not change the trend of the last column.
My computer has 20 physical cores and 20 virtual cores. I have only checked 4 child threads. Of course, it has been tested on a Linux machine.
Why does thread performance change so much? and how to fix it?
source to share
First, are you actually creating the scheduler? Your code example shows you are using the Linux scheduler and are setting the thread attributes object as well as thread binding parameters, etc. This difference is important when choosing a solution to a problem.
The question is big anyway, and there are a few additional questions / topics that can be raised to help clarify the terms and get closer to a real answer. Let's start with the following things:
1 - The length of the test test. Subseeding the performance score of threads in a thread pool seems insufficient. Increase the evaluation time to allow for the scheduler time. Maybe a few minutes.
(An example of a typical time duration used in an existing benchmarking utility, read this )
2 - topic priorities. Your streams are not the only ones. Is it possible that the kernel scheduler might periodically move the benchmark as threads owned by other processes (other than the ones you created) that have higher priority? (and hence crowding out yours, which leads to distorted task completion times)
3 - task size. Is the number of operations required to complete each task small enough to fit within the slice of time allocated by the scheduler? This can affect the perception of threading problems, especially if there are any differences in the number of operations between each task. ( Processes that exceed the allocated CPU time slice are automatically pushed down to the lower "tier", while processes making I / O or block requests will be pushed to the higher "tiers". )
4 - Equality of problems - You are talking about dividing and conquering a set of matrix-matrix multiplications. But the matrices are identical in size and similar in content. that is, are you sure that the number of operations in each task is equal to the number of operations in all other tasks? The time slice assigned by the scheduler to each equally prioritized thread ensures that over time, a task that has an operation count greater than what can be completed in one time slot is more susceptible to longer completion times ( context switching due to higher priority of other OS processes) than those with few enough operations to be convenient for one time chunk.
5 . Other processes. I've mentioned this in other paragraphs above, but it deserves a number of its own. To use multiple cores requires multiple threads at the same time. But the opposite is not true. One core is not limited to one thread. The OS can preemptively interrupt one of your processes (threads) on a specific kernel with a higher priority process at any time (although not interrupting any other kernel), possibly distorting your timing. Again, a longer benchmarking time will help to reduce the impact of flows on differences in flows caused by these particular events.
source to share