Setting up mathematical parallel codes

Assuming I am interested in performance, not portability of my linear algebra's iterative multi-threaded solver, and that I have the results of profiling my code in hand, how can I tune my code to run optimally on this machine my choice?

The algorithm includes Matrix vector multiplications, norms and dot products. (FWIW, I'm working on CG and GMRES).

I am working on codes whose matrix size is roughly equivalent to the full RAM size (~ 6GB). I will be working on an Intel i3 laptop. I will be linking my codes with Intel MKL.

In particular,

  • Is there a good resource (PDF / Book / Paper) for learning about manual configuration? There are many things I've learned, like: Manual unwrapping isn't always optimal or about compiler flags, but I'd prefer a centralized resource.

  • I need something to translate the profiler information into improved performance. For example, my profiler tells me that my stacks from one processor are accessing another or that my ASM is mulpd

    taking too long. I don't know what this means and how I can use this information to improve my code.

I intend to spend as much time as necessary to squeeze out as much processing power as possible. Its more of a learning experience than it is for actual use or distribution at the moment.

(I'm worried about manual tuning, not autotuning)


  • This differs from normal performance tuning as most of the code is associated with Intel's proprietary MKL library.
  • Due to O (N ^ 2) memory bandwidth issues of vector matrix multiplication and dependencies, there is a limitation on what I could manage on my own through simple observation.
  • I write in C and Fortran and I have tried both and as discussed a million times on SO I found no difference in either that I tweak them correctly.

source to share

1 answer

Gosh, this still has no answers. After reading this, you still have no helpful answers ...

You imply that you have already done all the obvious and general things to create codes quickly. Specifically, you have:

  • chose the fastest algorithm for your problem (either this, or your task is to optimize the implementation of the algorithm, not optimize the search for a solution to the problem);
  • worked with your compiler like a dog to squeeze out the last drop of execution speed;
  • linked in the best libraries you can find that are generally used (and tested to make sure they really improve the performance of your program;
  • manual memory access to optimize r / w performance;
  • did all the obvious little tricks that we all do (for example, when comparing norms of 2 vectors, you don't need to take the square root to determine that it is "larger" than another, ...);
  • hammered the parallel scalability of your program inside the gnat s == P line gripper on your performance graphs;
  • always executed your program at the desired job size, for a certain number of processors, in order to maximize some performance;

and yet you are not satisfied!

Now, unfortunately, you are close to the brink of bleeding and the information you are looking for cannot be easily found in books or websites. Even here on SO. Part of the reason for this is that you are currently optimizing your code on your platform and you are in a better position to diagnose problems and fix them. But these problems are likely to be very localized; you can conclude that no one else outside your immediate research group will be interested in what you are doing, I know you are not interested in any micro-optimization that I do in my code on my platform.

The second reason is that you have entered an area that is still an active research front and useful lessons (if any) are being published in the scientific literature. To do this, you need access to a good research library if you don't have one nearby, then there are good places to start ACM and IEEE-CS digital libraries. (Post or comment if you don't know what it is.)

In your post, I would look at magazines on 2 topics: computer and exotic computing for science and technology, and compiler design. I'm sure the former is obvious, the latter may be less obvious, but if your compiler has already done everything (useful) the cutting edge optimizations won't ask this question and the compiler writers are working to win your successors I need.

You are probably looking for optimizations that, like loop unrolling, were relatively hard to find, implemented in compilers 25 years ago, and which were bleeding back then, and which themselves will be old and set at 25 years more.


First, let me make something explicit that was originally only implied in my "answer": I am not prepared to spend enough time on SO to guide you through even a brief introduction to the knowledge that I have acquired over 25 years in scientific / engineering and high performance computing. They won't let me write books, but many of them and Amazon will help you find them. This answer was much longer than I'd like to post before adding this bit.

Now, to pick up the points in your comment:

  • on "handheld memory access" starts in the Wikipedia article on "loopback tiling" (see, you can't even rely on me to paste the url here) and read from there; you should be able to quickly pick up terms that you can use in your future searches.
  • "work like a dog with your compiler." I really want to read its documentation and get a detailed understanding of the intentions and realities of the various options; in the end, you will have to test a lot of compiler options to determine which one is "best" for your code on your platform.
  • on "micro-optimization", well here's a start: Optimizing performance for numeric indices . Don't run away with the idea that you will learn all (or even a lot) of what you want to learn from this book. It is now about 10 years old. Filter messages:
    • Optimizing performance requires proximity to machine architecture;
    • Optimizing performance consists of 1001 separate steps, and it is usually impossible to predict which of them will be most useful (and which ones are really harmful) without a detailed understanding of the program and its runtime.
    • Optimizing performance is a sport for participation, you cannot learn it without doing it;
    • Optimizing performance requires close attention to detail and good accounting.

Oh, and never write the clever optimization part that you can't easily undo when the next release of the compiler implements a better approach. I've spent quite a bit of time removing clever tricks from 20-year-old Fortran, which was justified (if at all) on the basis of improved execution performance, but which now just confuses the programmer (which annoys me too) and gets in the way of the compiler doing its job.

Finally, one piece of wisdom that I am willing to share: these days I do very little optimization, which is not one of the items on my first list above; I find the cost-benefit ratio of micro-optimization is not good for my employers.



All Articles