MPI on a multi-core machine
My situation is quite simple: I want to run MPI enabled software on one multiprocessor / core machine, say 8.
My MPI implementation is MPICH2.
As I understand it, I have several options:
$ mpiexec -n 8 my_software
$ mpiexec -n 8 -hosts {localhost: 8} my_software
or I could also tell Hydra to fork rather than ssh;
$ mpiexec -n 8 -launcher fork my_software
Could you please tell me if there will be any differences or will the behavior be the same?
Of course, since all my nodes will be on the same machine, I don't want "messaging" to be done over the network (even a local loop), but over shared memory. Since I understood that MPI will count this for itself and this will be the case for all three options.
source to share
Simple answer:
All methods should lead to the same performance. You will have 8 processes running on cores using shared memory.
Technical answer:
Forks have the advantage of being compatible, on systems where rsh / ssh spawning would be a problem. But maybe, I think, just start processes locally.
In the end (unless MPI is configured) all processes on the same processor will end up using "shared memory", and the startup method or host specification method shouldn't matter to this. The communication method is handled by another parameter (-channel?).
The specific syntax of the node specification method may allow you to bind processes to a specific CPU core, then you may have slightly better / worse performance depending on your application.
source to share
If you have everything set up correctly, I don't see your program behaving depending on how you run it, unless it runs with one or more parameters. (That would mean you didn't have everything set up correctly in the first place.)
If memory is good at changing the way messages are passed, depends on the MPI devices you are using. It used to be that you would use the chpi_memi mpi device. This controlled the transfer of messages between processes, but it used buffer space and messages were sent to and from that space. Thus, the transmission of messages was performed, but at the speed of the memory bus.
I am writing in the past tense because since I was close to the hardware I knew (or, frankly, cared about) the implementation details at a low level, and more modern MPI installations could be a little or a lot, more complex ... I would be surprised and pleased to learn that any modern MPI installation actually replaces message passing with read / write shared memory on a multicore / multiprocessor computer. I would be surprised because this would require translating the message being passed into shared memory access, and I'm not sure if this is easy (or easy enough to do) for all of MPI. I think it is much more likely that current implementations still rely on passing messages through the memory bus through some buffer area. But as I argue, this is just my best guess,and I am often wrong on these issues.
source to share