MPI Foundation

05 May 2019 Hpc tips MPI TACC

I attended MPI Foundation, a nice training given by TACC staff, at the end of March. This post briefly documents the basic MPI programming knowledge I learnt ( should have been written long ago, but did not even start until the day before follow-up training).

Overview

MPI Wordview

Message Passing Interface (MPI) is invented in 90s, when multicore processor had not show up, while there had been distributed systems what machines connected by network. The MPI programming model, to my understanding, is: duplicate a single program (static copy), create one process (active copy) on each machine to process different set of data, if there is any need of communication between processes, pass messages (network packages) through network to exchange data or synchronize.

Basic Concepts

Compiling and running

There is no seperate compiler for MPI, but scripts wrapping regular compilers, such as mpicc, mpicxx, mpiicc, mpiicpc, and you can use all the usual flags. To run the MPI program, use mpiexec or mpirun, such as

mpiexec -n 4 mpiexecutable ... yourprogram arguments
mpirun -np 4 mpiexecutable ... yourprogram arguments

On TACC, you can use the wrapper ibrun to get the support from SLURM.

MPI definitions

Here is a skeleton MPI program task written in C++:

Here is the example output:

By default, the part of the program that is not in the MPI region will be repeated by each process in sequence.
By default, the part of the program that is not in the MPI region will be repeated by each process in sequence.
By default, the part of the program that is not in the MPI region will be repeated by each process in sequence.
By default, the part of the program that is not in the MPI region will be repeated by each process in sequence.
Processes are running in parallel for the MPI region.
This is MPI process 0 in a MPI world of 4 processes, running on processor c456-103.stampede2.tacc.utexas.edu.
Processes could go through different paths, processing different part of the data in different ways.
Path 0 for data 0.
Processes are running in parallel for the MPI region.
This is MPI process 1 in a MPI world of 4 processes, running on processor c456-103.stampede2.tacc.utexas.edu.
Path 1 for data 1.
Processes are running in parallel for the MPI region.
This is MPI process Processes are running in parallel for the MPI region.
This is MPI process 3 in a MPI world of 4 processes, running on processor c456-104.stampede2.tacc.utexas.edu.
Path 1 for data 3.
2 in a MPI world of 4 processes, running on processor c456-104.stampede2.tacc.utexas.edu.
Path 1 for data 2.
After the MPI region, processes execute the left part of the program sequencially.
This was MPI process 2.
After the MPI region, processes execute the left part of the program sequencially.
This was MPI process 3.
After the MPI region, processes execute the left part of the program sequencially.
This was MPI process 1.
After the MPI region, processes execute the left part of the program sequencially.
This was MPI process 0.

Python has module called mpi4py for MPI methods, here is a basic example:

Collectives

Collectives in MPI is a collection of predefined communication patterns, such as:

Reduce: aggregated all the values owned by different processes to one value in the root process;
Bcast: make every process have an identical copy of the data from the root process;
Gather: copy data owned by different processes to the non-overlapped memory space of the root process;
Scatter: copy different region of data from the root process to different processes;
Scan: reduction with partial results

There are some variants for those collectives:

All-variant: perform the same operation while end up with the results in every processes instead of only the root process, such as Allreduce, Allgather;
varient-v: for different processes perform the same operation with different amount of data, such as Gatherv, Scatterv;
Ex-variant: Scan is inclusive (process i has the partial reduction results for rank 0~i), Exscan does scan exclusively (process i has the partial reduction results for rank 0~i-1);
I-variant: nonblocking function calls, such as Ibcast, Iallgather.

Here is an example that uses some of collectives to solve a linear system of equations with the Gauss Jordan Elimination method: It worth mention that if use MPI_Allgather() instead of MPI_Gather(), the receive count is not the sum but receive buffer length divided by number of processes.

For all other collectives, you may find them in the standard or Dr. Eijkhout’s book.

Point-to-Point Communication

MPI support any user-specified two processes to communicate with each other by the send-receive protocol:

Send: copies a specified block of data in the buffer to a destination process, it has blocking variant Ssend, and non-blocking variant Isend;
Recv: wait a specified amount of data from a source process, is always blocking and would set a status object, it also has non-blocking variant Irecv;

Here is an example where each process gives a value to the process on the right: It is worth mention that for each send-receive, there could be a unique tag to distinguish this transaction from others.

There would be examples doing non-blocking point-to-point communication in the next post.

One-Side Communication

As an alternative to the two communications introduced above, one-side communication does one-time effort to setup windows of memory, which could be accessed from other processes from time to time. Apart from the functions used in the above example, there are MPI_Accumulate(), MPI_Get_accumulate() and so on.

Credits

TACC - Texas Advanced Computing Center
Victor Eijkhout - Thank you for your wonderful lecture.