Intel® MPI Library Reference Manual for Linux* OS
Intel® MPI Library provides extra functionality to enable fault tolerance support in the MPI applications. The MPI standard does not define behavior of MPI implementation if one or several processes of MPI application are abnormally aborted. By default, Intel® MPI Library aborts the whole application if any process stops.
Set the environment variable I_MPI_FAULT_CONTINUE to on to change this behavior. For example,
$ mpirun -env I_MPI_FAULT_CONTINUE on -n 2 ./test
An application can continue working in the case of MPI processes an issue if the issue meets the following requirements:
An application sets error handler MPI_ERRORS_RETURN to communicator MPI_COMM_WORLD (all new communicators inherit error handler from it)
An application uses master-slave model. In this case, the application is stopped only if the master is finished or does not respond
An application uses only point-to-point communication between a master and a number of slaves. It does not use inter slave communication or MPI collective operations.
Handle a certain MPI error code on a point-to-point operation with a particular failed slave rank for application to avoid further communication with this rank. The slave rank can be blocking/non-blocking send, receive, probe and test,
Any communication operation can be used on subset communicator system. If an error appears in a collective operation, any communication inside this communicator will be prohibited.
Master failure means the job stops.
Fault Tolerance functionality is not available for spawned processes.