Fault Tolerance

Intel® MPI Library provides extra functionality to enable fault tolerance support in the MPI applications. The MPI standard does not define behavior of MPI implementation if one or several processes of MPI application are abnormally aborted. By default, Intel® MPI Library aborts the whole application if any process stops.

Set the environment variable I_MPI_FAULT_CONTINUE to on to change this behavior. For example,

$ mpirun -env I_MPI_FAULT_CONTINUE on -n 2 ./test

An application can continue working in the case of MPI processes an issue if the issue meets the following requirements: