Intel® MPI Library Reference Manual for Linux* OS
The Intel® MPI Library recognizes the following events to detect hardware issues:
IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state
IBV_EVENT_QP_REQ_ERR Invalid request local work queue error
IBV_EVENT_QP_ACCESS_ERR Local access violation error
IBV_EVENT_PATH_MIG_ERR A connection failed to migrate to the alternate path
IBV_EVENT_CQ_ERR CQ is in error (CQ overrun)
IBV_EVENT_SRQ_ERR Error occurred on an SRQ
IBV_EVENT_PORT_ERR Link became unavailable on a port
IBV_EVENT_DEVICE_FATAL CA is in FATAL state
Intel® MPI Library stops using a port or the whole adapter for communications if one of these issues is detected. The communications continue through an available port or adapter, if the application is running in multi-rail mode. The application is aborted if no healthy ports/adapters are available.
Intel® MPI Library also recognizes the following event
IBV_EVENT_PORT_ACTIVE Link became active on a port
The event indicates that the port can be used again and is enabled for communications.