Intel® MPI Library User's Guide for Linux* OS
Environmental errors may happen when there are problems with the system environment, such as mandatory system services are not running, shared resources are unavailable and so on.
When you encounter environmental errors, check the environment. For example, verify the current state of important services.
librdmacm: Warning: couldn't read ABI version. librdmacm: Warning: assuming: 4 librdmacm: Fatal: unable to get RDMA device list
or:
CMA: unable to get RDMA device list librdmacm: couldn't read ABI version. librdmacm: assuming: 4
The OFED* stack is not loaded. The application was run over the dapl fabric. In such cases, hang of the MPI application is possible.
See the OFED* documentation for details about OFED* stack usage.
[0] MPI startup(): Multi-threaded optimized library [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1 [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1 [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1 [1] MPI startup(): dapl data transfer mode [0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1 [0] MPI startup(): dapl data transfer mode
In such cases, hang of the MPI application is possible.
The Subnet Manager (opensmd*) service is not running. The application was run over the dapl fabric. The following output is provided when you set I_MPI_DEBUG=2.
Check the current status of the service. See the OFED* documentation for details on opensmd* usage.
node01-mic0:MCM:2b66:e56a0b40: 2379 us(2379 us): scif_connect() to port 68, failed with error Connection refused node01-mic0:MCM:2b66:e56a0b40: 2494 us(115 us): open_hca: SCIF init ERR for mlx4_0 Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c at line 761: 0 internal ABORT - process 0
The mpxyd daemon (CCL-proxy) is not running. The application was run over the dapl fabric. In such cases, hang of the MPI application is possible.
Check the current status of the service. See the DAPL* documentation for details on mpxyd usage.
node01-mic0:SCM:2b94:14227b40: 201 us(201 us): open_hca: ibv_get_device_list() failed node01-mic0:SCM:2b94:14227b40: 222 us(222 us): open_hca: ibv_get_device_list() failed node01-mic0:CMA:2b94:14227b40: 570 us(570 us): open_hca: getaddr_netdev ERROR:No such device. Is ib0 configured? ... Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(784).................: MPID_Init(1326).......................: channel initialization failed MPIDI_CH3_Init(141)...................: dapl_rc_setup_all_connections_20(1386): generic failure with errno = 872609295 getConnInfoKVS(849)...................: PMI_KVS_Get failed
The ofed-mic service is not running. The application was run over the dapl fabric. In such cases, hang of the MPI application is possible.
Check the current status of the service. See the Intel® MPSS documentation for details on ofed-mic usage.
pmi_proxy: line 0: exec: pmi_proxy: not found
The Intel® MPI Library runtime scripts are not available. A possible reason is that the shared space cannot be reached. In such cases, hang of the MPI application is possible.
Check if the shared path is available across all the nodes.
[0] DAPL startup: RLIMIT_MEMLOCK too small [0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled
or:
node01:SCM:1c66:3f226b40: 6815816 us: DAPL ERR reg_mr Cannot allocate memory
Wrong system limits: the max locked memory is too small. The application was run over the dapl fabric.
Check the system limits and update them if necessary. The following example shows the correct system limits configuration:
$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 256273 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Are you sure you want to continue connecting (yes/no)? The authenticity of host 'node01 (<node01_ip_address>)' can't be established.
This message may repeat continuously until manual interruption.
The MPI remote node access mechanism is SSH. SSH is not configured properly: unexpected messages appear in the standard input (stdin).
Check the SSH connection to the problem node.
Password:
The MPI remote node access mechanism is SSH. SSH is not password-less. In such cases, hang of the MPI application is possible.
Check the SSH settings: password-less authorization by public keys should be enabled and configured.