Environment Problems

Environmental errors may happen when there are problems with the system environment, such as mandatory system services are not running, shared resources are unavailable and so on.

When you encounter environmental errors, check the environment. For example, verify the current state of important services.

Example 1

Symptom/Error Message

librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list

or:

CMA: unable to get RDMA device list
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4

Cause

The OFED* stack is not loaded. The application was run over the dapl fabric. In such cases, hang of the MPI application is possible.

Solution

See the OFED* documentation for details about OFED* stack usage.

Example 2

Symptom/Error Message

[0] MPI startup(): Multi-threaded optimized library
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-mlx4_0-1
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[1] MPI startup(): dapl data transfer mode
[0] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[0] MPI startup(): dapl data transfer mode

In such cases, hang of the MPI application is possible.

Cause

The Subnet Manager (opensmd*) service is not running. The application was run over the dapl fabric. The following output is provided when you set I_MPI_DEBUG=2.

Solution

Check the current status of the service. See the OFED* documentation for details on opensmd* usage.

Example 3

Symptom/Error Message

node01-mic0:MCM:2b66:e56a0b40: 2379 us(2379 us): scif_connect() to port 68, failed
with error Connection refused
node01-mic0:MCM:2b66:e56a0b40: 2494 us(115 us): open_hca: SCIF init ERR for mlx4_0
Assertion failed in file ../../src/mpid/ch3/channels/nemesis/netmod/dapl/dapls_module_init.c
at line 761: 0
internal ABORT - process 0

Cause

The mpxyd daemon (CCL-proxy) is not running. The application was run over the dapl fabric. In such cases, hang of the MPI application is possible.

Solution

Check the current status of the service. See the DAPL* documentation for details on mpxyd usage.

Example 4

Symptom/Error Message

node01-mic0:SCM:2b94:14227b40: 201 us(201 us): open_hca: ibv_get_device_list() failed
node01-mic0:SCM:2b94:14227b40: 222 us(222 us): open_hca: ibv_get_device_list() failed
node01-mic0:CMA:2b94:14227b40: 570 us(570 us): open_hca: getaddr_netdev ERROR:No
such device. Is ib0 configured?
...
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(784).................:
MPID_Init(1326).......................: channel initialization failed
MPIDI_CH3_Init(141)...................:
dapl_rc_setup_all_connections_20(1386): generic failure with errno = 872609295
getConnInfoKVS(849)...................: PMI_KVS_Get failed

Cause

The ofed-mic service is not running. The application was run over the dapl fabric. In such cases, hang of the MPI application is possible.

Solution

Check the current status of the service. See the Intel® MPSS documentation for details on ofed-mic usage.

Example 5

Symptom/Error Message

pmi_proxy: line 0: exec: pmi_proxy: not found

Cause

The Intel® MPI Library runtime scripts are not available. A possible reason is that the shared space cannot be reached. In such cases, hang of the MPI application is possible.

Solution

Check if the shared path is available across all the nodes.

Example 6

Symptom/Error Message

[0] DAPL startup: RLIMIT_MEMLOCK too small
[0] MPI startup(): dapl fabric is not available and fallback fabric is not enabled

or:

node01:SCM:1c66:3f226b40: 6815816 us: DAPL ERR reg_mr Cannot allocate memory

Cause

Wrong system limits: the max locked memory is too small. The application was run over the dapl fabric.

Solution

Check the system limits and update them if necessary. The following example shows the correct system limits configuration:

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256273
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Example 7

Symptom/Error Message

Are you sure you want to continue connecting (yes/no)? The authenticity of host 'node01 (<node01_ip_address>)' can't be established.

This message may repeat continuously until manual interruption.

Cause

The MPI remote node access mechanism is SSH. SSH is not configured properly: unexpected messages appear in the standard input (stdin).

Solution

Check the SSH connection to the problem node.

Example 8

Symptom/Error Message

Password:

Cause

The MPI remote node access mechanism is SSH. SSH is not password-less. In such cases, hang of the MPI application is possible.

Solution

Check the SSH settings: password-less authorization by public keys should be enabled and configured.