Open MPI logo

FAQ:
Running MPI jobs

  |   Home   |   Support   |   FAQ   |   all just the FAQ

Table of contents:

  1. What pre-requisites are necessary for running an Open MPI job?
  2. Do I need a common filesystem on all my nodes?
  3. How do I add Open MPI to my PATH and LD_LIBRARY_PATH?
  4. What if I can't modify my PATH and/or LD_LIBRARY_PATH?
  5. How do I launch Open MPI parallel jobs?
  6. How do I run a simple SPMD MPI job?
  7. How do I run an MPMD MPI job?
  8. I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why?
  9. When I build Open MPI with the Intel compilers, I get warnings about "orted" or my MPI application not finding libimf.so. What do I do?
  10. When I build Open MPI with the PGI compilers, I get warnings about "orted" or my MPI application not finding libpgc.so. What do I do?
  11. When I build Open MPI with the Pathscale compilers, I get warnings about "orted" or my MPI application not finding libmv.so. What do I do?
  12. Can I run non-MPI programs with mpirun / mpiexec?
  13. Can I run GUI applications with Open MPI?
  14. Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?
  15. What other options are available to mpirun?
  16. How do I use the --host option to mpirun?
  17. How do I control how my processes are scheduled across nodes?
  18. I'm not using a hostfile. How are slots calculated?
  19. Can I run multiple parallel processes on a uniprocessor machine?
  20. Can I oversubscribe nodes (run more processes than processors)?
  21. Can I force Agressive or Degraded performance modes?
  22. How do I run with the TotalView parallel debugger?
  23. How do I run with the DDT parallel debugger?
  24. What launchers are available?
  25. How do I specify to the rsh launcher to use rsh or ssh?
  26. How do I run with the SLURM and PBS/Torque launchers?
  27. How do I run with the SGE launcher?
  28. Can I suspend and resume my job?
  29. Does the SGE tight integration support the -notify flag to qsub?
  30. How do I run with LoadLeveler?
  31. How do I load libmpi at runtime?
  32. What MPI environmental variables exist?
  33. How do I get my MPI job to wireup its MPI connections right away?


1. What pre-requisites are necessary for running an Open MPI job?

In general, Open MPI requires that its executables are in your PATH on every node that you will run on and if Open MPI was compiled as dynamic libraries (which is the default), the directory where its libraries are located must be in your LD_LIBRARY_PATH on every node.

Specifically, if Open MPI was installed with a prefix of /opt/openmpi, then the following should be in your PATH and LD_LIBRARY_PATH

PATH:            /opt/openmpi/bin
LD_LIBRARY_PATH: /opt/openmpi/lib

Depending on your environment, you may need to set these values in your shell startup files (e.g., .profile, .cshrc, etc.).

NOTE: there are exceptions to this rule -- notably the --prefix option to mpirun.

See this FAQ entry for more details on how to add Open MPI to your PATH and LD_LIBRARY_PATH.

Additionally, Open MPI requires that jobs can be started on remote nodes without any input from the keyboard. For example, if using rsh or ssh as the remote agent, you must have your environment setup to allow execution on remote nodes without entering a password or passphrase.


2. Do I need a common filesystem on all my nodes?

No, but it certainly makes life easier if you do.

A common environment to run Open MPI is in a "Beowulf"-class or similar cluster (e.g., a bunch of 1U servers in a bunch of racks). Simply stated, Open MPI can run on a group of servers or workstations connected by a network. As mentioned above, there are several prerequisites, however (for example, you typically must have an account on all the machines, you can ssh or ssh between the nodes without using a password etc.).

Regardless of whether Open MPI is installed on a shared / networked filesystem or independently on each node, it is usually easiest if Open MPI is available in the same filesystem location on every node. For example, if you install Open MPI to /opt/openmpi-1.4.2 on one node, ensure that it is available in /opt/openmpi-1.4.2 on all nodes.

This FAQ entry has a bunch more information about installation locations for Open MPI.


3. How do I add Open MPI to my PATH and LD_LIBRARY_PATH?

Open MPI must be able to find its executables in your PATH on every node (if Open MPI was compiled as dynamic libraries, then its library path must appear in LD_LIBRARY_PATH as well). As such, your configuration/initialization files need to add Open MPI to your PATH / LD_LIBRARY_PATH properly.

How to do this may be highly dependent upon your local configuration, so you may need to consult with your local system administrator. Some system administrators take care of these details for you, some don't. YMMV. Some common examples are included below, however.

You must have at least a minimum understanding of how your shell works to get Open MPI in your PATH / LD_LIBRARY_PATH properly. Note that Open MPI must be added to your PATH and LD_LIBRARY_PATH in two situations: (1) when you login to an interactive shell, (2) and when you login to non-interactive shells on remote nodes.

  • If (1) is not configured properly, executables like mpicc will not be found, and it is typically obvious what is wrong. The Open MPI executable directory can manually be added to the PATH, or the user's startup files can be modified such that the Open MPI executables are added to the PATH every login. This latter approach is preferred.

    All shells have some kind of script file that is executed at login time to set things like PATH and LD_LIBRARY_PATH and perform other environmental setup tasks. This startup file is the one that needs to be edited to add Open MPI to the PATH and LD_LIBRARY_PATH. Consult the manual page for your shell for specific details (some shells are picky about the permissions of the startup file, for example). The table below lists some common shells and the startup files that they read/execute upon login:

    Shell Interactive login startup file
    sh (Bourne shell, or bash named "sh") .profile
    csh .cshrc followed by .login
    tcsh .tcshrc if it exists, .cshrc if it does not, followed by .login
    bash .bash_profile if it exists, or .bash_login if it exists, or .profile if it exists (in that order). Note that some Linux distributions automatically come with .bash_profile scripts for users that automatically execute .bashrc as well. Consult the bash man page for more information.

  • If (2) is not configured properly, executables like mpirun will not function properly, and it can be somewhat confusing to figure out (particularly for bash users).

    The startup files in question here are the ones that are automatically executed for a non-interactive login on a remote node (e.g., "rsh othernode ps"). Note that not all shells support this, and that some shells use different files for this than listed in (1). Some shells will supersede (2) with (1). That is, fulfilling (2) may automatically fulfill (1). The following table lists some common shells and the startup file that is automatically executed, either by Open MPI or by the shell itself:

    Shell Non-interactive login startup file
    sh (Bourne or bash named "sh") This shell does not execute any file automatically, so Open MPI will execute the .profile script before invoking Open MPI executables on remote nodes
    csh .cshrc
    tcsh .tcshrc if it exists, or .cshrc if it does not
    bash .bashrc if it exists


4. What if I can't modify my PATH and/or LD_LIBRARY_PATH?

There are some situations where you cannot modify the PATH or LD_LIBRARY_PATH -- e.g., some ISV application prefer to hide all parallelism from the user, and therefore do not want to make the user modify their shell startup files. Another case is where you want a single user to be able to launch multiple MPI jobs simultaneously, each with a different MPI implementation. Hence, setting shell startup files to point to one MPI implementation would be problematic.

In such cases, you have two options:

  1. Use mpirun's --prefix command line option (described below).
  2. Modify the wrapper compilers to include directives to include run-time search locations for the Open MPI libraries (see this FAQ entry)

mpirun's --prefix command line option takes as an argument the top-level directory where Open MPI was installed. While relative directory names are possible, they can become ambiguous depending on the job launcher used; using absolute directory names are strongly recommended.

For example, say that Open MPI was installed into /opt/openmpi-1.4.2. You would use the --prefix option like this:

shell$ mpirun --prefix /opt/openmpi-1.4.2 -np 4 a.out

This will prefix the PATH and LD_LIBRARY_PATH on both the local and remote hosts with /opt/openmpi-1.4.2/bin and /opt/openmpi-1.4.2/lib, respectively. This is usually unnecessary when using resource managers to launch jobs (e.g., SLURM, Torque, etc.) because they tend to copy the entire local environment -- to include the PATH and LD_LIBRARY_PATH -- to remote nodes before execution. As such, if PATH and LD_LIBRARY_PATH are set properly on the local node, the resource manager will automatically propagate those values out to remote nodes. The --prefix option is therefore usually most useful in rsh or ssh-based environments (or similar).

Beginning with the 1.2 series, it is possible to make this the default behavior by passing to configure the flag --enable-mpirun-prefix-by-default. This will make mpirun behave exactly the same as "mpirun --prefix $prefix ...", where $prefix is the value given to --prefix in configure.

Finally, note that specifying the absolute pathname to mpirun is equivalent to using the --prefix argument. For example, the following is equivalent to the above command line that uses --prefix:

shell$ /opt/openmpi-1.4.2/bin/mpirun -np 4 a.out


5. How do I launch Open MPI parallel jobs?

Similar to many MPI implementations, Open MPI provides the commands mpirun and mpiexec to launch MPI jobs. Several of the questions in this FAQ category deal with using these commands.

Note, however, that these commands are exactly identical. Specifically, they are symbolic links to a common back-end launcher command named orterun (Open MPI's run-time environment interaction layer is named the Open Run-Time Environment, or ORTE -- hence orterun).

As such, the rest of this FAQ usually refers only to mpirun, even though the same discussions also apply to mpiexec and orterun (because they are all, in fact, the same command).


6. How do I run a simple SPMD MPI job?

Open MPI provides both mpirun and mpiexec commands. A simple way to start a single program, multiple data (SPMD) application in parallel is:

shell$ mpirun -np 4 my_parallel_application

This starts a four-process parallel application, running four copies of the executable named my_parallel_application.

The rsh starter component accepts the --hostfile (also known as --machinefile) option to indicate which hosts to start the processes on:

shell$ mpirun --hostfile my_hostfile -np 4 my_parallel_application

The hostfile my_hostfile is a text file with hosts specified, one per line. Each host can also specify a default a maximum number of slots to be used on that host (i.e., the number of available processors on that host). Comments are also supported. For example:

# This is an example hostfile.  Comments begin with #
#
# The following node is a single processor machine:
foo.example.com

# The following node is a dual-processor machine:
bar.example.com slots=2

# The following node is a quad-processor machine, and we absolutely
# want to disallow over-subscribing it:
yow.example.com slots=4 max-slots=4

slot and max-slots are discussed more in this FAQ entry.

Note, however, that not all environments require a hostfile. For example, Open MPI will automatically detect when it is running in batch / scheduled environments (such as SGE, PBS/Torque, SLURM, and LoadLeveler) environments and use host information provided by those systems (i.e., it will ignore any provided hostfiles).

Also note that if using a launcher that uses a hostfile and no hostfile is specified, all processes are launched on the local host.


7. How do I run an MPMD MPI job?

Both the mpirun and mpiexec commands support multiple program, multiple data (MPMD) style launches, either from the command line or from a file. For example:

shell$ mpirun -np 2 a.out : -np 2 b.out

This will launch a single parallel application, but the first two processes will be instances of the a.out executable, and the second two processes will be instances of the b.out executable. In MPI terms, this will be a single MPI_COMM_WORLD, but the a.out processes will be ranks 0 and 1 in MPI_COMM_WORLD, while the b.out processes will be ranks 2 and 3 in MPI_COMM_WORLD.

mpirun (and mpiexec) can also accept a parallel application specified in a file instead of on the command line. For example:

shell$ mpirun --app my_appfile

where the file my_appfile contains the following:

# Comments are supported; comments begin with #
# Application context files specify each sub-application in the
# parallel job, one per line.  The first sub-application is the 2
# a.out processes:
-np 2 a.out
# The second sub-application is the 2 b.out processes:
-np 2 b.out

This will result in the same behavior as running a.out and b.out from the command line.

Note that mpirun and mpiexec are identical in command-line options and behavior; using the above command lines with mpiexec instead of mpirun will result in the same behavior.


8. I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. Why?

If you can run ompi_info and possibly even launch MPI processes locally, but fail to launch MPI processes on remote hosts, it is likely that you do not have your PATH and/or LD_LIBRARY_PATH setup properly on the remote nodes.

Specifically, the Open MPI commands usually run properly even if LD_LIBRARY_PATH is not set properly because they encode the Open MPI library location in their executables and search there by default. Hence, running ompi_info (and friends) usually works, even in some improperly setup environments.

However, Open MPI's wrapper compilers do not encode the Open MPI library locations in MPI executables by default (the wrappers only specify a bare minimum of flags necessary to create MPI executables; we consider any flags beyond this bare minimum set a local policy decision). Hence, attempting to launch MPI executables in environments where LD_LIBRARY_PATH is either not set or was set improperly may result in messages about libmpi.so not being found.

You can change Open MPI's wrapper compiler behavior to specify the run-time location of Open MPI's libraries, if you wish.

Depending on how Open MPI was configured and/or invoked, it may even be possible to run MPI applications in environments where PATH and/or LD_LIBRARY_PATH is not set, or is set improperly. This can be desirable for environments where multiple MPI implementations are installed, such as multiple versions of Open MPI.


9. When I build Open MPI with the Intel compilers, I get warnings about "orted" or my MPI application not finding libimf.so. What do I do?

The problem is usually because the Intel libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libimf.so, is an Intel compiler library. As such, it is likely that the user did not setup the Intel compiler library in their environment properly on this node.

Double check that you have setup the Intel compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the Intel compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com

Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit

shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory
shell$

The above example shows that running a trivial C program compiled by the Intel compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the Intel compiler environment is setup properly for non-interactive logins.


10. When I build Open MPI with the PGI compilers, I get warnings about "orted" or my MPI application not finding libpgc.so. What do I do?

The problem is usually because the PGI libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libpgc.so, is a PGI compiler library. As such, it is likely that the user did not setup the PGI compiler library in their environment properly on this node.

Double check that you have setup the PGI compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the PGI compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com

Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit

shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libpgc.so: cannot open shared object file: No such file or directory
shell$

The above example shows that running a trivial C program compiled by the PGI compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the PGI compiler environment is setup properly for non-interactive logins.


11. When I build Open MPI with the Pathscale compilers, I get warnings about "orted" or my MPI application not finding libmv.so. What do I do?

The problem is usually because the Pathscale libraries cannot be found on the node where Open MPI is attempting to launch an MPI executable. For example:

shell$ mpirun -np 1 --host node1.example.com mpi_hello
orted: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 11893) died unexpectedly with status 127 while
attempting to launch so we are aborting.
[...more error messages...]

Open MPI first attempts to launch a "helper" daemon (orted) on node1.example.com, but it failed because one of orted's dependent libraries was not able to be found. This particular library, libmv.so, is a Pathscale compiler library. As such, it is likely that the user did not setup the Pathscale compiler library in their environment properly on this node.

Double check that you have setup the Pathscale compiler environment on the target node, for both interactive and non-interactive logins. It is a common error to ensure that the Pathscale compiler environment is setup properly for interactive logins, but not for non-interactive logins. For example:

shell$ cd $HOME
shell$ mpicc mpi_hello.c -o mpi_hello
shell$ ./mpi_hello
Hello world, I am 0 of 1.
shell$ ssh node1.example.com

Welcome to node1.
node1 shell$ ./mpi_hello
Hello world, I am 0 of 1.
node1 shell$ exit

shell$ ssh node1.example.com $HOME/mpi_hello
mpi_hello: error while loading shared libraries: libmv.so: cannot open shared object file: No such file or directory
shell$

The above example shows that running a trivial C program compiled by the Pathscale compilers works fine on both the head node and node1 when logging in interactively, but fails when run on node1 non-interactively. Check your shell script startup files and verify that the Pathscale compiler environment is setup properly for non-interactive logins.


12. Can I run non-MPI programs with mpirun / mpiexec?

Yes.

Indeed, Open MPI's mpirun and mpiexec are actually synonyms for our underlying launcher named orterun (i.e., the Open Run-Time Environment layer in Open MPI, or ORTE). So you can use mpirun and mpiexec to launch any application. For example:

shell$ mpirun -np 2 --host a,b uptime

This will launch a copy of the unix command uptime on the hosts a and b.

Other questions in the FAQ section deal with the specifics of the mpirun command line interface; suffice it to say that it works equally well for MPI and non-MPI applications.


13. Can I run GUI applications with Open MPI?

Yes, but it will depend on your local setup and may require additional setup.

In short: you will need to have X forwarding enabled from the remote processes to the display where you want output to appear. In a secure environment, you can simply allow all X requests to be shown on the target display and set the DISPLAY environment variable in all MPI process' environments to the target display, perhaps something like this:

shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +
shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out

However, this technique is not generally suitable for unsecure environments (because it allows anyone to read and write to your display). A slightly more secure way is to only allow X connections from the nodes where your application will be running:

shell$ hostname
my_desktop.secure-cluster.example.com
shell$ xhost +compute1 +compute2 +compute3 +compute4
compute1 being added to access control list
compute2 being added to access control list
compute3 being added to access control list
compute4 being added to access control list
shell$ mpirun -np 4 -x DISPLAY=my_desktop.secure-cluster.example.com a.out

(assuming that the four nodes you are running on are compute1 through compute4).

Other methods are available, but they involve sophisticated X forwarding through mpirun and are generally more complicated than desirable.


14. Can I run ncurses-based / curses-based / applications with funky input schemes with Open MPI?

Maybe. But probably not.

Open MPI provides fairly sophisticated stdin / stdout / stderr forwarding. However, it does not work well with curses, ncurses, readline, or other sophisticated I/O packages that generally require direct control of the terminal.

Every application and I/O library is different -- you should try to see if yours is supported. But chances are that it won't work.

Sorry. :-(


15. What other options are available to mpirun?

mpirun supports the "--help" option which provides a usage message and a summary of the options that it supports. It should be considered the definitive list of what options are provided.

Several notable options are:


16. How do I use the --host option to mpirun?

The --host option to mpirun takes a comma-delimited list of hosts on which to run. For example:

shell$ mpirun -np 3 --host a,b,c hostname

Will launch one copy of hostname on hosts a, b, and c.

--host works in two different ways:

  • Exclusionary: If a list of hosts to run on has been provided by another source (e.g., by a hostfile or a batch scheduler such as SLURM, PBS/Torque, SGE, etc.), the hosts provided by the --host option must be in the already-provided host list. If the --host-specified nodes are not in the already-provided host list, mpirun will abort without launching anything.

    In this case, the --host option acts like an exclusionary filter -- it limits the scope of where processes will be scheduled from the original list of hosts to produce a final list of hosts.

    For example, say that the hostfile my_hosts contains the hosts node1 through node4. If you run:

    shell$ mpirun -np 1 --hostfile my_hosts --host node3 hostname
    

    This will run a single copy of hostname on the host node3. However, if you run:

    shell$ mpirun -np 1 --hostfile my_hosts --host node17 hostname
    

    This is an error (because node17 is not listed in my_hosts; mpirun will abort.

    Finally, note that in exclusionary mode, processes will only be executed on the --host-specified hosts, even if it causes oversubscription. For example:

    shell$ mpirun -np 4 --host a uptime
    

    This will launch 4 copies of uptime on host a.

  • Inclusionary: If a list of hosts has not been provided by another source, then the hosts provided by the --host option will be used as the original and final host list.

    In this case, --host acts as an inclusionary agent; all --host-supplied hosts become available for scheduling processes. For example (assume that you are not in a scheduling environment where a list of nodes is being transparently supplied):

    shell$ mpirun -np 3 --host a,b,c hostname
    

    This will launch a single copy of hostname on the hosts a, b, and c.

Note, too, that --host is essentially a per-application switch. Hence, if you specify multiple applications (as in an MPMD job), --host can be specified multiple times:

shell$ mpirun -np 1 --host a hostname : -np 1 --host b uptime

This will launch hostname on host a and uptime on host b.


17. How do I control how my processes are scheduled across nodes?

The short version is that if you are not oversubscribing your nodes (i.e., trying to run more processes than you have told Open MPI are available on that node), scheduling is pretty simple and occurs either on a by-slot or by-node round robin schedule. If you're oversubscribing, the issue gets much more complicated -- keep reading.

The more complete answer is: Open MPI schedules processes to nodes by asking two questions from each application on the mpirun command line:

  • How many processes should be launched?
  • Where should those processes be launched?

The "how many" question is directly answered with the -np switch to mpirun. The "where" question is a little more complicated, and depends on three factors:

  • The final node list (e.g., after --host exclusionary or inclusionary processing)
  • The scheduling policy (which applies to all applications in a single job)
  • The default and maximum number of slots on each host

As briefly mentioned in this FAQ entry, slots are Open MPI's representation of how many processors are available on a given host.

The default number of slots on any machine, if not explicitly specified, is 1 (e.g., if a host is listed in a hostfile by has no corresponding "slots" keyword). Schedulers (such as SLURM, PBS/Torque, SGE, etc.) automatically provide an accurate default slot count.

Max slot counts, however, are rarely specified by schedulers. The max slot count for each node will default to "infinite" if it is not provided (meaning that Open MPI will oversubscribe the node if you ask it to -- see more on oversubscribing in this FAQ entry).

Open MPI currently supports two scheduling policies: by slot and by node:

  • By slot: This is the default scheduling policy, but can also be explicitly requested by using either the --byslot option to mpirun or by setting the MCA parameter rmaps_base_schedule_policy to the string "slot".

    In this mode, Open MPI will schedule processes on a node until all of its default slots are exhausted before proceeding to the next node. In MPI terms, this means that Open MPI tries to maximize the number of adjacent ranks in MPI_COMM_WORLD on the same host without oversubscribing that host.

    For example:

    shell$ cat my-hosts
    node0 slots=2 max_slots=20
    node1 slots=2 max_slots=20
    shell$ mpirun --hostfile my-hosts -np 8 --byslot | sort
    Hello World I am rank 0 of 8 running on node0
    Hello World I am rank 1 of 8 running on node0
    Hello World I am rank 2 of 8 running on node1
    Hello World I am rank 3 of 8 running on node1
    Hello World I am rank 4 of 8 running on node0
    Hello World I am rank 5 of 8 running on node0
    Hello World I am rank 6 of 8 running on node1
    Hello World I am rank 7 of 8 running on node1
    

  • By node: This policy can be requested either by using the --bynode option to mpirun or by setting the MCA parameter rmaps_base_schedule_policy to the string "node".

    In this mode, Open MPI will schedule a single process on each node in a round-robin fashion (looping back to the beginning of the node list as necessary) until all processes have been scheduled. Nodes are skipped once their default slot counts are exhausted.

    For example:

    shell$ cat my-hosts
    node0 slots=2 max_slots=20
    node1 slots=2 max_slots=20
    shell$ mpirun --hostname my-hosts -np 8 --bynode hello | sort
    Hello World I am rank 0 of 8 running on node0
    Hello World I am rank 1 of 8 running on node1
    Hello World I am rank 2 of 8 running on node0
    Hello World I am rank 3 of 8 running on node1
    Hello World I am rank 4 of 8 running on node0
    Hello World I am rank 5 of 8 running on node1
    Hello World I am rank 6 of 8 running on node0
    Hello World I am rank 7 of 8 running on node1
    

In both policies, if the default slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will loop through the list of nodes again and try to schedule one more process to each node until all processes are scheduled. Nodes are skipped in this process if their maximum slot count is exhausted. If the maximum slot count is exhausted on all nodes while there are still processes to be scheduled, Open MPI will abort without launching any processes.

NOTE: This is the scheduling policy in Open MPI because of a long historical precedent in LAM/MPI. However, the scheduling of processes to processors is a component in the RMAPS framework in Open MPI; it can be changed. If you don't like how this scheduling occurs, please let us know.


18. I'm not using a hostfile. How are slots calculated?

If you are using a supported resource manager, Open MPI will get the slot information directly from that entity. If you are using the --host parameter to mpirun, be aware that each instance of a hostname bumps up the internal slot count by one. For example:

shell$ mpirun --host node0,node0,node0,node0 ....

This tells Open MPI that host "node0" has a slot count of 4. This is very different than, for example:

shell$ mpirun -np 4 --host node0 a.out

This tells Open MPI that host "node0" has a slot count of 1 but you are running 4 processes on it. Specifically, Open MPI assumes that you are oversubscribing the node.


19. Can I run multiple parallel processes on a uniprocessor machine?

Yes.

But be very careful to ensure that Open MPI knows that you are oversubscibing your node! If Open MPI is unaware that you are oversubscribing a node, severe performance degredation can result.

See this FAQ entry for more details on oversubscription.


20. Can I oversubscribe nodes (run more processes than processors)?

Yes.

However, it is critical that Open MPI knows that you are oversubscribing the node, or severe performance degredation can result.

The short explanation is as follows: never specify a number of slots that is more than the available number of processors. For example, if you want to run 4 processes on a uniprocessor, then indicate that you only have 1 slot but want to run 4 processes. For example:

shell$ cat my-hostfile
localhost
shell$ mpirun -np 4 --hostfile my-hostfile a.out

Specifically: do NOT have a hostfile that contains "slots = 4" (because there is only one available processor).

Here's the full explanation:

Open MPI basically runs its message passing progression engine in two modes: aggressive and degraded.

For example, on a uniprocessor node:

shell$ cat my-hostfile
localhost slots=4
shell$ mpirun -np 4 --hostfile my-hostfile a.out

This would cause all 4 MPI processes to run in aggressive mode because Open MPI thinks that there are 4 available processors to use. This is actually a lie (there is only 1 processor -- not 4), and can cause extremely bad performance.


21. Can I force Agressive or Degraded performance modes?

Yes.

The MCA parameter mpi_yield_when_idle controls whether an MPI process runs in Aggressive or Degraded performance mode. Setting it to zero forces Aggressive mode; any other value forces Degraded mode (see this FAQ entry to see how to set MCA parameters).

Note that this value only affects the behavior of MPI processes when they are blocking in MPI library calls. It does not affect behavior of non-MPI processes, nor does it affect the behavior of a process that is not inside an MPI library call.

Open MPI normally sets this parameter automatically (see this FAQ entry for details). Users are cautioned against setting this parameter unless you are really, absoultely, positively sure of what you are doing.


22. How do I run with the TotalView parallel debugger?

Generally, you can run Open MPI processes with TotalView as follows:

shell$ mpirun --debug ...mpirun arguments...

Assuming that TotalView is the first supported parallel debugger in your path, Open MPI will autmoatically invoke the correct underlying command to run your MPI process in the TotalView debugger. Be sure to see this FAQ entry for details about what versions of Open MPI and TotalView are compatible.

For reference, this underlying command form is the following:

shell$ totalview mpirun -a ...mpirun arguments...

So if you wanted to run a 4-process MPI job of your a.out executable, it would look like this:

shell$ totalview mpirun -a -np 4 a.out

Alternatively, Open MPI's mpirun offers the "-tv" convenience option which does the same thing as TotalView's "-a" syntax. For example:

shell$ mpirun -tv -np 4 a.out

Note that by default, TotalView will stop deep in the machine code of mpirun itself, which is not what most users want. It is possible to get TotalView to recognize that mpirun is simply a "starter" program and should be (effectively) ignored. Specifically, TotalView can be configured to skip mpirun (and mpiexec and orterun) and jump right into your MPI application. This can be accomplished by placing some startup instructions in a TotalView-specific file named $HOME/.tvdrc.

Open MPI includes a sample TotalView startup file that performs this function (see etc/openmpi-totalview.tcl in Open MPI distribution tarballs; it is also installed, by default, to $prefix/etc/openmpi-totalview.tcl in the Open MPI installation). This file can be either copied to $HOME/.tvdrc or sourced from the $HOME/.tvdrc file. For example, placing the following line in your $HOME/.tvdrc (replacing /path/to/openmpi/installation with the proper directory name, of course) will use the Open MPI-provided startup file:

source /path/to/openmpi/installation/etc/openmpi-totalview.tcl


23. How do I run with the DDT parallel debugger?

If you've used DDT at least once before (to use the configuration wizard to setup support for Open MPI), you can start it on the command line with:

shell$ mpirun --debug ...mpirun arguments...

Assuming that you are using Open MPI v1.2.4 or later, and assuming that DDT is the first supported parallel debugger in your path, Open MPI will autmoatically invoke the correct underlying command to run your MPI process in the DDT debugger. For reference (or if you are using an earlier version of Open MPI), this underlying command form is the following:

shell$ ddt -n {nprocs} -start {exe-name}

Note that passing arbitrary arguments to Open MPI's mpirun is not supported with the DDT debugger.

You can also attach to already-running proceses with either of the following two syntaxes:

shell$ ddt -attach {hostname1:pid} [{hostname2:pid} ...] {exec-name}
# Or
shell$ ddt -attach-file {filename of newline separated hostname:pid pairs} {exec-name}

DDT can even be configured to operate with cluster/resource schedulers such that it can run on a local workstation, submit your MPI job via the scheduler, and then attach to the MPI job when it starts.

See the official DDT documentation for more details.


24. What launchers are available?

The documentation contained in the Open MPI tarball will have the most up-to-date information, but as of v1.0, Open MPI supports:

  • BProc versions 3 and 4 with LSF
  • Sun Grid Engine (SGE), and the open source Grid Engine (support first introduced in Open MPI v1.2)
  • PBS Pro, Torque, and Open PBS
  • LoadLeveler scheduler (full support since 1.1.1)
  • rsh / ssh
  • SLURM
  • XGrid
  • Yod (Cray XT-3 and XT-4)


25. How do I specify to the rsh launcher to use rsh or ssh?

See this FAQ entry.


26. How do I run with the SLURM and PBS/Torque launchers?

If support for these systems are included in your Open MPI installation (which you can check with the ompi_info command -- look for components named "slurm" and/or "tm"), Open MPI will automatically detect when it is running inside such jobs and will just "do the Right Thing."

See this FAQ entry for a description of how to run jobs in SLURM; see this FAQ entry for a description of how to run jobs in PBS/Torque.


27. How do I run with the SGE launcher?

Support for SGE is included in Open MPI version 1.2 and later.

NOTE: To build SGE support in v1.3, you will need to explicitly request the SGE support with the "--with-sge" command line switch to Open MPI's configure script.

See this FAQ entry for a description of how to correctly build Open MPI with SGE support.

To verify if support for SGE is configured into your Open MPI installation, run ompi_info as shown below and look for gridengine. The components you will see are slightly different between v1.2 and v1.3.

For Open MPI 1.2:

shell$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v1.0, API v1.0, Component v1.2)
                 MCA pls: gridengine (MCA v1.0, API v1.0, Component v1.2)

For Open MPI 1.3:

shell$ ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3)

Open MPI will automatically detect when it is running inside SGE and will just "do the Right Thing."

Specifically, if you execute an mpirun command in a SGE job, it will automatically use the SGE mechanisms to launch and kill processes. There is no need to specify what nodes to run on -- Open MPI will obtain this information directly from SGE. For example, this will run the 4 MPI processes on the nodes that were allocated by SGE:

# Get the environment variables for SGE
# (Assuming SGE is installed at /opt/sge and $SGE_CELL is 'default' in your environment)
# C shell settings
shell% source /opt/sge/default/common/settings.csh

# bourne shell settings
shell$ . /opt/sge/default/common/settings.sh

# Allocate a SGE interactive job with 4 slots
# from a parallel environment (PE) named 'orte'
shell$ qsh -pe orte 4
# Now run a 4-process Open MPI job
shell$ mpirun -np 4 a.out

There are also other ways to submit jobs under SGE:

# Submit a batch job with the 'mpirun' command embedded in a script
shell$ qsub -pe orte 4 my_mpirun_job.csh

# Submit an SGE and OMPI job and mpirun in one line
shell$ qrsh -V -pe orte 4 mpirun -np 4 hostname

# Use qstat(1) to show the status of SGE jobs and queues
shell$ qstat -f

As a reference to the setup, be sure you have a Parallel Environment (PE) defined for submitting parallel jobs. You don't have to name your PE "orte". The following example shows a PE named 'orte' that would look like:

% qconf -sp orte
pe_name           orte
slots             8
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

And be sure the queue will make use of the PE that you specified:

% qconf -sq all.q
...
pe_list               make cre orte
...

To determine whether the SGE parallel job is sucessfully launched to the remote nodes, you can pass in this MCA parameter "--mca pls_gridengine_verbose 1" to mpirun.

This will add in a -verbose flag to qrsh -inherit command that is used to send parallel tasks to the remote SGE execution hosts. It will show whether the connections to the remote hosts are established successfully or not.


28. Can I suspend and resume my job?

A new feature was added into Open MPI 1.3.1 that supports suspend/resume of an MPI job. To suspend the job, you send a SIGTSTP (not SIGSTOP) signal to mpirun. mpirun will catch this signal and forward it to the a.outs as a SIGSTOP signal. To resume the job, you send a SIGCONT signal to mpirun which will be caught and forwarded to the a.outs.

By default, this feature is not enabled. This means that both the SIGTSTP and SIGCONT signals will simply be consumed by the mpirun process. To have them forwarded, you have to run the job with --mca orte_forward_job_control 1. Here is an example on Solaris.

shell$ mpirun -mca orte_forward_job_control 1 -np 2 a.out

In another window, we suspend and continue the job.

shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15305 rolfv     158M   22M cpu1     0    0   0:00:21 5.9% a.out/1
 15303 rolfv     158M   22M cpu2     0    0   0:00:21 5.9% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1

shell$ kill -TSTP 15301
shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15303 rolfv     158M   22M stop    30    0   0:01:44  21% a.out/1
 15305 rolfv     158M   22M stop    20    0   0:01:44  21% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1

shell$ kill -CONT 15301
shell$ prstat -p 15301,15303,15305
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 15305 rolfv     158M   22M cpu1     0    0   0:02:06  17% a.out/1
 15303 rolfv     158M   22M cpu3     0    0   0:02:06  17% a.out/1
 15301 rolfv    8128K 5144K sleep   59    0   0:00:00 0.0% orterun/1

Note that all this does is stop the a.outs. It does not, for example, free any pinned memory when the job is in the suspended state.

To get this to work under the SGE environment, you have to change the suspend_method entry in the queue. It has to be set to SIGTSTP. Here is an example of what a queue should look like.

sheel$ qconf -sq all.q
qname                 all.q
[...snip...]
starter_method        NONE
suspend_method        SIGTSTP
resume_method         NONE 

Note that if you need to suspend other types of jobs with SIGSTOP (instead of SIGTSTP) in this queue then you need to provide a script that can implement the correct signals for each job type.


29. Does the SGE tight integration support the -notify flag to qsub?

If you are running SGE6.2 Update 3 or later, then the -notify flag is supported. If you are running earlier versions, then the -notify flag will not work and using it will cause the job to be killed.

To use -notify, one has to be a careful. First, let us review what -notify does. Here is an excerpt from the qsub man page for the -notify flag.

-notify
This flag, when set causes Sun Grid Engine to send
warning signals to a running job prior to sending the
signals themselves. If a SIGSTOP is pending, the job
will receive a SIGUSR1 several seconds before the SIGSTOP.
If a SIGKILL is pending, the job will receive a SIGUSR2
several seconds before the SIGKILL. The amount of time
delay is controlled by the notify parameter in each
queue configuration.

Let us assume you the reason you want to use the -notify flag is to get the SIGUSR1 signal prior to getting the SIGTSTP signal. As mentioned in this this FAQ entry one could run the job as shown in this batch script.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
mpirun -np 16 -mca orte_forward_job_control 1 a.out

However, one has to make one of two changes to this script for things to work properly. By default, a SIGUSR1 signal will kill a shell script. So we have to make sure that does not happen. Here is one way to handle it.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
exec mpirun -np 16 -mca orte_forward_job_control 1 a.out

Alternatively, one can catch the signals in the script instead of doing an exec on the mpirun.

#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 16
#$ -j y
#$ -l h_rt=00:20:00
function sigusr1handler()
{
        echo "SIGUSR1 caught by shell script" 1>&2
}
function sigusr2handler()
{
        echo "SIGUSR2 caught by shell script" 1>&2
}
trap sigusr1handler SIGUSR1
trap sigusr2handler SIGUSR2
mpirun -np 16 -mca orte_forward_job_control 1 a.out


30. How do I run with LoadLeveler?

If support for LoadLeveler is included in your Open MPI installation (which you can check with the ompi_info command -- look for components named "loadleveler"), Open MPI will automatically detect when it is running inside such jobs and will just "do the Right Thing."

Specifically, if you execute an mpirun command in a LoadLeveler job, it will automatically determine what nodes and how many slots on each node have been allocated to the current job. There is no need to specify what nodes to run on. Open MPI will then attempt to launch the job using whatever resource is available (on Linux rsh/ssh is used).

For example:

# Job to submit
shell$ cat job
#@ output  = job.out
#@ error   = job.err
#@ job_type = parallel
#@ node = 3
#@ tasks_per_node = 4
mpirun a.out

# Submit batch job to LoadLeveler 
shell$ llsubmit job

This will run 4 MPI process per node on the 3 nodes which were allocated by LoadLeveler for this job.

For users of Open MPI 1.1 series: In version 1.1.0 there exists a problem which will make it so that Open MPI will not be able to determine what nodes are available to it if the job has more than 128 tasks. In the 1.1.x series starting with version 1.1.1., this can be worked around by passing "-mca ras_loadleveler_priority 110" to mpirun. Version 1.2 and above work without any additional flags.


31. How do I load libmpi at runtime?

If you want to load a the shared library libmpi explicitly at runtime either by using dlopen() from C/C ++ or something like the ctypes package from Python, some extra care is required. The default configuration of Open MPI uses dlopen() internally to load its support components. These components rely on symbols available in libmpi. In order to make the symbols in libmpi available to the components loaded by Open MPI at runtime, libmpi must be loaded with the RTLD_GLOBAL option.

In C/C++, this option is specified as the second parameter to dlopen(). When using ctypes with Python, this can be done with the second (optional) parameter to CDLL(). For example (shown below in Mac OS X, where Open MPI's shared library name ends in ".dylib"; other operating systems use other suffixes, such as ".so")

  from ctypes import *
  mpi = CDLL('libmpi.0.dylib', RTLD_GLOBAL)
  f = pythonapi.Py_GetArgcArgv
  argc = c_int()
  argv = POINTER(c_char_p)()
  f(byref(argc), byref(argv))
  mpi.MPI_Init(byref(argc), byref(argv))
  mpi.MPI_Finalize()

Other scripting languages should have similar options when dynamically loading shared libraries.


32. What MPI environmental variables exist?

Beginning with the 1.3 release, Open MPI provides the following environmental variables that will be defined on every MPI process:

  • OMPI_COMM_WORLD_SIZE - the number of processes in this process' MPI Comm_World
  • OMPI_COMM_WORLD_RANK - the MPI rank of this process
  • OMPI_COMM_WORLD_LOCAL_RANK - the relative rank of this process on this node within it job. For example, if four processes in a job share a node, they will each be given a local rank ranging from 0 to 3.
  • OMPI_UNIVERSE_SIZE - the number of process slots allocated to this job. Note that this may be different than the number of processes in the job.

Open MPI guarantees that these variables will remain stable throughout future releases


33. How do I get my MPI job to wireup its MPI connections right away?

By default, Open MPI opens MPI connections between processes in a "lazy" fashion - i.e., the connections are only opened when the MPI process actually attempts to send a message to another process for the first time. This is done since (a) Open MPI has no idea what connections an application process will really use, and (b) creating the connections takes time. Once the connection is established, it remains "connected" until one of the two connected processes terminates, so the creation time cost is paid only once.

Applications that require a fully connected topology, however, can see improved startup time if they automatically "pre-connect" all their processes during MPI_Init. Accordingly, Open MPI provides the MCA parameter "mpi_preconnect_mpi" which directs Open MPI to establish a fully connected topology during MPI_Init. This is accomplished in a somewhat scalable fashion to help minimize startup time.

Users can set this parameter in two ways:

  • in the environment as OMPI_MCA_mpi_preconnect_all=1
  • on the cmd line as mpirun -mca mpi_preconnect_all 1

See this FAQ entry for more details on how to set MCA parameters.