The Wayback Machine - https://web.archive.org/web/20070203153908/http://www.cisl.ucar.edu:80/docs/ibm/ref/parallel.html

IBM SP-cluster systems reference manual
last update: 09/13/2004

Understanding parallel job management and message passing on IBM SP systems

To prepare you for understanding how IBM SP systems manage parallel jobs and message passing, some distinctions are necessary. This discussion concerns how the IBM SP systems manage parallel processes.

Code parallelization is offered in two distinctly different paradigms for Symmetric Multi Processor (SMP) and cluster computing systems.

The first paradigm is "threads," where a serial-process task spawns multiple threads to compute different parts of a loop or to compute different code sections. The important thing to remember about threaded parallelization is that threads split off one process, they are intimately connected to that process, and they rejoin it before the process terminates. The typical method of thread parallelization is using OpenMP directives. The thread paradigm is not the subject of the discussion in this section.

This section is about the second paradigm, "message passing," where multiple coordinated independent executables (tasks) run simultaneously on multiple processing units of the system to achieve a common goal. The coordination between the independent tasks is accomplished by intercommunication between processing units (CPUs or nodes or both). This requires system resources from potentially more than one node of the cluster to:

Start assigning CPUs and nodes to the job
Spawn daemons to run the individual tasks
Tell daemons to start running the individual tasks, then monitor progress
Shut down the job when all tasks finish
Write reports, close files, release processing units, and exit

It is not appropriate to require users to perform all these tasks, so generally some sort of system-broker user interface is provided. The parallel system broker for the Parallel System Support Programs (PSSP) on the IBM SP is called POE (for Parallel Operating Environment).

Note: Tasks can talk to each other through an industry-standard interface such as MPI, but this is not required. There is a production parallel-task job currently running under POE at NCAR that uses only disk files to communicate between tasks. POE provides a system library through which a parallel process can find the number of tasks running under this parallel process, the names of the tasks, and the ordinals of the tasks. This information is normally required by the message-passing library (such as MPI).

POE is normally given a task and told how many times to start that task on a node or across a set of nodes. Alternatively, you can give POE a set of different tasks in a "cmdfile" to start on a node or nodes. The first method is called SPMD for Single Process Multiple Data. The second method is called MPMD for Multiple Processes Multiple Data.

What POE does -- POE functional states

POE software is a polling and interrupt-driven state machine. (You can use the whatis.com glossary to learn about the concept of a state machine.)

This POE state diagram shows how POE processes parallel jobs on multiple processing units. (This link opens a new browser so you can alternate between the large diagram and the following text.)

POE is invoked through user initiation (when a user issues a "poe" command or when a LoadLeveler script starts POE), through user intervention (for example using the llcancel command), and through system intervention (for example if an unresolved communication error occurs).

Normally, POE is in one state at a time. Some states occur serially (state 1, state 2, state 3), and some states occur repeatedly (state 3).

State 1: When invoked, POE's first action is to assemble a set of system resource requirements from:

Command-line parameters

MP_* environment variables that have been set

PSSP system configuration parameters (PSSP stands for Parallel System Support Programs)

POE then makes a request through LoadLeveler for PSSP system resources, (such as number of nodes, node use, process mapping, and so on). (See "State 1" on the POE state diagram.)

Note: LoadLeveler is the "gatekeeper" of all system resources. It builds and maintains a database for all batch jobs and for interactive jobs that require resources other than the interactive login node. This is why you see a LoadLeveler informational comment on your workstation screen when you start POE interactively. It is also why the queue name "interactive" -- analogous to the batch queue names -- is used on the system. It is also why you must set the interactive pool to MP_RMPOOL=1 -- this tells LoadLeveler to request nodes from the interactive pool.

Note: LoadLeveler builds a resource database for your job that POE will use to set environment variables at run time. The information in this resource database can take precedence over some MP_ environment variables that you may have set in your code. POE reads your local MP_ environment variables when it starts; later, it asks for the contents of the resource database for your job. The LoadLeveler directives that will take precedence appear in the table below (POE parameters and corresponding LoadLeveler directives for an MPI job), and this issue is also explained further in the "Job control interactions" diagram below. For now, just remember this rule:

Nothing in your LoadLeveler script can be overridden by setting MP_ environment variables.

Note: If a parallel job is interactive, the LoadLeveler batch subsystem is still involved: it helps create a resource database from your environment and from your POE command line. It asks IBM's PSSP for the resources that will be used to build this database. (This process is explained further in the "Job control interactions" diagram below.)

State 2: POE's second major activity after being invoked is to spawn numerous, subordinate Partition Manager Daemons (PMDs) on each of the nodes given to POE through the LoadLeveler negotiation. These PMDs on the nodes are always in communication with their master, POE. If POE loses communication with a PMD, then POE shuts down the job. (The minimum frequency of this communication is user-controlled by the environment variable MP_PULSE. However, it is typically only necessary to control this communication for some specialized applications.) (See "State 2" on the diagram.)

State 3: POE assigns parallel application tasks to each of the PMDs. The PMDs start the task(s) on their node(s) and serve as stdin, stdout, and stderr terminals for their task(s). The PMDs relay task status to the master POE session, which monitors all running tasks. POE can also send system and user interrupts to the tasks running on the nodes. The PMDs "pipe" all their stdin, stdout, and stderr information from/to POE, and this is how users send and receive all "terminal" input and output. Most of the running time of any parallel job is spent performing these interactions. (See "State 3" on the diagram.)

State 4: POE receives a shutdown signal from executable(s) on node(s). For example, if this is an MPI job: MPI_FINALIZE from all executables or MPI_ABORT from one. (See "State 4" on the diagram.)

State 5: To clean up before exiting, POE may sort output, then write stdout and stderr files, close user files, shut down PMDs, and notify PSSP through LoadLeveler that it is releasing nodes. When these tasks are completed, POE exits. (See "State 5" on the diagram.)

Parallel batch job control on IBM SP-cluster systems

When preparing parallel jobs to run on IBM SP-cluster systems, you should know how job control is passed between LoadLeveler and POE on the system. This knowledge helps you understand:

Why the MP_ environment variables you set in your code sometimes have no effect on your job

Why your interactive parallel job output sometimes has LoadLeveler filter comments or requests or aborts

Why your interactive parallel jobs sometimes appear to be queued while waiting for resources

As stated above, LoadLeveler needs you to provide information about the system resources your job will need. This section helps you understand how LoadLeveler and POE share control of how your job runs on IBM SP-cluster systems.

When processing system resource requests, LoadLeveler requires any two of these three instructions about the parallel tasks in your job and the node(s) on which they will run:

How many tasks are in your job, e.g. batch directive #@total_tasks=8 or interactive command MP_PROCS=8

How many nodes your job will need, e.g. batch directive #@node=2 or interactive command MP_NODES=2

How many tasks will run on each node, e.g. batch directive #@tasks_per_node=4 or interactive command MP_TASKS_PER_NODE=4

Your LoadLeveler script must also provide information about the inter-node network resources your job will need. (The IBM SP architecture includes a high-bandwidth low-latency crossbar switch that provides, among other things, network services between nodes to support message passing for user codes. The short name for this inter-node network is "the switch.") The network resources line in your LoadLeveler script has the form #@network.* and it provides these four pieces of information:

The message passing library you want the switch configured to use: MPI or LAPI

Which network adapter to use: csss or en0

How the network adapter will be used: shared or not_shared

The network protocol to use: us or ip ("us" is the native protocol for the switch; "ip" is the internet protocol.)

This is the network directive most commonly used on IBM SP-cluster systems: #@network.MPI=csss,shared,us

But POE uses a different set of names than LoadLeveler for these same network resources, and POE runs the job on IBM SP-cluster systems. So for batch jobs, the LoadLeveler directives have to be translated into POE environment variables while the job is running. This is described in the "Job control interactions" diagram below.

For interactive jobs, the MP_ environment variables you supply are transmitted to LoadLeveler when the POE command is issued and then the translation described above takes place. (It may seem confusing that the batch subsystem is used for interactive jobs, but this process is described with the diagram below.)

Here are the MP_ environment variables that POE uses for network resources (in C shell format):

The message passing library you want the switch configured use: setenv MP_MSG_API MPI or setenv MP_MSG_API LAPI

Which network adapter to use: setenv MP_EUIDEVICE CSSS or setenv MP_EUIDEVICE EN0

How the network adapter will be used: setenv MP_ADAPTER_USE DEDICATED or setenv MP_ADAPTER_USE SHARED

The network protocol to use: setenv MP_EUILIB US or setenv MP_EUILIB IP

Note: If you run your parallel job interactively, you must set these MP_ environment variables in your code.

As noted above, some of the environment variables set by LoadLeveler take precedence over some MP_ environment variables that you may have set in your code. This Job control interactions diagram shows how control is passed among the functional units of the system, and it clarifies why your LoadLeveler directives take precedence over your MP_ environment variables. (This link opens a new browser so you can see both the large diagram and the following text.)

The diagram shows job control flowing from the upper left to the lower right. A parallel batch job gets node and network resources using LoadLeveler directives at queueing time and again later at POE execution time. This second use of LoadLeveler directives explains why user-set MP_ environment variables in the run script are overwritten.

In the diagram, a LoadLeveler script named "my.job.script" is submitted to LoadLeveler via the llsubmit command; it requests resources for a four-process parallel execution flow. The Scheduler generates a database from my.job.script that contains a list of system resources the job will need. Then the Negotiator holds your job in a queue until it finds the necessary resources available on the system. The Negotiator then passes the job on to the Starter, which reads the list of resources in the database and starts execution.

As the batch job runs, it will require POE to coordinate the independently running tasks. POE then reads the resource database assigned to the job again. The configuration information that was derived from the LoadLeveler directives and stored in the resource database is now translated into POE configuration information and used for job execution. This is when your LoadLeveler directives, which are stored as configuration information in the resource database, can overwrite the MP_ environment variables in your code. Not all MP_ environment variables will be overwritten, however. See the table below ("POE parameters and corresponding LoadLeveler directives for an MPI job") to determine which environment variables can be overwritten.

What MPI does

MPI is the industry-standard Message Passing Interface. It is one of the three message-passing libraries that are configured to work with POE. The other two are PVM and LAPI. (PVM stands for Parallel Virtual Machine, and LAPI stands for Low-level Application Programming Interface for communication.)

This does not mean that other message-passing libraries (such as P4) would not work with POE, but if you do your message passing with P4, you would have to embed POE environment calls in your code to determine task ordinal, number of processes, and so on. Also, if you want to use the inter-node network for high-speed communication, you would need to map your P4_sends and P4_receives to low-level LAPI protocol. Unless IBM decides to support P4 as a message-passing library, it is much easier to substitute MPI calls for P4 calls in your code.

MPI strategies and parameters for best performance

There are some basic MPI programming strategies that increase MPI system library performance. There are a number of POE parameter choices that also increase MPI library performance. This section lists these strategies and parameters. In addition, this section provides a diagram to help you understand how kernel buffer communication works for MPI jobs on the IBM SP architecture so you can optimize your MPI application and avoid costly kernel buffer overflows.

POE environment variables and coding strategies that can improve MPI library performance:

In most cases, the system default values are adequate, except when the POE environment variable MP_SHARED_MEMORY=yes must be set as in the first bullet item in the list below.

System-specific environment variable options

To improve the performance of your MPI job, try setting these POE environment variables. They are presented in order of importance:

To specify the high performance path for MPI messages between processes on a node, you need to control high performance internal communication between tasks on a node by setting POE environment variable MP_SHARED_MEMORY=yes (see table below).

To specify optimal MPI setaside buffers, and if MPI_RECV routines cannot be posted before MPI_SEND routines, you need to control the kernel setaside buffer for incoming MPI messages by setting POE environment variable MP_BUFFER_MEM=64MB (see table below).

To specify no setaside buffers and minimize data movement from the communication buffer to the user space, you need to control the message length used to divide up the setaside MPI buffer by setting POE environment variable MP_EAGER_LIMIT=0 (see table below).

Coding strategies that can improve MPI library performance

If communication bottlenecks still exist after you have changed POE environment variables as outlined above, then cosider using these coding strategies:

Block up your data into single sends and receives per time step if possible. Unless your messages are larger than the kernel buffer (DMA send FIFO=128KB), this is the most efficient high-bandwidth-low-latency method of data communication.

Post MPI_RECV and MPI_IRECV routines before MPI_SEND routines so no additional buffer copies are necessary.

Avoid frequent collective communication: It's expensive! Consider embedding collective-scalar information in your large-data message transmissions.

Understanding kernel buffer communication for MPI jobs

This kernel buffer communication diagram shows how data flow between the user's MPI process and the nodes on IBM SP-cluster systems. By managing the volume of data moving from your process through the kernel memory and controlling the timing of the data moving through the kernel memory, you can achieve significant MPI performance improvements. The key is to avoid the long path through the receive pipe overflow buffer and the cache memories shown in red on the diagram. (This link opens a new browser so you can alternate between the large diagram and the following text.)

You can maximize data communication efficiency under MPI if you use the low-latency path through the kernel memory shown in green on the diagram. From your user process, post your MPI_IRECV routines early in your loop calculations, then use the MPI_SEND routine to control the quantity of data moving into the send pipe in the kernel memory. Your process thus allows those data to be moved through the kernel memory, through the Direct Memory Access (DMA) device, and across the switch to the other nodes before your process sends the next quantity of data into the kernel memory. By controlling your memory usage in this way, your process also avoids conflict with data moving between other CPUs and kernel memory.

The kernel uses "pinned" memory pages that are not swappable for the kernel send/recv cyclic buffer. If too many data arrive at the receive pipe, a setaside transfer is enabled using the current values for the MP_EAGER_LIMIT and MP_BUFFER_MEM environment variables, and the most recent data are shunted to the receive pipe overflow buffer. This is when significant delays can occur because the data in the overflow buffer must move through a processor and its caches one more time before returning to the user process memory. Further, these delays now create more congestion because the processor involved in the data move is no longer working on the user code, and since its caches are flushed to do the data move, they must be reloaded before they can return to their work on the user code.

POE parameters and corresponding LoadLeveler directives for an MPI job

You can set POE environment variables (MP_*) with LoadLeveler directives (#@keyword=value) instead of putting the POE variables in the body of your script. Many programmers use this technique to make the directives more visible and accessible.

Reference: IBM Parallel Environment for AIX Version 2 Release 4.0: Operation and Use, Volume 2, Part 1 and Part 2.
Note: The environment variables described here can also be set on the POE command line.

Note: A relevant LoadLeveler directive is not always available because there is not a one-to-one correspondence between LoadLeveler directives and POE directives. Consult IBM's LoadLeveler manual IBM LoadLeveler for AIX, Using and Administering for a complete discussion of LoadLeveler directives. The MP_ variables are described in the "Environmental variables" section of IBM's manual Operation and Use, Using the Parallel Operating Environment.

Note: The following table provides example settings; these are not suggested values for the variables you should use for your programs.

Parameter description	POE environment variable (for interactive use)	LoadLeveler directive (for batch jobs)
How to use the communication network	MP_ADAPTER_USE=shared	#@network.MPI=csss,shared,us
Node and processor use	MP_CPU_USE=unique	#@node_usage=shared
Specifying the switch	MP_EUIDEVICE=csss	#@network.MPI=csss,shared,us
Specifying the communication protocol	MP_EUILIB=us	#@network.MPI=csss,shared,us
Specifying the number of program tasks	MP_PROCS="user specified"	#@total_tasks="user specified"
Specifying message-passing library	MP_MSG_API=MPI	#@network.MPI=csss,shared,us Note: If you are using LAPI or MPI and LAPI, then you would need to set this.
Specifying interactive resource node pool	MP_RMPOOL=1	N/A
Specifying the number of nodes wanted for this job	MP_NODES="user specified"	#@NODE="user specified"
Specifying how many tasks to put on a node	MP_TASKS_PER_NODE="user specified"	#@tasks_per_node="user specified" Note: At NCAR, your LoadLeveler script or MPI environment may not specify more tasks per node than CPUs on a compute node. Use the spinfo command to find out how many CPUs are on the node your job will run on.
Parallel programming execution model	MP_PGMMODEL="user specified" Note: spmd or mpmd, spmd is the default, mpmd must be used with a cmdfile.	N/A
File containing pathlist of executables	MP_CMDFILE="path to file" Note: MP_CMDFILE is only used if MP_PGMMODEL=mpmd.	N/A
Labeling where a stdout message is coming from	MP_LABELIO=yes	N/A
Specifying who gets to read stdin in SPMD mode	MP_STDINMODE=all	N/A
Specifying the reporting of POE communication	MP_INFOLEVEL=3 Note: This is a good error diagnostic level.	N/A
Specifying the kernel set-aside buffer for incoming MPI messages	MP_BUFFER_MEM=64MB Note: Default is 64MB. This space is only needed if MPI_RECV or MPI_IRECV routines aren't posted before message arrives. The disadvantages of this buffer are that it takes memory space from the heap and it greatly increases message latency as one additional memory copy has to occur if it is used.	N/A
Specifying the message length used to divide up the set-aside MPI buffer	MP_EAGER_LIMIT=0 Note: If you have the node to yourself and you post your MPI_RECV and MPI_IRECV routines before your MPI_SEND routines, then you can set MP_BUFFER_MEM=0 and get the smallest MPI message latency.	N/A
Specify high-performance internal communication between tasks on a node	MP_SHARED_MEMORY=yes	N/A

Example demonstrating POE function

Command script "run.exmpl"

This script can be executed both interactively and in batch:

Interactively: run.exmpl

Batch: llsubmit run.exmpl

--------------------------------------------------------------------------

#! /bin/ksh
# The following are LoadLeveler batch directives. These directives 
# will be ignored if this script is submitted interactively.
#
#@job_name=simple_clone
#@output=$(job_name).$(jobid).out
#@error=$(job_name).$(jobid).err
#@job_type=parallel
#@network.MPI=csss,shared,us
#@node_usage=shared
#@total_tasks=2
#@node=1
#@class=share
#@queue
#
# A simple MPI program
#
# Create a file named clone.f to debug:

cat > clone.f << EOF
      program imawho
      include 'mpif.h'
      integer istatus(MPI_STATUS_SIZE)
      real*8 msg0(1000),msg1(1000)
! Initialize arrays:
      do i=1,1000
      msg0(i)=0.0
      msg1(i)=0.0
      end do
! Start MPI and connect to other process:
      call mpi_init(ierr)
      call mpi_comm_rank(MPI_COMM_WORLD,irank,ierr)
! Calculation loop:
      do j = 1, 10
! Post MPI_Irecvs in each clone:
      if (irank .eq. 0) then
        call mpi_irecv(msg0,1000,MPI_REAL8,1,itag0,MPI_COMM_WORLD,ireq0,ierr)
      else
        call mpi_irecv(msg1,1000,MPI_REAL8,0,itag1,MPI_COMM_WORLD,ireq1,ierr)
      end if
! Calculate contents of send buffer:
      do i=1,1000
        if(irank .eq.0) then
                msg1(i)=msg1(i)+irank+i
        else
                msg0(i)=msg0(i)+irank+i
        endif
      end do
! MPI_send msg to other process:
        if( irank .eq. 0) then
          call mpi_send(msg1,1000,MPI_REAL8,1,itag1,MPI_COMM_WORLD,ierr)
          call mpi_wait(ireq0,istatus,ierr)
        else
          call mpi_send(msg0,1000,MPI_REAL8,0,itag0,MPI_COMM_WORLD,ierr)
          call mpi_wait(ireq1,istatus,ierr)
        end if
! End j step loop:
      end do
!
      print*,' At step=',(j-1)
      print*,' process 0 sen(t|ding)= ',(msg1(i),i=1,10)
      print*,' process 1 sen(t|ding)= ',(msg0(i),i=1,10)
! Terminate MPI processes:
      call mpi_finalize(error)
      stop
      end
EOF
#
# Compile clone.f to clone, with -g option on!
#
mpxlf_r -g -qfixed=132 -o clone clone.f
#
#
# If this job is interactive, then execute the next if clause:
#
# Recall that false is true in Korn shell.
#
if $LOADBATCH
then
export MP_ADAPTER_USE=shared
export MP_EUIDEVICE=csss
export MP_EUILIB=us
export MP_MSG_API=MPI
export MP_CPU_USE=multiple
export MP_PGMMODEL=spmd
export MP_PROCS=2
export MP_NODES=1
export MP_RMPOOL=1
fi
#----------------------------------------------------------------------------

# The following are set regardless of whether this is an interactive
# or a LoadLeveler run:
#
# Set programming mode and MPI performance parameters:

export MP_PGMMODEL=spmd
export MP_SHARED_MEMORY=yes

# Set stdout options

export MP_LABELIO=yes
export MP_STDOUTMODE=ordered
export MP_INFOLEVEL=2

echo `env |grep "MP_"`

# Since MPI_IRECVs are posted before MPI_SENDs, no setaside buffer is
# needed, thus saving 65 MB of memory, and the eager limit can be set
# to zero which minimizes the message moves:

export MP_BUFFER_MEMORY=0
export MP_EAGER_LIMIT=0

# end of environment setting ----------------------------------------

# Execute a two-MPI-process job with POE:

poe clone

# Remove the evidence:

rm clone*

# --------------------------------------------------------------------

Here are the results of issuing the command script "run.exmpl".

** imawho   === End of Compilation 1 ===
1501-510  Compilation successful for file clone.f.
MP_SHARED_MEMORY=yes MP_EUIDEVICE=csss MP_PROCS=2 MP_NODES=1
MP_STDOUTMODE=ordered MP_RMPOOL=1 MP_INFOLEVEL=2 MP_ADAPTER_USE=shared
MP_MSG_API=MPI MP_PGMMODEL=spmd MP_EUILIB=us MP_LABELIO=yes
MP_CPU_USE=unique
INFO: 0031-364  Contacting LoadLeveler to set and query information for
interactive job

Default project 43310028 being used since none was specified in script...

llsubmit: Processed command file through Submit Filter:
"/usr/local/loadl/submit_filter".
INFO: 0031-380  LoadLeveler step ID is bb0001en.75931.0
INFO: 0031-119  Host bb0005en allocated for task 0
INFO: 0031-119  Host bb0005en allocated for task 1
   0:INFO: 0031-724  Executing program: 
   1:INFO: 0031-724  Executing program: 
   0:INFO: 0031-619  MPCI library was compiled at  Thu Aug 30 16:01:22 2001
   0:
   0:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
   1:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
INFO: 0031-656  I/O file STDOUT closed by task 0
INFO: 0031-656  I/O file STDERR closed by task 0
INFO: 0031-656  I/O file STDOUT closed by task 1
INFO: 0031-656  I/O file STDERR closed by task 1
INFO: 0031-251  task 0 exited: rc=0
INFO: 0031-251  task 1 exited: rc=0
   0:  At step= 10
   0:  process 0 sen(t|ding)=  10.0000000000000000 20.0000000000000000
30.0000000000000000 40.0000000000000000 50.0000000000000000
60.0000000000000000 70.0000000000000000 80.0000000000000000
90.0000000000000000 100.000000000000000
   0:  process 1 sen(t|ding)=  20.0000000000000000 30.0000000000000000
40.0000000000000000 50.0000000000000000 60.0000000000000000
70.0000000000000000 80.0000000000000000 90.0000000000000000
100.000000000000000 110.000000000000000
   1:  At step= 10
   1:  process 0 sen(t|ding)=  10.0000000000000000 20.0000000000000000
30.0000000000000000 40.0000000000000000 50.0000000000000000
60.0000000000000000 70.0000000000000000 80.0000000000000000
90.0000000000000000 100.000000000000000
   1:  process 1 sen(t|ding)=  20.0000000000000000 30.0000000000000000
40.0000000000000000 50.0000000000000000 60.0000000000000000
70.0000000000000000 80.0000000000000000 90.0000000000000000
100.000000000000000 110.000000000000000
INFO: 0031-639  Exit status from pm_respond = 0

---------------End of Example-----------------------------------------

Next page | Table of contents - IBM SP-cluster systems reference manual

If you have questions about this document, please contact SCD Customer Support. You can also reach us by telephone 24 hours a day, seven days a week at 303-497-1278. Additional contact methods: consult1@ucar.edu and during business hours in NCAR Mesa Lab Suite 39.

Address of this page: http://www.scd.ucar.edu/docs/ibm/ref/parallel.html

Jan	FEB	Mar
	03
2006	2007	2008