Running Jobs

  • Modules and user environment
  • Batch System
  • Submitting jobs
  • File System
  • Example Scripts
  • Mixed Mode

1. Introduction

The default environment selects the most generic set of compilers and HPC-related libraries on the system. These are:
  • Intel compiler
  • Open MPI (message passing library)
  • SGE (batch scheduler)
  • Institutional module
The user environment on Polaris is managed using the popular module package often found on HPC clusters and some Linux distributions (with some modifications described below). Software packages can be easily added or removed with simple commands. 

2. Module structure and institutional modules

Modules are organised by categories such as applications, compilers and libraries, with a number of applications and software packages in each category. In addition, some institutions will also install some software specifically for their users (e.g. due to licences). Both these sets of software will be installed in an institutional module. To make software from a specific category available, the module category must first be loaded. For instance to make the modules from York available for loading, you must must load the module via the command module add york. By default, each user will have their parent institution's module loaded and software thereof available. 

3. Examining modules

Information on the currently loaded modules can be found with the command:

$ module list
This will return a list of currently loaded modules. For example the modules loaded by default, for a Leeds user, are:
Currently Loaded Modulefiles:  1) licenses           3) bit/64             5) openmpi/1.6.1      7) user  2) sge                4) intel/12.1.5.339   6) leeds
Details of alternative and additions to the environment, which are generally available, can be found with the module avail command and are listed according to their type:
------------------- /apps/Modules/infrastructure ---------------------
lancaster licenses manchester sge system test user

--------------------- /apps/Modules/developers/architecture ---------------
bit/64

--------------- /apps/Modules/developers/compilers --------------------
gnu/4.7.1 gnu/native(default) intel/12.1.5.339 java/7u4

-------------------- /apps/Modules/developers/libraries --------------------
acml/5.2.0 fftw/3.3.2 intelmpi/4.1.0.024 mvapich2/1.8 pgplot/5.2.2atlas/3.10.0 gsl/1.15 mkl/10.3u11 netcdf/4.2.1.1fftw/2.1.5 hdf5/1.8.9 mpe2/1.3.0 openmpi/1.6.1

-------------------- /apps/Modules/applications -------------------
R/2.15.1
Other useful commands:
CommandWhat
module whatisGives a one-line description of all available modules
module show <module>Shows what environment variables a module will modify
module help <module>Gives more information about the package (e.g. website, license)
module purgeUnload all modules from environment
module find <module>Searches institutional modules as well as central modules
 

3.1. Listing contents of institutional modules

The module avail command will list all generally available software, i.e. will exclude the packages installed in other institutions module areas. As discussed above, some software packages/applications are installed in institutional modules, and by default users will have their parent institution's module loaded at login. To list the software in the additional categories/institutional modules, users will need to use the command module find, which is a locally developed and implemented extension to the standard module package commands.A simple module find will list all available modules in all categories. For example:

$ module find

------------------- /apps/york/Modules --------------------
blast/2.2.27+ ngspice/2012-12-20 ngspice/2012-12-20-noX ngspice/24 ngspice/24-noX

------------------- /apps/manchester/Modules --------------
OpenFOAM/2.1.1 libsamplerate/0.1.8 libsndfile/1.0.25

------------------ /apps/Modules/system --------------------
IMB/3.2.3 bonnie/1.03e i7z/0.27.1 iozone/3.318 stress/1.0.0

------------------ /apps/leeds/Modules ---------------------
molpro/2010.1.26

------------------ /apps/Modules/infrastructure ------------
durham lancaster leeds licenses liverpool manchester newcastle sge sheffield system test user york

------------------ /apps/Modules/developers/architecture ------
bit/64

------------- /apps/Modules/developers/compilers ---------
gnu/4.7.1 gnu/native(default) intel/12.1.5.339 java/7u4

------------------ /apps/Modules/developers/libraries ----------
acml/5.2.0 fftw/2.1.5 gsl/1.15 intelmpi/4.1.0.024 mkl/10.3u11 mvapich2/1.8 openmpi/1.6.1atlas/3.10.0 fftw/3.3.2 hdf5/1.8.9 itac/8.0.3.007 mpe2/1.3.0 netcdf/4.2.1.1 pgplot/5.2.2

------------------ /apps/Modules/applications ---------------
R/2.15.1
The command module find name will search all the module files for the application name. For instance,

$ module find blast

--------------------- /apps/york/Modules ----------------------------------
blast/2.2.27+
Whereas the command module avail blast will return nothing, as this resides in the York module.Please note that the search is based on the module name, and will not match partial names. 

4. Adding and removing modules

To add/load a software package to your environment through modules use module add <module>. For example, the following command will add version 3.3.2 of the fftw library into your environment: 

$ module add fftw/3.3.2
If the version number (everything after the / character) is omitted, the default version of that software package will be added - normally the most recent. Note that the default version of software will generally be changed without warning.The commands module rm <module> or module unload <module> removes a particular module from your environment.Sometimes a module will refuse to be added, because it conflicts with another already part of your environment. For example, this will happen if:
  • a module providing a particular version of some software is already loaded, and an attempt is made to load a second version
  • a module providing some functionality (e.g. Fortran/C/C++ compilers, MPI, BLAS/LAPACK libraries) is already loaded, and an attempt is made to load another module providing similar functionality
If this happens, you will see a message identifying the modules involved. For example, if the Intel compilers are already loaded and you attempt to load the GNU GCC compilers:

$ module add gnugnu/native conflicts with loaded module 'intel/12.1.5.339'
If you also have a package depending on a set of compilers being present, a simple module rm intel/12.1.5.339 will not resolve the problem:

$ module unload intelintelmpi/4.0u3 depends on one of the module(s) 'gnu intel pgi'
The easy way to resolve this is to use the module switch command:

$ module switch intel gnu
 

5. Module conventions

There are some general rules that apply to modules on the system:
  • Modules providing C/C++/Fortran compilers always set variables CC, CXX, FC, F77, F90 to the appropriate value for that compiler. This allows automated tools such as GNU configure scripts to automatically use the selected compiler. It also means that the gnu/native module ought to be loaded if you want to use the Operating System-provided GNU GCC compiler.
  • Library modules providing BLAS and LAPACK linear algebra interfaces set N8_LINALG_FFLAG and N8_LINALG_CFLAG variables, providing appropriate Fortran and C link flags for that implementation. Please look at the numerical libraries page.
  • Sometimes it is useful to look at the installed files of a given package, e.g. if tutorials are installed. A <MODULE>_HOME variable is generally present to allow you to find them.
 

6. Dealing with flavours

A common problem on general-purpose HPC machines with multiple compiler versions and MPI implementations, is that installed libraries need to be compiled against each of those versions. On some systems they can lead to multiple, very similar, module names (e.g. fftw/3.3.2/gnu/openmpi/1.6.0,fftw/3.3.2/intel, etc.) that can be difficult to manage.Polaris takes a slightly different approach.There is a single module for each version of software, e.g. fftw/3.3.2. Inside that module, the software has been compiled against each permutation of its dependencies. We call these permutations flavours. The module's job is to examine what other modules are loaded, and select the appropriate flavour. 

$ module add fftw/3.2.2
$ echo $FFTW_HOME/apps/libraries/fftw/3.2.2/1/bit-64-intel-12.1.5.339-intelmpi-4.0u3
$ module switch intel gnu/native

$ echo $FFTW_HOME/apps/libraries/fftw/3.2.2/1/bit-64-gun-native-intelmpi-4.0u3
For example, here you can see that adding module fftw/3.2.2 on top of those loaded by default, automatically selects a version compiled against the Intel compiler and Intel MPI, in 64-bit mode. If you then change your compiler from Intel to GNU GCC, the fftw/3.2.2 module is automatically updated to give you the appropriate version.If you find this disconcerting, setting environment variable MODULE_MESG to flavour will provide more feedback on what is going on. For example, switching the Intel compiler back in:

$ export MODULE_MESG=flavour

$ module switch gnu intel/12.1.5.339
intelmpi/4.0u3 - loading flavour bit-64-intel-12.1.5.339 revision 1
fftw/3.3.2 - loading flavour bit-64-intel-12.1.5.339-intelmpi-4.0u3 revision 1

7.How to get your test jobs running quickly

A number of users have requested that we implement a test queue on Polaris to facilitate the running of very short jobs, as some users feel that they have to wait too long in the normal queue just to run a short test job. However, with careful selection of resources requested in the job script (ie. short job time, low core count and low memory requirements), test jobs should be able to make use of a scheduler optimisation known as backfill. Although this may mean that your job will not run to completion, it will run long enough for you to confirm it is behaving as expected.

Backfill allows the scheduler to make better use of the resources available on the cluster by running some jobs out of order. When SGE (GridEngine, the scheduler) schedules a job start time, it prioritises the jobs in the queue according to a number of factors and then puts the jobs into a sorted list with the highest priority jobs first. The scheduler progresses through the list, starting the jobs one at a time until it gets to a job it cannot start because sufficient resources are not available on the cluster. For those jobs that cannot be started, the scheduler reserves resources so that the job can be started at some point in the future.

However, these jobs will not use up all of the available resources on the cluster; there will naturally be some holes in the schedule. The scheduler tracks these resources and can therefore determine which additional, lower priority, jobs can be started so as not to delay the start time of the highest priority jobs. Although it isn’t possible to guarantee that jobs are eligible for backfill, the chance of a job being eligible can be improved by carefully selecting the resources requested in your submission script.

For a typical test job, you might be interested in confirming that your script works as expected or your application reads or writes data correctly. You might therefore request: a single core, 30 minutes of runtime, 2GB memory and then submit it in the normal way using the qsub command. You should notice that it will start to execute sooner than a production job. Once you are happy that your application is behaving as expected, then it can be submitted to the queue with a more appropriate set of parameters.

Batch System

 When you log in, you will be directed to one of several login nodes. These allow regular command line access to the system, necessary for the setup of runs, compiling and some analysis work. Usage is shared amongst all who are logged in, therefore these systems will get very quickly overloaded if they are used for regular computation.The compute power behind the system is accessible through a batch submission system. When a job executes through the batch system, processors on the back-end are made available, exclusively, for the purposes of running the job.The batch queue system installed is Son of Grid Engine version 8.1.1, plus locally developed and implemented patches.In order to interact with the batch system, the user must give some indication of the resources they require. At a minimum these include:
  • how long does it need to run for
  • on how many processors (assumed 1 unless otherwise told)
Armed with this information, the scheduler is able to dispatch the jobs at some point in the future when the resources become available. A fair-share policy is in operation to guide the scheduler towards allocating resources fairly between different Institutions, Faculties, projects and users. 

1. Resource reservation and backfill

By default all jobs are eligible for resource reservation, in that the scheduler will ensure the highest priority jobs will have their start times booked in the future. Theqsched -a command can be used to generate a list of the anticipated start times of these jobs. At the moment only the top 128 jobs are considered for resource reservation. The system will backfill jobs if they will start and finish before the highest priority jobs are scheduled to start. Therefore indicating a realistic runtime for a job (rather than the queue maximum) will make short jobs eligible to be backfilled, potentially shortening their wait-time.There is also a facility to book an amount of HPC resource for some time in the future, through advance reservation. Jobs eligible to run in that reservation can then be submitted to run within it. Advance reservation is not enabled for users by default, however these reservations can be enabled upon request provided there is a valid case for their use and the fairness policies allow it. 

2. Queue configuration

Currently the facility is configured with a single general access queue, allowing submission to all available compute resources. Thus, there is no need to specify a queue name in job submissions. 

2.1. Time limits

Jobs requesting a time up to the maximum runtime of the queue are eligible to be run. At the moment the maximum runtime is 48 hours.Should a job run beyond the length of time it requested, it will be killed by the queuing system. To change the time requested by a batch job, change the time specified in the -l h_rt flag e.g.: 
$ qsub -l h_rt=6:00:00 script.sh
Will request six hours of runtime. 

3. Memory usage

In order that programs do not compete over the available memory in a machine memory usage is consumable. This helps ensure that e.g. if one job is consuming 60GB memory on a machine with a total 64GB memory, the maximum total size of all other jobs which are allowed to execute on that system is 4GB.By default a 1GB/process (or 1GB/slot) limit is defined for all batch jobs. To override this behaviour use the -l h_vmem switch to qsub. E.g. to run a 1 process code using 6GB of memory for 6 hours: 
$ qsub -l h_vmem=6G -l h_rt=6:00:00 script.sh
Memory is specified per slot therefore: 
$ qsub -l h_vmem=2G -l h_rt=6:00:00 -pe smp 4 script.sh
Will request a total of 8GB of memory, shared between 4 processes.Jobs will be run on nodes, provided that the total memory requested per node does not exceed the physical memory of the node. Please note that if a job requests more memory than is physically available the job will not run though it will still show up in the queue. If an executing program exceeds the memory it requested, it will be automatically terminated by the queuing system. Note: we have modified the scheduler to make a better measurement of memory usage then on many other HPC clusters which are also running the Grid Engine batch scheduler. You may find that jobs require less h_vmem to run on Polaris, than on other machines.Polaris has 316 nodes (5,056 cores) each with a total of 64 GB of memory. There are an additional 16 nodes (256 cores) each with a total of 256 GB of memory. To access the later you need to use the 
-l node_type=16core-256G
flag. Please see Submitting Jobs for more details. 

4. Job submission

The general command to submit a job with the qsub command is as follows: 
$ qsub [options] script_file_name [--script-args]
where script_file_name is a file containing commands to executed by the batch request.For commonly used options and more details about qsub please look at Submitting Jobs tab.For example submission scripts please look at the Script Examples tab. 

4.1. Submitting shared-memory parallel jobs

Shared memory parallel jobs are jobs that run multiple threads or processes on a single multi-core machine. For instance !OpenMP programs are shared memory parallel jobs.There is a shared memory parallel environment (pe) called smp that is set up to enable the submission of these type of jobs. To option needed to submit this type of job is: 
-pe smp <cores>
For example: 
$ qsub -l h_rt=6:00:00 -pe smp 4 script.sh
will request 4 processes in a shared memory processor running for 6 hours. 

4.2. Submitting distributed parallel jobs

This type of parallel job runs multiple processes over multiple processors, either on the same machine or more commonly over multiple machines.A significant change made to the batch system on Polaris, is that in addition to the standard Grid Engine submission syntax, we have also implemented an alternative "nodes" syntax. This is designed to give jobs dedicated access to entire nodes. This should provide more predicable job performance, for instance due to placement and dedicated use of InfiniBand cards as well as providing a more flexible specification of processes or threads for mixed-mode programming.It can take either of the following forms: 
-l nodes=<w>[,ppn=<y>][,tpp=<z>]-l np=<x>[,ppn=<y>][,tpp=<z>]
Where:
  • w = number of nodes requested x = number of processes requested y = number of processes per node (rewrites MPI hostfile to this) z = number of threads per process (sets OMP_NUM_THREADS to this)
If y and z are omitted, Grid Engine sets y = number of cores in each machine, z = 1.If y is present and z omitted, Grid Engine sets z = int(num cores / y).If z is present and y omitted, Grid Engine sets y = int(num cores / z).If using this syntax, the amount of memory available to the job on each node is automatically set to the node_type spec (i.e. 64G by default).These options also support mixed mode (MPI+OpenMP) programming.In addition, the standard Grid Engine method for requesting the number of cores is applicable via use the parallel environment, in this instance pe ib . So the option needed would be: 
-pe ib <cores>
 

5. Querying queues

The qstat command may be used to display information on the current status of Grid Engine jobs and queues. The basic format for this command is: 
$ qstat [switches]
Important switches are as follows:
SwitchAction
-helpPrints a list of all options.
-fPrints full display output of all queues
-g cPrint a 'cluster queue' summary - good for understanding what resources are free, across different queue types
-g tPrint 'traditional' output, i.e. print a line per queue used, rather than a line per job
-u usernameDisplays all jobs for a particular username.
The switches are documented in the man pages; for example, to check all options for the qstat command type: 
$ man qstat
By default, users will only see their jobs in the qstat output. To see all jobs use a username wildcard: 
$ qstat -u \*
 

6. Job deletion

To delete a job from the queues issue the following command: 
$ qdel jobid
where jobid is a number referring to the specified job (available from qstat). To force action for running jobs issue the following command: 
$ qdel -f jobid
A user can delete all their jobs from the batch queues through the command: 
$ qdel -u username
 

Submitting Jobs

This a general description of commonly used options to qsub. Aside from those options specific to Polaris, you can find a full listing of options by looking the manpage:
$ man qsub
 

Submitting jobs using qsub

The general syntax to submit a job with the qsub command is as follows:
$ qsub [options] script_file_name [--script-args]
where script_file_name is a file containing commands to be executed by the batch request.Commonly used options are:
OptionDescriptiondefault
-l h_rt=hh:mm:ssThe wall clock time (amount of real time needed by the job). This parameter must be specified, failure to include this parameter will result in an error messageRequired
-l h_vmem=memorySets the limit of virtual memory required (for parallel jobs per processor). If this is not given it is assumed to be 1GB/process. If you require more memory than 1GB/process you must specify this flag. e.g. -l h_vmem=12G will request 12GB memory.1G
-l h_stack=memorySets the stacksize to memory.unlimited
-helpPrints a list of options
-l nodes=x[,ppn=y][,tpp=z]Specifies a job for parallel programs using MPI. Assigns whole compute nodes. x is the number of nodes,y is the number of processes per node, z is the number of threads per process.
-l np=x[,ppn=y][,tpp=z]Specifies a job for parallel programs using MPI. Assigns whole compute nodes. x is the number of processes, y is the number of processes per node, z is the number of threads per process.
-pe ib npSpecifies a job for parallel programs using MPI, np is the number of cores to be used by the parallel job.
-pe smp npSpecifies a job for parallel programs using !OpenMP or threads. np is the number of cores to be used by the parallel job.
-hold_jid prevjobHold the job until the previous job (prevjob) has completed - useful for chaining runs together, resuming runs from a restart file.
-l placement=typeChoose optimal for launching a process topology which provides fully non-blocking communication, minimising latency and maximising bandwidth. Choose good for launching a process topology which provides 2:1 blocking communications. Choose scatter for running processes anywhere on the system without topology considerations.good
-l node_type=16core-64GDirects the job to machines with a specific number of cores and amount of memory. Available types are16core-64G and 16core-256G.16core-64G
-t start-stopProduce an array of sub-tasks (loop) from start to stop, giving the $SGE_TASK_ID variable to identify the individual subtasks.
-cwd / -l nocwdExecute the job from the current working directory; output files are sent to the directory form which the job was submitted. If -l nocwd is used, they are sent to the user's home directory.-cwd
-m beSend mail at the beginning and at the end of the job to the owner.
-M email_address@<domain>Specify mail address for -m option. The default <username>@leeds.ac.uk will automatically redirect to your email address at you parent institution, which was used for your registration on the facility.<username>@leeds.ac.uk
-VExport all current environment variables to all spawned processes. Necessary for current module environment to be transferred to SGE shell.Recommended
-P projectAccount usage to particular project. If user belongs to multiple projects, specifying the project name is compulsory. Can be omitted for users with single project membershipDefault project for members of single project. Not specified for users with multiple projects

N8 HPC File Systems

1. Home directories

On account creation, users are allocated a home directory with a quota of 10GB and the variable $HOME is setup automatically to point at it. The home directories are backed up, providing some protection against data loss, however this comes at the expense of capacity and performance.Exceeding your home directory quota will can cause various issues, such as login problems and issues with X11 forwarding, as well as jobs failing if they are writing out to this location. To check your usage please use the command :
$ quota
In general, users should only store important data that they wish to have backed up in their home directory (for example environment files, source code and some input data). Files for running programs and data generated from applications should be directed to the /nobackup Lustre file system, described below.NOTE: University of Leeds Arc1 users, will currently get access to their existing 5GB Arc1 home directory. It is anticipated that this will be increased to same level as users from the other participating institutions. 

2. High speed parallel file system

The facility has a large globally visible temporary file system mounted on /nobackup. It has a much higher capacity when compared to the home directories and is optimised for input/output from running jobs.This file system is constructed using the Lustre parallel file system. It has been benchmarked to deliver 4GB/s of I/O throughput to the compute nodes over the InfiniBand network. It provides 175TB of usable disk, and there are no quotas enforced. Programs should be run from and their data produced into the /nobackuparea. Important data should then be transferred from /nobackup to a suitable backed up location.It must be stressed that data on this file system is not backed up will be periodically cleared to ensure enough capacity is available for running programs. If the file system fills up, jobs will potentially fail. Files unused for 90 days will be automatically deleted from the system. Users will generally receive automated warning emails, usually at two weeks and one week before the deletion of files as well as a list of any files deleted.When using for the first time, you are encouraged to make a directory under /nobackup store your data, e.g:
$ mkdir /nobackup/<username>
 

2.1. Important note regarding small file sizes

As part of its throughput optimisation settings, the /nobackup file system uses a block size of 1 MB for storage. This means that even files that are a few kilobytes in size will occupy a 1 MB block. As a consequence, small files can end up occupying a considerably larger chunk of disk than they actually need. For instance a few MBs of data spread over numerous files of a few kilobytes each, can end up occupying a several GBs of disk space instead of a few MB. If you are regularly using large numbers of small files it may be best to compress them into a smaller number of large files as they will use substantially less disk space. 

3. Local node temporary space

The best I/O performance by far can be achieved by making use of the local disks attached to the compute nodes. Each nodes has a directory mounted on/scratch, which is only visible to the local node and is ~800GB in size. As there is limited amount of disk space available, it is important that you move all data off the local disks after your job has executed. As filling up these local disks can cause jobs to fail, directories may be deleted at any time by the system administrator. As data on these disks in not backed up, it is therefore imperative that important data should not be kept in these directories.

Example Scripts

This page contains a number of simple batch submissions scripts. These can be used to submit jobs via the command:

$ qsub scriptname
The examples below are all independent of the executing shell. The default executing shell is the borne shell (/bin/sh). If you require a different shell (e.g./bin/csh) then this can be specified in a #!/bin/csh directive at the top of the script.Directives to the batch scheduling system must be preceded by #$, so for instance to specify the current working directory add #$ -cwd to your script.

Script Examples

1. General queue settings

There are several default settings for the batch queue system:
  • The runtime must be specified otherwise jobs will be rejected. The maximum runtime of all queue is 48 hours and no default value is set.
  • Unless otherwise specified, the default 1GB/process or (1GB/slot) is defined for all jobs.
  • Unless otherwise specified, jobs are executed from the current working directory, and output is directed to the directory from which the job was submitted. Please note this is different to standard Sun Grid Engine behaviour, which defaults to a user's home directory. The option -cwd is included in the scripts below for completeness, though it is not strictly needed.
  • Environmental variables, set up by modules and license settings for example, are not exported by default to the compute nodes. So, if not using the option to export variables (-V) modules will need to be loaded within the submission script.
 

1.1. Multiple projects

All users on the facility are members of research projects, with project codes in the form of N8HPC_PROJECT_NAME. These are used, in conjunction with other factors, for resource allocation.
  • Members of multiple projects will need to specify the project to which each job should be allocated. This is done via the -P directive to the batch system. So, in all the script examples below, please add the directive #$ -P N8HPC_PROJECT_NAME. Jobs with no project specified will be rejected by the batch system.
  • For users belonging to a single project, jobs will be attributed to their project by default and can safely omit the -P directive.
 

2. Serial jobs

 

2.1. Simple serial job

To launch a simple serial job, 
serial_prog
 for instance, at the very least must specify the runtime. For example for a job to run in the current working directory (-cwd), exporting variables (-V) for 1 hour.

#$ -cwd
#$ -V
#$ -l h_rt=1:00:00
serial_prog
 

2.2. More memory

The default allocation is 1GB/slot, to request more memory use -l h_vmem option. For example to request 1500M memory:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l h_vmem=1500M
serial_prog
 

2.3. Using the large-memory nodes

By default jobs are directed to the 16 core and 64GB per node machines, which form the majority of the facility, and are designated by 16core-64G. There are also the large memory nodes which have 16 cores and 256GB per node, with designation 16core-256G. To make use of the large memory nodes use the option -l node_type=16core-256G. For example, for a job requesting 70GB of memory, running for one hour from the current working directory and exporting the current environment:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l h_vmem=70G
#$ -l node_type=16core-256G
serial_prog
 

2.4. Task array

To run a large number of identical jobs, for instance for parameter sweeps or a large number of input file, it is best to make use of task arrays. The batch system will automatically setup the variable $SGE_TASK_ID to correspond to the task number, and input and output files are indexed by the task number. For instance running tasks 1 to 100:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -t 1-100
serial_prog ${SGE_TASK_ID}
SGE_TASK_ID will take the values 1,2, ...,100

3. Parallel jobs

 

3.1. Shared memory

Shared memory jobs should be submitted using the -pe smp <cores>  flag. The number maximum number of cores that can be requested for shared memory jobs is limited by the number of cores available in a single node, in the case of Polaris this is 16 cores. Note that the OMP_NUM_THREADS environment variable is automatically set to the requested number of cores by the batch system. To run a 16 process shared memory job, for 1 hour:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -pe smp 16openmp_code
 

3.2. Larger-Shared memory

To request more than the default 1GB/process memory, use the -l h_vmem flag. For instance to request 8 processes and 3500M/process :

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l h_vmem=3500M
#$ -pe smp 16openmp_code
Please note that currently the majority of Polaris is comprised of 16 core nodes with a total of 64GB which can be thought of as 4GB/core. 

3.3. Large-Shared memory to large-memory nodes

To run large memory job on the large memory nodes, use -l node_type=16core-256G flag . For example, running 16 processes at 16GB/process:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l h_vmem=16G
#$ -pe smp 16
#$ -l node_type=16core-256Gopenmp_code
 

3.4. MPI jobs

On Polaris a number of locally developed patches are applied to the batch system. A significant change is the way resources for parallel MPI jobs can be requested. 

3.4.1. Node syntax

The preferred method for large jobs to request the required number of cores is using the -l np flag. This will ensure that jobs are given exclusive use of entire nodes. In addition, nodes will be allocated in order to minimise latency, by giving jobs the best possible placement. For example to request 64 processes for 1 hour:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l np=64
mpirun mpi_prog
Using this syntax will allocate exclusive use of nodes, with all available memory in the node. In case of standard nodes the above will give 4 nodes, each with 16 cores and 64GB of memory, or 4GB/core.Alternatively, you can explicitly request the number of nodes. For example to request 4 nodes, and use all cores available in nodes, use:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l nodes=4
mpirun mpi_prog
In case of the default nodes this will give 64 cores and all the memory available in those nodes, i.e. 4GB/core. To request 8 nodes, and 8 processes per node use:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l nodes=8,ppn=8
mpirun mpi_prog
This will give provide exclusive use of 8 nodes, with 8 cores per node and all the memory, i.e. 64 cores with 8GB/core. However, please note that in the output ofqstat will reflect that the job is occupying 8 full nodes, i.e. 128 cores. 

3.4.2. Standard SGE syntax

The standard SGE submission syntax is also available, and is better suited to smaller jobs. To request 64 processes for 1 hour:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -pe ib 64
mpirun mpi_prog
This will allocate 64 processes, not necessarily on the same node but will guarantee a minimum of 2:1 blocking. 

3.4.3. Larger-memory MPI jobs

When using the node syntax by default the job is allocated all memory in a node, and will not usually need to adjust the memory requirement.When using the standard SGE syntax more memory can be requested by using the -l h_vmem flag. For example to ask for 64 cores with 4Gb/core for 1 hour: 

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l h_vmem=4G
#$ -pe ib 64
mpirun mpi_prog
 

3.5. Large-memory MPI jobs to large memory nodes

 

3.5.1. Node syntax

To request 64 cores on the large memory nodes, i.e. 16Gb/core :

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l np=64
#$ -l node_type=16core-256G
mpirun mpi_prog
The above will allocate exclusive use of 4 large memory nodes and is equivalent to:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -l nodes=4
#$ -l node_type=16core-256G
mpirun mpi_prog
 

3.5.2. Standard SGE syntax

To request 64 cores on the large memory nodes, asking for 16GB/core, using the standard syntax:

#$ -cwd -V
#$ -l h_rt=1:00:00
#$ -pe ib 64
#$ -l h_vmem=16G
#$ -l node_type=16core-256G
mpirun mpi_prog
This will allocate any available 64 cores on the large memory nodes. 

4. Mixed Mode Programming

Please see the "Mixed Mode Programming" tab for information on mixed mode programming

Mixed Mode Programming

The batch system also supports mixed mode (MPI+OpenMP) programming.Polaris supports MPI+OpenMP programming model, typically with MPI processes across nodes and OpenMP threads upon each node with the total number of processes (MPI*OpenMP) equalling the number of physical processor cores.Your code will need to call MPI_Init and make use of OpenMP directives. You will compile your code using an MPI wrapper and enabling OpenMP support, for example
mpif90 -openmp example.f90 -o mixed.exe
You will need to determine ppn, the number of MPI processes per node, and tpp, the number of OpenMP threads per MPI process. Additionally, you can either ask for a given number of nodes nodes or for the total number of MPI processes np. Note that ppn is related to np since ppn=np/nodes.Your qsub command would then be either
qsub -V -l hr_t=01:00:00 -l nodes=$nodes,ppn=$ppn,tpp=$tpp script
or
qsub -V -l hr_t=01:00:00 -l np=$np,ppn=$ppn,tpp=$tpp script
The script will then contain an
mpirun
command of the form

mpirun ./a.out
Please note you do not need to supply any -np argument to
mpirun
.Given there are 16 cores per node, you would typically ensure ppn*tpp=16 

1. Example

To run an MPI+OpenMP executable mixed.exe with 64 MPI processes each launching 4 OpenMP threads, the following qsub command
qsub -V -cwd -b y -l hr_t=01:00:00 -l np=64,ppn=4,tpp=4 
mpirun ./mixed.exe
will allocate 16 nodes (=16*16=256 cores). Each node will have 4 MPI processes, each of which will have 4 OpenMP threads (so 4*4=16 processes per node in total, and 16*16=256 (=64MPI*4OpenMP) processes in total).Alternatively, the same effect can be achieved by
qsub -V -cwd -b y -l hr_t=01:00:00 -l nodes=16,ppn=4,tpp=4 
mpirun ./mixed.exe
Note that the OMP_NUM_THREADS environment variable is automatically set by the batch system and so you do not need to set this in your environment. 

1.1. Job Scripts

The equivalent job scripts for the above examples are as follows:To run an MPI+OpenMP executable mixed.exe with 64 MPI processes each launching 4 OpenMP threads (therefore 4 MPI processes per node)
#$_-cwd -V#$_-l h_rt=1:00:00#$_-l np=64,ppn=4,tpp=4
mpirun ./mixed.exe
Alternatively, the same effect can be achieved by specifying the number of nodes (16 in this case) 
#$_-cwd -V#$_-l h_rt=1:00:00#$_-l nodes=16,ppn=4,tpp=4
mpirun ./mixed.exe

Login Form