HPC/UCPH
  +Open all         -Close all  
 

SLURM

This documentation will only cover site specific topics. Please, use the web to find solutions and/or read the SLURM documentaion:

SLURM Documentation
Cheatsheet

Short SLURM Introduction

There are basically two methods of running a SLURM job; using srun and sbatch. We will skip salloc for now, check the man-page salloc(1) if you think you need it. Running a simple job is easy using srun:

$ srun hostname
node404.cluster

Running it on two nodes:

$ srun --nodes=2 hostname
node404.cluster
node405.cluster

Two task on the same node:

$ srun --ntasks=2 hostname
node404.cluster
node404.cluster

Note that running jobs with srun blocks the shell and only returns once the job has finished. This is usually not the intended behavior, so let's take a look at using sbatch.

The following simple sequential job sets the job name, number of tasks and memory per task (--mem-per-cpu). The batch job prints all SLURM environment variables describing the allocation. It also prints out our site-specific SCRATCH variable which always points to the local scratch area on the allocated node.

#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=128

echo
echo "SLURM ENVIRONMENT"
echo "-----------------"
env | grep SLURM_ | sort | while read var; do
    echo " * $var"
done

echo
echo "STENO ENVIRONMENT"
echo "-----------------"
echo " * SCRATCH=$SCRATCH"

# Two job "steps"
srun hostname
srun sleep 10

The job can be submitted using the sbatch command. If you have submit access to multiple partitions, you can specify the correct one using -p <partition>. If you have multiple “bank accounts”, you can specify the correct one using -A <account>, but none of you should need that for now.

Note that SLURM distinguishes between “jobs” and “job steps”. Each srun command inside a job script denotes a “step” and is subject to the various job arguments such as “–ntasks” and “–nodes”. All commands not run through srun will be executed only once on the first allocated node (the node where the batch script is running).

Accounting is done for each step. This can be inspected at run-time using sstat <jobid>. To check up on your jobs, use squeue. Jobs may be canceled using scancel.

Parallel jobs

MPI jobs

If using an MPI implementation with support for SLURM (most implementations has such support), a parallel job is just as simple:

#!/bin/bash
#SBATCH --job-name=barrier
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --mem-per-cpu=128

# If not using the `mpi-selector` that ships with the Mellanox OFED
# (on machines with Mellanox InfiniBand devices), use `module` to load
# an implementation:
module load openmpi-x86_64

srun ./barrier 10000

Note the missing mpirun. SLURM detects that “barrier” is an MPI application an automatically launches the job according to the “SBATCH” specifications. That’s ice-cold cool right?! Of course, you can still execute it with mpirun to take “personal” control of the launch.

Multithreaded jobs

These are jobs that run in one process with multiple threads.

#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --threads-per-core=1
#SBATCH --mem-per-cpu=128

# The program that starts the threads

./program

--threads-per-core=1 tells slurm that it should only use one CPU per core. If you want to utilize Hyperthreading you can remove it.

Hybrid jobs

A mix of MPI and threading. This is done by setting --cpus-per-task to define the number of threads per MPI process.

OpenMP jobs and Hyper-Threading

OpenMP starts as many threads as there are logical cores to run on. When you ask for a number of threads with --cpus-per-task you will get that number of physical cores. However, with Hyper-Threading SLURM will give you access to all logical cores (typical two per physical core).

When you start an OpenMP program without telling how many threads you want, it will use as many as available. On systems with Hyper-Threading (most today), --cpus-per-task=4 will only give you 2 cores each with 2 logical cores. If you only want one thread per physical core, you can set the number of threads explicitely like this:

#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --threads-per-core=1
#SBATCH --mem-per-cpu=128

# The program that starts the threads

OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./program

Miscellaneous

Local scratch

The path to local node scatch storage is defined in the SCRATCH environment variable.

Hyper-Threading

Hyper-threading is a technology supported by most nodes with Intel CPUs. For some applications it might improve performance.

It is off by default. If you want to use it, add the following option to your job script:

#SBATCH --ntasks-per-core=2

Note: By default slurm will only allocate one task per core, but that task will have 2 CPUs on a Hyper-threaded node. This option will give you two tasks per core.

Interactive sessions

A new feature is interactive sessions. You can request one with e.g. srun –pty bash. Even though SLURM uses cgroups to control how many cores a user can use, we have no control over memory consumption. Thus, I would often be best to request the node in exclusive mode using –exclusive as a parameter to srun. Note that this is NOT the default, so if you mess up each others job allocations, that’s on you ;)

Fair-share scheduling

With SLURM we have introduced fair share scheduling on all nodes. This means that users who are under-serviced will receive a higher priority in the queue. You can check up on your usage with sshare. For now, all users have the same shares, but this can of course be changed to reflect the wishes of the queue owners - just let us know.

In general we have far better methods of prioritizing users and managing the queues. If you have any specific requests or similar, please contact us and we’ll discuss it.