HPC/UCPH
   
 

SLURM

This documentation will only cover site specific topics. Please, use the web to find solutions and/or read the SLURM documentation:

SLURM Documentation
Cheatsheet

About -p partition

Note that we don't set a default partition - so the -p partition is needed. To get a list of partitions you have access to, you can use:

sacctmgr show associations User=$(whoami)

Short SLURM Introduction

There are basically three methods of running a SLURM job; using srun, sbatch and salloc. Running a simple job is easy using srun:

$ srun -p partition hostname
node404.cluster

Running it on two nodes:

$ srun -p partition --nodes=2 hostname
node404.cluster
node405.cluster

Two task on the same node:

$ srun -p partition --ntasks=2 hostname
node404.cluster
node404.cluster

Note that running jobs with srun blocks the shell and only returns once the job has finished. This is usually not the intended behavior, so let's take a look at using sbatch.

The following simple sequential job sets the job name, number of tasks and memory per task (--mem-per-cpu). The batch job prints all SLURM environment variables describing the allocation. It also prints out our site-specific SCRATCH variable which always points to the local scratch area on the allocated node.

#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --mem=1G

echo
echo "SLURM ENVIRONMENT"
echo "-----------------"
env | grep SLURM_ | sort | while read var; do
    echo " * $var"
done

echo
echo "STENO ENVIRONMENT"
echo "-----------------"
echo " * SCRATCH=$SCRATCH"

# Two job "steps"
srun hostname
srun sleep 10

The job must be submitted using the sbatch command like:

sbatch -p partition name-of-script.sh

If you have submit access to multiple partitions, you specify the correct one using the -p<partition>. If you have multiple “bank accounts”, you can specify the correct one using -A <account>. Both options can go in the script as well:

#SBATCH --partition=name-of-partition
#SBATCH --account=name-of-account

Note that SLURM distinguishes between “jobs” and “job steps”. Each srun command inside a job script denotes a “step” and is subject to the various job arguments such as “–ntasks” and “–nodes”. All commands not run through srun will be executed only once on the first allocated node (the node where the batch script is running).

Accounting is done for each step. This can be inspected at run-time using sstat <jobid>. To check up on your jobs, use squeue. Jobs may be canceled using scancel.

Parallel jobs

MPI jobs

If using an MPI implementation with support for SLURM (most implementations has such support), a parallel job is just as simple:

#!/bin/bash
#SBATCH --job-name=barrier
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --mem=1G

# If not using the `mpi-selector` that ships with the Mellanox OFED
# (on machines with Mellanox InfiniBand devices), use `module` to load
# an implementation:
module load openmpi-x86_64

srun ./barrier 10000

Note the missing mpirun. SLURM detects that “barrier” is an MPI application an automatically launches the job according to the “SBATCH” specifications. That’s ice-cold cool right?! Of course, you can still execute it with mpirun to take “personal” control of the launch.

Multithreaded jobs

These are jobs that run in one process with multiple threads.

#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --threads-per-core=1
#SBATCH --mem=1G

# The program that starts the threads

./program

--threads-per-core=1 tells slurm that it should only use one logical core per physical core. If you want to utilize Hyperthreading you can remove it.

Hybrid jobs

A mix of MPI and threading. This is done by setting --cpus-per-task to define the number of threads per MPI process.

OpenMP jobs and Hyper-Threading

OpenMP starts as many threads as there are logical cores to run on. When you ask for a number of threads with --cpus-per-task you will get that number of physical cores. However, with Hyper-Threading SLURM will give you access to all logical cores (typical two per physical core).

When you start an OpenMP program without telling how many threads you want, it will use as many as available. On systems with Hyper-Threading (most today), --cpus-per-task=4 will only give you 2 cores each with 2 logical cores. If you only want one thread per physical core, you can set the number of threads explicitely like this:

#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --threads-per-core=1
#SBATCH --mem=1G

# The program that starts the threads

OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./program

Hyper-Threading and single threaded jobs

If you want to run single threaded jobs you need to add --cpu_bind=threads. I.e. Run a job on each thread (usually 2 per core). Example:

#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --cpu_bind=threads
#SBATCH --mem=1G

./program
If you want to run a single thread on a core, then you can add:

#SBATCH --threads-per-core=1

Miscellaneous

Local scratch

The path to local node scatch storage is defined in the SCRATCH environment variable.

Hyper-Threading

Hyper-threading is a technology supported by most nodes with Intel CPUs. For some applications it might improve performance.

It is off by default. If you want to use it, add the following option to your job script:

#SBATCH --ntasks-per-core=2

Note: By default slurm will only allocate one task per core, but that task will have 2 CPUs on a Hyper-threaded node. This option will give you two tasks per core.

Interactive sessions

With the update to 21.08.8 (May 2022), the steps to allocate an interactive session was changed to: salloc -p partition-name - eg:

[user@fend01 ~]$ salloc -p astro_short -N4
salloc: Pending job allocation 33291741
salloc: job 33291741 queued and waiting for resources
salloc: job 33291741 has been allocated resources
salloc: Granted job allocation 33291741
salloc: Waiting for resource configuration
salloc: Nodes node[837-840] are ready for job
[user@fend01 ~]$ hostname
fend01.cluster
[user@fend01 ~]$ srun hostname
node838.cluster
node839.cluster
node837.cluster
node840.cluster
[user@fend01 ~]$ 
notice how the allocation is for 4 machines - and how srun is needed to run the command on all 4.

Even though SLURM uses cgroups to control how many cores a user can use, we have no control over memory consumption. Thus, I would often be best to request the node in exclusive mode using –exclusive as a parameter to srun. Note that this is NOT the default, so if you mess up each others job allocations, that’s on you ;)

More info is available at: NREL SLURM Changes and SLURM 20.11 - Release notes

Fair-share scheduling

With SLURM we have introduced fair share scheduling on all nodes. This means that users who are under-serviced will receive a higher priority in the queue. You can check up on your usage with sshare. For now, all users have the same shares, but this can of course be changed to reflect the wishes of the queue owners - just let us know.

In general we have far better methods of prioritizing users and managing the queues. If you have any specific requests or similar, please contact us and we’ll discuss it.