SLURM
This documentation will only cover site specific topics.
Please, use the web to find solutions and/or read the SLURM documentation:
SLURM Documentation
Cheatsheet
About -p partition
Note that we don't set a default partition - so the -p partition
is needed. To get a list of partitions you have access to, you can use:
sacctmgr show associations User=$(whoami)
Short SLURM Introduction
There are basically three methods of
running a SLURM job; using srun, sbatch and salloc. Running a
simple job is easy using srun:
$ srun -p partition hostname
node404.cluster
Running it on two nodes:
$ srun -p partition --nodes=2 hostname
node404.cluster
node405.cluster
Two task on the same node:
$ srun -p partition --ntasks=2 hostname
node404.cluster
node404.cluster
Note that running jobs with srun blocks the shell and only returns
once the job has finished. This is usually not the intended behavior, so
let's take a look at using sbatch.
The following simple sequential job sets the job name, number of tasks and memory per task (--mem-per-cpu). The batch job
prints all SLURM environment variables describing the allocation. It also prints
out our site-specific SCRATCH variable which always points to the local scratch area on the
allocated node.
#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --mem=1G
echo
echo "SLURM ENVIRONMENT"
echo "-----------------"
env | grep SLURM_ | sort | while read var; do
echo " * $var"
done
echo
echo "STENO ENVIRONMENT"
echo "-----------------"
echo " * SCRATCH=$SCRATCH"
# Two job "steps"
srun hostname
srun sleep 10
The job must be submitted using the sbatch command like:
sbatch -p partition name-of-script.sh
If you have submit access to multiple partitions, you specify the correct one
using the -p<partition>. If you have multiple “bank
accounts”, you can specify the correct one
using -A <account>. Both options can go in the script as well:
#SBATCH --partition=name-of-partition
#SBATCH --account=name-of-account
Note that SLURM distinguishes between “jobs” and “job steps”. Each
srun command inside a job script denotes a “step” and is subject to
the various job arguments such as “–ntasks” and “–nodes”. All commands
not run through srun will be executed only once on the first allocated
node (the node where the batch script is running).
Accounting is done for each step. This can be inspected at run-time using sstat
<jobid>. To check up on your jobs, use squeue. Jobs may be canceled
using scancel.
Parallel jobs
MPI jobs
If using an MPI implementation with support for SLURM (most
implementations has such support), a parallel job is just as simple:
#!/bin/bash
#SBATCH --job-name=barrier
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --mem=1G
# If not using the `mpi-selector` that ships with the Mellanox OFED
# (on machines with Mellanox InfiniBand devices), use `module` to load
# an implementation:
module load openmpi-x86_64
srun ./barrier 10000
Note the missing mpirun. SLURM detects that “barrier” is an MPI
application an automatically launches the job according to the “SBATCH”
specifications. That’s ice-cold cool right?! Of course, you can still
execute it with mpirun to take “personal” control of the launch.
Multithreaded jobs
These are jobs that run in one process with multiple threads.
#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --threads-per-core=1
#SBATCH --mem=1G
# The program that starts the threads
./program
--threads-per-core=1 tells slurm that it should only use one logical core
per physical core. If you want to utilize Hyperthreading you can remove it.
Hybrid jobs
A mix of MPI and threading. This is done by setting --cpus-per-task to define the number of threads
per MPI process.
OpenMP jobs and Hyper-Threading
OpenMP starts as many threads as there are logical cores to run on. When you ask for a number of threads with
--cpus-per-task you will get that number of physical cores. However, with Hyper-Threading
SLURM will give you access to all logical cores (typical two per physical core).
When you start an OpenMP program without telling how many threads you want, it will use as many as available.
On systems with Hyper-Threading (most today), --cpus-per-task=4 will only give you 2 cores each with 2 logical cores. If you only want one thread per physical core, you can
set the number of threads explicitely like this:
#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --threads-per-core=1
#SBATCH --mem=1G
# The program that starts the threads
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./program
Hyper-Threading and single threaded jobs
If you want to run single threaded jobs you need to add --cpu_bind=threads. I.e. Run a job on each thread (usually 2 per core). Example:
#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --ntasks=1
#SBATCH --cpu_bind=threads
#SBATCH --mem=1G
./program
If you want to run a single thread on a core, then you can add:
#SBATCH --threads-per-core=1
Miscellaneous
Local scratch
The path to local node scatch storage is defined in the SCRATCH environment variable.
Hyper-Threading
Hyper-threading
is a technology supported by most nodes with Intel CPUs. For some applications it might improve
performance.
It is off by default. If you want to use it, add the following option to your job script:
#SBATCH --ntasks-per-core=2
Note: By default slurm will only allocate one task per core, but that task will have 2 CPUs on a Hyper-threaded node.
This option will give you two tasks per core.
With the update to 21.08.8 (May 2022), the steps to allocate an
interactive session was changed to:
salloc -p partition-name - eg:
[user@fend01 ~]$ salloc -p astro_short -N4
salloc: Pending job allocation 33291741
salloc: job 33291741 queued and waiting for resources
salloc: job 33291741 has been allocated resources
salloc: Granted job allocation 33291741
salloc: Waiting for resource configuration
salloc: Nodes node[837-840] are ready for job
[user@fend01 ~]$ hostname
fend01.cluster
[user@fend01 ~]$ srun hostname
node838.cluster
node839.cluster
node837.cluster
node840.cluster
[user@fend01 ~]$
notice how the allocation is for 4 machines - and how srun
is needed to run the command on all 4.
Even though SLURM uses cgroups to control how many
cores a user can use, we have no control over memory consumption. Thus,
I would often be best to request the node in exclusive mode using
–exclusive as a parameter to srun. Note that this is NOT the
default, so if you mess up each others job allocations, that’s on you ;)
More info is available at:
NREL SLURM Changes and
SLURM 20.11 - Release notes
Fair-share scheduling
With SLURM we have introduced fair share scheduling on all nodes. This
means that users who are under-serviced will receive a higher priority
in the queue. You can check up on your usage with sshare. For now, all
users have the same shares, but this can of course be changed to reflect
the wishes of the queue owners - just let us know.
In general we have far better methods of prioritizing users and managing
the queues. If you have any specific requests or similar, please contact
us and we’ll discuss it.
|