SCIENCE HPC Center - High Performance Computing Centre at the University of Copenhagen

	Documentation SLURM Lustre MPI Compilers VGL ssh Visual Studio Dynamic firewall OLD: Shared filesystem OLD: Volume Storage Account How to get an account Contact Contact
			SLURM This documentation will only cover site specific topics. Please, use the web to find solutions and/or read the SLURM documentation: SLURM Documentation Cheatsheet About `-p partition` Note that we don't set a default partition - so the `-p partition` is needed. To get a list of partitions you have access to, you can use: `sacctmgr show associations User=$(whoami)` Short SLURM Introduction There are basically three methods of running a SLURM job; using `srun`, sbatch and salloc. Running a simple job is easy using srun: `$ srun -p partition hostname node404.cluster` Running it on two nodes: `$ srun -p partition --nodes=2 hostname node404.cluster node405.cluster` Two task on the same node: `$ srun -p partition --ntasks=2 hostname node404.cluster node404.cluster` Note that running jobs with srun blocks the shell and only returns once the job has finished. This is usually not the intended behavior, so let's take a look at using sbatch. The following simple sequential job sets the job name, number of tasks and memory per task (--mem-per-cpu). The batch job prints all SLURM environment variables describing the allocation. It also prints out our site-specific SCRATCH variable which always points to the local scratch area on the allocated node. `#!/bin/bash #SBATCH --job-name=slurmtest #SBATCH --ntasks=1 #SBATCH --mem=1G echo echo "SLURM ENVIRONMENT" echo "-----------------" env \| grep SLURM_ \| sort \| while read var; do echo " * $var" done echo echo "STENO ENVIRONMENT" echo "-----------------" echo " * SCRATCH=$SCRATCH" # Two job "steps" srun hostname srun sleep 10` The job must be submitted using the sbatch command like: `sbatch -p partition name-of-script.sh` If you have submit access to multiple partitions, you specify the correct one using the -p<partition>. If you have multiple “bank accounts”, you can specify the correct one using -A <account>. Both options can go in the script as well: `#SBATCH --partition=name-of-partition #SBATCH --account=name-of-account` Note that SLURM distinguishes between “jobs” and “job steps”. Each srun command inside a job script denotes a “step” and is subject to the various job arguments such as “–ntasks” and “–nodes”. All commands not run through srun will be executed only once on the first allocated node (the node where the batch script is running). Accounting is done for each step. This can be inspected at run-time using sstat <jobid>. To check up on your jobs, use squeue. Jobs may be canceled using scancel. Parallel jobs MPI jobs If using an MPI implementation with support for SLURM (most implementations has such support), a parallel job is just as simple: #!/bin/bash #SBATCH --job-name=barrier #SBATCH --nodes=2 #SBATCH --ntasks-per-node=8 #SBATCH --mem=1G # If not using the `mpi-selector` that ships with the Mellanox OFED # (on machines with Mellanox InfiniBand devices), use `module` to load # an implementation: module load openmpi-x86_64 srun ./barrier 10000 Note the missing mpirun. SLURM detects that “barrier” is an MPI application an automatically launches the job according to the “SBATCH” specifications. That’s ice-cold cool right?! Of course, you can still execute it with mpirun to take “personal” control of the launch. Multithreaded jobs These are jobs that run in one process with multiple threads. `#!/bin/bash #SBATCH --job-name=slurmtest #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --threads-per-core=1 #SBATCH --mem=1G # The program that starts the threads ./program` --threads-per-core=1 tells slurm that it should only use one logical core per physical core. If you want to utilize Hyperthreading you can remove it. Hybrid jobs A mix of MPI and threading. This is done by setting --cpus-per-task to define the number of threads per MPI process. OpenMP jobs and Hyper-Threading OpenMP starts as many threads as there are logical cores to run on. When you ask for a number of threads with --cpus-per-task you will get that number of physical cores. However, with Hyper-Threading SLURM will give you access to all logical cores (typical two per physical core). When you start an OpenMP program without telling how many threads you want, it will use as many as available. On systems with Hyper-Threading (most today), --cpus-per-task=4 will only give you 2 cores each with 2 logical cores. If you only want one thread per physical core, you can set the number of threads explicitely like this: `#!/bin/bash #SBATCH --job-name=slurmtest #SBATCH --nodes=2 #SBATCH --ntasks=2 #SBATCH --cpus-per-task=4 #SBATCH --threads-per-core=1 #SBATCH --mem=1G # The program that starts the threads OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK ./program` Hyper-Threading and single threaded jobs If you want to run single threaded jobs you need to add --cpu_bind=threads. I.e. Run a job on each thread (usually 2 per core). Example: `#!/bin/bash #SBATCH --job-name=slurmtest #SBATCH --ntasks=1 #SBATCH --cpu_bind=threads #SBATCH --mem=1G ./program` If you want to run a single thread on a core, then you can add: `#SBATCH --threads-per-core=1` Miscellaneous Local scratch The path to local node scatch storage is defined in the SCRATCH environment variable. Hyper-Threading Hyper-threading is a technology supported by most nodes with Intel CPUs. For some applications it might improve performance. It is off by default. If you want to use it, add the following option to your job script: `#SBATCH --ntasks-per-core=2` Note: By default slurm will only allocate one task per core, but that task will have 2 CPUs on a Hyper-threaded node. This option will give you two tasks per core. Interactive sessions With the update to 21.08.8 (May 2022), the steps to allocate an interactive session was changed to: salloc -p partition-name - eg: `[user@fend01 ~]$ salloc -p astro_short -N4 salloc: Pending job allocation 33291741 salloc: job 33291741 queued and waiting for resources salloc: job 33291741 has been allocated resources salloc: Granted job allocation 33291741 salloc: Waiting for resource configuration salloc: Nodes node[837-840] are ready for job [user@fend01 ~]$ hostname fend01.cluster [user@fend01 ~]$ srun hostname node838.cluster node839.cluster node837.cluster node840.cluster [user@fend01 ~]$` notice how the allocation is for 4 machines - and how srun is needed to run the command on all 4. Even though SLURM uses cgroups to control how many cores a user can use, we have no control over memory consumption. Thus, I would often be best to request the node in exclusive mode using –exclusive as a parameter to srun. Note that this is NOT the default, so if you mess up each others job allocations, that’s on you ;) More info is available at: NREL SLURM Changes and SLURM 20.11 - Release notes Fair-share scheduling With SLURM we have introduced fair share scheduling on all nodes. This means that users who are under-serviced will receive a higher priority in the queue. You can check up on your usage with sshare. For now, all users have the same shares, but this can of course be changed to reflect the wishes of the queue owners - just let us know. In general we have far better methods of prioritizing users and managing the queues. If you have any specific requests or similar, please contact us and we’ll discuss it.