Getting started with SLURM

From Docswiki
Jump to navigation Jump to search

For running computationally expensive tasks for long periods of time, you will want to create a job and run it on a cluster. This page is intended to teach you everything required to run your first job on a SLURM CPU cluster (sinister, volkhan). For convenience, we will refer primarily to sinister, but the information is transferrable to other SLURM clusters, with a small note about SLURM versions here. This page does not contain information about PBS (dexter), but there is some information here. If you are running GPU (pat) jobs, you will need additional information from here.

Basic terminology

A cluster is a large computer that has many processors. The processors are grouped into nodes, which operate semi-independently of each other. Each node has it's own disk (accessed at /scratch/) and memory, but is capable of communicating with other nodes through a network. When you login to sinister with SSH, you are logged into the head node. This is a special node designed for interacting with users. The other nodes are compute nodes, designed for running long and intensive tasks. SLURM (Simple Linux Utility for Resource Management), now officially called the Slurm Workload Manager, is a programme that manages the compute nodes, allocating resources, starting and ending jobs, and managing the queue. To get the compute nodes to do anything, you need to ask through SLURM.

How to not be anti-social

When you are using sinister, the most important thing to be aware of is that other people are using it too. You need to be careful to not act in a way that impedes other users from getting on with their work. Primarily this means not excessively using the head node, as if you do other users will get a very slow response time on their SSH and will not be able to operate. There are some specific things to avoid.

  • Do not run long expensive tasks on the head node: that is what the compute nodes are for. Anything longer than a few minutes should not be run on the head node. For a short expensive task (like compiling GMIN), prefix your command with 'nice', which tells the operating system to give your process a lower priority, meaning any simple commands other people might be running are not slowed down. You are not likely to notice a significant impact on the speed of your process by doing this.
 $ nice make
  • Do not rapidly copy a lot of data over the network. Although the nodes are networked together and moving data between them is a simple process (using NFS, the Network File System), it is relatively slow and is computationally expensive. If you constantly move data over the NFS, for example by writing a verbose log file, other users will notice. This is most likely to be an issue when writing to the directory /sharedscratch/, which is a large working space partition accessible over the network from any partition. Each node has its own space, accessed at /scratch/ on the node, or through /nodescratch/ from the head node. Best practices to avoid this problem are detailed below, but it is mentioned here due to its great importance.
  • Be realistic about the requirements for your job. The cluster is a shared resource that is not infinite. To make the most efficient use of the resource, do not do things like request 32 cores for a job that can only make use of 4, or request a time limit of 1 week for a job that will only take 1 hour. The better the information you give to the queuing system, the better it can serve you.

Submitting a job

To submit a job, you create a job submission script, here we'll call it submit.sh, and then run

 $ sbatch submit.sh

which tells SLURM to execute the contents of your submission script. When you do this, SLURM will first look through any information at the top of the job script for configuration instructions like the job time limit. Then it will place your job into a queue. When the necessary resources become available, any commands in your submission script will be executed in sequence on a compute node. Let's have a look at a simple submission script for running GMIN.

 #!/bin/bash
 #SBATCH -J GMIN_LJ38
 #SBATCH --time=1:00:00
 #SBATCH -n1
 
 echo Starting job $SLURM_JOB_ID
 hostname
 
 TMP=/scratch/jwrm2/$SLURM_JOB_ID
 mkdir -p $TMP
 cp $SLURM_SUBMIT_DIR/data $SLURM_SUBMIT_DIR/coords $TMP/
 cd $TMP
 
 /home/jwrm2/bin/GMIN > logfile
 
 cp -p lowest logfile $SLURM_SUBMIT_DIR && rm -rf $TMP
 cd $SLURM_SUBMIT_DIR

We'll go through this line by line. The first line (affectionately known as the 'shebang') tells SLURM what shell to use to interpret the commands. You don't have to use Bash, but it's what most people are used to, so it's recommended if you want anyone else to be able to help you.

Next we have a series of configuration commands for SLURM. Each of these starts with #SBATCH and they are ignored by Bash. Some of the most useful are:

  • -J <name> The name of the job, that will show up in diagnostic information. Make your life easier by choosing a suitably descriptive one.
  • --time=hh:mm:ss The maximum amount of time your job will run for. If this time limit is reached and the job hasn't finished, SLURM will kill it. It is in yours and everyone else's interest to make this a fairly accurate estimate of how long the job will actually take (plus a little bit).
  • --ntasks=<number> The number of processors your job requires.
  • --nodes=<number> The number of nodes your job requires.
  • --mem=<value> The amount of memory your job needs. The default is often fine.

Consult the SLURM documentation [1] for a full list. Here, we request one processor, for a maximum time of 1 hour, and we set the name to 'GMIN_LJ38'.

The remaining commands are executed by Bash once the job has begun. First we write out some diagnostic information: the ID of the job and which compute node it is running on. This is written to a file slurm-<job_ID>.out in the directory we run sbatch from, for example

 $ cat slurm-214233.out
 Starting job 214233
 compute-0-17.local

The first line is a useful check that the job actually began. Knowing which node your job ran on can be useful if it terminates early/unexpectedly.

The next section of the submission script is about reducing the NFS load. Instead of running the job on /sharedscratch/, which the node accesses over the NFS, we create a directory in the node's own /scratch/ space. Change 'jwrm2' to your own username. We copy only the files GMIN needs in order to run, then change to the new directory. $SLURM_SUBMIT_DIR is a variable holding the directory from whcih sbatch was run.

Next we actually run the programme. We redirect the GMIN output to a log file. If you do not do this redirection, output will instead go to the slurm-<job_ID>.out file, which most likely resides on /sharedscratch/.

Finally, we have clean up actions to be run after GMIN has finished. We copy back only the data we are interested in to the directory from which sbatch was run, then delete the temporary directory we created on the nodescratch. Note the '&&' syntax, which means 'run the command after the && only if the command before the && was successful'. In this case, it means that the temporary directory will not be deleted if we couldn't copy those files back.

The queuing system

When you submit your job with sbatch, it may not be run instantly. It is quite possible that the the resources your job needs are currently being used by someone else. Your job enters the queue. You can see the queue by typing

 $ squeue
              JOBID PARTITION                                               NAME     USER ST       TIME  NODES NODELIST(REASON)
             214233   CLUSTER                                          GMIN_LJ38    jwrm2 PD       0:00      1 (Resources)
             214207   CLUSTER                                              GLP-1    kr366  R      29:31      2 compute-0-[22-23]
             214112   CLUSTER                                    pentamer_1_1292    ld506  R 1-02:03:44      1 compute-0-12

There is a lot of information here. The first column gives the ID of the job. The next column is not usually relevant. Then we see the name of the job (assigned with -J in the submission script) and the user who submitted the job. The ST column tells you the current status of the job: 'PD' means 'pending', ie. the job is waiting to run, but is not yet running; 'R' means 'running'. We also have the time the job has been running for, the number of nodes the job is using and which they are. If the job is not running, the final column may display a reason. If it shows (Resources), then your job will run as soon as the necessary resources are available. If it instead shows (Priority), then there is another waiting job in the queue that will be run before yours. If you only want to see information about your jobs (or any other user's), you can use the -u option:

 $ squeue -u jwrm2

will only show the jobs submitted by the user jwrm2.

How does SLURM decide which jobs to run? Principally, it works out a user's priority by a fairshare system: the more compute time you have recently used, the lower your priority will be. You can see everyone's fairshare values by looking at the final column of the output of

 $ sshare -a

Higher numbers mean a higher priority for that user. However, SLURM will also try to fit smaller jobs into gaps. If there is a 12 node job waiting from a user with a very high priority, but only one node is currently available, a short one node job from a user with low priority may be able to fit in. Therefore it is in your interest to make realistic estimates of the resources your job needs: it may end up running sooner.

Oops, I just submitted the wrong job

You can abort a job when it is waiting to run or when it is running. Find the job ID, either by looking at squeue or the output file in the submission directory. Let's say it was 214233:

 $ scancel 214233

will immediately kill the job.

Running PATHSAMPLE with SLURM

PATHSAMPLE runs a little differently from standard GMIN: it launches new OPTIM jobs and sends them to the other processors available. This is pretty much taken care of if you include 'SLURM' in the pathdata file, but there are a couple of extra considerations. Firstly, PATHSAMPLE needs a nodes.info file with information about the available processors. Generate this file at the start of you submission script with:

 echo $SLURM_NTASKS > nodes.info
 srun hostname >> nodes.info
 echo $USER >> nodes.info
 pwd >> nodes.info

Note the use of srun here. It means: run this command on every one of the available processors.

Secondly, we need to decide whether PATHSAMPLE itself will be run from /sharedscratch/ (not the OPTIM jobs, PATHSAMPLE will always use /nodescratch/ for those). Running PATHSAMPLE from /sharedscratch/ is simpler and it means that if the job terminates before PATHSAMPLE completes all the requested cycles, the database on /sharedscratch/ will be in the most current state. Here is an example PATHSAMPLE script that does this

 #!/bin/bash
 #SBATCH --time=20:0:0
 #SBATCH --job-name=PATHSAMPLE_LJ38
 #SBATCH --ntasks=16
 
 echo $SLURM_NTASKS > nodes.info
 srun hostname >> nodes.info
 echo $USER >> nodes.info
 pwd >> nodes.info
 
 /home/jwrm/bin/PATHSAMPLE > logfile

We run PATHSAMPLE for 20 hours on 16 cores. SLURM is free to assign those cores over multiple nodes as it sees fit. Note there is no creation of a temporary directory on /scratch/, PATHSAMPLE access the database stored in /sharedscratch/. Whenever an OPTIM job is created, PATHSAMPLE creates a temporary directory on /scratch/, copies the files required for OPTIM to run, launches OPTIM with srun, waits for OPTIM to finish, copies the files back and adds new stationary points to the database, then deletes the temporary directory. That is all taken care of for you. Because this approach involves writing to /sharedscratch/, it is only appropriate if each OPTIM job takes a while to run (more than a few minutes) and if the amount of data required for each OPTIM job is small. If either of those conditions is violated, users may notice a slowdown on the head node and be annoyed with you. In that case a different approach is required.

PATHSAMPLE can be run from /nodescratch/. This method reduces the NFS load, but is slightly more complicated and if the job terminates early, you will need to get your expanded database manually (see here). Here is an example PATHSAMPLE script that does this

 #!/bin/bash
 #SBATCH -J PATHSAMPLE_LJ38
 #SBATCH --time=20:0:0
 #SBATCH --ntask=4
 #SBATCH --nodes=1
 
 echo Starting job $SLURM_JOB_ID
 hostname
 
 TMP=/scratch/jwrm2/$SLURM_JOB_ID
 mkdir -p $TMP
 cp $SLURM_SUBMIT_DIR/pathdata $SLURM_SUBMIT_DIR/min.data $SLURM_SUBMIT_DIR/ts.data $SLURM_SUBMIT_DIR/points.min $SLURM_SUBMIT_DIR/points.ts $SLURM_SUBMIT_DIR/min.A $SLURM_SUBMIT_DIR/min.B $SLURM_SUBMIT_DIR/odata.connect $TMP/
 cd $TMP
 
 echo $SLURM_NTASKS > nodes.info
 srun hostname >> nodes.info
 echo $USER >> nodes.info
 pwd >> nodes.info
 
 /home/jwrm2/svn/PATHSAMPLE/builds/gfortran/PATHSAMPLE > logfile
 
 cp min.data ts.data points.min points.ts min.A min.B logfile $SLURM_SUBMIT_DIR/ && rm -rf $TMP
 cd $SLURM_SUBMIT_DIR/

We run for 20 hours on 4 cores. Since we are worried about the amount of data we're copying around, it makes sense to request these on the same node. We create a temporary directory on the node's /scratch/ space and copy the whole PATHSAMPLE database to it. Then we generate the nodes.info file and we're ready to run PATHSAMPLE itself. At the end, we copy the whole database back to /sharedscratch/ and delete the temporary directory if successful. This solution is not perfect, as we still have to copy the whole database over the NFS at the start and the end, but if each OPTIM job needs a lot of writing, or writes data very rapdily, it may be preferable.

My job got cancelled, where are my files?

If your job has finished, it will no longer appear in the output of squeue. It may have successfully completed, or it might have run out of time (or memory, etc.). Look at the end of the slurm-<job_ID>.out file. If it shows

 slurmstepd: *** JOB 214233 ON compute-0-36 CANCELLED AT 2020-05-18T10:45:16 DUE TO TIME LIMIT ***

then your job ran out of time and was killed. Maybe this was a week long GMIN run that ran out of time 10 steps from the end. How annoying. Time to run it again from the beginning with a longer time limit, because there is no GMIN.dump file in the submission directory? Hardly... In our job submission script, we only deleted the temporary directory after the successful completion of the job. Therefore if the job did not successfully complete, the directory and its contents are still there. The information at the top of the slurm-<job_ID>.out file will help you find them. We created the temporary directory at

 /scratch/jwrm2/214233

To access this from the head node, we need to know what node it was created on. That's why we ran 'hostname' at the start of the job. Let's say the output of that was

 compute-0-17.local

We can access our files at

 /nodescratch/compute-0-17/jwrm2/214233

Note that normal Bash tab completion of the directory name may not work here. That doesn't necessarily mean the directory doesn't exist, it's just a peculiarity of NFS automount.

It would be polite to delete the temporary directory after recovering any files you need.

A smarter way to get your files

Although recovering files from /nodescratch/ is fine, if you've every had 52 GMIN jobs terminate and needed to recover their GMIN.dump files for restarting, it gets a little tedious. Fortunately there is a better way, leveraging the signal mechanism of Linux. Here is an example:

 #!/bin/bash
 #SBATCH -J GMIN_LJ38
 #SBATCH --time=5:00
 #SBATCH -n1
 #SBATCH --signal=B:USR1@120
 
 signal_handler()
 {
     # Timeout commands go here
     echo "Caught USR1 signal"
     cp GMIN.dump logfile $SLURM_SUBMIT_DIR
     exit
 }
 trap 'signal_handler' USR1
 
 echo Starting job $SLURM_JOB_ID
 hostname
 
 TMP=/scratch/jwrm2/$SLURM_JOB_ID
 mkdir -p $TMP
 cp $SLURM_SUBMIT_DIR/data $SLURM_SUBMIT_DIR/coords $TMP/
 cd $TMP
 
 /home/jwrm2/bin/GMIN > logfile &
 wait
 
 cp -p lowest logfile $SLURM_SUBMIT_DIR && rm -rf $TMP
 cd $SLURM_SUBMIT_DIR

This script will run any commands enclosed in braces at '# Timeout commands go here', 120 seconds before the job times out. It will terminate itself immediately after completing those commands, so you will not get a 'CANCELLED' message in the slurm-<job_ID>.out file (but only because of the 'exit', otherwise it would continue execution after the 'wait', which is probably not what you want). You can use it to copy back recovery files like GMIN.dump, or the updated PATHSAMPLE database, before the job terminates. If 120 seconds is not enough time for your timeout commands, change the 120 in the line '#SBATCH --signal=B:USR1@120'. The temporary directory will not be deleted if the timeout is reached, although you could put 'rm -rf $TMP' in the timeout section if you were really confident that you were grabbing everything you needed. You should occasionally go through /nodescratch/ and delete any leftover temporary directories that may be hanging around.

Theory

You don't need to understand how the above example works to use it, but we're all curious scientists, aren't we? The Linux kernel can communicate with a running process by sending it 'signals'. Common ones include 'SIGTERM' and 'SIGKILL'. In fact, when your job reaches the time limit, it gets sent a 'SIGTERM' signal, telling it to terminate. If the process has not set up anything specific to do on receiving a particular signal, the default action is to exit. However, most signals can have custom responses set up. An exception is 'SIGKILL', for which immediate termination cannot be overridden.

The line '#SBATCH --signal=B:USR1@120' tells SLURM to please send the signal 'USR1' (user specified signal number 1) to the job 120 seconds before reaching the time limit. 'USR1' is not sent by any system processes, so we're free to use it as we wish. Next we create a Bash function, 'signal_handler()', with our sequence of commands. We tell the Bash script to jump to our function on receiving the 'USR1' signal, overriding the default exit behaviour (trap 'signal_handler' USR1).

The final finicky bit is how we run our main programme. If we ran 'GMIN', the signal would get sent to GMIN, as the active process, rather than Bash. GMIN wouldn't know what to do with it, so would immediately exit. Not good. By running 'GMIN &' instead, the signal goes to Bash. However, if we did just that, the job would quickly reach the end after launching GMIN, which would terminate the job and all it's children, including GMIN. Oops. By writing 'wait', we tell Bash to go into an 'interruptible sleep', meaning it will sit there not doing anything until it receives a signal. Therefore it is able to excecute the commands in the 'signal_handler()' function on receiving the 'USR1' signal. As a matter of interest, execution of the main script then resumes where it left off before the signal, but the wait has now been completed (it only waits for the first signal) and execution proceeds to the 'cp' line. Putting 'exit' at the end of the function means we don't jump back to after wait. We could have put another 'wait' instead of 'exit', but that would only be useful if GMIN finished in the short period of time between completing the function and reaching the job time limit. Better to just save 2 minutes of CPU time by exiting immediately.

How does our job finish normally, if we've put it to sleep then? Well, when GMIN exits it sends, you guessed it... a signal, in this case 'SIGCHLD'. That will wake up Bash from its wait and the script will proceed to completion.