Submitting jobs, interactively or to a cluster queue system

From Docswiki
Jump to navigation Jump to search

This is a brief summary for those who haven't found/read the documentation provided by the COs.

Interactive

The basic command for submitting jobs to a queue is qsub.

To submit interactively, using the following syntax:

qsub -I -q [queue]

When sufficient nodes are available then a session opens and you can execute commands on the assigned nodes until the walltime specified by the queue. Submitting interactively is handy if you need to keep a close eye on what's going on (though hopefully you might know what your job's meant to be doing anyway...). Ideally though, it's best to avoid, since if there are no available nodes you can often be waiting around.

Batch

Batch jobs are generally executed from scripts. An example script might be:

#PBS -q [queue]
#PBS -lwalltime=[hhh:mm:ss]

cd $PBS_O_WORKDIR
[commands to execute]

The first line specifies the queue to submit to. Each queue has a specified walltime (specified by the initial letter) and number of cores (indicated by the number). The queue configurations can be found here for:

The second line can be used to specify a second (shorter) walltime. This is handy if your job doesn't fit into a similar timeframe to the available queues (e.g. the shortest walltime is 24 hours on Volkhan).

The third line tells PBS to execute commands from the current working directory from which you submitted the job.

Better example scripts for each cluster are available at /info/pbs on each cluster.

Running on /scratch/ on a node

To reduce the NFS load on the clusters, it's probably beneficial to use the PBS script to set up and run the job in /scratch/ on the node, rather than the NFS-mounted /sharedscratch/ (or /home/). Note that PATHSAMPLE jobs set up and run their OPTIM processes in this way already, so no changes are needed to a PATHSAMPLE job submission script.

Add the following, for example, to your PBS script as the commands to execute (i.e. after the specification of the queue and walltime etc):

TMP=/scratch/<your-user-ID>/$PBS_JOBID
# -p means "no error if existing, make parent directories as needed"
mkdir -p $TMP
# set up the necessary input files in this directory:
cp odata finish perm.allow input.crd $TMP
cd $TMP
# run the executable in the usual way (and NOT in the background!)
/home/wales/bin/COPTIM35 >& logfile
# copy all required output back to the directory from which you submitted the script
cp logfile $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
rm -rf $TMP

The group clusters will all eventually change from PBS to slurm. This sbatch script copies the job to local scratch in the same way as above:

#!/bin/bash
#SBATCH --time=670:0:0
#SBATCH -n 1
#SBATCH -J LJ8
#
echo $SLURM_NTASKS > nodes.info
srun hostname >> nodes.info
echo $USER >> nodes.info
pwd >> nodes.info
#
TMP=/scratch/wales/$SLURM_JOB_ID
mkdir -p $TMP
cp -r $SLURM_SUBMIT_DIR/* $TMP
cd $TMP
#
srun -N1 -n1 /sharedscratch/wales/OPTIM.pgi/OPTIM > output
#
cp -r ./* $SLURM_SUBMIT_DIR && cd $SLURM_SUBMIT_DIR && rm -rf $TMP

To run more OPTIM jobs via PATHSAMPLE use the SLURM keyword in pathdata (no PBS or CPUS line) and

#SBATCH -n 24
#
# then
#
srun -N1 -n1 /sharedscratch/wales/PATHSAMPLE.pgi/PATHSAMPLE > output

To use all the memory available add

#SBATCH --mem-per-cpu=10000

Running on /scratch/ on a node for parallel jobs

This is appropriate for parallel tempering jobs using keywords PTMC, BSPT, BHPT, not for PATHSAMPLE jobs.

Getting your job to run on multiple nodes with output going to the local /scratch/ is a bit more involved because you must copy startup files to all nodes and retrieve data from all nodes. The following submit script can be used to help in the process. The following command will submit a job with 16 processors spread over 4 nodes

$ qsub -l nodes=4:ppn=4 submit_par

(I can't upload the file because "Permitted file types are png, gif, jpg, jpeg.", but you can find it on sinister at sinister:/home/js850/research/library/gitworking/ptmc/submit_par )