Identifying job on a node

From Docswiki
Jump to navigation Jump to search

Sometimes it may happen that on some node you have more than one job running and only one among them you need to kill. How find which one it is? If you only know which output file the job that you want to kill is writing here is a way to find the process:

Get the pids of the candidate processes and examine their open files, which are kept in /proc/<pid>/fd on a given node. Do an 'ls -l' on that directory for each pid, which will show you the open files, and see which one has the output file you want open.

If the job is serial you can identify processes belonging to the job by looking at the process parent ids. There will be one process on the node named after your PBS job identifier- this is your job script. Get the PID of that process and look for its children (ie processes with this PID as the parent PID) and kill those. You can see process parent ids by running top, pressing f, then pressing b, then return. To restrict the display to just your own processes press u and then type your username.

The parent PID method doesn't always work with parallel jobs because they can spawn processes via rsh which are not children of the script process.