Difference between revisions of "Identifying job on a node"

From CUC3
Jump to navigation Jump to search
import>Em427
 
import>Cen1001
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
Sometimes may happens that on one node you have more than one job running and only one among them you need to kill. How find which one it is? Here is an explanation given by Catherine:
+
Sometimes it may happen that on some node you have more than one job running and only one among them you need to kill. How find which one it is? If you only know which output file the job that you want to kill is writing here is a way to find the process:
   
Get the pids of the candidate processes and examine their open files, which are kept in /proc/<pid>/fd on given node.
+
Get the pids of the candidate processes and examine their open files, which are kept in /proc/<pid>/fd on a given node. Do an 'ls -l' on that directory for each pid, which will show you the open files, and see which one has the output file you want open.
  +
Do an 'ls -l' on that directory for each pid, which will show you the open files, and see which one has that file open.
 
  +
If the job is serial you can identify processes belonging to the job by looking at the process parent ids. There will be one process on the node named after your PBS job identifier- this is your job script. Get the PID of that process and look for its children (ie processes with this PID as the parent PID) and kill those. You can see process parent ids by running top, pressing f, then pressing b, then return. To restrict the display to just your own processes press u and then type your username.
  +
  +
The parent PID method doesn't always work with parallel jobs because they can spawn processes via rsh which are not children of the script process.

Latest revision as of 08:40, 15 July 2008

Sometimes it may happen that on some node you have more than one job running and only one among them you need to kill. How find which one it is? If you only know which output file the job that you want to kill is writing here is a way to find the process:

Get the pids of the candidate processes and examine their open files, which are kept in /proc/<pid>/fd on a given node. Do an 'ls -l' on that directory for each pid, which will show you the open files, and see which one has the output file you want open.

If the job is serial you can identify processes belonging to the job by looking at the process parent ids. There will be one process on the node named after your PBS job identifier- this is your job script. Get the PID of that process and look for its children (ie processes with this PID as the parent PID) and kill those. You can see process parent ids by running top, pressing f, then pressing b, then return. To restrict the display to just your own processes press u and then type your username.

The parent PID method doesn't always work with parallel jobs because they can spawn processes via rsh which are not children of the script process.