Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 
Sign in to follow this  
enegado

How to catch a qdel signal

Recommended Posts

Hello

Is there a way to catch a qdel signal in the PBS script to force clean up before exiting?

I have an Ansys MPI, and when a user qdel's their job, I would like to copy back all the scratch data before the process dies on the compute node.

 

Thank you

Share this post


Link to post
Share on other sites

If you want to go about this from the job script itself, here is a *very* simple example of a signal handler for SIGTERM from a BASH script:
 

#!/bin/bash

function file_transfer {
echo "SIGTERM caught!  Copying out files"
sleep 30
echo "DONE copying files"
}

trap file_transfer SIGTERM

sleep 1000

 

There is more to consider when using PBS Pro, though.  When you do a qdel, this is what happens:

For a single-vnode job:
1. PBS sends the job a SIGTERM
2. PBS waits for the amount of time specified in the kill_delay queue attribute
3. PBS sends the job a SIGKILL

For a multi-vnode job:
1. Mother superior sends a SIGTERM to all processes on the primary execution host
2. If any of the processes of the top task of the job are still running, PBS waits a minimum
of kill_delay seconds
3. Mother Superior sends a SIGKILL to all remaining job processes on the primary execution
host
4. The subordinate MoMs send a SIGKILL to all their processes belonging to this job


kill_delay is 10 seconds by default.


So you have to set the queue attribute kill_delay to be larger than 30 seconds in the example above or else the file_transfer function will be KILLed before it has finished, but of course that's just an example.  In your actual script where you'd have put in an actual file copy command in place of the "sleep 30" the kill_delay would need to be set to a longer period than the copy command can take.  

 

There are other options aside from a signal handler in the job script that could help you achieve more or less the same thing in PBS Pro (assuming you are an admin on this PBS Pro installation and could get these configured):

1) Use an execjob_epilogue hook.  This would be a Python hook that the pbs_mom runs for all jobs just after executing or killing a job, but before job is cleaned up.  The hook could be written to detect whether or not the job in question is an Ansys job and whether or not it exited normally or as a result of SIGTERM/KILL and then perform the file copy if necessary.

Perhapse less attractive but still worth mentioning at least in passing are:

2) Utilize PBS Pro's stageout mechanism to copy all files from the job execution directory to some predetermined location.

3) Use an action terminate script so that a custom script is run in place of simply sending SIGTERM by the pbs_mom.

 

If you'd like more details about any of these options please contact pbssupport@altair.com and we can work with you.

Thanks.

-Scott Campbell

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now
Sign in to follow this  

×