Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 
Pan

job was terminated in 2 second

Recommended Posts

Are you sure that PBS killed the job?


 


The first thing I would suggest is looking at the STDERR and STDOUT files of the job. Typically these files are copied back to the directory and host the job was submitted from. Another place to look is in the application's specific log file. 


 


If the job script was never executed and you did not receive the STDOUT and STDERR files, then you will need to examine the $PBS_HOME/mom_logs on the system the job was dispatched to. If you don't know which execution node(s) were allocated to the job, you will need to know the JOBID of the job in question. On the PBS server, execute the command 



tracejob $PBS_JOBID

where $PBS_JOBID is the job in question. By default, tracejob parses the logs current log files. If you need to include past log files, you will need to use the -n option. See the tracejob manage for more details. 


 


Based on the output of tracejob you should be able to identify which execution nodes were selected. 


 


Here is an example.




[root@mic01 ~]# tracejob -n 20 30

Job: 30.mic01.prog.altair.com

02/18/2014 16:46:41  L    Considering job to run
02/18/2014 16:46:41  S    enqueuing into workq, state 1 hop 1
02/18/2014 16:46:41  S    Job Queued at request of scott@mic01.prog.altair.com, owner = scott@mic01.prog.altair.com, job name = STDIN, queue = workq
02/18/2014 16:46:41  S    Job Run at request of Scheduler@mic01.prog.altair.com on exec_vnode (mic01:ncpus=1)
02/18/2014 16:46:41  S    Job Modified at request of Scheduler@mic01.prog.altair.com
02/18/2014 16:46:41  L    Job run
02/18/2014 16:46:41  M    Started, pid = 25696
02/18/2014 16:46:41  A    queue=workq
02/18/2014 16:46:41  A    user=scott group=scott project=_pbs_project_default jobname=STDIN queue=workq ctime=1392760001 qtime=1392760001 etime=1392760001
                          start=1392760001 exec_host=mic01/4 exec_vnode=(mic01:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1
                          Resource_List.place=pack Resource_List.select=1:ncpus=1 resource_assigned.ncpus=1 
02/18/2014 16:48:21  S    Obit received momhop:1 serverhop:1 state:4 substate:42
02/18/2014 16:48:21  S    Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=2788kb resources_used.ncpus=1
                          resources_used.vmem=317040kb resources_used.walltime=00:01:40
02/18/2014 16:48:21  M    task 00000001 terminated
02/18/2014 16:48:21  M    Terminated
02/18/2014 16:48:21  M    task 00000001 cput= 0:00:00
02/18/2014 16:48:21  M    kill_job
02/18/2014 16:48:21  M    mic01 cput= 0:00:00 mem=2788kb
02/18/2014 16:48:21  M    no active tasks
02/18/2014 16:48:21  M    Obit sent
02/18/2014 16:48:21  M    copy file request received
02/18/2014 16:48:21  M    staged 2 items out over 0:00:00
02/18/2014 16:48:21  M    no active tasks
02/18/2014 16:48:21  M    delete job request received
02/18/2014 16:48:21  S    dequeuing from workq, state 5
02/18/2014 16:48:21  A    user=scott group=scott project=_pbs_project_default jobname=STDIN queue=workq ctime=1392760001 qtime=1392760001 etime=1392760001
                          start=1392760001 exec_host=mic01/4 exec_vnode=(mic01:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1
                          Resource_List.place=pack Resource_List.select=1:ncpus=1 session=25696 end=1392760101 Exit_status=0 resources_used.cpupercent=0
                          resources_used.cput=00:00:00 resources_used.mem=2788kb resources_used.ncpus=1 resources_used.vmem=317040kb
                          resources_used.walltime=00:01:40 run_count=1

Once you identified the first exec_host, you will need to examine the mom_logs on that machine by searching for the PBS_JOBID on that system.


Share this post


Link to post
Share on other sites

 

Are you sure that PBS killed the job?

 

The first thing I would suggest is looking at the STDERR and STDOUT files of the job. Typically these files are copied back to the directory and host the job was submitted from. Another place to look is in the application's specific log file. 

 

If the job script was never executed and you did not receive the STDOUT and STDERR files, then you will need to examine the $PBS_HOME/mom_logs on the system the job was dispatched to. If you don't know which execution node(s) were allocated to the job, you will need to know the JOBID of the job in question. On the PBS server, execute the command 

tracejob $PBS_JOBID

where $PBS_JOBID is the job in question. By default, tracejob parses the logs current log files. If you need to include past log files, you will need to use the -n option. See the tracejob manage for more details. 

 

Based on the output of tracejob you should be able to identify which execution nodes were selected. 

 

Here is an example.

[root@mic01 ~]# tracejob -n 20 30

Job: 30.mic01.prog.altair.com

02/18/2014 16:46:41  L    Considering job to run

02/18/2014 16:46:41  S    enqueuing into workq, state 1 hop 1

02/18/2014 16:46:41  S    Job Queued at request of scott@mic01.prog.altair.com, owner = scott@mic01.prog.altair.com, job name = STDIN, queue = workq

02/18/2014 16:46:41  S    Job Run at request of Scheduler@mic01.prog.altair.com on exec_vnode (mic01:ncpus=1)

02/18/2014 16:46:41  S    Job Modified at request of Scheduler@mic01.prog.altair.com

02/18/2014 16:46:41  L    Job run

02/18/2014 16:46:41  M    Started, pid = 25696

02/18/2014 16:46:41  A    queue=workq

02/18/2014 16:46:41  A    user=scott group=scott project=_pbs_project_default jobname=STDIN queue=workq ctime=1392760001 qtime=1392760001 etime=1392760001

                          start=1392760001 exec_host=mic01/4 exec_vnode=(mic01:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1

                          Resource_List.place=pack Resource_List.select=1:ncpus=1 resource_assigned.ncpus=1 

02/18/2014 16:48:21  S    Obit received momhop:1 serverhop:1 state:4 substate:42

02/18/2014 16:48:21  S    Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=2788kb resources_used.ncpus=1

                          resources_used.vmem=317040kb resources_used.walltime=00:01:40

02/18/2014 16:48:21  M    task 00000001 terminated

02/18/2014 16:48:21  M    Terminated

02/18/2014 16:48:21  M    task 00000001 cput= 0:00:00

02/18/2014 16:48:21  M    kill_job

02/18/2014 16:48:21  M    mic01 cput= 0:00:00 mem=2788kb

02/18/2014 16:48:21  M    no active tasks

02/18/2014 16:48:21  M    Obit sent

02/18/2014 16:48:21  M    copy file request received

02/18/2014 16:48:21  M    staged 2 items out over 0:00:00

02/18/2014 16:48:21  M    no active tasks

02/18/2014 16:48:21  M    delete job request received

02/18/2014 16:48:21  S    dequeuing from workq, state 5

02/18/2014 16:48:21  A    user=scott group=scott project=_pbs_project_default jobname=STDIN queue=workq ctime=1392760001 qtime=1392760001 etime=1392760001

                          start=1392760001 exec_host=mic01/4 exec_vnode=(mic01:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1

                          Resource_List.place=pack Resource_List.select=1:ncpus=1 session=25696 end=1392760101 Exit_status=0 resources_used.cpupercent=0

                          resources_used.cput=00:00:00 resources_used.mem=2788kb resources_used.ncpus=1 resources_used.vmem=317040kb

                          resources_used.walltime=00:01:40 run_count=1

Once you identified the first exec_host, you will need to examine the mom_logs on that machine by searching for the PBS_JOBID on that system.

 

Hi, I got the error log, and the content is :./exec: line 36: 12644 已杀死               ./ampt < nseed_runtime > nohup.out

and the mom log is :

 

03/04/2014 10:26:52;0008;pbs_mom;Job;62.hpc100;Started, pid = 12512

03/04/2014 10:26:54;0080;pbs_mom;Job;62.hpc100;task 00000001 terminated

03/04/2014 10:26:54;0008;pbs_mom;Job;62.hpc100;Terminated

03/04/2014 10:26:54;0100;pbs_mom;Job;62.hpc100;task 00000001 cput= 0:00:02

03/04/2014 10:26:54;0008;pbs_mom;Job;62.hpc100;kill_job

03/04/2014 10:26:54;0100;pbs_mom;Job;62.hpc100;hpc100 cput= 0:00:02 mem=0kb

03/04/2014 10:26:55;0008;pbs_mom;Job;62.hpc100;no active tasks

03/04/2014 10:26:55;0100;pbs_mom;Job;62.hpc100;Obit sent

03/04/2014 10:26:55;0100;pbs_mom;Req;;Type 54 request received from root@hpc100, sock=10

03/04/2014 10:26:55;0080;pbs_mom;Job;62.hpc100;copy file request received

03/04/2014 10:26:55;0100;pbs_mom;Job;62.hpc100;staged 2 items out over 0:00:00

03/04/2014 10:26:55;0008;pbs_mom;Job;62.hpc100;no active tasks

03/04/2014 10:26:55;0100;pbs_mom;Req;;Type 6 request received from root@hpc100, sock=10

03/04/2014 10:26:55;0080;pbs_mom;Job;62.hpc100;delete job request received

 

 

and program got the some output but incomplete and  job can't last more than 2 seconds.

Share this post


Link to post
Share on other sites

Hmm.. I am not sure what the following means:



./exec: line 36: 12644 已杀死               ./ampt < nseed_runtime > nohup.out

I am assuming that this application error message is something important? 


 


From PBS mom_logs, the task started, then it terminated, and because the session exited PBS finds the job is done. 


 


I assume that this application can execute fine when not running via PBS, correct? If this is true, then I would start to examine what might be different in the user's environment. 


 


Experiment:


 


Log into the compute node and execute the job script. If you are relying on PBS_* environment variables, then take care make sure you provide reasonable values to ensure the job script will run. Once you have the job script running, record the user's environment (env) to a file


 


Now, submit an interactive job (qsub -I). Work through the same steps you did above to get the job script to execute cleanly. 


Share this post


Link to post
Share on other sites

Hmm.. I am not sure what the following means:

./exec: line 36: 12644 已杀死               ./ampt < nseed_runtime > nohup.out

I am assuming that this application error message is something important? 

 

From PBS mom_logs, the task started, then it terminated, and because the session exited PBS finds the job is done. 

 

I assume that this application can execute fine when not running via PBS, correct? If this is true, then I would start to examine what might be different in the user's environment. 

 

Experiment:

 

Log into the compute node and execute the job script. If you are relying on PBS_* environment variables, then take care make sure you provide reasonable values to ensure the job script will run. Once you have the job script running, record the user's environment (env) to a file

 

Now, submit an interactive job (qsub -I). Work through the same steps you did above to get the job script to execute cleanly. 

Hi, i run the jobs in interactive model and get same error, see the picture below

 

I also run another program using pbs in the interactive model, get same error

post-21767-0-91448400-1393909612_thumb.p

post-21767-0-28570300-1393909768_thumb.p

Share this post


Link to post
Share on other sites

Thanks for sharing the screenshots. This is helpful. So it looks like the job is being submitted with -l cput argument, which the host will always enforce when executing the job. 


 


Can you tell me how the job is being submitted? 


 


1. Provide me all of the arguments used with qsub, which includes any PBS directives (#PBS) used within the job script  


2. AND, as root, a qstat -f of the job. 

Share this post


Link to post
Share on other sites

Thanks for sharing the screenshots. This is helpful. So it looks like the job is being submitted with -l cput argument, which the host will always enforce when executing the job. 

 

Can you tell me how the job is being submitted? 

 

1. Provide me all of the arguments used with qsub, which includes any PBS directives (#PBS) used within the job script  

2. AND, as root, a qstat -f of the job. 

Hi, fix the problem.  I just select wrong item in xpbs.  Thanks.

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

×