Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 
Marcelo

Can not delete Hang job

Recommended Posts

I have problem with one job which status is hold but i could not delete it.

# qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1535.admin        impi             vbrito84          00:00:00 H workq

# qstat -a

sansao:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1535.admin      vbrito84 workq    impi        55072   4  96    --  200:0 H 00:00

# qdel 1535.admin
qdel: Unauthorized Request  1535.admin

# qdel -W force 1535.admin
qdel: Unauthorized Request  1535.admin
#
 

Here are some information of configuration in my system

# /opt/pbs/11.3.2.130131/bin/pbs_hostn -v sansao
primary name: sansao.peno.coppe.ufrj.br (from gethostbyname())
aliases:           sansao
     address length:  4 bytes
     address:        146.164.57.11   (188327058 dec)  name:  sansao.peno.coppe.ufrj.br
# /opt/pbs/11.3.2.130131/bin/pbs_hostn -v admin
primary name: admin.default.domain (from gethostbyname())
aliases:           admin
aliases:           loghost
     address length:  4 bytes
     address:            10.0.10.1   (17432586 dec)  name:  admin.default.domain

# qstat -Qf
Queue: workq
    queue_type = Execution
    total_jobs = 1
    state_count = Transit:0 Queued:0 Held:1 Waiting:0 Running:0 Exiting:0 Begun
        :0
    enabled = True
    started = True

# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq enabled = True
set queue workq started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_info = /opt/altair/licensing11.0/altair_lic.dat
set server pbs_license_min = 1
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server license_count = "Avail_Global:0 Avail_Local:0 Used:0 High_Use:0 Avail_Sockets:26 Unused_Sockets:0"
set server eligible_time_enable = False
set server max_concurrent_provision = 5

 # cat /etc/pbs.conf
PBS_EXEC=/opt/pbs/default
PBS_HOME=/var/spool/PBS
PBS_START_SERVER=1
PBS_START_MOM=0
PBS_START_SCHED=1
PBS_SERVER=sansao
PBS_SCP=/usr/bin/scp
PBS_RSHCOMMAND=ssh

Anyone have idea how can I delete this hold job.

 

Thanks.

 

Share this post


Link to post
Share on other sites

Hi Marcelo,

 

    Please run qstat -an to determine which node(s) the hung job is assigned to.  Then, run pbsnodes <node_name> for that/those node(s).  Please reply with the output of those commands.  Also, please run tracejob 1535.admin and attach that output as well.  

 

Thanks,

 

Chris

Share this post


Link to post
Share on other sites

What does the server_logs and mom_logs say for not being able to qdel the job?

Who is trying to qdel the job? is it root or the user that submitted the job?

Can you share the logs?

The JOBID has "admin", but your /etc/pbs.conf says the PBS_SERVER=sansao.. Looks like your cluster has multiple interfaces and points of name resolution, so you will need to describe what you cluster looks like from a network and naming point of view. 

I noticed in your other post about the Mom attribute that you are running 11.3.2. This is a pretty old version of PBS. I am not saying 11.3.2 has any issues, but I wanted to make sure you were aware of v13.0 having much better support for multi-interface clusters.

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

×