Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

Leaderboard


Popular Content

Showing most liked content since 06/20/13 in all areas

  1. 4 points
    Tomtom

    Node Down PBS Pro

    Hello I have a problem with a node in pbs: I'm trying to run PBS on a virtual machine, i'm using VirtualBox with two guest 1) server guest, with hostname server.localdomain and PBS server installation with Altair license manager, OS: CentOS 6.5 2) node1 guest, with hostname nodo1.localdomain and Execution only PBS installation, OS: CentOS 6.5 I create the nodo1 with qmgr with the command create node nodo1 mom=nodo1.localdomain I started first the mom in node1 and then scheduler, server and mom in server when i run "pbsnode -a" i get: server Mom = server.localdomain Port = 15002 pbs_version = PBSPro_12.2.1.140292 ntype = PBS state = free pcpus = 1 resources_available.arch = linux resources_available.host = server resources_available.mem = 995772kb resources_available.ncpus = 1 resources_available.vnode = server resources_assigned.accelerator_memory = 0kb resources_assigned.mem = 0kb resources_assigned.naccelerators = 0 resources_assigned.ncpus = 0 resources_assigned.netwins = 0 resources_assigned.vmem = 0kb comment = node down: communication closed resv_enabled = True sharing = default_shared nodo1 Mom = nodo1.localdomain Port = 15002 pbs_version = unavailable ntype = PBS state = state-unknown pcpus = 1 resources_available.host = nodo1 resources_available.ncpus = 1 resources_available.vnode = nodo1 resources_assigned.accelerator_memory = 0kb resources_assigned.mem = 0kb resources_assigned.naccelerators = 0 resources_assigned.ncpus = 0 resources_assigned.netwins = 0 resources_assigned.vmem = 0kb comment = node down: communication closed resv_enabled = True sharing = default_shared All the IP are static: server: 192.168.56.200 nodo1: 192.168.56.201 I edit the /etc/hosts file 192.168.56.200 server.localdomain server 192.168.56.201 nodo1.localdomain nodo1 when i ping the server from the node1 and the node1 from the server i get response node1 mom log: 06/11/2014 10:29:05;0002;pbs_mom;Svr;Log;Log opened 06/11/2014 10:29:05;0002;pbs_mom;Svr;pbs_mom;pbs_version=PBSPro_12.2.1.140292 06/11/2014 10:29:05;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=linux:security=:configure_args='--set-cflags=-g -O2 -D_LARGEFILE_SOURCE ' --with-database-dir=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/pgsql --with-database-user=pbsdata --with-editline=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/editline --with-tcl=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/tcltk --with-libical=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/libical --with-python=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/python --with-gsoap=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/gSOAP --with-swig=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/swig --with-expat=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/expat --with-hwloc=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/hwloc --with-licensing=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/alsdk --with-licensing-libtype=linux_x64 --prefix=/usr/pbs --enable-sharedlib 06/11/2014 10:29:05;0100;pbs_mom;Svr;parse_config;file config 06/11/2014 10:29:05;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.56.200 as authorized 06/11/2014 10:29:05;0002;pbs_mom;n/a;set_restrict_user_maxsysid;setting 499 06/11/2014 10:29:05;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10 06/11/2014 10:29:05;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized 06/11/2014 10:29:05;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.56.201 as authorized 06/11/2014 10:29:05;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path. 06/11/2014 10:29:05;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/PBS/checkpoint/ 06/11/2014 10:29:05;0002;pbs_mom;n/a;initialize;pcpus=1, OS reports 1 cpu(s) 06/11/2014 10:29:05;0006;pbs_mom;Fil;pbs_mom;Version PBSPro_12.2.1.140292, started, initialization type = 0 06/11/2014 10:29:05;0002;pbs_mom;Svr;pbs_mom;Mom pid = 3443 ready, using ports Server:15001 MOM:15002 RM:15003 06/11/2014 10:29:05;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at server.localdomain:15001 server log: 06/11/2014 10:38:37;0002;Server@server;Svr;Log;Log opened 06/11/2014 10:38:37;0002;Server@server;Svr;Server@server;pbs_version=PBSPro_12.2.1.140292 06/11/2014 10:38:37;0002;Server@server;Svr;Server@server;pbs_build=mach=linux:security=:configure_args='--set-cflags=-g -O2 -D_LARGEFILE_SOURCE ' --with-database-dir=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/pgsql --with-database-user=pbsdata --with-editline=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/editline --with-tcl=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/tcltk --with-libical=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/libical --with-python=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/python --with-gsoap=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/gSOAP --with-swig=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/swig --with-expat=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/expat --with-hwloc=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/hwloc --with-licensing=/home/pbsbuild/PBSPro_12_2_1/PBSPro_12.2.1.140292/linux26_x86_64-work/alsdk --with-licensing-libtype=linux_x64 --prefix=/usr/pbs --enable-sharedlib 06/11/2014 10:38:37;0006;Server@server;Fil;Server@server;Version PBSPro_12.2.1.140292, started, initialization type = 1 06/11/2014 10:38:37;0002;Server@server;Svr;Server@server;Starting PBS dataservice 06/11/2014 10:38:39;0002;Server@server;Svr;Server@server;connected to PBS dataservice@server.localdomain 06/11/2014 10:38:39;0086;Server@server;Svr;pbs_python_ext_quick_start_interpreter;--> Python Interpreter quick started, compiled with version:'2.5.1' <-- 06/11/2014 10:38:39;0086;Server@server;Svr;pbs_python_ext_quick_start_interpreter;--> inserted Altair PBS Python modules dir='/opt/pbs/default/lib/python/altair:/opt/pbs/default/lib/python/python2.5:/opt/pbs/default/lib/python/python2.5/shared' <-- 06/11/2014 10:38:39;0002;Server@server;Fil;Server@server;PBS Server hostname is server.localdomain, Server-id is 1 06/11/2014 10:38:39;0002;Server@server;n/a;setup_env;read environment from /var/spool/PBS/pbs_environment 06/11/2014 10:38:39;0004;Server@server;Svr;Server@server;Hostid is 0xa8c0c838 06/11/2014 10:38:39;0004;Server@server;Svr;Server@server;Using license server at 6200@127.0.0.1 06/11/2014 10:38:39;0002;Server@server;Svr;Act;Account file /var/spool/PBS/server_priv/accounting/20140611 opened 06/11/2014 10:38:39;0086;Server@server;Svr;Server@server;Recovered queue workq 06/11/2014 10:38:39;0002;Server@server;Svr;Server@server;Expected 1, recovered 1 queues 06/11/2014 10:38:39;0080;Server@server;Svr;Server@server;No jobs to open 06/11/2014 10:38:39;0086;Server@server;Svr;Server@server;Found hook PBS_translate_mpp type=pbs 06/11/2014 10:38:39;0086;Server@server;Svr;Server@server;Found hook PBS_ibwins type=pbs 06/11/2014 10:38:39;0080;Server@server;Hook;print_hook;ALLHOOKS hook[0] = {PBS_translate_mpp, order=1000, type=1, enabled=0 user=0, event=(queuejob,resvsub), alarm=90, freq=120} 06/11/2014 10:38:39;0080;Server@server;Hook;print_hook;ALLHOOKS hook[1] = {PBS_ibwins, order=0, type=1, enabled=1 user=0, event=(queuejob), alarm=30, freq=120} 06/11/2014 10:38:39;0080;Server@server;Hook;print_hook;queuejob hook[0] = {PBS_ibwins, order=0, type=1, enabled=1 user=0, event=(queuejob), alarm=30, freq=120} 06/11/2014 10:38:39;0080;Server@server;Hook;print_hook;queuejob hook[1] = {PBS_translate_mpp, order=1000, type=1, enabled=0 user=0, event=(queuejob,resvsub), alarm=90, freq=120} 06/11/2014 10:38:39;0080;Server@server;Hook;print_hook;resvsub hook[0] = {PBS_translate_mpp, order=1000, type=1, enabled=0 user=0, event=(queuejob,resvsub), alarm=90, freq=120} 06/11/2014 10:38:39;0086;Server@server;Svr;pbs_python_ext_quick_shutdown_interpreter;--> Stopping Python interpreter <-- 06/11/2014 10:38:39;0002;Server@server;Svr;Server@server;Server pid = 5249 ready; using ports Server:15001 Scheduler:15004 MOM:15002 MOM_GLOBUS:15005 RM:15003 06/11/2014 10:38:39;0100;Server@server;Svr;Server@server;--> Python Interpreter started, compiled with version:'2.5.1' <-- 06/11/2014 10:38:39;0100;Server@server;Svr;Server@server;--> inserted Altair PBS Python modules dir='/opt/pbs/default/lib/python/altair:/opt/pbs/default/lib/python/python2.5:/opt/pbs/default/lib/python/python2.5/shared' <-- 06/11/2014 10:38:39;0100;Server@server;Svr;Server@server;sys.modules= {'zipimport': <module 'zipimport' (built-in)>, 'pbs._pbs_v1': None, 'pbs.pbs': None, 'pbs': <module 'pbs' from '/opt/pbs/default/lib/python/altair/pbs/__init__.pyo'>, 'pbs.v1': <module 'pbs.v1' from '/opt/pbs/default/lib/python/altair/pbs/v1/__init__.pyo'>, 'signal': <module 'signal' (built-in)>, '__builtin__': <module '__builtin__' (built-in)>, 'pbs.v1._export_types': <module 'pbs.v1._export_types' from '/opt/pbs/default/lib/python/altair/pbs/v1/_export_types.pyo'>, 'sys': <module 'sys' (built-in)>, 'pbs.v1._base_types': <module 'pbs.v1._base_types' from '/opt/pbs/default/lib/python/altair/pbs/v1/_base_types.pyo'>, '_pbs_v1': <module '_pbs_v1' (built-in)>, '_pbs_v1.svr_types': <module '_pbs_v1.svr_types' (built-in)>, '__main__': <module '__main__' (built-in)>, 'exceptions': <module 'exceptions' (built-in)>, 'pbs.v1._svr_types': <module 'pbs.v1._svr_types' from '/opt/pbs/default/lib/python/altair/pbs/v1/_svr_types.pyo'>, 'pbs.v1._pbs_v1': None, 'pbs.v1._exc_types': <module 'pbs.v1._exc_types' from '/opt/pbs/default/lib/python/altair/pbs/v1/_exc_types.pyo'>} 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;BEGIN setting up all resource attributes 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;DONE setting up all resource attributes, number set <50> 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;BEGIN setting up all queue attributes 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;DONE setting up all queue attributes, number set <51> 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;BEGIN setting up all job attributes 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;DONE setting up all job attributes, number set <91> 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;BEGIN setting up all server attributes 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;DONE setting up all server attributes, number set <83> 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;BEGIN setting up all reservation attributes 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;DONE setting up all reservation attributes, number set <43> 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;BEGIN setting up all vnode attributes 06/11/2014 10:38:39;0106;Server@server;Svr;Server@server;DONE setting up all vnode attributes, number set <31> 06/11/2014 10:38:39;0040;Server@server;Svr;server.localdomain;Scheduler sent command 10 06/11/2014 10:38:39;0040;Server@server;Svr;server.localdomain;Scheduler sent command 0 06/11/2014 10:38:39;0002;Server@server;Node;server.localdomain;update2 state:0 ncpus:1 06/11/2014 10:38:39;0002;Server@server;Node;server.localdomain;Mom reporting 1 vnodes as of Wed Jun 11 10:38:37 2014 06/11/2014 10:38:39;0002;Server@server;Node;server.localdomain;node up 06/11/2014 10:38:39;0100;Server@server;Req;;Type 9 request received from Scheduler@server.localdomain, sock=11 06/11/2014 10:38:39;0004;Server@server;Sched;Server@server;attributes set: at request of Scheduler@server.localdomain 06/11/2014 10:38:39;0004;Server@server;Sched;Server@server;attributes set: sched_host = server.localdomain 06/11/2014 10:38:39;0004;Server@server;Sched;Server@server;attributes set: pbs_version = PBSPro_12.2.1.140292 06/11/2014 10:38:39;0100;Server@server;Req;;Type 82 request received from Scheduler@server.localdomain, sock=11 06/11/2014 10:38:39;0100;Server@server;Req;;Type 21 request received from Scheduler@server.localdomain, sock=11 06/11/2014 10:38:39;0100;Server@server;Req;;Type 81 request received from Scheduler@server.localdomain, sock=11 06/11/2014 10:38:39;0100;Server@server;Req;;Type 71 request received from Scheduler@server.localdomain, sock=11 06/11/2014 10:38:39;0100;Server@server;Req;;Type 58 request received from Scheduler@server.localdomain, sock=11 06/11/2014 10:38:39;0100;Server@server;Req;;Type 20 request received from Scheduler@server.localdomain, sock=11 06/11/2014 10:38:39;0100;Server@server;Req;;Type 51 request received from Scheduler@server.localdomain, sock=11 06/11/2014 10:38:41;0002;Server@server;Node;server.localdomain;Mom restarted on host 06/11/2014 10:38:41;0002;Server@server;Node;server.localdomain;node down: ping no stream 06/11/2014 10:38:41;0002;Server@server;Node;server.localdomain;update2 state:0 ncpus:1 06/11/2014 10:38:41;0002;Server@server;Node;server.localdomain;Mom reporting 1 vnodes as of Wed Jun 11 10:38:37 2014 06/11/2014 10:38:41;0002;Server@server;Node;server.localdomain;node up 06/11/2014 10:38:44;0004;Server@server;Svr;Server@server;cancelled a license server task 06/11/2014 10:38:44;0006;Server@server;Svr;Server@server;License file location set to 6200@127.0.0.1 06/11/2014 10:38:44;0080;Server@server;Svr;Server@server;One or more of the required features checked out will expire in 29 day(s) 06/11/2014 10:39:04;0002;Server@server;Node;nodo1.localdomain;node down: communication closed 06/11/2014 10:48:39;0040;Server@server;Svr;server.localdomain;Scheduler sent command 3 06/11/2014 10:48:39;0040;Server@server;Svr;server.localdomain;Scheduler sent command 0 06/11/2014 10:48:39;0100;Server@server;Req;;Type 21 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:48:39;0100;Server@server;Req;;Type 81 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:48:39;0100;Server@server;Req;;Type 71 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:48:39;0100;Server@server;Req;;Type 58 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:48:39;0100;Server@server;Req;;Type 20 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:48:39;0100;Server@server;Req;;Type 51 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:55:38;0100;Server@server;Req;;Type 0 request received from root@server.localdomain, sock=12 06/11/2014 10:55:38;0100;Server@server;Req;;Type 49 request received from root@server.localdomain, sock=13 06/11/2014 10:55:38;0100;Server@server;Req;;Type 9 request received from root@server.localdomain, sock=12 06/11/2014 10:55:38;0004;Server@server;Node;nodo1;attributes set: at request of root@server.localdomain 06/11/2014 10:55:38;0004;Server@server;Node;nodo1;attributes set: state - offline 06/11/2014 10:55:38;0004;Server@server;Node;nodo1;attributes set: state - down 06/11/2014 10:55:52;0100;Server@server;Req;;Type 0 request received from root@server.localdomain, sock=12 06/11/2014 10:55:52;0100;Server@server;Req;;Type 49 request received from root@server.localdomain, sock=13 06/11/2014 10:55:52;0100;Server@server;Req;;Type 58 request received from root@server.localdomain, sock=12 06/11/2014 10:58:39;0040;Server@server;Svr;server.localdomain;Scheduler sent command 3 06/11/2014 10:58:39;0040;Server@server;Svr;server.localdomain;Scheduler sent command 0 06/11/2014 10:58:39;0100;Server@server;Req;;Type 21 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:58:39;0100;Server@server;Req;;Type 81 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:58:39;0100;Server@server;Req;;Type 71 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:58:39;0100;Server@server;Req;;Type 58 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:58:39;0100;Server@server;Req;;Type 20 request received from Scheduler@server.localdomain, sock=12 06/11/2014 10:58:39;0100;Server@server;Req;;Type 51 request received from Scheduler@server.localdomain, sock=12 06/11/2014 11:08:39;0040;Server@server;Svr;server.localdomain;Scheduler sent command 3 06/11/2014 11:08:39;0040;Server@server;Svr;server.localdomain;Scheduler sent command 0 06/11/2014 11:08:39;0100;Server@server;Req;;Type 21 request received from Scheduler@server.localdomain, sock=12 06/11/2014 11:08:39;0100;Server@server;Req;;Type 81 request received from Scheduler@server.localdomain, sock=12 06/11/2014 11:08:39;0100;Server@server;Req;;Type 71 request received from Scheduler@server.localdomain, sock=12 06/11/2014 11:08:39;0100;Server@server;Req;;Type 58 request received from Scheduler@server.localdomain, sock=12 06/11/2014 11:08:39;0100;Server@server;Req;;Type 20 request received from Scheduler@server.localdomain, sock=12 06/11/2014 11:08:39;0100;Server@server;Req;;Type 51 request received from Scheduler@server.localdomain, sock=12 Why i always get node1 down? Have i to install a DNS server or it is sufficient to set all the hostname in /etc/hosts ?
  2. 3 points
    Gabe Turner

    Command history in qmgr

    Though, like many, I would like to see qmgr use readline to support things like command history and alternate key bindings (vi keys? yes please), I thought I'd share a work-around that I've been using: Install rlwrap (which wraps things in readline), and alias qmgr to 'rlwrap qmgr'. HTH, Gabe
  3. 2 points
    jshelley

    Hook to check limits

    Here is a example hook that checks the walltime, ncpus, and memory limits (user, queue, cluster). I hope someone may find it useful Jon MY_check_limits.txt
  4. 1 point
    pgarg

    Add Abaqus Job Template n Compute Manager

    I have been trying to add Abaqus job in Compute Manager. We are using Dassault License server for Abaqus. The abaqus job works fine outside PBS environment but it is throwing error message reading abaqus license variable. /cm/shared/apps/Abaqus/DassaultSystemes/SIMULIA/Commands/abaqus memory=2000 cpus=4 job=nut8mm.inp /var/DassaultSystemes/Licenses/DSLicSrv.txt /var/DassaultSystemes/Licenses/DSLicSrv.txt Abaqus Warning: The .inp extension has been removed from the job identifier Abaqus Warning: Keyword (dsls_license_config) must point to an existing file. Abaqus Error: License for standard with cpus=4 is not available. Abaqus/Analysis exited with error(s). Can anyone please share your Abaqus templates?
  5. 1 point
    jshelley

    Some hook examples

    Here is some other examples that others may find useful. ecores_time_limit.txt does a calculation ecores * walltime to come up with a usage percentage. This usage percentage is determined by a functions of interpolated points to define the desired boundary. MY_pbs_data.txt is a module that can be used to calculate ecores as well as output the information job and node information in XML. I hope that others in the PBS hooks community will find this useful. Jon MY_pbs_data.txt ecores_time_limits.txt
  6. 1 point
    Well, the ideal solution would be to get your friendly neighborhood system admin to set the built-in "software" resource on the nodes that have Python on them (because that's what that resource is there for) and configure the scheduler to take that resource into consideration for scheduling purposes. Then, all you'd have to do is something like this: qsub -lselect=1:software=python .... and the scheduler would take care of the rest for you. This takes exactly one qmgr command and a one-line change to the scheduler configuration. But, as you said, you don't have the requisite permission to do that.
  7. 1 point
    Scott Suchyta

    PBS Professional Open Source

    We are excited to inform you that the PBS Professional team has successfully completed its goal in releasing the open source licensing option of PBS Professional by mid-2016. Now Available for Download Visit that brand new website www.pbspro.org to learn more about the initiative and download the software packages. Join the Community We want you to be a part of the open source project community! Join our forum to continue to receive announcements and interact with one another to discuss topics and help answer questions. Everyone is welcome to contribute to the code in a variety of ways including developing new capabilities, testing, etc. Visit www.pbspro.org to learn about our different mailing lists and the numerous ways to participate. Thank you, The PBS Professional Open Source Team
  8. 1 point
    If you run into this problem you can substitute in the system zip command for the one that PAS provides. To do so, simply follow these commands (as root on the PAS host): . /etc/pas.conf mv $PAS_EXEC/bin/Linux-x86_64/zip $PAS_EXEC/bin/Linux-x86_64/zip.pas_orig ln -s $(which zip) $PAS_EXEC/bin/Linux-x86_64/zip No restart of PAS or CM required.
  9. 1 point
    Hello Roland, There are a couple of factors involved based on whether or not you're running Windows, what version of PBS you are running, whether you need to upgrade PBS when you migrate to the new server (i.e. the older PBS version doesn't support the newer OS version, or you simply want to upgrade), and whether or not you want to keep your jobs running during the move. You'll also need to make sure that your license server is relocated to the new server if it is running on the server you are retiring. In a perfect world, where you are not running Windows, the license server is on a separate system, the PBS versions are identical, the operating system versions are identical, and jobs are no longer running, you can copy things over and be able to start PBS back up on the new server without changes. If you are not living in this perfect world, please let us know: - The OS and version of the server you are retiring - The OS and version of the server you are migrating to - Where your license server resides - What version of PBS you are currently running - What version of PBS Pro you wish to run on the new server - Do you need to keep jobs running during the move? If you are living in the perfect world I described and wish to use the copy method: 1) on the old server and all execution nodes, /etc/init.d/pbs stop 2) make backup copies of /etc/pbs.conf, $PBS_EXEC (/opt/pbs/default by default), and $PBS_HOME (/var/spool/PBS by default) 3) copy /etc/pbs.conf to the new server. 4) modify the PBS_SERVER line in /etc/pbs.conf on the new server to reflect the new server hostname. 5) modify the PBS_SERVER line in /etc/pbs.conf on each execution node to reflect the new server hostname. 6) modify all $clienthost lines in $PBS_HOME/mom_priv/config on each execution host to reflect the new server hostname. 7) copy $PBS_HOME to the same location on the new server 8) copy $PBS_EXEC to the same location on the new server 9) copy /etc/init.d/pbs to the new server 10) on the new server, chkconfig pbs on 11) on the new server and all execution hosts, /etc/init.d/pbs start
  10. 1 point
    cjm

    removing an execution host

    Hi Ingrid, First, I would make sure no jobs are running on the nodes, then stop pbs on the hosts you are planning to remove (/etc/init.d/pbs stop), then remove the nodes on the PBS server using qmgr (qmgr -c 'd node <node_name>'). If you are planning to use the nodes for other purposes, I would also make sure you disable the PBS daemons from startup.
  11. 1 point
    scc

    PBS Job dependency script on Windows

    This will work from a Windows .bat file: @echo off for /f "delims=" %%a in ('qsub job1.pbs') do @set FIRST=%%a echo %FIRST% for /f "delims=" %%a in ('"qsub -W depend=afterany:%FIRST% job2.pbs"') do @set SECOND=%%a echo %SECOND% -Scott
  12. 1 point
    Hi NNN, sorry for the delayed response. I was on vacation I do not have an example execjob_epilogue at my finger tips - sorry. Given the hooks are written in python, you can code as you would like. I believe PEP would prefer this to be handled in try: In order for the job to be requeued, the job must have the rerunnable flag set to 'y'. Otherwise, requeuing a non-rerunnable job will result in the job exiting the queue. Rerunning the job (via hook ro qrerun) will result in the job losing their queue wait time. So, this does not guarantee the job will have higher priority when it i requeued. You may want to consider a tunable job formula whereby your execjob_epilogue will modify a number resource used by the job formula to increase its priority. Does this make sense?
  13. 1 point
    Most sites will perform their node health checks and sanitize the environment before the job launches on the execution nodes; using the execjob_prologue. I might not be exposed to the applications you are running, but I have not found the application reliable at diagnosing an issue with the environment. Assuming that you can get reliable diagnostic information from the job's err/out/log file to make a decision, then I believe you will want to use the execjob_epilogue because the execjob_end will *not* let you rerun the job. Referring to PBS Professional v12.2 Admin Guide, Section 6.9.9,11 One idea to consider is to have the application definition, which will become the job script, write out an application-specific script that would be used to parse the log/err/out file. Then the execjob_epilogue would execute this known script, if it exists.
  14. 1 point
    Scott Suchyta

    net stop/start pbs_server

    ingrid, thanks for confirming that the nodes state=free for the nodes that had full drives. I will make note of this in our systems and see what can be done to avoid this situation. WRT ER/OU being written back to the same network drive - you might be on the right path. IIRC, once the mapped network drive is in use, it cannot be used again by other 'batch' processes. At least I could not map the U: drive multiple times on the same system in multiple batch processes. I suspect that you should have seen copy issues in the mom_log files of the effected hosts. There should have been some error message accompanying the failure.
  15. 1 point
    scc

    Insufficient amount of resource ncpus

    Hello Ingrid, please send details to pbssupport@altair.com (please include the qsub commands you are using and a copy of "pbsnodes -av" output) and we can work with you to understand what is going on. It sounds like you might simply need assistance understanding PBS Pro's "qsub -lselect=..." resource specification syntax which allows you to control things like how many ncpus are needed from each host vs. needing to place all of the ncpus requested on a single host. Thanks. -Scott
  16. 1 point
    sameermdesh

    Jobs status shows Waiting

    Thanks Scott for the reply... This got resolved after addding below line in /etc/pbs.conf on client node PBS_SCP=/usr/bin/scp
  17. 1 point
    Thanks for your response. The solution we went with was to write all program output to an external file instead of relying on the output being written to the .o file. It looks like we can safely assume the writes within the PBS script finish before the next job runs, even if we can't assume the .o and .e files are completely written. So we no longer use the .o files for anything during the run of jobs (except for possible error-reporting -- we now just combine the .o and .e files since the .o files will always now be empty).
  18. 1 point
    Scott Suchyta

    pbsnodes error

    I am not sure if you references section 13.6.2 of the PBS Professional AG v12.2, but I am going to quote it here. Verify hostname resolution between mom and server. You may want to use pbs_hostn -v for this activity Verify PBS MOM daemon is running on the node Verify that /etc/pbs.conf has correct hostname of PBS Server Verify that $PBS_HOME/mom_priv/config $clienthost has the correct server name
  19. 1 point
    Hello TomTom, Currently there are two separate library paths for the python distributed with PBS Professional, one for running things outside the server context and one for running things inside the server context (i.e. server hooks). In order to import socket, you will need to add the external PBS python library paths to the hook's sys.path. Beware, however, that you should be careful what you do with these modules, as every server hook is blocking by design. You would not, for instance, want to create some kind of service layer inside a hook as it will need to complete all processing before control is released back to the server. Caveat aside, to add the external PBS python paths to a hook, you can follow the PBS Pro Admin Guide 12.2, section 6.16 "Python Modules and PBS" For your benefit, assuming $PBS_EXEC is /opt/pbs/default: import sys my_paths = ['/opt/pbs/default/python/lib/python25.zip', '/opt/pbs/default/python/lib/python2.5', '/opt/pbs/default/python/lib/python2.5/plat-linux2', '/opt/pbs/default/python/lib/python2.5/lib-tk', '/opt/pbs/default/python/lib/python2.5/lib-dynload', '/opt/pbs/default/python/lib/python2.5/site-packages'] for my_path in my_paths: if my_path not in sys.path: sys.path.append(my_path) Be sure to check for existence of the path in sys.path first, as the python process does not get reloaded for every server hook execution as it does for mom hooks. EDIT: corrected my_paths
  20. 1 point
    PBS Pro license issue: “Not Running: PBS Error: Floating License unavailable” 09-May-2014 System: PBS Pro version 12.0 installed on SLES 11 SP2LMX version 12.0 license manager Network license schema using a three server HAL setup Symptom: qstat –sw reports “Not Running: PBS Error: Floating License unavailable” One or more jobs in Q state affectedPotentially your LMX license server is down and needs to be restartedPotentially an update was done to LMX licensing such as new license file and despite normal appearances where other Altair LMX based applications are able to check-out licenses (look at LMX logs on the license server and run lmx license stat) PBS Pro is reporting this problemExisting jobs in the PBS queue may be running unaffectedSome new jobs submitted to the PBS queue may transition to R state Notes: Attempting to qrun an affected job fails because PBS cannot allocate needed licensesVerify PBS license location string is correctIf using a HAL schema, the host order in the string has to match the HAL server order in the LMX server configuration file, altair-serv.cfgSending the PBS scheduler a HUP will not workYou do not want to attempt a qterm and warm start of the PBS server in the given license circumstance How-to resolve: Re-insert the pbs_license_info string using qmgr, this causing PBS to reinitialize the licensing internally Use qmgr -c "p s" | grep pbs_license_info to display the license string and a good time to confirm it is correct, see Notes above Using the license string captured above or a corrected one, reinitialize the PBS license connection: qmgr –c “set server pbs_license_file_location = 6200@huey:6200@louie:6200@dewey” Example of condition: qstat -ws vulcan: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time ------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - ----- 106976.vulcan krockon gp MRB_FC 3221 2 4 126gb -- R 76:20:11 Job run at Tue May 06 at 05:56 on (vulcan[22]:ncpus=2:mem=66060288kb)+(vulcan[23]:ncpus=2:mem=66060288kb) 107077.vulcan choosew cfd_gp Cfg6_CP_2 279272 10 60 300gb -- R 65:52:12 Job run at Tue May 06 at 16:24 on (vulcan[62]:ncpus=6:mem=31457280kb)+(vulcan[63]:ncpus=6:mem=31457280kb)+(... 107282.vulcan hammero gp HX22E_full_f 379874 1 4 254gb -- R 02:59:10 Job run at Fri May 09 at 07:17 on (vulcan[30]:mem=67108864kb:ncpus=4+vulcan[31]:mem=67092480kb+vulcan[28]... 107293.vulcan nuktynm shortgp nxn_43n_bm -- 1 4 63gb 01:00 Q -- Not Running: PBS Error: Floating License unavailable 107294.vulcan nuktynm shortgp nxn_43m_bm -- 1 6 63gb 01:00 Q -- Not Running: PBS Error: Floating License unavailable 107299.vulcan aaelidk gp SOF5_Heb_SYS12 -- 1 4 63gb -- Q -- Not Running: PBS Error: Floating License unavailable 107301.vulcan saracki gp MCDV-Down-Af 158128 1 1 254gb -- R 00:04:14 Job run at Fri May 09 at 10:12 on (vulcan[26]:mem=67108864kb:ncpus=1+vulcan[27]:mem=67092480kb+vulcan[24]...
  21. 1 point
    Clear a hung (zombie) vnode 22-April-2014: PBS Pro version 12.0 installed on SLES 11 SP2 running on an SGI UV1000 Symptom: PBS Pro qstat reports a job or jobs in R state without a SessID or Elapsed Time. The job is not running on the OS nor is there an existing vnode cpuset, look in /dev/cpuset/PBSPro, there will not be a cpuset directory for the affected Job The qdel is ineffective How-to resolve: Use “qstat –r“ to list the affected jobs Use “qalter –r y PBS_JobID“ to mark the job rerunable You can check it is now marked rerunable, “qstat –f PBS_JobID | grep Rerunable” should return “Rerunable = True” Use “qrerun –W force PBS_JobID” to re-queue job You can check it is re-queued, “qstat –a” will list the job in Q state Use “qdel PBS_JobID” to terminate job Tips: Fixing several nodes at the same time, targeting a particular user: qstat –ru a_users_id qalter –r –y PBS_JobID_1 PBS_JobID_2 PBS_JobID_3 PBS_JobID_4 qstat –f PBS_JobID_1 PBS_JobID_2 PBS_JobID_3 PBS_JobID_4 | grep Rerunable qrerun –W force PBS_JobID_1 PBS_JobID_2 PBS_JobID_3 PBS_JobID_4 qstat –au a_users_id qdel PBS_JobID_1 PBS_JobID_2 PBS_JobID_3 PBS_JobID_4
  22. 1 point
    The -p specifies that when starting, pbs_mom should allow any running jobs to continue running, and not have them requeued. This option can be used for single-host jobs only; multi-host jobs cannot be preserved. To answer your question as to how to kill pbs_mom so that you can preserve the job on a single-node, you will want to use SIGINT. Here is the snippet from the PBS Professional Admin Guide v12.2 on how to start and stop the pbs_mom daemon.
  23. 1 point
    WRT the "memory overload" issue - nice job identifying the issue. Something else you may want to consider in your setup is to verify interactive job (not running in PBS) environment and ulimits are the _same_ as batch environment and ulimits. For instance, 1. On the compute node, as the user, execute env and ulimit -a 2. Submit a batch script, as the same user, and have the script execute env and ulimit -a 3. Compare the output of each. Now, I am not saying that all of the env and ulimit -a settings need to be part of the batch job. In fact, you really want to have the environment of the batch job very clean as to ensure repeatability in the users jobs. So, you may need to identify required env to be set in the job. Ok, let's look at the second part of your issue. PBS Professional's goes out of it's way to recover from a failure; especially in an event of a node failure. PBS will attempt to rerun the job ASAP. By default, PBS jobs are configured to be rerunnable (-r y). So, if the node had failed, the job would be eligible to be rerun on a different set of nodes. You can change this behavior by setting a default_qsub_arguments attribute via qmgr qmgr -c "set server default_qsub_argument='-r n'" In addition, pbs_server will periodically check in with the primary vnode pbs_mom daemon to see if everything is 'ok'. If the pbs_server daemon cannot reach the primary pbs_mom, then pbs_server will requeue the job. Referring to the 12.2. Reference Guide There is a fairly good write up about node_fail_requeue in the 12.2. Admin Guide, Section 9.4.2 Node Fail Requeue: Jobs on Failed Vnodes So, if your goal is to _not_ restart the job from the beginning when a node failure occurs, you may want to consider setting the job's -r attribute to n (no). Then if there is a failure, the job will be terminated and exit the queue. If you want to be more sophisticated with the node failure _and_ the application supports it's own checkpoint/restart facilities, you could have the application write out periodic checkpoints and in an event of a requeue. Now, the user's job script would need to detect an application restart file exists and begin execution from the last known restart file. See Admin Guide 12.2, Section 9.3 Checkpoint Restart, and review the $action attributes of the pbs_mom.
  24. 1 point
    Looks like you have made some really good progress. Nice work! When you execute the mpirun interactively, you are providing your own hostfile. I am assuming that you are also launching the mpirun from the same machine when you try 4, 8, and 16 nodes. Is this correct? The reason why I am asking is, I am wondering if your batch jobs are not getting assigned to the same node order as your interactive test. For instance, your interactive test hostfile says compute-0-2 compute-0-3 compute-0-4 compute-0-5 But when the jobs is run via PBS, the hostfile list order is compute-0-3 compute-0-4 compute-0-5 compute-0-2 if when with a smaller number of node (e.g., 4), the job is not starting on the same node as your interactive test. Therefore, I am wondering if the environment or PATHs might be different on the other nodes. I am coming to this hypothesis after reading other posts on the internet referencing your error message. "Those error messages are coming from MPICH. Check your path and ensure you put the MPI install location first, or use the absolute path to the mpirun." So, please double check 1. the mpirun location is the same on all systems 2. the PATH order on all systems is the same; otherwise, you may want to consider redefining PATH variable with the mpirun command line arguments. 3. the LD_LIBRARY_PATH is consistent, too Also, for the 4 node batch job that failed, you should see which nodes were assigned to the job - note the node order, too. Then update your interactive hostfile with the same node list order, and try running it interactively.
  25. 1 point
    Pan

    job was terminated in 2 second

    Hi , I found jobs which take more than 2 seconds was terminated by pbs. why pbs kill the jobs when jobs run to 2 second? looking for help.
  26. 1 point
    Pan

    job was terminated in 2 second

    Hi, i run the jobs in interactive model and get same error, see the picture below I also run another program using pbs in the interactive model, get same error
  27. 1 point
    You can do this via qmgr: qmgr -c "set queue workq resources_default.place=scatter" Below is an example. [root@mic01 ~]# qmgr -c "p q workq" # # Create queues and set their attributes. # # # Create and define queue workq # create queue workq set queue workq queue_type = Execution set queue workq enabled = True set queue workq started = True [root@mic01 ~]# qmgr -c "set queue workq resources_default.place=scatter" [root@mic01 ~]# qmgr -c "p q workq" # # Create queues and set their attributes. # # # Create and define queue workq # create queue workq set queue workq queue_type = Execution set queue workq resources_default.place = scatter set queue workq enabled = True set queue workq started = True You can read more in the PBS Professional Admin Guide (v12.2, section 5.9.3.6 Specifying Default Job Placement)
  28. 1 point
    Scott Suchyta

    job was terminated in 2 second

    Hmm.. I am not sure what the following means: ./exec: line 36: 12644 已杀死 ./ampt < nseed_runtime > nohup.out I am assuming that this application error message is something important? From PBS mom_logs, the task started, then it terminated, and because the session exited PBS finds the job is done. I assume that this application can execute fine when not running via PBS, correct? If this is true, then I would start to examine what might be different in the user's environment. Experiment: Log into the compute node and execute the job script. If you are relying on PBS_* environment variables, then take care make sure you provide reasonable values to ensure the job script will run. Once you have the job script running, record the user's environment (env) to a file Now, submit an interactive job (qsub -I). Work through the same steps you did above to get the job script to execute cleanly.
  29. 1 point
    Pan

    job was terminated in 2 second

    Hi, I got the error log, and the content is :./exec: line 36: 12644 已杀死 ./ampt < nseed_runtime > nohup.out and the mom log is : 03/04/2014 10:26:52;0008;pbs_mom;Job;62.hpc100;Started, pid = 12512 03/04/2014 10:26:54;0080;pbs_mom;Job;62.hpc100;task 00000001 terminated 03/04/2014 10:26:54;0008;pbs_mom;Job;62.hpc100;Terminated 03/04/2014 10:26:54;0100;pbs_mom;Job;62.hpc100;task 00000001 cput= 0:00:02 03/04/2014 10:26:54;0008;pbs_mom;Job;62.hpc100;kill_job 03/04/2014 10:26:54;0100;pbs_mom;Job;62.hpc100;hpc100 cput= 0:00:02 mem=0kb 03/04/2014 10:26:55;0008;pbs_mom;Job;62.hpc100;no active tasks 03/04/2014 10:26:55;0100;pbs_mom;Job;62.hpc100;Obit sent 03/04/2014 10:26:55;0100;pbs_mom;Req;;Type 54 request received from root@hpc100, sock=10 03/04/2014 10:26:55;0080;pbs_mom;Job;62.hpc100;copy file request received 03/04/2014 10:26:55;0100;pbs_mom;Job;62.hpc100;staged 2 items out over 0:00:00 03/04/2014 10:26:55;0008;pbs_mom;Job;62.hpc100;no active tasks 03/04/2014 10:26:55;0100;pbs_mom;Req;;Type 6 request received from root@hpc100, sock=10 03/04/2014 10:26:55;0080;pbs_mom;Job;62.hpc100;delete job request received and program got the some output but incomplete and job can't last more than 2 seconds.
  30. 1 point
    Scott Suchyta

    job was terminated in 2 second

    Are you sure that PBS killed the job? The first thing I would suggest is looking at the STDERR and STDOUT files of the job. Typically these files are copied back to the directory and host the job was submitted from. Another place to look is in the application's specific log file. If the job script was never executed and you did not receive the STDOUT and STDERR files, then you will need to examine the $PBS_HOME/mom_logs on the system the job was dispatched to. If you don't know which execution node(s) were allocated to the job, you will need to know the JOBID of the job in question. On the PBS server, execute the command tracejob $PBS_JOBID where $PBS_JOBID is the job in question. By default, tracejob parses the logs current log files. If you need to include past log files, you will need to use the -n option. See the tracejob manage for more details. Based on the output of tracejob you should be able to identify which execution nodes were selected. Here is an example. [root@mic01 ~]# tracejob -n 20 30 Job: 30.mic01.prog.altair.com 02/18/2014 16:46:41 L Considering job to run 02/18/2014 16:46:41 S enqueuing into workq, state 1 hop 1 02/18/2014 16:46:41 S Job Queued at request of scott@mic01.prog.altair.com, owner = scott@mic01.prog.altair.com, job name = STDIN, queue = workq 02/18/2014 16:46:41 S Job Run at request of Scheduler@mic01.prog.altair.com on exec_vnode (mic01:ncpus=1) 02/18/2014 16:46:41 S Job Modified at request of Scheduler@mic01.prog.altair.com 02/18/2014 16:46:41 L Job run 02/18/2014 16:46:41 M Started, pid = 25696 02/18/2014 16:46:41 A queue=workq 02/18/2014 16:46:41 A user=scott group=scott project=_pbs_project_default jobname=STDIN queue=workq ctime=1392760001 qtime=1392760001 etime=1392760001 start=1392760001 exec_host=mic01/4 exec_vnode=(mic01:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=pack Resource_List.select=1:ncpus=1 resource_assigned.ncpus=1 02/18/2014 16:48:21 S Obit received momhop:1 serverhop:1 state:4 substate:42 02/18/2014 16:48:21 S Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=2788kb resources_used.ncpus=1 resources_used.vmem=317040kb resources_used.walltime=00:01:40 02/18/2014 16:48:21 M task 00000001 terminated 02/18/2014 16:48:21 M Terminated 02/18/2014 16:48:21 M task 00000001 cput= 0:00:00 02/18/2014 16:48:21 M kill_job 02/18/2014 16:48:21 M mic01 cput= 0:00:00 mem=2788kb 02/18/2014 16:48:21 M no active tasks 02/18/2014 16:48:21 M Obit sent 02/18/2014 16:48:21 M copy file request received 02/18/2014 16:48:21 M staged 2 items out over 0:00:00 02/18/2014 16:48:21 M no active tasks 02/18/2014 16:48:21 M delete job request received 02/18/2014 16:48:21 S dequeuing from workq, state 5 02/18/2014 16:48:21 A user=scott group=scott project=_pbs_project_default jobname=STDIN queue=workq ctime=1392760001 qtime=1392760001 etime=1392760001 start=1392760001 exec_host=mic01/4 exec_vnode=(mic01:ncpus=1) Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=pack Resource_List.select=1:ncpus=1 session=25696 end=1392760101 Exit_status=0 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=2788kb resources_used.ncpus=1 resources_used.vmem=317040kb resources_used.walltime=00:01:40 run_count=1 Once you identified the first exec_host, you will need to examine the mom_logs on that machine by searching for the PBS_JOBID on that system.
  31. 1 point
    PBS Professional 10.1 extended the user and group limit capabilities, which introduced new syntax for defining named-user, -group, and -project limits. NOTE that you will not be able to use the old and new syntax at the same time. Referring to PBS Professional 12.2 Admin Guide (Section 5.15.16 Ways To Limit Resource Usage at Server and Queues) and using the new syntax, the following will allow you to define a maximum number of processors per user in one queue. Qmgr: set queue workq max_run_res.ncpus = [u:PBS_GENERIC=4] In this example, I have defined that all users (PBS_GENERIC) can have a maximum of 4ncpus running within the workq.
  32. 1 point
    When I was trying to set the "pbs_license_info" using the following command: qmgr -c "set server pbs_license_info=6200@admin" as linux user "root". The error message was like this: qmgr obj= svr=default: Unauthorized Request qmgr: Error (15007) returned from server The PBS Pro was not installed by me. Now I just installed the Altair license server and have got it running. I guess the problem is with the user or server name? Any suggestions about how to figure out this problem will be appreciate very much!
  33. 1 point
    Hello, I install PBS Pro 12.1, I create a trial license and I donwload the altair_lic.dat file. How Can I register the license on pbs ? I tried to install Altair Licensing 11.0.2 or run this command as Admin : qmgr -c "set server pbs_license_info=????" but where can I find my port and host ?
  34. 1 point
    Hi everyone, I have that pbs script to submit a job and I want to make multiple simultaneous runs with different values of my environnement variable $var1. My script looks like this: ############################ #!/bin/csh # #PBS -l walltime=01:00:00 #PBS -l nodes=2:ppn=12 # #PBS -l mem=10gb #PBS -o out.dat #PBS -j oe #PBS -W umask=022 #PBS -r n #PBS -N test_param #PBS -v var1='500,750' #PBS -t 1-2 setenv OMP_NUM_THREADS 1 cd $PBS_O_WORKDIR mpiexec ./go_laser ############################ So I want, for example, that this creates 2 directories named test_param_500/ and test_param_750/ with the value var1=500 and var1=700 respectively. Note that in my program "go_laser" the variable $var1 is a single value and not an array. Thanks for your help !
  35. 1 point
    Traditionally, PBS Professional utilizes the user's primary GID for job submission (e.g., acl, accounting). Now, it is typically for users to be associated to multiple groups. If you want a secondary group to be used in determining ACLs, then you need to submit the job with –W group_list= Below is an example of how to utilize the user's secondary groups with ACLs scott@sles11-00:~> qmgr -c "p q PM" # # Create queues and set their attributes. # # # Create and define queue PM # create queue PM set queue PM queue_type = Execution set queue PM acl_group_enable = True set queue PM acl_groups = +PM set queue PM enabled = True set queue PM started = True Now, my gid and groups are scott@sles11-00:~> id scott uid=1000(scott) gid=100(users) groups=1006(PM),16(dialout),33(video),100(users) Notice that gid=users and groups contains PM Submit simple qsub… and no-go: scott@sles11-00:~> qsub -I -q PM qsub: Unauthorized Request Submit with defining the group_list, and success… scott@sles11-00:~> qsub -I -q PM -W group_list=PM qsub: waiting for job 129.sles11-00 to start qsub: job 129.sles11-00 ready scott@sles11-00:~> You will also notice that the accounting logs will reflect the group requested at submission. scott@sles11-00:~> sudo tracejob 129 | grep group 07/07/2011 15:04:06 A user=scott group=PM jobname=STDIN queue=PM ctime=1310065445 qtime=1310065445 etime=1310065445 start=1310065446 exec_host=sles11-00/0 exec_vnode=(sles11-00:ncpus=1) Resource_List.cput=00:05:00 Resource_List.higher=1 Resource_List.ncpus=1 Resource_List.nodect=1 Resource_List.place=pack Resource_List.select=1:ncpus=1 Resource_List.W_prio=1 resource_assigned.ncpus=1
  36. 1 point
    Hi Andrea, Based on your example, J2 would automatically be put into a held state and only released if J1 successfully finished (exit status is equal to zero). Unfortunately, you stumbled across an issue with a race condition where J1 exited the queue before J2 was successfully submitted. One idea to work around this race condition would be to use qsub_hook to validate that the depending job is still in the system. If the depending job has exited the system, you could inform the user that the job no longer exists or allow it to go in the system without a dependency check. Scott
×