Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

jerryleo

Members
  • Content count

    25
  • Joined

  • Last visited

About jerryleo

  • Rank
    Member

Profile Information

  • Gender
    Not Telling
  1. Hi, Thanks for the inputs. It's not possible to do suspension on our HPC. To release the preemption resources, just kill the low-priority job and automaticlly requeue it again instead of suspending it. While kill the preempted jobs, PBS Server sends SIGTERM, it can't identify if the job just was killed or vacated by scheduler and requeued again, it's really annoying for workflow control. Saying, A was running, B with higher priority and wanted the preemption resources, scheduler vacated A and requeued it automatically. If it could send SIGUSR1 or other signals instead SIGTERM, workflow application would know the job just was vacated by scheduler and requeued by scheduler, and could put the job status back to the submitted state other than failed and avoid improper action. Wondering if any way to have PBS Server send different signal for appropriate actions ? Or if any workaround for this sort of situation ? Appreciating further comments/inputs Regards Jerry
  2. group name issue in the job log

    Hi, Thanks for your kindly inputs. Found the root problem of this issue. The /etc/passwd file are not consistent on the login node and pbs server node, user account was missed from the /etc/passwd file on the pbs server node. Wondering why the user still could submit and run job without any problem even if the account info is missing from /etc/passwd on pbs server node. Something is screwing on the system setting/configuration ? Appreciating further comments/inputs. Regards Jerry
  3. Hi, Wondering if it's possible to have PBS provide a different signal for pre-emptive job, such as SIGUSR1 ? Thanks Jerry
  4. Hi, It's using PBSPro_12.1.1.131502, with a particular user, the '-default-' was used as the group name instead the user's group, like this 20170110:01/10/2017 21:52:26;E;553720.sdb;user=songgt group=-default- project=_pbs_project_default accounting_id="0x600026632" jobname=roms queue=workq ctime=1484083146 qtime=1484083146 etime=1484083146 start=1484083146 exec_host=login1/0+login1/1+login1/2+login1/3+login1/4+login1/5+login1/6+login1/7+login1/8+login1/9+login1/10+login1/11+login1/12+login1/13+login1/14+login1/15 exec_vnode=(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1) Resource_List.arch=XT Resource_List.mppwidth=16 Resource_List.ncpus=16 Resource_List.nodect=16 Resource_List.place=free Resource_List.select=16:vntype=cray_compute Resource_List.walltime=02:00:00 session=5842 alt_id=439506 end=1484085146 Exit_status=0 resources_used.cpupercent=7 resources_used.cput=00:00:19 resources_used.mem=7036kb resources_used.ncpus=16 resources_used.vmem=135980kb resources_used.walltime=00:33:17 run_count=1 Make sure that the group exists and the user is in the group. NO this sort of issue with other users, only for a particular user. Have no idea what's wrong. Any ideas ? Thanks for your time Regards
  5. Hi, max_run_res.ncpus just based on users, or groups, or project Wondering how to limit the max. cpus per job ? For example, just can request Max. 168 CPUs per job. Appreciating your time Regards Jerry
  6. Steve, Thanks for the further comments Sorry for not stating PBS_HOME clearly. For individual Moms, the PBS_HOME is /var/spool/PBS, only for the PBS Server node is share on GPFS On log05, it got a new status D process of pbs_iff if any q* commands was involved, and can not be killed anymore. Just guess, suppose pbs_iff processes to be sleeping on waiting for FD_CLOEXEC on all file handles it opened or something like this And noticed that there were two sleeping flush processes [root@log05 public]# ps aux | grep -i flush root 487 0.0 0.0 0 0 ? S May01 0:40 [flush-8:0] root 6401 0.0 0.0 103300 876 pts/0 S+ 19:43 0:00 grep -i flush root 22033 0.0 0.0 0 0 ? S 19:39 0:00 [flush-0:22] Not sure the possible reasons why this happened, and worried about if any bad side effects or got more worse impacts on running works on the cluster. For the past couple of days, it looked like everything worked good except the q commands on log05. There were running works dont' have restart capability, so wanted a workaround for rebooting PBS server with ZERO impact on running works. Please see if I understood correctlly for a safe-reboot of PBS server 1. It shuold be safe to reboot PBS server whether or not have failover configured 2. It also needs to set PBS_PRIMARY and PBS_SECONDARY in the pbs.conf for the Moms. 3. After rebooting log05 and starting the secondary server on log06, it needs to restart individual Moms to bring new configuration into effect Thanks Jerry
  7. Scott, Thanks for the comments. The PBS_HOME is on a GPFS share and can be seen by any node. The PBS_EXEC is on a NFS share. Already tried 'kill -HUP' PBS Server, it looked like still not work and hang up with pbs_iff. Now wanted to reboot the host, not only the pbs server but also the scheduler will be rebooted. And the site has not been configured for PBS failover. So I'm afraid reboot the host will effect the running works on the cluster. And wanted to build a secondary PBS server/scheduler to make a 'hot' take over or something like this. But have little experiences with this sort of situation and with limited resources for building an experiment environment to do a test before implementing PLAN B. So wanted certain ideas with making a hot take over for the site without being configured for PBS failover. Appreciating any further comments Jerry
  8. Hi, Just couple of updated messages Got these with more investigation 1. there were 2 pbs_diff processes with status D root 17815 0.0 0.0 106144 1196 ? D 16:23 0:00 sh -c /opt/pbs/default/sbin/pbs_iff -i 192.168.0.113 log05 15001 3 36018 root 23275 0.0 0.0 106144 1196 pts/4 D 16:26 0:00 sh -c PBS_IFF_CLIENT_ADDR=192.168.0.113 /opt/pbs/default/sbin/pbs_iff log05 15001 3 51408 2. The ppid of both of two processes is 1 So both of two pbs_diff processes can not be killed Suppose these two "uninterruptible" pbs_iff processes made pbs server not response for local q command anymore. If the only way was to reboot log05. We want all jobs will not be impacted. The plan B is to set up secondary PBS server log06 and make a hot take over from log05 if it was possible Here are my steps 1. pbs.conf ( log06 & log05) PBS_EXEC=/opt/pbs/default PBS_HOME=/nuist/p/public/PBS_HOME PBS_START_SERVER=1 PBS_START_MOM=0 PBS_START_SCHED=1 PBS_CORE_LIMIT=unlimited PBS_SCP=/usr/bin/scp PBS_SERVER=log05 PBS_PRIMARY=log05 PBS_SECONDARY=log06 2. Make this change in start_pbs() function in init.d/pbs script on log06 change line 354 & line 363 if ${PBS_EXEC}/sbin/pbs_server to if ${PBS_EXEC}/sbin/pbs_server -F 60 3. start the secondary server daemon on log06 by issuing /etc/init.d/pbs start 4. HUP pbs server on log05 like this kill -HUP `ps aux | grep -i pbs_server.bin | grep -v grep | awk '{print $2}'` 5. log05 could be rebooted safely if log06 took over all jobs from log05 successfully Are there any things still missing ? Thanks Jerry
  9. Thanks for reply Here are the further information 1. pbs_hostn output [root@log05 ~]# pbs_hostn -v log05 primary name: log05 (from gethostbyname()) aliases: log05.nuist.edu.cn address length: 4 bytes address: 192.168.0.113 (1895868608 dec) name: log05 [root@log05 pbs_jobs_admin]# hostname log05 [root@log05 ~]# pbs_hostn -v `hostname` primary name: log05 (from gethostbyname()) aliases: log05.nuist.edu.cn address length: 4 bytes address: 192.168.0.113 (1895868608 dec) name: log05 [root@log05 ~]# pbs_hostn -v log05.nuist.edu.cn primary name: log05 (from gethostbyname()) aliases: log05.nuist.edu.cn address length: 4 bytes address: 192.168.0.113 (1895868608 dec) name: log05 [root@log05 ~]# 2. netsat output [root@log05 ~]# netstat -anp | grep pbs tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN 5979/pbs_server.bin tcp 0 0 0.0.0.0:15004 0.0.0.0:* LISTEN 5740/pbs_sched tcp 0 0 192.168.0.113:594 192.168.19.5:15002 ESTABLISHED 5979/pbs_server.bin tcp 0 0 192.168.0.113:41437 192.168.0.115:6200 ESTABLISHED 5979/pbs_server.bin tcp 0 0 192.168.0.113:595 192.168.19.5:15002 ESTABLISHED 5979/pbs_server.bin tcp 0 0 192.168.0.113:55706 192.168.0.113:15007 ESTABLISHED 5979/pbs_server.bin udp 0 0 0.0.0.0:15001 0.0.0.0:* 5979/pbs_server.bin udp 0 0 0.0.0.0:1023 0.0.0.0:* 5979/pbs_server.bin 3. ifconfig output [root@log05 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 5C:F3:FC:B9:57:90 inet addr:192.168.0.113 Bcast:192.168.63.255 Mask:255.255.192.0 inet6 addr: fe80::5ef3:fcff:feb9:5790/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:606749091 errors:0 dropped:0 overruns:0 frame:0 TX packets:606212598 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:66974910799 (62.3 GiB) TX bytes:123103425255 (114.6 GiB) Interrupt:28 Memory:96000000-96012800 eth1 Link encap:Ethernet HWaddr 5C:F3:FC:B9:57:92 inet addr:202.195.237.178 Bcast:202.195.237.191 Mask:255.255.255.192 inet6 addr: fe80::5ef3:fcff:feb9:5792/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1983508 errors:0 dropped:0 overruns:0 frame:0 TX packets:265894 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:211738110 (201.9 MiB) TX bytes:230187244 (219.5 MiB) Interrupt:40 Memory:98000000-98012800 Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes. Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is obsolete! For replacement check ip. ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.20.0.113 Bcast:172.20.255.255 Mask:255.255.0.0 inet6 addr: fe80::202:c903:53:2ce9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:8375985 errors:0 dropped:0 overruns:0 frame:0 TX packets:8525745 errors:0 dropped:22 overruns:0 carrier:0 collisions:0 txqueuelen:1024 RX bytes:36199675648 (33.7 GiB) TX bytes:39837433798 (37.1 GiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:19991594 errors:0 dropped:0 overruns:0 frame:0 TX packets:19991594 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:4534262707 (4.2 GiB) TX bytes:4534262707 (4.2 GiB) 4. compared with the working nodes It looked like it was waiting for an answer stirng "\0\0\0\" without time-out working nodes fcntl(4, F_SETFD, 0) = 0 read(4, "\0\0\0\0", 4) = 4 close(4) = 0 log05 fcntl(4, F_SETFD, 0) = 0 read(4, Already checked the /etc/hosts file, it has never been touched since it was created, and also replace it with the hosts file of working nodes, still not work either.
  10. Scott, Thanks for response 1. >> Do you have your site configured for PBS Failover? Sorry, I don't know how to check it. And not sure. 2. ****************** pbs.conf ***************** [root@log05 ~]# cat /etc/pbs.conf PBS_EXEC=/opt/pbs/default PBS_HOME=/nuist/p/public/PBS_HOME PBS_START_SERVER=1 PBS_START_MOM=0 PBS_START_SCHED=1 PBS_SERVER=log05 PBS_CORE_LIMIT=unlimited PBS_SCP=/usr/bin/scp 3. Double checked, no firewall was running [root@log05 ~]# getenforce Disabled [root@log05 ~]# chkconfig --list ip6tables 0:off 1:off 2:off 3:off 4:off 5:off 6:off iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off [root@log05 ~]# service --status-all ip6tables: Firewall is not running. iptables: Firewall is not running. 4. No issues were detected by pbs_probe [root@log05 ~]# $PBS_EXEC/sbin/pbs_probe ====== System Information ======= sysname=Linux nodename=log05 release=2.6.32-220.el6.x86_64 version=#1 SMP Wed Nov 9 08:03:13 EST 2011 machine=x86_64 === No PBS Infrastructure Problems Detected === 4. running trace on the qstat, it halts until ctrl+c pressed [root@log05 ~]# strace qstat execve("/opt/pbs/default/bin/qstat", ["qstat"], [/* 35 vars */]) = 0 brk(0) = 0x24d4000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc236000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=73358, ...}) = 0 mmap(NULL, 73358, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b63fc237000 close(3) = 0 open("/lib64/libdl.so.2", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\r \2560\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=22536, ...}) = 0 mmap(0x30ae200000, 2109696, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x30ae200000 mprotect(0x30ae202000, 2097152, PROT_NONE) = 0 mmap(0x30ae402000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x30ae402000 close(3) = 0 open("/lib64/libcrypt.so.1", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\f \3307\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=43392, ...}) = 0 mmap(0x37d8200000, 2318816, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x37d8200000 mprotect(0x37d8207000, 2097152, PROT_NONE) = 0 mmap(0x37d8407000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x37d8407000 mmap(0x37d8409000, 184800, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x37d8409000 close(3) = 0 open("/lib64/libpthread.so.0", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \\\240\2560\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=141576, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc249000 mmap(0x30aea00000, 2208672, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x30aea00000 mprotect(0x30aea17000, 2093056, PROT_NONE) = 0 mmap(0x30aec16000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x30aec16000 mmap(0x30aec18000, 13216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x30aec18000 close(3) = 0 open("/lib64/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\355a\2560\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1979000, ...}) = 0 mmap(0x30ae600000, 3803304, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x30ae600000 mprotect(0x30ae797000, 2097152, PROT_NONE) = 0 mmap(0x30ae997000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x197000) = 0x30ae997000 mmap(0x30ae99c000, 18600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x30ae99c000 close(3) = 0 open("/usr/lib64/libfreebl3.so", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\2602`\3307\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=386040, ...}) = 0 mmap(0x37d8600000, 2496224, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x37d8600000 mprotect(0x37d865d000, 2093056, PROT_NONE) = 0 mmap(0x37d885c000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5c000) = 0x37d885c000 mmap(0x37d885e000, 14048, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x37d885e000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc24a000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc24b000 arch_prctl(ARCH_SET_FS, 0x2b63fc24b200) = 0 mprotect(0x37d885c000, 4096, PROT_READ) = 0 mprotect(0x30ae997000, 16384, PROT_READ) = 0 mprotect(0x30aec16000, 4096, PROT_READ) = 0 mprotect(0x37d8407000, 4096, PROT_READ) = 0 mprotect(0x30ae402000, 4096, PROT_READ) = 0 mprotect(0x30ae01f000, 4096, PROT_READ) = 0 munmap(0x2b63fc237000, 73358) = 0 set_tid_address(0x2b63fc24b4d0) = 2885 set_robust_list(0x2b63fc24b4e0, 0x18) = 0 futex(0x7fff1a76319c, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7fff1a76319c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, NULL, 2b63fc24b200) = -1 EAGAIN (Resource temporarily unavailable) rt_sigaction(SIGRTMIN, {0x30aea05aa0, [], SA_RESTORER|SA_SIGINFO, 0x30aea0f4a0}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {0x30aea05b30, [], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x30aea0f4a0}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 getrlimit(RLIMIT_STACK, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0 futex(0x625520, FUTEX_WAKE_PRIVATE, 2147483647) = 0 brk(0) = 0x24d4000 brk(0x24f5000) = 0x24f5000 getuid() = 0 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(3) = 0 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(3) = 0 open("/etc/nsswitch.conf", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=1688, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc237000 read(3, "#\n# /etc/nsswitch.conf\n#\n# An ex"..., 4096) = 1688 read(3, "", 4096) = 0 close(3) = 0 munmap(0x2b63fc237000, 4096) = 0 open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=73358, ...}) = 0 mmap(NULL, 73358, PROT_READ, MAP_PRIVATE, 3, 0) = 0x2b63fc237000 close(3) = 0 open("/lib64/libnss_files.so.2", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360!\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=65928, ...}) = 0 mmap(NULL, 2151824, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x2b63fc24c000 mprotect(0x2b63fc258000, 2097152, PROT_NONE) = 0 mmap(0x2b63fc458000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xc000) = 0x2b63fc458000 close(3) = 0 mprotect(0x2b63fc458000, 4096, PROT_READ) = 0 munmap(0x2b63fc237000, 73358) = 0 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3 fcntl(3, F_GETFD) = 0x1 (flags FD_CLOEXEC) fstat(3, {st_mode=S_IFREG|0644, st_size=42018, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc237000 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 4096 close(3) = 0 munmap(0x2b63fc237000, 4096) = 0 futex(0x625528, FUTEX_WAKE_PRIVATE, 2147483647) = 0 open("/etc/services", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=640999, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc237000 lseek(3, 0, SEEK_CUR) = 0 lseek(3, 0, SEEK_SET) = 0 read(3, "# /etc/services:\n# $Id: services"..., 4096) = 4096 read(3, "ervice\nfinger 79/tcp\nfi"..., 4096) = 4096 read(3, " 209/udp "..., 4096) = 4096 read(3, "a-cluster 694/tcp "..., 4096) = 4096 read(3, " 1494/tcp "..., 4096) = 4096 read(3, "603/udp #"..., 4096) = 4096 read(3, " # BPRD (VERITAS NetBack"..., 4096) = 4096 read(3, "432/tcp postgresql #"..., 4096) = 4096 read(3, "GS Protocol\nmpm 45/t"..., 4096) = 4096 read(3, " 103/udp # Genes"..., 4096) = 4096 read(3, "ransfer Program\nsgmp "..., 4096) = 4096 read(3, " Location Service\ndls-mon "..., 4096) = 4096 read(3, "# Tobit David Service Layer\ntd-s"..., 4096) = 4096 read(3, "# DATEX-ASN\ndatex-asn "..., 4096) = 4096 read(3, "Data Transfer\nnetcp 39"..., 4096) = 4096 read(3, " 431/tcp # UT"..., 4096) = 4096 read(3, "# hybrid-pop\nhybrid-pop 473"..., 4096) = 4096 read(3, " # IBM-DB2\nibm-db2 "..., 4096) = 4096 read(3, "mmi 575/tcp "..., 4096) = 4096 read(3, " # SCO System Administra"..., 4096) = 4096 read(3, " 655/udp "..., 4096) = 4096 read(3, " # almanid Connec"..., 4096) = 4096 read(3, " 763/udp #"..., 4096) = 4096 read(3, "ftps-data 989/tcp "..., 4096) = 4096 read(3, " # Sun's NEO Object Reque"..., 4096) = 4096 read(3, " 1085/tcp # We"..., 4096) = 4096 read(3, "rotocol\nbnetfile 1120/udp"..., 4096) = 4096 read(3, " # ANSI C12.22 Port\nc"..., 4096) = 4096 read(3, "luster Manager\nalias 1"..., 4096) = 4096 read(3, " 1223/tcp # T"..., 4096) = 4096 read(3, "B\nqnts-orb 1262/udp "..., 4096) = 4096 read(3, " 1301/udp # CI3-"..., 4096) = 4096 read(3, "erver\nnaap 1340/tcp "..., 4096) = 4096 read(3, "re Systems\nmolly 1374/"..., 4096) = 4096 read(3, "on Starter\nibm-res 1405/"..., 4096) = 4096 read(3, "n X25/SNA Gateway\neicon-slp "..., 4096) = 4096 read(3, "openmath 1473/tcp "..., 4096) = 4096 read(3, "on Ltd. Lic. Man.\nmvx-lm "..., 4096) = 4096 read(3, " Shiva Hose\nshivasound 1549"..., 4096) = 4096 read(3, " 1585/udp "..., 4096) = 4096 read(3, " # softdataphone\nontime"..., 4096) = 4096 read(3, "udp # netview-aix"..., 4096) = 4096 read(3, " # RSVP-ENCAPSULATION-2\nrsvp-e"..., 4096) = 4096 read(3, "# encore\ncisco-net-mgmt 1741/tc"..., 4096) = 4096 read(3, " # answersoft-lm\nanswerso"..., 4096) = 4096 read(3, "r\nunisys-lm 1823/udp "..., 4096) = 4096 read(3, " # LeCroy VICP\nlecroy-vicp "..., 4096) = 4096 read(3, " # MetaAgent\ncymtec-por"..., 4096) = 4096 read(3, "/tcp # CTT Broker"..., 4096) = 4096 read(3, "ote 1967/udp "..., 4096) = 4096 read(3, "knet 2005/tcp csync "..., 4096) = 4096 read(3, " # EPNSDP\nclearvi"..., 4096) = 4096 read(3, " # SunCluster Geographic\ns"..., 4096) = 4096 read(3, " # ELATELINK\nelatelink"..., 4096) = 4096 read(3, " Remote Debug Port\ngdbremote "..., 4096) = 4096 read(3, " 2193/tcp # "..., 4096) = 4096 read(3, " # eHome Message Se"..., 4096) = 4096 read(3, "# CoMotion Master Server\ncomotio"..., 4096) = 4096 read(3, "ager (FLEX)\nadvant-lm 2295"..., 4096) = 4096 read(3, " # RCC Host\nrcc-host 2"..., 4096) = 4096 read(3, "L3-HBMon\nl3-hbmon 2370/ud"..., 4096) = 4096 read(3, "414/udp # Beeyond"..., 4096) = 4096 read(3, " # Rapido_IP\nrapido-"..., 4096) = 4096 read(3, " 2492/udp # GROO"..., 4096) = 4096 read(3, "erce\nvrcommerce 2530/udp "..., 4096) = 4096 read(3, "568/udp # SPAM TR"..., 4096) = 4096 read(3, "/udp # SMNTUBoots"..., 4096) = 4096 read(3, "site 2651/udp "..., 4096) = 4096 read(3, "ata 2690/udp "..., 4096) = 4096 read(3, "ay Control Protocol Call Agent\ns"..., 4096) = 4096 read(3, " 2767/udp "..., 4096) = 4096 read(3, "anager products\ndvr-esm "..., 4096) = 4096 read(3, " 2845/tcp #"..., 4096) = 4096 read(3, "/tcp # TopFlow\nto"..., 4096) = 4096 read(3, "/udp # CESD Conte"..., 4096) = 4096 read(3, " # BOLDSOFT-LM\nboldsoft-lm "..., 4096) = 4096 read(3, " 2998/tcp # Rea"..., 4096) = 4096 read(3, " 3033/tcp "..., 4096) = 4096 read(3, "onitor Port\ncsd-monitor 3072"..., 4096) = 4096 read(3, " # Business protocol\ngeoloc"..., 4096) = 4096 read(3, " # CSI-LFAP\ncsi-lf"..., 4096) = 4096 read(3, "/udp # H2GF W.2m "..., 4096) = 4096 read(3, " # Survey Instrument\nsurve"..., 4096) = 4096 read(3, "p # VIEO Fabric E"..., 4096) = 4096 read(3, " # VS Server\nvs-server 328"..., 4096) = 4096 read(3, " Manager\nofficelink2000 3320/tc"..., 4096) = 4096 read(3, "ilm 3362/tcp "..., 4096) = 4096 read(3, " ch 2\nnokia-ann-ch2 3406/udp "..., 4096) = 4096 read(3, "Admin\nhri-port 3439/tcp "..., 4096) = 4096 read(3, "isar-port 3475/udp "..., 4096) = 4096 read(3, "eb\nvt-ssl 3509/tcp "..., 4096) = 4096 read(3, "lookup 3543/tcp "..., 4096) = 4096 read(3, " 3578/tcp # Data "..., 4096) = 4096 read(3, "Port\nhp-dataprotect 3612/tcp "..., 4096) = 4096 read(3, "Cyc\nxss-srv-port 3646/tcp "..., 4096) = 4096 read(3, " 3678/tcp # Da"..., 4096) = 4096 read(3, " # Anoto Rendezvous Por"..., 4096) = 4096 read(3, " # CimTrak\ncimtrak "..., 4096) = 4096 read(3, "Impact Mgr./PEM Gateway\nbfd-cont"..., 4096) = 4096 read(3, "mite Tech Tapeware\ntapeware "..., 4096) = 4096 read(3, "spw-dnspreload 3849/tcp "..., 4096) = 4096 read(3, "# DTS Service Port\nmsdts1 "..., 4096) = 4096 read(3, "5/udp # Auto-Grap"..., 4096) = 4096 read(3, " 3948/tcp # Anto"..., 4096) = 4096 read(3, " 3980/udp "..., 4096) = 4096 read(3, " # Talarian Mcast\ntalarian-mcas"..., 4096) = 4096 read(3, "eract 4052/tcp "..., 4096) = 4096 read(3, " # Lorica outside facing (SS"..., 4096) = 4096 read(3, " # Netadmin Systems"..., 4096) = 4096 read(3, " 4151/udp # Men &"..., 4096) = 4096 read(3, "udp # Vatata Peer"..., 4096) = 4096 read(3, " # File System Port Map"..., 4096) = 4096 read(3, " 4403/tcp # ASIG"..., 4096) = 4096 read(3, " # isigate\nworldscores "..., 4096) = 4096 read(3, "te It! Message Service\nnoteit "..., 4096) = 4096 read(3, "hronization\nnetxms-sync 4702"..., 4096) = 4096 read(3, "42/udp # nCode IC"..., 4096) = 4096 read(3, "nitor\nccss-qsm 4970/tcp "..., 4096) = 4096 read(3, "ix I/O daemon (stat)\nqvr "..., 4096) = 4096 read(3, "cp # Qpur File Pr"..., 4096) = 4096 read(3, "ica-Online\naol-1 5191/"..., 4096) = 4096 read(3, "NLG Data Service\nhacl-hb "..., 4096) = 4096 read(3, "tion 5363/tcp # "..., 4096) = 4096 read(3, " # SGI Array Servic"..., 4096) = 4096 read(3, " # inin secure messagi"..., 4096) = 4096 read(3, " proshare conf video\nprosharevid"..., 4096) = 4096 read(3, " # NetAgent\ndali-port 57"..., 4096) = 4096 read(3, "cp # SSDTP\nssdtp "..., 4096) = 4096 read(3, "e Manager\nmontage-lm 6147/t"..., 4096) = 4096 read(3, "udp # Metatude Di"..., 4096) = 4096 read(3, "Domain\nsun-sr-iiops 6486/udp "..., 4096) = 4096 read(3, "eros V5 FTP Data\nkftp-data "..., 4096) = 4096 read(3, "-https 6789/udp "..., 4096) = 4096 read(3, " 7030/udp # "..., 4096) = 4096 read(3, "dezvous 7392/udp "..., 4096) = 4096 read(3, "gnaling Transport Layer\nnls-tl "..., 4096) = 4096 read(3, "RAQMON PDU\nprgp 7747/"..., 4096) = 4096 read(3, "teradataordbms 8002/udp "..., 4096) = 4096 read(3, "doms-migr 8101/tcp "..., 4096) = 4096 read(3, "ium) Network Protocol\ntnp "..., 4096) = 4096 read(3, " # Sun App Server - JMX/RMI\nsu"..., 4096) = 4096 read(3, " 8999/tcp # "..., 4096) = 4096 read(3, "evice Discovery\napani1 "..., 4096) = 4096 read(3, "# VERITAS Information Serve\nvisd"..., 4096) = 4096 read(3, "Agent\ncba8 9593/udp "..., 4096) = 4096 read(3, "erver channel\nenrp-sctp 99"..., 4096) = 4096 read(3, "r Desktop Patrol\nswdtp "..., 4096) = 4096 read(3, " # sun cacao rmi registry a"..., 4096) = 4096 read(3, "udp # LinoGrid Eng"..., 4096) = 4096 read(3, " # hde-lcesrvr-2\nhde-lc"..., 4096) = 4096 read(3, "tcp # ZigBee IP Tr"..., 4096) = 4096 read(3, " 20005/udp # O"..., 4096) = 4096 read(3, "22763/udp # Talika"..., 4096) = 4096 read(3, " 25003/tcp # icl-"..., 4096) = 4096 read(3, "45\nxqosd 31416/tcp "..., 4096) = 4096 read(3, " # PROFInet Context Manager\ne"..., 4096) = 4096 read(3, "ice\nqdb2service 45825/udp "..., 4096) = 2023 read(3, "", 4096) = 0 close(3) = 0 munmap(0x2b63fc237000, 4096) = 0 open("/etc/pbs.conf", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=176, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc237000 read(3, "PBS_EXEC=/opt/pbs/default\nPBS_HO"..., 4096) = 176 read(3, "", 4096) = 0 close(3) = 0 munmap(0x2b63fc237000, 4096) = 0 open("/etc/pbs.conf", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=176, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc237000 read(3, "PBS_EXEC=/opt/pbs/default\nPBS_HO"..., 4096) = 176 read(3, "", 4096) = 0 close(3) = 0 munmap(0x2b63fc237000, 4096) = 0 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 socket(PF_NETLINK, SOCK_RAW, 0) = 4 bind(4, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0 getsockname(4, {sa_family=AF_NETLINK, pid=2885, groups=00000000}, [12]) = 0 sendto(4, "\24\0\0\0\26\0\1\3\363\246IU\0\0\0\0\0\0\0\0", 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20 recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\363\246IUE\v\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 284 recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"@\0\0\0\24\0\2\0\363\246IUE\v\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 320 recvmsg(4, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\363\246IUE\v\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20 close(4) = 0 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4 connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(4) = 0 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4 connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(4) = 0 open("/etc/host.conf", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc237000 read(4, "multi on\n", 4096) = 9 read(4, "", 4096) = 0 close(4) = 0 munmap(0x2b63fc237000, 4096) = 0 futex(0x30ae99f384, FUTEX_WAKE_PRIVATE, 2147483647) = 0 open("/etc/resolv.conf", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0644, st_size=340, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc237000 read(4, "# Generated by NetworkManager\n\n\n"..., 4096) = 340 read(4, "", 4096) = 0 close(4) = 0 munmap(0x2b63fc237000, 4096) = 0 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4 fstat(4, {st_mode=S_IFREG|0644, st_size=30554, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b63fc237000 read(4, "127.0.0.1 localhost localhost."..., 4096) = 4096 read(4, "0.4.3 c04n03-ib0 c04n03-ib0.nuis"..., 4096) = 4096 read(4, "n08.nuist.edu.cn \n172.20.7.8 c07"..., 4096) = 4096 read(4, "2-ib0 c10n12-ib0.nuist.edu.cn \n1"..., 4096) = 4096 read(4, ".nuist.edu.cn \n172.20.14.2 c14n0"..., 4096) = 4096 read(4, " \n192.168.17.6 c17n06 c17n06.nui"..., 4096) = 4096 read(4, "-ib0 c20n09-ib0.nuist.edu.cn \n19"..., 4096) = 4096 read(4, "edu.cn\n192.168.0.107 io07 io07.n"..., 4096) = 1882 read(4, "", 4096) = 0 close(4) = 0 munmap(0x2b63fc237000, 4096) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(15001), sin_addr=inet_addr("192.168.0.113")}, 16) = 0 write(3, "+2+1+0+4root+0", 14) = 14 poll([{fd=3, events=POLLIN}], 1, 600000) = 1 ([{fd=3, revents=POLLIN}]) read(3, "+2+1+0+0+1", 1024) = 10 getsockname(3, {sa_family=AF_INET, sin_port=htons(56393), sin_addr=inet_addr("192.168.0.113")}, [16]) = 0 pipe2([4, 5], O_CLOEXEC) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b63fc24b4d0) = 2972 close(5) = 0 fcntl(4, F_SETFD, 0) = 0 read(4, ^C <unfinished ...> [root@log05 ~]# 5. q commands ran well until today 6. ownership or permission on log05 [root@log05 ~]# echo $PBS_HOME /nuist/p/public/PBS_HOME [root@log05 ~]# ll /nuist/p/public drwxr-xr-x 8 pbsadmin root 65536 Oct 22 2014 pbsadmin drwxr-xr-x 13 root root 65536 Mar 18 16:10 PBS_HOME [root@log05 ~]# ls -lrt $PBS_HOME total 1088 -rw-r--r-- 1 root root 36 Sep 23 2012 pbs_environment.old drwxrwxrwt 2 root root 65536 Sep 23 2012 undelivered drwx------ 2 root root 65536 Sep 23 2012 checkpoint drwxr-xr-x 2 root root 65536 Sep 23 2012 aux -rw-r--r-- 1 root root 36 Jan 10 2014 pbs_environment -rw-r--r-- 1 root root 14 Jan 10 2014 pbs_version drwxr-xr-x 2 root root 65536 Nov 14 16:27 mom_logs drwxr-x--x 4 root root 65536 Dec 6 17:30 mom_priv -rw-r--r-- 1 root root 9565 Dec 6 17:59 set_sharing drwxr-x--- 9 root root 196608 Apr 29 11:07 server_priv drwx------ 11 pbsadmin root 65536 May 1 20:25 datastore drwxrwxrwt 2 root root 65536 May 1 20:25 spool drwxr-x--- 2 root root 65536 May 1 22:27 sched_priv drwxr-xr-x 2 root root 65536 May 6 00:00 server_logs drwxr-xr-x 2 root root 65536 May 6 00:07 sched_logs
  11. Hi, There was a weird issue that all q commands, such qstat, qmgr, etc., hung up on the node which installed pbs server, but worked fine on other node. Due to the pbs server was installed on the problem node and is running jobs and no secondary server are available, it can not restart either the node or pbs server. Wondering any workaround for this sort of problem? thanks For example, pbs server was installed on log05 on log05, all q commands hung up, no respose, no timeout, etc., just stay right there without any output until press ctrl+c to abort [root@log05 ~]# qstat ^C [root@log05 ~]# qstat @log05 ^C [root@log05 ~]# qmgr -c "p s" But on other nodes there is no this sort of issue, for example on log01 [root@log01 ~]# qstat -f -B log05 Server: log05 server_state = Active server_host = log05 scheduling = True total_jobs = 2980 state_count = Transit:0 Queued:37 Held:0 Waiting:0 Running:38 Exiting:0 Beg un:0 operators = root@* default_queue = Economy log_events = 511 mail_from = adm query_other_jobs = True resources_default.ncpus = 1 resources_default.walltime = 06:00:00 default_chunk.ncpus = 1 resources_assigned.mpiprocs = 3012 resources_assigned.ncpus = 3232 resources_assigned.nodect = 278 scheduler_iteration = 600 flatuid = True FLicenses = 768 resv_enable = True node_fail_requeue = 310 max_array_size = 10000 node_group_enable = False node_group_key = node_g pbs_license_info = 6200@log07 pbs_license_min = 3696 pbs_license_max = 4000 pbs_license_linger_time = 31536000 license_count = Avail_Global:498 Avail_Local:464 Used:3232 High_Use:3696 Av ail_Sockets:0 Unused_Sockets:0 pbs_version = PBSPro_12.2.0.133411 eligible_time_enable = False job_history_enable = True max_concurrent_provision = 5 [root@log01 ~]# qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue Intl # create queue Intl set queue Intl queue_type = Execution set queue Intl resources_default.walltime = 06:00:00 set queue Intl max_run = [u:PBS_GENERIC=1] set queue Intl max_run_res.ncpus = [u:PBS_GENERIC=1] set queue Intl enabled = False set queue Intl started = False # # Create and define queue PGI # create queue PGI set queue PGI queue_type = Execution set queue PGI resources_default.walltime = 06:00:00 set queue PGI max_run = [u:PBS_GENERIC=1] set queue PGI max_run_res.ncpus = [u:PBS_GENERIC=1] set queue PGI enabled = False set queue PGI started = False # # Create and define queue Bigcpu # create queue Bigcpu set queue Bigcpu queue_type = Execution set queue Bigcpu Priority = 50 set queue Bigcpu acl_users = Khan15 set queue Bigcpu acl_users += caijinjie set queue Bigcpu acl_users += qhzhao set queue Bigcpu resources_max.ncpus = 1500 set queue Bigcpu resources_max.walltime = 06:00:00 set queue Bigcpu resources_default.walltime = 06:00:00 set queue Bigcpu default_chunk.Qlist = Bigcpu set queue Bigcpu max_run = [u:PBS_GENERIC=1] set queue Bigcpu max_run_res.ncpus = [u:PBS_GENERIC=1500] set queue Bigcpu enabled = False set queue Bigcpu started = False # # Create and define queue Economy # create queue Economy set queue Economy queue_type = Execution set queue Economy Priority = 30 set queue Economy max_queued = [u:PBS_GENERIC=5] set queue Economy resources_max.ncpus = 512 set queue Economy resources_max.walltime = 24:00:00 set queue Economy resources_default.walltime = 24:00:00 set queue Economy default_chunk.Qlist = Economy set queue Economy resources_available.ncpus = 1008 set queue Economy max_run = [u:PBS_GENERIC=2] set queue Economy max_run_res_soft.ncpus = [u:PBS_GENERIC=512] set queue Economy enabled = True set queue Economy started = True # # Create and define queue Regular # create queue Regular set queue Regular queue_type = Execution set queue Regular Priority = 50 set queue Regular max_queued = [u:PBS_GENERIC=5] set queue Regular resources_max.ncpus = 512 set queue Regular resources_max.walltime = 12:00:00 set queue Regular resources_default.walltime = 12:00:00 set queue Regular default_chunk.Qlist = Regular set queue Regular resources_available.ncpus = 1344 set queue Regular max_run = [u:PBS_GENERIC=2] set queue Regular max_run_res_soft.ncpus = [u:PBS_GENERIC=512] set queue Regular enabled = True set queue Regular started = True # # Create and define queue Longtime # create queue Longtime set queue Longtime queue_type = Execution set queue Longtime Priority = 130 set queue Longtime max_queued = [u:PBS_GENERIC=2] set queue Longtime acl_user_enable = False set queue Longtime resources_max.ncpus = 336 set queue Longtime default_chunk.Qlist = Longtime set queue Longtime resources_available.ncpus = 1008 set queue Longtime max_run = [u:PBS_GENERIC=1] set queue Longtime max_run_res_soft.ncpus = [u:PBS_GENERIC=336] set queue Longtime enabled = True set queue Longtime started = True # # Create and define queue Premium # create queue Premium set queue Premium queue_type = Execution set queue Premium Priority = 100 set queue Premium max_queued = [u:PBS_GENERIC=5] set queue Premium acl_user_enable = True set queue Premium acl_users = wuyang set queue Premium resources_max.ncpus = 336 set queue Premium default_chunk.Qlist = Premium set queue Premium max_run = [u:PBS_GENERIC=2] set queue Premium max_run_res_soft.ncpus = [u:PBS_GENERIC=336] set queue Premium enabled = True set queue Premium started = True # # Create and define queue Debug # create queue Debug set queue Debug queue_type = Execution set queue Debug Priority = 100 set queue Debug acl_user_enable = True set queue Debug acl_users = Caiwenyi0507 set queue Debug acl_users += deng set queue Debug acl_users += fucaifang set queue Debug acl_users += jliu set queue Debug default_chunk.Qlist = Debug set queue Debug enabled = True set queue Debug started = True # # Set server attributes. # set server scheduling = True set server operators = root@* set server default_queue = Economy set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server resources_default.ncpus = 1 set server resources_default.walltime = 06:00:00 set server default_chunk.ncpus = 1 set server scheduler_iteration = 600 set server flatuid = True set server resv_enable = True set server node_fail_requeue = 310 set server max_array_size = 10000 set server node_group_enable = False set server node_group_key = node_g set server pbs_license_info = 6200@log07 set server pbs_license_min = 3696 set server pbs_license_max = 4000 set server pbs_license_linger_time = 31536000 set server license_count = "Avail_Global:498 Avail_Local:416 Used:3280 High_Use:3696 Avail_Sockets:0 Unused_Sockets:0" set server eligible_time_enable = False set server job_history_enable = True set server max_concurrent_provision = 5 Appreciating your time Regards Jerry
  12. How to alter the queue order of jobs

    Scott, Got the idea. Thanks Appreciating your time Regards Jerry
  13. How to alter the queue order of jobs

    Hi, It looked like that redefine the priority of the batch job via qalter did not give more help for re-arranging the queue order of jobs. Wondering if there is a way to alter the queue order of jobs directly? for example, the queue order of jobs are like this job1 job2 job3 Wanted to re-arrange the order to job3 job1 job2 Thanks for your time Jerry
  14. Issue with setting sharing attr. of vnode

    Scott, Thank you very much indeed for your kindly and detailed instruction. Appreciating your time Regards Jerry
  15. Issue with setting sharing attr. of vnode

    Scott, I have to apply this change on the cluster with jobs are running. So I'd like to make a double confirmation to see if I understood correctly 1. one node per file 2. The 2nd paramert 'se_sharing 'is the configuration file which will be created and used by pbs_mom 3. The 3rd paramert 'set_sharing' is the configuration 2' file I will need to create 4. apply 'pbs_mom -s insert ' one node by one node with appropriate 'configuration 2' file. 5. -HUP or -INT ? If the node is busying Thanks for your time Jerry
×