All Activity

This stream auto-updates   

  1. Yesterday
  2. Last week
  3. Okies What I am trying to do is to allow interactive jobs via PBS but not allow users the ability to use ssh to login to other execution nodes. For the admin to login to execution nodes they login to the login node and then via the head node they can login to an exec node.
  4. Earlier
  5. Referencing the latest PBS Professional Installation and Upgrade Guide (14.2.1) A few more points.. what ports to open for interactive qsub jobs? It's unknown, and it has to, since you have different sessions -- a possibly unlimited number of them -- and they don't use privileged ports since they are not run as root but as the user (they simply communicate the port to the execution nodes by setting an attribute). Allow traffic _from_ the execution hosts.
  6. You could try the following as root qhold `qselect` the command qselect will list all of the jobs in the queue(s). Since there are no arguments supplied to qselect, it will result in listing all jobs. Yo may want to review the qselect documentation to understand how you can filter the job list. You reference Compute Manager, which suggests to me that you (your company) should have a support contract. I wouldn't know whether your company is current on the support contract, but it would hurt to send email to the PBS Support team to get real-time help. See http://www.pbsworks.com/ContactSupport.aspx
  7. Hello Very urgent issue caused on our cluster, but the IT manager didn't on duty now. I have the root account and only know how to hold one job by using "qhold [job ID]" Could you tell me how to hold all jobs on our cluster use the command? Thanks By the way, wish some body can give me a command reference for the Cluster manager...Thanks again.
  8. Hi I have been tightening down our cluster. Our login node accepts ssh from anywhere. Execution nodes accept ssh only from the head node. That all seems to work fine for submitting normal non-interactive jobs. Exec nodes: From the head node allow: proto (tcp udp) mod state state NEW dport 15001:15004 ACCEPT; Login node: From the head node allow: proto (tcp udp) mod state state NEW dport 15001:15004 ACCEPT; This allows normal jobs to be submitted but interactive jobs fail with "Job cannot be executed" and an exit status of -1 If I set for the login node a rule to allow from a specific node NEW ACCEPT then the interactive job will work on that node. I thought all PBS communications would be via the head node and not direct node-to-node like exec node to/from login node. Are there ports that need to be allowed to let interactive jobs run? A netstat during an interactive job showed "login node:33796 to exec node:39424 ESTABLISHED Mike
  9. Thank you Scott; We have already pbspro customer. I want to install pbs open source also. May I install pbspro and pbs open source on the same head node.
  10. The same OS is not required to be the same on head node and compute nodes. So, your configuration "server parts on Redhat 6.7 and install pbs_mom on CentOS 7.3" is a valid configuration.
  11. Dear All; Do head node and compute nodes need to be same? I want to install server parts on Redhat 6.7 and install pbs_mom on CentOS 7.3. Will it work? Regards;
  12. Thankyou Scott. I have now done that for the node. It was better I check here before doing something wrong and ending users jobs :-) Thanks
  13. Job(s) running on the "offline" node will continue to execute. By setting a node to "offline" will NOT stop the executing job(s) on the node. Referring to the reference guide, pbsnodes -o <nodename> Marks listed hosts as OFFLINE even if currently in use. This is different from being marked DOWN. A host that is marked OFFLINE will continue to execute the jobs already on it, but will be removed from the scheduling pool (no more jobs will be scheduled on it.) For hosts with multiple vnodes, pbsnodes operates on a host and all of its vnodes, where the hostname is resources_available.host, which is the name of the natural vnode. To offline a single vnode in a multi-vnoded system, use: qmgr -c “set node <nodename> state=offline” If you want to issue the node comment at the same time as offlining the node, you can pbsnodes -C "Note: not accepting new jobs" -o <nodename> [<nodename2> ...> with qmgr, you would qmgr -c "set node <nodename> comment = 'Note: not accepting new jobs'"
  14. If I put a node offline what will happen to existing jobs? I thought that would stop existing jobs when they tried to communicate back to the head node. Also by keeping the node online I can add a comment "Note not accepting new jobs." and users with jobs on that node wont worry about their jobs as they wont see the node is down.
  15. Have you looked at setting the node offline? pbsnodes -o <nodename> or you can use qmgr -c "set node <nodename> state = offline"
  16. Hi I have some nodes that need a reboot. I want to stop new jobs from starting on those specific nodes so when the currently running jobs are finished there will be no jobs running on those nodes and I can reboot them. So I don't want existing jobs to be ended. Just new jobs not started. The manual for qhold suggests that might be what I should use but its not clear. Also what's the best way to stop ALL new jobs from starting but allow current jobs to continue? Setting dedicated_time to a short time in the future still allows new jobs to start. Thanks Mike
  17. Can the "sharing" state on a node be changed without a MOM restart? Would like to set up a few nodes as both exclusive and shared, with sharing alternating based on the queue using the node. Thank you,
  18. I have been trying to add Abaqus job in Compute Manager. We are using Dassault License server for Abaqus. The abaqus job works fine outside PBS environment but it is throwing error message reading abaqus license variable. /cm/shared/apps/Abaqus/DassaultSystemes/SIMULIA/Commands/abaqus memory=2000 cpus=4 job=nut8mm.inp /var/DassaultSystemes/Licenses/DSLicSrv.txt /var/DassaultSystemes/Licenses/DSLicSrv.txt Abaqus Warning: The .inp extension has been removed from the job identifier Abaqus Warning: Keyword (dsls_license_config) must point to an existing file. Abaqus Error: License for standard with cpus=4 is not available. Abaqus/Analysis exited with error(s). Can anyone please share your Abaqus templates?
  19. Dear Users I am try to use the OpenMPI 1.6.5 with the PBSPro_13.1.501.161802, but I am facing some problems. I can run the case in one node but when I try to split the code in several nodes the the application do not start, although the qstart shows the process as running. I found some references saying that it is necessary to compile the OpenMPI with some special flag to integrate it with the PBSPro. Have someone successfully integrated the OpenMPI 1.6.5 with the PBSPro? Was necessary to compile the OpenMPI with some special flag? If yes, which flags? Thanks for your help, Regards, Marcelo
  20. I've finished setting up ohpc 1.2 with pbs pro. The setup is is as follows the master is connected to the LAN on eth0 and to the compute nodes (via a switch) on eth1. The pbs_comm however, defaults to the IP adress of eth0 which the compute nodes of course can not reach Output from /var/spool/pbs/comm_logs: 01/20/2017 14:54:41;0002;Comm@ricr-cluster;Svr;Comm@ricr-cluster;Exiting 01/20/2017 14:54:41;0002;Comm@ricr-cluster;Svr;Log;Log closed 01/20/2017 14:55:08;0002;Comm@ricr-cluster;Svr;Log;Log opened 01/20/2017 14:55:08;0002;Comm@ricr-cluster;Svr;Comm@ricr-cluster;pbs_version=14.1.0 01/20/2017 14:55:08;0002;Comm@ricr-cluster;Svr;Comm@ricr-cluster;pbs_build=mach=N/A:security=N/A:configure_args=N/A 01/20/2017 14:55:08;0002;Comm@ricr-cluster;Svr;Comm@ricr-cluster;/opt/pbs/sbin/pbs_comm ready (pid=16276), Proxy Name:ricr-cluster:17001, Threads:4 01/20/2017 14:55:08;0c06;Comm@ricr-cluster;TPP;Comm@ricr-cluster(Thread 1);tfd=18, Leaf registered address 10.155.198.146:15004 01/20/2017 14:55:14;0c06;Comm@ricr-cluster;TPP;Comm@ricr-cluster(Thread 2);tfd=19, Leaf registered address 10.155.198.146:15001 01/20/2017 14:55:41;0c06;Comm@ricr-cluster;TPP;Comm@ricr-cluster(Thread 3);tfd=20, Leaf registered address 192.168.1.4:15003 I don't really understand what is going on here since 192.168.1.4 is the IP of compute node1. Output from /var/spool/pbs/mom_logs/ 01/20/2017 14:55:41;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest 192.168.1.5:15001, msg="tfd=20, pbs_comm:10.155.198.146:17001: Dest not found" 01/20/2017 14:55:46;0001;pbs_mom;Svr;pbs_mom;Access from host not allowed, or unknown host (15008) in is_request, bad connect from 10.155.198.146:15001 This is the pbs.conf PBS_SERVER=ricr-cluster PBS_START_SERVER=1 PBS_START_SCHED=1 PBS_START_COMM=1 PBS_START_MOM=0 PBS_EXEC=/opt/pbs PBS_HOME=/var/spool/pbs PBS_CORE_LIMIT=unlimited PBS_SCP=/bin/scp For the compute nodes, the 0/1 flags are interchanged of course All services are running: [root@c1 ~]# ps -ef | grep pbs root 3952 1 0 14:55 ? 00:00:00 /opt/pbs/sbin/pbs_mom root 4099 3812 0 15:23 pts/0 00:00:00 grep --color=auto pbs [root@ricr-cluster ~]# ps -ef | grep pbs root 16276 1 0 14:55 ? 00:00:00 /opt/pbs/sbin/pbs_comm root 16293 1 0 14:55 ? 00:00:00 /opt/pbs/sbin/pbs_sched root 16753 1 0 14:55 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor postgres 16853 1 0 14:55 ? 00:00:00 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007 postgres 16861 16853 0 14:55 ? 00:00:00 postgres: postgres pbs_datastore 10.155.198.146(40520) idle root 16870 1 0 14:55 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin root 17700 14501 0 15:21 pts/0 00:00:00 grep --color=auto pbs The IP listed after pbs_datastore is the unwanted IP of eth0 pinging and sshing works in both directions. The nodes are all listed as down. I'm guessing this is due to them not communicationg with pbs_comm [root@ricr-cluster ~]# pbsnodes -av c1 Mom = c1.localdomain Port = 15002 pbs_version = unavailable ntype = PBS state = state-unknown,down pcpus = 1 resources_available.host = c1 resources_available.ncpus = 1 resources_available.vnode = c1 resources_assigned.accelerator_memory = 0kb resources_assigned.mem = 0kb resources_assigned.naccelerators = 0 resources_assigned.ncpus = 0 resources_assigned.netwins = 0 resources_assigned.vmem = 0kb comment = node down: communication closed resv_enable = True sharing = default_shared How do I reconfigure pbs_comm?
  21. Hi, Thanks for the inputs. It's not possible to do suspension on our HPC. To release the preemption resources, just kill the low-priority job and automaticlly requeue it again instead of suspending it. While kill the preempted jobs, PBS Server sends SIGTERM, it can't identify if the job just was killed or vacated by scheduler and requeued again, it's really annoying for workflow control. Saying, A was running, B with higher priority and wanted the preemption resources, scheduler vacated A and requeued it automatically. If it could send SIGUSR1 or other signals instead SIGTERM, workflow application would know the job just was vacated by scheduler and requeued by scheduler, and could put the job status back to the submitted state other than failed and avoid improper action. Wondering if any way to have PBS Server send different signal for appropriate actions ? Or if any workaround for this sort of situation ? Appreciating further comments/inputs Regards Jerry
  22. Hi, Thanks for your kindly inputs. Found the root problem of this issue. The /etc/passwd file are not consistent on the login node and pbs server node, user account was missed from the /etc/passwd file on the pbs server node. Wondering why the user still could submit and run job without any problem even if the account info is missing from /etc/passwd on pbs server node. Something is screwing on the system setting/configuration ? Appreciating further comments/inputs. Regards Jerry
  23. The group = -default- is set by the pbs_server for a job if the system function getpwnam(userid) returns null. Was this user created after the pbs_server had started last, or was the user added to the group since the last server start? Maybe something (try restarting nscd, if it is in use?) is caching old information? You can test getpwnam() with the following simple program: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <pwd.h> #include <errno.h> int main(int argc, char *argv[]) { char* name; int returnval; char command[100]; struct passwd *pwdp; name=argv[1]; pwdp = getpwnam(name); printf("errno: %s\n",strerror(errno)); if (pwdp == NULL) { printf("Pointer returned is NULL:\n"); } else { printf("Pointer returned is NOT NULL!\n"); } if (pwdp == (struct passwd *)0) { printf("No Password Entry for User %s\n", name); } else { printf("Password Entry for User FOUND for %s\n", name); } sprintf(command,"getent passwd %s", name); returnval = system(command); }
  24. Hello Jerry, if you mean a different signal for job suspension specifically (there are other preemption methods as well), then yes, have a look at "suspendsig" in the documentation.
  25. Hi, Wondering if it's possible to have PBS provide a different signal for pre-emptive job, such as SIGUSR1 ? Thanks Jerry
  26. Hi, It's using PBSPro_12.1.1.131502, with a particular user, the '-default-' was used as the group name instead the user's group, like this 20170110:01/10/2017 21:52:26;E;553720.sdb;user=songgt group=-default- project=_pbs_project_default accounting_id="0x600026632" jobname=roms queue=workq ctime=1484083146 qtime=1484083146 etime=1484083146 start=1484083146 exec_host=login1/0+login1/1+login1/2+login1/3+login1/4+login1/5+login1/6+login1/7+login1/8+login1/9+login1/10+login1/11+login1/12+login1/13+login1/14+login1/15 exec_vnode=(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_1:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1)+(clogin86_8_0:ncpus=1) Resource_List.arch=XT Resource_List.mppwidth=16 Resource_List.ncpus=16 Resource_List.nodect=16 Resource_List.place=free Resource_List.select=16:vntype=cray_compute Resource_List.walltime=02:00:00 session=5842 alt_id=439506 end=1484085146 Exit_status=0 resources_used.cpupercent=7 resources_used.cput=00:00:19 resources_used.mem=7036kb resources_used.ncpus=16 resources_used.vmem=135980kb resources_used.walltime=00:33:17 run_count=1 Make sure that the group exists and the user is in the group. NO this sort of issue with other users, only for a particular user. Have no idea what's wrong. Any ideas ? Thanks for your time Regards
  27. Actually, I solved my problem. It seems that with Torque 4.2.10, any setup that has the same host name for pbs_server and pbs_mom is seen as a NUMA system. The key to getting the MOM node to state=free was to edit /var/lib/torque/mom_priv/mom.layout and enter the single line "nodes=0" in it. After restarting the pbs daemons (e.g., service pbs_server restart), a "qnodes" shows the single 16-core compute node as in state=free and jobs now do run. I'll have to do further investigation to determine for sure that all cpus and 16 processors indeed get used, but I think I'm on the right track now. Hopefully, this will be of help to some other Torque users who want to set up a single-node queueing system. - Gary
  28. What do you get if you execute rpm -qi <package_name> Otherwise, you may want to review the http://goo.gl/j04vz the company that maintains Torque.
  1. Load more activity