• Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

All Activity

This stream auto-updates   

  1. Earlier
  2. I have install PBSpro 14.2 for failover cluster when I run job,i try to tracejob id ,it's display have a mistake :06/06/2017 11:34:01 L Failed to run: (15016) how to slove this problem? [root@HN01 network-scripts]# tracejob 342 Job: 342.HN01 06/06/2017 11:34:01 S enqueuing into workq, state 1 hop 1 06/06/2017 11:34:01 S Job Run at request of Scheduler@hn01 on exec_vnode (cn01:ncpus=32)+(cn02:ncpus=32)+(cn03:ncpus=32)+(cn04:ncpus=32)+(cn05:ncpus=32)+(cn06:ncpus=32) 06/06/2017 11:34:01 S (req_movejob) Request invalid for state of job, state=4 06/06/2017 11:34:01 L Considering job to run 06/06/2017 11:34:01 L Job run 06/06/2017 11:34:01 L Considering job to run 06/06/2017 11:34:01 L Failed to run: (15016) 06/06/2017 11:34:01 S Job Queued at request of danish@mgnt02, owner = danish@mgnt02, job name = pbs_run2.job, queue = workq 06/06/2017 11:34:01 A queue=workq 06/06/2017 11:34:01 A user=danish group=navy project=_pbs_project_default jobname=pbs_run2.job queue=workq ctime=1496720041 qtime=1496720041 etime=1496720041 start=1496720041 exec_host=cn01/0*32+cn02/0*32+cn03/0*32+cn04/0*32+cn05/0*32+cn06/0*32 exec_vnode=(cn01:ncpus=32)+(cn02:ncpus=32)+(cn03:ncpus=32)+(cn04:ncpus=32)+(cn05:ncpus=32)+(cn06:ncpus=32) Resource_List.mpiprocs=192 Resource_List.ncpus=192 Resource_List.nodect=6 Resource_List.nodes=6:ppn=32 Resource_List.place=scatter Resource_List.select=6:ncpus=32:mpiprocs=32 Resource_List.walltime=72:00:00 resource_assigned.ncpus=192 cn01 mom_logs: 06/06/2017 11:32:33;0008;pbs_mom;Job;341.HN01;nprocs: 630, cantstat: 0, nomem: 0, skipped: 0, cached: 0, max excluded PID: 0 06/06/2017 11:32:33;0008;pbs_mom;Job;341.HN01;Started, pid = 19512 06/06/2017 11:32:33;0080;pbs_mom;Job;341.HN01;task 00000001 terminated 06/06/2017 11:32:33;0008;pbs_mom;Job;341.HN01;Terminated 06/06/2017 11:32:33;0100;pbs_mom;Job;341.HN01;task 00000001 cput= 0:00:00 06/06/2017 11:32:33;0008;pbs_mom;Job;341.HN01;kill_job 06/06/2017 11:32:33;0100;pbs_mom;Job;341.HN01;CN01 cput= 0:00:00 mem=432kb 06/06/2017 11:32:33;0100;pbs_mom;Job;341.HN01;cn02 cput= 0:00:00 mem=0kb 06/06/2017 11:32:33;0100;pbs_mom;Job;341.HN01;cn03 cput= 0:00:00 mem=0kb 06/06/2017 11:32:33;0100;pbs_mom;Job;341.HN01;cn04 cput= 0:00:00 mem=0kb 06/06/2017 11:32:33;0100;pbs_mom;Job;341.HN01;cn05 cput= 0:00:00 mem=0kb 06/06/2017 11:32:33;0100;pbs_mom;Job;341.HN01;cn06 cput= 0:00:00 mem=0kb 06/06/2017 11:32:33;0008;pbs_mom;Job;341.HN01;no active tasks 06/06/2017 11:32:33;0100;pbs_mom;Job;341.HN01;Obit sent 06/06/2017 11:32:33;0100;pbs_mom;Req;;Type 54 request received from root@, sock=1 06/06/2017 11:32:33;0080;pbs_mom;Job;341.HN01;copy file request received 06/06/2017 11:32:34;0100;pbs_mom;Job;341.HN01;staged 2 items out over 0:00:01 06/06/2017 11:32:34;0008;pbs_mom;Job;341.HN01;no active tasks 06/06/2017 11:32:34;0100;pbs_mom;Req;;Type 6 request received from root@, sock=1 06/06/2017 11:32:34;0080;pbs_mom;Job;341.HN01;delete job request received 06/06/2017 11:32:34;0008;pbs_mom;Job;341.HN01;kill_job 06/06/2017 11:34:01;0100;pbs_mom;Req;;Type 1 request received from root@, sock=1 06/06/2017 11:34:01;0100;pbs_mom;Req;;Type 3 request received from root@, sock=1 06/06/2017 11:34:01;0100;pbs_mom;Req;;Type 5 request received from root@, sock=1 06/06/2017 11:34:01;0008;pbs_mom;Job;342.HN01;Type 5 request received from root@, sock=1 06/06/2017 11:34:01;0008;pbs_mom;Job;342.HN01;Started, pid = 20071 mom_log_341.txt tracejob_341.txt
  3. More on this topic, reading the error message "Abaqus Warning: Keyword (dsls_license_config) must point to an existing file.". Could it be the file pointed to by that keyword (/var/DassaultSystemes/Licenses/DSLicSrv.txt?) does not exist on the given execution host?
  4. Hello, Perhaps the environment is not set up correctly in the application script? Have you tried to insert a "env" debug statement somewhere in your application script to see if the environment variables are set as expected? If Abaqus expects the licensing information to be delivered by means other than an environment variable you have to verify if the same information is available on the execution host.
  5. Hello Jakub, True True True True At each scheduling cycle, the server_dyn_res script will execute and set the value of resources_available.foo_widget. The Scheduler will decrement the resources_available.foo_widget value by each job requesting foo_widget until there are no more available resources_available.foo_widget. So, at the next scheduling cycle, the server_dyn_res script will execute and use the value your script returns. WRT your background, I don't believe I see the entire picture. I can imagine several different approaches for using "special" nodes, which I am going to consider you are calling vCloud resources. I am assuming that vCloud resources are compute nodes that are somewhere else. For instance, the vCloud resources are compute resources "off site" (maybe in a cloud). Am I assuming correctly? If this is true, then I am also assuming there is a single PBS Server/Scheduler managing the local compute nodes and the "off site" compute nodes. Is this is true? It seems to me that you want to make a decision on a job-by-job Simple approach, if you only have 50 'units' of vCloud, then set this as a static value on the PBS Server (qmgr -c "s s resources_available.vCloud=50), and when the job requests vCloud=<value>, then the Server will keep track of how many vCloud resources are assigned (resources_assigned.vCloud), and the Scheduler will take this into consideration during the scheduling cycle. Although the "decision" to approve the vCloud resource would need to be done earlier in the job submission (e.g., queuejob hook). The queuejob hook will be responsible for making the decision and setting the vCloud value. In addition, the queuejob hook would modify the job's request and add to it an attribute to make sure it runs on the vCloud resources. You may want to consider moving a job from a queue to the "special" vCloud" queue. the vCloud resources only run jobs from the vCloud queue. If you have a PBS Server/Scheduler running on the vCloud resources, you could consider using peer scheduling. Maybe a runjob hook could be used to make the decision on whether to run the job or not.
  6. Hi Scott Thanks for your reply. And thanks for the point to a different community forum. So if I understand you correctly in your scenario, then if instead each job used foo_widget=2 then if you had 50 jobs in the queue 25 would run in the first pass of the scheduler, if the script returned 50. Then another 25 would run the next pass if the script returned 50 again. And if I'm correct each job is only processed once per scheduler cycle so if the job finished really quickly a new one would not start during that pass (ie the usage of foo_widget would not be decreased by 2)? Just to give you some background what I'm doing: we are using pbs pro do govern approval for our vCloud requests. We queue up all the requests and then approve thouse that we have enough resources. So we have a scripts which obtains cpu and memory and assign to server_dyn_res variables and each approval passes its usage a resource requirements. As such we would like to allow as many approvals as we can without over subscribing. cheers, Jakub
  7. Hi Jakub, The availability of a dynamic server-level resource is determined by running a script or program specified in the server_dyn_res line of PBS_HOME/sched_priv/sched_config. The value for resources_available.<resource> is updated at each scheduling cycle with the value returned by the script. This script is run at the host where the scheduler runs, once per scheduling cycle. The script must return the value via stdout in a single line ending with a newline. The scheduler tracks how much of each numeric dynamic server-level custom resource has been assigned to jobs, and will not overcommit these resources. So, lets go through a scenario. Assuming that there was sufficient resources to satisfy the jobs native resource requests (e.g., ncpus, mem). We will have a server_dyn_res script querying for foo_widget. The server_dyn_res script returned the value of 50. So, resources_available.foo_widget=50 for this scheduling cycle. If each job was asking foo_widget=1, then the scheduler could schedule up to 50 jobs. (1job:1foo_widget). Now, at the next scheduling cycle, the server_dyn_res script would execute and report back a new number. Does this help? BTW, the PBS Works User Forum is not very active (well besides SPAM) these days. I recommend that you raise your question on the http://community.pbspro.org forum, which has more folks monitoring. Scott
  8. Hello Pavel The PBS Works User Forum is not very active (well besides SPAM) these days. I recommend that you raise your question on the http://community.pbspro.org forum, which has more folks monitoring. Scott
  9. I'm trying to submit jobs to PSBPro using C++ api, but I have some issues with attropl structure, especially - the resource list. What is the correct format for it? 1: (name=”Resource_List”,resource=”select”,value=”1:ncpus=1”) 2: (name=”Resource_List”,resource=”select”,value=”1”)-> (name=”Resource_List”,resource=”ncpus”,value=”1”) 3: (name=”Resource_List”,resource=”select”,value=”1:ncpus=1”)-> (name=”Resource_List”,resource=”ncpus”,value=”1”)
  10. I believe the #PBS directives are interpreted by qsub so I think the behavior is expected.
  11. I'm using server_dyn_res variables to track if a job should run. This is working fine but I would like to be able to run multiple jobs per scheduler iteration. Does PBS increase the values of the variables using the entries set in each job, or do I have to continue only running one job at a time. Thanks Jakub
  12. Hello everyone, I've been using the PBS APIs for about a year, and just noticed that all PBS directives contained in scripts submitted through pbs_submit() are ignored, e.g., even if the script has this: #PBS -q dev, the job ends up in the default queue. Is this behavior expected? Is there a way to force script directives to be honored by the API?
  13. Hello everyone, I've been using the PBS APIs for about a year, and just noticed that all PBS directives contained in scripts submitted through pbs_submit() are ignored, e.g., even if the script has this: #PBS -q dev, the job ends up in the default queue. Is this behavior expected? Is there a way to force script directives to be honored by the API?
  14. Okies What I am trying to do is to allow interactive jobs via PBS but not allow users the ability to use ssh to login to other execution nodes. For the admin to login to execution nodes they login to the login node and then via the head node they can login to an exec node.
  15. Referencing the latest PBS Professional Installation and Upgrade Guide (14.2.1) A few more points.. what ports to open for interactive qsub jobs? It's unknown, and it has to, since you have different sessions -- a possibly unlimited number of them -- and they don't use privileged ports since they are not run as root but as the user (they simply communicate the port to the execution nodes by setting an attribute). Allow traffic _from_ the execution hosts.
  16. You could try the following as root qhold `qselect` the command qselect will list all of the jobs in the queue(s). Since there are no arguments supplied to qselect, it will result in listing all jobs. Yo may want to review the qselect documentation to understand how you can filter the job list. You reference Compute Manager, which suggests to me that you (your company) should have a support contract. I wouldn't know whether your company is current on the support contract, but it would hurt to send email to the PBS Support team to get real-time help. See http://www.pbsworks.com/ContactSupport.aspx
  17. Hello Very urgent issue caused on our cluster, but the IT manager didn't on duty now. I have the root account and only know how to hold one job by using "qhold [job ID]" Could you tell me how to hold all jobs on our cluster use the command? Thanks By the way, wish some body can give me a command reference for the Cluster manager...Thanks again.
  18. Hi I have been tightening down our cluster. Our login node accepts ssh from anywhere. Execution nodes accept ssh only from the head node. That all seems to work fine for submitting normal non-interactive jobs. Exec nodes: From the head node allow: proto (tcp udp) mod state state NEW dport 15001:15004 ACCEPT; Login node: From the head node allow: proto (tcp udp) mod state state NEW dport 15001:15004 ACCEPT; This allows normal jobs to be submitted but interactive jobs fail with "Job cannot be executed" and an exit status of -1 If I set for the login node a rule to allow from a specific node NEW ACCEPT then the interactive job will work on that node. I thought all PBS communications would be via the head node and not direct node-to-node like exec node to/from login node. Are there ports that need to be allowed to let interactive jobs run? A netstat during an interactive job showed "login node:33796 to exec node:39424 ESTABLISHED Mike
  19. Thank you Scott; We have already pbspro customer. I want to install pbs open source also. May I install pbspro and pbs open source on the same head node.
  20. The same OS is not required to be the same on head node and compute nodes. So, your configuration "server parts on Redhat 6.7 and install pbs_mom on CentOS 7.3" is a valid configuration.
  21. Dear All; Do head node and compute nodes need to be same? I want to install server parts on Redhat 6.7 and install pbs_mom on CentOS 7.3. Will it work? Regards;
  22. Thankyou Scott. I have now done that for the node. It was better I check here before doing something wrong and ending users jobs :-) Thanks
  23. Job(s) running on the "offline" node will continue to execute. By setting a node to "offline" will NOT stop the executing job(s) on the node. Referring to the reference guide, pbsnodes -o <nodename> Marks listed hosts as OFFLINE even if currently in use. This is different from being marked DOWN. A host that is marked OFFLINE will continue to execute the jobs already on it, but will be removed from the scheduling pool (no more jobs will be scheduled on it.) For hosts with multiple vnodes, pbsnodes operates on a host and all of its vnodes, where the hostname is resources_available.host, which is the name of the natural vnode. To offline a single vnode in a multi-vnoded system, use: qmgr -c “set node <nodename> state=offline” If you want to issue the node comment at the same time as offlining the node, you can pbsnodes -C "Note: not accepting new jobs" -o <nodename> [<nodename2> ...> with qmgr, you would qmgr -c "set node <nodename> comment = 'Note: not accepting new jobs'"
  24. If I put a node offline what will happen to existing jobs? I thought that would stop existing jobs when they tried to communicate back to the head node. Also by keeping the node online I can add a comment "Note not accepting new jobs." and users with jobs on that node wont worry about their jobs as they wont see the node is down.
  25. Have you looked at setting the node offline? pbsnodes -o <nodename> or you can use qmgr -c "set node <nodename> state = offline"
  26. Hi I have some nodes that need a reboot. I want to stop new jobs from starting on those specific nodes so when the currently running jobs are finished there will be no jobs running on those nodes and I can reboot them. So I don't want existing jobs to be ended. Just new jobs not started. The manual for qhold suggests that might be what I should use but its not clear. Also what's the best way to stop ALL new jobs from starting but allow current jobs to continue? Setting dedicated_time to a short time in the future still allows new jobs to start. Thanks Mike
  1. Load more activity