• Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

Scott Suchyta

Moderators
  • Content count

    160
  • Joined

  • Last visited

  • Days Won

    9

Scott Suchyta last won the day on June 21 2016

Scott Suchyta had the most liked content!

About Scott Suchyta

  • Rank
    Advanced Member

Profile Information

  • Gender
    Not Telling
  1. Hello Jakub, True True True True At each scheduling cycle, the server_dyn_res script will execute and set the value of resources_available.foo_widget. The Scheduler will decrement the resources_available.foo_widget value by each job requesting foo_widget until there are no more available resources_available.foo_widget. So, at the next scheduling cycle, the server_dyn_res script will execute and use the value your script returns. WRT your background, I don't believe I see the entire picture. I can imagine several different approaches for using "special" nodes, which I am going to consider you are calling vCloud resources. I am assuming that vCloud resources are compute nodes that are somewhere else. For instance, the vCloud resources are compute resources "off site" (maybe in a cloud). Am I assuming correctly? If this is true, then I am also assuming there is a single PBS Server/Scheduler managing the local compute nodes and the "off site" compute nodes. Is this is true? It seems to me that you want to make a decision on a job-by-job Simple approach, if you only have 50 'units' of vCloud, then set this as a static value on the PBS Server (qmgr -c "s s resources_available.vCloud=50), and when the job requests vCloud=<value>, then the Server will keep track of how many vCloud resources are assigned (resources_assigned.vCloud), and the Scheduler will take this into consideration during the scheduling cycle. Although the "decision" to approve the vCloud resource would need to be done earlier in the job submission (e.g., queuejob hook). The queuejob hook will be responsible for making the decision and setting the vCloud value. In addition, the queuejob hook would modify the job's request and add to it an attribute to make sure it runs on the vCloud resources. You may want to consider moving a job from a queue to the "special" vCloud" queue. the vCloud resources only run jobs from the vCloud queue. If you have a PBS Server/Scheduler running on the vCloud resources, you could consider using peer scheduling. Maybe a runjob hook could be used to make the decision on whether to run the job or not.
  2. Hi Jakub, The availability of a dynamic server-level resource is determined by running a script or program specified in the server_dyn_res line of PBS_HOME/sched_priv/sched_config. The value for resources_available.<resource> is updated at each scheduling cycle with the value returned by the script. This script is run at the host where the scheduler runs, once per scheduling cycle. The script must return the value via stdout in a single line ending with a newline. The scheduler tracks how much of each numeric dynamic server-level custom resource has been assigned to jobs, and will not overcommit these resources. So, lets go through a scenario. Assuming that there was sufficient resources to satisfy the jobs native resource requests (e.g., ncpus, mem). We will have a server_dyn_res script querying for foo_widget. The server_dyn_res script returned the value of 50. So, resources_available.foo_widget=50 for this scheduling cycle. If each job was asking foo_widget=1, then the scheduler could schedule up to 50 jobs. (1job:1foo_widget). Now, at the next scheduling cycle, the server_dyn_res script would execute and report back a new number. Does this help? BTW, the PBS Works User Forum is not very active (well besides SPAM) these days. I recommend that you raise your question on the http://community.pbspro.org forum, which has more folks monitoring. Scott
  3. Hello Pavel The PBS Works User Forum is not very active (well besides SPAM) these days. I recommend that you raise your question on the http://community.pbspro.org forum, which has more folks monitoring. Scott
  4. Referencing the latest PBS Professional Installation and Upgrade Guide (14.2.1) A few more points.. what ports to open for interactive qsub jobs? It's unknown, and it has to, since you have different sessions -- a possibly unlimited number of them -- and they don't use privileged ports since they are not run as root but as the user (they simply communicate the port to the execution nodes by setting an attribute). Allow traffic _from_ the execution hosts.
  5. You could try the following as root qhold `qselect` the command qselect will list all of the jobs in the queue(s). Since there are no arguments supplied to qselect, it will result in listing all jobs. Yo may want to review the qselect documentation to understand how you can filter the job list. You reference Compute Manager, which suggests to me that you (your company) should have a support contract. I wouldn't know whether your company is current on the support contract, but it would hurt to send email to the PBS Support team to get real-time help. See http://www.pbsworks.com/ContactSupport.aspx
  6. The same OS is not required to be the same on head node and compute nodes. So, your configuration "server parts on Redhat 6.7 and install pbs_mom on CentOS 7.3" is a valid configuration.
  7. Job(s) running on the "offline" node will continue to execute. By setting a node to "offline" will NOT stop the executing job(s) on the node. Referring to the reference guide, pbsnodes -o <nodename> Marks listed hosts as OFFLINE even if currently in use. This is different from being marked DOWN. A host that is marked OFFLINE will continue to execute the jobs already on it, but will be removed from the scheduling pool (no more jobs will be scheduled on it.) For hosts with multiple vnodes, pbsnodes operates on a host and all of its vnodes, where the hostname is resources_available.host, which is the name of the natural vnode. To offline a single vnode in a multi-vnoded system, use: qmgr -c “set node <nodename> state=offline” If you want to issue the node comment at the same time as offlining the node, you can pbsnodes -C "Note: not accepting new jobs" -o <nodename> [<nodename2> ...> with qmgr, you would qmgr -c "set node <nodename> comment = 'Note: not accepting new jobs'"
  8. Have you looked at setting the node offline? pbsnodes -o <nodename> or you can use qmgr -c "set node <nodename> state = offline"
  9. What do you get if you execute rpm -qi <package_name> Otherwise, you may want to review the http://goo.gl/j04vz the company that maintains Torque.
  10. Sorry, garygo, this forum is for PBS Professional; I cannot comment on Torque. Have you verified that the firewall is not blocking communication ports? I am asking because we receive several PBS Professional support calls related to pbs_server and pbs_mom communication issues and it is because of the firewall rules.
  11. The URL you referenced is based on TORQUE - not PBS Professional. Can you confirm that your site is using PBS Professional? There can be several reasons why jobs are pending (e.g., scheduling policies, resource requests). As a user, you have a some commands you can use to figure out why the job is not running. The most common command is qstat -f $PBS_JOBID where $PBS_JOBID is the job in question. This will show the full details of the job and there is a comment attribute that contains a string. Usually, this is the last statement from the PBS Scheduler. However, admins have the ability to overwrite this comment field. Another command, which needs to be executed on the PBS Server, is to use tracejob -z -n $DAYS $PBS_JOBID where $DAYS is how many days in the past you want to parse the daemon logs on the local machine. Since I mentioned loggin into the PBS Server, you will see Server and Scheduler logs. If the output of tracejob is NOT clear, then the admin could have filtered out DEBUG and other verbose records. You may want to look through the Scheduler logs ($PBS_HOME/sched_logs/) yourself. I don't know if your site is using any other 3rd party integrations (e.g., allocation managers, cluster managers) to influence scheduling. So, you wll need to provide more details about your site's configuration in order for someone on the list to provide you better guidance.
  12. We are excited to inform you that the PBS Professional team has successfully completed its goal in releasing the open source licensing option of PBS Professional by mid-2016. Now Available for Download Visit that brand new website www.pbspro.org to learn more about the initiative and download the software packages. Join the Community We want you to be a part of the open source project community! Join our forum to continue to receive announcements and interact with one another to discuss topics and help answer questions. Everyone is welcome to contribute to the code in a variety of ways including developing new capabilities, testing, etc. Visit www.pbspro.org to learn about our different mailing lists and the numerous ways to participate. Thank you, The PBS Professional Open Source Team
  13. We are excited to inform you that the PBS Professional team has successfully completed its goal in releasing the open source licensing option of PBS Professional by mid-2016. Now Available for Download Visit that brand new website www.pbspro.org to learn more about the initiative and download the software packages. Join the Community We want you to be a part of the open source project community! Join our forum to continue to receive announcements and interact with one another to discuss topics and help answer questions. Everyone is welcome to contribute to the code in a variety of ways including developing new capabilities, testing, etc. Visit www.pbspro.org to learn about our different mailing lists and the numerous ways to participate. Thank you, The PBS Professional Open Source Team
  14. What does the server_logs and mom_logs say for not being able to qdel the job? Who is trying to qdel the job? is it root or the user that submitted the job? Can you share the logs? The JOBID has "admin", but your /etc/pbs.conf says the PBS_SERVER=sansao.. Looks like your cluster has multiple interfaces and points of name resolution, so you will need to describe what you cluster looks like from a network and naming point of view. I noticed in your other post about the Mom attribute that you are running 11.3.2. This is a pretty old version of PBS. I am not saying 11.3.2 has any issues, but I wanted to make sure you were aware of v13.0 having much better support for multi-interface clusters.
  15. That is very interesting. The MOM attribute represents the hostname of host on which MoM daemon runs. The server can set this to the FQDN of the host on which MoM runs, if the vnode name is the same as the hostname. By chance does n016 resolve to n016.default.domain and n013.default.domain? On the PBS Server and on the PBS MOM (n016), please execute the following to see the possible hostname resolution pbs_hostn -v n016 If you have name resolution issue where n016 thinks it is n016.default.domain and n013.default.domain, make sure you fix that before proceeding with the following suggestion. Your node configuration looks very basic - meaning there are no custom resources. Your fastest method of cleaning this up would be delete the node from qmgr and re-create it. qmgr -c "d n n016" qmgr -c "c n n016"