Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

Scott Suchyta

Moderators
  • Content count

    160
  • Joined

  • Last visited

  • Days Won

    9

Everything posted by Scott Suchyta

  1. using server_dyn_res

    Hello Jakub, True True True True At each scheduling cycle, the server_dyn_res script will execute and set the value of resources_available.foo_widget. The Scheduler will decrement the resources_available.foo_widget value by each job requesting foo_widget until there are no more available resources_available.foo_widget. So, at the next scheduling cycle, the server_dyn_res script will execute and use the value your script returns. WRT your background, I don't believe I see the entire picture. I can imagine several different approaches for using "special" nodes, which I am going to consider you are calling vCloud resources. I am assuming that vCloud resources are compute nodes that are somewhere else. For instance, the vCloud resources are compute resources "off site" (maybe in a cloud). Am I assuming correctly? If this is true, then I am also assuming there is a single PBS Server/Scheduler managing the local compute nodes and the "off site" compute nodes. Is this is true? It seems to me that you want to make a decision on a job-by-job Simple approach, if you only have 50 'units' of vCloud, then set this as a static value on the PBS Server (qmgr -c "s s resources_available.vCloud=50), and when the job requests vCloud=<value>, then the Server will keep track of how many vCloud resources are assigned (resources_assigned.vCloud), and the Scheduler will take this into consideration during the scheduling cycle. Although the "decision" to approve the vCloud resource would need to be done earlier in the job submission (e.g., queuejob hook). The queuejob hook will be responsible for making the decision and setting the vCloud value. In addition, the queuejob hook would modify the job's request and add to it an attribute to make sure it runs on the vCloud resources. You may want to consider moving a job from a queue to the "special" vCloud" queue. the vCloud resources only run jobs from the vCloud queue. If you have a PBS Server/Scheduler running on the vCloud resources, you could consider using peer scheduling. Maybe a runjob hook could be used to make the decision on whether to run the job or not.
  2. using server_dyn_res

    Hi Jakub, The availability of a dynamic server-level resource is determined by running a script or program specified in the server_dyn_res line of PBS_HOME/sched_priv/sched_config. The value for resources_available.<resource> is updated at each scheduling cycle with the value returned by the script. This script is run at the host where the scheduler runs, once per scheduling cycle. The script must return the value via stdout in a single line ending with a newline. The scheduler tracks how much of each numeric dynamic server-level custom resource has been assigned to jobs, and will not overcommit these resources. So, lets go through a scenario. Assuming that there was sufficient resources to satisfy the jobs native resource requests (e.g., ncpus, mem). We will have a server_dyn_res script querying for foo_widget. The server_dyn_res script returned the value of 50. So, resources_available.foo_widget=50 for this scheduling cycle. If each job was asking foo_widget=1, then the scheduler could schedule up to 50 jobs. (1job:1foo_widget). Now, at the next scheduling cycle, the server_dyn_res script would execute and report back a new number. Does this help? BTW, the PBS Works User Forum is not very active (well besides SPAM) these days. I recommend that you raise your question on the http://community.pbspro.org forum, which has more folks monitoring. Scott
  3. pbs_submit attropl structure example

    Hello Pavel The PBS Works User Forum is not very active (well besides SPAM) these days. I recommend that you raise your question on the http://community.pbspro.org forum, which has more folks monitoring. Scott
  4. iptables and interactive jobs

    Referencing the latest PBS Professional Installation and Upgrade Guide (14.2.1) A few more points.. what ports to open for interactive qsub jobs? It's unknown, and it has to, since you have different sessions -- a possibly unlimited number of them -- and they don't use privileged ports since they are not run as root but as the user (they simply communicate the port to the execution nodes by setting an attribute). Allow traffic _from_ the execution hosts.
  5. job manage on Compute Manager

    You could try the following as root qhold `qselect` the command qselect will list all of the jobs in the queue(s). Since there are no arguments supplied to qselect, it will result in listing all jobs. Yo may want to review the qselect documentation to understand how you can filter the job list. You reference Compute Manager, which suggests to me that you (your company) should have a support contract. I wouldn't know whether your company is current on the support contract, but it would hurt to send email to the PBS Support team to get real-time help. See http://www.pbsworks.com/ContactSupport.aspx
  6. Install on different Linux Versions

    The same OS is not required to be the same on head node and compute nodes. So, your configuration "server parts on Redhat 6.7 and install pbs_mom on CentOS 7.3" is a valid configuration.
  7. How to prevent new jobs from starting on specific nodes

    Job(s) running on the "offline" node will continue to execute. By setting a node to "offline" will NOT stop the executing job(s) on the node. Referring to the reference guide, pbsnodes -o <nodename> Marks listed hosts as OFFLINE even if currently in use. This is different from being marked DOWN. A host that is marked OFFLINE will continue to execute the jobs already on it, but will be removed from the scheduling pool (no more jobs will be scheduled on it.) For hosts with multiple vnodes, pbsnodes operates on a host and all of its vnodes, where the hostname is resources_available.host, which is the name of the natural vnode. To offline a single vnode in a multi-vnoded system, use: qmgr -c “set node <nodename> state=offline” If you want to issue the node comment at the same time as offlining the node, you can pbsnodes -C "Note: not accepting new jobs" -o <nodename> [<nodename2> ...> with qmgr, you would qmgr -c "set node <nodename> comment = 'Note: not accepting new jobs'"
  8. How to prevent new jobs from starting on specific nodes

    Have you looked at setting the node offline? pbsnodes -o <nodename> or you can use qmgr -c "set node <nodename> state = offline"
  9. single-node host name woes

    What do you get if you execute rpm -qi <package_name> Otherwise, you may want to review the http://goo.gl/j04vz the company that maintains Torque.
  10. single-node host name woes

    Sorry, garygo, this forum is for PBS Professional; I cannot comment on Torque. Have you verified that the firewall is not blocking communication ports? I am asking because we receive several PBS Professional support calls related to pbs_server and pbs_mom communication issues and it is because of the firewall rules.
  11. resource balance information/manipulation

    The URL you referenced is based on TORQUE - not PBS Professional. Can you confirm that your site is using PBS Professional? There can be several reasons why jobs are pending (e.g., scheduling policies, resource requests). As a user, you have a some commands you can use to figure out why the job is not running. The most common command is qstat -f $PBS_JOBID where $PBS_JOBID is the job in question. This will show the full details of the job and there is a comment attribute that contains a string. Usually, this is the last statement from the PBS Scheduler. However, admins have the ability to overwrite this comment field. Another command, which needs to be executed on the PBS Server, is to use tracejob -z -n $DAYS $PBS_JOBID where $DAYS is how many days in the past you want to parse the daemon logs on the local machine. Since I mentioned loggin into the PBS Server, you will see Server and Scheduler logs. If the output of tracejob is NOT clear, then the admin could have filtered out DEBUG and other verbose records. You may want to look through the Scheduler logs ($PBS_HOME/sched_logs/) yourself. I don't know if your site is using any other 3rd party integrations (e.g., allocation managers, cluster managers) to influence scheduling. So, you wll need to provide more details about your site's configuration in order for someone on the list to provide you better guidance.
  12. PBS Professional Open Source

    We are excited to inform you that the PBS Professional team has successfully completed its goal in releasing the open source licensing option of PBS Professional by mid-2016. Now Available for Download Visit that brand new website www.pbspro.org to learn more about the initiative and download the software packages. Join the Community We want you to be a part of the open source project community! Join our forum to continue to receive announcements and interact with one another to discuss topics and help answer questions. Everyone is welcome to contribute to the code in a variety of ways including developing new capabilities, testing, etc. Visit www.pbspro.org to learn about our different mailing lists and the numerous ways to participate. Thank you, The PBS Professional Open Source Team
  13. PBS Professional Open Source

    We are excited to inform you that the PBS Professional team has successfully completed its goal in releasing the open source licensing option of PBS Professional by mid-2016. Now Available for Download Visit that brand new website www.pbspro.org to learn more about the initiative and download the software packages. Join the Community We want you to be a part of the open source project community! Join our forum to continue to receive announcements and interact with one another to discuss topics and help answer questions. Everyone is welcome to contribute to the code in a variety of ways including developing new capabilities, testing, etc. Visit www.pbspro.org to learn about our different mailing lists and the numerous ways to participate. Thank you, The PBS Professional Open Source Team
  14. Can not delete Hang job

    What does the server_logs and mom_logs say for not being able to qdel the job? Who is trying to qdel the job? is it root or the user that submitted the job? Can you share the logs? The JOBID has "admin", but your /etc/pbs.conf says the PBS_SERVER=sansao.. Looks like your cluster has multiple interfaces and points of name resolution, so you will need to describe what you cluster looks like from a network and naming point of view. I noticed in your other post about the Mom attribute that you are running 11.3.2. This is a pretty old version of PBS. I am not saying 11.3.2 has any issues, but I wanted to make sure you were aware of v13.0 having much better support for multi-interface clusters.
  15. 'Mom' atrrib

    That is very interesting. The MOM attribute represents the hostname of host on which MoM daemon runs. The server can set this to the FQDN of the host on which MoM runs, if the vnode name is the same as the hostname. By chance does n016 resolve to n016.default.domain and n013.default.domain? On the PBS Server and on the PBS MOM (n016), please execute the following to see the possible hostname resolution pbs_hostn -v n016 If you have name resolution issue where n016 thinks it is n016.default.domain and n013.default.domain, make sure you fix that before proceeding with the following suggestion. Your node configuration looks very basic - meaning there are no custom resources. Your fastest method of cleaning this up would be delete the node from qmgr and re-create it. qmgr -c "d n n016" qmgr -c "c n n016"
  16. Problem using mom hook script >4Kb in PBS Pro v12.0

    based on what I can research, it looks like this was resolved in v12.1
  17. Does any version of PBS Pro work on Windows Server 2012 R2?

    Referencing the PBS Professional 12.2.6 Release Notes, it does indicate the following:
  18. Command history in qmgr

    This is somewhat _old news_, but still worth making sure people know.. qmgr history (and editing) is available as of PBS Professional v12.2. See a snippet from the manual below:
  19. How to move PBS server and schedular to new server?

    This is described in the manuals with good detail. Not knowing what version of PBS Professional you are using, I will assume 13.0. For all steps described in a backup for migration, see installation guide section 7.8.1 (for UNIX/Linux) To dump qmgr configurations, here are some examples: qmgr -c "print server" > qmgr-ps.out qmgr -c "print node @default" > qmgr-pn.out
  20. How to move PBS server and schedular to new server?

    I am not sure what version of PBS Professional you are using. Questions to consider before proceeding: - Will the new system be running the same version of the operating system? If so, this can make things easier. - Will the host of the new PBS Server/Scheduler be named the same as the old? PBS is very sensitive to hostnames, so if you are not using the same hostname, then you will run the risk of having to do a lot more updating in the configuration (e.g., on the execution nodes in /etc/pbs.conf, PBS_HOME/mom_priv/config). - Do you have intentions of upgrading the version of PBS Professional during this migration? - Does your Advance or Standing Reservations? If yes AND PBS is older than 12.2, delete each reservation. - Do you have running jobs? If yes, then you will need to allow them to run to completion or requeue them. You cannot migrate PBS while there are running jobs. If you requeue jobs, be sure to turn scheduling off so that the jobs do not get dispatched again. - Are you using wrapped MPIs? If so, you will need to unwrap them. I would recommend reviewing the Installation & Upgrade Guide > Migration Upgrade as it explains a of the steps involved with a migration upgrade, but is applicable to a migration between servers, too.
  21. How to limit the maximum number of processors per job

    max_run_res.<res> can be applied to generic user, generic group, generic project, named user, named group, named project, and/or overall (o:PBS_ALL). Since you want to impose the limit of ncpus per job, I am assuming that you don't want to limit to restrict the total ncpus that could be running or queued. IOW, UserA could submit three 168ncpus jobs, and the three jobs would be eligible to run if there were enough resources available. If this is true, then my suggestion would be to introduce a queuejob event hook that validates the total ncpus being requested is less than the maximum you want set. In addition, the same logic would be used for the modifyjob event hook. This way the user would not be able to change the total ncpus to something larger after the job was successfully submitted.
  22. PBS Node Unknown

    firly, Thanks for sharing the screen shots - this was helpful. WRT Abaqus jobs not submitting It would appear that the abaqus command wraps our qsub command. I will not be able to help diagnosis any issues with the command line arguments of abaqus - this will need to be pursued via Abaqus support as they are providing you the interface. However, I can provide you some guidance on what I think might be happening. Are you the administrator of these systems AND did you install PBS Professional? You mention on server and 6 clients. I need to clarify terminology Based on your screen shots, it looks like you have 1. ONE PBS Server/Scheduler [gttgr-abaqus]; I see this from the xpbs screenshot > server 2. ONE PBS MOM (execution node) [gttgr-abaqus]; I this from the qmgr -c "list nodes @active" 3. SIX clients, I am assuming that you are referring to submission clients and not execution nodes. WRT BAD UID The error message looks similar to what I would have expected our qsub command to provide, BUT it looks like the abaqus command is modifying the error message. Based on the description of the error message, I would assume that the PBS Professional administrator has not added the Admin user accounts acl_roots attribute of the PBS Server. I will reference Section 8.10 of the PBS Professional v12.2 Admin Guide that describes this: WRT error with queue If you were submitting the job to a specified queue via PBS Professional's command, qsub, then you would have used the "-q workq" argument. Again, I don't know what the abaqus argument --queue is looking for to be valid. I see you specified "--queue workq:fsshared". The appended fsshared argument is unknown to me, so this must be something specific for Abaqus?? WRT pbsnodes -a command not returning nodes You received an error message that does not make sense to me. Using the -a argument is valid, and I would have expected the command to list out all of the defined nodes within the PBS Server. Similar to the output of qmgr -c "list nodes @default" or qmgr -c "list nodes @active". Can you confirm the absolute path of the pbsnodes command? I am wondering if Abaqus has wrapped more of the PBS Professional commands. WRT pbs_hostn command not returning expected output Again, the pbs_hostn command is not returning the expected output. You have provided legitimate arguments, but the command is outputting the help/usage. Again, can you confirm that the absolute path of the pbs_hostn command? It might make sense to take abaqus out of the equation and make sure that the PBS Professional installation and configuration is setup correctly.
  23. There is a job attribute that determines whether a job is rerunnable. See -r or rerunnable in the guides. If you find that setting this job attribute to "n" by default, you can define the default_qsub_argument on the server via qmgr: qmgr -c 'set server default_qsub_arguments = "-r n"' Or, you can create a hook to enforce the "-r n" for qsub (queuejob) and qalter (modifyjob). I am suggesting modifyjob because it is possible for the user can change attributes of the job after submission.
  24. pbs failed in workstation.

    Sorry albumns, this forum is for PBS Professional - not TORQUE.
  25. torque pbs in workstation failed

    Sorry albumns, this forum is for PBS Professional - not TORQUE.
×