Scott Suchyta

Moderators
  • Content count

    152
  • Joined

  • Last visited

  • Days Won

    9

Scott Suchyta last won the day on June 21 2016

Scott Suchyta had the most liked content!

About Scott Suchyta

  • Rank
    Advanced Member

Profile Information

  • Gender
    Not Telling
  1. What do you get if you execute rpm -qi <package_name> Otherwise, you may want to review the http://goo.gl/j04vz the company that maintains Torque.
  2. Sorry, garygo, this forum is for PBS Professional; I cannot comment on Torque. Have you verified that the firewall is not blocking communication ports? I am asking because we receive several PBS Professional support calls related to pbs_server and pbs_mom communication issues and it is because of the firewall rules.
  3. The URL you referenced is based on TORQUE - not PBS Professional. Can you confirm that your site is using PBS Professional? There can be several reasons why jobs are pending (e.g., scheduling policies, resource requests). As a user, you have a some commands you can use to figure out why the job is not running. The most common command is qstat -f $PBS_JOBID where $PBS_JOBID is the job in question. This will show the full details of the job and there is a comment attribute that contains a string. Usually, this is the last statement from the PBS Scheduler. However, admins have the ability to overwrite this comment field. Another command, which needs to be executed on the PBS Server, is to use tracejob -z -n $DAYS $PBS_JOBID where $DAYS is how many days in the past you want to parse the daemon logs on the local machine. Since I mentioned loggin into the PBS Server, you will see Server and Scheduler logs. If the output of tracejob is NOT clear, then the admin could have filtered out DEBUG and other verbose records. You may want to look through the Scheduler logs ($PBS_HOME/sched_logs/) yourself. I don't know if your site is using any other 3rd party integrations (e.g., allocation managers, cluster managers) to influence scheduling. So, you wll need to provide more details about your site's configuration in order for someone on the list to provide you better guidance.
  4. We are excited to inform you that the PBS Professional team has successfully completed its goal in releasing the open source licensing option of PBS Professional by mid-2016. Now Available for Download Visit that brand new website www.pbspro.org to learn more about the initiative and download the software packages. Join the Community We want you to be a part of the open source project community! Join our forum to continue to receive announcements and interact with one another to discuss topics and help answer questions. Everyone is welcome to contribute to the code in a variety of ways including developing new capabilities, testing, etc. Visit www.pbspro.org to learn about our different mailing lists and the numerous ways to participate. Thank you, The PBS Professional Open Source Team
  5. We are excited to inform you that the PBS Professional team has successfully completed its goal in releasing the open source licensing option of PBS Professional by mid-2016. Now Available for Download Visit that brand new website www.pbspro.org to learn more about the initiative and download the software packages. Join the Community We want you to be a part of the open source project community! Join our forum to continue to receive announcements and interact with one another to discuss topics and help answer questions. Everyone is welcome to contribute to the code in a variety of ways including developing new capabilities, testing, etc. Visit www.pbspro.org to learn about our different mailing lists and the numerous ways to participate. Thank you, The PBS Professional Open Source Team
  6. What does the server_logs and mom_logs say for not being able to qdel the job? Who is trying to qdel the job? is it root or the user that submitted the job? Can you share the logs? The JOBID has "admin", but your /etc/pbs.conf says the PBS_SERVER=sansao.. Looks like your cluster has multiple interfaces and points of name resolution, so you will need to describe what you cluster looks like from a network and naming point of view. I noticed in your other post about the Mom attribute that you are running 11.3.2. This is a pretty old version of PBS. I am not saying 11.3.2 has any issues, but I wanted to make sure you were aware of v13.0 having much better support for multi-interface clusters.
  7. That is very interesting. The MOM attribute represents the hostname of host on which MoM daemon runs. The server can set this to the FQDN of the host on which MoM runs, if the vnode name is the same as the hostname. By chance does n016 resolve to n016.default.domain and n013.default.domain? On the PBS Server and on the PBS MOM (n016), please execute the following to see the possible hostname resolution pbs_hostn -v n016 If you have name resolution issue where n016 thinks it is n016.default.domain and n013.default.domain, make sure you fix that before proceeding with the following suggestion. Your node configuration looks very basic - meaning there are no custom resources. Your fastest method of cleaning this up would be delete the node from qmgr and re-create it. qmgr -c "d n n016" qmgr -c "c n n016"
  8. based on what I can research, it looks like this was resolved in v12.1
  9. Referencing the PBS Professional 12.2.6 Release Notes, it does indicate the following:
  10. This is somewhat _old news_, but still worth making sure people know.. qmgr history (and editing) is available as of PBS Professional v12.2. See a snippet from the manual below:
  11. This is described in the manuals with good detail. Not knowing what version of PBS Professional you are using, I will assume 13.0. For all steps described in a backup for migration, see installation guide section 7.8.1 (for UNIX/Linux) To dump qmgr configurations, here are some examples: qmgr -c "print server" > qmgr-ps.out qmgr -c "print node @default" > qmgr-pn.out
  12. I am not sure what version of PBS Professional you are using. Questions to consider before proceeding: - Will the new system be running the same version of the operating system? If so, this can make things easier. - Will the host of the new PBS Server/Scheduler be named the same as the old? PBS is very sensitive to hostnames, so if you are not using the same hostname, then you will run the risk of having to do a lot more updating in the configuration (e.g., on the execution nodes in /etc/pbs.conf, PBS_HOME/mom_priv/config). - Do you have intentions of upgrading the version of PBS Professional during this migration? - Does your Advance or Standing Reservations? If yes AND PBS is older than 12.2, delete each reservation. - Do you have running jobs? If yes, then you will need to allow them to run to completion or requeue them. You cannot migrate PBS while there are running jobs. If you requeue jobs, be sure to turn scheduling off so that the jobs do not get dispatched again. - Are you using wrapped MPIs? If so, you will need to unwrap them. I would recommend reviewing the Installation & Upgrade Guide > Migration Upgrade as it explains a of the steps involved with a migration upgrade, but is applicable to a migration between servers, too.
  13. max_run_res.<res> can be applied to generic user, generic group, generic project, named user, named group, named project, and/or overall (o:PBS_ALL). Since you want to impose the limit of ncpus per job, I am assuming that you don't want to limit to restrict the total ncpus that could be running or queued. IOW, UserA could submit three 168ncpus jobs, and the three jobs would be eligible to run if there were enough resources available. If this is true, then my suggestion would be to introduce a queuejob event hook that validates the total ncpus being requested is less than the maximum you want set. In addition, the same logic would be used for the modifyjob event hook. This way the user would not be able to change the total ncpus to something larger after the job was successfully submitted.
  14. firly, Thanks for sharing the screen shots - this was helpful. WRT Abaqus jobs not submitting It would appear that the abaqus command wraps our qsub command. I will not be able to help diagnosis any issues with the command line arguments of abaqus - this will need to be pursued via Abaqus support as they are providing you the interface. However, I can provide you some guidance on what I think might be happening. Are you the administrator of these systems AND did you install PBS Professional? You mention on server and 6 clients. I need to clarify terminology Based on your screen shots, it looks like you have 1. ONE PBS Server/Scheduler [gttgr-abaqus]; I see this from the xpbs screenshot > server 2. ONE PBS MOM (execution node) [gttgr-abaqus]; I this from the qmgr -c "list nodes @active" 3. SIX clients, I am assuming that you are referring to submission clients and not execution nodes. WRT BAD UID The error message looks similar to what I would have expected our qsub command to provide, BUT it looks like the abaqus command is modifying the error message. Based on the description of the error message, I would assume that the PBS Professional administrator has not added the Admin user accounts acl_roots attribute of the PBS Server. I will reference Section 8.10 of the PBS Professional v12.2 Admin Guide that describes this: WRT error with queue If you were submitting the job to a specified queue via PBS Professional's command, qsub, then you would have used the "-q workq" argument. Again, I don't know what the abaqus argument --queue is looking for to be valid. I see you specified "--queue workq:fsshared". The appended fsshared argument is unknown to me, so this must be something specific for Abaqus?? WRT pbsnodes -a command not returning nodes You received an error message that does not make sense to me. Using the -a argument is valid, and I would have expected the command to list out all of the defined nodes within the PBS Server. Similar to the output of qmgr -c "list nodes @default" or qmgr -c "list nodes @active". Can you confirm the absolute path of the pbsnodes command? I am wondering if Abaqus has wrapped more of the PBS Professional commands. WRT pbs_hostn command not returning expected output Again, the pbs_hostn command is not returning the expected output. You have provided legitimate arguments, but the command is outputting the help/usage. Again, can you confirm that the absolute path of the pbs_hostn command? It might make sense to take abaqus out of the equation and make sure that the PBS Professional installation and configuration is setup correctly.
  15. There is a job attribute that determines whether a job is rerunnable. See -r or rerunnable in the guides. If you find that setting this job attribute to "n" by default, you can define the default_qsub_argument on the server via qmgr: qmgr -c 'set server default_qsub_arguments = "-r n"' Or, you can create a hook to enforce the "-r n" for qsub (queuejob) and qalter (modifyjob). I am suggesting modifyjob because it is possible for the user can change attributes of the job after submission.