Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

Steve Gombosi (Altair)

Members
  • Content count

    12
  • Joined

  • Last visited

  • Days Won

    1

Steve Gombosi (Altair) last won the day on August 11 2016

Steve Gombosi (Altair) had the most liked content!

About Steve Gombosi (Altair)

  • Rank
    Member

Profile Information

  • Gender
    Not Telling
  • Location
    Boulder, CO. USA
  1. how to set up stack-size to unlimited in PBS pro

    Zhifa, You could have just emailed ;-). That limits file is invoked during MoM startup, and the limits defined there will apply to MoM and her children. Just edit that file to set ulimit -s unlimited and restart the MoM. Steve
  2. Exclude (or include) specific nodes in PBS Pro

    Well, the ideal solution would be to get your friendly neighborhood system admin to set the built-in "software" resource on the nodes that have Python on them (because that's what that resource is there for) and configure the scheduler to take that resource into consideration for scheduling purposes. Then, all you'd have to do is something like this: qsub -lselect=1:software=python .... and the scheduler would take care of the rest for you. This takes exactly one qmgr command and a one-line change to the scheduler configuration. But, as you said, you don't have the requisite permission to do that.
  3. Exclude (or include) specific nodes in PBS Pro

    Hi Jerry, PBS Professional does not currently support using logical operators in resource specifiers. This is something we're working on for a future release. Since you don't have administrative privileges on the machine, you can't configure PBS to identify vnodes that have Python on them. That's too bad, because then you could just request a vnode with Python and PBS could automatically schedule you there. As it is, you're going to have to request a specific vnode or host for each job, e.g.: qsub -l select=1:ncpus=1:host=good_node1 Steve
  4. Job submission from NATed machine

    It would be helpful to have a server log in this case. For security, PBS does a forward/backward name resolution check to ensure that it's not being spoofed. It's possible that your NAT configuration is interfering with that, since this requires that the name of the submitting host be resolvable from the server. The interface library also invokes an authentication program (pbs_iff) to authenticate itself to the server when it issues the "connect" request. Pbs_iff must be setuid root, since it has to open privileged ports to authenticate.
  5. All q commands hung up on one of nodes but works fine on others

    Jerry, Whether or not you have failover configured, restarting the server should not cause job loss. Anything that's running on a compute node should continue to run. Jobs, queues, and nodes are stored in the PBS database and should be recovered on restart. The settings in pbs.conf are shell environment variables. They're only read and placed in the environment of the PBS daemons on startup, so a kill -HUP won't cause them to be reprocessed. If you want to implement failover, you have to restart PBS to do it so the PBS_PRIMARY and PBS_SECONDARY variables will be set in the environment. You'll have to change pbs.conf on all the MoM nodes as well, so that they'll recognize the secondary server. Speaking of which: When you say that PBS_HOME is shared on GPFS, I hope that's not true for the individual MoMs. Each MoM needs her own PBS_HOME. Can you determine what event the pbs_iff processes are sleeping on? Steve
  6. Queue Chunks/Nodes

    Bryan, Alexis's advice for handling hyperthreaded machines is excellent. He's our senior support person in Europe. To answer the more question about limiting nodes and chunks - when we virtualized the concept of "nodes" by introducing "select" and vnodes back in PBSPro 7.x, we made limiting physical nodes rather difficult to do (;-)). There is a deprecated resource, "nodect", which used to be the number of physical nodes a job requested. Since users no longer request physical nodes but rather "chunks" of resources with "-l select", there's no longer a one-to-one correspondence between resource chunks and physical nodes. As a result, the nodect resource currently contains the number of chunks requested by a job. Since nodect is deprecated (and has been for quite some time), it may be removed in a future release. As long as it's still around, you can always limit it the way you would limit any other resource. This may not end up doing exactly what you want since users may not request a chunk size that maps directly to a physica node (you can always fix this for them in a submit hook). By the way, it looks like you have a valid support contract so you can always contact us directly for help (via the support number in the front of all our documentation or the pbssupport@altair.com email alias). That's what we're here for.
  7. Job held, too many failed attempts to run

    I suspected that was cn10. Could I please see the MOM log for the same time period from cn10?
  8. Job held, too many failed attempts to run

    Halil, What version of PBS are you running? The format of the job_start_error message is somewhat different in recent versions of PBS Professional. I suspect that 15003 is not the error code. 15003 is the port number typically used for MOM communication. What hostname corresponds to IP address 10.128.1.10?
  9. Job held, too many failed attempts to run

    PBS maintains a "hop count" of the number of times it has attempted to run a job. There is an internal maximum hop count - when the number of attempts exceeds this value the job is placed in a hold state because we assume there's something either in the job or in the system configuration that is preventing it from running. This maximum hop count is currently an internal constant - we've had some internal discussions about making it site-configurable. The job was probably rejected by the MOM daemon (the execution control daemon that runs on each execution node in the cluster). Your best bet to diagnose the cause would be to look in the MOM log on the primary execution host for the job (the first node appearing in the exec_vnode attribute of the job). The log can be found in the $PBS_HOME/mom_logs directory on this node (each MOM maintains its own log on the host it's running on). The individual log files are named after the date, in the format YYYYMMDD - for example, the log for Jan 1, 2012 would be named 20120101. Steve Gombosi Sr. Application Engineer Altair Engineering, Inc.
  10. collect usage statistics of jobs

    Rodrigo, You might want to check out the pbs-report command to see if it will generate the sort of report you're looking for. Steve Gombosi Sr. Application Engineer Altair Engineering, Inc.
  11. newbie needs help

  12. newbie needs help

    Yes, as long as a PBS MOM (the PBS daemon that supervises execution of user jobs) is running on it. I'm assuming this is a Linux system, and it sounds like you're using xpbsmon to check the status, is that correct? Could I get you to check a couple of things? 1) Is there a process named "pbs_mom" running on machine 1 (do a "ps -ef|grep pbs_mom" from the Linux command line to check this). 2) If not, what does your PBS configuration file (/etc/pbs.conf) look like? This file controls which daemons are brought up when PBS starts on a machine (among other things). The line that's of particular interest here is the one that looks like this: PBS_START_MOM=1 if the value here is "1", PBS will attempt to start the MOM when it initializes. If it's "0", MOM will not be started. Steve
×