Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

scc

Members
  • Content count

    25
  • Joined

  • Last visited

  • Days Won

    3
  1. group name issue in the job log

    The group = -default- is set by the pbs_server for a job if the system function getpwnam(userid) returns null. Was this user created after the pbs_server had started last, or was the user added to the group since the last server start? Maybe something (try restarting nscd, if it is in use?) is caching old information? You can test getpwnam() with the following simple program: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/types.h> #include <pwd.h> #include <errno.h> int main(int argc, char *argv[]) { char* name; int returnval; char command[100]; struct passwd *pwdp; name=argv[1]; pwdp = getpwnam(name); printf("errno: %s\n",strerror(errno)); if (pwdp == NULL) { printf("Pointer returned is NULL:\n"); } else { printf("Pointer returned is NOT NULL!\n"); } if (pwdp == (struct passwd *)0) { printf("No Password Entry for User %s\n", name); } else { printf("Password Entry for User FOUND for %s\n", name); } sprintf(command,"getent passwd %s", name); returnval = system(command); }
  2. Hello Jerry, if you mean a different signal for job suspension specifically (there are other preemption methods as well), then yes, have a look at "suspendsig" in the documentation.
  3. How to catch a qdel signal

    If you want to go about this from the job script itself, here is a *very* simple example of a signal handler for SIGTERM from a BASH script: #!/bin/bash function file_transfer { echo "SIGTERM caught! Copying out files" sleep 30 echo "DONE copying files" } trap file_transfer SIGTERM sleep 1000 There is more to consider when using PBS Pro, though. When you do a qdel, this is what happens: For a single-vnode job: 1. PBS sends the job a SIGTERM 2. PBS waits for the amount of time specified in the kill_delay queue attribute 3. PBS sends the job a SIGKILL For a multi-vnode job: 1. Mother superior sends a SIGTERM to all processes on the primary execution host 2. If any of the processes of the top task of the job are still running, PBS waits a minimum of kill_delay seconds 3. Mother Superior sends a SIGKILL to all remaining job processes on the primary execution host 4. The subordinate MoMs send a SIGKILL to all their processes belonging to this job kill_delay is 10 seconds by default. So you have to set the queue attribute kill_delay to be larger than 30 seconds in the example above or else the file_transfer function will be KILLed before it has finished, but of course that's just an example. In your actual script where you'd have put in an actual file copy command in place of the "sleep 30" the kill_delay would need to be set to a longer period than the copy command can take. There are other options aside from a signal handler in the job script that could help you achieve more or less the same thing in PBS Pro (assuming you are an admin on this PBS Pro installation and could get these configured): 1) Use an execjob_epilogue hook. This would be a Python hook that the pbs_mom runs for all jobs just after executing or killing a job, but before job is cleaned up. The hook could be written to detect whether or not the job in question is an Ansys job and whether or not it exited normally or as a result of SIGTERM/KILL and then perform the file copy if necessary. Perhapse less attractive but still worth mentioning at least in passing are: 2) Utilize PBS Pro's stageout mechanism to copy all files from the job execution directory to some predetermined location. 3) Use an action terminate script so that a custom script is run in place of simply sending SIGTERM by the pbs_mom. If you'd like more details about any of these options please contact pbssupport@altair.com and we can work with you. Thanks. -Scott Campbell
  4. 4G download limit with Compute Manager

    If you run into this problem you can substitute in the system zip command for the one that PAS provides. To do so, simply follow these commands (as root on the PAS host): . /etc/pas.conf mv $PAS_EXEC/bin/Linux-x86_64/zip $PAS_EXEC/bin/Linux-x86_64/zip.pas_orig ln -s $(which zip) $PAS_EXEC/bin/Linux-x86_64/zip No restart of PAS or CM required.
  5. To add to what CJM said: For a job to be accepted by the PBS server, the user at the submitting host must pass an ruserok() test. From the RCMD(3) man page (check your local Linux distro documentation, it may differ from this): * The iruserok() and ruserok() functions take a remote host's IP address or name, respectively, two user names and a flag indicating whether the local user's name is that of the superuser. Then, if the user is NOT the superuser, it checks the /etc/hosts.equiv file. If that lookup is not done, or is unsuccessful, the .rhosts in the local user's home directory is checked to see if the request for service is allowed. If this file does not exist, is not a regular file, is owned by anyone other than the user or the superuser, or is writeable by anyone other than the owner, the check automatically fails. Zero is returned if the machine name is listed in the hosts.equiv file, or the host and remote user name are found in the .rhosts file; otherwise iruserok() and ruserok() return -1. If the local domain (as obtained from gethostname(2)) is the same as the remote domain, only the machine name need be specified. If the pbs_server attribute flatuid is set to true, this test is skipped and the job is accepted based on the submitting users name alone (with obvious security implications, which can be tempered by using acl_host_enable and acl_hosts). Here is a test program to see if ruserok() passes for a given user and host: /* Two use cases: 1) User submitting job from remote host to server getting unexpected "Bad UID" message. That is, user doesn't have access when he thinks he should. 2) User(s) can delete, etc other user(s) jobs. That is, one user is able to act as what he thinks is a different user, server sees them as being equivalent. Build with "cc ruserok.c -o ruserok" Usage (run on the PBS server system): ruserok remote_host remote_user1 local_user2 where: remote_host: the host from which the job is being submitted, or where the PBS client command is issued remote_user1: the username of the user submitting the job, or issuing the client command local_user2: the username of the user remote_user1 is trying to submit the job as, or owner of the job that remote_user1 is trying to act on with the client command In most cases both user names given will be the same, unless testing to see a job can be submitted as a different user with the qsub -u option. */ #include <errno.h> #include <stdio.h> #include <unistd.h> int main(int argc, char *argv[]) { int rc; char hn[257]; if (argc != 4) { fprintf(stderr, "Usage: %s remote_host remote_user1 local_user2\n", argv[0]); return 1; } if (gethostname(hn, 256) < 0) { perror("unable to get hostname"); return 2; } hn[256] = '\0'; printf("on local host %s, from remote host %s\n", hn, argv[1]); rc = ruserok(argv[1], 0, argv[2], argv[3]); if (rc == 0) printf("remote user %s is allowed access as local user %s\n", argv[2], argv[3]); else printf("remote user %s is denied access as local user %s\n", argv[2], argv[3]); return 0; }
  6. This will work from a Windows .bat file: @echo off for /f "delims=" %%a in ('qsub job1.pbs') do @set FIRST=%%a echo %FIRST% for /f "delims=" %%a in ('"qsub -W depend=afterany:%FIRST% job2.pbs"') do @set SECOND=%%a echo %SECOND% -Scott
  7. Not getting resources_used from array jobs

    Hi Mike, job array subjobs are designed to be "lighter weight" than individual batch jobs, and as such not all job attributes are retained by the server in memory nor ever even written to the internal PBS database (FYI, this is the reason that subjobs are restarted upon server restart). The resources_used values for the subjobs are recorded in the accounting log E record, though. I hope this helps! -Scott
  8. PBS Pro 12 unable to kill via qdel

    Ok, so it looks like there are 2 open questions: 1) Why did the job fail to launch properly? A job would normally be in substate 41 only for a short time while the job is in the process of launching on the execution hosts. In order to look further into this, we'd need to see the mom log from node n16 from June 22nd at ~17:57 (a few minutes worth of messages before and after this time would be good to include for context). 2) Why can't the job be deleted? It looks like the qdel -W force is being run by root@kanadlogin1. Is that a completely separate host from kanad.hpc.iiserb (where the pbs_server is running), or is it a different hostname/alias for the same host? It may be helpful to post the output of these commands as run by root on kanad.hpc.iiserb: qmgr -c "p s managers" pbs_hostn -v kanand pbs_hostn -v kanad.hpc.iiserb pbs_hostn -v kanandlogin1
  9. PBS Pro 12 unable to kill via qdel

    Hello Aniruddh, please post the output of "qstat -f 33600" as run as root on kanad so that the details of this job are shown. For the "unauthorized" error, were you running the qdel command as the user, or as root, and on which host were you running the command? If you were running it as root on kanand, please check to see if you are able to run this command without seeing an error: qmgr -c "s s scheduling=t". Thanks. -Scott
  10. This is very likely hostname resolution related. Please post the output of these command as run on log05: pbs_hostn -v log05 hostname pbs_hostn -v $(hostname) netstat -anp | grep pbs ifconfig We may need to see further pbs_hostn output depending on what is seen from the above. Thanks. -(The other) Scott
  11. $PBS_HOME/mom_priv/config

    Hmm, that file should certainly exist in any default PBS Pro installation. Please double check the value for $PBS_HOME in /etc/pbs.conf on the host on which you are looking, and also please make sure you are looking for the $PBS_HOME/mom_priv/config file on a host where the pbs_mom component was actually installed (that is, not on a host where you installed the "commands only" choice from the INSTALL script). If you still cannot find it, please contact pbssupport@altair.com and we'll check some things out with you. Thanks. -Scott
  12. How to alter the queue order of jobs

    Hi Jerry, the order in which jobs are run is heavily dependent on the scheduling policy in place. Using the qorder command or changing the job's priority attribute may or may not make a difference. I'd recommend forgetting about qorder, and let's just talk about job priority. In order for the job's priority attribute to make a difference it must be included as a node_sort_key in the $PBS_HOME/sched_priv/sched_config file: job_sort_key: "job_priority HIGH" Job ordering can be a rather complex topic since it depends not only on job_sort_key, but also on whether or not the scheduling formula is in use, whether or not fair share is in use, differing queue priorities, etc. The job_sort_key usually acts as a "tie breaker" for the previously mentioned scheduling features (that is, something like fair share picks the user whose job will run first, then job_sort_key picks which of his/her jobs runs first, though there may be other user's jobs with a higher priority attribute). The default sched_config will not take the priority attribute into account at all though. Please try adding the above job_sort_key (as either the primary key or a secondary, depending on what job_sort_key you may already be using, lower sort keys again act as "tie breakers" for higher ones), HUP the pbs_sched, and re-try your test. If you do not get the expected result, it is likely that there are other factors at play, so please contact pbssupport@altair.com and include your test procedure, sched_config, and output of qmgr -c "p s". Thanks. -Scott
  13. Please contact pbssupport@altair.com and include the output of these commands while jobs are queued: qstat -Bf qstat -ans pbsnodes -av Please also send a copy of the pbs_server log from the day you were having this problem ($PBS_HOME/server_priv/logs/<date>, $PBS_HOME is /var/spool/PBS by default, but the value can be found in /etc/pbs.conf). Thanks. -Scott
  14. Queue bound issue

    Hi Jerry, is your primary goal to explicitly tie specific nodes to specific queues, or to allow PBS to select node groups to run jobs on based on their network topology in predefined groups? You are attempting to use the "placement set" (AKA node_group_enable) feature which was designed to do something like the later, rather than the former. If your goal is to simply configure PBS so that jobs in the debug queue run on a predefined set of 14 nodes, then I suggest that you use the resource based method to do this that is described in "4.8.2.2.i Procedure to Associate Vnodes with Multiple Queues" of the 12.2 PBS Pro Administrator's Guide (and set node_group_key back to false in qmgr). If this is not actually your goal, please contact pbssupport@altair.com and we can discuss in more depth. As an aside, we recommend against using the "qsub -lnodes=..." style syntax if at all possible. It is provided for backwards compatibility reasons only, "qsub -lselect=..." is the current and preferred method of job submission. Thanks. -Scott
  15. Hello Ingrid, please send details to pbssupport@altair.com (please include the qsub commands you are using and a copy of "pbsnodes -av" output) and we can work with you to understand what is going on. It sounds like you might simply need assistance understanding PBS Pro's "qsub -lselect=..." resource specification syntax which allows you to control things like how many ncpus are needed from each host vs. needing to place all of the ncpus requested on a single host. Thanks. -Scott
×