Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

begou

Members
  • Content count

    8
  • Joined

  • Last visited

  1. why log files go to "undelivered" dir ?

    OK Adam, I will try the $usecp config. That I have not understood page 268 was that I was able to use several instances of $usecp. Not explained in the doc I have read but may be I missed a paragraph where it is detailed. Updating PBSPro could be a solution, may be this summer when fewer people are working. These weeks we have many students and researchers runing long CFD simulations and I can't stop the servers. Thanks for your detailed answer. Patrick
  2. why log files go to "undelivered" dir ?

    Hi Adam, I'm back after many tests! Your suggestion give me a starting point to browse the admin guide efficiently. However, as I have several independant PATH (/home/user/* and /data/*) I don't found in the doc how to set $usecp for all of them. But I understand that setting PBS_SCP=/usr/bin/scp in /etc/pbs.conf tell PBS to use scp to copy these files. So I spent many time to setup a host based authentication between the nodes, the Xeon front-end and the itanium front-end. Now a user can ssh from any node to any node with this host based authentication. But I still have job staying in the E state and, for a job runing on node cl1n004, the hanging process are: decaixj 4461 4353 0 15:25 ? 00:00:00 /usr/bin/scp -Brvp /var/spool/PBS/spool/13810.calcul9sv4.OU decaixj@cl1n003 /home/users/decaixj/calcul/Venturi_4/KLSST_c02_m042/KLSST_c02_m042.o13810 decaixj 4462 4461 0 15:25 ? 00:00:00 /usr/bin/ssh -x -oForwardAgent no -oClearAllForwardings yes -oBatchmode yes -v -ldecaixj cl1n003 scp -v -r -p -t /home/users/decaixj/calcul/Venturi_4/KLSST_c02_m042/KLSST_c02_m042.o13810 I think this job has been submitted by a previous job runing on node cl1n003 and has been allocated on node cl1n004. And files still go in the undelivred directory. Patrick
  3. Hi, I've tryed a search with the keyword "undelivered" on the forum but do not find anything. On google my search only give me informations on how to remove files in the "undelivered" PBS directory. But my problem is "Why did log files go in this undelivered directory" ? I do not find any help in PBS Professional 9.1 admin guide too..... My PBSPro is runing on a mixed cluster (itanium and Xeon) with queues for the itanium part and queues for the xeon part. If I launch a job for an itanium queue from the itanium host I get the log file of the job in my dir. If I launch the same job for the same itanium queue from the Xeon host the log file goes in the "undelivered" dir on the itanium host ! The PATHs of all my disk area are the same on the 2 hosts and the disk is nfs mounted on the Xeon from the itanium. Any Idea ? I think it is related to the definition of the output path with this syntax Output_Path = submission-host:/path/to/the/file but even if I use #PBS -o /absolute/path/to/the/out #PBS -e /absolute/path/to/the/err the submission hostname is added (as shown by "qstat -f jobid") and the log files go into the "undelivered" dir! Thanks for your help Patrick
  4. Pbs_mom is down, how to restart safely ?

    Thanks Adam for your quick answer! I'm using PBSPro_9.1.0.72982 and my administrator guide is only the 9.1 version so I have not this 6.4.9 section in it. But I have read the man page. Let me check what you suggest to be sure that I understand it. At this time I have 5 jobs depending of pbs_mom ├─pbs_mom ─┬─bash───bash───bash───cavka_jan_2010 │ ├─csh───7369.calcul9sv4───solverbaro_amin │ ├─csh───7374.calcul9sv4───solverbaro_amin │ ├─csh───7386.calcul9sv4───solverbaro_amin │ └─csh───7399.calcul9sv4───solverbaro_amin You mean I can kill pbs_mom process without killing the jobs and that restarting pbs_mom -p will keep the jobs runing and detach automaticaly (without the -N option) from the terminal? And restarting pbs_mom without option from /etc/inetd.d will kill the jobs (as I did previously). As some days ago I've killed several user's jobs, I must be very carefull.... Patrick
  5. Hi, On one node of my cluster, pbs_mom does not communicate anymore with the server located on an other node. From the PBS server: Qmgr: list node cl1n005 .... state = down .... resources_available.ncpus = 8 resources_available.vnode = cl1n005 resources_assigned.ncpus = 6 .... On this cl1n005 node I have 6 sequential jobs runing as shown by the resources_assigned.ncpus attribute and confirmed by a top command on the node. Some days ago I had the same problem on this node and I restart pbs: /etc/init.d/pbs restart But this command has killed all the runing jobs on this node! Is there a way to "refresh/restart" the pbs_mom on my node without killing all the jobs ? A kill -SIGHUP command does not solve this problem. Thanks for your advices. Patrick
  6. Jobs always start on the front-end

    Thanks Adam for your suggestion. I've set priorities on my nodes with qmgr and my last test was launched on a node successfuly! I've to check these next days the behavior of PBS with this new config. I used "qmgr" and: set node cl1n006 priority=126 .... set node cl1n001 priority=121 set node calcul9sv3 priority=1 Is the "priority" value a 4 byte integer or is it a single byte one ? Just to know how high the priority can be ? It is not detailed in the admin guide (the example was 123 so I limit to a 1 byte signed integer my choices) I wasn't clear when speaking of "front-end". My front-end calcul9sv3 do not run the scheduler nor the server. I call it front-end because this host allows interactive sessions for runing tests, prost-processing, compilations... and batch submission. So it is important that its load remains minimal. Can you tell me why the parameters $ideal_load 0.2 $max_load 8.0 are not used by PBS with my config ? I would like to avoid too much load on this front-end, even if the nodes are overloaded. May be should I restrict resources_available.ncpus to 4 or 6 to keep almost 2 available cpus for interactive sessions ? Patrick
  7. Jobs always start on the front-end

    I'm using PBSPro_9.1.0.72982. My job submission file is very simple, juste selecting the architecture. This is how I submit a test. The application /home/users/begou/sleep60 is a script calling the unix command "sleep 60" to let me know where the batch run on the cluster. #PBS -N mytest #PBS -j oe #PBS -l cput=1000 #PBS -l select=1:ncpus=1 #PBS -l place=free:shared #PBS -q xeon /home/users/begou/sleep60 The xeon queue is a routing queue. The jobs are dispatched on 2 execution queues: - seq-xeon for sequential jobs - par-xeon for parallel jobs In the scheduler config I have: node_sort_key: "sort_priority HIGH" ALL sort_queues: true ALL resources: "ncpus, mem, arch, host, vnode, qlist" load_balancing: false ALL #load_balancing: true ALL <<== hangs the Itanium queues smp_cluster_dist: round_robin May be should I set smp_cluster_dist to lowest_load ? But it should not solve my need to set a lower priority for the front-end, just to accept jobs when all the cluster nodes are overloaded. The frontend and all my nodes (except the itanium part) have the same number of cpus, and the same memory config. This morning I launch my test, and this is the result with PBS placement: The X86_64 front-end run 7 sequential jobs exec_vnode = (calcul9sv3:ncpus=1) Job_Owner = aumelas@calcul9sv3 exec_vnode = (calcul9sv3:ncpus=1) Job_Owner = begou@calcul9sv3 (my job) exec_vnode = (calcul9sv3:ncpus=1) Job_Owner = decaixj@calcul9sv3 exec_vnode = (calcul9sv3:ncpus=1) Job_Owner = decaixj@calcul9sv3 exec_vnode = (calcul9sv3:ncpus=1) Job_Owner = goncalves@cl1n002 exec_vnode = (calcul9sv3:ncpus=1) Job_Owner = pellone@calcul9sv3 exec_vnode = (calcul9sv3:ncpus=1) Job_Owner = pellone@calcul9sv3 These are on the itanium part: exec_vnode = (calcul9sv4[0]:ncpus=3+calcul9sv4[1]:ncpus=4+calcul9sv4[2]:ncpus=1) Job_Owner = balarac@calcul9sv4 exec_vnode = (calcul9sv4[4]:ncpus=4+calcul9sv4[5]:ncpus=4) Job_Owner = balarac@calcul9sv4 These are the nodes X86_64. Nodes 1 to 4 only run one sequential job while the front-end run 7!!!! exec_vnode = (cl1n001:ncpus=1) Job_Owner = aumelas@calcul9sv3 exec_vnode = (cl1n002:ncpus=1) Job_Owner = pellone@calcul9sv3 exec_vnode = (cl1n003:ncpus=1) Job_Owner = pellone@calcul9sv3 exec_vnode = (cl1n004:ncpus=1) Job_Owner = aumelas@calcul9sv3 exec_vnode = (cl1n005:ncpus=1) Job_Owner = aumelas@calcul9sv3 exec_vnode = (cl1n005:ncpus=1) Job_Owner = larroude@calcul9sv3 exec_vnode = (cl1n006:ncpus=1) Job_Owner = aumelas@calcul9sv3 exec_vnode = (cl1n006:ncpus=6) Job_Owner = duprat@calcul9sv3 Thanks for your help Adam. Patrick
  8. hellow This is my first post on the forum. I'm using PBSPro for scheduling jobs on an Altix450 (ia64) and a cluster of Xeon (6 bi-xeon nodes quadcores + 1 front-end). The PBS config has been done by SGI but is not documented :-( My problem is that x86_64 jobs frequently start on the cluster front-end, even if some nodes are idle. I've founded some parameters in the "PBS administrator guide" but it doesn't change its behavior. On the mom config file /var/spool/PBS/mom_priv/config of the cluster front end I've set: $ideal_load 0.2 $max_load 8.0 On the mom config file /var/spool/PBS/mom_priv/config of the cluster nodes I've set: $ideal_load 4.0 $max_load 8.5 On the mom config file /var/spool/PBS/mom_priv/config of the Altix450 (28 cores) I've set: $ideal_load 28 $max_load 30 but this does'nt change the scheduling of the jobs (I've sent a kill -SIGHUP to every mom). They still start on the front-end even if some nodes are idle. Then I try to modify the sheduler config as it seems to be required also and I've set in the /var/spool/PBS/sched_priv/sched_config file: load_balancing: true ALL But when I set this parameter, jobs still start on the front-end and no jobs start on the itanium anymore (mom is the cpuset version on the altix450). The message is "no ressources available" I can stop mom on the front-end, but I want to configure the front-end to run some jobs when the X86_64 nodes are overloaded. Thanks for your help and advices. Patrick
×