Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

TUNGXP

Members
  • Content count

    16
  • Joined

  • Last visited

About TUNGXP

  • Rank
    Member

Profile Information

  • Gender
    Not Telling
  1. Dear Altair support team. I have a troubles, when I submit job to run bwa with many nodes (compute-0-0 and compute-0-1), it's only run on one node. This is my pbs_submit.qsub script: #PBS -N bwa_program #PBS -q workq #PBS -l select=2:ncpus=10 #PBS -l place=scatter cd $PBS_O_WORKDIR # run the program module load pbs_module /data/bwa/bwa mem -t 20 Build37.1/homo_full.fa test1.fastq test2.fastq >test_homo.sa I use this command to submit job: qsub pbs_submit.qsub It's done work but it only run on compute-0-0 with 20 cpu (I check it by use top command on each compute node). here is pbsnode -a compute-0-0 Mom = compute-0-0.local Port = 15002 pbs_version = PBSPro_13.0.1.152391 ntype = PBS state = free pcpus = 20 jobs = 19.hpc.igr.ac.vn/0, 19.hpc.igr.ac.vn/1, 19.hpc.igr.ac.vn/2, 19.hpc.igr.ac.vn/3, 19.hpc.igr.ac.vn/4, 19.hpc.igr.ac.vn/5, 19.hpc.igr.ac.vn/6, 19.hpc.igr.ac.vn/7, 19.hpc.igr.ac.vn/8, 19.hpc.igr.ac.vn/9 resources_available.arch = linux resources_available.host = compute-0-0 resources_available.mem = 66059224kb resources_available.ncpus = 20 resources_available.vnode = compute-0-0 resources_assigned.accelerator_memory = 0kb resources_assigned.mem = 0kb resources_assigned.naccelerators = 0 resources_assigned.ncpus = 10 resources_assigned.netwins = 0 resources_assigned.vmem = 0kb resv_enable = True sharing = default_shared compute-0-1 Mom = compute-0-1.local Port = 15002 pbs_version = PBSPro_13.0.1.152391 ntype = PBS state = free pcpus = 20 jobs = 19.hpc.igr.ac.vn/0, 19.hpc.igr.ac.vn/1, 19.hpc.igr.ac.vn/2, 19.hpc.igr.ac.vn/3, 19.hpc.igr.ac.vn/4, 19.hpc.igr.ac.vn/5, 19.hpc.igr.ac.vn/6, 19.hpc.igr.ac.vn/7, 19.hpc.igr.ac.vn/8, 19.hpc.igr.ac.vn/9 resources_available.arch = linux resources_available.host = compute-0-1 resources_available.mem = 66059224kb resources_available.ncpus = 20 resources_available.vnode = compute-0-1 resources_assigned.accelerator_memory = 0kb resources_assigned.mem = 0kb resources_assigned.naccelerators = 0 resources_assigned.ncpus = 10 resources_assigned.netwins = 0 resources_assigned.vmem = 0kb resv_enable = True sharing = default_shared You can see 2 pictures in attach files is top command when program run in compute-0-0 and compute-0-1. Even when I run with "select=2:ncpus=10" and "bwa mem -t 40", it's only run on compute-0-0 Here is log file when program run in each compute (id job:20): compute-0-0: 11/05/2015 00:26:05;0080;pbs_mom;Job;19.hpc.igr.ac.vn;copy file request received 11/05/2015 00:26:05;0100;pbs_mom;Job;19.hpc.igr.ac.vn;staged 2 items out over 0:00:00 11/05/2015 00:26:05;0008;pbs_mom;Job;19.hpc.igr.ac.vn;no active tasks 11/05/2015 00:26:05;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.10:15001, sock=1 11/05/2015 00:26:05;0080;pbs_mom;Job;19.hpc.igr.ac.vn;delete job request received 11/05/2015 00:26:05;0008;pbs_mom;Job;19.hpc.igr.ac.vn;kill_job 11/05/2015 00:30:49;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.10:15001, sock=1 11/05/2015 00:30:49;0100;pbs_mom;Req;;Type 3 request received from root@192.168.1.10:15001, sock=1 11/05/2015 00:30:50;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.10:15001, sock=1 11/05/2015 00:30:50;0008;pbs_mom;Job;20.hpc.igr.ac.vn;Type 5 request received from root@192.168.1.10:15001, sock=1 11/05/2015 00:30:50;0008;pbs_mom;Job;20.hpc.igr.ac.vn;Started, pid = 31959 compute-0-1: 11/05/2015 00:12:05;0080;pbs_mom;Job;18.hpc.igr.ac.vn;delete job request received 11/05/2015 00:12:05;0008;pbs_mom;Job;18.hpc.igr.ac.vn;kill_job 11/05/2015 00:23:42;0008;pbs_mom;Job;19.hpc.igr.ac.vn;JOIN_JOB as node 1 11/05/2015 00:27:40;0008;pbs_mom;Job;19.hpc.igr.ac.vn;KILL_JOB received 11/05/2015 00:27:40;0008;pbs_mom;Job;19.hpc.igr.ac.vn;kill_job 11/05/2015 00:27:40;0008;pbs_mom;Job;19.hpc.igr.ac.vn;DELETE_JOB received 11/05/2015 00:27:40;0008;pbs_mom;Job;19.hpc.igr.ac.vn;kill_job 11/05/2015 00:32:25;0008;pbs_mom;Job;20.hpc.igr.ac.vn;JOIN_JOB as node 1 Why it doesn't start on compute-0-1? it's only start on compute-0-0? Thank you.
  2. I still have many free cpu, but when I submit more than three job each queue, they can not run. Here is my server configure: I only use 2 first queues "workq" and "default" queue. Here is output of qstat: Why the last job have "H" state? The comment of last job is "job held, too many failed attempts to run" There are many free resources, why do they cannot run? How to resolve that? I'm using PBS 12 Pro Anyone can help me? Thank you
  3. Dear Scott, I have just read in PBSProUserGuide12 document, I can't submit job with job_name has more than 15 characters in length. How to increase length of Job Name (-N option), may be up to 30 characters? Thank you.
  4. I have just try this command on cluster and compute-node, It's return the same value. on cluster: [root@cluster ~]# id testuseruid=501(testuser) gid=100(users) groups=100(users)On compute-node [root@compute-0-24 ~]# id testuseruid=501(testuser) gid=100(users) groups=100(users)It's still not working
  5. Hi, How to use qsub on each compute node: Ex: when I run this command on master: [testuser@cluster ~]$ qsub my_job.qsub It's work fine! But when I change to compute node: [testuser@cluster ~]$ssh compute-0-0 [testuser@compute-0-0 ~]$qsub my_job.qsub qsub: Bad UID for job execution [testuser@compute-0-0 ~]$ How to solve that ? Thank you,
  6. Thank you, We'll reconfigure PBS server as your guide. I've seen this topic http://forum.pbsworks.com/index.php?/topic/94-pbs-mom-is-down-how-to-restart-safely/ So, I wonder how to stop MOM with SIGKILL or SIGINT ? I have been stop pbs mom with command "/etc/init.d/pbs stop", then I start it like topic above with "pbs_mom -p". The job on that node suddenly killed. Please show me how to restart MOM with recovery job when the node die?
  7. Thank you for your support, We have to reinstall all compute nodes. After that, we realize that errors cause by memory overload. We have to set stack memory to unlimited in /etc/security/limits.conf for all user, even we have to set ulimit -s unlimited in qsub script before run program. We still have problem when using PBS. When we submit job with 4 nodes (ex: compute-0-0 + compute-0-1 + compute-0-2 + compute-0-3). This job run normally in about 10 hours, suddenly one of them die (compute-0-3 died!). We reboot compute-0-3, this node back to normal working. But when compute-0-3 died, PBS automatic change immediately switch to another live node (ex: compute-0-4) And now it's run with compute-0-0 + compute-0-1 + compute-0-2 + compute-0-4. However, this job restart from the beginning like 10 hours before . How to configure it to continue run (not restart from beginning) ? (Maybe disable automatic switch node or something like that ... ) Thank you,
  8. Dear Scott Suchyta, I am very thankful for your help in the last. I still have problem while using PBS Pro (or perhaps no fault of PBS) Our system uses mvapich compiler on InfiniBand-based transmission. This is my program.qsub file : #PBS -N wrf_program #PBS -q default #PBS -l select=8:ncpus=16 #PBS -l place=scatter cd $PBS_O_WORKDIR # run the program module unload rocks-openmpi module load rocks-openmpi-mvapich which mpirun mpirun -np 128 -hostfile $PBS_NODEFILE wrf.exe exit 0 In that file, we use 8 nodes with 16 cpus per node. we run with command: cluster@testuser> qsub program.qsub It's work fine ! But when I change my program.qsub file to 4 nodes: #PBS -N wrf_program #PBS -q default #PBS -l select=4:ncpus=16 #PBS -l place=scatter cd $PBS_O_WORKDIR # run the program module unload rocks-openmpi module load rocks-openmpi-mvapich which mpirun mpirun -np 4 -hostfile $PBS_NODEFILE wrf.exe exit 0 It's die with log: [proxy:0:1@compute-0-5.local] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:913): assert (!closed) failed [proxy:0:1@compute-0-5.local] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:1@compute-0-5.local] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec@compute-0-4.local] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@compute-0-4.local] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@compute-0-4.local] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion [mpiexec@compute-0-4.local] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion I think this error not of any node on our system, because when this program run with 8 node, it's call also compute-0-4 and compute-0-5. Besides that, we also use an additional way to test the case on: Instead of using PBS Pro, we call nodes directly via hostfile. In this work_directory, we create file with name hostfile with content: compute-0-2 compute-0-3 compute-0-4 compute-0-5 we use command below to run wrf_program: mpirun -np 4 -hostfile hostfile wrf.exe It work fine! It aslo work with any number of node (4 nodes, 8 nodes or 16 nodes is ok) Issues that I want to ask here is: When we use PBS Pro, why does it work with 8 nodes but not work with 4 node? (If don't use PBS Pro, It work with any number of node) Do we probably forgot something configuration of PBS? Thank you very much,
  9. Ok, I'll try it, Thank you !
  10. How to setup default place=scatter in queue ?
  11. Thank you, Please help me one more question. How to setup maximum number of processors per node in one queue?
  12. Dear support team, I have a question with PBS Pro. How to setup maximum number of processors per user in one queue? Thank you,
  13. Node properties in PBS Pro

    Wow, I've just do as your guides. It's worked! Thank you so much,
  14. Node properties in PBS Pro

    Dear scott, When I submit my script on some node (include cluster node), It's end so fast and have output log file. But when I submit my script on computenode only, It's end so quite slow and no have log file. I have tried to ssh from cluster to compute (no need username andd password)and ssh from compute back to cluster (need username and password). It's OK. This is my script submitted: #!/bin/sh #PBS -q serialq #PBS -o sleep.out #PBS -e sleep.err #PBS -N sleep_test #PBS -l select=1:ncpus=1 for i in {1..5} ; do echo $i sleep 1 done This is my log in /var/spool/mail/root: how to solve that ? thank you,
  15. Node properties in PBS Pro

    Thank you very much. I've just tried it. It's worked! But I have one more question. When I submit any job (include a simple job or complex job), they stop at state E (Job is exiting after having run) quite long (it's take about 10 - 15 minutes) before finish. How to debug that ? Thank you,
×