Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 
TUNGXP

Cannot run bwa with multiple nodes

Recommended Posts

Dear Altair support team.


 


I have a troubles, when I submit job to run bwa with many nodes (compute-0-0 and compute-0-1), it's only run on one node.


 


This is my pbs_submit.qsub script:



#PBS -N bwa_program

#PBS -q workq

#PBS -l select=2:ncpus=10

#PBS -l place=scatter

cd $PBS_O_WORKDIR

# run the program

module load pbs_module

/data/bwa/bwa mem -t 20 Build37.1/homo_full.fa test1.fastq test2.fastq >test_homo.sa

I use this command to submit job:



qsub pbs_submit.qsub

It's done work but it only run on compute-0-0 with 20 cpu (I check it by use top command on each compute node).


here is pbsnode -a



compute-0-0
Mom = compute-0-0.local
Port = 15002
pbs_version = PBSPro_13.0.1.152391
ntype = PBS
state = free
pcpus = 20
jobs = 19.hpc.igr.ac.vn/0, 19.hpc.igr.ac.vn/1, 19.hpc.igr.ac.vn/2, 19.hpc.igr.ac.vn/3, 19.hpc.igr.ac.vn/4, 19.hpc.igr.ac.vn/5, 19.hpc.igr.ac.vn/6, 19.hpc.igr.ac.vn/7, 19.hpc.igr.ac.vn/8, 19.hpc.igr.ac.vn/9
resources_available.arch = linux
resources_available.host = compute-0-0
resources_available.mem = 66059224kb
resources_available.ncpus = 20
resources_available.vnode = compute-0-0
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 10
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

compute-0-1
Mom = compute-0-1.local
Port = 15002
pbs_version = PBSPro_13.0.1.152391
ntype = PBS
state = free
pcpus = 20
jobs = 19.hpc.igr.ac.vn/0, 19.hpc.igr.ac.vn/1, 19.hpc.igr.ac.vn/2, 19.hpc.igr.ac.vn/3, 19.hpc.igr.ac.vn/4, 19.hpc.igr.ac.vn/5, 19.hpc.igr.ac.vn/6, 19.hpc.igr.ac.vn/7, 19.hpc.igr.ac.vn/8, 19.hpc.igr.ac.vn/9
resources_available.arch = linux
resources_available.host = compute-0-1
resources_available.mem = 66059224kb
resources_available.ncpus = 20
resources_available.vnode = compute-0-1
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 10
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared


You can see 2 pictures in attach files is top command when program run in compute-0-0 and compute-0-1.


 


Even when I run with "select=2:ncpus=10" and "bwa mem -t 40", it's only run on compute-0-0


 


Here is log file when program run in each compute (id job:20):


 


compute-0-0:



11/05/2015 00:26:05;0080;pbs_mom;Job;19.hpc.igr.ac.vn;copy file request received
11/05/2015 00:26:05;0100;pbs_mom;Job;19.hpc.igr.ac.vn;staged 2 items out over 0:00:00
11/05/2015 00:26:05;0008;pbs_mom;Job;19.hpc.igr.ac.vn;no active tasks
11/05/2015 00:26:05;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.10:15001, sock=1
11/05/2015 00:26:05;0080;pbs_mom;Job;19.hpc.igr.ac.vn;delete job request received
11/05/2015 00:26:05;0008;pbs_mom;Job;19.hpc.igr.ac.vn;kill_job
11/05/2015 00:30:49;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.10:15001, sock=1
11/05/2015 00:30:49;0100;pbs_mom;Req;;Type 3 request received from root@192.168.1.10:15001, sock=1
11/05/2015 00:30:50;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.10:15001, sock=1
11/05/2015 00:30:50;0008;pbs_mom;Job;20.hpc.igr.ac.vn;Type 5 request received from root@192.168.1.10:15001, sock=1
11/05/2015 00:30:50;0008;pbs_mom;Job;20.hpc.igr.ac.vn;Started, pid = 31959

compute-0-1:



11/05/2015 00:12:05;0080;pbs_mom;Job;18.hpc.igr.ac.vn;delete job request received
11/05/2015 00:12:05;0008;pbs_mom;Job;18.hpc.igr.ac.vn;kill_job
11/05/2015 00:23:42;0008;pbs_mom;Job;19.hpc.igr.ac.vn;JOIN_JOB as node 1
11/05/2015 00:27:40;0008;pbs_mom;Job;19.hpc.igr.ac.vn;KILL_JOB received
11/05/2015 00:27:40;0008;pbs_mom;Job;19.hpc.igr.ac.vn;kill_job
11/05/2015 00:27:40;0008;pbs_mom;Job;19.hpc.igr.ac.vn;DELETE_JOB received
11/05/2015 00:27:40;0008;pbs_mom;Job;19.hpc.igr.ac.vn;kill_job
11/05/2015 00:32:25;0008;pbs_mom;Job;20.hpc.igr.ac.vn;JOIN_JOB as node 1

Why it doesn't start on compute-0-1? it's only start on compute-0-0?


 


Thank you.


post-54-0-92813200-1446632954_thumb.png

post-54-0-18577000-1446632963_thumb.png

Share this post


Link to post
Share on other sites

Hello TungXP,


PBS Pro assigns the resources, but it's up to the job script to set up any parallelism. You can refer to the $PBS_NODEFILE variable to determine what you've been assigned. The job script will only run on the primary execution host, so typically MPI is used to spawn processes on other nodes assigned to the job.


 


I'm not very familiar with bwa, but if you're using bio-bwa as it appears, it is not MPI aware and will not execute tasks on multiple nodes. The parallel fork pbwa does, but it doesn't look like it supports the mem command you're using.


 


Josh


Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

×