Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 

Recommended Posts

I've finished setting up ohpc 1.2 with pbs pro. The setup is is as follows

the master is connected to the LAN on eth0 and to the compute nodes (via a switch) on eth1. The pbs_comm however, defaults to the IP adress of eth0 which the compute nodes of course can not reach

Output from /var/spool/pbs/comm_logs:

01/20/2017 14:54:41;0002;Comm@ricr-cluster;Svr;Comm@ricr-cluster;Exiting
01/20/2017 14:54:41;0002;Comm@ricr-cluster;Svr;Log;Log closed
01/20/2017 14:55:08;0002;Comm@ricr-cluster;Svr;Log;Log opened
01/20/2017 14:55:08;0002;Comm@ricr-cluster;Svr;Comm@ricr-cluster;pbs_version=14.1.0
01/20/2017 14:55:08;0002;Comm@ricr-cluster;Svr;Comm@ricr-cluster;pbs_build=mach=N/A:security=N/A:configure_args=N/A
01/20/2017 14:55:08;0002;Comm@ricr-cluster;Svr;Comm@ricr-cluster;/opt/pbs/sbin/pbs_comm ready (pid=16276), Proxy Name:ricr-cluster:17001, Threads:4
01/20/2017 14:55:08;0c06;Comm@ricr-cluster;TPP;Comm@ricr-cluster(Thread 1);tfd=18, Leaf registered address 10.155.198.146:15004
01/20/2017 14:55:14;0c06;Comm@ricr-cluster;TPP;Comm@ricr-cluster(Thread 2);tfd=19, Leaf registered address 10.155.198.146:15001
01/20/2017 14:55:41;0c06;Comm@ricr-cluster;TPP;Comm@ricr-cluster(Thread 3);tfd=20, Leaf registered address 192.168.1.4:15003


I don't really understand what is going on here since 192.168.1.4 is the IP of compute node1.

 

Output from /var/spool/pbs/mom_logs/

01/20/2017 14:55:41;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest 192.168.1.5:15001, msg="tfd=20, pbs_comm:10.155.198.146:17001: Dest not found"
01/20/2017 14:55:46;0001;pbs_mom;Svr;pbs_mom;Access from host not allowed, or unknown host (15008) in is_request, bad connect from 10.155.198.146:15001

This is the pbs.conf

PBS_SERVER=ricr-cluster
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

 

For the compute nodes, the 0/1 flags are interchanged of course

All services are running:

[root@c1 ~]# ps -ef | grep pbs
root      3952     1  0 14:55 ?        00:00:00 /opt/pbs/sbin/pbs_mom
root      4099  3812  0 15:23 pts/0    00:00:00 grep --color=auto pbs


[root@ricr-cluster ~]# ps -ef | grep pbs
root     16276     1  0 14:55 ?        00:00:00 /opt/pbs/sbin/pbs_comm
root     16293     1  0 14:55 ?        00:00:00 /opt/pbs/sbin/pbs_sched
root     16753     1  0 14:55 ?        00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 16853     1  0 14:55 ?        00:00:00 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
postgres 16861 16853  0 14:55 ?        00:00:00 postgres: postgres pbs_datastore 10.155.198.146(40520) idle
root     16870     1  0 14:55 ?        00:00:00 /opt/pbs/sbin/pbs_server.bin
root     17700 14501  0 15:21 pts/0    00:00:00 grep --color=auto pbs

The IP listed after pbs_datastore is the unwanted IP of eth0

 

pinging and sshing works in both directions.

The nodes are all listed as down. I'm guessing this is due to them not communicationg with pbs_comm

[root@ricr-cluster ~]# pbsnodes -av
c1
     Mom = c1.localdomain
     Port = 15002
     pbs_version = unavailable
     ntype = PBS
     state = state-unknown,down
     pcpus = 1
     resources_available.host = c1
     resources_available.ncpus = 1
     resources_available.vnode = c1
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.netwins = 0
     resources_assigned.vmem = 0kb
     comment = node down: communication closed
     resv_enable = True
     sharing = default_shared

How do I reconfigure pbs_comm?

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

×