Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 
garygo

single-node host name woes

Recommended Posts

I am attempting to set up torque to run on a single node with 20 logical cores, configured as np=16. Both the server name and the single mom node are meant to have the hostname -s of dev1-linux. The setup mostly is working. A queue exists and I can submit jobs to it. But qnodes shows the node with state=down and the jobs do not run.

I am running on CentOS 6.8 using Torque 4.2.10. From having tried this in the past, I suspect the problem is that there is some kind of communication problem between pbs_server and pbs_mom, with some elements seeing the hostname as the full host (hostname -f) and some as the short name. Log files don't reveal any obvious errors, except that the server_logs file shows the server as 'dev1-linux.attlocal.net' when I have used 'dev1-linux' in all the places I can think of where a host name is specified.

Any suggestions about the state=down problem in general or other places (beside server_name and mom_priv/config) for controlling host name?

Share this post


Link to post
Share on other sites

Sorry, garygo, this forum is for PBS Professional; I cannot comment on Torque.

Have you verified that the firewall is not blocking communication ports? 

I am asking because we receive several PBS Professional support calls related to pbs_server and pbs_mom communication issues and it is because of the firewall rules. 

Share this post


Link to post
Share on other sites

Thanks for your comments, Scott. Yes. I have verified that the firewall allows communication ports. I also tried turning off the firewall altogether.

I realize that this forum is not exactly for Torque. I was just hoping that some knowledge here of host name issues would translate to my situation. I also have been unable to find any forum that directly addresses Torque as simply downloaded by yum for CentOS. Any suggestions on appropriate forums?

Thanks,

Gary

Share this post


Link to post
Share on other sites

Actually, I solved my problem. It seems that with Torque 4.2.10, any setup that has the same host name for pbs_server and pbs_mom is seen as a NUMA system. The key to getting the MOM node to state=free was to edit /var/lib/torque/mom_priv/mom.layout and enter the single line "nodes=0" in it. After restarting the pbs daemons (e.g., service pbs_server restart), a "qnodes" shows the single 16-core compute node as in state=free and jobs now do run.

I'll have to do further investigation to determine for sure that all cpus and 16 processors indeed get used, but I think I'm on the right track now. Hopefully, this will be of help to some other Torque users who want to set up a single-node queueing system.

- Gary

 

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

×