Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 
cdoherty

Having the Internet between the server and MOM

Recommended Posts

We're finishing up a first-phase PBS implementation. Looking ahead, I'm about to try running a PBS server in our colo, coordinating a node in Amazon EC2. Subject to the expected caveats about network reliability, data locality, everything else one expects in running over a WAN, this seems like a really good way to scale our node availability.

The obvious questions:

1. Has anyone tried this already? How'd it go?

2. Is there anything about the server-MOM communication specifically that would react poorly to e.g. 1000ms latency, or some percentage of packet loss? It seems like the worst that would happen is they lose touch, and the server already has policies for handling that.

Thanks,

Chris

Share this post


Link to post
Share on other sites

Hi Chris, I don't know of anyone who has tried this, and unfortunately I know nothing about the level of network freedom you are given to access nodes through Amazon EC2.

I can tell you that the server and mom need to be able to communicate with each other both via TCP and UDP.

It would be possible to mitigate the damage of some of the potential UDP communication loss by adjusting the rpp_retry server attribute to something like 60. This determines how many times the PBS daemons will retry RPP (UDP) communications between each other before declaring communication failure. The default value is 10, and each try is about 2 seconds.

In addition to the rpp_retry attribute, there is also the node_fail_requeue attribute. This is the number of seconds the server can be out of communication with the primary execution host before deciding to requeue any jobs running on that host. This is the seconds SINCE THE FAILURE OCCURS, so when the failure is initially declared is dependent on the rpp_retry value set above. This can be set to 0 so that jobs are never requeued on nodes that the server deems "state down".

I hope this helps. Please share any results you have with this project in this thread. If you need more general PBS networking questions answered, please feel free to ask here as well.

Thanks!

--

Scott Campbell

PBS Professional Support Engineer

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

×