Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 


  • Content count

  • Joined

  • Last visited

  • Days Won

  1. Thanks for your response. The solution we went with was to write all program output to an external file instead of relying on the output being written to the .o file. It looks like we can safely assume the writes within the PBS script finish before the next job runs, even if we can't assume the .o and .e files are completely written. So we no longer use the .o files for anything during the run of jobs (except for possible error-reporting -- we now just combine the .o and .e files since the .o files will always now be empty).
  2. I'm not sure, but I may be seeing an internal PBS race condition when running dependent jobs. The error happens occasionally -- not often, but sometimes. I think the error is that PBS tells the next job in the chain that it's ok to run before it has completely output the .o and .e files for the previous jobs. In a simplified scenario: I have three jobs, the third depends on the first two. The dependency chain is set up so that job3 will run only after dependok=job1:job2. The third job reads the output of the first two jobs. The output is in the .o files for the first two jobs. The error that I see sometimes, is that the third job fails with the message that one or both of the .o files to a previous job doesn't exist. When I look, of course, the .o file is there. And I can verify that I had the correct job dependency chain defined in the pbs script. I think what might be happening is that job1 and job2 finished, and PBS told job3 that it's ok to start running, but PBS didn't make sure that the .o files are completely written before job3 starts. Is this possible? Is there some way to ensure that job3 doesn't start unless job1 and job2 are TRULY done -- meaning the .o files are written?
  3. PBS stopped working in our environment, most likely due to a mistake on our part to allocate enough space for PBS. The space is now full with large log files. We are looking into settings that will reduce the number of log messages written and how to purge old log files. But currently, as it is refusing connections due to space, it is continuing to write fail messages to the MOM_log. There is some polling process or something that continues to write "Cannot send Obit" -- there are hundreds of entries per minute of this message -- all useless after the first entry. The lack of enough space is what caused this error in the first place, and now the MOM is constantly taking up more space to tell us the same error message over and over.