Jump to content
  • Announcements

    • admin

      PBS Forum Has Closed   06/12/17

      The PBS Works Support Forum is no longer active.  For PBS community-oriented questions and support, please join the discussion at http://community.pbspro.org.  Any new security advisories related to commercially-licensed products will be posted in the PBS User Area (https://secure.altair.com/UserArea/). 
sam8902

MOM_log being constantly written to with error messages, making the problem worse

Recommended Posts

PBS stopped working in our environment, most likely due to a mistake on our part to allocate enough space for PBS. The space is now full with large log files. We are looking into settings that will reduce the number of log messages written and how to purge old log files. But currently, as it is refusing connections due to space, it is continuing to write fail messages to the MOM_log. There is some polling process or something that continues to write "Cannot send Obit" -- there are hundreds of entries per minute of this message -- all useless after the first entry. The lack of enough space is what caused this error in the first place, and now the MOM is constantly taking up more space to tell us the same error message over and over.


 


 


Share this post


Link to post
Share on other sites

sam8902, This is definitely an unfortunate situation. I hope that your site was able to recover. 


 


I can share short-term remedies that I know other customers use, as well as techniques used within Altair's HPC environment. 


 


Verbosity of PBS Daemon Log Files


Ensure that PBS daemon log events or log_filter (pbs_sched) is set to an appropriate verbosity. Sometimes sites set these attributes to log EVERYTHING, and forget to put them back to default. The PBS Professional Admin Guide v12.2 Section 12.4.4 Log Event Classes for details on setting these values.


 


PBS Daemon Logs


The PBS daemon logs are usually good for the administrators to do forensics on jobs that are not or did not behave correctly. For instance, the user is asking why the job is still queued OR why did my job die. Otherwise, the daemon log files take up space. The rule of thumb that I have used is to set up a cron to purge the log files after the longest running job has exited the system plus a few days for the user to post process their results. Let me provide you an example. Say the longest running job my HPC cluster has ever seen was 20 days. I give the users 10 days to inform me about any issues they had with the job. I picked 10 days because most of the users will take a weeks worth of vacation.. factor in the weekends.. then their time to catch up on email.. blah blah.. Anyways, in total, I allow the daemon log files to exist on the system for 30 days before they are purged. Below is an example cron I put on all system running PBS. 



30 3 * * 0      find /var/spool/PBS/mom_logs -type f -mtime +30 -exec rm -f {} \;
32 3 * * 0      find /var/spool/PBS/sched_logs -type f -mtime +30 -exec rm -f {} \;
34 3 * * 0      find /var/spool/PBS/server_logs -type f -mtime +30 -exec rm -f {} \;

Accounting Logs


Accounting logs contain the history of the cluster relative to job throughput. Nearly 99.999% of sites keep these for years. Site will review these logs for analyzing trends, user and group utilization, and input into future procurements. Many sites use a cron to manage this data. For instance, accounting logs that are >30days old are archived or compressed (e.g., tar or gzip). The compression is usually really good. Below is an example of a cron that is used on a PBS Server. 



36 3 * * 0      find /var/spool/PBS/server_priv/accounting -type f -mtime +30 -exec gzip {} \;

I am aware of an RFE in the ticket system for making PBS smarter about disk space usage. I know there was a lot of debate internal about how PBS should behave when the disk files up (e.g., disk space thresholds, reaction/triggers). In addition, if the disk fills up, then PBS is not the only services that is going to be having issues. 


 


I should mention that in PBS Professional 12.0, there is a hook event called exechost_periodic, which is similar to a cron BUT not a replacement to cron. Now, I have not tried this because the cron works fine for me, but if you want to be adventurous, you could consider this option.


 


I welcome you to open an official RFE with the PBS Works support team, too. If you would like me to put you in contact with your local representatives, please private message me. 


Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now

×