Prevent and detect queuing issues in Tivoli Workload Scheduler

Posted by on in Cloud and Data Center Transformation
  • Font size: Larger Smaller
  • Hits: 1015
  • 1 Comment
  • Subscribe to this entry
  • Print

The Tivoli Workload Scheduler 9.2 FP1 has been recently published and is introducing a small but very important enhancements that can help TWS administrators to prevent delays and service interruptions on Tivoli Workload Scheduler.

 

TWS message files

The TWS is based on a distributed architecture where different component running on different machines exchange information using messages in an internal format. The architecture is designed to prevent any lost of information and is mostly based on persistent message files (the .msg files) that works as queues for each TWS process.

Each message queue as a single reader that have to process the messages sequentially.

The main message files are the following:

  • Mailbox.msg is the input queue for mailman, the process that routes messages to different nodes and components.
  • Intercom.msg is the input queue for batchman, the process that applies changes to the Symphony file (containing the current plan), resolves dependencies and selects the jobs to start.
  • Courier.msg is the input queue for jobman, the process that actually starts jobs.
  • pobox/server<X>.msg are the input queues for mailman servers, processes used to increase the master/domain managers scalability distributing the workstations manged by the master or domain manager to several mailman processes.
  • pobox/tomaster.msg and pobox/<workstation>.msg are message files used to temporarily store messages that cannot be sent to the parent workstation or to a specific child workstation because unlinked.

 

Impacts and causes of full message files

To avoid to full the filesystems, every message file has a maximum size that cannot be exceeded, this size is by default 10 MB and can be changed using the evtsize utility. The default value is used when a new message file is created and can be modified setting EVSIZE environment variable to the number of desired bytes.

Since no message can be lost, if a message file is full, every process that needs to write a message will wait until some message is consumed and there is room to write the new message. In worst conditions this can cause a chain reaction where a queue is full, a processes cannot write a message and so it’s not able to process messages from its own queue that starts growing.

There are 3 main scenarios that could happen on TWS environments and that can cause issues to the scheduling activities.

 

Agent down for a long period

When an agent is unlinked, mailman caches messages to be sent to that agent in the pobox/<workstation>.msg file, if the agent is down for too long and there are a lot of messages for that agent, the message file in pobox can become full, the consequence is that mailman process hangs until the agent link is reestablished or the message file size is increased.

Mailbox.msg starts growing up and changes are not applied to the plan until the issue is resolved.

Monitoring the size of pobox message files can help to prevent the issues allowing to fix the problem before it affects mailman, resizing the pobox or running a “JnextPlan -for 0000”

 

Unreachable agents

When mailman process on the master or domain manager tries to connect an unreachable agent it can receive 2 network errors: Connection Refused or Connection Timeout. If the connection is refused this is usually received in few milliseconds and this does not introduce any side effect. But if the agent is unreachable due to firewall issues or routing problems mailman can just remain stuck for a while until the connect attempt goes in timeout. “tcp connect timeout” option in localopts can help in reducing the wait time before the connect attempt terminates, but there are anyway some seconds that mailman has to wait, and if the workstation is defined as autolink, this time is spent every few minutes (according to “mm retrylink” in localopts that by default is set to 10 minutes). 

The consequence is that mailman is slowed down and it’s capacity of processing messages is reduced, Mailbox.msg starts growing up and there is an increasing delay applying changes to the plan.

The timeout can be seen in the TWSMERGE log and the impact can be reduced moving problematic agents under a mailman server process.

Monitoring the Mailbox.msg can help to detect the issue.

 

Mailman capacity exceeded

Every time mailman process a message, it decides which nodes need to receive that message and send it to the right node accordingly. Some messages like submissions need to be sent to all FTAs and that means that mailman have to send it one by one to all the workstations directly connected. If the message incoming rate is higher than the mailman capacity of processing events, Mailbox.msg starts growing up and there is an increasing delay applying changes to the plan.

This is usually experienced during workload peak hours, and can appear unexpectedly if there is a slow down of a child agent. The slowdown can be due to network issues that are increasing the network latency of reduce network capacity, or can be due to a slow message processing on the agent that resulting in a full Mailbox.msg and that are causing mailman on the parent node to wait for space in the agent Mailbox.msg in order to proceed.

Until the incoming is below mailman max capacity, the Mailbox.msg is almost empty and there is almost no delay introduced by mailman, but if it exceeds the max capacity the Mailbox.msg starts growing and with it the delay to apply changes to the plan, and the delay continues to grow until mailman is running over the max capacity.

If the incoming is close to the max capacity, a small increase in the incoming or a slowness on agent that reduces the mailman capacity can let Mailbox.msg to grow.

Many factors can influence the mailman capacity, e.g. OS resources and status, network latency and speed, network timeouts, slowness on child workstations.

As for the previous scenario, mailman server can be used to distribute workstation under multiple processes and increase the mailman capacity.

To prevent this kind of issues we need to know how much mailman capacity is currently used, as much as left, as much buffer we have to handle fluctuations in incoming or capacity. This is was not possible until now, 9.2 FP1 is introducing specific tracing that can be used to track this. 

 

Monitoring message files.

To monitor message file size it’s not correct to use the file size, infact they are implemented as circular queues and the file size can be large even if the actual usage is small.

TWS provides 3 built-in mechanisms to monitor the message files:

  • the evtsize utility 
  • EDWA events
  • ITM integration

 

The “evtsize -show <msg file>”  and “evtsize -show pobox” can be used to manually monitor the msg files or for integration with any monitoring tool that can periodically call the tool to report the actual size of message files and generate alerts if they are growing over a specific threshold.

Within TWS there are prebuilt events available in the Event Driven Workload Automation (EDWA, a.k.a. Event Management) that can be used to create event rules that send alerts or even submit commands to automatically resize a message file.

The ITM integration provides an out-of-the box integration that using evtsize allows to monitor the status of message files from the TEP console.

To define thresholds consider that if a workstation is unlinked it’s normal to have events in pobox message files, for them it’s important to be alerted before they become full.

For Mailbox.msg and Intercom.msg instead, only small fluctuations are normal, and messages stored in these queues are always a delay in applying changes to the plan.

 

How to prevent full message files with TWS 9.2 FP1

As said, TWS 9.2.0.1 has introduced the capability for mailman process to trace some basic information about its activity. Let us call this capability “mailman profiler”.

These information can be used to monitor mailman performances during the day, providing indications about mailman process working rate during a predefined observation period, and the amount of time spent by mailman process with each connected workstation.

Mailman profiler is always active on all TWS 9.2.0.1 nodes acting as “manager” (including mailman servers). During mailman processing, the profiler collects information about the number of records that have been sent to each workstation and the time spent for that workstation.

The default sampling period is 12 hours, the suggestion is to customize and refine the sampling period to obtain information that are more consumables according to own purposes. This can be done adding the following property in the localopts file of the TWS manager node to monitor and restarting the agent to  activate the new setting.

mm mailman profile = <sampling period in minutes>

The profiler print the following information in TWSMERGE trace file Every <mm mailman profile> minutes:

00:00:48 18.12.2014|MAILMAN:INFO:=== MAILMAN WORKING PROFILER OF THE LAST ===> 306 seconds

00:00:48 18.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE ===> 7.8431%

00:00:48 18.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu NC005080_92  ===> 7.09534739% (242 processed records)

 

00:36:00 01.12.2014|serverA:INFO:=== MAILMAN WORKING PROFILER OF THE LAST ===> 311 seconds

00:36:00 01.12.2014|serverA:INFO:=== MAILMAN WORKING RATE ===> 3.5370%

00:36:00 01.12.2014|serverA:INFO:=== MAILMAN WORKING RATE for cpu NC005062  ===> 3.50608939% (78 processed records)

The same information are printed also when mailman stops, reporting every time the data gathered since the previous profiler snapshot.

Note that data for a mailman server are also available. 

In every snapshot it's possible to see if the information relates to a mailman server or not and the sampling period (in this case 5 minutes).

 

Looking at the trace entries it is possible to monitor the following information:
  • mailman working rate, that is the indicator of mailman capacity: the higher is the value of this indicator, the closer is mailman to its max processing capacity. Values near to 100% mean that mailman spends all its processing time to communicate with workstations, even small changes in workload or in mailman capacity may result in messages starting queuing in Mailbox.msg. In this case the overall working rate for main mailman and mailman server "A" are respectively 7.8% and 3.5%.
  • the list of workstation that have been contacted in the sampling period, the amount of exchanged records and time spent with each one: in this example 242 records have been passed to the FTA named nc_005080_92, and 7.09% of mailman processing time has been spent to contact nc_005080_92. Only workstations that have an impact on mailman performances are listed in the profiler output (that is, if the working rate for the cpu is greater than 0).

Monitoring one or more profiler’s samples it is possible to focus on two main aspects:

  1. check the way ‘mailman working rate’ changes: continuous increases could indicate a factor of instability. Approaching to the limit of 80% may be dangerous.
  2.  check the way ‘mailman working rate for <cpu>’ changes: anomalies for a specific set of cpus indicate that something is happening managing them and these FTAs could be problematic.

These cases could be combined and in both cases it is possible to put in place proper actions to avoid the incoming issues.

How to find root cause with TWS 9.2 FP1

We give now some practical examples of possible data returned by mailman profiler and understand how to recognize the root cause of the ‘enqueueing’ phenomenon and the best actions to implement to anticipate the occurrence of the issue.

The basic idea is to monitor the possible approach of mailman performances to the maximum capacity and identify any problematic workstation.

Consider a sampling period of 4 hours and TWS manager handling 4 FTA. After the first observation we have the following:

03:00:27 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING PROFILER OF THE LAST ===> 14405 seconds

03:00:27 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE ===> 18.3%

03:00:27 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_A  ===> 4.27083% (4213 processed records)

03:00:27 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_B  ===> 5.2569% (5601 processed records)

03:00:27 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_C  ===> 4.7222% (4744 processed records)

03:00:27 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_D  ===> 3.9722% (3900 processed records)

Suppose now that the second observation period includes the workload peak hours so the number of events to manage is doubled

15:20:45 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING PROFILER OF THE LAST ===> 14403 seconds

15:20:45 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE ===> 36.5%

15:20:45 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_A  ===> 8.5213% (8102 processed records)

15:20:45 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_B  ===> 10.7719% (12078 processed records)

15:20:45 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_C  ===> 9.2132% (9112 processed records)

15:20:45 24.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_D  ===> 8.0012% (7789 processed records)

In this case we can verify that mailman is still far from reaching its maximum capacity.

Suppose that 3 FTAs are added to the manager we are monitoring to face another increase of the workload. These are results reported:

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING PROFILER OF THE LAST ===> 14401 seconds

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE ===> 69.5%

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_A  ===> 8.6715% (8310 processed records)

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_B  ===> 10.5212% (11178 processed records)

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_C  ===> 9.2132% (9352 processed records)

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_D  ===> 25.2722% (25789 processed records)

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_E  ===> 4.2361% (4105 processed records)

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_F  ===> 6.3888% (6378 processed records)

05:31:02 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_G  ===> 5.2777% (5366 processed records)

No delay is still observed in the environment but it is evident that mailman has raised its internal processing and it is now closer to its max capacity. Also clearly FTA_D has a working rate much higher than the others.

Another run of the profiler shows this:

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING PROFILER OF THE LAST ===> 14407 seconds

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE ===> 82.4%

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_A  ===> 8.7652% (8398 processed records)

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_B  ===> 10.4387% (11103 processed records)

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_C  ===> 9.1745% (9323 processed records)

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_D  ===> 25.5643% (25699 processed records)

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_E  ===> 16.8402% (4203 processed records)

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_F  ===> 6.5554% (6356 processed records)

17:01:52 25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_G  ===> 5.4333% (5401 processed records)

This shows that mailman is dangerously approaching its max capacity. Also, comparing the last 2 profiler's reports, we can see that all the FTAs seem behave in a similar way but FTA_E: its working rate raised to 16.8% (from 4.2%) processing the same amount of events.

Mailman in this case is spending 16.8% of its time to process an amount of events for FTA_E that was previously reported to be 4% of its time. FTA_E is the problematic FTA and it is necessary to fix the behavior of this FTA to avoid the incoming filling of the message queue.

Typical reason for FTA_E slowness are:

  • network problems in the connection between the manager and FTA_E (from the manager to the fta and/or viceversa)
  • problems on FTA_E accessing the local filesystem and causing delays reading incoming messages.

Let’s assume now the data about FTA_E were

17:01:52  25.12.2014|MAILMAN:INFO:=== MAILMAN WORKING RATE for cpu FTA_E  ===> 16.8402% (16485 processed records)

So, the working rate increased a lot but also the number of processed events, being comparable with the other FTAs.

In this case we don’t have a problematic FTA that is spending a lot of time to process few records. We have FTA_D and FTA_E that are processing a huge amount of events and require at least a dedicated manager.

 

Authors

Franco Mossotto (@FMossottoTWA) - IBM Tivoli Workload Scheduler Chief Designer
Paolo Salerno - IBM Tivoli Workload Scheduler L3 Team Leader
0
Comments are not available for public users. Please login first to view / add comments.