Optimizing IT and Service Management Optimizing IT and Service Management

Is Alarm to Trouble Ticket always one way traffic?

Posted by on in Uncategorized
  • Font size: Larger Smaller
  • Hits: 1179
  • Subscribe to this entry
  • Print

Back in the last century one of the earliest third party integrations developed for Netcool OMNIbus was a gateway to the Remedy Action Request System (ARS). That gateway allowed an operator to create a trouble ticket from an OMNIbus event and thus initiated the now standard process from event management to incident management. This process is essentially one way the posting back of the trouble ticket number notwithstanding, and the event manager and service request server are treated as two entirely separate systems that occasionally exchange information. The question is, should that remain so, or should the event manager be more closely integrated into the incident management process.

We tend to think of incident management as a linear process, which can be drawn as:

In most implementations the process transitions from the event manager to the incident manager at the Assign stage and often the users of the event management system lose visibility of progress until their event is closed, either because a clear alarm is received from the repaired device or because the ticket is closed and that in turn clears the initiating alarm. However at a large telco I have noted operators use a tool to update the Summary field of an alarm with information about the fix progress so some feedback is clearly welcome.

The linear process also assumes that only a single string of teams are involved - a monitoring team detecting and diagnosing who hand over to field engineering for the actual fix. However that is rarely the case. In my time in network support I could tell the potential impact of a problem by the number of senior managers hovering around my desk, and more formally there was a User Help team who were supposed to keep the business informed. If anything things are now even more complex. That linear flow is actually more like this:

The problem now for an incident manager is to keep track of those individual assignments. In this blog I would like to propose a means of doing so using events, event management and the principles behind the Common Alerting Protocol. I propose that true incident management means identifying the people who should be doing things and tracking whether or not they are doing so. That requires systems to provide and apply a process and to monitor the progress of that process. Processes generally exist already, if only in paper form or as custom and practice, so one step will be to bring those into the system. Once in the system actioning those steps and the responses to those actions can be made into events that can subsequently be monitored with an event manager. To begin with though we need to think how we will contain all the events relating to an incident and show the relationships between them. Here is where the Common Alerting Protocol can give us a methodology.

The Common Alerting Protocol has its origin in US government research into public warning systems. A proposal was made by the NSTC in November 2000 for standard method for disseminating hazard and other public safety warnings which was followed up by other agencies, no doubt given a hefty kick by the terrorist attack of 9/11, resulting in the first draft of CAP being presented to OASIS in 2004. With this history CAP comes prepared to handle multi-agency responses to incidents in a standardised manner and is vendor independent. We may not necessarily use CAP in this incident management proposal, but some of the principles used to define CAP can be usefully applied.

The first is a means of creating incident containers. Events in CAP are provided with three ID fields. The first, MessageID, is the unique identifier of the event itself. The second, ReferenceID, provides a one to one link to another event, while the third, IncidentID, provides a one to many link which can be used to identify the container. This diagram shows how those three IDs can be used to correlate related events.

There are a couple of points to note. The first is that the event used to headline the container is not necessarily the same as the root cause event. This means that service impact can be in a separate event to the actual technical cause. "Application XYZ is unavailable" can be treated separately to "Hard Disk failure" which is the underlying reason for the application being down. Why might we want to do that? Well there will probably be two separate activities here, systems engineers will be endeavouring to restore the application and fix the hardware while a user liaison team handles the impact of that application being unavailable.

The second point to note is that there is a cascade of parent-child events. Application XYZ may have a warm standby which needs an instruction to another team of systems administrators to activate it. So these three related events might record the instruction to start the warm standby, an acknowledgement that the instruction has been received and then a notification that the action has been completed. Having all these steps logged as events allows for an "at a glance" visualisation to be built. This would be really useful for a busy incident manager trying to track a dozen incidents at once.

Another thing CAP defined was the timeline of an event, and defined it to include the prequel as well as the sequel. Because many of the warnings envisaged to be sent by CAP would be forecasts - storm warnings for example - the time a warning would become effective could be in the future. And if it was in the future then some grading would be needed to set the likelihood of the event actually occurring. This can be useful if our incident management needs to account for planned maintenance and maintenance windows. Our event timeline under CAP would be:

The CAP v1.2 standard also gives space to include instructions to the recipient in the event as well as options to include affected areas - useful in the context of utilities and telcos - and to provide URIs to relevant documentation. As part of the smarter city research IBM also proposed an extension to cover asset information such as the asset tag, asset owner and asset status.

However CAP is just a messaging protocol. The key component in this approach to incident management is the process that should be followed. Obviously the process will depend on the type of incident, so the first step is to detect there is an incident and then to identify what sort of incident it is. Once that is done then a policy engine such as Netcool Impact can call up the appropriate process and start applying it. If we have also created a messaging system where instructions and responses to instructions are exchanged as events then we can use Netcool OMNIbus together with Impact to provide full incident monitoring. Maximo tools such as EAM and SRM are still needed to handle the details of assets and trouble tickets, as are configuration tools such as NCM and IEM, but the approach I am proposing would provide a single point where all activities could be viewed as they happen or are scheduled.

I close with a sample architecture of how this might be implemented using Tivoli product

Netcool OMNIbus would be the central event store with probes detecting alarms. A probe does exist for CAP events incidentally, based on the message bus integration. Netcool Impact would be the policy engine and could also be the tool to do alarm correlation, though ITNM and TBSM are other candidates. Maximo would be the candidate for asset management and human activity management and TCR the archive and report tool.




Comments are not available for public users. Please login first to view / add comments.