When Bad Things Happen to Good Messages

by Robert Blackhall

The Role of IMS in Today’s Enterprise

Legacy Data is Often at the Core of the Enterprise

IMS continues to be a robust, reliable data server of choice in the enterprise; its ability to flawlessly execute billions of transactions provides evidence of its staying power. IMS prevails because it is reliable, dependable, and it works. Legacy applications and data persist because IBM continues investing in IMS and enhancing its functionality. The additional investment allows IMS to keep pace with the times, so that new interfaces and subsystems can access applications and data.

With the evolution of IMS in the enterprise, not only have the formats and structures of the databases changed, but so too has the, often overlooked, messaging environment. Because of data interface and access improvements, IMS is now open to access from other subsystems and the Web. IMS has evolved from local queues to shared queues, and now processes message types including Advanced Program-to-Program Communication (APPC) and Open Transaction Manager Access (OTMA) from other subsystems. Ongoing improvements to operating system architecture and hardware performance have permitted increased workloads and improved throughput in IMS. As a consequence, the amount of message traffic that must flow through the message queue has increased. The architecture changes, new interfaces, and additional traffic have created further dependence on message queue availability and performance. The availability of the IMS message queues has become as critical as the availability of the IMS databases.

Life Used to Be So Simple; a Message was Just a Message

In the old days, the IMS SYSGEN was a static, clearly defined, and predictable entity. The network gurus defined the environment and their word was law; as an IMS system programmer, you either met or didn’t meet their schedule. The good news is that a new application staff coding in C/Java can now access IMS. This benefit comes at the cost of a less controllable and less predictable workload, along with resource usage that is whimsical and dynamic. A less controllable environment is prone to application errors, loops, and message flooding.

Another cost is paid in overhead for messages on the queue. With each new IMS release, the number and size of IMS message prefixes increase within every IMS message. Sometimes, the message prefixes dwarf the business data itself. Consequently, prefix overhead consumes more of the message queue. The effect of the increased message prefix size is not always noticeable in smaller test environments, but can significantly impact larger production systems. With IMS V6.1, IBM introduced a subtle architectural change that further enlarges the size of IMS messages. IBM now places the conversational Scratch Pad Areas (SPAs) in the message itself, rather than in a SPA data set or in memory. Users with heavy conversational processing feel the most pain from this change.

You Can Make the Queue Bigger, but Will It Be Big Enough?

The message queues are an ever-increasing focal point of operational concern. Larger messages, consolidated workloads, and faster processors that allow more throughput results in unpredictable queue utilization. If the optimal balance between incoming and outgoing work is not maintained, the queue can start to fill at an express rate. Data integrity is paramount within IMS. If data accuracy or delivery cannot be guaranteed, IMS will terminate rather than continue with the anomaly. If the message queue fills up, then IMS comes down.

When an IMS that supports today’s consolidated workloads comes down, the impact is widespread and severe. IMS recovery is lengthy, expensive, and has significant business impact. Worse yet, if you encounter an overflow and outage and reallocate the queues larger, you have to hope that they are now large enough. If your best calculation isn’t right, the outage soon reoccurs and the painful experience unmercifully repeats. With Service Level Agreements (SL.As) to meet and bonus criteria based on SLA .s, an application or IMS outage is costly for your company and for you.

Managing IMS in the Enterprise

Since message queue overflows are so disruptive, the IMS queues require careful monitoring and management. But trouble is real-time and investigation is not. Native commands are fine for operating IMS, but are ineffective at problem determination and problem analysis. Display commands can reveal a count of messages, but not the true impact of the messages on the message queue; 10,000 small messages might be tolerable, whereas 10,000 huge messages are likely catastrophic! For true problem determination, you need to know much more than a simple count. For example: What is the intended application, origin, or destination for these messages? Who is responsible for this data and when did it make its way on the queue? Is there a true problem and, if so, how can you stop it? Worse yet, you simply can’t see some queue space users. If you are forced to make decisions in a vacuum, it is too easy to inadvertently make the problem worse.

Types of Queue Polluters

Some queue space users wreck all the fun for everyone else. These polluters overuse and abuse queue space and place IMS at risk. There are two general categories of polluters: slow creeps and hot spot single polluters.

Slow Creeps

Sometimes, messages that cannot be delivered fill the IMS queues so slowly that it takes days or weeks to cause an outage. Perhaps an old printer was removed or simply broke and no one fixed it. Day after day, reports are still written to it and remain on the queue waiting for delivery. The message queue slowly nears a critical state for some time without notice, and the problem only becomes apparent when it’s too late.

Hot Spot Single Polluters

Because of faster hardware and improved throughput, a single application or interface can deliver a high volume of work to the queue and cause a lot of havoc. Problem determination in this scenario has a very short fuse; by the time you recognize the problem and the lucky associate walks to the monitor to investigate, the "IMS_IS_COMING_DOWN" system dump is being taken, and the phone is ringing off the hook.

Invalid messages

Invalid messages can be another cause for outages of companies. IMS resources. There are two types of invalid message, which can be introduced onto the message queues. The messages can be from the source terminal for the transaction that is undefined or the source terminal that has an incorrect SYSID. This can result in U820 IMS outages at transaction schedule time. These situations can be caused by IMSGEN errors or by tools that do not integrity check messages before they insert messages to the message queues.

Managing IMS in Shared Queues

Most companies today in planning for the future are trying to determine how to accommodate workload increases while keeping the associated costs down. Since IMS V6 messages can be shared between multiple IMS systems. By moving the IMS message queues into the coupling facility, as many as 32 IMS systems can have access to the same message queues. This process allows messages to be placed on the message queues by one IMS system and then processed by any IMS system in the IMSPLEX. This capability of the IMSPLEX is commonly referred to as IMS shared message queues. The transition of the message queues into the coupling facility means these messages are no longer associated with a single IMS system. Users of shared queues often assume that queue overflow problems are resolved. In fact, shared queues supplement local queues and local queue overflows may still occur in a shared queues environment. These overflows can result in U758 abends, S80A abends, or IMS hangs, depending on how the queue buffers are defined.

Solution

We all have too much to do and too little time to do it. We certainly have less time to diagnose problems or write and maintain homegrown utilities that manage messages. But in today’s dynamic environment, the IMS message queues must be well managed and protected in order to fulfill SLAs and preserve business data. A method for identifying the root cause of the queue problems is needed, and a means to quickly protect IMS should be deployed prior to a crisis. With this careful planning and implementation it is easy for you to be the hero, save the good messages, and save the day. You should also look for tools that will allow you to automate as many processes as possible.

Contributors : Robert Blackhall
Last modified 2005-05-19 11:25 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?