Recovering IMS Systems - Can I Do It Alone?
This article describes some problems and considerations involved with recovering IMS systems. Important facets of recovery include automating wherever possible to reduce the likelihood of human error and ensuring business-critical applications are back online as quickly as possible with minimal data loss. Some background concepts may already be familiar to the experienced database administrator (DBA); newer DBAs will find it helpful to consider possible recovery scenarios and contingencies that may need to be taken into account.
IMS Recoveries: Environment and Background
IMS is a large database management system used in major corporations around the world. With it thousands of users can access vast quantities of business-critical data quickly and reliably.
IMS systems consist of many components including databases, application programs, control blocks, data communications and system software. IMS systems run on extremely fast mainframe computers in the OS/390 operating system environment. At IMS startup, various events occur to make the stored data available.
Since these components "live" on storage devices and are maintained by humans, opportunities for problems arise. Although DASD has grown to be extremely reliable, failures are possible. OS/390 systems are also quite reliable, yet operator errors and hardware malfunctions can present problems. Hurricanes, earthquakes, tornadoes and other natural disasters can render a data center inoperable. Even if the physical components of the system perform without fault, the maintenance of data and programs introduces another area of concern – human error. Application program errors and human errors at storage installation time are still other ways that massive problems can enter otherwise reliable systems.
In a perfect world, there would be no system problems, no disasters and no errors. In reality, it helps to have a way to recover from these challenges.
Hypothetical Situation #1: Disaster Recovery
What if your data center experiences either a declared or real disaster, such as Chicago’s 1992 flood? That very real flood wiped out the processing power of several downtown businesses. What about Hurricanes Andrew and Hugo that each caused billions of dollars of damage to the southeastern coast of the United States, leaving entire cities without electricity, housing and food? Buildings were flattened by high winds, and life stood still for thousands of residents. The 1994 Northridge earthquake near Los Angeles also wreaked havoc in one of the most populous centers of business and entertainment in the United States. Real disasters, such as these, will happen in the future.
In preparation for a real emergency, many companies engage in business-continuity planning and create disaster-recovery plans. The point of these plans is to ensure that the companies can resume their business activities following an interruption.
However, many obstacles arise in reaching this goal. Often the first thing to do following a real or declared disaster is to determine the exact timestamp to which you need to recover your data. Or, more simply stated, you want to make the data look exactly like it did at a very precise time, down to hundredths of a second. To ensure data integrity, you will want to choose a timestamp when no work was in process.
With standard recovery procedures, you select recovery points that occur when an IMS database is stopped. Typically, these points happen at best once per day, or perhaps weekly; 24x365 shops may not have even that many opportunities. You may be in great danger of being unable to recover to the exact timestamp required; thus data could easily be lost. For example, a utility or credit card payment that was recorded during that period of loss may not be recorded.
Once you have selected a timestamp for recovery, you must often endure a tedious and error-prone requirement to manipulate the data in the IMS RECON (REcovery CONtrol) data sets to allow IMS to be properly started. This requirement can take many hours of manual work and can yield results of varying degrees of accuracy.
Then you must code the JCL statements for the actual recovery job. This requirement is another time-consuming task that you must perform perfectly to ensure data integrity. Also, many IMS systems are closely associated with DB2® subsystems. When problems arise, how do you recover both sets of data and still remain logically consistent?
Hypothetical Situation #2: Human Error
Humans maintain IMS system components. As with the rest of us, application programmers, DBAs and systems programmers are capable of mistakes. An occasional logic error will inevitably creep into most IMS systems. If we could prevent 100% of these situations, we would do so. But in case a problem does occur, it is critical to restore data integrity by correcting the error. But how?
Many of the techniques for mending such errors are similar to those for disaster recovery. Several realistic scenarios can cause the need to recover data. Human error, though, represents the greatest threat. So even if you are not expecting a hurricane, flood or earthquake, you should have recovery procedures designed, tested and ready for implementation.
Recovery to the Rescue
Whatever the cause of the problem, you must be able to recover data quickly so your business can continue. While your data is unavailable, your company stands to lose hundreds, thousands or even millions of dollars. However, recognizing the need for a recovery is vastly different from making a recovery happen quickly and correctly. The problem is that IMS recovery is a very complex task, quite prone to errors and generally difficult to manage.
Recovery is different from just restoring an image of the data as it appeared at some earlier point in time. Rather, recovery involves bringing the data back to its state at the time of the problem. Often recovery means restoring databases and then reapplying the correct changes that occurred to that database, in the correct sequence. Even though it sounds simple, this assignment can be overwhelming. For a complex database management system such as IMS, a recovery can take hours. Meanwhile, critical business opportunities come and go.
There are various tools and methodologies available when an IMS database recovery is needed. Some procedures allow automation to simplify the process. It is important to account for many scenarios, including those of disaster and human error as described earlier.
Standard recovery procedures allow selecting recovery points only when an IMS database is stopped. These times are normally considered to be an outage, since access to the data is prevented; there are very few of these timeframes available under normal circumstances. Some tools allow creation and selection of many more recovery timestamps, including "quiet times" between transaction activity, without making the database unavailable or taking an outage. Rather than settling for a single candidate recovery point, you could have hundreds or thousands of useable recovery points. This multitude of recovery points lets you minimize or completely eliminate loss of data from "in-flight" transactions, such as the utility and credit card payment example suggested previously.
Remember the problem of cleaning up the RECON data sets? Using manual procedures, this can take an experienced DBA hours or days, with a relatively high possibility for error. Various tasks must be performed, such as:
- Close any necessary open records: PRILOG, SECLOG, PRISLD, SECSLD
- Update and/or delete ALLOC and LOGALL records as necessary
- Delete active SUBSYSTEM records, including DL/I batch jobs
- Mark database data set groups and areas as "recovery required" as necessary
- Delete PRIOLD and SECOLD records as necessary
There are tools on the market that can perform these tasks automatically in only a few minutes. When the clock is ticking and thousands of dollars of revenue are going unclaimed during a disaster, your recovery scenario should involve allowing time to perform these functions, regardless of whether a manual or automated process is employed.
Since recovery is such a critical factor to business continuity, many installations perform practice drills before the real thing hits. This usually involves many individuals going to a disaster recovery site (sometimes called a "hot site") and actually going through the steps of simulating a recovery following a disaster. These drills are highly publicized within the company and attract the attention of the highest levels of management. Typically, IT professionals are not in a position to test their ability to perform local recoveries (such as may be required following a human error or device failure). These activities are left to the DBAs to handle "on the fly," which could leave the installation exposed to greater chance for missing some key factor in the recovery. Some vendors provide tools that equip DBAs with the ability to practice recovery before the actual event, by ensuring that all elements required to run a recovery (such as copies of the databases and system logs) are available. These can be used "in the privacy of the office," rather than in the spotlight at a full-blown disaster recovery practice. Such tools allow deficiencies in current backup procedures to be identified and corrected before a recovery is required.
Cross System and Point-in-Time Recoveries
Recovery scenarios can involve added complexity when multiple systems must be brought back to the same time. Coordinating both IMS and DB2 environments to a common recovery point is not a trivial matter and may involve hours of manual analysis and intervention on the part of the DBAs involved.
Point-in-Time (PIT) recoveries bring additional special considerations to bear. While a system manager may wish to recover to a particular timestamp (and there are tools available to easily handle this requirement), it is important to identify the work going on at that exact time to ensure data integrity. While this is a daunting manual process, with much work required to analyze data from many logs, there are tools available to automate this process and make it much more manageable.
Recovery Success Stories
In August 1999, a major Texas-based insurance carrier completed its first successful disaster recovery drill. With the help of certain vendor products and services, they created the environment for a successful drill. Before that time, these exercises could not recover to a known data point; resumption of their business would have been clouded with doubt and loss of data integrity.
Tools provided invaluable tasks during this massive effort. Products performed these and other tasks:
- Grouped whole sets of databases to synchronize activity among them
- Identified the assets that would be needed for recovery (to make sure they were available when needed)
- Estimated the length of time required to run the recovery
- Found and created recovery points (compared to those available for their corresponding DB2 subsystem)
- Conditioned the RECON data sets to allow IMS to be started cleanly
- Created comprehensive JCL to recreate the databases
Via manual efforts, this customer may have continued to experience frustration in their disaster recovery efforts, as they had for many years. With these automated tools, recovery became another opportunity for success.
Another organization performed a successful disaster recovery drill in the summer of 1999. A major outsourcing partner, this company is a large IMS Fast Path site with thousands of data areas that implement data sharing in various configurations. Customers include one of the largest and best-known retailers in the country.
By using automation and recovery tools, their disaster recovery drills have been consistently successful. The company has recovered three times the amount of data in half the time through the capabilities of these systems. Disaster recovery exercises are vastly improved over manual processes, both in duration (length of time required for recovery) and accuracy (data integrity).
Hardware Isn't the Answer
Many planners feel that since they use mirroring technology to duplicate their data on multiple DASD images, they have no need for recovery. But what about application program errors? Having multiple copies of logically incorrect data is not a solution. If an operator accidentally initializes the wrong mirrored string of DASD, you need to be able to recover that data as well. Hardware alone is not the answer to recoverability.
Some vendors' tools provide automatic creation of a recovery jobs in a simplified, flexible, comprehensive and single-step approach. Be careful in selecting the products you incorporate, though; some other products' process is manual, complicated, rigid, incomplete and accomplished only in multiple steps. For example, would you rather perform multiple RECON modifications for each recovery timestamp and manual RECON notifications for "recovery required" conditions, or have this accomplished in one step generated by a single utility?
In evaluating tools, check to ensure that you can create meaningful groups of databases on which to perform your recovery steps. Easier manipulation of resources when your business applications are in jeopardy makes recoveries easier and more effective.
There are several options available when recoveries are needed; manual processes and a variety of automated solutions can be pieced together to provide a comprehensive plan for getting the business' data back online. Relying strictly on manual intervention leaves the installation open to many opportunities for human error to corrupt the applications' data. By thoroughly planning for both likely and unlikely scenarios and utilizing automation wherever possible, DBAs will be better prepared to handle whatever emergencies arise in a timely manner that ensures data integrity. Through use of various features and functions available, successful recoveries are much easier and more likely to be attained.
Terrell Smith is the product line manager for RECOVERY MANAGER for IMS with BMC Software. She has worked in the computer industry 13 years and joined BMC Software in 1993. Prior to that, she was in technical support for Southwestern Bell Telephone Company.
Contributors : Terrell Smith
Last modified 2005-08-04 08:27 AM