A DB2 Health Check - Part 2
Part 1 | Part 2
In Part 1 of this article, I concentrated on what DB2 health really means: what it is, and what it’s not. In Part 2, I will go into more technical detail on specific DB2 health tuning categories and techniques.
Although the Capability Maturity Model™ (CMM) was originally created for the software development process, it can also be used as a framework for organizing a quality improvement program for documentation and processes. (More information on CMM may be found at www.sei.cmu.edu/cmm/cmm.html) .
Based on what we’ve covered so far, we now have enough information to define measurements and scoring for the various health areas. Each IT department should review itself based on the Capability Maturity Model™ with an emphasis on classifying the measure of health. (Refer to part 1 of this article for how this is done.) Here, we concentrate on DB2 infrastructure support.
We also need to have a way of rating what we do. The following is the scale used by one company:
|Does not exist||No process in place||0|
|Initial||Exists||Process in place||1|
|Defined||Understood by all||Consistently applied||3|
|Managed||Regularly reviewed and updated||Regularly reviewed and updated||4|
|Optimized||Regular quality improvement||Regular quality improvement||5|
This company chose to concentrate on the areas of Documentation and Processes. (The third area, People, was assigned as a separate project; we will concentrate on the technical aspects of DB2 system health in this article. For more information on quality improvement of people in an IT environment, see The MIS Manager’s Appraisal Guide by Lockwood Lyon and Fred Gluckson, McGraw-Hill, 1994.)
Our case study company then evaluated each area of its IT subsystems in terms of the previously noted chart. They logged the score for each area (Documentation and Processes). When they finished, they reviewed their findings for accuracy and completeness. Next they began to set priorities for improving documentation and processes.
Each IT enterprise should follow this procedure: evaluate current health measures and systems, devise health measures relevant to IT, rate each area, and prioritize areas that require improvement either based on a low score or on your enterprise or departmental goals. My recommendation: Any area receiving a score of zero for either documentation or processes should get a high priority.
Documentation and Process Upgrades
The process of updating documentation should follow the levels listed in the previously noted chart. First, ensure that documentation exists for each area. Then, ensure that it is centrally available to all who might use it, and make certain it can be read and understood by all. Next, implement a process where the department regularly reviews and updates the documentation. Finally, institute a quality improvement process.
Process updates are more difficult, primarily because of their complexity. Processes such as regular tablespace reorgs or statistics gathering may not be implemented consistently across the enterprise, or even across a single DB2 subsystem.
Strategies and tactics for improving documentation and processes are part of the CMM; interested readers should reference the CMM Web site cited earlier in this article.
Indicators and Events for Automation
Part of any review of IT processes and procedures is looking at automation. Before describing automation that is part of the DBMS, we need to review that which is either constructed by DBAs or support staff or is implemented as part of a third-party software tool.
Most shops use a simple “If-Then” logic to define the automation they desire. For example, “If any pageset is over 30 extents, execute a Reorg utility for the pageset.” These needs are then implemented typically as batch jobs that are executed if the condition is true.
The conditions are defined in terms of indicators, and there are two types: state-based, and threshold-based. State-based indicators usually have two values such as “On/Off” or “True/False.” Examples of these would be whether or not DB2 is up, whether or not an Active Log is being archived, or whether or not an index is defined as the clustering index.
Threshold-based indicators are used when a measurement varies over a range of values. Usually they indicate one of three conditions for the current value:
- It is within an acceptable range (normal)
- It has increased (or decreased) to a level that merits some concern (warning)
- It has increased (or decreased) to a level for which immediate action must be taken (danger)
An example of a threshold-based indicator is the number of extents of a pageset. For most enterprises, this number ranges from 1 to 255 depending on the initial allocation of the pageset and whether or not it can extend to multiple volumes. Along with the measurement itself (say, 30 extents), you must also define thresholds that designate the boundaries between the normal/warning areas and the warning/danger areas.
With the indicators now defined, the event definitions become clear. An event occurs when either:
- A state-based indicator changes state
- A threshold-based indicator value crosses a threshold
In summary: support personnel:
- define the conditions and processes.
- define the indicators and events.
- implement data gathering processes to regularly calculate indicator values.
- implement processes to signal events based on indicators.
- implement additional processes that execute based on the events.
For more on this process, see the recently-published RedBook, Event Management and Best Practices, IBM document SG24-6094.
As your automation effort proceeds, you go through certain levels of sophistication, corresponding to the levels of the CMM. Here, the levels might represent the following:
1. Produce and gather multiple sources of event, threshold, and statistical data.
2. Consolidate, summarize, and report data using management tools.
3. Implement monitoring software that correlates and recommends actions.
4. Implement action scripts for monitoring software to automatically take action.
5. Integrate components for dynamic management.
We now have all of the tools required to determine DB2 health. In part 1 of this article, we have:
- Identified classes of health:
- Ability to recover from a disaster
- Availability of extra capacity
- Proactive, predictive, self-healing
- Identified categories of DB2 health:
- Subsystem configuration
- Catalog and directory
- Access Paths
- Data: volumetric and configurational
- Process objects
So far in part 2, we have:
- Developed methods for rating the health of our documentation and processes.
- Optionally developed methods for rating the health of our people.
- Defined indicators, events, and processes for each health category.
- Optionally implemented these processes as automation.
This, in a nutshell, is the way that you go about implementing a DB2 health strategy.
We conclude with an example from our case study company.
Implementing Health Category Measurement and Automation
Our case study company identified several categories of DB2 system health measures. One in particular was Subsystem Configuration. This was broken down into more granular areas as follows:
1. Subsystem configuration
a. MVS environment; DBMS operational state
b. DBMS WLM assignments
c. Data sharing and Parallel Sysplex exploitation
d. DBMS maintenance
e. IRLM configuration
f. Disaster recovery readiness
g. ZParm settings
h. Logs, archives, and log utilization
i. Work files
j. Data sharing
k. Virtual pool sizing, thresholds, and tuning
l. Global buffer pool tuning
m. Memory pool sizing (EDM, RID, Sort) and usage
n. Processes for regular reporting
o. Processes for automated changes
For each area, they defined several indicators. For example, for the first area (MVS environment; DBMS operational state), the indicators chosen were:
- (S) DB2 IRLM address space active
- (S) DB2 MSTR address space active
- (S) DB2 DBM1 address space active
- (S) DB2 DDF address space active
- (T) DB2 subsystem startup: recovery complete
- (T) Production DB2 WLM resource “on”
The company then implemented measurements for these indicators through a series of REXX procedures and console commands. State changes kicked off console messages, e-mails, and other notifications.
In addition to this monitoring and reporting activity, additional logic combined indicator states and date/time information and stored records in a data warehouse for historical reporting and additional monitoring. As the database grew, they were able to spot trends and further refine their measurements and indicator/event definitions.
For another example, let’s look at the area of Virtual Pool sizing, thresholds, and tuning. This is much more granular, focusing as it does on the DB2 DBM1 address space and its allocation of memory to the virtual pools.
A set of indicators was defined for each virtual pool. Since each pool was used for a certain class of pagesets (e.g., workfiles, small tables, indexes), the indicator threshold values differed across the pools as well as some of the indicators themselves. Here is a selection of the indicators they defined:
- (T) Active pages as a percent of pool size
- (T) Deferred Write Threshold exceeded
- (T) Additional thresholds exceeded
- (T) Virtual pool to Hiperpool movement
- (T) Page writes to DASD per unit time
In these circumstances, our case study company used the DB2-related SMF records as a data source and analyzed the results with a third-party software tool. As before, they also summarized and timestamped the data and stored it in the data warehouse.
In Part 2 of this article, we’ve gone into more detail on rating documentation, processes, and people. We then discussed ratings and indicators of health, how to define events, and the process of automating event collection and notification. We concluded with some examples.
The health of your DB2 subsystems depends on a combination of factors. The system itself is a combination of software, processes, documentation, and people. Good health can be defined as a blend of recoverability, capacity, self-healing, stability, maturity, and more.
Smart companies will embark on a course that includes the Capability Maturity Model as a method of organizing and guiding your health strategy. It also has the side effect of helping you implement quality improvements via best practices. As you begin to implement these practices, try to be proactive. Implement automation when you can, and gather and store performance data for later trend analysis.
Finally, realize that the health check is not a one-time effort, but an ongoing process. The point of a health check is to implement a process of continuous quality improvement. While it’s nice to have a healthy DB2, it’s even better to keep it that way.
Lockwood Lyon is a DB2 for z/OS Systems and Database Performance specialist. He has over twenty years of experience in Information Technology as an IMS and DB2 database analyst, systems analyst, manager, and consultant. Most recently he's spent quite a lot of time on DB2 subsystem installation and performance tuning. Lockwood is the author of MIS Manager's Appraisal Guide (McGraw-Hill 1993), Migrating to DB2 (John Wiley & Sons 1991), and The IMS/VS Expert's Guide (Van Nostrand 1990).
Contributors : Lockwood Lyon
Last modified 2006-01-04 02:51 PM