DB2 Subsystem Tuning in an ERP Environment - Part 3: Monitoring and Tuning

by Lockwood Lyon

In Part 1, we reviewed the most important issues facing the system programmer and DBA when planning for ERP package implementation. In Part 2 we addressed DB2 subsystem configuration issues including sizing the catalog and directory, virtual storage management, log management, and recommended ZParm settings.

In this last section of our three-part series we cover the directions in which enterprises must alter their monitoring and tuning efforts. Poorly-tuned ERP package instances are characterized by long (perceived) transaction turnaround times and long I/O wait times. You must analyze the monitoring data and deduce the underlying root causes of these symptoms.

First, some basics.

Monitoring the ERP System

Monitoring an ERP application suite involves using several basic measurement techniques. Since it probably co-exists with other applications within a DB2 subsystem or data sharing group a means must be found to separate ERP package performance measures from those of other applications. Additionally, when measuring subsystem performance you must isolate the effects of the package from those of other applications.

Creating Measurement Processes

There are three things to remember when creating measurement processes:

1. Implement Automation. One-time measures are certainly useful – performance 'snapshots' are sometimes essential for debugging performance problems; however, an ERP package contains many thousands of objects and hundreds of processes (or more). You need a process that develops meaningful measures that allow you to get overall status information as well as the ability to drill down to the details. To do this you must find the means to automate both the data gathering and measurement processes.

Many performance measurement products have features for automating either data gathering or performance reporting.

2. Save Historical Data. It takes time to develop a complete picture of ERP package behavior. It is unrealistic to believe that you will be able to do a good job of monitoring and tuning within a few weeks — or months. You must gather data on a regular basis and store it for later processing. (Relational databases are great for this!) Then, over the course of time, you can do trending analysis and capacity planning. Historical data is essential for regression testing; for example, if you are considering making major performance changes you will want a set of measures both before and after the changes to see if there was any improvement.

3. Develop graphical reports. A spreadsheet filled with numbers will not be enough. You must decide on the best way to present your data in order to interpret it correctly. Bar charts, scatter diagrams, and pie charts can all be used to good effect.

What to Monitor; What to Measure

There are two general categories of measurements: system-related and application-related. These categories overlap somewhat, but generally speaking system-related measurements are at a high, or global level. Applications demand work and resources from the DB2 subsystem (such as data storage and retrieval) and the subsystem then performs work on the applications' behalf.

Some common system-related measurements are:

Virtual pool threshold attainment
Sortpool, EDM Pool, Rid Pool filling or failure
Logging rate and Log Write performance
Global buffer pool usage
Subsystem address space CPU usage

Application-related measurements are more detailed and specific. They deal with suites of SQL statements (either separately or taken in sets). Here, the most common measure deal with I/O:

Buffer Pool hit rate (or page residency times)
Waits for synchronous I/O
Prefetch rates
Hardware Cache usage
Index efficiency
Data access paths

Measurement Classes

There are also two classes of measurements that you will need: measurements of resources (which may be overused, underutilized, or constrained) and measurements of behaviors (which may be symptoms of underlying problems).

Resources
             Memory
                     DBM1 Address Space (Virtual Storage constraints)
                     EDM Pool, Sort Pool, RID Pool
                     Virtual Pools (Page Residency)
            CPU
             I/O Subsystem
            Network

Behaviors
          Locking
          Compression
          Work File Extents
          Deadlocks, Timeouts

The difficulty in dealing with behaviors is that they are symptoms only; they are not the problems themselves. It is easy to fall into the trap of thinking that you have isolated a problem to a few SQL statements. True, making changes (adding OPTIMIZE FOR n ROWS, re-designing indexes, running multi-column RunStats) may increase statement execution speed; however, the net result may be that you have obscured or totally missed the underlying problem(s).

This is why you need to gather measurements from all classes: resources and behaviors, system and application.

In order to better understand what you are measuring, here's an example from one installation in the early stages of an ERP package tuning effort. Table 1 shows a cross-reference of their measurement categories and classes with specific measurements. Table 2 shows some typical tactics and tuning efforts for each quadrant.

	System-related	Application-related
Resources	Memory DBM1 Addr Space (total) SortPool RID Pool EDM Pool CPU Addr Space Usage I/O Subsystem Log Writing Pageset Placement Global BP Sizing	CPU CPU used by SQL stmts Class 1 & Class 2 Times I/O Subsystem Pageset Placement Hardware Cache Usage “xxLARGE” Tablespaces Network 3-Tier Configuration DB2 Connect Settings
Behaviors	Compression Hardware Compression Work Files Total Size Bpool Assignment Bpool Thresholds Physical Separation Locking Behaviors IRLM Parms Incidence of Lock Escalation	Compression Selection of Tablespaces Access Paths Access Path Analysis Index Selection Locking Behaviors Incidence of Deadlocks Incidence of Lock Escalation LOCKSIZE settings

Table 1: Measurement categories and classes for an ERP installation.

Settings

	System-related	Application-related
Behaviors	Memory Pool Re-sizing CPU Forbid Parallelism Remove Compression I/O Subsystem Catalog/Directory Reorg Consider HiperPools BP Thresholds LOGLOAD/CHKFREQ Network Re-Configure DB2Connect	CPU No Compression for “small” tables Access Path Tuning I/O Subsystem Implement Data Spaces FREEPAGE, PCTFREE
Resources	Work Files Enlarge Locking Behaviors NUMLKUS, NUMLKTS IRLMRWT, DEADLOK	Compression OFF for Small Tables Access Paths BIND Acquire and Release Partitioning Optimizer Hints Parallelism RunStats (multi-column) Locking Behaviors Row-level Locking Large PCTFREE; Reorg

Table 2: Typical tuning efforts by measurement category and class.

So, What Should I Measure?

For ERP systems there are three initial areas to emphasize, depending upon data and transaction volumes. These relate to the package's use of memory and the I/O subsystem:

DBM1 Address Space total real storage
the EDM Pool
the virtual pools (buffer pools)

The DBM1 Address Space

The DBM1 address space contains many areas of memory called pools essential to proper ERP functioning. Among these are the Virtual Pools, EDM Pool, Sort Pool, RID Pool, and others. (The composition of the DBM1 address space was covered in detail in Part 2 of this article). As recommended in Part 2, you should construct a spreadsheet that lists your various storage allocations.

You can estimate usage based on ZParm settings for most of these pools; however, in a memory-constrained environment you should use whatever tools you have available to get real-time storage measurements. For example, it is possible to use DSNWDMP (described briefly in the Diagnosis Guide) to dump the DBM1 address space and get actual pool size measurements. Other third-party tools may have summary or detail statistics you can use.

The important things to do (as mentioned at the beginning of this article) are to:

Automate the measurements
Retain historical data
Implement graphical reports

So, you can begin with a simple daily (or hourly) measurement of total DBM1 address space real storage usage. Store the measurements over time, and graph the result. You will probably find that memory usage increases sharply on DB2 subsystem startup. This should be no surprise; many of the pools are allocated only on first use; also, some control blocks stored in DBM1 are related to datasets that are not open yet.

Ensure that ZParms and other changes are controlled (SRTPOOL, MAXRBLK, MAXKEEPD, EDMBSPAC, EDMPOOL, DSMAX, CTHREAD, MAXDBAT, buffer pools, and so on). The settings for these and other ZParms will affect memory allocation in the DBM1 address space.

As the production week rolls on your DBM1 real storage graph should stabilize, particularly as ERP package functions are executed. For example, if your package includes a payroll function you should see real storage allocations jump as payroll begins running.

Over time if you do not see real storage usage stabilize then it is possible that there is a DB2 subsystem problem. Check your maintenance level and refer to IBMLink for APARs related to storage issues. In particular, see Informational APAR II10817, “DB2 R510 R610 R710 Storage Usage FixList.”

One last note. While the current limit for address space real storage size is 2GB, the effective usable real storage for the DBM1 address space is closer to 1.6 GB. Any total storage usage that approaches this size may signal an impending DB2 subsystem shutdown! Review your DBM1 storage spreadsheet for high water marks. If possible, implement some real-time monitoring (some third-party tools have this capability) with a notification to your group should the total DBM1 storage exceed 1.4 GB. Should this happen, you may need to take action immediately to reduce DBM1 storage usage (reduce some bufferpool sizes, reset some ZParms dynamically, and so on)

The EDM Pool

As seen in Part 2, the EDM Pool contains database objects such as database descriptors (DBDs) and, if you have implemented dynamic SQL caching, the dynamic SQL cache. Most third-party tools have a process for measuring the EDM pool contents, including the percentage of the pool taken up by each object category.

Begin with this and (as above) map your EDM Pool usage over time. Pay particular attention to the ratio of Request for Pages to Page Already in Pool for each object category. If these ratios are all above 99 percent it indicates that, in general, your EDM Pool is functioning well. Whenever DB2 needed a database object or a prepared SQL statement, 99 percent of the time the object was already in the EDM pool and no I/O to disk was required to fetch it.

The dynamic SQL cache needs to be treated a bit differently. ERP packages typically implement much of their functionality with dynamic SQL. Good performance will necessitate that you cache the prepared versions of these statements. The Global Dynamic Cache is allocated when ZParm CACHEDYN is set to YES. Based on your DBM1 storage allocation report you may consider your environment to be virtual storage constrained. In that case consider setting ZParm EDMDSPAC to YES. This will place the global cache in a Data Space, thus outside of the DBM1 address space. (Note that you must also set additional ZParms that control the size of this data space. These ZParms all begin with “EDM”)

One final note on the EDM Pool. There seems to be some difference of opinion on the Pages in Use statistic, especially when it is near 100 percent. Many experts recommend sizing the EDM Pool to allow a certain percentage (say 10 percent) of the EDM Pool to be unused. Theoretically this allows for a cushion in case a long-running (and not committing) transaction is responsible for holding object definitions in the EDM Pool for an extended period.

Continue to measure your EDM Pool usage over time. As the maximum size stabilizes note the high-water mark. If the pool is near 100 percent used, increase its size gradually to allow for about 10 percent non-used. Continue to monitor over time. A sudden jump to 100 percent used may signal the appearance of a rogue application or process.

The Virtual Pools

With most ERP package processes being I/O-constrained, the virtual pools deserve special attention. We have already covered virtual pool sizing and object assignments in Part 2; here, we will concentrate on measurements.

First, it is assumed that you have assigned objects to the virtual pools based on activity (random access, sequential access, and so on). (You can verify these assignments by a simple query against SYSTABLESPACE and SYSINDEXSPACE.) Now you will need to collect measurements of virtual pool use. You can begin with the –DISPLAY BPOOL . . . DETAIL command if you wish; alternatively, there are several IBM and third-party tools available for measuring and tuning buffer pools.

Your initial measurements should concentrate on when and how often thresholds were exceeded. After that you can continue with hit ratios. Threshold settings are described in the DB2 Administration Guide, including how best to determine them for random and sequential workloads.

A complete discussion of virtual pool measurements and tuning is beyond the scope of this article. I recommend you begin with the materials on virtual pool tuning in the guide; for additional information, consider a trial of one of the tuning tools mentioned earlier.

Tuning the ERP System

Most ERP package vendors provide product tuning recommendations, usually in the form of white papers or documentation downloadable from their website. In addition, there are many user groups and conferences (like the International DB2 Users Group) where presentations on tuning are given.

I will not cover all of these issues in this article -- you should research this on your own. Instead, I will provide you with a point at which to start developing your tuning process.

Start Here

In an ERP package environment the two most common reasons for “tuning” are:

Long perceived transaction turnaround time (“response time”)
Poorly-performing or long-running SQL statements

Since these reasons are both classified as behaviors, most tuning efforts tend to concentrate on application-related behaviors. The primary culprits are assumed to be poorly-written SQL statements, poor index design, and lack of good data distribution statistics. While these may be good places to start your investigation, remember that you want to determine and fix the underlying cause of the behavior, not just the behavior itself.

The starting point for tuning poor application behavior is usually “problem” SQL statements and index construction and selection. There are many good references and guides available for beginning this process, both from vendors and consultants. Meanwhile, lets look at the system-related behaviors.

ERP Package System Behavior

For most ERP packages Tuning issues typically revolve around two things:

Minimizing end-to-end perceived transaction turnaround time (“online response time”)
Reducing data-intensive SQL operations (DASD I/O tuning)

For the first of these the most common issues revolve around n-tier hardware and software configuration. To address them you must measure and monitor network flows and document and confirm proper configurations. This typically involves things like TCP/IP tuning, DB2Connect parameter setting, and the like. Regrettably, much of this will be beyond your control.

To do proper DASD I/O tuning you must be aware of your DASD configuration. Many shops have moved away from standard “round, brown, and spinning” hardware to RAID configurations. We won't cover specific technology tactics here; instead, we will concentrate on what to measure.

As stated before, ERP Packages are usually I/O-bound. Further, much of the I/O time is spent waiting for synchronous reads. These are reads issued by DB2 on behalf of an ERP process, usually an SQL statement, during which the process must wait for the read to complete. This is in contrast to asynchronous reads, such as prefetch, that can execute independent of the SQL statement or process.

Reducing I/O Wait Times

Much of your initial ERP package tuning efforts will be concentrated on reducing waits for synchronous reads. The most common reason for such a wait is an SQL statement that requires access to a table in a non-clustering sequence (usually through an index that has a low ClusterRatio). Thus, reads to the table are random, hopping from one page to another across the tablespace. While this application-related behavior can usually be fixed (perhaps by re-clustering, access path tuning, index changes, and so on) we first need to know that it is happening.

The best source for synchronous I/O wait times is the SMF records produced by DB2. (While it is possible to get a general idea of DASD I/O times from such things as RMF, these tools don't really give an accurate picture of DB2 I/O. This is because the DB2 read and write engines function independently of SQL and other processes. This means that synchronous and asynchronous I/Os will be combined in any reporting.) There are many tools available for gathering this data. The most common seem to be the Accounting Detail and Statistics reports available with the IBM DB2 Performance Monitor (DB2PM). Again, there are also several third-party tools that can produce many of the equivalent reports.

Begin your tuning efforts with the Accounting Detail report, and gather measurements of Class 3 Sync I/O Wait Time and Sync Read I/Os. Take the ratio of wait time to I/Os. The Rule of Thumb according to the DB2 Administration Guide: Less than 40 ms. per I/O is good.

If you are experiencing longer average wait times, you don't have many choices at the system level. Typical responses are:

1. Reallocate Virtual Pools to a larger size

The problem may lie in the DASD subsystem. A bigger buffer pool may allow page residency time to increase. You can also consider allocating Hiperpools or Data Spaces for the virtual pools, as these will reside outside the DBM1 address space. Have your operating systems group monitor system paging.

2. Reducedynamic prefetch activity

Consider more frequent Commits, use OPTIMIZE FOR 1 ROW in certain SQL statements.

3. Researchdataset placement

In a RAID environment it may no longer make sense to “physically separate” tablespaces from indexes, since these devices simulate MVS volumes at a logical level. However, some hardware implementations have the capability to “pin” datasets or volumes in the RAID cache, basically making the datasets resident in memory on the device.This should reduce waits for I/Os, although it is expensive in terms of hardware usage.

Once you have pre-diagnosed that your wait time(s) are high you can use RMF reporting to get statistics at the volume level to see if there is a DASD subsystem problem.

Your next step will be to look at the following:

I/O Rates (GetPages)
CPU Usage
Elapsed Times; In-DB2 Times
Logging

The DB2 Administration Guide has additional tuning suggestions regarding these items.

Additional Subsystem Tuning

On a subsystem-wide note, you need to determine if your total DB2 subsystem is read-intensive or update-intensive. To do this, research your normal DB2 logging rate. Print a copy of a recent Bootstrap Dataset (BSDS) and pick a succession of archive logs that were created during ERP package execution. (Warning! Most of the times in the BSDS are in GMT, not local time. Check the administration guide and the comments at the beginning of the BSDS report to determine this.)

Note the begin and end RBAs and times of the Archives and use subtraction to determine elapsed time and RBA range. Divide the total RBA size by the total elapsed time to determine logging rate.

A general guideline:

Logging less than 1Mb/Sec Read-Intensive DB2 Subsystem
Logging more than 1 Mb/Sec Update-Intensive DB2 Subsystem

Note that one Mb/Sec is the equivalent of about 80 cylinders of 3390 DASD per minute.

If your subsystem is update-intensive then you need to consider additional monitoring and tuning. Here are the usual tactics used in this situation. Many of these are not possible in an ERP package environment (re-arranging table columns, for example); however, it is more and more common to implement ERP packages in non-exclusive DB2 environments. Consider the following for your DB2 subsystems as a whole:

Keep frequently-updated columns close to each other
Help to reduce logging volume by keeping frequently-updated columns physically adjacent in table definitions; the DB2 Administration Guide covers Logging and such tuning tactics in great detail
Reconsider COMPRESS YES
The cost to decode the row, do the update, and re-encode the row may be significant in an update-intensive subsystem. There is also the overhead of placing rows in overflow. Consider setting COMPRESS to NO for tables of small size in total pages).
Analyze Log Write performance for possible bottleneck(s)
It is possible that your Logging process is the bottleneck in an update-intensive system. Check MSTR address space logs for “Log Full” messages; Consider increasing the size of Log Buffers (the OUTBUFF ZParm); Analyze DASD I/O times for your log volumes

Summary

I certainly hope that this three-part series has been valuable to you. Should you have any questions, feel free to contact me. Good luck!

Lockwood Lyon is a Systems and Database Performance specialist in the Midwest. He has over twenty years of experience in Information Technology as an IMS and DB2 database analyst, systems analyst, manager, and consultant. Most recently he's spent quite a lot of time on DB2 subsystem installation and performance tuning. Lockwood is the author of, MIS Manager's Appraisal Guide (McGraw-Hill 1993), Migrating to DB2 (John Wiley & Sons 1991), and The IMS/VS Expert's Guide (Van Nostrand 1990).

Contributors : Lockwood Lyon
Last modified 2006-01-04 03:21 PM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?