Skip to content

DBAzine.com

Sections
Personal tools
You are here: Home » DB2 » DB2 Mainframe Articles Archive » Fundamentals of Trialing ISV High-Speed Utilities
Seeking new owner for this high-traffic DBAzine.com site.
Tap into the potential of this DBA community to expand your business! Interested? Contact us today.
Who Are You?
I am a:
Mainframe True Believer
Distributed Fast-tracker

[ Results | Polls ]
Votes : 4454
 

Fundamentals of Trialing ISV High-Speed Utilities

by Dale Franklin

Pulling it all together and avoiding the pitfalls: a checklist for those planning to do in-depth performance comparisons of various vendors’ high-speed utilities (DB2 or IMS)

After more than 32 years in the software business, I’ve approached the issue of trialing third-party, high-speed utilities from three different perspectives:

      • As a customer (several times; different companies)
      • As a software engineer, writing and maintaining utilities
      • As a performance analyst, running a mainframe utilities performance lab

This article may provide both a review of the basics of this subject, and, perhaps, some guidelines for better managing your production system for optimal performance.

Let’s begin with the assumption that you’ve already identified why you need a high-speed utility and how it alleviates your ‘pain,’ and that you’ve financially justified purchasing it. Now you must prepare to contact and evaluate solutions from two or three Independent Software Vendors (ISVs).

Your focus might be on speed only. But there are other important issues to consider. This article will cover the following:

      • Strategic:

Considering an Independent Software Vendor (ISV)

Benchmark Stats and Graphs from ISVs

Program Architecture

      • Tactical:

Document and Agree on your Objectives

Test Scenario Planning

Multi-tasking

Controller Overload

Load Distribution

Size of Testbed

“Wait-Time” Compensation Software

Iterative Runs

Apples-to-Apples

Apples-to-Oranges

Random vs Sequential Updating (while utility is running)

Documentation and Processing SMF logs

In Summary

Considering an Independent Software Vendor (ISV)

When you plan to make a significant expenditure on ISV software, you can’t be sure what you’re buying without some research and testimonies you obtain yourself by talking with other users directly.

In time, a positive consensus opinion among those users you’ve contacted may inspire your confidence. If the consensus is mixed, you should probably dig deeper to sift out opinion from fact. In short, if you talk to enough users, you will notice a trend (favorable or unfavorable).

One of my former ISV employers wanted to enter the DB2 utilities market & purchased a series of products written by another ISV that was relatively unknown at the time. Although brilliant in functionality, those products unfortunately performed poorly, were unreliable and unstable.

As a result, our management decided to ‘adopt’ the ISO-9000 standard so our marketing department could tell potential customers that they “conform to the ISO.” The question is, were these products originally developed from inception with ISO? Some reputable ISVs even exceed the ISO standard.

Before it even becomes necessary to buy products from an ISV, do some groundwork. Always be inquisitive about this topic at meetings and at technical conferences, whether you’re in the market go buy or not. Qualify the individuals you talk with - some independent “consultants” receive a commission if they can talk you into buying an ISV’s product, so watch for bias and motive.

It’s also a good idea to observe both the vendor and the customers: Who brags but doesn’t deliver on time? Who is making “desperate” deals? They might deliver now, but all too soon, you might find yourself painted into a corner. Does that last release tolerate or exploit the latest DBMS release?

What has been the ISV’s financial profile over the last three to six years? Although they all have extremely impressive color brochures that preach superior innovation, can they deliver on quality? Not ensuring that fact might lead to the ugly reality of being kicked out of bed at 3:00 a.m. by a phone call from the operator .

Some ISVs will do anything to run just a few seconds faster -- even compromise the customer’s data. Most utilities must access the catalog to get the statistics and parameters of the object they are working with before they begin crunching. Instead of issuing SQL calls, they do native VSAM reads to the catalog (for speed). This means they must keep up with the row layout for each release or maintenance level.

Consider the scenario in which an ISV just plain “missed it” -- reorganized a customer’s V2.3 tablespace and compressed it. Since this version did not support compression, the utility obviously was off a bit (or two). You can imagine the customer’s applications when they read a compressed row (real garbage in -- garbage out).

A few ISVs lean toward exploiting system memory as if it were disk while insisting that they “do not invoke the sort” or “don’t do sorts.” However, it seems self-evident that if you reorganize, you will need to rearrange rows in order. Whatever label used, it’s still a sort (including mini-merges). In short, ISVs who use this practice will GETMAIN as much storage as possible and do lots of quick sorts and merges so that when they are finished, the space is 98 percent organized.

But the problem is, depending on the size of your testbed and the level of disorganization, performance with this kind of utility exponentially erodes. And if your testbed is small and fairly well organized, then the ISV utility will perform impressively.

When it comes to innovation, there may some merit to the saying, “Yesterday, the big ate the small; today; the fast eat the slow.” But it doesn’t mean you should ignore the obvious. Utility technology is a complex business and needs a stable environment with a good track record that acquires, maintains, and develops the very best people as well-compensated engineers; not highly-paid mechanics. To ignore this might mean you’ll get the “privilege” of explaining to your upper management why reorg dropped some columns (a situation that’s happened more than once).

It’s true that software problems are inevitable; however, why accelerate the ‘inevitable’ with unethical, archaic, or incompetent design?

top

Benchmark Stats and Graphs from ISVs

You are likely considering certain ISVs because you’ve obtained some benchmark statistics. The term ‘benchmark’ is hotly debated. Almost everyone has an emotional opinion, and here’s why: some ISVs will blatantly skew their test scenarios with excessive memory, small testbed, tiny rows, 98 percent reorganized-to-begin-with, and so on.

Doing this will generate outrageously good results instead of using test scenarios that consider the “middle of the road” situations that most of their customers are likely to have. Such unethical behavior hides behind a disclaimer of “results may vary in your environment … etc.” Alternatively, keep in mind there are some very good vendors who intend to maintain their credibility and customer trust.

The credibility of a benchmark is directly proportional to the number of technical statistics the ISV is willing to volunteer. Some of these statistics are so outrageously skewed (in the ISV’s favor) that they become laughable. Reminder: if you are looking at an Excel graph, you positively must see the zero-zero X-Y axis, or else the chart is meaningless.

Say, for example, there are two lines on the wall. One is 100 inches long and the other 101 inches. From a distance you can’t tell them apart because you are seeing the whole picture. However, if you were shown only the last three inches, then one would appear 33 percent longer than the other. Unfortunately, some ISVs do the latter, so be vigilant and challenge them.

One fact omitted may throw a wrench into the credibility of the statistics (like buffer allocation and values). Suppose you found out that very short rows were crammed into the pages so they could boast of reorganizing 80 million rows? You might say, “We’d never run a benchmark with those parameters!” When I view any ISV’s benchmarks (in a presentation or on document), I expect to see the following:

      • Hardware Profile:

- Stand-alone or shared devices
- Type of CPU
- Amount of memory
- Type and number of controllers
- Type of raid

      • Software Profile:

- Operating system release
- DBMS release
- ISV(s) release

      • Testbed Profile (for each testbed)

- Configuration (partitioned, non-partitioned, etc.): # of parts
- Number of pages/blocks per part (showing computed cyls)
- Number of rows (showing computed rows-per-page)
- Average size of row (including prefix)
- For IMS, if FTB, and LTB pointers are used and hierarchical map
- Number of NPIs or secondary indexes. For each index, you should know the following: how many source columns make up the index, the index record size, cardinality (uniqueness), number of index records and number of cylinders.
- Testing Scenario description
- Buffer assignment and values
- Number of times benchmark was done (if not stand-alone)
- Brief description of the context of the test. For example:
- partial online reorg of 3-of-12 parts @ 20 RANDOM updates per second across the entire 12 parts
- recover to current: inputs=full i/c, 500K random application updates, partial i/c, 500K sequential updates, partial i/c, 500K mix of random and sequential updates
- full image-copy of 1,000 datasets with CPU pegged and maintained at 97% by 5 applications computing prime numbers

Such information received voluntarily from an ISV enables you to interpolate and see whether this benchmark could be applicable to your installation. Executives might skip this, but DBAs are trained and skillful in spotting fraud when they see it. Second, the ISV that volunteers this information shows it has nothing to hide and is sincerely trying to emulate your production environment. They have this information anyway; making it available to you is the right thing to do. This technical nomenclature is easily presented in 2-3 Excel or PowerPoint slides.

Obviously, the reputable ISV will try to hit the “middle of the road” testbed configuration (what they presume their potential customers have). You’re not likely spending big bucks for high-speed utilities unless you have substantial amounts of data. A benchmark of 2 million rows would probably be laughed at. Alternatively, a benchmark with 2 billion rows might seem very suspicious to you.

top

Program Architecture

Some ISVs may, unfortunately, utilize obsolete and archaic architecture left over from the early 1980s. Such architecture was built around the DBMS at that time and has not been updated to reflect subsequent advances in hardware and software.

In DB2 V1.2, the catalog consisted of about 30 files. In V6, the catalog contains about 60. A certain migration tool was designed to read most (or all) of the catalog into memory to avoid encumbering it. Today it must GETMAIN and GETMAIN and GETMAIN and may easily cause some of your production programs to abend.

How current is the software you intend to buy? Some products have so many fixes (i.e., patches and zaps) that they have patches on patches on patches. It may be a good idea to consider the current software landscape in retrospect to get a more realistic view of possible obstacles.

Learning from Y2K, legacy software did not go away, even when written in the 1960s. Some products have been rearchitected, but most have been “enhanced” with lots of S0C4 patches.

For fun, I do underwater videography in some exotic part of the world every 6 months. For editing videos, I wanted some morphing software that would create dazzling transitions. After searching the Web I found three software vendors and asked them these questions:

When was the product originally written?

What is the current release, and

When was that release GA?

One response to my questions read something like, “[the product] was originally written in 1982 and the current release is v1.1, GA in 1996. It will run on Mac O/S 6 or 7; anything else is unpredictable. Price $1300.” Needless to say, this reply did not inspire confidence, especially since the current Mac O/S is V.9 and V.10.

Getting technical support for mainframe EBCDIC modules is also a challenge, and obtaining fixes for products may be cumbersome. Some ISVs, for example, still don’t make use of the powerful functionality of the Internet to distribute fixes. Sometimes, after debugging the program’s logic, the developer must also code a corresponding zap and fax it to the customer. The customer must then code the zap -- hoping both he and the developer did everything correctly -- and apply the zap.

Not only does such a dated process cause superfluous work for the developer and for the customer, but it is also time-consuming. A more efficient way to apply a fix is for the customer to sign on to the Web, download the entire CSECT object (in about 5 - 10 seconds), transfer it to the mainframe, and link (case closed; without human intervention). The flipping of certain bits during the EBCDIC-to-ASCII-to-EBCDIC is no longer an issue (and beyond the scope of this article). Some smaller ISVs have a policy: no zaps; they send you the entire updated program.

After careful consideration of what’s been discussed so far, it is now time to begin planning the details of your project.

top

Document and Agree on Your Objectives

Be very explicit about your objective since you will likely distribute your findings to various levels of management for their approval. That objective is to determine what is the performance gap between products.

Others within your immediate group may differ with you about these objectives. A consensus is desirable. If not, these issues should be resolved early in your endeavor.

top

Test Scenario Planning

Spend a whole day planning the specific scenarios (full, partial, etc.) and review this with your coworkers and staff. Do them in this order:

Full, partial, index

Full, partial, index

Full, partial, index

Full, partial, index

not,

full, full, full, full

partial, partial, partial, partial

index, index, index, index

(More detail on this is found in the following section, Size of Testbed.)

Do some trial runs with your incumbent utility and have the infrastructure in place (testbed restores) so when you code up the ISV jcl, it’s just another run.

Insure you have some quality time on the machine. Since you’ve done some preliminary tests, you should have a good estimate.

In planning the test scenarios, it is important that you take control: you plan the testing scenarios and you take control of them (before the ISV consultants arrive). Some consultants are reputable and some have a “plan” to make their product look unrealistically favorable with their scenario. You should be able to tell them you’ve already worked out the logistics with your technical staff and that you will appreciate their syntactic suggestions.

Don’t be hesitant to ask for help and input from your local tech services folks. Often, they can’t wait to show off their monitors and “stuff” ... so include them in your planning phase and remember to buy them lunch and credit them in your published results.

top

Multitasking

Multitasking is when two or more tasks run simultaneously to divide the workload so the same work is done in less time. To multitask, you must have multiple file objects for testbeds (the bigger, the better). Don’t waste your time benchmarking a single large object: most likely, all vendors will perform comparably (unless that single object has lots of indexes).

In the DBMS bundle (DB2 or IMS), there are two major components: the production engine that services the business applications and the utilities that usually run during off-peak hours. So when someone says, “Version-X is screaming!”, you need to assess which they are referring to. Given budget and time deadlines, it is reasonable to assume that IBM’s priority is in the production engine as this runs the business and has the highest visibility.

Sub-second response is vital for online applications when customers are waiting on the phone or ATM. Alternatively, if a utility runs another 5 minutes at 3:00 a.m., then who cares? Those who run the IT department care, since it becomes a cumulative load with hundreds of jobs doing thousands of maintenance functions in a short window.

Performance demands on IBM resulted in multitasking in the V.4 production engine. If a space was partitioned and the predicates applicable, the query would run in parallel (or multitasked). Unfortunately, too many shops still had non-partitioned spaces left over from when the pyramids were built. Some mature shops, though, have availed themselves of this technology by partitioning their large spaces. This makes sense: by partitioning, you:

Reorg only the part needing reorg (other parts remain online)

Recover on the part relevant (other parts remain online)

Have more hardware working for you simultaneously

Do not put all your marbles in one bag

Until DB2 could multitask for the applications (production engine), there wasn’t a big push to partition - but now there is. This has tremendous implications when it comes to utility performance.

Exceptions: for a single large object, some products allow multitask unloading via a parameter that tells the utility which keyrange/blockrange to assign each task. As the song says, “this could be heaven or hell” when it comes to performance. This “keyrange unload” is applicable for single datasets spanning multiple vols (and you just happen to know which keyrange is on each physical raid vol) so you won’t contend. (Let’s also throw in a level of disorganization.)

The bottom line is, multitasking is restrained with single objects and will run about the same time: you’ve shot yourself in the foot while believing you’ve done a fair performance evaluation. This is one of the most significant issues in your study. So select a large 10 - 20 partitioned object to reorg, load, or unload. If you are benchmarking COPY, then copy a thousand datasets; not just one.

Read the ISV document on multitasking This information is usually at the end of the manual listed as “performance considerations“ or “asynchronous multitasking” and such. It would be advantageous to acquire the manuals for study a few weeks before the software is installed.

During the pre-analysis phase, a smart utility normally analyzes the object in 5 - 30 seconds before doing anything else as it assesses the scope of the load by calculating and assigning key ranges, memory, disk, and other resources including retrieving history from previous runs.

The ability to accurately distribute the load evenly over all tasks is the mark of true developer genius. This might comprise 50 - 60 percent of the product’s code, which is run only once during initialization. Ideally, all tasks should run about the same time, but sometimes this is not doable or practical.

If nine tasks are fired off; eight finish in two minutes, and the last runs for three hours, is this multitasking? Technically, yes. Performance-wise, no, and certainly not load balancing. Beware of ISVs that boast, “we do lots of multitasking” You know what to look for on your monitors.

top

Controller Overload

Spread your database parts evenly over four (or more) controllers. Get a controller-to-volser map and plan the even distribution of your testbed parts including exactly where the testbed image copy (or DSN1COPY) files will be placed.

Aside from not allowing the utility to multitask, the next most significant impact on the outcome of a benchmark is the level of stress on the controllers. “Controllers” are hardware devices (3990 or 2105) that control the switching of data between disk/tape and CPU. One controller may have 10 - 15 vols under its control. When I use the term “controller,” I also imply its channels and paths (which vary in number per site).

When too much load is placed on a controller, then performance drops 49 to 51 percent. This means 30-minute jobs run in an hour; 10-minute jobs require 20 minutes. This pertains to all the jobs using the controller, not just the job causing the slowdown. Controllers, according to my research, appear to operate either at 50 percent or 100 percent (nothing in-between). Obviously this skews your benchmark artificially high.

I saw firsthand this phenomena during one benchmark of an ISV that I was conducting. One of my testbeds was 36M rows, 12 parts at two million 4k pages. Nine parts were under one controller and three were under another. I benchmarked an ISV’s reorg in which it was only three times faster than a competitor’s product. The ISV developers and consultants were perplexed. Even the tech services guys were crawling over their box and looking at all kinds of monitors and SMF data since they expected better.

It then occurred to me while driving home that maybe the controllers, channels, and paths were saturated. Next day I used DSN1COPY to move all 12 parts, indexes, and NPIs evenly over four controllers at three parts each. I changed nothing else. As a result, performance jumped from three times faster to eight times faster.

Once, when I designed some IMS testbeds (100M segments, 260 DBDs, 320 datasets), I spent an entire week studying and carefully distributing each generic group across the controllers. This extra effort resulted in one of the utilities showing six to 15 times faster than the competition.

Usually in big shops, the DBA and disk administrators work in different departments and are at each other’s throats because the DBAs emphatically want to know how and where their data is distributed and the DISK administrators (who are typically unaware of the processing characteristics of a database) control the placement of data. They may insist, “Just use this generic parameter we give you ... and all will be well.”

It is my experience, however, that following this advice does not automatically ensure that “all will be well.” If the tablespace partitions are not evenly distributed across the controllers, a very serious performance problem will certainly occur. Aside from reorganization, the copy utility might copy a thousand files per night. If these files are not distributed evenly; the same kind of slowdowns occur (and usually nobody notices). With hundreds of files added annually, I believe that DBAs need autonomy when evenly distributing their objects. If those of you who are DBAs do have autonomy in this arena, you should obtain a controller-to-vol map from your disk administration staff, decide which databases have performance priority, and spread them accordingly. You’ll reap benefits with the applications and utilities. I would be skeptical of statements like, “That’s not a problem any more; we now have hi-tech smarter raid and smarter controllers” (like SHARK raid and HITACHI raid ... which offer some relief).

top

Load Distribution

Obviously if the files vary radically in size, some tasks will finish early while others continue running for a long time, thus prolonging the outcome and making the utility appear artificially slow.

Granted, in the real-world, there will be partitions of various sizes, but for the purpose of your research, selecting a testbed with good balance between the partitions (middle of the road) will better serve you in determining the “relative performance gap.” Remember, if you have a lopsided partitioned space, the DBAs will eventually balance it. If this is the object of your focus, it might be better to have this done sooner than later.

top

Size of Testbed

Your testbed should be big enough (at least 8GB) so it cannot fit into the buffers: the utility must be forced to do some serious disk reading and writing.

Benchmarking an object (like 100k pages or 3M rows) can easily fit into the buffers and wildly skew the elapsed time in its favor. Don’t be fooled by number of rows. Utilities invoke VSAM macros that read pages and CIs into memory, not rows.

The number of rows are of secondary importance because once the page is read into memory, the utility then de-rows the page (or in the case of IMS, de-segments the block). The number of rows (and IMS segments) are relevant when it comes to actions like building indexes and recovery.

For benchmarking, you also need a variety of testbeds (three minimum). Two large partitioned objects and one non-partitioned object (with lots of indexes) provide a good start. With Excel as a documentation tool, carefully plan your benchmark scenarios for each. This is not simplistic since the scenarios multiply themselves.

If you intend to reorg, then plan on FULL, PARTIAL, and NPI on the partitioned, and FULL and INDEX on the non-partitioned. That’s eight benchmarks so far; double that for each ISV tested (which brings the number up to 16). You will then need to run them three to four times each to get the best time, which brings the number up to 64 benchmarks just for reorg alone. Oh, I forgot about online! Make that 128 benchmarks. Get the picture?

Ideally, the best benchmark testbed involves taking an image copy of your production object and laying it down in a testing environment (and OBIDXLAT).

If this is not allowed due to internal procedures about working with sensitive data, then there are several vendors on the market that specialize in reading your production image copies/unloads and scrambling the data. The problem with scrambling the data is that some vital production values are needed for a realistic benchmark.

If you intend to reorganize an index, you need production values (in those columns - or - source fields) so the indexes reflect what’s in production. You need the same RI fields. In the case of IMS, you need the same LPCK, root key and RMOD values so that they randomize the same.

Here’s another workaround: Do an unload and write a program that will write the row/segment to a flat file and in the process, preserve the key and index fields. Space out (or asterisk) the rest of the data. Every fifth row (or entire IMS record) write to another file. Use a load utility for the bulk and an application to add the “maintenance” file to disorganize it. Once RUNSTATS (or your database analyzer) indicates the disorganization level you want (80 percent is good), then image copy everything, including indexes (if you intend to do a partial reorg and certainly for online reorg). If you’re on V.5 and can’t copy indexes, then DSN1COPY the indexes and data.

You need all of this in place and tested. Run a few trials with your incumbent utility and reinstate the testbed after each benchmark (however, for utilities like unload and load, you obviously needn’t bother). If you intend to reset the modification bits with COPY, then you’ve changed the testbed, and you should reinstate it. That DSN1COPY restoration parm is: PARM='CHECK,NUMPARTS(&PTS),RESET'

A good testbed size is at least two million 4k pages (or 8GB). Bigger is better.

top

“Wait-time” Compensation Software

“Wait-compensation” software is an ISV product that is supposed to subtract out so-called “wait” times from your benchmark job. Sometimes simple SMF data is used to compensate the “wait” times. Some users don’t care how busy their machine is and trust this to emulate a stand-alone machine.

However, when it comes to asynchronous multitasking, this software generally cannot deliver on this promise. These products work well for:

Running business applications that don’t multitask, or

The developer who wants to focus on a specific issue in his program.

I strongly recommend against trying to “subtract out” wait times from your benchmarks. For example, assume your utility goes through a preprocessing analysis phase for load balancing (three master tasks). Before finishing, all three of the master tasks generate three tasks (each), so nine tasks run. If just one of the master tasks waits for something, it cannot spawn its three tasks, then the other six tasks are running asynchronously until a point when they become synchronous, waiting for the other three tasks to catch up.

What are you going to subtract out? You may think that because the job ran in one hour with 30 minutes of “wait” time, it would run in 30 minutes. Unfortunately, in a true stand-alone environment, where spawning tasks could run freely, it might run in 12 minutes.

Or let’s assume that you have two synchronous tasks that wait on each other. There might be some accuracy subtracting out wait times. But if they are asynchronous, you could have one task waiting for something for one hour and the other running independently, doing lots of work - and what would you subtract in that instance? It would be very easy to misstep, either way.

top

Iterative Runs

Plan on making iterative runs for each scenario and select the best times; forget about using “wait-compensation” software to subtract out wait times.

Think about monitoring some of the benchmarks, but remember, the monitor utility must not encumber the benchmark. If it does, then simply run again without monitoring. From both runs, you’ll have a good picture.

Unless you have a stand-alone laboratory for benchmarking, you will need to run your benchmarks several times (for each scenario). Don’t think because your mainframe CPU is stable at the 20 to 30 percent level that you have a quiet environment. Quite the contrary: backups are usually done after normal working hours and on weekends.

For purely I/O bound applications like BACKUP, there is relatively little CPU needed; however, during backups, the controllers become saturated. When this happens, jobs can take two to nine times longer to run even with a CPU that is stable at around the 15 to 20 percent level.

When benchmarking, there are two issues to be aware of: CPU and controller saturation. Both of these can severely skew your benchmark efforts. If you schedule stand-alone time, then make sure you are the only one signed on the mainframe and check periodically if other jobs are submitted “accidentally” by an automated job scheduler. Double-check by reviewing your SMF data.

Since you will be making iterative runs, I advise tuning your disk work (and sortwork) allocations on your JCL. My visit with the sort gurus confirmed what I’d found in my lab: in some cases allocating too much sortwork can prolong the utility outcome. Each sort listing shows “tracks allocated” and “tracks used.” Many times there are multiple sort listings, so use the highest “tracks used” in your JCL.

top

Apples-to-Apples

To fairly compare two products’ performance, the testbed must be reinstated (byte-per-byte) every time after every benchmark. This reinstatement must include the indexes if you intend to do partial reorg or change modification bits during copy. Failure to do so will wildly skew the subsequent benchmark.

For instance, say you reorg a space that’s 20 percent cluster ratio, and after the reorg, it is 0 percent. If you don’t reinstate the disorganized testbed, then the subsequent reorg benchmark will reorg a space that’s already organized (done very quickly with no sort). Don’t laugh; it’s been done ... many times! In the case of partial COPY and if the mod bits are reset, a subsequent partial COPY can’t find any pages to COPY and the runtime is three to five seconds instead of three to five minutes.

Code several jobs to run in parallel to reinstate the testbed(s). Six DSN1COPY jobs running simultaneously for six to eight minutes will restore 2M 4k pages (spaces and indexes, 26 datasets in all).

In the case of an IMS KSDS, however, I’m not aware of any method to back it up at the CI level and preserve the disorganization. In specific, I’m referring to maintaining the CI/CA splits. Standard and ISV image copies do the equivalent of an unload, so when you ‘restore’, it reloads your KSDS in a reorged state. DFDSS and IDCAMS export/import also loose the splits after a restore.

You MUST externalize your buffers before each benchmark. This means that you must stop/start your spaces/databases in a preceding jobstep. If half your indexes are in the buffers (from a previous run), this will thoroughly skew your next benchmark if you don’t externalize the buffers. If you cycle your DB2 ssid before each benchmark, obviously, this step is unnecessary.

Regarding determining CPU used: Some utilities work in concert with your DB2 ssid, so both utility and DB2 are accumulating CPU. To obtain this information, you’ll have to look in two places: the job listing and in the SMF log (for DB2). Or, if you have a fancy monitor (that also uses up CPU resources), you should look there.

If you’re serious about determining CPU use, I recommend that you recycle your DB2 ssid before each benchmark so that SMF times and utility times are in sync with the utility runs. If you don’t, this can get out of hand quickly when doing lots of benchmarks. The bottom line is, you should cycle DB2 before each benchmark, and when you are finished with all your benchmarks, then extract the SMF data.

top

Apples-to-Oranges

What usually distinguishes an ISV is when they use unique processing techniques, and usually, you, the customer, pay for this technology. Say that ISV-A does not multitask and ISV-B does. Does this mean you disable multitasking for a “fair” apples-to-apples evaluation? This can be argued either way.

For example, there are different techniques to do an online reorg in which the original data is copied to a shadow target. IBM uses a temporary RIDMAP translation tablespace to resolve old/new addresses. Alternatively, BMC Software uses XBM snapshot technology. Is comparing these two products that use two different techniques comparing apples-to-apples or apples-to-oranges? In my view, this argument is rendered moot if you give both ISVs the same amount of memory, and here’s how: Your catalog tables and indexes should already be using bufp-0. Assign the testbed tablespace(s) to bufp-1 and testbed indexes to bufp-2. Then, assign the rid translation map (both data and index) to bufp-3. Now give bufp-3 100M of space and XBM 100M. Both products may not use all the resources assigned, but you’ve leveled the playing field for a fair comparison.

Another example is when smarter utility-A divides up a sort into smaller chunks for hipe-sorts and utility-B doesn’t (one huge, long, disk sort). One needs to be vigilant to determine what’s “fair” and what is superior processing technology.

top

Random Versus Sequential Updating

When it comes to benchmarking online-anything, give some careful thought to exactly how you intend to apply stress (in a repeatable way) with either sequential or random processing.

Until we have solid-state storage, then disk speed is STILL an issue. If an application is updating the disk and a reorg utility is accessing the same disk, they contend for resources. In an online reorg, the update pattern of the applications has EVERYTHING to do with the performance outcome of the utility -- my lab research shows it affects utility speed by as much as two to six times.

Are the applications updating randomly or sequentially? Random updates cause the read-write heads to bounce chaotically while the reorg is also trying to read the same disk. On the target side, the read-write heads will also bounce chaotically (not as much), while the changes are applied from the log apply task.

Conversely, if the application updates are done sequentially, then changes are in physical block sequence and most of this is stored in cache. When full, the changes are written and applied on the same track (including the target). This generates very little relative movement.

I decided to test this theory (and affix some numbers to it), so I spent two months writing a throttling mechanism that was a STIMER macro manipulator subroutine. I could then precisely control the update rate of an application (with external governing parameters).

I tried about a dozen intermix formulas since performance of the DBMS is slow in the beginning (until the buffers get loaded) then speeds up -- radically. DBMS speed also varies widely depending on where the reorg utility is reading on the disk. Throttling is like riding a bucking bronco (slow, fast, slow, fast).

I then wrote a COBOL program that would do seven updates, one isrt, one delete, and one update of an NPI (secondary indexed) column ... then repeat.

To get random numbers within a certain key range, I copied the IMS randomizer and added a few lines of code. The output from this randomizer subroutine could then be converted and used for an ordinary SQL call. This is probably the only DB2 program in the galaxy that uses an IMS randomizer. Via a control parm, I could direct the program to do either sequential or random updates.

The program was also designed to work within certain key ranges (parameters) so that I could run several iterations of the same program simultaneously (necessary needed for random processing-high update rates). Here’s what I discovered concerning the differences of sequential versus random updating:

Sequentially, I could maintain and sustain 1800 Updates Per Second (ups) and the reorg would finish (with LIMIT=1800). Randomly it quit (longlog term). This means the END of the log kept moving further away as the log apply task was reading batches of the log and applying them on the target (it couldn’t catch up). To my utter amazement, 1800 ups sequentially dropped to 30 - 50 ups randomly.

I attended a technical conference shortly afterward and talked with a frustrated DBA who came to the exact same conclusion. Even with the next release of (product X) and the next hardware release of (product Y), it was clear that there is a phenomenal performance difference if random or sequential updates are being done during reorg I noticed recover times varied widely for the same reason.

Most modern online systems are huge so they are kept simple and random. Good examples of such systems are those that facilitate credit card, (e.g., American Express), Fed-Ex, UPS, and banking transactions: all require immediate, random processing.

Even if performance were to improve by three times (90 ups), this is still a far cry from 1800 sequentially. If you are a large user and know your minimum random ups rate is around 80 - 120 ups, you might try various ISVs to see if they can handle such a load. One ISV I benchmarked could sustain 170 ups randomly and 2100 ups sequentially.

In summary, when it comes to random versus sequential updating, the benchmark should represent your normal production environment. Otherwise, you could be significantly disappointed in the utility’s performance once deployed.

top

Documentation and Processing SMF Logs

On Excel, document each benchmark in progress. Don’t wait until the project is over. To document and process SMF logs, you should keep all your listings in a PDS for each product since you’ll need to tie SMF times to utility run times.

When convenient, each listing should be carefully studied to make sure there are no “negative” messages and to insure the run parameters you specified were invoked. For instance, for some utilities, DYNAMIC ALLOCATION (yes/no) might just pertain to primary databases (not work files). Some times, it’s the reverse. Principal issues to document are:

Wall clock time

CPU time

EXCPs

Most ISV products work outside of DB2, so CPU and EXCPs (on the listing) are accurate. If you are benchmarking an IBM utility that works in concert with DB2, then you obtain the rest of the CPU from the SMF log. Be sure to include ____MSTR, ____DBM1 and ____IRLM tasks from your DB2. For this, SAS works well for me.

The command /I SMF will cause the SMF region to switch active logs (there are two) and begin archiving the now inactive log that contains your SMF data to a flat file. You then input this into your SAS job (//SMF DD statement). Check with your tech services staff to obtain the name of the archived SMF file.

I use these parameters in SAS to extract SMF data:

//SAS609 EXEC PGM=SASHOST,PARM='SORT=4 ',REGION=0M
*
*
//SYSIN DD * 
%INCLUDE SOURCLIB(TYPE30) ; 
DATA DALE2(KEEP=STRTDATE STRTTIME JOB CPUTIME ELAPSED) ; 
SET TYPE30_4 ; 
IF (JOB=: 'ssidMSTR' AND STEPNAME=: 'IEFPROC ') 
OR (JOB=: 'ssidIRLM') 
OR (JOB=: 'XBM') 
OR (JOB=: 'ssidSPAS') 
OR (JOB=: 'ssidDBM1' AND STEPNAME=: 'IEFPROC') ; 
STRTDATE=DATEPART(INITTIME) ; 
STRTTIME=TIMEPART(INITTIME) ; 
FORMAT STRTDATE DATE. STRTTIME TIME. ; 
HH=HOUR(CPUTM) ; 
MM=MINUTE(CPUTM) ; 
SS=SECOND(CPUTM) ; 
CPUTIME=(HH*60*60)+(MM*60)+SS ; 
HH=HOUR(SELAPSTM) ; 
MM=MINUTE(SELAPSTM) ; 
SS=SECOND(SELAPSTM) ; 
ELAPSED=(HH*60*60)+(MM*60)+SS ;
PROC SORT DATA=DALE2; 
BY STRTDATE STRTTIME JOB; 
PROC PRINT DATA=DALE2; 
VAR STRTDATE STRTTIME JOB CPUTIME ELAPSED ; 
RUN; 

Simply copy the above code and change “'ssid” to yours.

Be sure to cycle your DB2 before each benchmark so CPU times are nailed to that benchmark. Create two SDSF batch jobs: one to start and another to stop your DB2. On the DB2 startup job, add a second jobstep that contains DSN SYSTEM (ssid) RETRY (30) to make sure it’s up. Your startup/shutdown cycle CPU varies 2.3 to 1.4 seconds, depending on your machine. Do a couple of cycles to determine how much startup/shutdown CPU to subtract (according to your SMF extract data).

Also, consider putting some pizzazz into your published results: on your Excel graphs, use 3-D vertical bar columns and stack your CPU (in different colors) to differentiate utility CPU time from DB2 CPU time). I suggest two separate graphs: wall clock and CPU. Print them to see how your colors look in black and white.

In the case of DB2 EXCPs, since DB2 is handling the I/O for some utilities, you’ll need a third-party product since the EXCPS are not on the listing.

top

In Summary

Consider reliability and trust. Company data is your most precious resource. Theoretically, you could fire everyone at your company and hire new folks: it would take time and it would be slow, but you’d survive. If on the other hand, you loose your data ... you’re dead meat. I’ve read articles that chronicled a data center burning down or being flooded. After such a disaster, you don’t know who owes you money, who you owe, and have no data on which to base business decisions; you’re really dead! What ISV you trust with your production data is vital to your career.

--

Over the last 32 years, Dale Franklin has work at various Fortune 100 companies in the Houston area. He is currently an independent consultant. He can be reached at dale_diver@yahoo.com.


Contributors : Dale Franklin
Last modified 2005-04-18 03:18 PM
Transaction Management
Reduce downtime and increase repeat sales by improving end-user experience.
Free White Paper
Database Recovery
Feeling the increased demands on data protection and storage requirements?
Download Free Report!
 
 

Powered by Plone