An Interview with Scott Hayes on Optimizing Your DB2 UDB Databases - Part 1

Part 1 | Part 2

Scott Hayes is a well known DB2 UDB performance expert. He is a regular speaker at International DB2 User Group conferences, IBM DB2/Data Management conferences, and is frequently sought as a guest speaker for regional DB2 user group meetings. Scott has authored several published articles on DB2 UDB performance, is an IBM DB2 GOLD Consultant (an elite, exclusive group of IBM recognized top DB2 advocates), has obtained Advanced IBM DB2 Certifications, and is widely regarded by the worldwide DB2 community as the top performance expert for DB2 on distributed (UNIX, Linux, and Windows) platforms. Most recently, he co-founded Database-Brothers Incorporated (DBI), a group of database performance and accountability specialists.

SH: I started working with DB2 back in 1989. It was DB2 1.3 on a mainframe. I followed DB2 on the mainframe for some time through about DB2 V4, and then I had an opportunity to work on a DB2/6000 project to migrate a mainframe DB2 application to the UNIX open systems environment. Since it was a chance to learn something new, I jumped at it to learn about UNIX and learn about this new DB2.

In the course of migrating the applications for that environment, I realized that our application customers were going to have a lot of questions about performance, and it would be important to be able to monitor and understand performance. So I took quite an interest in the early days of DB2 on UNIX to understand what all the performance numbers were and helped to interpret them

After about a year working in that environment, I’d really enjoyed doing some presentations for DB2 user group conferences and I’d received good feedback. Friends and peers suggested that, since I seemed to know so much about performance for UDB, it might be a good idea to set out on my own and start a consulting practice for DB2 UDB performance. So, I went back to my employer and I resigned.

At the same time that I resigned, I gave them a letter offering my services as a consultant. Fortunately, they agreed to my proposal and with that $24,000 check, I started Database-GUYS, Inc., which became known later as DGI. DGI began principally oing a lot of performance consulting and performance troubleshooting for different clients and customers.

While consulting, I started creating some tools that would help rapidly analyze the performance problems my clients were having, and these tools proved to be quite useful and very valuable, because at the end of every consulting visit, customers would invariably ask if they could also have the tools. I got the idea that perhaps DGI should be in the software tools business in addition to consulting: this is what moved DGI from being purely a consulting organization to a software tools vendor.

As the tools matured, they became very powerful and very mature, and we continued to do consulting engagements. Probably the most common consulting engagement that we did was what we called a “performance audit,” which took two days.

The first day we’d fly out to the customer site, somebody who was reportedly having a performance problem, and using our tools, we’d very quickly analyze performance problem areas. Then, we’d develop some solutions, implement those solutions, and re-measure to validate the performance success that we had achieved. That was very powerful and very rewarding because everywhere we went we had a track record of always being successful in helping people getting past the performance problems they were experiencing. That reputation was pretty well known in the industry.

DGI would get calls at 9:00, 10:00, 11:00 at night from organizations in performance crisis, and the first question that they would usually ask is, “how soon can you get here?” not “how much will it cost?” When we’d arrive, the customary practice for most shops is to take new software and put it in development, test, and in QA environments before it has the opportunity to get anywhere near production, but the normal course for the DGI tools was always to go directly into the production environment. I suppose I am quite flattered about that in hindsight, but perhaps these customers had the view that, since performance was so bad, how much worse could it get once put in production? But the DGI tools usually started in the production environment first, which was quite a testimonial for DGI’s industry reputation.

There was also a big vote of confidence from the IBM Global Services organization across the country in Poughkeepsie; Atlanta, New York, Research Triangle Park; Boulder, Colorado. Global Services was DGI’s largest customer in terms of licensing product. The internal IBM organization was using DGI to help make sure that it would meet service level agreements for different customers. The DGI team took a great deal of pride in knowing that the company that built DB2 relied on DGI for performance tools.

Table Scans

DBAzine: Those are some great anecdotes and references from IBM leaning heavily on DGI to be successful. You present to lots of DBAs at user groups and conferences; since most of your audience members are in the “firefighting” mode, do you get questions from them looking for quick fixes? Are they asking you, “Hey, if I’m looking to troubleshoot a performance problem in the database, is there any specific area that you advise DBAs to focus on first?”

SH: Well, usually what gets people in trouble is that they have table scans going on and they’re not aware of them. Companies are buying large machines, large quantities of memory. They’re creating large buffer pools in DB2 to store the data. When they monitor the buffer pool performance, they see that they have great hit ratios.

But if you have a one-gigabyte buffer pool and a 500-megabyte table, DB2 can still be doing a table scan because the entire table could be stored in memory. They could have a 100 percent hit ratio, but that table scan is going to consume an incredible amount of CPU. Instead, they need to look at all of the DB2 performance data to get results. Most companies would often get in trouble with SQL that had a sub-optimal access path, and they wouldn’t be aware of it because they’d have these big, beefy machines and lots of memory.

Big machines with fast processors and a lot of memory can compensate for an inadequate physical design until such time as the number of users grows and there’s so much concurrent demand on the machine that there aren’t enough CPU cycles anymore to work through. A common scenario for companies that called DGI at the eleventh hour was when a new application they’d put intoproduction and that had been running just fine for 50 users was now falling flat after the company had brought a new region or territory into the application and added another 200 users. Then, the company would be virtually out of business with woefully inadequate performance. In the art of quickly addressing the most common performance problems, it becomes an exercise of trying to find the SQL that is consuming disproportionate or excessive resources during the course of its execution, and excessive resources can be measured in different ways.

In particular, if a machine is CPU bound, we look at the SQL that’s consumed the most CPU. But we can also look for SQL that has the highest sort times, which statements are consuming the most sort time; there’s a very strong correlation between statements that have high sort time and high CPU time. We can also look at statements that are doing a large amount of I/O, the average number of rows read per statement execution, and the average elapsed times to see if service level agreements are being met.

Table Snapshot

DBAzine: DBAs seem to be attributing a lot of these application performance problems to inefficient SQL after the application goes into production and are trying to nail down specific performance metrics to go after those problematic SQL statements. Is there any specific snapshot information that you find to be most valuable in doing that?

SH: Generally during the audit process, we wouldn’t spend very much time with the snapshot information. In trying to make rapid problem determinations of what’s going on outside of DB2, the most important information that we would look at would come in part from a snapshot for tables. We use a snapshot for tables to find out which tables had the highest amount of read I/O executed against them and how many rows were read on average per transaction. So if a table had 100,000 rows read and there had been 1,000 transactions executed, that would be 100 rows read for every transaction against that table.

We recognize, of course, that every table may not be accessed in every transaction, but that’s a very valuable metric for getting an idea of where the I/O workload is occurring across all the tables. Once we know which tables are being severely harmed by excessive I/O, the next task is to find the SQL that’s driving the I/O to those tables. You would use DGI tools — either SQLGUY or Flight Deck — begin with the table name, then search and find all of the SQL that was potentially accessing that table. (This includes finding out whether the is table being accessed through an alias, accessed through a view, or is being accessed through the base table name itself.)

We rely very heavily within the DGI tools on the DB2 event monitor. The DB2 SQL event monitor is written to a pipe, then processed by the DGI tools in memory without doing any physical I/O whatsoever, and that’s a great contrast to other tools that may be available on the market. DGI’s tools do real-time analysis in memory processing without any I/O. As we look at the event information, it’s very important to understand total aggregated SQL costs and have an accurate representation of frequency of execution.

If you have a statement, “select * from employee where EMPNO = 210,” and “select * from employee where EMPNO = 220,” and if you’re looking at snapshot information, the snapshot information will report that there were two different statements that were executed, and each of those was executed one time. That SQL statement could be a very popular, frequently executed statement in the course of the application’s operation. A statement that says EMPNO equals something could be executed thousands of times, but with the snapshot reporting that the occurrence was only one, when you look at the cost of any one of those occurrences individually, its cost is trivial, or it may appear to be trivial.

It is very important to do a total cost analysis. DGI’s method for analyzing SQL would be, “select * from employee where EMPNO = ‘some value’,” then ask, have we seen a statement that looks like that before?

DBAzine: Right.

SH: And if we have seen that statement before, we would increment the occurrence counter, and aggregate the costs. “Last time it ran, it took 0.1 seconds; this time it ran it, it took 0.1 second. Total cost of this statement is 0.2 seconds. Oh look, it ran again. Total cost is 0.3 seconds of CPU. Oh, it ran again. Total cost is 0.4 seconds of CPU time.” Now we can look at that 0.4 seconds of total CPU time, and four executions. We can compute an average CPU per execution if we want. That’s easy: 0.4 divided by four.

We can look at the number of rows read on average per execution, and we can look at total rows read to see where all the I/O is coming from. It’s important to understand that if DB2 has 0.4 seconds of CPU time being used by this statement and, given the total CPU time that’s been consumed by all of DB2 by all statements that have been through the DB2 engine, total CPU time for DB2 is one second, then that otherwise simple statement, “select * from employee where EMPNO = ‘some value’,” has actually consumed 40 percent of all of the CPU time in the application. Unless you have grouping and aggregation in the total costs accurately represented, the DBA would very likely be inclined to try to solve the wrong performance problem.

You can be easily misled by DB2 snapshot information into thinking that a statement that is reported to have executed two or three times and has a very high single execution cost is the most costly statement that’s hurting database performance the most. Everywhere we go, in every case of crisis performance troubleshooting, we would invariably find a SQL statement with low single execution cost, but extremely high total aggregated cost.

Here’s a case study and a history example:

There was a company in Florida that had a Web application, and this application was their entire business. They had no storefront, but they had quite a following. This company’s application was worldwide: there were thousands of customers using the Web site, and the company had worked very carefully to develop some application improvements. They had taken all of the DBA diligent actions. They developed a new application feature, got that functionality working to their satisfaction, put it in a test environment, then in a QA environment where a number of internal users exercised a system test.

They had done DB2 Explains to look at the estimated costs of all the SQL statements for the workload in the application, and it all looked good. They went into production on Monday, and on Monday morning at about 10 a.m., they called us in a panic, asking, “How fast can you get here? Because we’re out of business right now!”

After remotely analyzing some data they’d sent over the Internet, we found that they had an RS/6000 that had eight CPUs — a pretty beefy machine with lots of memory. But when we did the grouping and aggregation, total cost, and then the percentage distribution analysis of where the resources were going, we found that there was a single SQL statement that was selecting eight small columns from a table with only 350 rows!

DBAzine: Wow.

SH: The statement said, “Where columnX= ‘some value’ order by columnY,” and it had only 350 rows. And they had an index for columnX to satisfy that “equal” predicate. The statement was consuming 70 percent of all of the CPU time on a machine with eight CPUs. DB2 was using that index to find the 8 to 10 rows that qualified on the equal predicate, and then DB2 was performing a sort to order the data according to that order by clause. So imagine this: sorting eight to ten lousy rows on a 350-row table practically put a company out of business.

When they had run the DB2 Explain utility, Explain had told them that the statement didn’t cost anything to execute. There were only 350 rows; its estimated cost was only 37 timerons; who cares? Nobody paid any attention it. But because of the frequency of execution of that statement and its total aggregate cost relative to the entire workload, that simple statement practically crippled that company.

The solution that we came up with was pretty easy. One of my favorite sayings is a Chinese proverb that I was told years ago: A well-stated problem is a half-solved problem. Once we have a good problem statement, the solution is usually relatively easy.

We created a clustering index on those two columns, reorganized the table, ran the runstats utility, and the company was back in business. In fact, they had excess CPU capacity on the machine and room for growth. A remarkable story, but one that powerfully illustrates the necessity of looking at the right performance information to correctly understand the true nature of the problem within the DB2 database.

Last modified 2006-07-28 02:29 PM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

An Interview with Scott Hayes on Optimizing Your DB2 UDB Databases - Part 1

Table Scans

Table Snapshot