Tuning Disk Architectures for Databases

by Mike Ault

Introduction

When I began work in the computer industry, RAID (Redundant Array of Inexpensive Disks) technology was still in its infancy and Winchester hard drive technology wasn’t available except in huge frame-mounted drives of around 90 megabytes in size that weighed in at a svelte 200 to 300 pounds. Disk technology has progressed far in terms of capacity since those days, and the many controllers and optimizers boggle the imagination. Gone are the days when tuning a disk meant a simple defragmentation and placing frequently accessed files near the outer rim of the disk.

Disk Capacity

While disk volume capacity has been growing logarithmically, disk I/O capacity has been growing as a direct factor with disk rotational speed. So while volume capacity has jumped from 100 megabytes or less and 3600 rpm to 200 gigabytes at 10k or 15K rpm, the corresponding I/O rates have only increased from 30 IOPS to around 150 IOPS (for sequential I/O only). This indicates that capacity has increased more than two million times, while I/O rates increased by only a factor of five. What this means to you as a system tuner is that unless you build out (increase the number of drives) beyond what is needed strictly for volume of data, your ability to service I/O requests will actually decrease. This need to increase I/O rate to keep up with the ever-increasing capacity numbers is what drove the move to RAID technologies.

Disk Types

Adding to the confusion over disk tuning are the many disk types available: ATA, SCSI, USB, IEE1394 (FireWire) — all of these are examples of disk technologies in use today. Each of the disk technologies has different tuning requirements, and when you add RAID to the mix, it further muddies the tuning waters. RAID can be based in software, hardware, or a mix of both. The highest performance is usually derived from hardware-based RAID solutions.

Each of the disk types is characterized by its interface type. The interface types each have their own maximum bandwidth capacities, and you must take these into account when specifying your disk requirements. The following table shows an example comparison for these interfaces. Note that there are also limits on the number of devices that can be attached to each type of interface ranging, for example, from two on most ATA buses to 256 on IEEE.1394.

Interface	Transfer Rate (Megabytes/second)
Serial ATA	150-300
ULTRASCSI320	320
IEEE1394	400
USB2.0	480

Memory Caches

In an effort to improve the I/O characteristics of their disks, manufacturers added memory caches to allow read-and-write caching. Unfortunately, these caches are usually too small to be of much help with database-level performance as they are usually a very small fraction of a percent of the total disk capacity. Of what real use is a disk cache of a couple of megabytes on a 200-gigabyte disk holding a database table that is 60 gigabytes in size? The cache soon becomes saturated, and any I/O benefit is eliminated. Disk array manufacturers such as EMC and Hitachi have also provided larger amounts of cache, in the tens of gigabyte range, which in some situations does help improve I/O rates, depending on how efficiently the application uses the data stored in the database. For example, if the application is poorly written and does multiple full scans of multi-gigabyte tables, then again the cache becomes saturated, and you fall back to disk I/O speeds. For that matter, in some environments such as data warehouses and decision support systems, the scanning of large tables, sometimes hundreds of gigabytes in size, is commonplace. This can quickly tax even the largest cache systems.

I/O Profiles

All of this points to the fact that to get the maximum performance from your disk system, you must understand the I/O characteristics (the profile) of your database system, be it Oracle, SQL Server, Informix, UDB, or MySQL. You must tune your disk architecture to support the expected I/O profile and you must tune the database system to take advantage of the disk architecture. For example, an Oracle database has different I/O characteristics depending on whether it is reading or writing data and what type of read or write it is doing. Other databases have fixed read/write sizes.

You must determine the I/O profile for your database and then use the I/O profile of the database to determine the maximum and minimum I/O size. The I/O profile will tell you what percentage of I/O is large I/O and what percentage is small I/O. It will also give you the expected I/O rate in I/O per second.

Once you have the I/O per second, you can determine the I/O capacity (number of drives) needed to support your database.

The first rule of tuning your disk system is:

1. Size first for I/O capacity, then for volume.

Some back-of-the-envelope calculations for the number of spindles needed to support I/O rate (IOR) are:

RAID10 with active read/write to all mirrors:

MAX(CEILING(IOR/(NSIORM),M),2M)

Where:
IOR is expected maximum I/O rate in I/O per second
NSIOR is the average non-sequential I/O rate of the disks in I/O per second (range of 90 — 100 for RAID10)
M is the number of mirrors
(The maximum of the I/O rate divided by the average non-sequential I/O rate per disk times the number of mirrors to the nearest power of M or 2*M)

RAID5 assuming one parity disk:

MAX((IOR/CNSIOR)+1,3)

Where:
IOR is expected maximum I/O rate in I/O per second
CNSI/OR is the corrected average non-sequential I/O rate of the disks in I/O per second (range of 60 — 90 for RAID5)
(The maximum of the I/O rate divided by the average non-sequential I/O rate per disk corrected for RAID5 penalties plus one disk for the parity disk)

The correction for the non-sequential I/O rate for RAID is due to the up to 400 percent penalty on writes (writes take four times longer than reads on the same drive). In some cases on RAID5, I have seen this go as high as 6400 percent (writes take 64 times as long as reads for the same file) when combined with other problems such as fragmentation.

A case in point: Early RAID architectures utilized the “stripe shallow and wide” mindset in which files were broken into small pieces and spread over a large number of disks. For example, stripe units per disk of as small as 8K were common. Many systems read in I/O sizes of 64K or larger. This means that to satisfy a single I/O request, eight disks of the RAID set were required. If there were fewer than eight disks in the set, disks would have to undergo two or more I/Os to satisfy the request. This sounds fine if you are talking about a single user wanting to read a large file from a large group of disks very quickly; however, what happens when you have 10, or 100, or 1000 concurrent users all wanting to do the same thing?

Tune for Concurrency

This problem with concurrent access and RAID arrays is one of the most prevailing in the industry. The ubiquitous I/O wait is usually the predominant wait event in any database system simply due to the fact that I/O to memory is in the nanosecond range, while I/O to disk is in the millisecond range. When you add in blocked access due to multi-disk I/O requests, you get a snowball effect that can cripple your I/O subsystem.

Array manufacturers have begun to recognize this concurrent access problem and have increased the base stripe unit per disk to 64K, matching the I/O unit for many systems. Of course, now systems such as Sun and Windows utilize maximum I/O sizes of one megabyte or larger, so again the array manufacturers are playing catch-up to the server manufacturers.

So what is our second rule for tuning disks? Based on the previous information, the rule is:

2. Always ensure that the primary I/O size for your database system is matched to the I/O size of the disk array system.

Of course the inverse also holds true:

Always match the stripe unit per disk to the expected majority I/O request from your (database) application.

In the 1990s, Paul Chen of the University of California at Berkeley computer center published a series of papers on tuning disk array stripe units' size based on expected concurrency. In these papers, Mr. Chen and his associates determined that the I/O speed (as measured by average seek time) and I/O rate (as measured in megabytes per second) for a disk determined the stripe size for performance in an array even when the number of concurrent accesses is not known. There were three formulae derived from these papers:

For non-RAID5 arrays when concurrency is known:

SU = (SAPTDTR(CON-1)1.024)+.5K

Where:
SU — Striping unit per disk
S — Concurrency slope coefficient (~.25)
APT — Average positioning time (milliseconds)
DTR — Data transfer rate (Megabyte/sec)
CON — number of concurrent users.
1.024 = 1s/1000ms*1024K/1M (conversion factors for units)

So for a drive that has an average seek time of 5.6 ms and a transfer rate of 20 megabytes/second, the calculated stripe unit for a 20 concurrent user base would be:

(.255.620(19)1.024)+.5 = 545K (or ~512K)

For a system for which you didn’t know the concurrency, the calculation becomes:

SU =(2/3APTDTR)

So for the same drive:

2/35.620*1.024 = 76.46K, so rounding up ~128K or rounding down 64K

And from Chen’s final paper, a formula for RAID5 arrays is:

0.55.620*1.024 = 57.34 (rounding up 64K)

The values for average access time and transfer rate used in these examples is actually fairly low when compared to more advanced drives, so the stripe sizes shown previously are probably low by at least a factor of two or more. This is because when average seek times drop, the transfer rate increases. For example, on a Ultra3 SCSI 15K drive, the spec for average seek may drop to 4.7 megabytes per second; however, the transfer rate leaps to 70 megabytes per second. So the overall value of the combined factor goes from 112 to 329 — a 293 percent increase.

The I/O Path

The other major factor to consider is the ability to transfer the data from the disks to the CPU. Transferring the data from the disk to the CPU is carried out by the controllers and other interface components. Each controller in a system usually has two or more paths to the disk arrays. The more paths to and from the disk array, the more data that can be transferred. This leads to the third rule for tuning disk arrays:

3. Always plan transfer paths for peak load, not average load.

In one situation, I had a client who switched to a more advanced disk array with faster disks, a new server with faster CPUs, and yet the system performed at half the speed of the older one. Investigation showed they had decided to reduce the available disk paths to two-thirds of the previous number (from 12 to 8) based on “average” I/O requirements. They soon discovered that while this was fine for day-to-day processing, it spelled real trouble for end-of-month, end-of-quarter, and end-of-year processing as well as payday processing.

Make sure I/O is spread evenly in a load-balanced methodology, not a cascade-based methodology, across the paths for the controllers. In a load-balanced methodology, the load is spread evenly across all of the paths allocated to a given server. In a cascade methodology, subsequent paths are not utilized until the primary reaches a certain percentage of load.

The 100 Percent Myth

Many system administrators are guilty of perpetuating the 100 percent myth. This myth states that you don’t need more assets (be it disk, CPU, or memory) until the existing asset is 100 percent utilized. This leads to performance issues in the area of disks.

Because of the physics of disks, the best performance for a disk is at the outer edges; once you get toward the inner sectors, performance decreases because of the distance the head must travel to read the data and other factors. In the good old days, administrators spent much time positioning frequently used files on the outer edges of disks.

While physically positioning files on disks is difficult if not impossible in modern RAID systems, you should try not to fill the disks to 100 percent of capacity. Some experts say don’t use more then 30 percent if you want maximum performance; others, 50 percent. I say it depends on how the system is used, the operating system and the RAID array system.

Should I Worry About Fragmentation?

The answer here is, it depends. Windows-based environments (NT, W2K) are prone to fragmentation due to the opportunistic write algorithm that writes back blocks to the first available sector instead of their original sector. This is even true for the pre-allocated files used by most databases. Another fragmentation-prone environment is AIX. And of course anyone still using OpenVMS will also know all they ever want to about the costs of fragmentation.

In Windows, AIX, and OpenVMS, you will need to monitor for fragmentation and defragment as required. Be sure to verify with your disk array vendor that defragmentation is possible or allowed on their systems. If it becomes severe, you may need to back up the array, rebuild it, and then repopulate it to get around specific vendor limitations.

Summary

The entire underlying foundation of disk tuning, and for that matter database tuning, is to open up the flow path for data. Make sure there are no restrictions or kinks in the path from the disks to the memory and CPUs. Then, if you follow up with proper database tuning with the goal of reducing logical and physical I/Os to a minimum, you will have an optimally performing system.

References

Peter M. Chen and David A. Patterson. “Maximizing Performance in a Striped Disk Array.” In Proceedings of the 1990 International Symposium on Computer Architecture, pages 322–331, May 1990.

Peter M. Chen and Edward K. Lee, “Striping in a RAID Level 5 Disk Array.” Proceedings of the 1995 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems.

Drew Rob. “Hard Disks And How They Are Organized.” Enterprise Operations Management, Aurbach Publications, CRC Press, 2003.

Mike Ault is one of the leading names in Oracle technology. The author of more than 20 Oracle books and hundreds of articles in national publications, Mike Ault has five Oracle Masters Certificates and was the first popular Oracle author with his book Oracle7 Administration and Management. Mike also wrote several of the “Exam Cram” books, and enjoys a reputation as a leading author and Oracle consultant.

Contributors : Mike Ault, Peter M. Chen, David A. Patterson, Edward K. Lee, Drew Rob
Last modified 2005-06-21 11:49 PM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?