Terabyte Data Warehouse Table Design Choices, Part 1

by David Beulke

Introduction

Data warehouse tables continue to fill up at amazing speeds with larger and larger amounts of data. This situation has persisted since the beginning of technology. Now, with all the advances in DASD and with DB2 Version 8 there are many design choices. The following will discuss some of the options for designing tables for terabytes of data.

Regardless of whether the data will arrive at one time or over a long period of time, there are many concerns that need to be addressed; but, usually, query access and maintenance are the most important. Query access means being able to get at the information you want immediately through an indexed access or by scanning only a small portion of the data. Maintenance means reasonable backup and/or recovery costs and never having to reorganize the data because of its size or the maintenance downtime issues.

Where Table Design Starts

Table design starts with the underlying datasets that are supported in the operating system and the database. In DB2, the tablespace and indexspace relate to the data sets and this is where the table and index are eventually defined. DB2 Version 8 for z/OS has expanded many of the data set size limits and embraced 64-bit that allows much larger capacity tables. This allows a huge expansion of the associated DB2 data rows to be stored in DB2 more than any other DBMS.

Having a large number of rows focused in a single table data set is not optimum for maintenance or performance. Tablespace partitioning should be used to separate the table data rows into many data sets. Partition separation provides better administrator manageability for optimum dataset placement, quicker backup, faster recovery, and fewer table reorganizations. Since the tablespace partitioning is defined based on data element keys used in the table, partitioning can also be leveraged to allow multiple processes running in parallel across the table.

Leveraging Partitioning for Query Optimization

DB2 also leverages the partitioning information for query optimization and helps queries only access the appropriate partitions of data. This can be a tremendous optimization and performance help for focusing application access to the appropriate tablespace part. In DB2 V8 on z/OS partitioning and clustering are now independent variables allowing further flexibility for multiple table design options for optimum SQL and processing performance.

Partitioning keys are used to separate the data into different physical data sets. The database table designer should consider the best keys available and the events that cause the key to change and separate the data into different datasets for performance or maintenance. The most common keys can be associated to time, location codes, or people and the designer needs to evaluate the parallel processing options and maintenance options presented by different key design combinations and alternatives.

Associating Partitioning Keys to Time

For example, using transaction time, product SKU id, and customer number offers six different choices for partitioning key designs. The table below highlights the different alternatives. Each alternative should be evaluated to determine the number of rows that would be in each partition given the different key values or domains. For example there could be tens of millions of customers, or only ten thousand, or the time frame could be years of data, or only six months of history. The values for each of the keys should be evaluated to see how the frequency or grain of the key values could help the separation of the data into different data sets. The keys should also be evaluated against how they can be leverage to provide the best query optimization and how these index keys can be leveraged for parallel processing. The goal is to balance the number of rows in each partition, have a manageable data set size, promoting parallel processing and a having a good key structure for excellent query access.

Additional considerations should be evaluated about how the data will flow through the database partitioning key design over the lifetime of the system. The time aspect of the partitioning key scheme should be analyzed and leveraged to help the data flow into the appropriate number of partitions. By limiting the new data into a limited number of partitions, the daily backup and recovery requirements can also be minimized.

Design 1	Date	SKU Id	Cust Nbr
Design 2	SKU Id	Date	Cust Nbr

Design 3	Cust Nbr	SKU Id	Date
Design 4	SKU Id	Cust Nbr	Date

Design 5	Date	Cust Nbr	SKU Id
Design 6	Cust Nbr	Date	SKU Id

Associating Partitioning Keys to Value

The partitioning definition can be further configured by using all the keys, a subset of the keys, or a single key, to separate the data into different partitions. For example, the value changes in the SKU id and the Customer Number could be used to separate partitions and any value for Date would be acceptable for any partition. Also, the different key values could be grouped for the separating the partitions. For example, the partitions could be defined for every two thousand customer numbers and every one hundred entries for SKU ids. This flexibility of value partitioning definition allows the designer to adjust the separation frequency to fit the number of unique values for customers and SKU ids to the physical attributes desired for query efficiency and process parallelism.

Key Range	Date	SKU Id	Cust Nbr
Upper Limit	Any	100	2000
	Any	200	4000
	Any	300	6000
	Any	Etc....	Etc....

The number of partitions should also be analyzed and leveraged for parallelism, backup and, recovery considerations. Since DB2 Version 8 for z/OS can support up to 1024 partitions, care must be taken to define the appropriate number of partitions. Analysis should be done to justify the number of partitions and the maximum 1024 should be avoided if possible to leave room for later expansion or emergencies.

Associating Partitioning Keys to Date

The date offers a different type of partitioning scheme that can be leveraged to put the data into different partitions based on the Date data type key. This scheme is very appropriate for large scale tables to rotate the data through all the partitions over a given timeframe.

For example, using only the month and date portion of the Date key values could rotate the data through all the different partitions over time. The day portion could be used as the major key and the month as the minor key. By ignoring the year portion, the data rotates through all the partitions. Then the appropriate partitioning values can be used to define the desired number of partitions for a complete year. The example in following diagram shows partition separation per month. This same technique could be used with an accounting period, week or day time frame for the appropriate number of partitions for your multiple terabyte table design.

Key Range	Date	SKU Id	Cust Nbr
Upper Limit	9999/12/31	100	2000
	9999/11/30	200	4000
	9999/10/31	300	6000
	9999/9/30	Etc....	Etc....

Table and Index DASD Free Space

Table and index DASD free space is another very important aspect for multi-terabyte tables. The leveraging of clustering, compression, the appropriate PCTFREE and FREEPAGE settings, and index definitions are also very important. The row size, compressed or uncompressed, stored in the database should be analyzed and the PCTFREE and FREEPAGE settings adjusted appropriately. The designer needs to make sure there is enough free space to store every row and should not need to reorganize the data or indexes at the end of a busy season or day. This is only accomplished by having enough free space in your database and having the clustering definition order the data evenly across the tablespace and indexspace definitions.

Data Clustering

Just as there are many alternatives for partitioning, equal attention should be spent analyzing clustering keys for the data. The clustering of the data should be analyzed to spread the data evenly across the large terabyte tablespace partitioning datasets. Good partitioning and clustering definition analysis should be done against all of the different possible definitions. Just as the different combination of data elements were evaluated for partitioning, the designer should analyze how the different clustering elements spread the data throughout the tablespace.

For example, in a terabyte table, using customer id might be a good choice if there is a large number of customers. The large number of customers will spread the clustering of data out over the full range of customer values. The number of unique values for each clustering alternative should be evaluated to see how they spread the data over a single partition and over all the tablespace partitions. Clustering order also needs to be evaluated against the SQL activities manipulating the data so that performance is maintained during peak processing periods. The designer should make sure the data is not clustered too narrowly and does not create a tablespace or index “hot spot” that needs frequent reorganizations.

Conclusion

The partitioning discussion points in this article relate mainly to DB2 Version 8 for z/OS but can also be applied on the DB2 Version 8 for Linux, UNIX and Windows (LUW) by using UNION ALL views and other table definition techniques. In the next article, I will discuss this partitioning technique on DB2 LUW along with Multi-Dimensional Clustering and other design options for large terabyte tables.

David Beulke is an internationally recognized DB2 consultant, author and lecturer. He is known for his extensive expertise in database performance, data warehouses and internet applications. He is currently a member of the IBM DB2 Gold Consultant program, co-author of the IBM V8 and V7 z/OS DB2 Certification exams, co-author of the Business Intelligence Certification exam, past president of the International DB2 Users Group (IDUG), columnist for DB2 Magazine, former instructor for The Data Warehouse Institute (TDWI), and former editor of the IDUG Solutions Journal. He has helped clients with system and application issues on the mainframe, UNIX and Windows environments and has over 19 years experience in numerous industries. He can be reached at DaveBeulke@cs.com or 703 798-3283.

Contributors : David Beulke
Last modified 2005-04-12 06:21 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?