Chapter 5. Storage Engines and Data Types - Part 5

by Michael Kruckenberg and Jay Pipes

From Pro MySQL, Berkeley, Apress, July 2005.

The ARCHIVE Storage Engine

New in MySQL 4.1.3 is the ARCHIVE storage engine. Its purpose is to compress large volumesof data into a smaller size. While this storage engine should not be used for regular data access (normal operations), it is excellent for storing log or archive information that is taking up too much regular space on disk.

The ARCHIVE storage engine is not available on default installations of MySQL. In order to create an ARCHIVE table, you will need to build MySQL with the --with-archive-storage-engine option.

No indexes are allowed when creating an ARCHIVE table; indeed, the only access method for retrieving records is through a table scan. Typically, you would want to convert stale log data tables into ARCHIVE tables. On the rare occasion that analysis is needed, you could create a temporary MyISAM table by selecting the entire data set and create indexes on the MyISAM table.

The CSV Storage Engine

As of version 4.1.4, MySQL introduced the CSV storage engine. The CSV storage engine is not available on default installations of MySQL. In order to create a CSV table, you will need to build MySQL with the --with-csv-storage-engine option.

Although this is not the most useful of all the storage engines, it can have its advantages. The CSV engine stores table meta information, like all storage engines, in an .frm file in the database directory. However, for the actual data file, the CSV engine creates a file with the table name and a .CSV extension. This is handy in that the table data can be easily copied from the database directory and transferred to a client, like Microsoft Excel, to open the table in spreadsheet format.

Practically speaking, however, there is little use to the CSV storage engine given two facts. First, no indexing is available for the storage engine, making the use of it in normal operations somewhat implausible. Second, the INTO OUTFILE extension clause of the SELECT statement negates some of the benefits we mentioned. In order to output a section of an existing table into a CSV format, you could just as easily run the following statement:

mysql> SELECT * INTO OUTFILE '/home/user1/my_table.csv'
     > FIELDS TERMINATED BY ',' ENCLOSED BY '"'
     > FROM my_table;

This would dump the file to the server’s location of /home/user1/my_table.csv. If you wanted to place the output file onto the local client (say, if you were connecting remotely), you could execute the following from a shell prompt:

mysql -t -e "SELECT * FROM my_schema.my_table" | tr "\011" "," > my_file.csv

This would pipe tabbed results (the -t option) from MySQL to the tr program, which would translate the tab character (\011) to a comma character and dump the results to a CSV file on the local computer.

The FEDERATED Storage Engine

If any of you are coming from a Microsoft SQL Server background and have wondered whether MySQL implements anything like the concept of linked servers in SQL Server, look no further. Starting in version 5.0.3, you can use the FEDERATED storage engine to access databases located on remote servers in the same way you would on the local server.

The FEDERATED storage engine is not available on default installations of MySQL. In order to create a FEDERATED table, you will need to build a version of MySQL 5.0.3 or later with the --with-federated-storage-engine option.

On the local server, only an .frm file is created when a table with ENGINE=FEDERATED is created. Naturally, the data file is stored on the remote server, and thus there is no actual data file stored on the local server.

When accessing records from a FEDERATED table, MySQL uses the mysql client API to request resultsets from the remote server. If any results are returned, the FEDERATED storage engine converts the results to the format used by the underlying remote server’s storage engine. So, if an InnoDB table on the remote server were accessed on the local server via a FEDERATED table, the local FEDERATED storage engine would create an InnoDB handler for the request and issue the requested statements to the remote server, which would execute and return any resultset needed in the standard client format. The FEDERATED storage engine would then convert the returned results into the internal format of the needed table handler (in this case, InnoDB), and use that handler to return the results through the handler’s own API.

The NDB Cluster Storage Engine

The MySQL Cluster (NDB) is not really a storage engine, in that it delegates the responsibility of storage to the actual storage engines used in the databases that it manages. Once a cluster of database nodes is created, NDB controls the partitioning of data across the nodes to provide redundancy of data and performance benefits. We discuss clustering and NDB in detail in Chapter 19.

Guidelines for Choosing a Storage Engine

At this point, you might be wondering how to go about determining which storage engine is a good match for your application databases. MySQL gives you the ability to mix and match storage engines to provide the maximum flexibility when designing your schema. However, there are some caveats and some exceptions to keep in mind when making your choices.

Take time to investigate which index algorithms best fit the type of data you wish to store. As different storage engines provide different index algorithms, you may get a significant performance increase by using one over another. In the case of InnoDB, the storage engine will actually pick a B-tree or hash index algorithm based on its assessment of your data set. This takes away some of the control you might have by using a combination of MEMORY and MyISAM tables for storage; however, it might be the best fit overall. When it comes to requirements for FULLTEXT or spatial indexing, your only choice currently is MyISAM. Look for implementation of these other indexing algorithms to appear in other storage engines in the future.

Here, we present some general guidelines for choosing an appropriate storage engine for your various tables.

Use MyISAM for logging. For logging purposes, the MyISAM storage engine is the best choice. Its ability to serve concurrent read requests and INSERT operations is ideal for log tables where data is naturally inserted at the end of the data file, and UPDATE and DELETE operations are rare.

Use MyISAM for SELECT COUNT(*) queries. If you have an application that relies on multiple SELECT COUNT(*) FROM table queries, use the MyISAM storage engine for those tables. MyISAM’s index statistics make this type of query almost instantaneous. InnoDB’s performance degrades rapidly on larger data sets because it must do a scan of the index data to find the number of rows in a table. There are plans to improve InnoDB’s performance of this type of query in the near future.

Use InnoDB for transaction processing. When your application needs to ensure a specific level of isolation across read requests over multiple statements, use the InnoDB storage engine. Before deciding to use InnoDB, however, be sure that the transactions issued by your application code are indeed necessary. It is a mistake to start a multistatement transaction for statements that can be reduced to a single SQL request.

Use InnoDB for enforcing foreign key constraints. If the data you are storing has relationships between a master and child table (foreign keys), and it is absolutely imperative that the relationship be enforced, use InnoDB tables for both the master and child tables. Consider using the ON UPDATE CASCADE and ON UPDATE DELETE options to enforce any business rules. These options can cut down significantly on application code that would normally be required to handle enforcing business rules.

Use InnoDB for web site session data. If you store web application sessions in your database, as opposed to a file-based storage, consider using InnoDB for your storage engine. The reason for this is that typical web application sessions are UPDATE-intensive. New sessions are created, modified, and destroyed in a continual cycle as new HTTP requests are received and their sessions mapped and remapped to the database table. InnoDB’s row-level locking enables fast, concurrent access for both read and write requests. Why not use a MEMORY table instead, since it is fast for reads and writes, including DELETE operations? Well, the reason is that MEMORY tables cannot support TEXT or BLOB data because the hashing algorithm used in the storage engine cannot include the TEXT and BLOB pointers. Web session data is typically stored in TEXT columns, often as a serialized array of values.

You should not assume that an initial choice of a storage engine will be appropriate throughout the life of your applications. Over time, not only will your storage and memory requirements change, but the selectivity and the data types of your tables may change. If you feel that changing a table’s storage engine would have an impact, first create a test environment and make the changes there. Use the information in Chapter 6 to benchmark both schema and determine if the application would perform better with the changes.

Data Type Choices

As you know, MySQL offers various data types that you can apply to your table columns. Here, we cover the different data types in terms of some common recommendations for their use and knowledge areas we feel are often misunderstood.

Tip: If you’re unsure about a specific data type, or simply want a reference check, consider picking up a copy of Jon Stephens and Chad Russell’s excellent Beginning MySQL Database Design and Optimization (Apress, 2004).

Numeric Data Considerations

MySQL provides an array of numeric data types in different sizes and flavors. Choose numeric column types based on the size of the storage you need and whether you need precise or imprecise storage.

For currency data, use the DECIMAL column type, and specify the precision and scale in the column specification. Do not use the DOUBLE column type to store currency data. The DOUBLE column type is not precise, but approximate. If you are doing calculations involving currency data, you may be surprised at some of the results of your queries.

For instance, assume you defined a field product_price as DECIMAL(9,4) NOT NULL, and populate a data record with the value 99.99:

mysql> CREATE TABLE product (id INT NOT NULL, product_price DECIMAL(9,4) NOT NULL);
mysql> INSERT INTO product (id, product_price) VALUES (1, 99.99);

Next, you go to select the data records where product_price is equal to 99.99:

mysql> SELECT * FROM product WHERE product_price = 99.99;

Everything works as expected:

+----+---------------+
| id | product_price |
+----+---------------+
|  1 |       99.9900 |
+----+---------------+
1 row in set (0.00 sec)

However, you may be surprised to learn that the following query:

mysql> SELECT * FROM product WHERE 100 - product_price = .01;

yields different results depending on the data type definition:

mysql> SELECT * FROM product WHERE 100 - product_price = .01;
+----+---------------+
| id | product_price |
+----+---------------+
|  1 |       99.9900 |
+----+---------------+
1 row in set (0.03 sec)

mysql> ALTER TABLE product CHANGE COLUMN product_price product_price DOUBLE;
Query OK, 1 row affected (0.10 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM product WHERE 100 - product_price = .01;
Empty set (0.00 sec)

As you can see, the same query produces different results depending on the precision of the data type. The DOUBLE column type cannot be relied on to produce accurate results across all hardware architectures, whereas the DECIMAL type can. But calculations involving DECIMAL data can yield unexpected results due to the underlying architecture’s handling of floatingpoint arithmetic. Always test floating-point calculations thoroughly in application code and when upgrading MySQL versions. We have noticed differences in the way precision arithmetic is handled even across minor version changes. See http://dev.mysql.com/doc/mysql/en/problems-with-float.html for details about floating-point arithmetic issues.

String Data Considerations

As with all data types, don’t go overboard when choosing the length of string fields. Be conservative, especially when deciding on the length of character columns that will be frequently indexed. For those columns that should be indexed because they frequently appear in WHERE conditions, consider using an INDEX prefix to limit the amount of actual string data stored in the index rows.

Character fields are frequently used in the storage of address data. When determining how these character columns should be structured, first consider the maximum number of characters that can be stored in the specific data field. For example, if your data set includes only United States zip codes, use a CHAR(5) column type. Don’t make the field have a length of 10 just because you may need to store the zip+4 format data for some of the records. Instead, consider having two fields, one CHAR(5) to store the main, non-nullable zip code, and another nullable field for storing the +4 extension. If you are dealing with international postal addresses, investigate the maximum characters needed to store the postal code; usually, a CHAR(8) will do nicely.

Tip: Store spaces that naturally occur in a postal code. The benefit of removing the single space character is almost nonexistent compared to the pain of needing to remove and restore the space for display and storage purposes.

Also consider how the data will actually be searched, and ensure that separate search fields are given their own column. For instance, if your application allows end users to search for customer records based on a street address, consider using separate fields for the street number and the street name. Searches, and indexes, can then be isolated on the needed field, and thus made more efficient. Also, following rules for good database normalization, don’t have separate fields for different address lines. Instead of having two fields of type VARCHAR(50) named address_1 and address_2, have a single field address, defined as VARCHAR(100). Address information can be inserted into the single field with line breaks if needed.

For applications where search speed is critical, consider replacing city and region (state) fields with a single numeric lookup value. Because they are denser, numeric indexes will have faster search speeds. You can provide a lookup system defined something like this:

mysql> CREATE TABLE region (
     > id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
     > , region_name VARCHAR(30) NOT NULL
     > , country CHAR(2) NOT NULL);
mysql> CREATE TABLE location (
     > id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
     > , region INT NOT NULL
     > , city VARCHAR(30) NOT NULL
     > , INDEX (region, city));

You would populate your region and location tables with the data for all cities in your coverage area. Your customer table would then need only store the 4-byte pointer to the parent location record. Let’s assume you have a region with an ID of 23, which represents the state of California, in the United States. In order to find the names of customers in this region living in the city Santa Clara, you would do something like this:

mysql> SELECT c.name FROM location l
     > INNER JOIN customer c ON l.id = c.location
     > WHERE l.region = 23 AND l.city = 'Santa Clara';

This search would be able to use the index on location to find the records it needs. If you had placed the city and region information in the customer table, an index on those two fields would have needed to contain many more entries, presuming that there would be many more customer entries than location entries.

Again, indexes on numeric columns will usually be faster than character data columns of the same length, so think about converting data from a character format into a normalized lookup table with a numeric key.

Michael Kruckenberg started his career with web technologies more than 10 years ago. His first major undertaking was bringing a small mail-order company online (using MySQL). After hopping around a bit during the 1990s Internet boom and spending time in the Internet startup world, Mike put his feet down at his current gig, senior programmer at Tufts University. He is now the technical manager for the Apache/Perl/MySQL-driven Tufts University Sciences Knowledgebase (TUSK), a content repository for faculty and students. Mike likes to spend his elusive free time with his wife and kids on New England adventures, has a healthy addiction to music, and likes to dabble in the arts (photography, video, and literature).

For the past 10 years, Jay Pipes has worked with all kinds of companies, large and small, to identify the value of the information they collect and help them build software that best addresses the needs of their businesses. From e-commerce to work-order management systems, Jay has been involved in projects with both Microsoft and open-source technologies. Passionate about programming and all things technical, Jay now runs his own consulting business, based in Columbus, Ohio. When not being bothered by his two cats, two dogs, and a constantly ringing phone, you can find him, headphones pulled across his ears, happily coding away at home.

Contributors : Michael Kruckenberg, Jay Pipes
Last modified 2006-05-12 05:02 PM

DBAzine.com

Sections

Personal tools

Menu