Chapter 5. Storage Engines and Data Types - Part 6

by Michael Kruckenberg and Jay Pipes

From Pro MySQL, Berkeley, Apress, July 2005.

Temporal Data Considerations

If you need to store only the date part of your temporal data, use the DATE column type. Don’t use an INT or TIMESTAMP column type if you don’t need that level of granularity. MySQL stores all the temporal data types as integers internally. The only real difference between the variations is how the data is formatted for display. So use the smallest data type (sound familiar?)that you can.

Use the TIMESTAMP column type if you have audit needs for some tables. Timestamps are an easy, efficient, and reliable method of determining when applications make changes to a record. Just remember that the first column defined as a TIMESTAMP column in the CREATE TABLE statement will be used for the create time of the record. The columns after that field can be updated to the current system time by updating the column equal to NULL. For instance, suppose you define your orders table like so:

mysql> CREATE TABLE orders (
     > id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
     > , customer INT NOT NULL
     > , update_time TIMESTAMP NOT NULL
     > , create_time TIMESTAMP NOT NULL);

If you insert a record into the orders table, explicitly set the create_time field; otherwise, it will be entered as a 0. The update_time, being the first TIMESTAMP column defined, will automatically be set to the current timestamp:

mysql> INSERT INTO orders (customer, create_time) VALUES (3, NOW());

When you do a SELECT from the table, you will notice that the first TIMESTAMP is set to the Unix timestamp value of the current time. The second TIMESTAMP field will be 0.

mysql> SELECT * FROM orders;
+----+----------+----------------+----------------+
| id | customer | update_time    | create_time    |
+----+----------+----------------+----------------+
|  1 |        3 | 20050122190832 | 20050122190832 |
+----+----------+----------------+----------------+
1 row in set (0.00 sec)

When updating the order record later on, the update_time will automatically get updated with the current system timestamp, while the create_time will stay the same:

mysql> UPDATE orders SET customer = 4 WHERE id = 1;

Selecting from the table then yields the following:

mysql> SELECT * FROM orders;
+----+----------+----------------+----------------+
| id | customer | update _time   | create_time    |
+----+----------+----------------+----------------+
|  1 |        3 | 20050122192244 | 20050122190832 |
+----+----------+----------------+----------------+
1 row in set (0.00 sec)

So, in this way, you have a good way of telling not only when records have been updated, but also which records have not been updated:

mysql> SELECT COUNT(*) FROM orders WHERE update_time = create_time;

Tip: Starting with version 4.1.2, you can tell MySQL how to handle the automatic updates of individual TIMESTAMP columns, instead of needing to explicitly set the TIMESTAMP to its own value during INSERT statements. See http://dev.mysql.com/doc/mysql/en/timestamp-4-1.html for more information and suggestions.

Spatial Data Considerations

The Spatial Data Extensions for MySQL will become more and more useful as more of the OpenGIS specification is implemented, and, in particular, when MySQL implements the ability to load geographic information system (GIS) data through the LOAD DATA INFILE command directly from well-known text (WKT) or well-known binary (WKB) values. Until then, using spatial types may be a little cumbersome, but you can still reap some benefits. As far as the actual data types go, the MySQL online manual provides a good lesson on how the myriad geometry types behave.

SET and ENUM Data Considerations

Now we come to a topic about which people have differing opinions. Some folks love the SET and ENUM column types, citing the time and effort saved in not having to do certain joins. Others dismiss these data types as poor excuses for not understanding how to normalize your database.

These data types are sometimes referred to as inline tables or array column types, which can be a bit of a misnomer. In actuality, both SET and ENUM are internally stored as integers. The shared meta information struct for the table handler contains the string values for the numeric index stored in the field for the data record, and these string values are mapped to the results array when returned to the client.

The SET column type differs from the ENUM column type only in the fact that multiple values can be assigned to the field, in the way a flag typically is. Values can be ANDed and ORed together when inserting in order to set the values for the flag. The FIND_IN_SET function can be used in a WHERE clause and functionally is the same as bitwise ANDing the column value. To demonstrate, the following two WHERE clauses are identical, assuming that the SET definition is option_flags SET('Red','White','Blue') NOT NULL:

mysql> SELECT * FROM my_table WHERE FIND_IN_SET('White', option_flags);
mysql> SELECT * FROM my_table WHERE option_flags & 2;

For both ENUM and SET column types, remember that you can always retrieve the underlying numeric value (versus the string mapping) by appending a +0 to your SELECT statement:

mysql> SELECT option_flags+0 FROM my_table;

Boolean Values

For Boolean values, you will notice that there is no corresponding MySQL data type. To mimic the functionality of Boolean data, you have a few different options:

You can define the column as a TINYINT, and populate the field data with either 0 or 1. This option takes a single byte of storage per record if defined as NOT NULL.
You may set the column as a CHAR(1) and choose a single character value to put into the field; 'Y'/'N' or '0'/'1' or 'T'/'F', for example. This option also takes a single byte of storage per record if defined as NOT NULL.
An option offered in the MySQL documentation is to use a CHAR(0) NOT NULL column specification. This specification uses only a single bit (as opposed to a full byte), but the values inserted into the records can only be NULL¹⁰ or '' (a null string).

Of these choices, one of the first two is probably the best route. One reason is that you will have the flexibility to add more values over time if needed — say, because your is_active Boolean field turned into a status lookup field. Also, the NULL and '' values are difficult to keep separate, and application code might easily fall into interpreting the two values distinctly.

We hope that, in the future, the BIT data type will be a full-fledged MySQL data type as it is in other databases, without the somewhat ungraceful current definition.

STORING DATA OUTSIDE THE DATABASE

Before you store data in a database table, first evaluate if a database is indeed the correct choice of storage. For certain data, particularly image data, the file system is the best choice — storing binary image data in a database adds an unnecessary level of complexity. The same rule applies to storing HTML or large text values in the database. Instead, store a file path to the HTML or text data.

There are, of course, exceptions to this rule. One would be if image data needed to be replicated across multiple servers, in which case, you would store the image data as a BLOB and have slave servers replicate the data for retrieval. Another would be if there were security restrictions on the files you want to display to a user. Say, for instance, you need to provide medical documents to doctors around the country through a web site. You don’t want to simply put the PDF documents on a web server, as doctors may forward a link to one another, and trying to secure each web directory containing separate documents with an .htaccess file would be tedious. Instead, it would be better to write the PDF to the database as a BLOB field and provide a link in your secure application that would download the BLOB data and display it.

Some General Data Type Guidelines

Your choice of not only which data types you use for your field definitions, but the size and precision you specify for those data types can have a huge impact on database performance and maintainability. Here are some tips on choosing data types:

Use an auto-incrementing primary key value for MyISAM tables that require many reads and writes. As shown earlier, the MyISAM storage engine READ LOCAL table locks do not hinder SELECT statements, nor do they impact INSERT statements, as long as MySQL can append the new records to the end of the .MYD data file.

Be minimalistic. Don’t automatically make your auto-incrementing primary key a BIGINT if that’s not required. Determine the realistic limits of your storage requirements and remember that, if necessary, you can resize data types later. Similarly, for DECIMAL fields, don’t waste space and speed by specifying a precision and scale greater than you need. This is especially true for your primary keys. Making them as small as possible will enable more records to fit into a single block in the key cache, which means fewer reads and faster results.

Use CHAR with MyISAM; VARCHAR with InnoDB. For your MyISAM tables, you can see a performance benefit by using fixed-width CHAR fields for string data instead of VARCHAR fields, especially if only a few columns would actually benefit from the VARCHAR specification. The InnoDB storage engine internally treats CHAR and VARCHAR fields the same way. This means that you will see a benefit from having VARCHAR columns in your InnoDB tables, because more data records will fit in a single index data page.

Note: From time to time, you will notice MySQL silently change column specifications upon table creation. For character data, MySQL will automatically convert requests for CHAR data types to VARCHAR data types when the length of the CHAR field is greater than or equal to four and there is already a variable length column in the table definition. If you see column specifications change silently, head to http://dev.mysql.com/doc/mysql/en/Silent_column_changes.html to see why the change was made.

Don’t use NULL if you can avoid it. NULLs complicate the query optimization process and increase storage requirements, so avoid them if you can. Sometimes, if you have a majority of fields that are NOT NULL and a minority that are NULL, it makes sense to create a separate table for the nullable data. This is especially true if the NOT NULL fields are a fixed width, as MyISAM tables can use a faster scan operation algorithm when the row format is fixed length. However, as we noted in our coverage of the MyISAM record format, you will see no difference unless you have more than seven NULL fields in the table definition.

Use DECIMAL for money data, with UNSIGNED if it will always be greater than zero. For instance, if you want to store a column that will contain prices for items, and those items will never go above $1,000.00, you should use DECIMAL(6,2) UNSIGNED, which accounts for the maximum scale and precision necessary without wasting any space.

Consider replacing ENUM column types with separate lookup tables. Not only does this encourage proper database normalization, but it also eases changes to the lookup table values. Changing ENUM values once they are defined is notoriously awkward. Similarly, consider replacing SET columns with a lookup table for the SET values and a relationship (N-M) table to join lookup keys with records. Instead of using bitwise logic for search conditions, you would look for the existence or absence of values in the relational table.

If you are really unsure about whether a data type you have chosen for a table is appropriate, you can ask MySQL to help you with your decision. The ANALYSE() procedure returns suggestions for an appropriate column definition, based on a query over existing data, as shown in Listing 5-7. Use an actual data set with ANALYSE(), so that your results are as realistic as possible.

mysql> SELECT * FROM http_auth_idb PROCEDURE ANALYSE() \G
*************************** 1. row ***************************
             Field_name: test.http_auth_idb.username
              Min_value: aaafunknufcnhmiosugnsbkqp
              Max_value: yyyxjvnmrmsmrhadwpwkbvbdd
             Min_length: 25
             Max_length: 25
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 25.0000
                    Std: NULL
      Optimal_fieldtype: CHAR(25) NOT NULL
*************************** 2. row ***************************
             Field_name: test.http_auth_idb.pass
              Min_value: aaafdgtvorivxgobgkjsvauto
              Max_value: yyyllrpnmuphxyiffifxhrfcq
             Min_length: 25
             Max_length: 25
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 25.0000
                    Std: NULL
      Optimal_fieldtype: CHAR(25) NOT NULL
*************************** 3. row ***************************
             Field_name: test.http_auth_idb.uid
              Min_value: 1
              Max_value: 90000
             Min_length: 1
             Max_length: 5
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 45000.5000
                    Std: 54335.7692
      Optimal_fieldtype: MEDIUMINT(5) UNSIGNED NOT NULL
*************************** 4. row ***************************
             Field_name: test.http_auth_idb.gid
              Min_value: 1210
              Max_value: 2147446891
             Min_length: 4
             Max_length: 10
       Empties_or_zeros: 0
                  Nulls: 0
Avg_value_or_avg_length: 1073661145.4308
                    Std: 0.0000
      Optimal_fieldtype: INT(10) UNSIGNED NOT NULL
4 rows in set (1.53 sec)

Listing 5-7. Using PROCEDURE ANALYSE() to Find Data Type Suggestions

As you can see, the ANALYSE() procedure gives suggestions on an optimal field type based on its assessment of the values contained within the columns and the minimum and maximum lengths of those values. Be aware that ANALYSE() tends to recommend ENUM values quite often, but we suggest using separate lookup tables instead. ANALYSE() is most useful for quickly determining if a NULL field can be NOT NULL (see the Nulls column in the output), and for determining the average, minimum, and maximum values for textual data.

Summary

In this chapter, we’ve covered information that will come in handy as you develop an understanding of how to implement your database applications in MySQL. Our discussion on storage engines focused on the main differences in the way transactions, storage, and indexing are implemented across the range of available options. We gave you some recommendations in choosing your storage engines, so that you can learn from the experience of others before making any major mistakes.

We also examined the various data types available to you as you define the schema of your database. We looked at the strengths and peculiarities of each type of data, and then provided some suggestions to guide you in your database creation.

In the next chapter, you will learn some techniques for benchmarking and profiling your database applications. These skills will be vital to our exploration of SQL and index optimization in the following chapters.

Footnotes

¹⁰ Yes, you did read that correctly. The column must be defined as NOT NULL, but can have NULL values inserted into data records for the field.

Michael Kruckenberg started his career with web technologies more than 10 years ago. His first major undertaking was bringing a small mail-order company online (using MySQL). After hopping around a bit during the 1990s Internet boom and spending time in the Internet startup world, Mike put his feet down at his current gig, senior programmer at Tufts University. He is now the technical manager for the Apache/Perl/MySQL-driven Tufts University Sciences Knowledgebase (TUSK), a content repository for faculty and students. Mike likes to spend his elusive free time with his wife and kids on New England adventures, has a healthy addiction to music, and likes to dabble in the arts (photography, video, and literature).

For the past 10 years, Jay Pipes has worked with all kinds of companies, large and small, to identify the value of the information they collect and help them build software that best addresses the needs of their businesses. From e-commerce to work-order management systems, Jay has been involved in projects with both Microsoft and open-source technologies. Passionate about programming and all things technical, Jay now runs his own consulting business, based in Columbus, Ohio. When not being bothered by his two cats, two dogs, and a constantly ringing phone, you can find him, headphones pulled across his ears, happily coding away at home.

Contributors : Michael Kruckenberg, Jay Pipes
Last modified 2006-05-12 05:02 PM

DBAzine.com

Sections

Personal tools

Menu