Skip to content

Personal tools
You are here: Home » Of Interest » Articles of Interest » Chapter 5: Storage Engines and Data Types - Part 4

Chapter 5: Storage Engines and Data Types - Part 4

by Michael Kruckenberg and Jay Pipes

From Pro MySQL, Berkeley, Apress, July 2005.

Related Podcast: Jay Pipes and Michael Kruckenberg - Open Source Pros

Part 1  |  Part 2  |  Part 3  |  Part 4  |  Part 5  |  Part 6

Internal InnoDB Buffers

InnoDB caches information in two major internal buffers:

      • Buffer pool: This buffer contains cached index data pages (both leaf and non-leaf). The innodb_buffer_pool_size configuration variable controls the size of this buffer.
      • Log buffer: This buffer contains cached log records. The innodb_log_buffer_size configuration variable controls the size of the log buffer.

Note: It is unfortunate that InnoDB currently does not have the ability to change the configuration variables associated with the internal buffers on the fly. A restart of the mysqld process is required in order to facilitate the changes, which considering InnoDB was designed for always-on, high-availability systems, may be a significant downside.We hope that, in the future, these values will be modifiable through SQL commands.

In addition to these two main buffers, InnoDB also keeps a separate cache of memory for its internal data dictionary about the table and index structures in the tablespace.

InnoDB Doublewrite Buffer and Log Format

In order to ensure the ACID properties of transaction control, InnoDB uses a write-ahead logging system called a doublewrite buffer system. Remember from Chapters 2 and 3 that there is a difference between a write and a flush of data. A write simply changes the in-memory copy of a piece of data. A flush commits those writes to disk.

The doublewrite buffer refers to the dual-write process that occurs when InnoDB records changes issued under a transaction, as illustrated in Figure 5-1. Because of the principles of write-ahead logging, InnoDB must ensure that any statement that modifies the in-memory data set is first recorded on disk (in a log) before a COMMIT is issued for the entire transaction. This ensures that, in the case of a disk failure or software crash, the changes can be re-created from the log records. However, the designers of InnoDB realized that if a transaction were rolled back before a COMMIT was received, the statements on log records representing those changes would not need to be reissued during a recovery. So, InnoDB inserts transactional statements as log records into the log buffer (described in the previous section), while simultaneously executing the modifications those statements make against the in-memory copy of the record data available in the buffer pool. This dual-buffer write explains the doublewrite buffer terminology.

When a COMMIT is received for a transaction, by default, InnoDB flushes to disk (to the ib_logfile files) the log records in the log buffer representing the transaction in question. The reason we say “by default” is that you can tell InnoDB to only flush the transaction log files to disk every second, approximately. You can tell InnoDB to flush to disk based on the operating system process scheduling (around one second) by setting innodb_flush_log_at_ trx_commit to 0. This practice is not, however, recommended for mission-critical applications.

Figure 5-1. The doublewrite buffer process

Caution: Regardless of whether innodb_flush_log_at_trx_commit is set to 1, if the operating system on which the MySQL server is running does not have a reliable flushing mechanism, or if the disk drives attempt to fool the operating system into thinking a flush has occurred when, in fact, it hasn’t, InnoDB may lose data. This is not a fault of the storage engine, but rather of the operating system or hardware. For more information about this problem, see Peter Zaitsev’s (one of the InnoDB developers) article at

InnoDB log files contain a fixed number of log records.7 Because the log files cannot grow to an infinite size, and because log records are going to continue to be inserted into the log, there must be a way of overwriting log records that have been flushed to disk, and therefore are redundant.

InnoDB’s log record flushing system is circular in this way: it overwrites log records from the beginning of the log record with newer log records if the log file’s file size limit is reached. Figure 5-2 depicts a sample log file with a maximum of 14 log records.

Caution: Because of InnoDB’s process of overwriting logs, you must ensure that you provide enough room in your log file to cover the data changes made between backups. See Chapter 17 for more information about these administrative precautions.

Figure 5-2. InnoDB’s log file overwrites itself from the beginning of the file when it’s full.

The actual log record format is itself quite compact. It contains a log serial number (LSN), which is an 8-byte file and byte offset for the particular log record. Along with the LSN is a compressed form of the data modifications made by the statement.

In the case when the buffer pool exceeds its limits, InnoDB is forced to flush data pages to disk in order to remove the least recently used data pages. But, before it does so, InnoDB uses the LSN element of the page header section of the data page to check that the LSN of the page header is less than the last log record in the log file. If it’s not, then InnoDB writes the log file records before flushing the data pages.

The Checkpointing and Recovery Processes

As we explained in Chapter 3, transaction processing systems employ a checkpointing process to mark in a log file that all data pages that have had changes made to records have been flushed to disk. We explained that this checkpoint mark contained a list of open transaction numbers at the time that the checkpoint was made. In the InnoDB checkpointing process, the checkpoint contains a linked list of data pages that may still be dirty because of pending transactions.

A separate background thread spawned specifically to handle the InnoDB checkpointing process wakes on a timed interval to flush changed data pages in the buffer pool to the ibdata files. However, InnoDB may not actually flush any data pages at this interval. This is due to the fact that InnoDB is a fuzzy checkpointer, meaning that it will not flush data changes in memory as long as all of the following conditions exist:

      • Either the log buffer or buffer pool is not filled beyond a certain percentage of its total size limit
      • Log file writes have not fallen behind data page writes (a separate thread handles each)
      • No data pages have a page header LSN the same as a log record about to be overwritten by the log writer

After a crash, the InnoDB recovery process automatically kicks in on startup. InnoDB uses the LSN values in the log record to bring the data store to a consistent state based on the last checkpoint record’s LSN.

Other Storage Engines

Although MyISAM and InnoDB are the most commonly used storage engines, MySQL also offers other storage engines that are more specialized. In the following sections, we’ll cover the MERGE, MEMORY, ARCHIVE, CSV, FEDERATED, and NDB Cluster choices.

The MERGE Storage Engine

If you have a situation where, for performance or space reasons, you need to cut large blocks of similar data into smaller blocks, the MERGE storage engine can virtually combine identical MyISAM tables into an aggregated virtual table. A MERGE table must be created with an identical table definition to the MyISAM tables for which it offers a combined view. To create a MERGE table from multiple MyISAM tables, follow the CREATE TABLE statement with the ENGINE=MERGE UNION=(table_list) construct, as shown in Listing 5-2 for a fictional set of tables.8

mysql> CREATE TABLE t1 (
-> message CHAR(20));
mysql> CREATE TABLE t2 (
-> message CHAR(20));
mysql> INSERT INTO t1 (message) VALUES ('Testing'),('table'),('t1');
mysql> INSERT INTO t2 (message) VALUES ('Testing'),('table'),('t2');
mysql> CREATE TABLE total (
-> message CHAR(20), INDEX(a))

Listing 5-2. Creating a MERGE Table from Two Identical Tables

Note that column definitions for all three tables (the original tables and the MERGE table) are identical. Instead of a PRIMARY KEY being defined in the MERGE table, a normal index is used on the column a.

The most common example of MERGE storage engine usage is in archival and logging schemes. For instance, suppose we have a MyISAM table that stores the web site traffic information from our online toy store. It is a fairly simple log table, which looks something like Listing 5-3.

mysql> CREATE TABLE traffic_log (
> refer_site VARCHAR(255) NOT NULL,
> requested_uri VARCHAR(255) NOT NULL,
> user_agent VARCHAR(255) NOT NULL,
> INDEX (hit_date, refer_site(30))) ENGINE=MYISAM;

Listing 5-3. A Simple MyISAM Web Traffic Log Table

Although this table is simple, it could quickly fill up with a lot of information over the course of a month’s average web traffic to our store. Let’s imagine that, in the first month of tracking, our traffic_log table logged 250,000 hits. MyISAM is buzzing along, inserting records at light speed because of its ability to do thousands of writes per second on an incrementing numeric primary key. But this growth rate will eventually make the table unwieldy, consuming a massive amount of disk space or, worse, using up all the available space for the data file. So, we decide to make some slight changes to the application code that inserts the log records. Instead of one giant table, we create monthly tables, named traffic_log_yymm, where y and m are the year and month, respectively. We create the tables for the year up front and change the application code to insert into the appropriate month’s table based on the log record’s timestamp.

A couple month’s into our new system’s lifetime, we make a bold move and compress older logs using the myisampack utility (discussed earlier, in the section about the MyISAM compressed record format). Then we seem to be getting more and more requests to provide reporting for a number of months of data at a time. Manually UNIONing table results together has started to get a bit annoying.

So, we decide to investigate the MERGE storage engine option. We define a MERGE table as in Listing 5-4.

mysql> CREATE TABLE traffic_log_05 (
> refer_site VARCHAR(255) NOT NULL,
> requested_uri VARCHAR(255) NOT NULL,
> user_agent VARCHAR(255) NOT NULL,
> INDEX (id),
> INDEX (hit_date, refer_site(30)))
> UNION=( traffic_log_0501, traffic_log_0502, … , traffic_log_0512)

Listing 5-4. Creating a MERGE Table from Monthly Log Tables

This creates a table that aggregates the data for the entire year of 2005. Setting the INSERT_METHOD to NO is good because our application is inserting records into one of the underlying tables anyway. We create indexes on id and on hit_date and refer_site because most of the requests from the sales team have been for reports grouped by week and by referring site. We put a limit of 30 characters for the index on refer_site because, after analyzing the data, the majority of the data is variable at or below 30 characters. Note that we did not define a PRIMARY KEY index for the MERGE table; doing so would produce an error since the MERGE storage engine cannot enforce primary key constraints over all its UNIONed underlying tables.

We now have a method of accessing the varying log tables using a single table interface, as shown in Listing 5-5.

mysql> SELECT LEFT(refer_site, 30) AS 'Referer', COUNT(*) AS 'Referrals'
> FROM traffic_log_05
> WHERE hit_date BETWEEN '2005-01-01' AND '2005-01-10'
> GROUP BY LEFT(refer_site, 30)
> HAVING 'Referrals' > 1000
> ORDER BY 'Referrals' DESC
> LIMIT 5;

Listing 5-5. Aggregated Results from the MERGE Table

This would return the top five highest referring sites (limited to 30 characters), in the first ten days of January 2005, with the number of referrals greater than a thousand. The MERGE engine, internally, will access the traffic_log_0501 table, but, now, we don’t need to use different table names in order to access the information. All we need to do is supply our WHERE condition value to the name of the MERGE table — in this case: traffic_log_05. Furthermore, we could create a separate MERGE table, traffic_log (replacing your original table), which houses all records for our web site traffic.

Be aware that MERGE tables have some important limitations. You cannot use the REPLACE command on a MERGE table, and UNIQUE INDEX constraints are not enforced across the entire combined data set. For more information, see the MySQL manual at

When you’re considering using a MERGE table, also investigate using views, available only in MySQL 5.0, instead. They provide much more flexibility than the MERGE storage engine. See Chapter 12 for a detailed discussion of views.

The MEMORY Storage Engine

The MEMORY storage engine,9 as its name aptly implies, stores its entire contents, both data and index records, in memory. The trick with MEMORY tables, however, is that the data in the table is lost when the server is restarted. Therefore, data stored in MEMORY tables should be data that can be easily re-created on startup by using application scripts, such as lookup sets, or data that represents a time period of temporary data, like daily stock prices.

When you create a MEMORY table, one file is created under the /data_dir/schema_name/directory named table_name.frm. This file contains the definition of the table.

To automatically re-create data in MEMORY tables, you should use the --init-file=file startup option. The file specified should contain the SQL statements used to populate the MEMORY tables from a persistent table. You do not need to issue a CREATE TABLE statement in the file, because the table definition is persistent across restarts.

For instance, if you wanted to increase the speed of queries asking for information on zip code radius searching (a topic we will cover in Chapter 8), you might have an InnoDB table zips for persistent, transaction-safe storage, coupled with a MEMORY table zips_mem, which contains the zip codes entirely in memory. While the InnoDB zips table might contain a lot of information about each zip code — population statistics, square mileage, and so on — the zips_mem table would contain only the information needed to do radius calculations (longitude and latitude of the zip code’s centroid). In the startup file, you could have the following SQL statement, which would populate the MEMORY table:

INSERT INTO zips_mem SELECT zip, latitude, longitude FROM zips;

The downside is that any changes to the zip code information would need to be replicated against the zips_mem table to ensure consistency. This is why static lookups are ideally suited for MEMORY tables. After all, how often do the latitude and longitudes of zip codes change?

As you learned in Chapter 2, certain data sets and patterns can be searched more efficiently using different index algorithms. Starting with version 4.1, you can specify either a hash (the default) or B-tree index be created on a MEMORY table. Do so with the USING algorithm clause, where algorithm is the index algorithm, in your CREATE TABLE statement. For example, Listing 5-6 demonstrates how to implement a B-tree algorithm on a MEMORY table where you expect a lot of range queries to be issued against a temporal data type. This query pattern is often best implemented with a B-tree algorithm, versus the default hash algorithm.

mysql> CREATE TABLE daily_stock_prices (
> symbol VARCHAR(8) NOT NULL,
> high DECIMAL(6,2) NOT NULL,
> low DECIMAL(6,2) NOT NULL,
> INDEX USING BTREE (date, resource)) ENGINE = MEMORY;

Listing 5-6.Making a MEMORY Table Use a B-Tree Index


7 The number of records depends on the number of log files set in the innodb_log_files_in_group configuration setting and the actual size of the file set with innodb_log_file_size.

8 The example in Listing 5-2 is adapted from the MySQL manual.

9 Prior to MySQL 4.1, the MEMORY storage engine was called the HEAP table type.


Michael Kruckenberg started his career with web technologies more than 10 years ago. His first major undertaking was bringing a small mail-order company online (using MySQL). After hopping around a bit during the 1990s Internet boom and spending time in the Internet startup world, Mike put his feet down at his current gig, senior programmer at Tufts University. He is now the technical manager for the Apache/Perl/MySQL-driven Tufts University Sciences Knowledgebase (TUSK), a content repository for faculty and students. Mike likes to spend his elusive free time with his wife and kids on New England adventures, has a healthy addiction to music, and likes to dabble in the arts (photography, video, and literature).

For the past 10 years, Jay Pipes has worked with all kinds of companies, large and small, to identify the value of the information they collect and help them build software that best addresses the needs of their businesses. From e-commerce to work-order management systems, Jay has been involved in projects with both Microsoft and open-source technologies. Passionate about programming and all things technical, Jay now runs his own consulting business, based in Columbus, Ohio. When not being bothered by his two cats, two dogs, and a constantly ringing phone, you can find him, headphones pulled across his ears, happily coding away at home.

Contributors : Michael Kruckenberg, Jay Pipes
Last modified 2006-05-12 05:01 PM
Transaction Management
Reduce downtime and increase repeat sales by improving end-user experience.
Free White Paper
Database Recovery
Feeling the increased demands on data protection and storage requirements?
Download Free Report!

Powered by Plone