Chapter 5. Storage Engines and Data Types - Part 2

by Michael Kruckenberg and Jay Pipes

From Pro MySQL, Berkeley, Apress, July 2005.

The .MYI File Structure

The .MYI file contains the disk copy of all MyISAM B-tree and R-tree indexes built on a single MyISAM table. The file consists of a header section and the index records.

Note: The developer’s documentation (/Docs/internals.texi) contains a very thorough examination of the structures composing the header and index records.We’ll cover these basic structures from a bird’s-eye view.We encourage you to take a look at the TEXI documentation for more technical details.

The .MYI File Header Section

The .MYI header section contains a blueprint of the index structure, and is used in navigating through the tree. There are two main structures contained in the header section, as well as three other sections that repeat for the various indexes attached to the MyISAM table:

A single state structure contains meta information about the indexes in the file. Some notable elements include the number of indexes, type of index (B-tree or R-tree), number of key parts in each index, number of index records, and number of records marked for deletion.
A single base structure contains information about the table itself and some additional offset information, including the start address (offset) of the first index record, length of each index block (index data page in the key cache), length of a record in the base table or an average row length for dynamic records, and index packing (compression) information.
For each index defined on the table, a keydef struct is inserted in the header section, containing information about the size of the key, whether it can contain NULL values, and so on.
For each column in the index, a keyseg struct defines what data type the key part contains, where the column is located in the index record, and the size of the column’s data type.
The end of the header section contains a recinfo struct for each column in the indexes, containing (somewhat redundant) information about the data types in the indexes. An extra recinfo struct contains information about removal of key fields on an index.

Note: You can find the definition for these data structures in /myisam/myisamdef.h. Additionally, /myisam/mi_open.c contains functions that write the respective header section elements to the .MYI file. Each section has its own function; for instance, the recinfo struct is written to file in the mi_recinfo_write() function.

The .MYI File Index Records

After the header section, the MyISAM index blocks compose the remainder of the .MYI file. The index blocks are 1KB on-disk pages of data, representing the B-tree leaf and non-leaf nodes. A single index block contains only key values pertaining to one index in the table. The header section (detailed in the previous section) contains information about how the MyISAM storage engine should find the root node of each index by supplying the offset for the root node’s index block in the keydef elements.

Each index block contains the following:

A single 2-byte block header. The first bit of the 16 bits in the header indicates whether the block is a leaf node (0 for leaf; 1 for non-leaf). The remaining 15 bits contain the total length of bytes used in the block (nonfree space).
Following the header, index keys and record identifiers are laid out in a balanced organization (the B-tree format). With each key is stored the key value (of a length equal to the data type of the indexed field) and a 4-byte record pointer.
The remainder of the index block is junk bytes (filler bytes), which become used as the B-tree index “fills out” with inserts. This is the “fill factor” for MyISAM B-tree index pages, and typically represents between 65% and 80% of the data used within the index block under normal operations, to allow for split-free growth along with the insertions.

Tip: Running #> myisamchk -rq on a MyISAM table will cause the fill factor to rise to close to 100%, as it fills the index blocks as compactly as possible, which may be advisable on static or infrequently modified MyISAM tables.

MyISAM Table-Level Locking

To ensure the integrity of its data, the MyISAM storage engine supports only a single type of locking level: table-level locking. Much has been made of this “deficiency,” but for many applications, this level of locking, and its specific implementation in the MyISAM storage engine, works quite well and can be effective even in very high concurrency scenarios.

MyISAM issues one of three separate types of locks on its resources (data records), depending on the request issued to it by the connecting thread:

READ LOCAL: If the thread issues a SELECT statement against the in-memory copy of the data records, MyISAM asks for a READ LOCAL lock on the data. This type of lock does not prevent INSERTs into the table, as long as the data will be appended to the end of the data file. If the INSERT would push data into the middle of the data file, then the INSERT statement would need to wait until the READ LOCAL lock was released by the SELECT statement’s thread.
READ: If the actual .MYD data file is used to get information for the requesting client (for instance, the myisamchk utility), as opposed to the in-memory cache of the table data, a lock of type READ (sometimes called a shared lock) is issued. While a READ lock is placed on the resource, all UPDATE, INSERT, and DELETE statements are blocked from executing against the table’s data.
WRITE: A WRITE lock (sometimes called an exclusive lock) is placed on the table resource whenever an UPDATE or DELETE request is received, or if an INSERT is received that would fill an existing space in the data file that had previously been removed via a DELETE request.

So, with the READ LOCAL lock type, MyISAM tables can write data to the table without blocking simultaneous reads of the table’s data. You may wonder, given your understanding of data isolation levels, how this is possible. MyISAM recognizes that INSERT operations occurring on a table in which the primary key is an auto-incrementing number can write the new key data at the end of the index file, as opposed to reading into the index file to find an appropriate place to insert new data. Because the insertion of new keys will always occur at the end of the index file for this type of table, there is no need to hold up SELECT statements that have requested keys or data from anywhere else in the table.

For this reason, MyISAM makes an excellent choice for tables that primarily accomplish logging activity. For instance, it’s ideal for a table containing web site traffic data, where you may want to issue queries against a part of the traffic data, while continuing to insert thousands of new records a minute.

MyISAM Index Choices

Although the actual data is not stored in the order of the table’s primary key, MyISAM does maintain a list of pointers (think of them as internal record numbers) to those data records within its indexes. This key cache contains a linked list of pointers referencing address spaces inside the .MYD file where the actual data rows are stored. Regardless of the number of indexes attached to the MyISAM table, all indexes are implemented using this non-clustered organization (see Chapter 2).

You can have up to 64 separate indexes on a MyISAM table (32 in versions prior to 4.1.2). MyISAM supports three indexing options through which it can retrieve data from its key cache: B-tree, R-tree, and FULLTEXT.

B-Tree Indexes

In order to quickly locate information within the non-clustered index buffers, MyISAM uses a B-tree search algorithm. Therefore, keys are inserted into the index based on the key’s logical location in the index tree. If the key has a string data type and can be compressed using prefix compression, it will be. Alternatively, you can manually specify that compression should happen on INSERT by using the PACK_KEYS=1 option in the CREATE TABLE or CREATE INDEX statement. This can be useful for integer keys where you have a data set with most, if not all, key values using just the low-byte value (see the earlier section on the MyISAM fixed-record format). Packing the keys will strip the nonunique high-byte part of the integer value to allow for higher density indexing.

Note: The MyISAM key buffer system can be found in the /myisam/mi_key.c and /myisam/mi_keycache.c files. The B-tree algorithm is implemented in /myisam/mi_search.c. Table scans on MyISAM are implemented in /myisam/mi_scan.c.

R-Tree Indexes

For those of you who require the ability to work with spatial data types (geographical coordinates or three-dimensional shapes), the MyISAM storage engine supports R-tree indexing for that spatial data. Currently, MyISAM is the only storage engine that supports R-tree analysis. Effectively, the implementation of R-tree indexing on the MyISAM storage engine is a kind of extension to its existing key cache organization. It used the same informational structures as the B-tree indexes, but implements the comparison of values in a different way (the spatial way).

Note: The R-tree algorithm is implemented in the /myisam/rt_* files. Notably, rt_mbr.c contains the implementation for how key values are compared. By the way, mbr stands for minimum bounding rectangle.

FULLTEXT Indexes

MyISAM is currently the only storage engine supporting the FULLTEXT index option. A FULLTEXT index can be defined on any CHAR, VARCHAR, or TEXT field of a MyISAM table. When a record is inserted into a MyISAM table containing a FULLTEXT index, the data for the indexed fields is analyzed and split into “words.” For each word, an index entry is created, with the following elements:

The word itself
The number of times the word is found in the text being inserted
A floating-point weight value designed to express the importance of this word in relation to the entire string of data
The record identifier of the record, used as a pointer into the .MYD file

When a query is run against the index, the index entries are queried and, by default, returned in an order based on the weight value in the index entries. To query a FULLTEXT index, use the MATCH … AGAINST construct, as follows:

SELECT * FROM some_table
WHERE MATCH(fulltext_field1, fulltext_field2) AGAINST ('some search string');

In order to see the weighting of the index in your query results, simply use the MATCH construct in the SELECT clause, like so:

SELECT some_field, MATCH (fulltext_field1, fulltext_field2)
AGAINST ('some search string') FROM some_table;

Tip: You can make numerous tweaks to your FULLTEXT indexes, such as changing the minimum word length, altering the stopword file, and running queries in Boolean mode. http://dev.mysql.com/doc/mysql/en/fulltext-search.html has more information about various FULLTEXT options. In addition, Peter Gulutzan’s article, “The Full-Text Stuff That We Didn’t Put in the Manual” (http://dev.mysql.com/tech-resources/articles/full-text-revealed.html) has some excellent material.

MyISAM Limitations

Despite MyISAM’s various strengths, it does have a few downsides, primarily its lack of foreign key constraints and multiple-statement transaction safety.

Despite plans to include it, there is currently no way of making the MyISAM storage engine enforce a foreign key constraint. Though the FOREIGN KEY clause in your CREATE TABLE statement is parsed by the DDL compiler, nothing is actually stored or done to protect foreign key relational integrity.

The protection of foreign key constraints is a principle of sound database design, yet some in the database community have come out against foreign key constraints because of performance reasons. The MySQL development team is determined to keep performance as a top priority, and has indicated that the MyISAM storage engine may support foreign key constraints in the future, but only if doing so would not seriously impact the performance of the engine.

Unfortunately, at the time of this writing, if you are designing an application that has foreign key dependency support as a top priority, your storage engine choice is limited to InnoDB.² As with other things, enforcing relational integrity for foreign keys comes with a performance cost in InnoDB. However, we should stress that for most applications, this performance difference will be negligible, partly due to InnoDB’s row-level locking scheme, discussed in the next section.

MyISAM also does not give you the ability to ensure the atomicity, consistency, and durability of multiple statements executed with a transaction. The ACID test (see Chapter 3) cannot be applied to statement sets run against MyISAM tables. Although it is possible to mix and match storage engines in the database, if you have a transaction executing against both InnoDB tables (which do support transaction control) and MyISAM tables, you can be assured only that the statements executed against the InnoDB tables will be written to disk and recovered in the event of a crash.

Footnotes

² Technically, you could also use the BDB storage engine, but there are few to no advantages to using this earlier transaction-safe engine over InnoDB.

Michael Kruckenberg started his career with web technologies more than 10 years ago. His first major undertaking was bringing a small mail-order company online (using MySQL). After hopping around a bit during the 1990s Internet boom and spending time in the Internet startup world, Mike put his feet down at his current gig, senior programmer at Tufts University. He is now the technical manager for the Apache/Perl/MySQL-driven Tufts University Sciences Knowledgebase (TUSK), a content repository for faculty and students. Mike likes to spend his elusive free time with his wife and kids on New England adventures, has a healthy addiction to music, and likes to dabble in the arts (photography, video, and literature).

For the past 10 years, Jay Pipes has worked with all kinds of companies, large and small, to identify the value of the information they collect and help them build software that best addresses the needs of their businesses. From e-commerce to work-order management systems, Jay has been involved in projects with both Microsoft and open-source technologies. Passionate about programming and all things technical, Jay now runs his own consulting business, based in Columbus, Ohio. When not being bothered by his two cats, two dogs, and a constantly ringing phone, you can find him, headphones pulled across his ears, happily coding away at home.

Contributors : Michael Kruckenberg, Jay Pipes
Last modified 2006-05-12 05:01 PM

DBAzine.com

Sections

Personal tools

Menu