Improving SQL Full Text Search Performance

by Hilary Cotter

In this article, we will look at tips to improve SQL Server 2000 Full Text Search (FTS) querying and indexing performance. I have divided this article into two sections: one on improving query performance and another on improving indexing performance.

Improving Search Performance

In this section, we will review how to improve the search performance of your SQL FTS search solution.

Limit your results set

SQL Full Text Search querying or search performance is most sensitive to the number of rows you return in your query. There is seldom a reason to return all of the rows that match your search criteria to the client. Consider a search that returns 10000 rows. Do you really expect your Web-based client to wade through all 10,000 rows? Of course not; returning all the rows only overwhelms your client. Granted, if an application is consuming your search results set, you may need to return all rows.

You will get optimal performance when your query returns fewer than 2000 rows. Although you could use the Top operator in your SQL statement, you get better performance using ContainsTable’s top_n_by_rank argument. This does make our query slightly more complex and involves a join, but you will get better performance overall.

Here is a sample stored procedure illustrating this:

CREATE PROCEDURE SimpleSQLFTSSearch(@searchPhrase varchar(200))
AS
DECLARE @searchString varchar(200)
--filtering out single or double quotation marks
SET @searchPhrase =replace(@searchPhrase,char(39),char(39)+char(39))
SET @searchPhrase =replace(@searchPhrase,char(34),char(34)+char(34))
--returning a syntax message if no search phrase is passed
IF len(@searchPhrase)=0
BEGIN
PRINT 'usage is SimpleSQLFTSSearch ''Your Search Phrase goes here'''
RETURN -1
END 
SET @searchString = 'SELECT * FROM authors as a join CONTAINSTABLE(authors,'
SELECT @searchString = @searchString + '*,'+char(39)+char(34)+@searchPhrase
SELECT @searchString = @searchString + char(34) + char(39)+ ',200) as t on '
SELECT @searchString = @searchString  + 't.[KEY]=a.au_id order by rank desc'
EXEC (@searchString)
RETURN @@rowcount

Note:

We are filtering out any single or double quotation marks that might cause our query to bomb.
I have wrapped the build string statement over four lines for readability. You should be able to consolidate this to two lines.
We are ordering by Rank description. Ranking is a statistical measure of how relevant your rows are to your search phrase. Do not consider this to be a quantitative measure (for instance, you can’t consider a row with a rank of 500 to be five times as relevant as a row with a rank of 100); rather, consider ranking a way to order your search results. This will return rows that are likely to be most relevant to your search phrase. In your client, you may want to implement paging to display your search results in groups of 10 or 25 rows at a time. You will find that the most efficient caching mechanism involves repeatedly issuing the same query and doing the paging on the client. The reason why this works out so well, is that most users will find what they are looking for on the first page, and you will find that, in general, more than 90 percent of your searches are satisfied in the first page of results.
I have deliberately left off any error checking as the error numbers returned by MSSearch (for instance, your MSSearch service being stopped) are not trappable by TSQL. MSSearch is the engine that provides querying and indexing services.
I am returning a row count that will be -1 for no string being passed, and will otherwise provide a row count. This is because I prefer to use the DataReader, which does not return a row count (unless you return it through an output parameter, as we are doing). DataReaders provide optimal performance for read-only lists.

Contains/FreeText

Contains/ContainsTable offers faster performance than FreeText/FreeTextTable. A FreeText/FreeTextTable search will return more results because it is a fuzzy search, whereas Contains/ContainsTable is a strict search. So by default, a search using the Contains predicate on rows containing the word book will only match with book; however, a search on the word book using the FreeText predicates would return rows contains the words book, books, booking, booker, and booked. FreeText is considered to be a more natural way of searching, but its performance can be up to an order of magnitude slower than using Contains, especially for larger tables. GOOGLE!, for instance, uses Contains. You also have the option of making the Contains query fuzzier, though, and you can consult BOL (Book On Line) for more information on this.

Split text data into another table

There are some performance benefits gained from splitting your text data into a child table. When you have large char or varchar columns in your base table, overall performance is hampered by the table width. This does not apply to Text or Image data-type columns, as they are stored in a separate data page (unless you are using the Text In Row table option and your text data is small).

The following is an example of a Full Text Query in which we separate the textual data into a separate table:

CREATE PROCEDURE SQLFTSSearch(@searchPhrase varchar(200))
AS
DECLARE @searchString varchar(200)
--filtering out single or double quotation marks
SET @searchPhrase =replace(@searchPhrase,char(39),char(39)+char(39))
SET @searchPhrase =replace(@searchPhrase,char(34),char(34)+char(34))
--returning a syntax message if no search phrase is passed
IF len(@searchPhrase)=0
BEGIN
PRINT 'usage is SQLFTSSearch ''Your Search Phrase goes here'''
RETURN -1
END
SET @searchString = 'SELECT Col1, Col2, Col3 FROM Parent as p join
SELECT SELECT @searchString = @searchString + CONTAINSTABLE(Child,'
SELECT @searchString = @searchString + '*,'+char(39)+char(34)+@searchPhrase
SELECT @searchString = @searchString + char(34) + char(39)+ ',200) as t on '
SELECT @searchString = @searchString  + 't.[KEY]=p.PK order by rank desc'
EXEC (@searchString)
RETURN @@rowcount

in which the Parent table looks like:

CREATE TABLE Parent
(
PK int not null Primary Key identity(1,1),    
Col1 char(20),
Col2 char(20),
Col3 char(20),
)

and the child looks like:

CREATE TABLE Parent
(
PK int not null Primary Key identity(1,1),    
Col1 char(20),
Col2 char(20),
Col3 char(20),
)

The advantage is that the wider child table will not participate in the join at all; consequently, you get better performance.

Partitioning

You will often run into situations in which you are searching extremely large tables or you need to restrict your search results based on another criterion. Consider the case in which you are searching a table and are limiting your search to 200 rows, and only want to return rows that have a EntryDate greater than GetDate()-365:

SELECT *FROM TableName as a join 
CONTAINSTABLE(TableName, *,'"SearchPhrase"',200) as t on t.[KEY]=a.PK 
where EntryDate > GetDate()-365 order by

Suppose the first 200 rows all have an entry date older than last year — you would not receive any search results. In such cases, you will have to return all rows or partition your data into yearly tables. Then, you should check the search criteria and have your stored procedure branch to the correct table.

Caching

You get the best search performance when you don’t have to hit the catalog at all. If your user’s search patterns are predictable, you can cache your searches into pre-generated search results pages. The accompanying code sample illustrates how to do this. The text file redirect.txt contains a list of search phrases and the html pages to which the ISAPI extension will redirect to. In one search solution, we achieved a staggering 1400-percent increase in performance by caching. You can’t do this, however, if your searches are unpredictable or if you require real-time search results.

Improving Search Indexing Performance

The indexing speed of SQL FTS has always been its Achilles heel. In this section we review some tips on how to improve indexing performance.

Change Tracking

The best way to improve SQL FTS indexing performance is to use change tracking with the update index in background method. If you update a column that is being Full Text indexed, an identifier is written to sysfulltextnotify in the database in which you are full text indexing. This table is polled by MSSearch and the data in these rows corresponding to these identifiers are extracted and indexed. This process normally takes several seconds. You also have the option to use Full or Incremental populations both of which involve extracting every row from the tables you are full-text indexing; for million-row tables, this process can take longer than a day. Change tracking indexing can also be scheduled when you want to minimize the impact of the indexing on overall SQL Server performance. For instance, you might want to schedule the indexing to occur at night, or during another period of low activity.

Use a Separate Build Server

You should use a separate build server if …

you don’t needreal time indexing,
you can live with catalogs that are, perhaps, a day or more out of date
you can take your search functionality offline for a short time

You could build the catalog on a different server, then copy the catalog files to your search server. Follow the instructions on http://support.microsoft.com/default.aspx?scid=kb;en-us;240867 for more information about how to do this.

Hardware recommendations

There is a performance sweet spot for SQL FTS on machines with eight-way processors (or a Quad with hyper-threading). If you have a limited budget, you should invest your money in the fastest disk subsystem possible. Ideally, you should park our Full Text Catalogs on a separate RAID 10 array with a different controller than your page file and transaction log. You may find that performance with on a RAID 5 controller with write back caching offers performance close to a RAID 10 array.

Miscellaneous Tweaks

Here is a registry setting that can increase the number of threads available for SQL FTS( use only on multi-processor systems).

HKLM\Software\Microsoft\Search\1.0\Gathering Manager\Servers\SQLServer\EvaluationOrder

HKLM\Software\Microsoft\Search\1.0\Gathering Manager\Servers\SQLServer\HitInterval

When rows are extracted, the data is stored on disk in a location specified in:

HKLM\Software\Microsoft\Search\1.0\Gathering Manger\TempPath

Consider modifying this entry to a different drive or array from your page file. You can also use SetTempPath.vbs to do this. The script can be found in C:\Program Files\Common Files\System\MSSearch\Bin.

SQL FTS 2000 can index binary file types if they are stored in the image column and you store the document type in a char(3) or char(4) column. Although this feature is valuable, you will get better indexing performance if you are able to convert the data to text and store it in columns of the text data type.

Conclusion

This completes our look at improving SQL Server 2000 FTS indexing and search performance tips. SQL Server 2005 offers radical performance increases. We will look at new features in SQL 2005 FTS in a later article.

Hilary Cotter has been involved in IT for more than 20 years as a Web and database consultant in the tri-state area of New York, New Jersey and Connecticut. He was first recognized by Microsoft in 2001 with the Microsoft SQL Server MVP award. After receiving his Bachelor of Applied Science degree in Mechanical Engineering from the University of Toronto, he studied both economics at the University of Calgary and Computer Science at the University of California at Berkeley. Hilary has worked for Microsoft, Merrill Lynch, UBS-Paine Webber, MetLife, VoiceStream, Tiffany & Co., Pacific Bell, Cahners, Novartis, Petro-Canada, and Johnson and Johnson. He has just completed A Guide to SQL Server 2000 Transactional and Snapshot Replication and has a companion volume on merge replication in the works for 2005.

Contributors : Hilary Cotter
Last modified 2006-01-06 10:46 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?