Filling in the Gaps

by Joe Celko

As I get older, I am convinced that there really is no such animal as a simple programming problem. Oh, they might look simple when you start but that is just a trick. Under the covers, are all kinds of devils just waiting to get out.

Darren Taft posted what seems like an easy problem on the SQL Server newsgroup in 2000 October. Let me quote him: "I have an ordering system that allocates numbers within predefined ranges. I do this at the moment using this: ..." At this point, he posted a stored procedure written in T-SQL dialect. This procedure had a loop that incremented the request_id number in a loop until it either found a gap in the numbering or failed. Mr. Taft then continued: "This is fine for the first few numbers, but when the ranges are anything up to 10,000 between the minimum and the maximum, it starts to get a little slow. Can anyone think of a better way of doing this?

Basically it needs to find the next number within the range for which there isn't a row in the Requests table (the primary key is the request_id, which is an integer column with a clustered index). Rows can be deleted from within the range, so the next number will not always be the current maximum plus one."

Before you go further, try to write a procedural solution yourself. Now, put down your pencils and start reading again. As an aside, the original stored procedure was wrong because it did not test for an upper bound. If the range was completely used, the stored procedure would return the upper limit plus one.

Graham Shaw immediately proposed this query:

SELECT MIN (R1.request_id          + 1) 
  FROM Requests AS R1
       LEFT OUTER JOIN 
       Requests AS R2 
       ON R1.request_id + 1 = R2.request_id
  WHERE R2.request_id IS NULL;

The idea is that there is a leftmost value in the Requests table just before a gap. Therefore, when (request_nbr +1) is not in the table, we have found a gap. This is what the incremental approach in the stored procedure was doing, one row at a time.

Too bad this does not work. First of all, there is no checking for an upper bound. In effect, the flaw in the original stored procedure has become part of the specification! This is like the story about the Englishman who sent a favorite old jacket to a Chinese tailor and told him to make an exact copy of it in heavy silk. The tailor did exactly that, right down to the cigarette burns, stains and frayed elbows. The second problem is that you cannot get the first position in the range if it is the only one vacant.

Umachandar Jayachandranm, another regular to the newsgroup, saw that the OUTER JOIN should be expensive and suggested that Darren try this query:

 SELECT MIN(R1.request_id) + 1
  FROM Requests AS R1
 WHERE NOT EXISTS
       (SELECT * 
          FROM Requests AS R2
         WHERE R2.request_id = R1.request_id + 1 
           AND R2.request_id >= {{low range boundary}}) 
   AND R1.request_id >= {{low range boundary}}

He also proposed a proprietary solution based on the TOP(n) operator in SQL Server, but I will not go into that answer. But again, this answer has the same two flaws as before.

I agreed with Umachandar that the OUTER JOIN solution was needlessly complex. I proposed a more set-oriented solution in the form of a VIEW of the all gaps in the numbering, instead. That query looked like this:

CREATE VIEW Gaps (gap_start, gap_end)
AS SELECT DISTINCT R1.request_id + 1, MIN(R2.request_id -1)
     FROM Requests AS R1,
          Requests AS R2
    WHERE R1.request_id <= R2.request_id
      AND R1.request_id + 1 
      NOT IN (SELECT request_id FROM Requests)
      AND R2.request_id - 1 
      NOT IN (SELECT request_id FROM Requests)
      AND R1.request_id + 1 <= {{high range boundary}}
      AND R2.request_id - 1 >= {{low range boundary}}
    GROUP BY R1.request_id;

I was happy with this answer, since it found all the desired numbers and solved the problems at the extremes of the range. By using the plus and minus one, I am finding the gaps from both their left and right sides, so I will catch an open slot in both the high and low range boundaries. The only improvement I found was that you might want to change the NOT IN () predicates to NOT EXISTS() predicates for performance in some SQL products. You can also use this view to get reports on the density of allocated numbers, use it to compress the gaps, to insert new requests in a well distributed manner, and so on.

I was proud of myself until Darren replied, "Interesting response, but it doesn't actually provide the answer. I would need a further query on the view to get what I want. This view actually runs slower than the OUTER JOIN suggestion, so with a query on top of that, it has to be the slowest answer so far." He did concede that the query is handy for analyzing gaps and that he would keep it for future reference. That helped my wounded ego a little bit.

So it was time to do more thinking about the boundary problems and how to return only one number. I finally came up with this nightmare query:

SELECT MIN (X.request_id)
  FROM (SELECT (CASE WHEN (R1.request_id + 1) 
                           NOT IN (SELECT request_id 
                                     FROM Requests)
                     THEN (R1.request_id + 1)
                     WHEN (R1.request_id - 1) 
                           NOT IN (SELECT request_id 
                                    FROM Requests)
                     THEN (R1.request_id - 1)
                     ELSE NULL END) 
  FROM Requests AS R1
 WHERE R1.request_id + 1 
       BETWEEN {low range boundary} AND {high range boundary}
   AND R1.request_id - 1 
       BETWEEN {low range boundary} AND {high range boundary}
 GROUP BY R1.request_id) AS X(request_id);

The outermost query is simply returning the first number in the derived query. The derived query, X, finds gaps from both the left and the right sides by incrementing and decrementing values in the Requests table. It also does a range check in the WHERE clause. The real trick is in the CASE expression; when a gap exists to the right of a number, return it; when a gap exists to the left of a number, return it; when there are no gaps, return a NULL. This will solve the boundary problem at the extremes of the range. It might be ugly, but at least it works!

There is also a subtle third problem here. All these approaches tend to favor picking a new request_id value in the lower end of the range. The clustered B-tree index would have to be re-balanced more often than if you were to pick new request_id numbers randomly from the possible values in the gaps. The table will be reorganized more than you would really wish it to be.

For a situation with a great number of transactions, the real trick is to replace the clustered index with an unclustered index.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is SQL for Smarties (Morgan-Kaufmann Publishers, 1999). He is the Vice President of RDBMS at North Face Learning in Salt Lake City.

Contributors : Joe Celko
Last modified 2005-04-20 10:32 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Filling in the Gaps