# Integer Labeling in Nested Intervals Model

This article completes the series of articles exploring Binary Rational labeling for Nested Intervals Model. The advantage of this model is that every node has a permanent label so that insertion of a new node doesn’t affect existing nodes. In this article, we’ll reduce the node labeling to just a single positive integer number.

## Single Number Labeling

The Nested Intervals Model labeled each node with a binary rational number, which is essentially a pair of integer numbers. Given that a set of all pairs of integer numbers can be enumerated, it easily follows that we can code each node with one integer only. This is not a very interesting idea, however. We save some storage, but for all practical querying purposes, we would have to translate into binary rational numbers. Plus, who cares about storage limitations nowadays?

We’ll explore a different approach: employing a triangular navigation method. This method is similar to the enumeration mentioned in the previous paragraph, but we do not refer to numerator and denominator values explicitly. The picture below illustrates the idea:

We start with node <x=1,y=1/2>, which we label with 1, move diagonally left down along the “red” line, but find no more nodes. Next, we navigate along the red, diagonally aligned line just above it. Our starting point is <x=1,y=3/4>, which we label with the next integer, 2. Next, we move diagonally left down along the “red” line, find node <x=1/2,y=1/4>, and label it with integer 3. In the following step, we’ll navigate the next red diagonal just above our last one, and label them nodes 4,5,6, and 7.

In other words, numbers at every diagonally aligned line form an arithmetic progression. Hence, we’ll call them *arithmetic siblings.* Don’t confuse those with real siblings, which are, as you might recall from the introductory article,
aligned along the lines that meet at a breadth-first convergence point.
The other distinction is that each line drawn through the sibling nodes
has a slope less than that of the diagonal.

You might have already noticed that every first (or left-most) child
is twice as big as its parent. Therefore, we call all nodes on each
vertical line *geometric descendants*.

An immediate property of the previously mentioned labeling schema is *density*:
all integer positive numbers are utilized. Since some database
implementations have integers of limited range only, density might
become an attractive feature, because one would be able to store and
manipulate bigger trees before hitting the arithmetic overflow
exception. (Even though we’e getting a little ahead of ourselves, I’ll
mention that this idea didn’t prove to work; please note the conclusion
section of the paper for more comments.)

## Partial Order Relation

Single integer labeling allows a very simple definition of the
transitive closure relation. Given 2 nodes, i and j, such that i <=
j, let k be the maximum integer, satisfying i*2^{k} <= j. For example, if i=1 and j=11, then k=3 and i*2^{k}=8. Geometrically, the i*2^{k} is the intersection of the arithmetic siblings line originated in j and geometric descendants line originated in i.

Consider the binary relation **ancestor of** satisfying the following condition:

iancestor ofj <=> j-i*2^{k}< 2^{k-1}

Informally, we’ve required j to be closer to i*2^{k} than i to k if we adjust for “geometric” distance between i and i*2^{k}. For example, let i=1, j=12, then k=3 and i*2^{k}=8. We have j-i*2^{k} = 4 and 2^{k-1} = 4, therefore 12 is not an ancestor of 1. Indeed, 12 = 3*2^{2} is an ancestor of node 3, which is a sibling of node 1 (if we are allowed to consider the roots of the forest as siblings).

From the definition itself, that **ancestor of** a transitive
relation is not quite obvious. Our new integer labeling schema,
however, has a very simple connection to the Materialized path and
Binary Rational Encoding, which makes transitivity obvious.

## Ordered Integer Partitions

For each node labeled with **Numer** and **Denom** by Binary Rational Encoding schema, let

N = Denom - (Numer+1)/4

Here is a list of 16 tree nodes in 3 three different encodings:

N |
PATH |
NUMER/DENOM |

1 | .1 | 3/2 |

2 | .1.1 | 7/4 |

3 | .2 | 3/4 |

4 | .1.1.1 | 15/8 |

5 | .1.2 | 11/8 |

6 | .2.1 | 7/8 |

7 | .3 | 3/8 |

8 | .1.1.1.1 | 31/16 |

9 | .1.1.2 | 27/16 |

10 | .1.2.1 | 23/16 |

11 | .1.3 | 19/16 |

12 | .2.1.1 | 15/16 |

13 | .2.2 | 11/16 |

14 | .3.1 | 7/16 |

15 | .4 | 3/16 |

16 | .1.1.1.1.1 | 63/32 |

Notice that the

*PATH*encoding for each node between N=2

^{k-1}, and N=2

^{k}-1 is an ordered partition of integer k. Also, note that Denom in each partition of k equals 2

^{k}. It immediately follows that the sum of dot-separated numbers in any path is the logarithm of Denom.

N is the same encoding that we introduced in previous sections! The relation **ancestor of**, therefore, is just a reflection of the relation **is prefix of** defined on Materialized Paths, which is obviously transitive.

## Distance Function

Distance function is a straightforward implementation of ancestor relation defined in the previous section:

functiondist ( childinteger, parentinteger)RETURNinteger IS

ret integer;

arifgeom integer;

pwr integer;BEGIN

ifparent > childthen

RETURN-1;

end if;

ifparent = child then

RETURN0;

end if;

arifgeom := parent;

pwr := 0;

whilearifgeom*2 <= childloop

arifgeom := arifgeom*2;

pwr := pwr+1;

end loop;

ifchild-arifgeom < arifgeom/(parent*2)then

RETURNpwr;

end if;

RETURN-1;END;

selectdist(11,3), dist(12,3), dist(13,3), dist(14,3)fromdual

-1 2 2 -1

In theory, distance function is all that is required to answer
hierarchical queries. In practice, however, query with distance
function implies scanning the whole hierarchy table and applying the **ancestor of** predicate to every row. With this method, we couldn’t advance beyond small hierarchies.

## Querying Node’s Ancestors

A typical query that doesn’t require scanning the whole hierarchy might ask to find the list of node’s ancestors. Similarly, when a materialized path, nested intervals, and binary rational labelings answer such a query, this doesn’t involve querying the hierarchy relation at all. The Parent label calculation in our new integer labeling schema is simplicity itself. If the child number is even, then child divided by 2 gives us the parent. Otherwise, we won’t jump to the parent right away, but instead, navigate to adjacent (real) sibling with smaller encoding. Technically this means subtracting 1, and dividing the result by 2:

functionimmediate_parent ( childinteger)RETURNinteger IS

retinteger;BEGIN

ret := child;

whilemod(ret,2)=1loop

ret := (ret-1)/2;

end loop;

RETURNret/2;END;

/

selectimmediate_parent(21)fromdual

5

Therefore, queries involving ancestors can be answered efficiently by iterative navigation through the ancestors chain. These iterating calls to the immediate_parent function can be conveniently wrapped inside a table function. (Another common name for the table function is “virtual view”).

If a list of ancestors is used as a part of more complex report, then it would make sense creating index on a node number column (having a primary key on the node number might be a good idea, anyway). Assuming that the list of ancestors is small compared to the total number of rows in the hierarchy table, it would be more efficient to access each individual node on the ancestors list by a unique index scan.

## Mapping to Interval Encoding

Unlike ancestor queries, we can’t leverage the immediate_parent
function when querying the node’s descendants. Even if we develop a
dual version, immediate_child(child# integer*),* we still couldn’t
plug it into a single query. In this circumstances, the only efficient
method I’m aware of is Nested Sets approach. Given a node encoded as an
interval [x,y], we qualify all the nodes [x',y'] as the descendants
whenever y < x' < x (or, equivalently, y < y' < x). This is
typically an index range scan. Therefore, we need to be able to convert
our single number labels into the interval.

First, given a node N, we define k as

k = floor(log(2,N))

Then, we notice that arithmetic siblings are evenly spaced with horizontal distance 1/2^{k} between neighbor siblings. Therefore, the horizontal distance between nodes N and 2^{k} is (N-2^{k})/2^{k}. On the other hand, this distance is the same as the difference between x coordinate of node 2^{k} and x coordinate of the node N, that is 1-x. Our first equation tying the interval coordinate x to N follows immediately:

1-x = (N-2^{k})/2^{k}

can be rewritten as

x = (2^{k+1}-N)/2^{k}

Next, we leverage the constraint that node N lies on the arithmetic siblings line. Each arithmetic siblings line has a slope of -1, so the equation must have the form

y-x = const

Knowing that the line intersects the (x=1, y=1-1/2^{k+1}) point, we refine the line equation to

y-x = -1/2^{k+1}

which, after substituting x, can be rewritten as

y = (2^{k+2}-2*N-1)/2^{k+1}

When implementing these equations, be aware of the artifacts like limited precision of the logarithm function:

selectlog(2,4)-2, floor(log(2,4))fromdualLOG(2,4)-FLOOR(2) FLOOR(LOG(2,4))

----------------- ---------------

-2.000E-38 1

Therefore, we can’t rely on floor(log(2,N)), but have to use our homegrown function:

functionfloorLog2 ( Ninteger)RETURN integer IS

k integer;

kpow2 integer;BEGIN

k := 0;

kpow2 := 1;

whilekpow2 <= Nloop

k := k+1;

kpow2 := kpow2*2;

end loop;

RETURNk-1;END;

Now that the snag is removed we can proceed:

functionx_num ( Ninteger)RETURN integer IS

ret_numinteger;

ret_deninteger;

kinteger;BEGIN

k := floorlog2(N);

ret_num := power(2,k+1)-N;

ret_den := power(2,k);

whilemod(ret_num,2)=0

ret_num := ret_num/2;

ret_den := ret_den/2;

end loop;

RETURNret_num;END;

functionx_den ( Ninteger)RETURN integer IS

...

RETURNret_den;END;

functiony_num ( Ninteger)RETURN integer ISret_numinteger;

ret_deninteger;

kinteger;BEGIN

k := floorlog2(N);

ret_num := power(2,k+2)-2*N-1;

ret_den := power(2,k+1);

whilemod(ret_num,2)=0loop

ret_num := ret_num/2;

ret_den := ret_den/2;

end loop;

RETURNret_num;END;

functiony_den ( Ninteger)RETURN integer IS...

RETURNret_den;END;

selectpath(normalize_numer(numer,denom),

normalize_denom(numer,denom))from(

selectx_num(N)*y_den(N)+y_num(N)*x_den(N) numer,

x_den(N)*y_den(N) denomfrom(

selectrownum Nfromemp

)

);N PATH

-- ----------

1 .1

2 .1.1

3 .2

4 .1.1.1

5 .1.2

6 .2.1

7 .3

8 .1.1.1.1

9 .1.1.2

10 .1.2.1

11 .1.3

12 .2.1.1

13 .2.2

14 .3.1

In the test above, we have selected a list of nodes numbered from 1 to 14, then calculated interval boundaries x and y. Next, we added x and y together to convert to a single binary rational encoding. Since numerator, as a result of the previous step, can be an even number, we have to normalize the numerator and denominator before submitting them as arguments to the path function.

The final steps would be amending the hierarchy table to leverage new encoding

createtableemps (

nodeinteger,

employee_namevarchar2(30),

salarynumber,

...

)

and replacing the view

createor replaceviewhierarchyas

selectnode,

y_num(node) numer_left,

y_den(node) denom_left,

x_num(node) numer_right,

x_den(node) denom_right,

integer2path(node) path,

dist(node,1) depth,

employee_name,

salary,

...

fromemps

Writing integer2path(node) function is left as an exercise for you.

## Conclusion

Even though we are able to encode a tree node with a single integer, we still have to use Binary Rational encoding for query purposes. Unfortunately, denominators in the encoding sequence of all the children of the node “1”

PATH |
BINARY ENCODING |

.1.1 | 7/4 |

.1.2 | 11/8 |

.1.3 | 19/16 |

are growing exponentially. Likewise, the members of a depth-first sequence

PATH |
BINARY ENCODING |

.1 | 3/2 |

.1.1 | 7/4 |

.1.1.1 | 15/8 |

are growing exponentially as well. As of today, databases implement integers of a limited range and, therefore, one won’t be able to store and manipulate trees of any significant size before hitting the arithmetic overflow exception.

This flaw can be fixed by switching to a different encoding. Database developers are welcome to implement a new efficient method proposed in

http://arxiv.org/html/cs.DB/0401014

and

http://arxiv.org/abs/cs.DB/0402051

## Acknowledgments

This is a revision of the paper that was originally submitted to SIGMOD Record. The author thanks the anonymous reviewer who spotted a flaw in the transitivity proof.

--

**Vadim Tropashko** still works for the Real World Performance group at Oracle Corp.

Contributors : Vadim Tropashko

Last modified 2005-06-30 01:06 AM

## Powers of 2

## Replies to this comment