# Path Finder

I got an email asking me how to find paths in a graph using SQL. The author of the email had seen my chapter on graphs in *SQL for Smarties*, and read that I was not happy with my own answers. What he wanted was a list of paths from any two nodes in a directed graph, and I would assume that he wanted the cheapest path.

After thinking about this for awhile, the best way is probably to do the Floyd-Warshall or Johnson algorithm in a procedural language and load a table with the results. But I want to do this in pure SQL as an exercise.

Let's start with a simple graph and represent it as an adjacency list with weights on the edges.

CREATE TABLE Graph (source CHAR(2) NOT NULL, destination CHAR(2) NOT NULL, cost INTEGER NOT NULL, PRIMARY KEY (source, destination));

I got data for this table from the book *Introduction to Algorithms* by Cormen, Leiserson and Rivest (ISBN 0-262-03141-8), page 518. This book is very popular in college courses in the United States. I made one decision that will be important later; I added self-traversal edges (i.e., the node is both the source and the destination) with weights of zero.

INSERT INTO Graph VALUES ('s', 's', 0); INSERT INTO Graph VALUES ('s', 'u', 3); INSERT INTO Graph VALUES ('s', 'x', 5); INSERT INTO Graph VALUES ('u', 'u', 0); INSERT INTO Graph VALUES ('u', 'v', 6); INSERT INTO Graph VALUES ('u', 'x', 2); INSERT INTO Graph VALUES ('v', 'v', 0); INSERT INTO Graph VALUES ('v', 'y', 2); INSERT INTO Graph VALUES ('x', 'u', 1); INSERT INTO Graph VALUES ('x', 'v', 4); INSERT INTO Graph VALUES ('x', 'x', 0); INSERT INTO Graph VALUES ('x', 'y', 6); INSERT INTO Graph VALUES ('y', 's', 3); INSERT INTO Graph VALUES ('y', 'v', 7); INSERT INTO Graph VALUES ('y', 'y', 0);

I am not happy about this approach, because I have to decide the maximum number of edges in path before I start looking for an answer. But this will work and I know that a path will have no more than the total number of nodes in the graph. Let's create a table to hold the paths:

CREATE TABLE Paths (step1 CHAR(2) NOT NULL, step2 CHAR(2) NOT NULL, step3 CHAR(2) NOT NULL, step4 CHAR(2) NOT NULL, step5 CHAR(2) NOT NULL, total_cost INTEGER NOT NULL, path_length INTEGER NOT NULL, PRIMARY KEY (step1, step2, step3, step4, step5));

The step1 node is where I begin the path. The other columns are the second step, third step, fourth step, and so forth. The last step column is the end of the journey. The total_cost column is the total cost, based on the sum of the weights of the edges, on this path. The path length column is harder to explain, but for now, let's just say that it is a count of the nodes visited in the path.

To keep things easier, let's look at all the paths from "s" to "y" in the graph. The INSERT INTO statement for construction that set looks like this:

INSERT INTO Paths SELECT G1.source, -- it is 's' in this example G2.source, G3.source, G4.source, G4.destination, -- it is 'y' in this example (G1.cost + G2.cost + G3.cost + G4.cost), (CASE WHEN G1.source NOT IN (G2.source, G3.source, G4.source) THEN 1 ELSE 0 END + CASE WHEN G2.source NOT IN (G1.source, G3.source, G4.source) THEN 1 ELSE 0 END + CASE WHEN G3.source NOT IN (G1.source, G2.source, G4.source) THEN 1 ELSE 0 END + CASE WHEN G4.source NOT IN (G1.source, G2.source, G3.source) THEN 1 ELSE 0 END) FROM Graph AS G1, Graph AS G2, Graph AS G3, Graph AS G4 WHERE G1.source = 's' AND G1.destination = G2.source AND G2.destination = G3.source AND G3.destination = G4.source AND G4.destination = 'y';

I put in "s" and "y" as the source and destination of the path, and made sure that the destination of one step in the path was the source of the next step in the path. This is a combinatorial explosion, but it is easy to read and understand.

The sum of the weights is the cost of the path, which is easy to understand. The path_length calculation is a bit harder. This sum of CASE expressions looks at each node in the path. If it is unique within the row, it is assigned a value of one, if it is not unique within the row, it is assigned a value of zero.

All paths will have five steps in them because that is the way to table is declared. But what if a path exists between the two nodes which is shorter than five steps? That is where the self-traversal rows are used! Consecutive pairs of steps in the same row can be repetitions of the same node.

Here is what the rows of the Paths table look like after this INSERT INTO statement, ordered by descending path_length, and then by ascending cost.

Paths step1 step2 step3 step4 step5 total_cost path_length ====================================================== s s x x y 11 0 s s s x y 11 1 s x x x y 11 1 s x u x y 14 2 s s u v y 11 2 s s u x y 11 2 s s x v y 11 2 s s x y y 11 2 s u u v y 11 2 s u u x y 11 2 s u v v y 11 2 s u x x y 11 2 s x v v y 11 2 s x x v y 11 2 s x x y y 11 2 s x y y y 11 2 s x y v y 20 4 s x u v y 14 4 s u v y y 11 4 s u x v y 11 4 s u x y y 11 4 s x v y y 11 4

Clearly, all pairs of nodes could be picked from the original Graph table and the same INSERT INTO run on them with a minor change in the WHERE clause. However, this example is big enough for a short magazine article. And it is too big for most applications. It is safe to assume that people really want the cheapest path. In this example, the total_cost column defines the cost of an path, so we can eliminate some of the paths from the Paths table with this statement.

DELETE FROM Paths WHERE total_cost > (SELECT MIN(total_cost) FROM Paths);

Again, if you had all the paths for all possible pairs of nodes, the subquery expression would have a WHERE clause to correlate it to the subset of paths for each possible pair.

In this example, it got rid of 3 out of 22 possible paths. It is helpful and in some situations we might like having all the options. But these are not distinct options.

As one of many examples, the paths

(s, x, v, v, y, 11, 2)

and

(s, x, x, v, y, 11, 2)

are both really the same path, (s, x, v, y). Before we decide to write a statement to handle these equivalent rows, let's consider another cost factor. People do not like to change airplanes or trains. If they can go from Amsterdam to New York City on one plane without changing planes for the same cost, they are happy. This is where that path_length column comes in. It is a quick way to remove the paths that have more edges than they need to get the job done.

DELETE FROM Paths WHERE path_length > (SELECT MIN(path_length) FROM Paths);

In this case, that last DELETE FROM statement will reduce the table to one row: (s, s, x, x, y, 11, 0) which reduces to (s, x, y). This single remaining row is very convenient for my article, but if you look at the table, you will see that there was also a subset of equivalent rows that had higher path_length numbers.

(s, s, s, x, y, 11, 1) (s, x, x, x, y, 11, 1) (s, x, x, y, y, 11, 2) (s, x, y, y, y, 11, 2)

Your task is to write code to handle equivalent rows. Hint: the duplicate nodes will always be contigous across the row.

--

**Joe Celko** was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is *SQL for Smarties* (Morgan-Kaufmann Publishers, 1999). He is the Vice President of RDBMS at Northface University in Salt Lake City.

Contributors : Joe Celko

Last modified 2005-04-20 10:20 AM