Selection

by Joe Celko

Part 3 in a series

Introduction

Continuing the look at basic relational operators and SQL, we get to an operation with an unfortunate name: Selection. Selection removes rows from a table which do not pass a search condition. It is the counterpart of Projection, which removes columns from tables.

The reason the name is unfortunate is that SQL uses the keyword, SELECT, for the clause in a query that matches to the Projection operator and the keyword where for the clause in a query that matches to the selection operator. It is a little confusing, but just wait: things will get worse.

The search conditions are logical predicates -- things that return TRUE, FALSE or UNKNOWN. But wait a minute, most programming languages work with Boolean logic and have only TRUE and FALSE logical values. SQL and Codd's first relational model have a thing called a NULL and it makes things ... interesting.

The Null of It All

A NULL is not a value; it is a marker for a value that is missing. SQL does not know why the value is missing -- semantics is your job. But SQL does have syntax to handle NULLs. The basic rules are:

NULLs propagate in calculations. That makes sense; if I don't know what something is, then why would I know what a calculation done with it is?
NULLs return an UNKNOWN value in a logical expression. In fact, even (NULL = NULL) is UNKNOWN. Again, this makes sense. How can you tell one unknown
NULLs group together. This property has nothing to do with simple search conditions, so don't worry about it for now; I will cover this point in another article on the GROUP BY clause later.

All of the SQL datatypes can use the basic comparison operators like equal (=), greater than (>), less than (<), not less than (>=), not greater than (<=) and not equal (<>). With the exception of the rules for NULLs, they behave pretty much as in every other programming language.

The logical operators are also familiar looking. They are AND, OR and NOT, and they are found in pretty much every other programming language. The gimmick is that these are three valued logical operators and not two valued ones.

The UNKNOWN value results from using NULLs in comparisons and other predicates, but UNKNOWN is a logical value and not the same as a NULL, which is a data value.

x NOT 
==================
TRUE FALSE
UNK UNK
FALSE TRUE

AND | TRUE UNK FALSE
=============================
TRUE | TRUE UNK FALSE
UNK | UNK UNK FALSE
FALSE | FALSE FALSE FALSE

OR | TRUE UNK FALSE
============================
TRUE | TRUE TRUE TRUE
UNK | TRUE UNK UNK
FALSE | TRUE UNK FALSE

There is anther predicate of the form (x IS [NOT] NULL) in SQL that exits because you cannot use (x = NULL) to test for a NULL value. Almost all other predicates in SQL resolve themselves to chains of these three operators.

In the WHERE clause, the rows that test FALSE or UNKNOWN are removed from the table. Now, you are probably thinking that if we are going to treat FALSE and UNKNOWN alike, then why go to all the trouble to define a three-valued logic in the first place?

Defining a Three-valued Logic

SQL has three sub-languages: DML, DDL, and DCL. The Data Control Language (DCL) controls user access to the database and does not use predicates. In the Data Manipulation Language (DML), users can ask queries (SELECT statements) or change the data (INSERT INTO, UPDATE, and DELETE FROM statements). The Data Declaration Language (DDL) is where administrators control the schema objects like tables, views, stored procedures and so forth. The FALSE and UNKNOWN remove rows from the results of a query in the DML. In the DDL, a TRUE or UNKNOWN test result in a CHECK() constraint will preserve a row -- give it the benefit of the doubt, so to speak. Otherwise, no column could be NULL-able.

Wonder Shorthands

SQL also came up with some wonder "shorthands" that improve the readability of the code. The logical operator "x BETWEEN y AND z" means "((y <= x) AND (x <= z))" -- note the order of comparison and the inclusion of the endpoints of the range. Likewise, "x IN (a,b,c,..)" expands out to "((x = a) OR (x = b) OR (x = c) OR ...)" at run time.

Most SQL engines are pretty good about optimizing the predicates and not that good about optimizing calculations. For example, the engine might not change (x + 0) or (x * 1) to (x) when they are compiling the code. This means that you need to write very clear logical expression with the simplest calculations in SQL.

Procedural languages like Fortran or Pascal are very good about optimizing calculations, which only makes sense because all they do is calculations! But SQL is a data retrieval language and the goal is to get back the right set of data as fast as possible from the secondary storage. Calculations are done at the speed of electricity, while data is retrievedby mechanical disk reads. The biggest improvements come from faster retrieval methods, not improved calculations.

The next article in this series will deal with the "fancy stuff," when we get to Selection, Part II.

---

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is SQL for Smarties (Morgan-Kaufmann Publishers, 1999). He is the Vice President of RDBMS at North Face Learning in Salt Lake City.

Contributors : Joe Celko
Last modified 2005-04-20 10:26 AM

DBAzine.com

Sections

Personal tools

Menu