Skip to content

DBAzine.com

Sections
Personal tools
You are here: Home » Of Interest » Articles of Interest » Calculated Columns
Seeking new owner for this high-traffic DBAzine.com site.
Tap into the potential of this DBA community to expand your business! Interested? Contact us today.
Who Are You?
I am a:
Mainframe True Believer
Distributed Fast-tracker

[ Results | Polls ]
Votes : 3558
 

Calculated Columns

by Joe Celko

Introduction

You are not supposed to put a calculated column in a table in a pure SQL database. And as the guardian of pure SQL, I should oppose this practice. Too bad the real world is not as nice as the theoretical world.

There are many types of calculated columns. The first are columns which derive their values from outside the database itself. The most common examples are timestamps, user identifiers, and other values generated by the system or the application program. This type of calculated column is fine and presents no problems for the database.

The second type is values calculated from columns in the same row. In the days when we used punch cards, you would take a deck of cards, run them thru a machine that would do the multiplications and addition, then punch the results in the right hand side of the cards. For example, the total cost of a line in an order could be described as price times quantity.

The reason for this calculation was simple; the machines that processed punch cards had no secondary storage, so the data had to be kept on the cards themselves. There is truly no reason for doing this today; it is much faster to re-calculate the data than it is to read the results from secondary storage.

The third type of calculated data uses data in the same table, but not always in the same row in which it will appear. The fourth type uses data in the same database.

These last two types are used when the cost of the calculation is higher than the cost of a simple read. In particular, data warehouses love to have this type of data in them to save time.

When and how you do something is important in SQL. Here is an example, based on a thread in a SQL Server discussion group. I am changing the table around a bit, and not telling you the names of the guilty parties involved, but the idea still holds. You are given a table that look like this and you need to calculate a column based on the value in another row of the same table.

     CREATE TABLE StockHistory
  (stock_id CHAR(5) NOT NULL,
  sale_date DATE NOT NULL DEFAULT CURRENT_DATE,
  price DECIMAL (10,4) NOT NULL,
  trend INTEGER NOT NULL DEFAULT 0
        CHECK(trend IN(-1, 0, 1))
  PRIMARY KEY (stock_id, sale_date));

It records the final selling price of many different stocks. The trend column is +1 if the price increased from the last reported selling price, 0 if it stayed the same and -1 if it dropped in price. The trend column is the problem, not because it is hard to compute, but because it can be done several different ways. Let's look at the methods for doing this calculation.

Triggers

You can write a trigger which will fire after the new row is inserted. While there is an ISO Standard SQL/PSM language for writing triggers, the truth is that every vendor has a proprietary trigger language and they are not compatible. In fact, you will find many different features from product to product and totally different underlying data models. If you decide to use triggers, you will be using proprietary, non-relational code and have to deal with several problems.

One problem is what a trigger does with a bulk insertion. Given this statement which inserts two rows at the same time:

      INSERT INTO StockHistory (stock_id, sale_date, price)
  VALUES ('XXX', '2000-04-01',  10.75),
        ('XXX', '2000-04-03', 200.00);

Trend will be set to zero in both of these new rows using the DEFAULT clause. But can the trigger see these rows and figure out that the 2000 April 03 row should have a +1 trend or not? Maybe or maybe not, because the new rows are not always committed before the trigger is fired. Also, what should that status of the 2000 April 01 row be? That depends on an already existing row in the table.

But assume that the trigger worked corectly. Now, what if you get this statement?

      INSERT INTO StockHistory (stock_id, sale_date, price)
VALUES ('XXX', '2000-04-02', 313.25);

Did your trigger change the trend in the 2000 April 03 row or not? If I drop a row, does your trigger change the trend in the affected rows? Probably not.

As an exercise, write some trigger code for this problem.

INSERT INTO Statement

I admit I am showing off a bit, but here is one way of inserting data one row at a time. Let me put the statement into a stored procedure.

      CREATE PROCEDURE NewStockSale
  (new_stock_id CHAR(5) NOT NULL,
  new_sale_date DATE NOT NULL DEFAULT CURRENT_DATE,
  new_price DECIMAL (10,4) NOT NULL)
AS INSERT INTO
   StockHistory (stock_id, sale_date, price, trend)
   VALUES (new_stock_id, new_sale_date, new_price,
           SIGN(new_price -
                (SELECT H1.price
                   FROM StockHistory AS H1
                  WHERE H1.stock_id = StockHistory.stock_id
                    AND H1.sale_date =
                        (SELECT MAX(sale_date)
                           FROM StockHistory AS H2
                          WHERE H2.stock_id = H1.stock_id
                            AND H2.sale_date < H1.sale_date)
                  ))) AS trend
           );

This is not as bad as you first think. The innermost subquery finds the sale just before the current sale, then returns its price. If the old price minus the new price is positive negative or zero, the SIGN() function can computer the value of TREND. Yes, I was showing off a little bit with this query.

The problem with this is much the same as the triggers. What if I delete a row or add a new row between two existing rows? This statement will not do a thing about changing the other rows.

But there is another problem; this stored procedure is good for only one row at a time. That would mean that at the end of the business day, I would have to write a loop that put one row at a time into the StockHistory table.

Your next exercise is to improve this stored procedure.

UPDATE the Table

You already have a default value of 0 in the trend column, so you could just write an UPDATE statement based on the same logic we have been using.

      UPDATE StockHistory
   SET trend
       = SIGN(price -
             (SELECT H1.price
                FROM StockHistory AS H1
               WHERE H1.stock_id = StockHistory.stock_id
                 AND H1.sale_date =
                     (SELECT MAX(sale_date)
                        FROM StockHistory AS H2
                       WHERE H2.stock_id = H1.stock_id
                         AND H2.sale_date < H1.sale_date)));

While this statement does the job, it will re-calculate trend column for the entire table. What if we only looked at the columns that had a zero? Better yet, what if we made the trend column NULL-able and used the NULLs as a way to locate the rows that need the updates?

      UPDATE StockHistory
   SET trend = ...
  WHERE trend IS NULL;

But this does not solve the problem of inserting a row between two existing dates. Fixing that problem is your third exercise.

Use a VIEW

This approach will involve getting rid of the trend column in the StockHistory table and creating a VIEW on the remaining columns:

       CREATE TABLE StockHistory
      (stock_id CHAR(5) NOT NULL,
       sale_date DATE NOT NULL DEFAULT CURRENT_DATE,
       price DECIMAL (10,4) NOT NULL,
       PRIMARY KEY (stock_id, sale_date));

      CREATE VIEW StockTrends (stock_id, sale_date, price, trend)
      AS SELECT H1.stock_id, H1.sale_date, H1.price,
                 SIGN(MAX(H2.price) - H1.price)
           FROM StockHistory AS H1 StockHistory AS H2
          WHERE H1.stock_id = H2.stock_id
            AND H2.sale_date < H1.sale_date
          GROUP BY H1.stock_id, H1.sale_date, H1.price;

This approach will handle the insertion and deletion of any number of rows, in any order. The trend column will be computed from the existing data each time. The primary key is also a covering index for the query, which helps performance. A covering index is one which contains all of the columns used the WHERE clause of a query.

The major objection to this approach is that the VIEW can be slow to build each time, if StockHistory is a large table.

I will send a free book to the reader who submits the best answers top these exercises.you can contact me at my email address or you can go to my website at http://www.celko.com.

---

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is SQL for Smarties (Morgan-Kaufmann Publishers, 1999). He is the Vice President of RDBMS at North Face Learning in Salt Lake City.


Contributors : Joe Celko
Last modified 2005-04-20 10:28 AM
Transaction Management
Reduce downtime and increase repeat sales by improving end-user experience.
Free White Paper
Database Recovery
Feeling the increased demands on data protection and storage requirements?
Download Free Report!
 
 

Powered by Plone