Long-Term Data Retention Drives Database Archiving

A closer look at what it takes to successfully archive data from your operational databases and preserve it for the long haul.

Regular readers of this blog know that I’ve discussed database archiving here before. I have written about the market forces driving the need to archive, how archiving can improve the performance of operational databases, how legal and regulatory compliance impact database archiving, and even precisely defined the term "database archiving." If you are interested in any of those issues click on the links to reads those entries. But I've also been discussing this topic in one of my other blogs over at neonesoft.com, my Data Management Today Blog.

There are a lot of issues that need to be researched and explored in depth before you begin archiving database data. With that in mind, consider how you might approach archiving data from your production databases today. Oftentimes organizations believe they are archiving their data for long-term retention, but the “solutions” that are being employed to do so actually are falling short of the mark. This situation typically occurs when insufficient attention is paid to the query requirements against the archived data. Let’s examine some of these “solutions” in a bit more depth.

File and database backups are sometimes viewed as a solution to data archival. The idea proffered is that backups need to be done for recoverability anyway so why not use them for the archive? This is an incorrect assumption for several reasons. Why?

The first problem is that backup copies are not in a format that can be queried. In order to access the data it needs to be recovered back to a database format. Conducting such a recovery to enable queries against data no longer in the database is problematic, though. Do you recover the data directly into the existing operational database? This can cause data integrity problems because the backup data contains not just the data no longer in the database, but perhaps also data currently in the database but in a different state. In other words, although you might put the old data back in the database, you are likely also to corrupt existing data in the database.

OK, maybe you decide to create a "reference" database that looks exactly like the database from which the data was backed up. But will this work? Some DBMSs check internal identifiers to ensure that the recovery is being done to the same database objects. A new reference database, even if it looks exactly the same as the operational database, will be different -- at least in terms of how the DBMS views it. You may be able to manipulate the database structures or deploy utilities that operate at a lower level to get the recovery to work, but it will require additional work and effort to make it happen.

And what if you need to recover several backups? Do you create multiple reference databases or try to recover them all to a single reference database. Of course, the second option sounds better, doesn't it? But it might not be feasible. What if the database schema changed over the course of those backups? Maybe you added a column or changed a data type. Well, then, that recovery will most likely fail because the schema does not match.

Of course, you could use an unload utility to unload the data from the image copy backup (if that technology is available to you). But then you would have to go through that data to determine what is and what isn’t needed before you load it into the production database. And you still would have to jump the hurdle of dealing database schema changes since the backup was made. So you will have to consider that when re-loading the data from the backup. And is the data actually “the same” if it is loaded into a different schema definition? It might seem to be, but technically it is not – and that might be a problem from a legal perspective.

And think about the work involved when you run into an audit situation, where you are asked to produce transaction records spanning multiple years. This requires the very labor-intensive, manual process of slogging through the data and matching it against every database change that was made during the timeframe in question (which could be several decades). It might be (somewhat) feasible for time spans of 2 to 3 years perhaps, but not for archived data that spans 10, 20 or more years.

So there are many reasons why backups are not exactly useful for archival purposes. A similar approach to using backups is to use UNLOAD files. But, once again, this has all of the same problems as using image copy backups, with the additional problem that the UNLOAD files have to be created in addition to the backups.

Another approach might be to create a “reference” copy of the operational database to house the data. But, of course, we still have to deal with the issue of changing database schemata. Furthermore, if the data is maintained in a traditional database system then it is not protected from modification by database language commands such as SQL INSERT, UPDATE, and DELETE statements. And even if you use database security commands to prohibit modification this will not stop super users (e.g. SYSADM) from changing data.

Sometimes organizations choose just to leave the data resident in the operational database. This has all of the problems of the “reference” database approach, along with the additional problem of negatively impacting the performance of production applications and queries against the operational database.

These “solutions” are really no solution to the problem of archiving data for long-term retention. None of them are really useful in today’s environment. The proper solution, of course, is to set up archive policies for your operational data such that data is moved from the production databases to an archive data store that is explicitly designed for the long-term storage and protection of data. The process of doing so is called database archiving.

To elaborate on some of the things you need to consider in terms of database archiving I’m going to point you to several posts on my Data Management Today blog.

First up, in The Growth of Database Archiving, I discuss the rising awareness of the need to archive database data and ponders on why this is so.

Next up, in Long-Term Data Retention = Archiving, I look at the laws that dictate long-term data retention and examine how they are the primary drivers of database archiving, but not the only ones.

How to Define What to Archive covers the facts that database archiving requires significant up-front planning to succeed. This blog entry offers an overview of what you need to do to determine what to archive from your operational databases and systems.

In yet another entry, Direct Query From the Archive is Essential, I expound on the need to have ready query access to archived data. If we are archiving data then it stands to reason that it may be needed in the future. As such, data in an archive should be easily accessible and this blog posting talks about this necessity and its importance.

Another important aspect of database archiving that is often overlooked is maintaining the authenticity of data. In Authentic Data is Required for a Legal Archive I cover the need to ensure the on-going authenticity of data in your archives for the life of that data. If archived data is not authentic – that is, if it has been changed in any way – then it is of no use in a legal proceeding.

And Archive Administration is important, too. As with any important data store, the archive will require on-going care and feeding. Administering the archive is different, although similar, to typical database administration tasks. But administration cannot be ignored.

Part of archive administration is Archive Backup and Recovery and, of course, you will need to ensure that your database archives are recoverable. But the processes for backing up and recovering your archives are a bit different than standard database backup and recovery.

Data cannot live forever in the archive; when it is no longer required for regulatory and business purposes you have to get rid of it. In Discarding Archived Data I tackle the topic of discarding data from the archives.

And finally, the father of the data warehouse has been writing about database archiving, too. Here are links to his thoughts on the topic: Bill Inmon on Database Archiving.

I hope you take the time to read through the linked blog posts if you are interested in learning more about database archiving. And if you enjoy the material, take the time to check out my Data Management Today blog in addition to this one.

Sunday, May 06, 2007 | Permalink | Comments (0)

trackback URL: http://www.dbazine.com/blogs/blog-cm/craigmullins/blogentry.2007-05-05.0882314926/sbtrackback

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Long-Term Data Retention Drives Database Archiving