business intelligence resources

TDAN: The Data Administration Newsletter, Since 1997

THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
ROBERT S. SEINER – PUBLISHER

Subscribe to TDAN

TDWI World Conference

TDAN.com - The Data Administration Newsletter

TDAN.com

business intelligence resources

TDAN.com - The Data Administration Newsletter

   > home > newsletter > article
Database Archiving for Long-Term Data Retention

by Craig S. Mullins
Published: October 1, 2006

Published in TDAN.com October 2006

Organizations are generating and keeping a more data now than at any time in history before. This is so for many reasons. First of all, the amount of data in general is growing. According to industry analysts, enterprise databases are growing at the rate of 125% annually. Even more interesting is that as much as 80% of the information in those databases is not actively used (in other words, it is ready for archiving).

But why are we producing so much data? True, technology advances have better enabled our ability to capture and store data. But technology alone is not sufficient to account for the current rate of data growth.

Data may need to be retained for both internal and external reasons. Internal reasons are driven by company needs. If an organization business requires the data to conduct business and make money then that data will be retained. Today's modern organizations are storing more data for longer periods of time for many internal reasons. Typically, data is stored longer than it used to be to enable analytical processes to be conducted on the data. Data warehousing, data mining, OLAP, and similar technologies have delivered more and better techniques for extracting information out of data. So businesses are inclined to keep the data around for longer periods of time.

But external reasons, typically driven by the mandate to comply with legal and governmental regulations are another significant factor driving the need to store more data.

Legal Requirements to Archive

The corporate accounting scandals of the past few years have caused an onslaught of new laws to be written. These laws place regulations on how businesses are to treat their sensitive, business-critical data. Additionally, older laws that have been on the books are being enforced more rigorously than in the past. Basically, government regulations are being adopted to ensure that corporations are "doing the right thing" with their data. One of the things that is being mandated by these regulations is longer data retention periods.

Indeed, the number one driver of data management initiatives is likely to be government regulations. The growing number of regulations and the need for organizations to be in compliance is driving data retention. Regulations such as the Sarbanes-Oxley Act, HIPAA and BASEL II are some of the laws governing how long data must be retained. Moreover, industry analysts have estimated that there are over 150 federal and state laws that dictate how long data must be retained.

Many of these laws great expand the duration over which data must be retained. Until recently most organizations dealt with mandatory retention periods of only a few years for important business data. And this data was kept around longer because of business reasons, not legal requirements. But the situation has changed due to the bevy of new regulations at the federal, state, and local levels. Depending on the industry, what was once five or seven year retention periods is now expanding to 20, 30, or even 70 years. Today, retention periods are determined almost exclusively by government regulations and not from business needs.

To comply with these laws corporations must re-evaluate their established methods and policies for managing and retaining data. What worked in the past to retain data for a few years will no longer be sufficient over a much longer period.

Perhaps the most significant piece of legislation impacting data governance is the Sarbanes-Oxley Act. Section 802 of this act defines penalties for altering or deleting important business data and documents. Additionally, this legislation supports the records preservation rule defined in the Securities and Exchange Act of 1934 (Rule 240.17a-4). This means that electronic storage media must preserve the records in a non-rewritable, non-erasable format. Clearly, Sarbanes-Oxley requires organizations to implement a robust data retention solution. But, of course, Sarbanes-Oxley is not the only legislation driving data retention requirements.

According to research conducted by Enterprise Strategy Group - in its report titled "Digital Archiving: End-User Survey & Market Forecast 2006-2010" - digital archive capacity will increase nearly tenfold between 2005 and 2010. Total worldwide digital archive capacity in the commercial and government sectors will grow from about 2500 petabytes in 2005 to more than 27,000 petabytes by 2010. And they state that the major factors driving this growth will be regulatory compliance, corporate governance, litigation support, records management, and data management initiatives.

Clearly, organizations will be retaining more data over longer periods of time. And this will create the need for new policies, procedures, methodologies, and software to support storage, management and access of archived data.

The Lifecycle of Data

So how can we determine when data needs to be archived? In order to accurately answer that question we need to understand the different states of data as it progresses through its lifespan.

The diagram in Figure 1 delineates the various states of data over its useful life. Data is created at some point, usually by means of a transaction: a product is released, an order is processed, a deposit is made, etc. For a period of time after creation, the data enters it first state: it is operational. That is, the data is needed to complete on-going business transactions. This is where it serves it primary business purpose. Transactions are enacted upon data in this state.

alt
Figure 1. The Lifecycle of Data

The operational state is followed by the reference state. This is the time during which the data is still needed for reporting and query purposes. It could be to produce internal reports, external statements, or simply exist in case a customer asks for it.

Then, after some additional period of time, the data moves into an area where it is no longer needed for completing business transactions and the chance of it being needed for querying and reporting is small to none. However, the data still needs to be saved for regulatory compliance and other legal purposes, particularly if it pertains to a financial transaction. This is the archive state. It is the requirements for data in this state which this white paper addresses.

Finally, after a designated period of time in the archive, the data is no longer needed at all and it can be discarded. This actually should be emphasized much stronger: the data must be discarded. In most cases the only reason older data is being kept at all is to comply with regulations, many of which help to enable lawsuits. When there is no legal requirement to maintain such data, it is only right and proper for organizations to demand that it be destroyed - why enable anyone to sue you if it is not a legal requirement to do so?

Don't think in terms of databases or technologies that you already know when considering these data states. The data could be in three separate databases, a single database, or any combination thereof. Furthermore, don't think about data warehousing in this context - here we are talking about the single, official store of data - and its production lifecycle.

From here-on out we will use the terms introduced here for the various states of data throughout its lifecycle, with the emphasis being on archiving database data and the issues arising from doing so.

What is Database Archiving?

Database Archiving is part of a larger topic, namely Data Archiving. Data exists in many formats and for many purposes, and only a small percentage of it is actually in a database. Physical documents, electronic documents, computer files and data sets, e-mail, and multimedia files are all examples of data that may reasonably need to be archived at some point. Refer to Figure 2. Each of these "things" needs to be archived to fulfill regulatory, legal, and business requirements.

But each type of data requires different archival processing requirements due to its form and nature. What works to archive e-mail is not sufficient for archiving database data, and so on. In other words, type of data may need to command its own technology. This is most certainly true for database data. Why?

Well, data stored in a database is different than other types of data in many ways. The main advantage of using a DBMS is to impose a logical, structured organization on the data. A DBMS provides a layer of independence between the data and the applications that use the data. In other words, applications are insulated from how data is structured and stored. The interface to the data is through the DBMS data language, whether it is SQL for relational databases, DL/1 for IMS, or even XQuery for XML databases. So the archival of data from a database requires knowledge of, and operation in conjunction with, the mechanisms and interfaces of the DBMS.

alt
Figure 2. All Types of Data Need to be Archived

OK, if we now accept that database archiving is a subset of data archiving, let's define exactly what we mean by the term. Database Archiving is the process of removing selected data records from operational databases that are not expected to be referenced again and storing them in an archive data store where they can be retrieved if needed.

Let's examine each of the major components of that last sentence. We say removing because the data is deleted from the operational database when it is moved to the data archive. Recall our earlier discussion of the data lifecycle. When data moves into the archive state, query and access is no longer anticipated to be required.

Next, we say selected records. This is important because we do not want to archive database data at the file level. We need only those specific pieces of data that are no longer needed for operational and reference purposes by the business. This means that the archive needs to be able to selectively choose particular pieces of related data for archival... not the whole database, not an entire table or segment, and not even a specific row. Instead, all of the data that represents a business object is archived at the same time. For example, if we choose to archive order data, we would also want to archive the specifics about each item on that order. This data likely spans multiple constructs within the database (tables for DB2 or Oracle; segments and/or databases for IMS).

The next interesting piece of the definition is this: and storing them (the data) in an archive data store. This implies that the data is stored separately from the operational database and does not require either the DBMS or the operational applications any longer. Archived data is separate and independent from the production systems from which it was moved.

The final component of the definition that warrants clarification is... where they can be retrieved if needed. The whole purpose of archiving is to maintain the data in case it is required for some purpose. The purpose may be external, in the form of a lawsuit or to support a governmental regulation; or the purpose may be internal, in the form of a new business practice or requirement. At any rate, the data needs to be readily accessible in a reasonable timeframe without requiring a lot of manual manipulation. I mean, let's face it, anyone can archive data if they don't have to worry about how to query it later, right?

So, what do you think? Does your organization have the technology and resources at your disposal to archive your database data in accordance with legal requirements?

Go to Current Issue | Go to Issue Archive


Recent articles by Craig S. Mullins

Craig S. Mullins -

Craig S. Mullins is a data management strategist and an executive with NEON Enterprise Software, Inc. (www.neonesoft.com). Craig has extensive experience in the field of database management having worked as an application developer, a DBA, and an instructor with multiple database management systems, including working with with DB2 for z/OS since Version 1. Craig is also an IBM gold consultant and is the author of two books:

  • DB2 Developer's Guide and
  • Database Administration: The Complete Guide to Practices and Procedures.

You can contact Craig via his web site at www.craigsmullins.com