Tera-Scale Data Appliances for Business Intelligence
Published: April 1, 2003
Published in TDAN.com April 2003
Business Intelligence (BI) is vital to the decision-making process within a company. Enterprises need to analyze ever-increasing amounts of data on a regular basis to better understand all aspects of their business. Existing BI infrastructures are unable to handle the sheer quantity of data being stored, accessed and analyzed. Appliances are devices that are designed for high performance, efficiency and ease of use. The concept of a database appliance has been debated in the past, but has gained new relevance for today’s BI activities in an environment of explosive data growth. Given the wealth of third-party applications, maturity of DBMS’s, inexpensive storage and processing power, the time for the application of tera-scale data appliances in BI is now.
In the current age of data warehousing, Business Intelligence (BI) can make or break a company. The timely processing and retrieval of vast amounts of data is vital to the decision-making process. However, just as important as timeliness is the depth of data analysis possible. With the growing size of the average data warehouse, achieving these goals has become increasingly difficult; already, terabyte-sized data warehouses are fairly common.
According to Greg’s Law, data is estimated to double on average every nine months. Vendors have thus far handled this rapid growth in database size with very expensive and consistent upgrading of hardware and software, but over the past few years it has become clear that existing infrastructures are unable to effectively handle the demands of in-depth analysis on large amounts of data. Furthermore, the Internet has brought a greater level of user access to databases. As demand for access and analysis continues to grow, users relying on general-purpose hardware and software will have to search for solutions specifically designed to address this problem.
The challenge is to provide a purpose-built solution for the problem that is both specific and flexible. That is, it must be suited to the task of handling vast amounts of data and yet be compatible with the customer’s existing BI applications and infrastructure. Furthermore, such a solution should be relatively simple to put into place, in comparison with the highly complex (from a database administration point of view) systems currently available. Such systems are purpose-built appliances—expandable, affordable and uniquely suited to the ever-growing needs of users in terms of speed and sophistication of data analysis.
The Current State of Business Intelligence
The current BI infrastructure is a patchwork of hardware, software and storage that is growing ever more complex. Consider a typical BI solution:
Some systems are optimized for performance, but these optimizations have been performed in stages over time, and the underlying architecture has remained general in nature. Several Database Administration (DBA) and DBMS packages have been put in place; Symmetric Multiprocessing (SMP) servers and disk arrays from a variety of vendors serve the data; and an even larger selection of client applications are placed on top of this warehouse behemoth. For example, a company may use an Oracle DMBS, an HP server, and a storage solution from EMC, and as their system grows, they may add Hitachi storage and a second server. With these types of systems, data and user applications have to be continuously tuned and optimized.
Tera-scale databases that continue to grow steadily put tremendous strain on these systems. In addition, the queries run against the database grow more complex. Sophisticated analytical methods require complex queries and models; for example, Web log and customer segmentation analyses are taxing current database systems. The problem here is two-fold: first, the complex queries strain the system and slow the other queries being run. Second, if a business user is unable to get results in real-time, he is unlikely to try another query of equal or greater complexity. Therefore, the process behind obtaining useful information quickly becomes impaired.
Even in cases where the user base and data set are relatively stable, current BI systems often fail to meet their basic goal of delivering vital business information so that timely decisions may be made. From an administration standpoint, this current ‘patchwork’ of solutions is a nightmare. From the point of view of the business user, it is frustrating and does not provide the agility and performance the users are looking for. These strains occur because vendors have upgraded these systems incrementally over the years rather than change the underlying architecture to address the unique requirements of today’s tera-scale databases.
The issues with current BI architectures are evident across a broad range of companies and industries. While the system strain will become worse in the next few years, these problems exist today and are plaguing both business users and database administrators. Patchwork solutions can only hold together for so long; as database growth continues its exponential rise, the weak points in current systems are only going to become more aggravated. A new solution must be engineered now—that solution is a tera-scale data appliance that is purpose-built for BI.
The concept of a tera-scale data appliance is rooted in decades of academic and industry discussion about database appliances, or machines. A brief review of this history is helpful in understanding the evolution of the appliance, and its challenges over time.
Evolution of the Database Appliance (Machine)
In 1983, Haran Boral and David J. DeWitt wrote a paper hailing the end of the era of database machines (Boral and DeWitt, 1983). Almost twenty years ago, progress in the development of database machines was halted due to the problem of I/O bottleneck: the rate at which I/O speeds grow is minimal compared with the rate of growth of CPU speeds dictated by Moore’s Law. This lag in I/O speed versus CPU speed continues to this day. The database machines of the early 1980s found a solution to this problem by using custom hardware (memory and disks) that promised greater reliability and speed. However, the authors claimed that database machines were doomed because they were built around specialized hardware, which was expensive and difficult to maintain and integrate.
Ten years later, DeWitt, one of the authors of the original paper, published a second paper with Jim Gray claiming that parallel database systems were the future of database technology (DeWitt and Gray, 1992). Making reference to such successful database machine ventures as Teradata and Tandem, Dewitt pointed out that a system based on ‘conventional shared-nothing hardware,’ rather than specialized hardware, has the potential to be more robust and yield a higher level of performance.
In 1995, Kjell Bratbergsengen of the Norwegian Institute of Technology released a paper charting the rocky history of the database machine (Bratbergsengen, 1995). The author discussed previous attempts, including intelligent secondary storage devices, filters, associative memory systems, multiprocessor database computers and text processors. He claimed the area of most promise was in multiprocessor machines—and in fact, this is the direction most research and commercial ventures have taken.
Examples of database appliances/machines from both academic and commercial sectors include:
One of the challenges of the early attempts was the lack of powerful off-the-shelf components, including disk storage and memory, which would make a database appliance affordable. In addition, massively parallel processors were still at an early stage, and not powerful enough to handle tera-scale databases. Shared-nothing architectures were not sufficiently developed to help alleviate the disk I/O bottleneck. At the time that Bratbergsengen’s paper was written, disks were just reaching several gigabytes in size, and the transfer rates were only on the order of 3MB/sec. Today, although disk I/O is still a bottleneck for all systems, intelligent architectures have found ways of circumventing this problem.
What is an Appliance?
Webster defines an appliance as “an instrument or device designed for a particular use.” An appliance is an opaque drop-in solution that provides interfaces for all manner of tools without disturbing the inner workings of the appliance. An appliance is built with the end user’s point of view at the forefront of the design process. Appliances in any market come about as a result of the maturity of the technology. Introducing the appliance to a market is a logical next step, because appliances are efficient and affordable.
There are simple appliances we now take for granted, like the toaster: the consumer of toast is going to want a toasted piece of bread as quickly as possible, toasted to the level of his choice. Thus, the user is given a place to put the bread as well as a knob to control the level of toasting. Most important, a toaster can be put in place with an absolutely minimal amount of ‘configuration.’
Appliances in the computer world are so common that we often forget about them. It is an integrated box that can retrieve information at the request of external applications and keeps its inner workings hidden in order to maintain simplicity and ease of use. Take, for example, the network router. The majority of routers can be put in place with almost no configuration (other than setting the router’s IP address) and will start storing and forwarding packets. Hubs and switches are even simpler. The point is that these devices aid network transport greatly, yet are essentially transparent to the user.
Another example is the video streaming appliance. These devices are put in place to enhance video stream quality and the speed of video delivery. The device is transparent to the end user, but the performance boost is evident. Most important, from an administration standpoint, the device is simple to install and configure, and requires a low level of maintenance.
These devices were developed to address particular problems and marketed as elegant solutions. Examples of appliances in the computer world include:
The concepts presented by these appliances were extended to the world of databases, and development of a database machine began. Academic and commercial research sought out solutions to the problems facing databases and proposed machines to handle these issues.
Simplicity is the name of the game—we would not expect our toaster owner (‘administrator’) to have to open up the toaster and tinker with it to add extra slots for toast or to make the toast crispier. Likewise, why should database administrators be required to fine-tune the database system as the requirements increase? Appliances make our lives simpler. Why can’t this analogy be carried into the database world?
The Case for a Tera-Scale Data Appliance for Business Intelligence
Applied to BI, a tera-scale data appliance is a purpose-built machine capable of retrieving valuable decision-aiding intelligence from terabytes of data on the order of seconds or minutes as opposed to hours or days. Appliances represent the difference between making a decision using stale data and making a decision with the freshest information possible. Tera-scale data appliances are engineered for the purpose of delivering results while the results are still relevant.
A tera-scale data appliance that is purpose-built for BI is:
Optimization. Optimization affects both the storage and retrieval of data. A data appliance is engineered to deliver intelligence quickly and efficiently, no matter the database size. The appliance also allows for real-time updates to data, eliminating the delivery of stale data to the end user. The most important factors in BI are the timeliness and freshness of the results; they should be returned in a useful time frame, allowing a company to maximize their options. The appliance provides the real-time updates and retrievals critical to BI; such optimizations are done automatically by the appliance, without heavy DBA involvement.
Scalability. A tera-scale data appliance should be truly scalable. That is, the addition of extra storage to accommodate a larger data warehouse should not adversely affect performance. Specifically, the business users running queries against the data should not feel the effects of the growth. In order to accomplish this, the major bottleneck points must be distributed in the system rather than placed centrally. For large data transfers, bottlenecks are internal network speed and disk transfer speed; for complex queries, the bottleneck is often the CPU. An ideal data appliance should be able to scale to support a multi-terabyte-sized database without major performance degradation.
Reliability. Reliability is critical. One level of reliability comes from the inherent abstraction of an appliance. By keeping the inner workings from being modified by the users or administrators, the potential for failure decreases. Another level of reliability is provided by the homogeneous nature of an appliance; all parts of the system come from one vendor. The customer does not have to integrate disk arrays, operating systems, and database software, hoping that they will all work together flawlessly. Reliability increases as the number of vendors decreases, and multiple general-purpose offerings are replaced with a single solution.
Ease of Use. Obviously, we cannot do away with DBA entirely, as a certain level of management is necessary in order to maintain database integrity and performance. However, we can make the database system administrator’s job much easier, specifically in the area of end user software compatibility. By making the appliance compatible with all common database standards (ODBC, etc.) and placing it through rigorous testing, the appliance manufacturer can ensure that applications can interoperate with the appliance. Thus, the ongoing support issues can be minimized.
Given the long history of database development and the existence of previous attempts at database appliances/machines, why is now the time for a tera-scale data appliance in BI?
There are several reasons that the appliance is now possible, but the most important of these is the maturity of database technology. The database standards have been set, and this allows the system to be built completely around the desires and needs of the end user. Furthermore, the concept of a relational database is well defined and the users are experienced and eager to run increasingly complex queries. A wide variety of sophisticated applications and tools with standard interfaces allow widespread access to the database. And, as noted earlier, terabyte-sized databases, an influx of users and a demand for complex queries have placed unprecedented strain on the existing patchwork infrastructure.
Users of BI and data warehousing, therefore, need a system that yields high performance, both in speed and storage. High powered specialized hardware drove the database machines of the past, but now there is a need for better performance at a lower cost. The power of current technology is great enough that commercial, off-the-shelf components, which are dropping in price, can be used to construct a tera-scale data appliance. This appliance can provide valuable BI at a fraction of the cost of current industry database systems.
What is Today’s Tera-Scale Data Appliance for BI?
People often associate appliances with simplicity, and databases by nature are not simple. The high-performance, tera-scale data appliance, however, is not a simple tool mechanically; rather, it makes BI more useful to the end user. The appliance starts over from the beginning, addressing the problems and concerns of the end user and the issues raised by the growing size of databases. The tera-scale data appliance is clean, efficient, expandable and powerful.
A tera-scale data appliance integrates the hardware, DBMS and storage into one opaque device. It combines the best elements of SMP and Massively Parallel Processing (MPP) architectures into a new architecture to allow a query to be processed in the most optimized way possible. It is architected to remove all the bottlenecks to data flow so that the only remaining limit is the disk speed—a ‘data flow’ architecture where data moves at ‘streaming’ speeds. Through standard interfaces, it is fully compatible with existing BI applications, tools and data. And it is extremely simple to use.
How Businesses Benefit from a Tera-Scale Data Appliance for Business Intelligence
A tera-scale data appliance for BI provides speed for the business user. The time of waiting hours or days for queries to finish is past. Patience may be a virtue, but when it comes to BI, decision makers need results now. The size of the average data warehouse is increasing and showing no signs of slowing down, and with this increased store of knowledge comes an increased demand for BI. Businesses should not need to discard customer data from two months ago because their database slows to a crawl when the data is kept.
A tera-scale data appliance for BI provides freedom to the business user. Right now, users are limited in the queries they can run because of the time required to run them. Thus, users end up running the same set of queries against the database. With the time required to run a complex query reduced to seconds, users can not only run their old queries more often, but they have the time to devise and run whole new sets of queries.
A tera-scale data appliance provides simplicity for the administrator. The integrated nature of an appliance means that the time typically spent troubleshooting a complex database system can be spent in more productive endeavors. The effort is not to simplify a complex system, but rather to remove the appearance of being complex, by abstracting away the mechanical details. The end result is the removal of legacy systems and piecemeal components.
A tera-scale data appliance provides ease of database growth. The inherent scalability in a modified-MPP architecture stems from the modularity of the nodes. Ideally, we want a database with linear scaleup (DeWitt and Gray, 1992); that is, with n times the hardware, we should be able to handle a task n times as large in the same amount of time. The tera-scale data appliance provides us just that flexibility.
A tera-scale data appliance provides the lowest total cost of ownership. Being purpose-built means that it is constructed from commodity hardware, eliminating the overhead of special purpose hardware. The appliance has one source, one vendor, and therefore the costs associated with support are reduced. With existing technologies, the process of data growth typically incurs costs; hardware must be added and ongoing maintenance must be performed. The tera-scale data appliance reduces these costs with inexpensive yet powerful hardware from one source.
With the simple, efficient solution provided by a tera-scale data appliance for BI, businesses will run more efficiently. Results will be returned within seconds or minutes—orders of magnitude faster than with current architectures. Businesses today demand rapid response times to generate rapid results.
The success of decision-making in a company relies on Business Intelligence. BI, in turn, relies on the underlying database architecture. Current database architectures are patchwork systems, built in pieces and not optimized for delivering timely results. The maturity and stability of the relational database, paired with the power of consumer computer components, allows for a breaking down of the database system. Starting with a clean slate, the next generation database system should be engineered with the end user in mind. The system should be clean, scalable and enable optimized BI. A new generation of tera-scale data appliances holds promise for companies that depend on Business Intelligence.
Boral, H. and D.J. DeWitt. ‘Database Machines: An Idea Whose Time Has Passed?—A Critique of the Future of Database Machines,’ Proceedings of the 1983 Workshop on Database Machines, (Springer-Verlag), (1983), 166-187.
Bratbergsengen, K. ‘Parallel Database Machines,’ Rivista di Informatica, Vol.XXV, n.4, (ottobre-dicembre 1995).
DeWitt, D. J. and J. Gray. ‘Parallel Database Systems: The Future of High Performance Database Processing,’ ACM Communications, vol. 35(6), (June 1992), 85-98.
DeWitt, D., R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, M. Muralikrishna. ‘GAMMA—A High Performance Dataflow Database Machine,’ Proceedings of the 1986 VLDB Conference, Japan, (August 1986), 228-237.
DeWitt D. J., S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H. Hsiao, R. Rasmussen. ‘The Gamma Database Machine Project,’ IEEE Knowledge and Data Engineering, Vol. 2, No. 1, (March 1990), 44-62.
Sood, A.K. and A.H. Qureshi (Eds) ‘Database Machines, Modern Trends and Applications,’ NATO ASI Series F: Computers and Systems Sciences, Vol. 24. Springer-Verlag (1986).
Stonebraker, M., ‘The Case for Shared Nothing,’ Database Engineering, Vol. 9, No. 1, (1986), 4-9.
Foster D. Hinshaw -
Foster D. Hinshaw, CTO and co-founder of Netezza Corporation, is accomplished in designing and developing large complex systems for business-critical enterprise and departmental applications, as well as Web-based e-commerce systems. Previously, he provided Internet and Y2K consulting services to marquee clients including Staples. He also served as a consultant and Y2K Practice Manager at Keane, Inc., a leading software consulting company.
Prior to his consultancy experience, Hinshaw held management positions with VideoGuide, a developer of the leading on-screen TV guide, as well as hardware and software development positions at the Department of Environmental Protection, Stone Associates, Design Marketing and Maplewood Enterprises.
He earned a BS and MS in Electrical Engineering from Cornell University and an MBA from Harvard University.
Foster D. Hinshaw