|
Four Ways to Build a Data Warehouse
It has been said there are as many ways to build data warehouses as there are companies to build them.
It has been said there are as many ways to build data warehouses as there are companies to build them. Each data warehouse is unique because it must adapt to the needs of business users in different functional areas, whose companies face different business conditions and competitive pressures. Nonetheless, four major approaches to building a data warehousing environment exist. These architectures are generally referred to as 1) top-down 2) bottom-up 3) hybrid, and 4) federated. Most organizations—wittingly or not—follow one or another of these approaches as a blueprint for development. Although we have been building data warehouses since the early 1990s, there is still a great deal of confusion about the similarities and differences among these architectures. This is especially true of the “top-down” and “bottom-up” approaches, which have existed the longest and occupy the polar ends of the development spectrum. As a result, some companies fail to adopt a clear vision for the way the data warehousing environment can and should evolve. Others, paralyzed by confusion or fear of deviating from prescribed tenets for success, cling too rigidly to one approach or another, undermining their ability to respond flexibly to new or unexpected situations. Ideally, organizations need to borrow concepts and tactics from each approach to create environments that uniquely meets their needs. Semantic and Substantive Differences The two most influential approaches are championed by industry heavyweights Bill Inmon and Ralph Kimball, both prolific authors and consultants in the data warehousing field. Inmon, who is credited with coining the term “data warehousing” in the early 1990s, advocates a top-down approach, in which companies first build a data warehouse followed by data marts. Kimball’s approach, on the other hand, is often called bottom-up because it starts and ends with data marts, negating the need for a physical data warehouse altogether. On the surface, there is considerable friction between top-down and bottom-up approaches. But in reality, the differences are not as stark as they may appear. Both approaches advocate building a robust enterprise architecture that adapts easily to changing business needs and delivers a single version of the truth. In some cases, the differences are more semantic than substantive in nature. For example, both approaches collect data from source systems into a single data store, from which data marts are populated. But while “top-down” subscribers call this a data warehouse, “bottom-up” adherents often call this a “staging area.” Nonetheless, significant differences exist between the two approaches (see chart.) Data warehousing professionals need to understand the substantial, subtle, and semantic differences among the approaches and which industry “gurus” or consultants advocate each approach. This will provide a clearer understanding of the different routes to achieve data warehousing success and how to translate between the advice and rhetoric of the different approaches.
Top-Down Approach Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system data before it can be moved and integrated within the data warehouse. A separate staging area is particularly useful if there are numerous source systems, large volumes of data, or small batch windows with which to extract data from source systems. The major benefit of a “top-down” approach is that it provides an integrated, flexible architecture to support downstream analytic data structures. First, this means the data warehouse provides a departure point for all data marts, enforcing consistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data in the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business needs. For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, or support operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if they need cross-functional or enterprise views of the data. On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in the initial increments. This is because organizations must create a reasonably detailed enterprise data model as well as the physical infrastructure to house the staging area, data warehouse, and the marts before deploying their applications or reports. (Of course, depending on the size of an implementation, organizations can deploy all three “tiers” within a single database.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications. Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find the details behind the summary data in their reports.
Bottom-Up Approach The “bottom-up” approach consciously tries to minimize back-office operations, preferring to focus an organization’s effort on developing dimensional designs that meet end-user requirements. The “bottom-up” staging area is non-persistent, and may simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensional data marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier to extend existing dimensional models to accommodate new subject areas. Pros/Cons. The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible data structures using dimensional, star schema models. It also delivers value rapidly because it doesn’t lay down a heavy infrastructure up front. Without an integration infrastructure, the bottom-up approach relies on a “dimensional bus” to ensure that data marts are logically integrated and stovepipe applications are avoided. To integrate data marts logically, organizations use “conformed” dimensions and facts when building new data marts. Thus, each new data mart is integrated with others within a logical enterprise dimensional model. Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users do not have to “drill through” from a data mart to another structure to obtain detailed or transaction data. The use of a staging area also eliminates redundant extracts and overhead required to move source data into the dimensional data marts. One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions and facts to ensure integration and deliver a single version of the truth. When data marts are logically arrayed within a single physical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much to ask departments and business units to adhere and reuse references and rules for calculating facts. There can be a tendency for organizations to create “independent” or non-integrated data marts. In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus, organizations that use a bottom-up approach need to create additional data structures outside of the bottom-up architecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achieved simply by pulling a subset of data from a data mart at night when users are not active on the system.
Hybrid Approach The hybrid approach recommends spending about two weeks developing an enterprise model in third normal form before developing the first data mart. The first several data marts are also designed in third normal form but deployed using star schema physical models. This dual modeling approach fleshes out the enterprise model without sacrificing the usability and query performance of a star schema. The hybrid approach relies on an extraction, transformation, and load (ETL) tool to store and manage the enterprise and local models in the data marts as well as synchronize the differences between them. This lets local groups, for example, develop their own definitions or rules for data elements that are derived from the enterprise model without sacrificing long-term integration. Organizations also use the ETL tool to extract and load data from source systems into the dimensional data marts at both the atomic and summary levels. Most ETL tools today can create summary tables on the fly. After deploying the first few “dependent” data marts, an organization then backfills a data warehouse behind the data marts, instantiating the “fleshed out” version of the enterprise data model. The organization then transfers atomic data from the data marts to the data warehouse and consolidates redundant data feeds, saving the organization time, money, and processing resources. Organizations typically backfill a data warehouse once business users request views of atomic data across multiple data marts. The major benefit of a hybrid approach is that it combines rapid development techniques within an enterprise architecture framework. It develops an enterprise data model iteratively and only develops a heavyweight infrastructure once it’s really needed (e.g. when executives start asking for reports that cross data mart boundaries.) However, backfilling a data warehouse can be a highly disruptive process that delivers no ostensible value and therefore may never be funded. In addition, few query tools can dynamically and intelligently query atomic data in one database (i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when to query which database. This approach also relies heavily on an ETL tool to synchronize meta data between enterprise and local versions, develop aggregates, load detail data, and orchestrate the transition to a data warehousing infrastructure. Although ETL tools have matured considerably, they can never enforce adherence to architecture. The hybrid approach may make it too easy for local groups to stray irrevocably from the enterprise data model.
Federated Approach However, the federated approach—as defined by its most vocal proponent, Doug Hackney—is not a methodology or architecture per se, but a concession to the natural forces that undermine the best laid plans for deploying a perfect system. A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meet changing needs or business conditions. In short, it’s a salve for the soul of the stressed out data warehousing project manager who must sacrifice architectural purity to meet the immediate (and ever-changing) needs of his business users. Hackney says the federated approach is “an architecture of architectures.” It recommends how to integrate a multiplicity of heterogeneous data warehouses, data marts, and packaged applications that companies have already deployed and will continue to implement in spite of the IT group’s best effort to enforce standards and adhere to a specific architecture. Hackney concedes that a federated architecture will never win awards for elegance or be drawn up on clean white boards as an “optimal solution.” He says it provides the “maximum amount of architecture possible in a given political and implementation reality.” The approach merely encourages organizations to share the “highest value” metrics, dimensions, and measures wherever possible, however possible. This may mean, for example, creating a common staging area to eliminate redundant data feeds or building a data warehouse that sources data from multiple data marts, data warehouses, or analytic applications. The major problem with the federated approach is that it is not well documented. There are only a few columns written on the subject. But perhaps this is enough, as it doesn’t prescribe a specific end-state or approach. Another potential problem is that without a specific architecture in mind, a federated approach can perpetuate the continued decentralization and fragmentation of analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating meta data is a pernicious problem in a heterogeneous, ever-changing environment.
Summary Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their own way through the data warehousing thicket. Since each organization must respond to unique needs and business conditions, having a foundation of best practice models to start with augurs a successful outcome.
Recent articles by Wayne Eckerson
Wayne Eckerson -
Wayne W. Eckerson is the Director of Research for The Data Warehousing Institute (TDWI), the leading provider of high-quality, in-depth education and research to business intelligence and data warehousing professionals worldwide. Eckerson has a B.A. in American Studies from Williams College and a M.A.L.S. in literature from Wesleyan University. Eckerson lives and works in the coastal town of Hingham, Massachusetts with his wife and two children. Wayne can be reached at weckerson@tdwi.org. |
|||||||||||||||||||||||||||||||||||||||||