Business Intelligence Best Practices - BI-BestPractices.com

Collaboration. Communication. Community.

 
 
 Printer-friendly
 E-mail to friend
  Comments
ADVERTISEMENT
Four Ways to Build a Data Warehouse

by Wayne Eckerson
It has been said there are as many ways to build data warehouses as there are companies to build them.

It has been said there are as many ways to build data warehouses as there are companies to build them. Each data warehouse is unique because it must adapt to the needs of business users in different functional areas, whose companies face different business conditions and competitive pressures.

Nonetheless, four major approaches to building a data warehousing environment exist. These architectures are generally referred to as 1) top-down 2) bottom-up 3) hybrid, and 4) federated. Most organizations—wittingly or not—follow one or another of these approaches as a blueprint for development.

Although we have been building data warehouses since the early 1990s, there is still a great deal of confusion about the similarities and differences among these architectures. This is especially true of the “top-down” and “bottom-up” approaches, which have existed the longest and occupy the polar ends of the development spectrum.

As a result, some companies fail to adopt a clear vision for the way the data warehousing environment can and should evolve. Others, paralyzed by confusion or fear of deviating from prescribed tenets for success, cling too rigidly to one approach or another, undermining their ability to respond flexibly to new or unexpected situations. Ideally, organizations need to borrow concepts and tactics from each approach to create environments that uniquely meets their needs.

Semantic and Substantive Differences The two most influential approaches are championed by industry heavyweights Bill Inmon and Ralph Kimball, both prolific authors and consultants in the data warehousing field. Inmon, who is credited with coining the term “data warehousing” in the early 1990s, advocates a top-down approach, in which companies first build a data warehouse followed by data marts. Kimball’s approach, on the other hand, is often called bottom-up because it starts and ends with data marts, negating the need for a physical data warehouse altogether.

On the surface, there is considerable friction between top-down and bottom-up approaches. But in reality, the differences are not as stark as they may appear. Both approaches advocate building a robust enterprise architecture that adapts easily to changing business needs and delivers a single version of the truth. In some cases, the differences are more semantic than substantive in nature. For example, both approaches collect data from source systems into a single data store, from which data marts are populated. But while “top-down” subscribers call this a data warehouse, “bottom-up” adherents often call this a “staging area.”

Nonetheless, significant differences exist between the two approaches (see chart.) Data warehousing professionals need to understand the substantial, subtle, and semantic differences among the approaches and which industry “gurus” or consultants advocate each approach. This will provide a clearer understanding of the different routes to achieve data warehousing success and how to translate between the advice and rhetoric of the different approaches.

Top-Down Approach
The top-down approach views the data warehouse as the linchpin of the entire analytic environment. The data warehouse holds atomic or transaction data that is extracted from one or more source systems and integrated within a normalized, enterprise data model. From there, the data is summarized, dimensionalized, and distributed to one or more “dependent” data marts. These data marts are “dependent” because they derive all their data from a centralized data warehouse.

Sometimes, organizations supplement the data warehouse with a staging area to collect and store source system data before it can be moved and integrated within the data warehouse. A separate staging area is particularly useful if there are numerous source systems, large volumes of data, or small batch windows with which to extract data from source systems.

The major benefit of a “top-down” approach is that it provides an integrated, flexible architecture to support downstream analytic data structures. First, this means the data warehouse provides a departure point for all data marts, enforcing consistency and standardization so that organizations can achieve a single version of the truth. Second, the atomic data in the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business needs. For example, a data warehouse can be used to create rich data sets for statisticians, deliver operational reports, or support operational data stores (ODS) and analytic applications. Moreover, users can query the data warehouse if they need cross-functional or enterprise views of the data.

On the downside, a top-down approach may take longer and cost more to deploy than other approaches, especially in the initial increments. This is because organizations must create a reasonably detailed enterprise data model as well as the physical infrastructure to house the staging area, data warehouse, and the marts before deploying their applications or reports. (Of course, depending on the size of an implementation, organizations can deploy all three “tiers” within a single database.) This initial delay may cause some groups with their own IT budgets to build their own analytic applications. Also, it may not be intuitive or seamless for end users to drill through from a data mart to a data warehouse to find the details behind the summary data in their reports.

Bottom-Up Approach
In a bottom-up approach, the goal is to deliver business value by deploying dimensional data marts as quickly as possible. Unlike the top-down approach, these data marts contain all the data—both atomic and summary—that users may want or need, now or in the future. Data is modeled in a star schema design to optimize usability and query performance. Each data mart builds on the next, reusing dimensions and facts so users can query across data marts, if desired, to obtain a single version of the truth as well as both summary and atomic data.

The “bottom-up” approach consciously tries to minimize back-office operations, preferring to focus an organization’s effort on developing dimensional designs that meet end-user requirements. The “bottom-up” staging area is non-persistent, and may simply stream flat files from source systems to data marts using the file transfer protocol. In most cases, dimensional data marts are logically stored within a single database. This approach minimizes data redundancy and makes it easier to extend existing dimensional models to accommodate new subject areas.

Pros/Cons. The major benefit of a bottom-up approach is that it focuses on creating user-friendly, flexible data structures using dimensional, star schema models. It also delivers value rapidly because it doesn’t lay down a heavy infrastructure up front.

Without an integration infrastructure, the bottom-up approach relies on a “dimensional bus” to ensure that data marts are logically integrated and stovepipe applications are avoided. To integrate data marts logically, organizations use “conformed” dimensions and facts when building new data marts. Thus, each new data mart is integrated with others within a logical enterprise dimensional model.

Another advantage of the bottom-up approach is that since the data marts contain both summary and atomic data, users do not have to “drill through” from a data mart to another structure to obtain detailed or transaction data. The use of a staging area also eliminates redundant extracts and overhead required to move source data into the dimensional data marts.

One problem with a bottom-up approach is that it requires organizations to enforce the use of standard dimensions and facts to ensure integration and deliver a single version of the truth. When data marts are logically arrayed within a single physical database, this integration is easily done. But in a distributed, decentralized organization, it may be too much to ask departments and business units to adhere and reuse references and rules for calculating facts. There can be a tendency for organizations to create “independent” or non-integrated data marts.

In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus, organizations that use a bottom-up approach need to create additional data structures outside of the bottom-up architecture to accommodate data mining, ODSs, and operational reporting requirements. However, this may be achieved simply by pulling a subset of data from a data mart at night when users are not active on the system.

Hybrid Approach
The hybrid approach tries to blend the best of both “top-down” and “bottom-up” approaches. It attempts to capitalize on the speed and user-orientation of the “bottom-up” approach without sacrificing the integration enforced by a data warehouse in a “top down” approach. Pieter Mimno, an independent consultant who teaches at TDWI conferences, is currently the most vocal proponent of this approach.

The hybrid approach recommends spending about two weeks developing an enterprise model in third normal form before developing the first data mart. The first several data marts are also designed in third normal form but deployed using star schema physical models. This dual modeling approach fleshes out the enterprise model without sacrificing the usability and query performance of a star schema.

The hybrid approach relies on an extraction, transformation, and load (ETL) tool to store and manage the enterprise and local models in the data marts as well as synchronize the differences between them. This lets local groups, for example, develop their own definitions or rules for data elements that are derived from the enterprise model without sacrificing long-term integration. Organizations also use the ETL tool to extract and load data from source systems into the dimensional data marts at both the atomic and summary levels. Most ETL tools today can create summary tables on the fly.

After deploying the first few “dependent” data marts, an organization then backfills a data warehouse behind the data marts, instantiating the “fleshed out” version of the enterprise data model. The organization then transfers atomic data from the data marts to the data warehouse and consolidates redundant data feeds, saving the organization time, money, and processing resources. Organizations typically backfill a data warehouse once business users request views of atomic data across multiple data marts.

The major benefit of a hybrid approach is that it combines rapid development techniques within an enterprise architecture framework. It develops an enterprise data model iteratively and only develops a heavyweight infrastructure once it’s really needed (e.g. when executives start asking for reports that cross data mart boundaries.)

However, backfilling a data warehouse can be a highly disruptive process that delivers no ostensible value and therefore may never be funded. In addition, few query tools can dynamically and intelligently query atomic data in one database (i.e. the data warehouse) and summary data in another database (i.e. the data marts.) Users may be confused when to query which database.

This approach also relies heavily on an ETL tool to synchronize meta data between enterprise and local versions, develop aggregates, load detail data, and orchestrate the transition to a data warehousing infrastructure. Although ETL tools have matured considerably, they can never enforce adherence to architecture. The hybrid approach may make it too easy for local groups to stray irrevocably from the enterprise data model.

Federated Approach
The federated approach is sometimes confused with the hybrid approach above or “hub-and-spoke” data warehousing architectures that are a reflection of a top-down approach.

However, the federated approach—as defined by its most vocal proponent, Doug Hackney—is not a methodology or architecture per se, but a concession to the natural forces that undermine the best laid plans for deploying a perfect system. A federated approach rationalizes the use of whatever means possible to integrate analytical resources to meet changing needs or business conditions. In short, it’s a salve for the soul of the stressed out data warehousing project manager who must sacrifice architectural purity to meet the immediate (and ever-changing) needs of his business users.

Hackney says the federated approach is “an architecture of architectures.” It recommends how to integrate a multiplicity of heterogeneous data warehouses, data marts, and packaged applications that companies have already deployed and will continue to implement in spite of the IT group’s best effort to enforce standards and adhere to a specific architecture.

Hackney concedes that a federated architecture will never win awards for elegance or be drawn up on clean white boards as an “optimal solution.” He says it provides the “maximum amount of architecture possible in a given political and implementation reality.” The approach merely encourages organizations to share the “highest value” metrics, dimensions, and measures wherever possible, however possible. This may mean, for example, creating a common staging area to eliminate redundant data feeds or building a data warehouse that sources data from multiple data marts, data warehouses, or analytic applications.

The major problem with the federated approach is that it is not well documented. There are only a few columns written on the subject. But perhaps this is enough, as it doesn’t prescribe a specific end-state or approach. Another potential problem is that without a specific architecture in mind, a federated approach can perpetuate the continued decentralization and fragmentation of analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating meta data is a pernicious problem in a heterogeneous, ever-changing environment.

Summary
The four approaches described here represent the dominant strains of data warehousing methodologies. Data warehousing managers need to be aware of these methodologies but not wedded to them. These methodologies have shaped the debate about data warehousing best practices, and comprise the building blocks for methodologies developed by practicing consultants.

Ultimately, organizations need to understand the strengths and limitations of each methodology and then pursue their own way through the data warehousing thicket. Since each organization must respond to unique needs and business conditions, having a foundation of best practice models to start with augurs a successful outcome.

Top-Down Bottom-Up Hybrid Federated

Major Characteristics

· Emphasizes the DW.
· Starts by designing an enterprise model for a DW.
· Deploys multi-tier architecture comprised of a staging area, a DW, and “dependent” data marts.
· The staging area is persistent.
· The DW is enterprise-oriented; data marts are function-specific.
· The DW has atomic-level data; data marts have summary data.
· The DW uses an enterprise-based normalized model; data marts use a subject-specific dimensional model.
· Users can query the data warehouse and data marts.
 
· Emphasizes data marts.
· Starts by designing a dimensional model for a data mart.
· Uses a “flat” architecture consisting of a staging area and data marts.
· The staging area is largely non-persistent.
· Data marts contain both atomic and summary data.
· Data marts can provide both enterprise and function-specific views.
· A data mart consists of a single star schema, logically or physically deployed.
· Data marts are deployed incrementally and “integrated” using conformed dimensions.
 
· Emphasizes DW and data marts; blends “top-down” and “bottom-up” methods.
· Starts by designing enterprise and local models synchronously.
· Spends 2–3 weeks creating a high-level, normalized, enterprise model; fleshes out model with initial marts. · Populates marts with atomic and summary data via a non-persistent staging area.
· Models marts as one or more star schemas.
· Uses ETL tool to populate data marts and exchange meta data between ETL tool and data marts.
· Backfills a DW behind the marts when users want views at atomic level across marts; instantiates the “fleshed out” enterprise model, and moves atomic data to the DW.
 
· Emphasizes the need to integrate new and existing heterogeneous BI environments.
· An architecture of architectures.
· Acknowledges the reality of change in organizations and systems that make it difficult to implement a formalized architecture.
· Rationalizes the use of whatever means possible to implement or integrate analytical resources to meet changing needs or business conditions.
· Encourages organizations to share dimensions, facts, rules, definitions, and data wherever possible, however possible.
 

Pros

· Enforces a flexible, enterprise architecture.
· Once built, minimizes the possibility of renegade “independent” data marts.
· Supports other analytical structures in an architected environment, including data mining sets, ODSs, and operational reports.
· Keeps detailed data in normalized form so it can be flexibly re-purposed to meet new and unexpected needs.
· Data warehouse eliminates redundant extracts.
· Focuses on creating user-friendly, flexible data structures.
· Minimizes “back office” operations and redundant data structures to accelerate deployment and reduce cost.
· No drill-through required since atomic data is always stored in the data marts.
· Creates new views by extending existing stars or building new ones within the same logical model.
· Staging area eliminates redundant extracts.
 
· Provides rapid development within an enterprise architecture framework.
· Avoids creation of renegade “independent” data marts.
· Instantiates enterprise model and architecture only when needed and once data marts deliver real value.
· Synchronizes meta data and database models between enterprise and local definitions.
· Backfilled DW eliminates redundant extracts.
 
· Provides a rationale for “band aid” approaches that solve real business problems.
· Alleviates the guilt and stress data warehousing managers might experience by not adhering to formalized architectures.
· Provides pragmatic way to share data and resources.
 

Cons

· Upfront modeling and platform deployment mean the first increments take longer to deploy and cost more.
· Requires building and managing multiple data stores and platforms.
· Difficult to drill through from summary data in marts to detail data in DW.
· Might need to store detail data in data marts anyway.
 
· Few query tools can easily join data across multiple, physically distinct marts.
· Requires groups throughout an organization to consistently use dimensions and facts to ensure a consolidated view.
· Not designed to support operational data stores or operational reporting data structures or processes.
 
· Requires organizations to enforce standard use of entities and rules.
· Backfilling a DW is disruptive, requiring corporate commitment, funding, and application rewrites.
· Few query tools can dynamically query atomic and summary data in different databases.
 
· The approach is not fully articulated.
· With no predefined end-state or architecture in mind, it may give way to unfettered chaos.
· It might encourage rather than reign in independent development and perpetuate the disintegration of standards and controls.
 

Major Proponents

Bill Inmon and co-authors
 
Ralph Kimball and co-authors
 
Many practitioners
 
Doug Hackney
 

Recent articles by Wayne Eckerson

Wayne Eckerson - Wayne has been a thought leader in the business intelligence field since the early 1990s. He has conducted numerous research studies and is a noted speaker, blogger, and consultant. He is the author of two widely read books: Performance Dashboards: Measuring, Monitoring, and Managing Your Business (2005, 2010) and The Secrets of Analytical Leaders: Insights from Information Insiders (2012).

Wayne is founder and principal consultant at Eckerson Group, a research and consulting company focused on business intelligence, analytics and big data.