Business Intelligence Best Practices -

Collaboration. Communication. Community.

 E-mail to friend
Integrating Data Quality into Your Data Warehouse Architecture
Why is the quality of data—the most important element in a data warehouse—so often ignored until it is too late? Is it because it is too complex to solve, or is it simply not our business?

By Jean-Pierre Dijcks

Why is the quality of data—the most important element in a data warehouse—so often ignored until it is too late? Is it because it is too complex to solve, or is it simply not our business? Answering this fundamental question is tough, but part of the answer may be the lack of solid methodology to deal with data quality. This article sheds light on what to do to assure data quality is not ignored.

This article will describe a methodology for embedding data quality into the very heart of your data warehouse architecture. We present key components to include in your data warehouse design and processes, and explain their impact on data quality.

To frame the data quality methodology in a common reference, we will layer it on top of some classic methodologies for designing a data warehouse. This is not a discussion on general methodologies so we will only use them as a reference point. Regardless of the methodology you use, the points shown in this article can easily be embedded and applied.

The Baseline—Building Your Data Warehouse Once you begin to build a data warehouse, you will understand that it is not the easiest task. A number of methodologies are published in various books and articles to guide you in this task.

Think Big, Start Small
A cornerstone of successful warehouses is the usage of iteration in delivery. As opposed to the “big bang” delivery mechanism, iteration in delivery allows for a phased and manageable project. Segmenting your work into phases offers two benefits to your end users:

  • Doubt and insecurity are reduced when you solve one business problem now (instead of solving all of them “some day” in the future)
  • You can gradually embed knowledge and information in the organization, creating trust and usage (instead of overwhelming the organization with untested information and delivery mechanisms)

Of course, there are also project management advantages to iteration (e.g., lessons learned in the first phase can smooth the second phase of a larger project). You will also be able to see your success help the business, which in turn helps your own team become believers in the data warehouse. The basis for any data warehouse building methodology should be defining an ambitious project and then dividing it into small, manageable chunks, which you can deliver on time to meet the business target.

Sponsors Should Also Think Big
If all goes well, your project will be initiated from the business side of your organization, in which case you will work with your executive sponsor. You can’t always pick your sponsor, but you need to be sure the sponsor is well aware he has signed on for the long haul. You do not want your sponsor to leave the project once his specific business issue is resolved. In other words, your sponsor should sign on for the “think big” part of your project.

This step is vital for ensuring the common goals (information standardization across the organization) of a data warehouse are met and that individual departmental problems are understood and addressed. The question “What is a customer?” should be answered companywide, not on a per-department basis. Otherwise, your data consolidation dream and a single version of the truth rapidly disappears.

Many more things can be said about executive sponsorship in the context of a data warehouse project. Here we will leave it at emphasizing the importance once again. Just like data quality, it is frequently overlooked and will almost certainly come back to haunt you if it is not present.

Enterprise versus Mart
You may think the subject matter discussed to this point is only applicable to a large-scale enterprise data warehouse. While it is the authors’ opinion that it is the best way to solve your business problems, it is not always possible to have such a project scope. This does not, however, render this article useless. Even if you create a departmental data mart, you still need a sponsor who understands that you are building an infrastructure to solve many problems, not just the most pressing issue of the day. This awareness will allow you to actually invest in architecture rather than in ad hoc problem solving.

In any case, even if you are creating an enterprise data warehouse, don’t get stuck in defining the one-modelsolves- all mentality. Be sure to keep delivery in mind and be prepared to compromise the model design while pragmatically solving the issue. Remember, the business user does not care about the model—only about solving his problem.

Business Users Rule Your World
As with your executive sponsor, be sure to have business users who understand the problem and, if possible, the solution. Such warehouse users will be the judge of your project’s success. As we will see later in this article, they become even more critical when you layer the data quality methodology onto you project.

Figure 1. A Data Warehouse Methodology


The Data Warehouse Methodology
Let’s examine the high level methodology. Figure 1 shows an overview of the general data warehouse methodology we will use in this article. The methodology incorporates a foundation created using the sponsorship and “think big, start small” blocks. Project management, also a key factor, tops off the structure of your project.

As you can see in Figure 1, the methodology consists of five major building blocks layered on top of the foundation described earlier in this article. The five building blocks are:

  • Scope: The phase in which you determine the area or problem you will solve. This phase should play a role in both the enterprise data warehouse and the data mart project.
  • Design: The “grunt work”phase. During design, you discover the information required from your end users and design the actual system that will meet these requirements.
  • Build: The design is refined and a data warehouse is developed.
  • Test: This phase includes unit testing and end-user testing. You should be testing both the infrastructure and the delivery mechanism with your end users.
  • Deliver: The actual rollout of the system into production. This phase also encompasses tasks such as training and organizational change management.

We have discussed the involvement of end users in your project, but we must first define how these building blocks work within the context of a phased delivery approach. See Figure 2.

The Scope phase is designed to determine the project’s extent. It is structured atop the foundation and guarded by project management. In this more detailed look at the general methodology, note that end-user involvement becomes more prominent, spanning all phases of your project and influencing all building blocks. This methodology forms the basis for your data quality assurance methodology.

Figure 2. Iteration and Common Phases


Introducing and Scoping Data Quality
Data quality is a big but neglected problem. To understand how to tackle the problem, we should first define the subject and elements of data quality. Then we can define how to deal with data quality in our methodology (and in real life).

Defining Core Elements
A simple definition of data quality is: the level at which data is perceived to be true and accurate. The problem: quality is very subjective. High quality to one person can be low or poor quality to another. This makes data quality hard to measure.

Another question we need to ask is: What actually comprises data (and, therefore, data quality)? Are we merely concerned by transactions stored in our operational systems (“hard data”), or are we concerned by “soft data” such as terminology, business rules, and other forms of metadata?

In this article we place the hard and soft data under the banner of data quality. Therefore, the scope of data quality includes all data elements that play a role in the delivery of information to end users. Including metadata in the scope of our problem will have a major impact on the proposed methodology.

Defining the Data Quality Context
We are discussing data quality in the context of a data warehouse because a data warehouse is one of the areas most heavily impacted by data quality problems. Keep in mind that the problem is usually not caused by a data warehouse. Data quality for “hard data” is typically caused at the origin of the data—the operational system.

For “soft data,” quality issues often arise from data not actually stored anywhere until the data warehouse is built. Again, the data warehouse does not cause the quality issues, they are merely highlighted by it. For “soft data,” the methodology we define is based on how to extract and manage the data appropriately. On the “hard data” side, we will only cover how to detect and “fix” data quality in the warehouse context.

The Data Quality Methodology Defined
With the definition and context of data quality in mind, we can define the methodology that will help ensure data quality in your data warehouse.

As Figure 3 shows, the data quality methodology is based on four distinct phases and two iterative streams of information. These two streams are extremely important in the methodology, as you will need to continue working on data quality, even after you have delivered your initial result set to the end users.

Figure 3. The Data Quality Methodology


Investigate Data Quality
This is the most crucial phase in the methodology. Data quality assurance starts before your project actually starts. That is, at the moment you initiate your data warehouse project, you must also start thinking about data quality.

Your project sponsor and the end users are core to your project. It is important to correctly set data quality expectations from the start. The biggest problem with beginning early is that you will be hard pressed to find the time and resources to do this investigation. If you cannot do this before the project starts, you should add an extra step to the project plan to ensure you will assess the data quality before scoping and design begins. Failure to do so will lead to numerous problems later on.

This first step is the one step most data warehouse projects omit. Omitting this step will affect later phases of the project and lead to late delivery of incorrect information.

Scope Data Quality
After you assess the organization and data quality, you will need to scope the data quality project within the data warehouse project. Your assessment will profoundly impact your scope, as you will have a good idea of the quality of the data you are working with.

Poor data quality has two important consequences:

  • It reduces the number of problems you can solve in a given period of time
  • It increases the effort expended to solve a single problem Set expectations accordingly with all project participants, and back it up with the investigation results you gathered before starting the scope.

Architect the Solution
In this phase, you solve the problem(s) you discovered previously.

Essentially, there are two places to solve the data quality problem: on the source side and in the target data warehouse. The former is something that you may not have in hand or control over. The latter is most certainly within your project scope.

To architect the solution in the data warehouse, you need to:

  • Evaluate the technology needed and/or available
  • Evaluate the resources needed
  • Identify business rules that solve the data quality issues

When you know what you need, you will have to get the resources and technology in house. As you have now determined who is doing what at which moment in the data warehouse, this phase will have impact on the design of the warehouse. We will see this impact later on when we layer the data quality methodology over the phases of the warehouse methodology.

Solve Data Quality Issues
Solving the data quality issue is the last step. Here you will apply the technology, resources, and business rules to clean and scrub the data. How you actually solve the problem is, of course, a result of the technology and resources available.

End-User Feedback and Quality Reporting
Now that you have solved the problem, you might expect to be finished. Unfortunately, this is not the case. Data quality is not a “one shot” exercise but a continuous process. One aspect of data quality is that it changes over time, even if all other aspects remain constant. You may see an improvement if the source systems are cleaned up, but you also may see quality deteriorate because of new conditions (e.g., new sources or processes). Therefore, it is prudent to place iteration into the model.

End users will probably give you feedback when the data is deteriorating, but you may not hear about improvements. Therefore, you must stimulate any feedback and ensure that it is recorded in order to allow you to reassess the quality in your warehouse and the processes you have in place. This way you are in the position to fine tune the quality, and you will be able to continue delivering stable and expected quality.

While end-user perception is important, you may not want to solely depend on it for your quality assessment. Implement a reporting and measurement system to independently measure the quality of the data. This reporting loop is another continuous flow of information; if the reports indicate a change, you must act accordingly and re-investigate.

Both feedback loops allow you to maintain the warehouse and its quality.

Figure 4. The Combined Methodology


Embed the Data Quality Methodology
Now that we have defined both methodologies, we can merge them and come to an integrated picture of how data quality will influence your data warehouse design.

The data quality methodology becomes an integral part of the complete methodology. You want to ensure that your user community is involved, both in data quality and in the overall data warehouse. However, end-user involvement starts earlier in the overall process if we add data quality to the picture.

Data Quality Method in Practice
Having defined a methodology, we next look at applying it to real-life projects.

Investigate—The Importance of Metadata
While we have not spent much time discussing metadata (or “soft data”) to this point, it does play a significant part in the methodology. As discussed earlier, you will most likely be solving the data quality problem in the data warehouse and not in the source systems. For this example, we are going to assume that you are aware of what the metadata can do for you. However it is important to understand that metadata is, in itself, data. Therefore it is prone to quality errors, so the entire methodology should be applied to metadata as well.

When you start investigating and scoping your data warehouse, metadata brings yet another complicating factor to the table. Depending on the technology you use, this may become a big problem in your environment. If you use custom-written logic to fix data quality issues, you will be storing metadata implicitly within the code. This means:

  • It is hard to apply data quality assessments on logic embedded in code, as you cannot pinpoint the exact piece of code at fault.
  • Iteration becomes hard; each change requires you to investigate your entire code base and change it if necessary, making delivery cycles longer.

The solution: treat business rules as data in itself—that is, metadata. You can now measure and manage metadata within the data quality methodology using the same mechanisms.

Investigate—The Technology Impact of Metadata
The fact that you should be working with metadata leads to a direct technology impact on your investigation to allow metadata quality assessments and management, you will need a place to store the metadata—a repository.

Repositories come in many shapes and forms, which we will not discuss. Instead, we will choose the pragmatic approach of looking at a repository from the data warehouse perspective.

The minimum requirements of a repository in the data warehouse scope are:

  • Storage of data warehouse schema design information for sources and targets
  • Extraction, transformation, and loading (ETL) rules and processes must be stored
  • Business rules must be stored and applied within the ETL process
  • Reporting capabilities (preferably enabled for a wide audience: your end users)

Be aware that the repository and metadata discussion is likely to influence the selection of the tool you will use in your project.

Investigate—Starting Your Project
Now that you have your metadata sorted, it is time to investigate the heart of the matter—your data quality issues—typically by performing two tasks:

  • Investigate the source systems by direct access
  • Interview key users of the source systems

Perform the first task to create your own understanding of the data. It is especially useful to investigate differences between source systems. You can start this just before the project “really” starts. Interviewing key users is something you usually do to design your data warehouse. You should include questions about data integrity, issues the users may have seen, and even the thoughts they have on the quality of source systems. Most business users have a pretty good feeling about their data sources and have learned to deal with data problems in their job. This information is very useful in your investigation.

Investigate—Establish the Baseline
After you have investigated your data sources and combined this information with the interview results, you must establish the data quality baseline. What is the data quality level of the organization, and where do you want to go?

It may be difficult to establish a measurement for quality. Use a known method, such as Six Sigma. This allows you to quantify the number of defects in a sigma (or standard deviation) number. Simply stated, the lower the number, the bigger the problem.

You can then use the same scale to establish the goal for data quality. This baseline and the data quality goals you set are documented and reported to both the project sponsors and the end users.

Scope—Determining What to Do
Based on the assessment of your data quality issues and the goals for the data warehouse you must now scope, the work required to deliver your solution. The important aspect of this phase is to ensure that the effort needed to achieve the set data quality goal is embedded in the scoping for the project.

Architect—Data Quality in Your Design
The biggest impact data quality will have on your design is in the ETL area of the project. The ETL portion (especially the transform) will bear the brunt of the work. In order to fix data quality issues, you will need to transform data from its original value into the true or expected value.

This is where you need to be prudent and establish the correct tool to assist you. We have discussed the importance of metadata and a repository. We are now adding another requirement to your tool: the integration of data quality in the ETL tool. We will discuss tools in more depth later, but first let us look at the design impact of data quality on ETL.

When you design ETL processes, the typical focus is on collecting and consolidating information. You will decide where a data element originates and how to consolidate the data sources. If you add data quality rules, your design becomes more complex. Based on the quality of your data for each source system, you must now describe how you want to fix data quality issues while transforming and consolidating information.

Here you see why it is so important to work with metadata. You should be able to use the business rules discovered in the investigation phase in the ETL process.

Architect—Performance and Scalability
Performance and scalability affect the data quality within your ETL process since you are working with complex rules and code. It is important to realistically assess the performance you need and the performance you can achieve.

With your architecture in place and the scoping done, you will want to establish a set of project checkpoints. In particular, for the first increment of your project, spend time with your end users and show them results frequently, which:

  • Establishes and maintains their involvement in the project
  • Verifies the rules you distilled from your investigation of real data
  • Identifies new problems with data quality
  • Shows both progress and potential to the users

While you are setting these checkpoints, be sure to capture measurements on the data and verify them with what you expected from the investigation phase. The earlier you can discover you are not on track, the easier it is to change the project plans.

Solve—Performance and Scalability
As part of your checkpoints, make sure you test your business rules and the ETL process for correctness and ensure that you can handle real data volumes. In other words, verify that your architecture can handle the data load performance needed—both now and in the future.

Selecting a Tool for the Methodology
After understanding the challenges and the requirements of tools, we should look at some tools in the market and assess what toolset best supports the methodology.

Overview of the Tools Market
As with many areas surrounding data warehousing, data quality software is a fairly fragmented market. It seems that most tools focus on a single solution area within the data warehouse. This does not help you implement a consistent methodology across various subject areas.

The biggest problem you will face in the current market is that apart from complex data quality issues, you will also solve complex architecture problems. In particular, metadata integration will become an issue.

Luckily, it is not all bad news. Vendors are consolidating or working with partners to offer more integrated solutions. Looking at market trends, consolidation is the name of the game—smaller data quality specialists are consolidating into larger ETL specialists. With database vendors quickly moving into the ETL tool market, even these ETL specialists are hard pressed to keep up with the consolidation, and they become niche players themselves. With this in mind, let’s take a look at the vendors and how they fit within this methodology.

Pure Data Quality Tools
Trillium and First Logic dominate the market for data quality standalone tools. Both vendors offer complete toolsets focused on data quality. These vendors:

  • Supply reference data libraries
  • Supply tools to cleanse and scrub data
  • Allow you to work with business rules and metadata

The biggest issue with these highly specialized tools is that they require specific knowledge and training. They typically have some ETL functionality but lack the specialization to be first-class ETL tools.

Most vendors that fit this category are extensively partnered with ETL vendors and deliver some functionality in specific ETL tools. So many of the tools that are specialized ETL tools pretend to be data quality savvy through reseller deals with specialized data quality vendors.

The partner solution is not always better than the standalone tool, as there may be limited capabilities in the integrated part only allowing you to use limited data quality capabilities. For more capabilities, you will have to leave the ETL tool and go work directly in the partner tool. These solutions work best in an environment with limited ETL requirements.

Pure ETL Tools
Informatica and Ascential are some of the better-known tools in the ETL market. Informatica is an example of a vendor that partners with Trillium. In the case of Informatica, integration delivers a subset of the functionality from Trillium in the ETL tool. Pure ETL tools fit the data warehouse methodology better than they fit the combined methodology. While there is some integration with a data quality tool, it is still a separate tool set. ETL is done well but the integration of data quality is limited.

In addition, these pure ETL tools typically have limited design capabilities beyond ETL. You typically do the schema design in a CASE tool, then do the ETL in the ETL tool, and do the data quality (partially) in the data quality tool. As with pure data quality tools, you will end up integrating software on top of the complex problem of building a data warehouse with quality data.

Consolidated Tools
Ascential is moving in this direction with some recent acquisitions. However, the vendor farthest along with consolidating ETL and data quality is Oracle with its Warehouse Builder product. Like Ascential, Oracle bought a data quality vendor (Carleton) and integrated that functionality in Warehouse Builder, delivering an integrated ETL and data quality tool. SAS also offers a fairly complete solution as the product stacks contain both ETL and data quality tools. In general, consolidated tools let you work with one user interface with complete functionality—the metadata is shared by default.

Methodology Compliance
To fit with a methodology for both designing a warehouse and warranting data quality, it is best for a tool to do many tasks. Ideally, you want to have the following:

  • A single metadata store (repository) for all data-warehouse- related information
  • Schema design capabilities for both dimensional and relational objects
  • Complete ETL capabilities
  • Embedded data quality capabilities

Vendors are moving toward consolidation. Eventually, tools in the marketplace will achieve a more integrated fit between the toolset and methodology. Once this fit is established, it will be easier to incorporate data quality into your projects.

Factoring data quality into your data warehouse design will definitely change your construction techniques. The initial phases of the data warehouse are most severely impacted by giving data quality the attention it needs.

Working with this methodology offers several technical advantages:

  • A clear assessment on what the actual data quality problems are before you start the project
  • A better architecture to deal with data quality (since you have designed it to do so)
  • Potentially, a better fit with the tool you use (since you anticipated data quality tasks)

For the business this means: Less risk of a delay in the delivery of projects

  • Less risk of a failure due to unexpected data quality issues
  • Better quality of data in your final data warehouse, enabling decisions based on more accurate data

It is worthwhile to consider data quality early in your projects and work actively to embed data quality measures and capabilities in your project from its inception.

1 Six Sigma is a method applied in various companies, often in manufacturing, to quantify the defects. Ideally there are no defects in a series (for example 1000) of products. So there are zero defects per 1000 products produced. If you have defects, you are deviating from the optimal, expressed in a sigma number. If you add cost to reduce defects to the equation it turns out that Six Sigma is the ideal number to have. Cost versus defects is in a perfect (that means lowest cost) balance.

Jean-PierreDijicksJean-Pierre Dijcks
is a senior product manager for Oracle Warehouse Builder, Oracle’s enterprise business intelligence integration design tool.