|
Integrating Data Quality into Your Data Warehouse Architecture
Why is the quality of data—the most important element in a data warehouse—so often ignored until it is too late? Is it because it is too complex to solve, or is it simply not our
business?
By Jean-Pierre Dijcks Why is the quality of data—the most important element in a data warehouse—so often ignored until it is too late? Is it because it is too complex to solve, or is it simply not our business? Answering this fundamental question is tough, but part of the answer may be the lack of solid methodology to deal with data quality. This article sheds light on what to do to assure data quality is not ignored. This article will describe a methodology for embedding data quality into the very heart of your data warehouse architecture. We present key components to include in your data warehouse design and processes, and explain their impact on data quality. To frame the data quality methodology in a common reference, we will layer it on top of some classic methodologies for designing a data warehouse. This is not a discussion on general methodologies so we will only use them as a reference point. Regardless of the methodology you use, the points shown in this article can easily be embedded and applied. The Baseline—Building Your Data Warehouse Once you begin to build a data warehouse, you will understand that it is not the easiest task. A number of methodologies are published in various books and articles to guide you in this task.
Think Big, Start Small
Of course, there are also project management advantages to iteration (e.g., lessons learned in the first phase can smooth the second phase of a larger project). You will also be able to see your success help the business, which in turn helps your own team become believers in the data warehouse. The basis for any data warehouse building methodology should be defining an ambitious project and then dividing it into small, manageable chunks, which you can deliver on time to meet the business target.
Sponsors Should Also Think Big This step is vital for ensuring the common goals (information standardization across the organization) of a data warehouse are met and that individual departmental problems are understood and addressed. The question “What is a customer?” should be answered companywide, not on a per-department basis. Otherwise, your data consolidation dream and a single version of the truth rapidly disappears. Many more things can be said about executive sponsorship in the context of a data warehouse project. Here we will leave it at emphasizing the importance once again. Just like data quality, it is frequently overlooked and will almost certainly come back to haunt you if it is not present.
Enterprise versus Mart In any case, even if you are creating an enterprise data warehouse, don’t get stuck in defining the one-modelsolves- all mentality. Be sure to keep delivery in mind and be prepared to compromise the model design while pragmatically solving the issue. Remember, the business user does not care about the model—only about solving his problem.
Business Users Rule Your World Figure 1. A Data Warehouse Methodology
The Data Warehouse Methodology As you can see in Figure 1, the methodology consists of five major building blocks layered on top of the foundation described earlier in this article. The five building blocks are:
We have discussed the involvement of end users in your project, but we must first define how these building blocks work within the context of a phased delivery approach. See Figure 2. The Scope phase is designed to determine the project’s extent. It is structured atop the foundation and guarded by project management. In this more detailed look at the general methodology, note that end-user involvement becomes more prominent, spanning all phases of your project and influencing all building blocks. This methodology forms the basis for your data quality assurance methodology. Figure 2. Iteration and Common Phases
Introducing and Scoping Data Quality
Defining Core Elements Another question we need to ask is: What actually comprises data (and, therefore, data quality)? Are we merely concerned by transactions stored in our operational systems (“hard data”), or are we concerned by “soft data” such as terminology, business rules, and other forms of metadata? In this article we place the hard and soft data under the banner of data quality. Therefore, the scope of data quality includes all data elements that play a role in the delivery of information to end users. Including metadata in the scope of our problem will have a major impact on the proposed methodology.
Defining the Data Quality Context For “soft data,” quality issues often arise from data not actually stored anywhere until the data warehouse is built. Again, the data warehouse does not cause the quality issues, they are merely highlighted by it. For “soft data,” the methodology we define is based on how to extract and manage the data appropriately. On the “hard data” side, we will only cover how to detect and “fix” data quality in the warehouse context.
The Data Quality Methodology Defined As Figure 3 shows, the data quality methodology is based on four distinct phases and two iterative streams of information. These two streams are extremely important in the methodology, as you will need to continue working on data quality, even after you have delivered your initial result set to the end users. Figure 3. The Data Quality Methodology
Investigate Data Quality Your project sponsor and the end users are core to your project. It is important to correctly set data quality expectations from the start. The biggest problem with beginning early is that you will be hard pressed to find the time and resources to do this investigation. If you cannot do this before the project starts, you should add an extra step to the project plan to ensure you will assess the data quality before scoping and design begins. Failure to do so will lead to numerous problems later on. This first step is the one step most data warehouse projects omit. Omitting this step will affect later phases of the project and lead to late delivery of incorrect information.
Scope Data Quality Poor data quality has two important consequences:
Architect the Solution Essentially, there are two places to solve the data quality problem: on the source side and in the target data warehouse. The former is something that you may not have in hand or control over. The latter is most certainly within your project scope. To architect the solution in the data warehouse, you need to:
When you know what you need, you will have to get the resources and technology in house. As you have now determined who is doing what at which moment in the data warehouse, this phase will have impact on the design of the warehouse. We will see this impact later on when we layer the data quality methodology over the phases of the warehouse methodology.
Solve Data Quality Issues
End-User Feedback and Quality Reporting End users will probably give you feedback when the data is deteriorating, but you may not hear about improvements. Therefore, you must stimulate any feedback and ensure that it is recorded in order to allow you to reassess the quality in your warehouse and the processes you have in place. This way you are in the position to fine tune the quality, and you will be able to continue delivering stable and expected quality. While end-user perception is important, you may not want to solely depend on it for your quality assessment. Implement a reporting and measurement system to independently measure the quality of the data. This reporting loop is another continuous flow of information; if the reports indicate a change, you must act accordingly and re-investigate. Both feedback loops allow you to maintain the warehouse and its quality. Figure 4. The Combined Methodology
Embed the Data Quality Methodology The data quality methodology becomes an integral part of the complete methodology. You want to ensure that your user community is involved, both in data quality and in the overall data warehouse. However, end-user involvement starts earlier in the overall process if we add data quality to the picture.
Data Quality Method in Practice
Investigate—The Importance of Metadata When you start investigating and scoping your data warehouse, metadata brings yet another complicating factor to the table. Depending on the technology you use, this may become a big problem in your environment. If you use custom-written logic to fix data quality issues, you will be storing metadata implicitly within the code. This means:
The solution: treat business rules as data in itself—that is, metadata. You can now measure and manage metadata within the data quality methodology using the same mechanisms.
Investigate—The Technology Impact of Metadata Repositories come in many shapes and forms, which we will not discuss. Instead, we will choose the pragmatic approach of looking at a repository from the data warehouse perspective. The minimum requirements of a repository in the data warehouse scope are:
Be aware that the repository and metadata discussion is likely to influence the selection of the tool you will use in your project.
Investigate—Starting Your Project
Perform the first task to create your own understanding of the data. It is especially useful to investigate differences between source systems. You can start this just before the project “really” starts. Interviewing key users is something you usually do to design your data warehouse. You should include questions about data integrity, issues the users may have seen, and even the thoughts they have on the quality of source systems. Most business users have a pretty good feeling about their data sources and have learned to deal with data problems in their job. This information is very useful in your investigation.
Investigate—Establish the Baseline It may be difficult to establish a measurement for quality. Use a known method, such as Six Sigma. This allows you to quantify the number of defects in a sigma (or standard deviation) number. Simply stated, the lower the number, the bigger the problem. You can then use the same scale to establish the goal for data quality. This baseline and the data quality goals you set are documented and reported to both the project sponsors and the end users.
Scope—Determining What to Do
Architect—Data Quality in Your Design This is where you need to be prudent and establish the correct tool to assist you. We have discussed the importance of metadata and a repository. We are now adding another requirement to your tool: the integration of data quality in the ETL tool. We will discuss tools in more depth later, but first let us look at the design impact of data quality on ETL. When you design ETL processes, the typical focus is on collecting and consolidating information. You will decide where a data element originates and how to consolidate the data sources. If you add data quality rules, your design becomes more complex. Based on the quality of your data for each source system, you must now describe how you want to fix data quality issues while transforming and consolidating information. Here you see why it is so important to work with metadata. You should be able to use the business rules discovered in the investigation phase in the ETL process.
Architect—Performance and Scalability
Solve—Checkpoints
While you are setting these checkpoints, be sure to capture measurements on the data and verify them with what you expected from the investigation phase. The earlier you can discover you are not on track, the easier it is to change the project plans.
Solve—Performance and Scalability
Selecting a Tool for the Methodology
Overview of the Tools Market The biggest problem you will face in the current market is that apart from complex data quality issues, you will also solve complex architecture problems. In particular, metadata integration will become an issue. Luckily, it is not all bad news. Vendors are consolidating or working with partners to offer more integrated solutions. Looking at market trends, consolidation is the name of the game—smaller data quality specialists are consolidating into larger ETL specialists. With database vendors quickly moving into the ETL tool market, even these ETL specialists are hard pressed to keep up with the consolidation, and they become niche players themselves. With this in mind, let’s take a look at the vendors and how they fit within this methodology.
Pure Data Quality Tools
The biggest issue with these highly specialized tools is that they require specific knowledge and training. They typically have some ETL functionality but lack the specialization to be first-class ETL tools. Most vendors that fit this category are extensively partnered with ETL vendors and deliver some functionality in specific ETL tools. So many of the tools that are specialized ETL tools pretend to be data quality savvy through reseller deals with specialized data quality vendors. The partner solution is not always better than the standalone tool, as there may be limited capabilities in the integrated part only allowing you to use limited data quality capabilities. For more capabilities, you will have to leave the ETL tool and go work directly in the partner tool. These solutions work best in an environment with limited ETL requirements.
Pure ETL Tools In addition, these pure ETL tools typically have limited design capabilities beyond ETL. You typically do the schema design in a CASE tool, then do the ETL in the ETL tool, and do the data quality (partially) in the data quality tool. As with pure data quality tools, you will end up integrating software on top of the complex problem of building a data warehouse with quality data.
Consolidated Tools
Methodology Compliance
Vendors are moving toward consolidation. Eventually, tools in the marketplace will achieve a more integrated fit between the toolset and methodology. Once this fit is established, it will be easier to incorporate data quality into your projects.
Conclusion Working with this methodology offers several technical advantages:
For the business this means: Less risk of a delay in the delivery of projects
It is worthwhile to consider data quality early in your projects and work actively to embed data quality measures and capabilities in your project from its inception. 1 Six Sigma is a method applied in various companies, often in manufacturing, to quantify the defects. Ideally there are no defects in a series (for example 1000) of products. So there are zero defects per 1000 products produced. If you have defects, you are deviating from the optimal, expressed in a sigma number. If you add cost to reduce defects to the equation it turns out that Six Sigma is the ideal number to have. Cost versus defects is in a perfect (that means lowest cost) balance.
|