Business Intelligence Best Practices - BI-BestPractices.com

Collaboration. Communication. Community.

 
 
 Printer-friendly
 E-mail to friend
  Comments
ADVERTISEMENT
A Business Approach to Data Quality: Achieving and Maintaining First-Class Organizational Data
The key to understanding any company’s operations is understanding the data. Consequently, the better quality the data is, the better the understanding.

By Stacey Herdlein

Abstract

The key to understanding any company’s operations is understanding the data. Consequently, the better quality the data is, the better the understanding. Conversely, the poorer the data quality is, the less clear the picture is of the company’s operation flows and efficiencies. Data quality is often a technical and business quagmire for organizations, but there is an approach—that can be guided by business people—to thoroughly check the current state of data quality in practically any organization. This approach is a three-step “bottoms up” method that begins with analyzing the actual organizational data elements, moves on to analyzing the entities that these elements comprise, and finishes by analyzing the relationship between the data and the business rules that it must support. When augmented by an organizational commitment to develop a “data quality culture,” this methodology can be extremely useful in helping companies create and maintain superior organizational data quality.

Introduction

As the business environment grows increasingly competitive and IT dollars become more and more scarce, savvy organizations continue to seek better ways to understand their operations, as well as the expectations of their customers, suppliers, investors, and employees. In order to do this, organizations must be able to seamlessly integrate their strategy, people, processes, and technology with the strategy, people, processes, and technology of their customers, suppliers, investors, and employees. Such integration enables stakeholders to transact business and access needed information when, where, and how they want.

While seamless integration sounds good in theory, in practice it is incredibly difficult to execute. Many organizations still cannot articulate—let alone measure and understand—the current state of their internal business practices, much less reach out to their partners. Even if a business can articulate its desired best practice business operations in theory, chances are that its practices “on the floor” are very different. So how do organizations begin to understand what is truly occurring in their current business processes—and begin to measure how those processes might improve?

The key to understanding any company’s operations is understanding the data. Consequently, the better quality the data is, the better the understanding. Conversely, the poorer the data quality is, the less clear the picture is of the company’s operations flows and efficiencies.

Before You Begin: Understanding the Current Business Process Flows

The first step in understanding organizational operations flows—and thus the first step in improving them—is to understand the current-state operations environment. This process is actually quite simple. It starts with creating process flows of the current-state operations. Once these process flows are created, the real work of understanding the current state begins—with an analysis of the information required to support the processes. Asking questions such as, What information is an input to the process? What information is changed or created during the process? What happens to the information once the process is complete? and How is the output distributed along down-stream processes?, is a good high-level first step in the process analysis. Once the process information is analyzed—at least at a high level, the information can be broken down into required data elements. The same questions listed previously apply, but at the data element level. The resultant answers provide a baseline set of data on which to begin to examine data quality.

One caveat, however: begin at the beginning. If the data is poor at the beginning of a process, then the “ripple effect” of that bad data “downstream” from where it was generated is magnified. It’s more efficient and effective to start the examination at the data source and correct errors there—and propagate those corrections downstream—rather than start at the end of the process and try to correct errors backwards.

First Things First: a Few Definitions

Handling data quality issues is a challenge for any organization, but the challenge is made more difficult by the fact that people often have different notions in mind when they use the term, “data quality.” So, to be sure that we’re all on the same page, I’d like to begin with a few definitions. There are essentially three types of data quality:

Data Validity—does the data do what it’s supposed to do?

  • Does the data really describe what it is intended it to describe? For example, does the field “social security number” really contain social security numbers? Or is the field being used for a different purpose? In another example, does the “checking account number” really contain checking account numbers for the banks’ patrons? In healthcare, does the field “type of lab test” have information in it that indicates a type of lab test, such as an Epstein Barr virus titre test?
  • Is the data being matched to a table of standard values and ranges? Are there range checks on the value that do not allow these standards to be circumvented? For example, does the social security numbers reflect “real” possibilities for social security numbers, or do values such as “000-00-0000” exist? Are there checks on the point of entry of this number that do not allow a “000-00-0000” value to be entered? For the checking account numbers, if the pattern of these values is two alpha values followed by six numeric values, is this pattern followed? In the Epstein Barr virus test, does the result show a value consistent with a titre (2, 4, 8,16, 32, etc.)?

Data Consistency—does the data do the same thing all the time?

  • Within the same facility, does data entry operator “A” fill in the same information as data entry operator “B” for the same set of circumstances?
  • Across multiple facilities, is the same data captured when the same set of circumstances presents itself? When new banking customers open new accounts, is the same set of data captured in different branches for the different customers?
  • Is the process of gathering the needed data the same across multiple facilities? Do all facilities have the same definition and set of standard values for all fields?


Data Completeness/Accuracy—is the data good enough to do the job?

  • Is all needed data relevant to a business process captured? Every effort must be made to ferret out all applicable data to all processes. The picture used for analysis purposes must be as complete as possible—at an acceptable cost-to-benefit ratio of course. No data picture can ever be 100-percent complete, but it can be close.
  • Is the data accurate to the reality of the business (i.e., do the data values correctly reflect the objective values or metrics they are intended to measure)? “Good,”—as in logical—decisions can be made on inaccurate data. However, if data used as a basis for decision making is an inaccurate reflection of the reality of the marketplace or environment, no matter how logically the decision follows the data, the decision will be bad.

As the definitions above suggest, there is both a structured, “hard” approach to examining data quality, as well as a change management effort required to instill a culture and business process improvement that will support improved data quality.

A Business Approach to Examining Data Quality

Once an organization decides to examine the data quality associated with its business processes, and once it defines and understands its business process flows, the next logical step is to find the best approach to actually examining the current state of data quality. The following guidelines develop an approach to thoroughly check the current state of data quality in practically any organization. This approach is a three-step “bottoms up” approach that begins with analyzing the actual organizational data elements, moves on to analyzing the entities that these elements comprise, and finishes by analyzing the relationship between the data and the business rules that it must support.1


 

A Business Approach to Data Quality
Examine organizational data elements for relevant ranges and validity
Examine database entity relationships
Ensure that organizational data conforms with business rules
Involve SMEs at each point in the data quality process

The first step in any data quality analysis is to evaluate the existing state of individual data elements in the organization. This includes doing a check to see what data ranges are present and valid for data elements within each business process. Only organizational knowledge workers and knowledge managers—in concert with subject matter experts—will know which results will make sense in the analysis, and which results are errors.

The current-state data quality evaluation begins with a domain analysis (domain is a fancy term for a set of valid values) performed for the following types of data:

  • Primary keys. Primary keys are unique identifiers of data entities. Checking the range of values in primary keys will help ensure that there are no duplicates, or values that may not be unique. For example, in many databases, social security numbers are used as part of the primary key. A domain check of the social security number will reveal the presence of any blanks, nulls, or repeats.
  • Indicators. Indicators are types of data in which only two values should be valid. Examples include yes/no, male/female, and true/false. A domain analysis of indicators will reveal how many of each value are present, and the presence of any other values within this particular data element. For example, in a healthcare system, one might expect to see an equal ratio of gender in a domain analysis. If there is an unequal mix, it does not mean that the data itself is wrong, it simply means that this might be an area worthy of further investigation in concert with the subject matter expert. Similarly, if values in this data set include, say, “M,” “F,” and “A,” it doesn’t necessarily mean “A” is invalid, just that the organization must understand what “A” means in order to make a judgment on the validity of “A.”
  • Codes. Codes specify the acceptable ranges for attributes that help describe primary keys. For example, an auto manufacturer may offer cars in a variety of colors. A red finish on the car may have a value of color_code = 1, blue may have a value of color_code = 2, and so on. If there are 10 car colors, the associated color code values should range from 1 to 10. Consequently, any code values out of the range of 1 to 10 may be invalid. If the code value of “11” appears, that may indicate an experimental color, or it may be that the value “1” or “10” was mistyped. A subject matter expert would know for sure, but the out-of-range value must be investigated.
  • Date/time. Date/time values are tested in two ways: against an upper and lower limit, and for format. For example, if a company started conducting business on May 1, 1990, then there should be no dates in the systems that precede that date. If the quality check for the upper limit of the date is performed on December 31, 2000, then theoretically, there should not be any dates beyond that date. If the format specified is mm/dd/yyyy, then all date values must be in that specified format to have a correct value.
  • Quantity. Quantity values are examined to see if the values returned in the domain assessment “make sense” from a business perspective. For example, salary ranges for a particular job type may vary from as much as $20,000 to $80,000. Only an organizational subject matter expert could confirm whether or not these ranges make sense.
  • Text. Text data is tricky, to say the least. There may be no rules that text data can be tested against. However, organizational subject matter experts can provide clarity here. If a text field is supposed to contain free-form comments, then it may be impossible to check that for “valid” comments. On the other hand, if a free-form field is used to capture particular information, it may be more easily tested and validated.

One example of textual data that often stumps data quality project teams is street address data. Address data is often a combination of textual and numerical data that, of necessity, must be of a “text” data type in the database. However, as for many data quality challenges, there is a software solution. Address checking software that can help simplify this process is readily available and relatively inexpensive for the time and effort it saves in checking this type of combination data.

The next step in the data quality analysis is to examine data entities, on a database by database basis, as to how they are constructed, and how they relate to each other in the database.


For primary keys, the key structure itself must support the tenet that what is held in the primary key is unique. The primary key—by its very definition—must define a unique record in the database. One primary key (or a combination of data elements that makes up the key) MUST support one—and only one—record in the database. For example, name alone is not enough to distinguish an employee in a personnel database. Even the combination of name and birthday is not enough. For example, consider the very likely case of two people with the name “Timothy Williams,” both born on the same day. However, the triple-key combination of name, birth date, and social security number is usually sufficient to establish a unique primary key.


For foreign keys, (the foreign key in one table has to be the primary key in another table, and vice versa, referential integrity is the issue. Typically, these records can be thought of as a parent/child relationship. The value in the parent table must have a corresponding value in child table, and the child must have a corresponding parent. For example, if the parent contains a listing of supplier names and supplier codes; the supplier code is the primary key (sup_code). If the child table contains a listing of products (prod_code), then all the products in the child table must have an associated supplier parent, and all of the suppliers must have an associated child product.
The third step in the data quality analysis is to ensure that organizational data conforms to accepted business rules. There are four types of business rules that must be checked:

  • Valid value combinations. Do two or more data elements make sense in how they relate to each other? For example, in a hospital, if a female is noted as being treated for prostate cancer, this is not a valid value combination. Obviously something is inherently wrong in this data combination: either the gender of the patient is incorrect, or the diagnosis code is incorrect.
  • Computations. Do all the records in a summary table equate to the records in the detail table? For an employee paid for piecework, does the total pay equal the amount paid per piece by the total number of pieces?
  • Time. Do time ranges make sense; in other words, does the project inception date occur before the project completion date? Does the date the order was placed by the customer occur before the date the order was packed by the supplier, which should be before the date the order was shipped by the shipping company, and all of these dates should be before the date the order was received by the customer?
  • If/then/else logic. Most business rules are logic combinations that follow if/then/else logic. For instance, if a particular course offered at a college is titled “Calculus 200,” and that course requires a particular prerequisite, such as “Calculus 100,” then a student that has a grade for Calculus 200 but not Calculus 100 may be defying required business logic (then again, the student may have been particularly bright and may have tested out of Calculus 100). Regardless, the logic of the combination must be checked and validated in order to ensure data quality.


Subject Matter Experts: A Critical Component of the Data Quality Equation

Consider the role of the subject matter expert in the steps above. The subject matter expert is the person who knows whether or not the actual data itself is correct—or could be correct based on the company’s business rules. Therefore, it is critical in any data quality project and for ongoing data quality maintenance, that SMEs play an integral part in the day-to-day evaluation of data quality and the resolutions of any data quality issues.

For instance, consider the task of an SME for a paper products retailer. In the database tables for suppliers and products, there will most likely be multiple suppliers of paper goods in the “suppliers” table, as well as a multitude of types of paper goods in the “products” table. Suppose a referential integrity check revealed that there were three suppliers in the supplier table, and 100 products in the product table. As long as each supplier has its products listed in the product table, and each product can be matched back to a supplier, then the referential integrity of the database looks good.

However, suppose that a particular supplier, for example, ABC Paper Co. with a supplier code of 515, only supplies products with product codes 100, 200, and 300. An SME would run a domain check to ensure that the data for supplier 515 only contains products with product codes listed as 100, 200, or 300. Anything outside this range would require further investigation. Perhaps if the product code 400 came up, that might be a discontinued product. It might have been a seasonal item only. Again, this true data interpretation is a very human-based function. The data quality checks described in this paper can point out potential areas for investigation, but the SMEs are in the best position to interpret, clarify, and explain the results.

Note: this article addresses issues of data quality with data that is resident in organizational databases, not data that is being moved from one environment to another. When data movement occurs, further testing must be performed to ensure that the data conforms to defined business transformation rules. This topic will not be addressed in this article.

Creating a Data Quality Culture

Now that we have discussed the mechanics of doing a base quality check, let’s discuss creating an environment that supports high quality data. Since data tells a company’s story—where it has been, where it is currently, and what might happen in the future—it is critical that the data be of as high a quality as it possibly can be, so the story is as accurate as it can be.
As organizations increasingly realize the value of their data, and treat that data as the most valuable intangible asset that the company has, it makes sense to have business processes in place that support and reward stewards of data quality. Some organizations take this so seriously that they even have Chief Data Officers, or Data Governance Committees.

However, who are the true stewards of data quality? Ironically, the true data stewards of data quality in an organization—the employees who enter the data at its source—are often the least respected, least compensated, and least incented employees within the company. Moreover, they are typically paid on the quantity of data that they input versus the quality of data that they input.

Consider a typical order entry operator (OEO). Most OEOs are paid based on the number of orders they enter during their shift. Suppose a typical OEO can enter 1,000 correct orders in a shift, but because of a corporate dictum that quantity of data entered is paramount, the OEO actually enters 1,200 orders in a shift, disregarding quality. As a result, of the final 300 he enters, only 20 percent are correct.
Because he is being paid based solely on the quantity of the orders, he is delighted with his extra pay and does not really care that most of his last 300 are wrong. Alternatively, if OEOs were paid based on the quality of the data they entered, then that same good OEO may only enter 1,000 orders, of which only two are incorrect.

In terms of rework, missed/incorrect orders, having 998 orders that are completely correct the FIRST time makes more sense than having 900 that are correct but 240 that are incorrect—and that leads to customer dissatisfaction and angry suppliers. It is far less costly in terms of time and money to do the job right the first time than to waste time searching through incorrect orders, correcting them, and placating dissatisfied customers and/or suppliers.

So is it wise to choose quantity over quality? Not if organizations today believe—correctly so—that the future of their organization is only as secure as the quality of their data. However, wise or not, very few organizations have a culture that expects and rewards high data quality. High-level committees often discuss data quality issues without considering the significance of the process by which the data is captured. Until the job of capturing and entering data is rewarded appropriately, data quality will not improve substantially.

Some authors advocate creating data quality circles and quality committees, complete with Corporate Data Stewards and quality teams to continually monitor data quality changes and perform fixes. That’s a good approach. However, the best starting point to creating a data quality culture starts at the source: the front-line employee who handles organizational data first. Pay these employees for the quality of their work, not the quantity. Doing the job right the first time is the best, least costly alternative, and it will ensure that when organizations do get around to forming data quality teams and electing Data Quality Stewards, the data quality culture is already deeply embedded in the organization.

Conclusion

Data is the true heart of a business. Business people don’t make decisions based solely on hunches—at least not all the time. For good decision making, good quality data is a must have. As businesses seek to expand and to better serve their customers, suppliers, investors, and employees—and as they tackle the issues that impede their ability to do so—the common theme across all those issues is data. Without good, quality data, any efforts aimed at improving the business will be in vain. How can organizations build a solid decision making environment on a foundation of shoddy data? If data quality is not a priority in the culture of the organization, and, as a result, the business people make decisions based on bad data, instead of expanding into new markets, the organization will stay stuck in a vicious pattern of “garbage in, garbage out,” trying to solve operational issues based on bad data.



REFERENCES

Quality Manager: Methodology and Application Guide, Vol. 4.8.2., Westboro, Mass. Ascential Software, 2000-2001. pp. 1-17.

 

 


BIOGRAPHY

Stacey Herdlein is a Senior Manager with Arthur Andersen, LLP’s Atlanta Business Consulting practice. Ms. Herdlein is the Director of Andersen’s Global Data Management Center of Excellence’s Americas Division. She has more than 12 years of information technology experience and has designed and built data warehouses and business intelligence solutions for clients in the healthcare arena, as well as in numerous other industries. Ms. Herdlein is an accomplished writer and speaker on data warehousing and data quality. Her work has appeared publications such as “For the Record and Healthcare Informatics” magazines, and in the “Journal of Data Warehousing.” Ms. Herdlein has a Master of Science in Health Systems from the Georgia Institute of Technology and is active in several industry organizations, including HIMSS, the Institute for Industrial Engineers, and the Data Warehousing Institute.