Business Intelligence Best Practices - BI-BestPractices.com

Collaboration. Communication. Community.

 
 
 Printer-friendly
 E-mail to friend
  Comments
ADVERTISEMENT
Master Data Management, Synchronization, and Coherence

by David Loshin
Master data management (MDM) is a popular topic among enterprise information management professionals. But it isn’t a technology or a shrink-wrapped product.

MDM comprises the business applications, methods, and tools that implement the policies, procedures, and infrastructure to support the capture, integration, and subsequent shared use of accurate, timely, consistent, and complete master data.

An MDM program is intended to:

  • Assess the use of core information objects, data value domains, and business rules in the range of applications across the enterprise.
  • Identify core information objects that are used in different application data sets and that would benefit from centralization.
  • Create a standardized model to manage those key information objects in a shared repository.
  • Manage collected and discovered metadata as an accessible, browsable resource, and use it to facilitate consolidation.
  • Collect and harmonize unique instances to populate the shared repository.
  • Integrate the harmonized view of data object instances with existing and newly developed business applications via a service-oriented approach.
  • Institute the proper data governance policies and procedures at the corporate or organizational level to continuously maintain the master data repository.

Numerous technologies have been expected to address parts of this problem, including customer master tables or industry-specific consolidated product management systems. But what distinguishes MDM from earlier attempts is not necessarily improvements in technology, but rather the tighter integration of information management and business information governance. Coupling business stewardship and oversight with more precise data accessibility, availability, and consistency requirements creates an operational environment that is more conducive to effective information management practices.

Conceptually, a master data management program is intended to provide a “single source of truth” for enterprise applications that is both consistent and synchronized. Redundant data instances that exist in isolation across different application data stores are identified, and a process of extraction, consolidation, and aggregation resolves multiple instances into a unique version that represents the managed data object (person, location, product, and so on). This is then managed (either physically or virtually) as the “master record.” In some implementations, a single master repository holds the records, while in others, replicated copies of the master records are published out to each participating application. All applications that use the master record coordinate through the MDM environment to gain access to the consistent, synchronized view.

While this concept is appealing, what do consistent and synchronized really mean in the context of a master repository? At a high level, “consistent” implies that different copies of any record have the same information; “synchronized” implies that the result of transactions applied to any copy of a record will be observable at the same time in any other copy. But there are a few challenges in pinning these notions down further, since business requirements may imply different levels of consistency and synchronization.

Consider two different extremes: absolute synchronization versus loosely coupled consistency. Let’s use customer data integration as an example. In the first case, every action that modifies a customer record must be tightly coordinated with a central repository. In the second case, actions that modify customer records can take place at any time without coordination.

In case one, we derive all the information-oriented benefits of MDM: a single source of truth, complete synchronization, the 360-degree customer view. In fact, any analytical functions can take advantage of the up-to-date view and integrate precise predictive analytics in real time. However, tight synchronization introduces transaction semantics into a distributed application architecture where such requirements did not previously exist. This in turn may introduce a performance penalty associated with record-locking and multi-phase commits.

In case two, the benefits of MDM are somewhat limited to offline processing. Because of the loose synchronization, there is no performance impact, since the aggregation and resolution do not need to be done each time a customer record is modified. However, there is a functional penalty, as real-time actions cannot necessarily be incorporated into operational applications without the risk of depending on out-of-date data.

Each extreme has benefits and drawbacks. It all comes down to the concept of coherence—that is, the solution’s ability to allow local copies of a single, shared record while notifying each copy when an update has been made to the master copy. Essentially, what we are looking for is a way to correlate a business’s synchronization needs with the MDM solution’s level of coherence.

Anyone who has worked with multiprocessor systems with shared memory may recognize a similarity between cache coherence algorithms and the challenge of maintaining MDM coherence in a way that does not impact performance. In a multiprocessor environment, different processors may request data from main (that is, shared) memory. Since accessing main memory incurs a high degree of latency, architects insert a faster memory called a cache between the processor and main memory. Each time data is loaded from memory, a copy is placed in this faster cache memory, reducing the latency for subsequent accesses of the same data. The problem occurs when two different processors have loaded the same chunk of data, and one of the processors modifies that data value. Suddenly, one of the cached copies is different from the other, which introduces an inconsistency. To address this, internally, every time a cache copy is written by one processor, the main memory copy is also written (called a “write-through”), and all the copies in other processor caches are invalidated. This means that the next time the processor accesses that data value, it may not use the cache copy, but must reload the data from main memory.

For absolute synchronization, this maps directly to the MDM environment, where each application can read a copy of a customer record (making the read copy the “cached” copy). When an application modifies a customer record, the master copy must be updated (i.e., written through). At this point, the applications that have copies of that customer record must be notified that their copies are invalid, and that the next time any application wants to access that specific (local copy of the) record, it may actually need to access the master copy.

This kind of mechanism introduces two modifications to the service components supporting the MDM solution. The first is centralized directory management, which tracks which applications have copies of which customer records. The second embeds the hook into the directory framework for any access to a customer record (whether or not it has already been accessed) to check its validity before allowing its use.

So we have a potential solution to the absolute synchronization challenge, but the real question becomes: when do applications require absolute synchronization with the master repository? In my recent conversations with vendors that provide CDI and MDM solutions, it turns out that different kinds of MDM deployments have different underlying business drivers that affect the need for absolute synchronization. For example, a personalization profile for your customer base can be created even if the data is not completely up to date. On the other hand, the authentication of access rights for any particular individual may require fully synchronized data.

In fact, the degree of synchronization becomes one aspect of application integration that must be considered when deploying an MDM program. But by analyzing each application’s coherency needs and adjusting the MDM infrastructure to best accommodate them, you may both reduce performance bottlenecks and enforce synchronization where it is needed.


Recent articles by David Loshin

David Loshin - David is the President of Knowledge Integrity, Inc., a consulting and development company focusing on customized information management solutions including information quality solutions consulting, information quality training and business rules solutions. Loshin is the author of The Practitioner's Guide to Data Quality Improvement, Master Data Management, Enterprise Knowledge Management: The Data Quality Approach and Business Intelligence: The Savvy Manager's Guide. He is a frequent speaker on maximizing the value of information. David can be reached at loshin@knowledge-integrity.com or at (301) 754-6350.

Editor's Note: More articles and resources are available in David's BeyeNETWORK Expert Channel. Be sure to visit today!