Business Intelligence Best Practices - BI-BestPractices.com

Collaboration. Communication. Community.

 
 
 Printer-friendly
 E-mail to friend
  Comments
ADVERTISEMENT
Key Technologies Enabling a Seismic Shift in Enterprise Data Management
The authors describe the importance of data lineage and metadata management as key enablers for these initiatives, and discuss how emerging technology is beginning to address this growing need with automated tools.

By Jeff Reagan, Ian Rowlands

Abstract

An externally-driven generational change is reshaping the data management landscape. The traditional approach to enterprise data management created an organization structured around skill sets, such as data modeling and database administration, and activities that were often application-driven.

Data management activities are regrouping, as enterprises recycle enterprise architectures, seek to protect themselves from risks arising from compliance and regulation failures, and strive to rationalize information to improve business agility.

The adoption of these new enterprise architectures to support rapid change—while delivering the necessary governance—is best achieved by creating the necessary supporting infrastructures and by the understanding and rationalization of information assets. This critical understanding and rationalization is being supported and enabled by two key initiatives: data lineage and metadata management.

The authors describe the importance of data lineage and metadata management as key enablers for these initiatives, and discuss how emerging technology is beginning to address this growing need with automated tools that can share metadata in a complementary way across an organization’s information sources.

It is easy to forget that information technology (IT) is a young discipline. There is as yet no fixed body of professional knowledge, structure, or standards. At the same time, the impact of IT on the enterprise can be transformational— in a positive or negative sense. This combination of factors subjects IT to three types of change:

  • Externally-driven generational change (for example, the shift from largely mainframe-centric IT, which dominated from the 1960s through the turn of the century, to Internet-centric IT, which emerged in the 1990s and is now rapidly becoming a dominant model)
  • Internally-driven strategic change (such as an outsourcing or offshore initiative driven by a desire to change the enterprise’s cost model)
  • Externally- and internally-driven tactical change (for example, the replacement of one vendor’s software solution by another’s, or shifts in hardware approaches to achieve short-term cost savings)

An externally-driven generational change is reshaping the data management landscape. The traditional approach to enterprise data management created an organization structured around skill sets, such as data modeling and database administration, and activities that were often application-driven. IT is examining its data management activities as organizations begin to:

  • Recycle enterprise architectures
  • Protect themselves against risks arising from compliance and regulation failures
  • Rationalize information to improve business agility
Recycling Enterprise Architectures

It’s hard to resist the analogy between early enterprise computing and the age of the dinosaurs. Massive applications were developed, each in its own silo, with severe risks emerging when applications overlapped. Changes in applications and infrastructures were slow, and data management was an undeveloped science. Several technology waves have changed the picture radically.

From an infrastructure perspective, starting in the early 1980s, three distinct shifts are clear:

  • The emergence and integration of the personal computer and local area network
  • The evolution of distributed peer-to-peer networks
  • The shift from processor-centric to network-centric models (predominantly, of course, Internet-based)

Similar macro-level shifts can be seen in the data storage and application development spheres. Data storage shifted from monolithic application-owned data stores, to managed databases, to segregation of operational and informational (data mart and data warehouse) stores. In the application space, the shift has been from monolithic applications through client-server applications to object-based technology and finally to service-oriented architectures. This last shift is having a radical impact on data management practices.

There are two conflicting trends. One, probably short term, is that data management professionals are—in some cases—being marginalized because of lack of expertise in services-related technologies. The other trend, more far-reaching, is that a desire to share information across application and even enterprise boundaries is elevating the role of data architecture and forcing:

  • Consolidation of entities (identifying as the same things which appear different is challenging)
  • Exposure of entity definitions and descriptions to a broad user community
Responding to Compliance and Regulatory Pressures

Having failed to regulate itself, IT is increasingly having regulation forced upon it. Table 1 shows, in no particular order, a random selection of the myriad of regulations to which IT organizations must be sensitive. Sarbanes- Oxley is truly the tip of a very threatening iceberg.

Many of the regulations and standards are a response to failures in enterprise governance (consider Sarbanes- Oxley). Others simply encapsulate a desire for better and more uniform ways of doing business (for instance, the establishment of the Basel II Capital framework). Yet others, such as the ASC X12 transaction sets, are intended to facilitate business agility.

A Sample of Regulations Pertaining to IT
ASC X12 (350 + transaction sets) Basel II
California Identity Theft Protection Law (affects any entity doing business in CA) Export Controls Clinger-Cohen Act 1996
Data Protection Act (UK) Document Retention management (DoD5015.2)
EU Markets in Financial Instruments Directive (MiFID European Union Data Protection Directive (EUDPD)
FTC Act Gramm-Leach-Bliley (Title V)
Health Information Technology Promotion Act of 2006 Health Insurance Portability and Accountability Act (HIPAA)
HL7 International Traffic in Arms Regulations
ISO 17799:2005 Code of Practice for Information Security Management Pennsylvania Deceptive Privacy Law (any entity doing business in PA)
Personal Data Protection Act (Netherlands) PIPEDA (Canada)
Privacy Act (Australia) Restriction of Hazardous Substances (RoHS)
Sarbanes-Oxley Act of 2002 (Sections 301, 302, 403, 404, 409, 802 ...) Section 508 Rehabilitation Act Amendments 1998
TREAD Act Uniform Electronic Transactions Act (UETA)
USA PATRIOT Act

Table 1. IT is facing a growing burden caused by the expanding list of compliance and regulatory issues.

Ensuring the IT portfolio is compliant—particularly where there is a substantial applications legacy—is a huge challenge. The reaction is to establish data governance as a key discipline in data management. Data governance ensures that the organization retains necessary and sufficient data entities and controls such issues as data stewardship, information definition, information access, and information lifecycle management.

Many organizations are supplementing data governance issues with operational initiatives, such as the implementation of master data management processes— the identification of core business entities and their abstraction into a “system of record.”

The emergent data management disciplines—and sub-disciplines such as data cleansing, data consolidation and reconciliation, and data migration—depend on facilities such as:

  • Enterprise metadata management
  • Data consolidation and relationship discovery
  • Enterprise data registration
Improving Business Agility

Several distinct time pressures are impacting enterprises:

  • The drive to execute individual transactions more quickly
  • The need to make faster (and better) business decisions
  • The demand for faster response to new or changing market opportunities

Responding to these drivers is made easier by successful adoption of adaptable and scalable enterprise architectures and by implementation of the supporting infrastructures for compliance and governance.

Critical Requirements to Support Change

Adopting new enterprise architectures to support rapid change—while delivering the necessary governance—is best achieved by creating the necessary supporting infrastructures and by two key activities:

  • Understanding information assets: understanding what information exists across an organization, where it resides, and how it relates to the business
  • Rationalizing information assets—eliminating duplication and clarifying the relationships between data items

This critical understanding and rationalization can be enabled and supported by two distinct disciplines:

  • Data lineage is the understanding of where a piece of data originated, how it has moved through systems within an organization, and what transformations have changed it.
  • Metadata management: often simply referred to as data about data, metadata is more aptly described as the information that gives context to IT’s support of the business. It answers the “who, what, why, how, where, and when” questions of information assets. It follows that metadata’s scope is broad, the information is dynamic, and traceability is critical.
Understanding Information Assets

Understanding information assets is critical to the efficient growth of an organization and can be a complex undertaking, but is necessary to provide these benefits:

  • Reduction of development time for new applications
  • Increased speed of implementation for complex integrations
  • Leveraged information infrastructure across all initiatives
  • Improved information quality and decision making
  • Improved organizational productivity and efficiency
  • Improved agility to respond to business needs (reduced time to market)
  • Extended value of existing information assets across the enterprise
  • Controlled IT costs and increased ROI

For organizations of all sizes, the reduction of development time for new applications is always looked upon favorably, but perhaps the greatest benefit to understanding information assets is the leverage it gives to IT organizations needing to understand the “big picture.” Such knowledge provides the basis for controlling IT costs and improving the organization’s productivity and efficiency.

Rationalizing Information Assets

Three issues have contributed significantly to the complexity, confusion, and duplication most organizations face when working with their data: (1) the evolution of applications, most created for individual business unit use, has created silos of overlapping and duplicate information, with no clarity about the relationships between data sources; (2) merger and acquisition activities have produced multiple instances of the same data inside organizations; and (3) the development of multiple instances of the same functionality to support different business units has created overlap and duplication between data entities and the relationships between them.

The practice of rationalizing information assets, or eliminating duplication and clarifying the relationships between data items, attempts to eliminate these issues. This rationalization process enables organizations to reap several benefits:

  • Identify and reduce data redundancy: hidden data redundancies have hard costs that roll straight to the bottom line. Removing these redundancies produces more consistent data throughout the organization.
  • Support standards-based development: this has a direct and positive effect on an organization’s ability to create applications quickly, thereby providing better time-to-value. In addition, a shift to service-orientedarchitecture- conforming development, based on a standard Web services stack, is proving a key contributor to business agility. However, it is only effective when information assets are clearly identified, redundancies eliminated, and the relationships between assets clarified.
  • Accelerate development of information asset reuse: this provides efficiencies as well as cost and time savings across the organization.
Data Lineage as a Key Enabler

Data lineage analysis (answering the questions, “Where did this piece of information come from and how has it been calculated, derived, and transformed?”) is now recognized as a critical discipline in IT management. At face value, it is “merely” a technical issue. IT should understand data lineage to optimize and maintain systems and processes, but data lineage is also a critical business issue. Consider senior executives’ fiduciary responsibility for the quality of information mandated under the Sarbanes-Oxley legislation, or consider the “transparency” requirements of the Basel II accord which demand the ability to audit systems that contribute to a specific risk-weighted capital amount. The absence of data lineage documentation puts the enterprise at risk of serious compliance failure.

Understanding data lineage is not easy. A single item on a report may consolidate information from many data elements. To further add complexity, each of these sources may have entered the process at different stages, been through different manipulations and transformations while moving from one database to another, and, in a data warehousing environment, been through sophisticated ETL and business intelligence toolsets (which were not meant to interact with each other).

There has been much good academic work on analyzing data lineage1, but analysis first requires discovery. The discovery depends on metadata—the information that describes what data is available at what stage in the transformation process and how it got there. Identifying data relationships is a key component of data lineage analysis, beginning at the system of record for each element of interest, and traversing the data’s lifecycle as it is transformed and moves up the data-value food chain. Tools that enable organizations to view data lineage graphically across the enterprise allow analysts and designers to better choose where to source data for an application. These tools also promote the notion of “reusable information assets,” improve and simplify transparency, and underpin many other data management initiatives.

Metadata Management as a Key Enabler

The issues we have discussed thus far are, essentially, issues of maturity. IT capabilities have often run ahead of the tools available to IT managers. One critical tool, as identified previously, is enterprise metadata management.

IT is subject to generational change. So, too, is metadata management. Metadata has always existed, at least the information that allowed programs to handle data and end users to interpret reports. At first, however, there was no sharing of metadata, no separation of metadata and assets, and no management. The first formal management emerged as facilities evolved to manage individual technologies—databases such as IMS and IDMS had their own catalogs, for instance.

The next generation recognized the value of sharing entity definitions across technologies, giving birth to products such as MSP’s Data Manager. A separate technology branch began to manage documentation. The next step, in the 1980s and 1990s, saw the emergence of general-purpose technologies such as the DataDictionary/Solution from Brownstone Solutions, DB Excel from Reltec Products, and Rochade from Rottger & Osterberg.

In the 1990s, the metadata management discipline changed little—a reflection of the maturity of the (largely) mainframe-based enterprise computing environment. All that changed as the year 2000 panic and the generational IT changes of the new century kicked in. These changes stimulated the emergence of a new generation of metadata management technologies, characterized by a new breadth of metadata managed, a new breadth of applications exploiting metadata, and a new level of sophistication of facilities.

Figure 1 depicts the impact of the current IT pressures on the metadata management environment.

ITpressurestriggerametadatasurge

Figure 1. IT pressures trigger a metadata surge.

As enterprises seek to become more agile and to cope with regulatory pressures, metadata usage is shifting from being a prerogative of the metadata specialist to being a requirement for the successful activities of the business analysts and end users. At the same time, the areas of interest have expanded from the essentially structural (syntactic metadata) to include the meaning of information assets (semantic metadata). As the scope of metadata management has expanded in kind, so have the applications—metadata management is now being applied to challenges such as:

  • Application discovery and understanding
  • Application portfolio management
  • Service-oriented architecture management
  • Enterprise application (ERP, CRM, etc.) management
  • Data dictionary
  • Data management and integration
  • Data warehouse metadata management
  • Master data management
  • Enterprise architecture

The new generation of metadata management technologies will recognize the distinction between enterprisewide and usage-specific requirements, and will accommodate both in a “hub-and-spoke” architecture.

The third and final characteristic of the latest generation of metadata management tools is a deeper sophistication in capabilities. Required capabilities include fully extensible and customizable metamodeling, configuration and version management, the ability to import any required metadata, role-based user interfaces, sophisticated structured and unstructured search capabilities, and the automated intelligent discovery of relationships between entities.

That last capability deserves a deeper look. Business and technical activities have driven a proliferation of entities in enterprises. Often an acquisition leads to duplication of applications (and, hence, duplication of entities) with different names and formats, stored in different places. Similar challenges emerge from the creation of legacy applications communicating by data transfer.

Until recently, repository tools have not been able to automatically determine relationships between disparate data sources within the enterprise. The ability to import such metadata or to determine it heuristically is a critical prerequisite to better understanding the growing complexity of data lineages and identifying and inventorying the proper data sources for new applications.

Datalineageandmetadatamanagementtools

Figure 2. Data lineage and metadata management tools provide the "who, what, why, when, where, and how" for an organization’s data.

Automated Heuristic Matching for Metadata

To support the efforts of a larger metadata management initiative, an organization must first complete the process of deriving and collecting relationships across the metadata. This matching or “mapping” process can affect the overall time and cost of the initiative. It is also a traditionally manual process, which is costly as well as:

  • Time consuming: a typical analyst can map only one file at a time at the sluggish pace of 1.5 to 2 relationships per hour.
  • Resource-intensive: in many cases, subject-matter experts are considered part-time resources that will complete the mapping exercises after they do their “real” jobs; in some organizations, this process is outsourced completely to consultants who have no prior knowledge of the source systems.
  • Error prone: this is the case with any manual process that deals with poorly documented systems and is executed by an ever-changing team.
Column Name Data Type
Full_name Varchar(30)
Contact Varchar(25)
Person Char(30)
Owner String
Lien_Holder Varchar(40)

Table 2. Schema metadata: object names and data types.

Fortunately, new tools can automate and accelerate much of this necessary function. There are two predominant types of heuristic matching: metadata and data content. Metadata heuristic matching is based on the naming conventions used to name storage objects such as columns in relational tables, flat files, or Excel spreadsheets. Metadata heuristics also strongly consider the data type associated with each storage object when determining if a relationship exists between two objects. Content matching is focused on the underlying data associated with a given storage object.

Both types of heuristic matching use algorithms that should yield weighted results, depending on the strength of the match. In the case of relationship matching based on metadata, exact-name matches and enhancedname matches should produce different scores, with exact-name matching providing a basis for a stronger relationship. The data type of each storage element should also be considered in the weighted strength of a metadata-based match.

When using content matching, the object naming conventions and data type metadata may sometimes be irrelevant. Instead, it is the correlation of the underlying data that determines the weighted relationship match results. For example, there is no direct name match and data-type linkage between the storage elements listed in Table 2.

However, if a sample of the underlying data for each column shows a significant overlap (e.g., “John L. Smith”), then there is a high probability that relationships exist among these storage elements.

A combination of metadata and data-content heuristic matching yields results that can also show referential primary-foreign-key relationships that roll relationships up from the column level to the table level for relational data sources. Such analysis is useful to show broader relationships across tables within an application, across systems, and between disparate data sources across the enterprise and even across enterprise boundaries. This “comprehensive analysis,” applied across many disparate data sources, is useful in mergers and corporate acquisitions, where a quick, high-level understanding of new data sources is necessary.

For traditional end-user reporting, data warehouse, or other analytic applications, a “hub-and-spoke” analysis is typically performed. In this data architecture, a target is typically defined beforehand by user requirements and automated metadata discovery tools are employed to locate the best data sources to supply users with the information and knowledge they need. Once again, a holistic view of enterprisewide relationships is needed to quickly facilitate prototyping and development. Emphasis is often placed on a federated architecture, which can make use of this metadata due to a quicker turnaround (less development) and a smaller footprint required (fewer storage requirements).

In selecting a data rationalization technology, weigh these factors:

  • Does the toolset provide customizable heuristics? For example, does name matching allow for variable-name pattern matching algorithms to be applied? Selectively including metadata in the underlying matching heuristics provides a powerful basis for analysis and should be part of the toolset.
  • Can regular expressions be used in pattern matching so the underlying data is normalized for comparison? This is important when comparing phone numbers, for example. One data storage element may contain hyphens; another may not.
  • Are user-definable “probe sets” provided when taking samples of data? Sampling little data often provides no value; sampling too much data is likely overkill and may result in longer (and unnecessary) access time on operational systems.
  • Does the technology under consideration provide an efficient graphical user interface, with the capability to share its metadata with other tools in the data management space (such as Erwin, Powerdesigner, and Rochade), and with productivity software used for design documents (e.g., Microsoft Office)?

Data harmonization and rationalization technology must be capable of working with many sources and targets simultaneously to maximize process efficiency. Other desirable features in an automated mapping tool include:

  • An intuitive user interface to allow subject-matter experts to review and validate relationships and perform exception-based mapping
  • The ability to find relationships based on a sample of the data, as opposed to requiring access to the entire data source; in many cases it is cost prohibitive to copy the entire data source for analysis
  • A framework that allows for concurrent processing against multiple data sets rather than one that analyzes only one or two sources at a time; for large organizations, the value of this cannot be understated
  • The ability to interface easily with existing data management tools (e.g., ETL, design, and modeling)

By automating and accelerating the mapping process, time-to-value can be accelerated while the need for scarce or costly human resources can be held to a minimum. At the same time, automated mapping tools can reduce errors, simplify mapping maintenance, and provide an easy way to maintain documentation about mappings.

Summary

Data lineage and metadata management initiatives are emerging as organizational “must-haves” for enabling the critical understanding and rationalization of organizations’ information—with metadata management providing the “who, what, why, when, where, and how” answers and data lineage providing the source for “who” and “where.” Technologies for accelerating and automating much of these processes are emerging quickly. Although the power of metadata is well understood, its value diminishes rapidly if it becomes stale. Any technology in consideration must be able to schedule refreshes, spot changes, and alert the users of such changes.

The same holds true for technology used to construct data lineage—it should enable and empower users to access lineage information on demand. The most useful technologies will facilitate an understanding of the lifecycle of the data—from entry into the organization through the many systems—determining along the way where it has been and how it has transformed as it moves up the data-value food chain. Tools that allow organizations to view data lineage graphically across the enterprise allow analysts and designers to better source data for an application. These tools also promote the notion of “reusable information assets,” improve and simplify transparency, and underpin many other data management initiatives.

Increasingly, dynamic business environments and survival in the current regulatory climate place a heavy burden on corporations and their data management organizations to become more agile than ever. The sea of corporate data must be understood better than ever before. Vendors in the data mapping, metadata, and repository tools marketplace have responded with offerings that are beginning to address this growing need with automated tools that can share metadata in a complementary way across platforms.

There will likely be more, not fewer, compliance and regulatory pressures facing the CIO of tomorrow. The seismic shifts occurring in enterprise data management are being heard and felt throughout all enterprises, large and small alike. Careful planning for a flexible and adaptive metadata infrastructure softens this effect and enables survival in the inevitable tsunami of change.

1 See, for instance, Cui and Widom, “Lineage tracing for general data warehouse transformations,” VLDB Journal (2003) 12:41-58