Business Intelligence Best Practices - BI-BestPractices.com

Collaboration. Communication. Community.

 
 
 Printer-friendly
 E-mail to friend
  Comments
ADVERTISEMENT
Extending Enterprise BI Capabilities: New Patterns for Data Integration

by Anupam Manglik, Vivek Mehra
This article discusses the benefits and constraints of new data integration patterns that can augment traditional physical methodologies to provide a complete BI view for the enterprise.

An enterprise’s BI needs evolve as business strategies adapt to new business environments. Mergers and acquisitions, regulatory requirements, and the introduction of new channels can drive changes in BI requirements. To arm decision makers with a complete view of business, data from existing sources needs to be integrated with new data sources that may arise from such changes.

To ensure an ongoing alignment of BI strategies with business needs, data integration strategies also need careful and continuous assessment. Physical data integration has traditionally been the primary mechanism for creating an integrated BI view using data warehouses and data marts. With the advent of EII (enterprise information integration) tools, new virtual data integration patterns have now become viable.

This article discusses the benefits and constraints of new data integration patterns that can augment traditional physical methodologies to provide a complete BI view for the enterprise.

Introduction

Organizations face a constantly changing business environment; they must deal with regulatory requirements, mergers and acquisitions, and the introduction of new channels for the business. This requires enterprise business intelligence (BI) capabilities to evolve in lockstep with the new business strategies. To ensure an ongoing alignment of BI strategies with business needs, data integration strategies also need careful and continuous assessment.

The Information Supply Chain (ISC)

To arm decision makers with a complete view of business, data from existing sources needs to be integrated with new data sources. It is critical to first understand an enterprise’s existing data flow.

Enterprise data can be broadly divided into two categories: (a) data supporting operational systems (such as accounting, human resources, general ledger, and so on), and (b) data that supports BI functions (such as executive dashboards, reporting, and analytics).

Data supporting BI activities typically flows through four logical steps across the enterprise. These steps form the information supply chain:

  • Acquiring data from systems of record
  • Consolidating and harmonizing the data within a given domain
  • Integrating data across domains, performing calculations, and deriving information
  • Disseminating data to end users via BI applications

Each of these steps may include activities such as data cleansing; extraction, transformation, and loading (ETL); validation against metadata; and data quality feedback loops that ensure data trustworthiness and minimal data degradation.

Data Integration Patterns

Integration and dissemination functions have traditionally used physical stores (data warehouses and marts) as the primary mechanism for creating an integrated BI view. However, with the advent of EII tools (enterprise information integration), new virtual data integration patterns have now become viable. We will explore these data integration patterns and discuss their benefits and constraints.

Three primary data integration patterns can be applied to integrating disparate sources to provide an integrated BI view.

1. The Physical Pattern
This pattern (see Figure 1) uses a data warehouse to integrate and persist data from various sources, and employs data marts to disseminate data. These data marts are created and optimized for specific business needs. BI applications such as dashboards, reports, and analytics use data marts to present the data to business users.

alt

The physical pattern has traditionally been the preferred approach for data integration. One of the significant drivers for adopting this approach is its superior query performance. The integrated data in data marts is stored in a manner that optimizes query performance for a particular BI need.

This approach is also popular because, by definition, the physical data integration pattern provides for the business need of capturing historic data and capturing summary information. Multi-year trending analysis requires historical information, but transactional systems typically capture limited historical data due to performance reasons. In such cases, a data warehouse is used to capture historical data. Data marts house summary information required by a specific set of users. This avoids repeated summarization of data for each user query for summary data, and enhances query performance. Data marts are often also used to store and disseminate atomic level data for specific user groups.

Implementation of the physical data integration pattern is a multi-stage process and can be complex, lengthy, and costly. Organizational constraints, such as transfer of ownership of physical data to a different group, may also impede the adoption of physical integration.

2. The Virtual Pattern
This pattern (see Figure 2) integrates data from disparate sources virtually and uses virtual data marts to disseminate data. This pattern therefore combines the data integration and dissemination functions in the traditional BI information supply chain. Virtual integration eliminates physical movement of data from data sources to target integrated data platform.

alt

Implementation of the virtual integration pattern has been made viable by the availability of a new set of enterprise information integration (EII) tools, which promise real-time data integration from a variety of data sources such as relational databases, delimited text files, Web services, and multidimensional databases. EII tools use predefined metadata to populate views that make integrated data appear relational to an end user. Some EII products available in the market include BEA’s Liquid Data, Composite Software’s Information Server, and MetaMatrix’s Metamatrix Server. These product suites vary in features and technical strengths and should be selected based on a specific set of business and technical requirements.

The primary obstacles to adopting such tools for data integration have been linked to the performance of queries. Virtual integration accesses data from source systems directly, which in many situations is not optimized for BI queries. Hence query performance may be slow in such cases. However, EII features (such as the caching of query results and availability of powerful hardware that can execute these queries in parallel mode) help overcome the performance constraints and make these tools a viable alternative for some situations.

EII offers benefits such as access to real-time data and transparency to the source data systems. EII tools can be used to provide integrated BI across different data formats as they can integrate data available in different formats. Further, EII tools can disseminate data to end users or systems in a variety of formats (Excel, Web services, and text files, among others) and integrate with BI reporting and analytics tools as well. Output interfaces like Web services provide reusable data services to organizations that can then be leveraged by different applications.

The virtual pattern does not work well when data summarization is required or when historical information must be captured. Also, since the data is provided directly from the source data systems, availability of source systems is an important consideration in applying the virtual data integration pattern. Further, EII does not match the data cleansing capabilities provided by ETL routines involved in physical integration and provides limited cleansing capabilities. For repeatable, non-trivial data cleansing needs such as de-duping, an ETL tool will be required to cleanse the data. Typically, the cleansing requires setting up a physical staging area. The dirty data goes through ETL, validation, and data quality routines and moves to the staging area before the EII tool can access it for integration.

3. The Hybrid Pattern
This pattern (see Figure 3) manifests itself in combinations of virtual and physical patterns to integrate data for BI. The use of virtual or physical pattern combinations for integrating a set of data sources is driven by the constraints of those data sources. Some of the possible combinations include:

  • Using the physical pattern for integrating data sources and then disseminating the integrated data virtually. The virtual dissemination is attained using EII tools that expose the integrated data as Web services. This combination is appropriate when there is a need to capture historical data, or there are data source constraints such as low data source availability. Both of these imply a requirement for persistence or physical integration. However, at the same time there may be a need for a high degree of reuse of the disseminated data service through Web services, implying a need for virtual integration to coexist with the physically integrated data.
  • Using a physical pattern for integrating a set of data sources, then virtually integrating this physical data set with another set of data sources. This integrated data can then be disseminated virtually. An administrator can use the EII tool to create views that access the integrated data, as well as create “über-views” that integrate underlying views. These views can be queried by the end user or application. This combination is appropriate when some of the source data systems necessitate physical integration due to various constraints but others can be integrated virtually. A real-life scenario is integrating data from transactional systems with data in an existing data warehouse owned by a different organization. The transactional systems can be integrated physically, then virtually integrated with the existing data warehouse to provide an integrated BI view.

The hybrid pattern provides benefits of both the physical and virtual approaches when applied in a combination that meets data integration constraints specific to the enterprise.

alt

Constraints Influencing Data Integration Patterns

Business requirements, architectural requirements, and various constraints drive the choice of integration pattern. There are constraints that are applicable regardless of the pattern applied. An example of such constraints is the lack of a business sponsor for the BI initiative. Absence of a sponsor who has identified the ISC for the organization and can drive all BI data movement through it (so that there are no leaks) is a primary reason BI initiatives fail.

Other constraints influence the adoption of a particular data integration pattern. In many situations, any of the patterns satisfies the functional needs for BI, but organizational constraints or the enterprise architecture strategy influence the choice of integration pattern. This section discusses such constraints and how they influence applicability of a design patterns. These constraints can be divided into several categories.

Technology and Data Architecture Constraints
These factors result from technical requirements or technical characteristics of various architectural components that influence the applicability of a data integration pattern. Factors include:

  • Complexity of Transformation: Integrating data from multiple sources requires data to be harmonized with respect to syntax and semantics. This requires data transformation rules to be executed during the data integration cycle. Complex transformations may impact query performance as these transformation rules are executed in real time during query execution for virtual integration. Virtual integration is therefore not an appropriate choice when complex transformations are required. Physical integration performs these transformations as part of the ETL process in batch mode, and hence impacts the data latency but not performance.
  • Performance Requirements: Virtual integration typically integrates data across multiple sources by conducting a join in real time (instead of batch mode) for physical integration. Therefore, virtual integration may provide worse query performance than physical integration. However, performance constraint should be considered in conjunction with the performance requirements and other constraints to make the appropriate decision about the use of an integration pattern.
  • Data Source Profiles:
    • External: External data sources, like third-party data, do not provide real-time access to the data source and may require staging before they can be integrated or published. In addition, they may not be available in a consistent format and may require some cleansing and translation, implying a need for persistence. Hence, virtual integration may not be feasible when integrating with external data sources.
    • Governance: Virtual integration requires accessing data sources in real time. Any operational or design changes to data sources directly impact the integration. Thus, lack of control or influence over data sources makes use of virtual integration an unattractive choice.
    • Availability: Availability of data sources plays a big role in the application of a design pattern. Since virtual integration accesses data sources in real time, any downtime of a data source, whether a development, maintenance, or network issue, translates into unavailability of BI information. If service level agreements of data sources are not aligned with the BI application availability requirements, virtual integration may not be an appropriate choice.
  • Data Quality: Physical integration typically involves ETL processes that move data from sources into a staging area before it can be published for BI use. Staging facilitates data scrubbing and translation, and in some cases manual validation or annotation. However, virtual integration provides very limited data cleansing facilities.
  • Data Periodicity: Sometimes data from different sources for a particular time period is not available at the same time. In such cases, the data from these sources needs to be staged until the data from all sources for that period is available. Virtual integration is not feasible in this case.
  • Data Source Transparency: Sometimes architectural flexibility is required such that the BI application can be insulated from data source changes. Virtual integration provides an attractive means to support a business environment when data-source changes are planned for the future.
  • Architectural Direction: Since EII tools can expose the integrated data as Web services, a virtual integration pattern is a good choice for organizations that are standardizing on services-oriented architectures.
  • Technology Adoption Culture: The propensity to adopt emerging technology varies in each organization. Since the EII set of tools is relatively new, organizations averse to using emerging technologies may not choose virtual integration as a data integration pattern.
Business Constraints

These factors influence the applicability of a data integration pattern resulting from requirements for BI, or business factors such as cost and time to market. Such constraints include:

  • Historical Data: Capturing historical data, by definition, requires persistence. Hence, the virtual integration pattern is not suitable when there is a need for historical data.
  • Data Summarization: Business requirements that involve a high degree of summarization make physical integration a better choice. Virtual integration is not a suitable choice because of the real-time performance impact of data summarization queries.
  • Time to Market: Projects using virtual integration tend to have a shorter time to market as the absence of physical data movement eliminates the staging and hosting of data, and no ETL processes need to be implemented to move data for integration purposes.
  • Maintenance Cost: Typically, the virtual integration pattern lowers long-term maintenance costs since it eliminates actual data movement and operational procedures linked to it.
  • Data Freshness: Business requirements for realtime snapshots of data are better served using the virtual integration pattern since the physical integration pattern will, by its very nature, introduce data latency.
  • Business Requirements Evolution: Sometimes the requirements for BI applications evolve with subsequent phases of the project. Applying the virtual integration pattern in such cases is a suitable choice as it provides the flexibility to create new integration views, and modify existing views with significantly less effort than the physical integration pattern.
Organizational Constraints

These factors influence the applicability of a data integration pattern resulting from the ownership and location of disparate data sources. They include:

  • Budget and Resources: Integrating data across multiple domains or organizations often requires an overarching, common budgetary allocation since the benefits are shared with all the domains involved. In many organizations, such sharing of budgets is not possible due to an unwillingness to part with their own budgets. This impedes the implementation of a cross-domain BI solution. In such cases, virtual integration may afford a lower cost to entry and help overcome budgeting barriers.
  • Cross-organization Governance and Change Management: A lack of cross-organization governance and change management processes impedes the creation of cross-domain data warehouses. Since data warehouses require significant investments, without cross-organization governance, they do not provide long-term ROI and are operationally challenging to maintain. In such cases, virtual integration is a better solution since it provides loosely coupled integration and a logical harmonization of data across domains.
  • Culture: Withholding information or an inclination against sharing data beyond individual domains or silos is a common cultural affliction in large organizations. Since information sharing is the very basis of BI, departments may choose to persist data to circumvent inhibitions against data sharing. Persisted data will enable domain level validation and annotation before it is sent out for consumption at a higher executive level.
Compliance and Regulatory Constraints

These factors influence the applicability of a data integration pattern driven by regulatory or compliance requirements. Among the constraints:

Regulatory Requirements: Requirements imposed by governments or standards organizations may require some persistent data and an application of the physical integration pattern.

Physical Separation: In certain cases, compliance requirements necessitate that the data from two different domains within the larger organization remain physically separate. In such cases, virtual integration provides a suitable approach for providing an integrated BI view across these sources.

Decision Matrix for Data Integration Patterns

The decision to adopt a particular design pattern to meet business and technical requirements is influenced by a set of constraints and how the pattern meets those constraints. This decision matrix provides an overview of various constraints and how virtual or physical patterns meet these constraints.

EII tools are part of an evolving technology space and introduce project risks associated with any emerging technology. These risks should be taken into consideration when selecting a data integration pattern. Proofs of concept should be developed based on realistic project scenarios before a virtual integration pattern is applied.

Constraint Value Physical Pattern Virtual Pattern
Technology and Data Architecture
Complexity of transformation High alt
Performance requirements High alt
Data source profile: internal or external External alt
Data source profile: governance Lack of control alt
Data source profile: availability Low Availability alt
Data quality Low alt
Data periodicity Staging required to synchronize data periodicity alt
Data source transparency High alt
Architectural direction Web Services alt
Technology adoptation culture Mature technologies only alt
Business Constraints
Historical data New to capture historic data alt
Data summarization needed High alt
Time to market Sooner alt
Maintenance cost Low alt
Data freshness Real-time snapshot alt
Business requirement evolution Significant evolution alt
Organizational Constraints
Budget and resources Cannot be shared alt
Cross-organizational governance Not present alt
Culture Information withholding alt
Regulatory and Compliance
Regulatory requirements Persist data alt
Compliance and confidentiality Physical separation alt
Why Patterns?

Today’s rapidly changing business environment makes the evolving design and execution of BI strategies critical to the success of an organization. To craft an effective BI strategy, architects first need to identify systems or data sources of record and align them with the enterprise information supply chain. Architects also need to understand data integration patterns currently in use and contrast them with the available patterns discussed in this article. Being able to recognize and adapt a data pattern to a data integration issue, even at a high level of abstraction, will aid in articulating and assessing potential solutions.

A careful assessment is required to determine enterprise fit for a particular data integration pattern. It involves analyzing business and user needs with regard to data timeliness, freshness, history, and regulatory requirements, an understanding of ingrained organizational culture and governance, as well as existing maturity of technologies involved. As the virtual data integration space evolves and matures, it holds the promise of lending itself to more flexible and robust BI strategies.

Scenarios The following scenarios offer some real-life examples of how constraints influence the choice of a data integration pattern:
Scenario 1.

Goal: To create a leadership team dashboard of financial metrics data for a financial services company. The market data is available through a third-party company database.

Key Constraint: The third-party data source does not provide real-time integration.

Solution: The hybrid pattern—persist with the third-party data and integrate virtually with the financial data available in the company’s financial data warehouse.

Scenario 2.

Goal: To integrate financial metrics data for U.S. and European sales data for a global software product company for head-office management reporting.

Key Constraints:

  • U.S. and European sales data owned by separate domains with no cross-organization governance
  • Project driven by the U.S. sales organization and no shared budget exists
  • European sales data exists in a distinct data warehouse
Solution: The virtual pattern—use virtual integration to integrate U.S. sales data with European sales data residing in an existing data warehouse.
Scenario 3.

Goal: To create common reports in a pharmaceutical company for clinical trial metrics across sub-organizations involved with pre- and post-market trials.

Key Constraints:

  • Lack of cross-organizational governance
  • Separate data warehouses with integrated clinical data for sub-organizations
  • Budget owned by a single organization
  • Compressed time to market

Solution: The virtual pattern—integrate the two data warehouses virtually to provide common reports using harmonized metadata.

Scenario 4.

Goal: Provide enhanced BI for HR applications (performance management, talent management, etc.) for a diversified, global insurance company. Integrate corporate HR data with organization-specific HR process data.

Key Constraints:

  • Corporate data available in a data warehouse
  • Multiple organization-specific, transactional applications provide enhanced, processed data
  • Availability of the transactional applications cannot be guaranteed
  • Organization’s desire to capture historical data
  • Corporate unwillingness to house organization-specific data in the corporate data warehouse

Solution: The hybrid pattern—integrate the organization’s transactional systems into a physical data warehouse, and virtually integrate corporate data with the data warehouse.

Anupam Manglik -

Anupam Manglik is Senior Architect at NetNumina Solutions. He has more than 15 years of experience developing architecture and data integration strategies and solutions.

Vivek Mehra -

Vivek Mehra is a Senior Architect at NetNumina Solutions. He has more than 10 years of consulting experience in developing technology and data integration strategies.