Business Intelligence Best Practices - BI-BestPractices.com

Collaboration. Communication. Community.

 
 
 Printer-friendly
 E-mail to friend
  Comments
ADVERTISEMENT
Ten Mistakes to Avoid When Constructing a Real-Time Data Warehouse

by Stephen Brobst
This article identifies the 10 most commonly made mistakes when designing a real-time data warehouse and gives advice to help you avoid these pitfalls.

Foreword
Real-time data warehousing is clearly emerging as a new breed of decision support. Providing both tactical and strategic decision support from a single, consistent repository of information has compelling advantages. The result of such an implementation naturally encourages the alignment of strategy development with execution of the strategy. However, a radical re-thinking of existing data warehouse architectures will need to be undertaken in many cases. Evolution toward more strict service levels in the areas of data freshness, performance, and availability are critical. The pages that follow identify the 10 most commonly made mistakes when designing a real-time data warehouse and give advice to help you avoid these pitfalls.

Mistake 1: Focusing on “Real” Time Rather than “Right” Time
There is an enormous amount of focus around creating the “real-time” enterprise in the IT industry today. Creating real-time feeds into a data warehouse is often perceived as a critical aspect of bringing intelligence to the real-time enterprise. While technology vendors may be enamored with real time, it is critical to focus on business-driven solutions and avoid getting caught up in the technology hype. Real time should not be the goal for a data warehouse implementation in and of itself. A better goal would be right-time data warehousing—whereby the service levels for data freshness are driven by business goals rather than technology hype.

Just because real time can be implemented, does not mean that it should be implemented in all cases. The idea of right-time data warehousing is that data freshness service levels are aligned with and driven by the needs of specific business processes. Aligning service levels for data acquisition with the business processes within an organization will result in more cost-effective implementations and, ultimately, better return on investment. For example, consider the requirements for an analytic application designed to support decisions related to exception handling for late flights in an airline data warehouse.

Does data acquisition of late flight events really need to occur in a small number of seconds from the operational bookkeeping systems? Probably not. The airline will know at least 10 minutes prior to the flight landing time whether it will be late or not. Assuming that the analytic decisions related to gate assignment, holding connecting flights, re-accommodation, and so on can be accomplished within a reasonable amount of time (minutes), the need for immediate (small number of seconds) acquisition of the late flight event would be overkill. A more cost-effective capacity plan and implementation solution can be realized when the data freshness requirements are not overstated relative to the business requirements.

On the other hand, a data warehousing solution designed to perform the analytics related to quality control on a high-speed assembly line may very well need data freshness measured in (near) “real” time. The immediate capture and analysis of test data from the assembly lines is essential for process control and quality management. Proactively detecting machine drift and taking corrective action before missed tolerances force shutdown of an assembly line for more drastic repairs can mean millions of dollars in savings.

Focusing on “right-time,” using data freshness service level agreements (SLAs) based on understanding business process requirements, leads to far more effective implementations than technology-driven solutions striving for “real-time” solutions influenced by market hype. However, it is important not to make short-sighted decisions in the design of the data acquisition architecture for the data warehouse. A well designed architecture will allow for increasing data freshness SLAs as business requirements evolve. It is important to implement a scalable solution that can be adjusted upwards in capacity to support more aggressive data freshness according to the needs of maturing business processes. Re-writing or re-architecting a data acquisition infrastructure due to lack of foresight can be a significant drain on the ROI for a real-time data warehouse. On the other hand, over-engineering the initial implementation can be just as big a drain. The key is a scalable architecture that allows just the right amount of capacity to be deployed at each stage of evolution in deployment of an organization's real-time data warehousing capabilities (without code re-writes!).

Mistake 2: Confusion between Bookkeeping, Decision Making, and Action Taking
Life was much simpler in the early days of data warehousing. It was possible to make almost black and white distinctions between the online transaction processing (OLTP) systems and the data warehouse. The OLTP systems were online and operational in nature. Data warehouse solutions were batch oriented and definitely non-operational. However, significant changes have taken place over the last 15 years of evolution in data warehousing. Data warehouses are becoming more real time and decision support is certainly considered a mission-critical, operational capability in many organizations.

Despite the aggressive service levels in performance, availability, and data freshness demanded of a real-time data warehouse solution, it is critical to understand the data warehouse isn’t a replacement for traditional OLTP systems. Bookkeeping within an organization is best performed using an OLTP systems infrastructure. The OLTP systems are designed to handle many concurrent bookkeeping transactions with a goal toward minimizing latency for transaction commit times. OLTP systems are often functionally oriented and have a relatively high write-read ratio. Data warehouses, however, are focused more on throughput of data acquisition than the latency of individual transaction commits. The data acquisition architecture for a data warehouse is typically stream oriented rather than transaction oriented. Moreover, the read-write ratio in the data warehouse workload is high.

OLTP systems are focused on bookkeeping operations. Data warehouse systems are focused on decision making operations. Real-time data warehouse solutions are capable of providing both strategic and tactical decision support. A real-time data warehouse does not replace OLTP functionality. The nature of the underlying file system organization, database locking protocols, and overall engineering tradeoffs are quite different for OLTP and data warehousing (including real-time data warehousing). Although tactical decision support often requires very up-to-date data and high performance, it differs from OLTP because the read-write ratio of the workload is much higher and there is more emphasis on throughput of data acquisition than latency of individual transaction commits. Tactical decision support also requires integration of data from across multiple functional areas within an organization and typically requires narrow and deep access to historical information. A real-time data warehouse should be designed to enable tactical decision support in addition to strategic decision support.

Action taking requires cooperation between the bookkeeping systems and the decision support environment. Consider, for example, an airline implementation of real-time data warehousing. An important capability within an airline is the handling of exceptions resulting from flight delays. When a flight delay occurs, the operational control systems will publish a new “wheels down” time for the aircraft. Under conditions of such an event, there is a complex set of decisions that needs to be made regarding passenger re-accommodation, whether or not to hold connecting flights, and many others. Re-accommodation involves figuring out which passengers can be placed on later flights, how to allocate seats on the later flights if there are more passengers that have missed their connection than there are seats available, how to allocate hotel rooms (class of hotel, level of compensation, etc.) to customers with forced overnights, and so on. A real-time data warehouse for an airline would subscribe to events indicating late flights to facilitate participation in the re-accommodation (and other) decisions. A sophisticated set of optimization decisions then take place based on a complex decision tree to assess aggregate lifetime value impact associated to numerous options when applied across all customers on the flight.

These re-accommodation decisions require integration of information from many different sources: flight schedules, boarding information, flight bookings, checked luggage, customer value, historical information regarding previously misconnected flights for each passenger, and much more. The real-time data warehouse is where this information all comes together. However, once the re-accommodation decisions are made, there is further cooperation with the bookkeeping systems that is required to put them into action. The real-time data warehouse needs to provide the decisions that have been made back to the bookkeeping systems so that seats can be held for the allocated passengers on the later connecting flights, luggage can be transferred to the appropriate flights, hotel rooms can be allocated to stranded passengers, etc. Sometimes a human intermediary is involved in the feedback loop and at other times process-level integration can take place on an automated basis using enterprise application integration (EAI) software. In either case, taking action requires cooperation between the real-time data warehouse (for decision making) and the bookkeeping systems (to put the decision into action).

Mistake 3: Using Legacy ETL Infrastructure
Traditional data warehouses have been built with batch file processing as the foundation for data acquisition. Homegrown and commercially available ETL (extract, transform, and load) tools have been built around a model of data-level integration using non-intrusive extracts from the operational bookkeeping systems with meta data used to define source-to-target mappings using code generators or engine-based transformation tools. The execution framework for these tools is to take “raw” source files as input and then use the meta data to drive transformation of the source files into outputs that are load ready for insert, update, and delete operations into the data warehouse.

This batch file processing framework is very efficient for periodic data acquisition into a data warehouse. The overhead of starting up a batch job is easily outweighed by the superior efficiency of sequential file processing with large volumes of data. However, in a real-time data warehousing implementation, it is no longer acceptable to accumulate the large volumes of data necessary to make file processing an efficient method of data acquisition. It is clearly unacceptable to wait until the end of the day (or week) to load data into a real-time data warehouse with extreme service levels for data freshness. Since real-time data warehousing means frequent acquisition of data into the warehouse, there will not be sufficient volume accumulated in any given load cycle to merit batch data processing implementations. If the number of rows being loaded into a data warehouse is less than the number of blocks in the target table, it is typically more efficient to use continuous data acquisition with SQL inserts than bulk data loading (it also depends on the number of indices on the table and will vary by RDBMS product).

Real-time data warehousing has two very important implications for the ETL infrastructure: (1) data delivery needs to be more explicitly supported from the operational source systems, and (2) data transformation implementations must evolve from a file processing orientation to a stream processing orientation. Legacy ETL infrastructures are primarily non-intrusive. Very little involvement from the source systems is required because the ETL processes extract from file structures using back-door processes without requiring changes to legacy applications. However, to facilitate real-time data warehousing, the operational bookkeeping applications must be enhanced to explicitly publish data to EAI message bus structures. This will require changes to the legacy applications to interface with the EAI infrastructure within the enterprise.

Real-time data warehousing requires continuous delivery of data from the bookkeeping systems into the data warehouse repository. Old style batch data processing is not appropriate in the new world of real-time data warehousing. Batch file processing typically revolves around relatively infrequent (e.g., once per day or even less often) data acquisition. In the real-time enterprise the data spigot is always turned on. The implication is that ETL infrastructure must be stream oriented rather than file oriented. The leading ETL vendors have already begun to enhance their product offerings with adapters to allow stream processing as an additional option beyond file processing. These adapters typically interface with reliable queuing systems or publish/subscribe mechanisms within the EAI infrastructure for the enterprise.

Capacity planning for the data warehouse will need to be re-visited when the transition from batch to continuous data acquisition takes place. There will no longer be a shrinking batch window at night or on weekends to contend with after converting to real-time data acquisition. However, incremental workload throughout the day (and night) must be accounted for in the capacity plan to allow for simultaneous query execution and data acquisition within the real-time data warehouse. Moreover, the per-record resource requirements for data acquisition will be higher than in the batch processing world. The capacity plan must account for peak workload conditions when considering the resources necessary for implementation—do not fall into the trap of using average workload requirements.

Mistake 4: Too Much Summary Data
A common vehicle for enhancing performance in traditional data warehousing is to make use of summary tables or cubes. By pre-aggregating data along commonly used dimensions of analysis it is possible to avoid dynamic calculation of summary data during query execution. The theory is to invest in construction of summary tables when new data is acquired into the warehouse in order to avoid dynamically building the summaries over and over again during query execution. Many query tools and applications have built-in capability for “aggregate awareness” to allow automatic exploitation of summary data.

However, reliance on pre-aggregated summary tables is usually undesirable in a real-time data warehouse environment. The higher data volatility in the real-time enterprise changes the performance economics of building summary tables. In the days when summary tables were built once per week (or even once per month) and then used all week (month) before they were re-built again, the many times that the summaries were accessed during query execution easily justified the investment in their construction. However, with real-time data warehousing the content in the warehouse changes much more frequently. This means fewer uses of the summary data in between summary builds. As the warehouse moves toward a true continuous update, the investment required to keep summary tables up to date becomes less and less attractive.

For historical data that has stabilized, it may still be attractive to keep pre-aggregated data. On the other hand, building pre-aggregated summaries on the most current data that is changing on a minute-by-minute (second-by-second) is usually cost prohibitive. Keeping a summary table up to date is generally much more expensive than inserting new rows into the base tables of the data warehouse. As the frequency of data acquisition increases relative to the arrival rate of new queries, the cost tradeoff for maintenance overhead on the summary tables versus performance advantage for query execution becomes less and less compelling. Any summaries maintained on the most up-to-date data must be designed so that row level updates can be used for summary maintenance rather than requiring re-aggregation from the detailed data every time new data is acquired into the warehouse.

In the selective cases when the performance economics of summarized data for a real-time data warehouse are compelling, it is advisable to implement using materialized view technology rather than traditional summary tables. This technology has been introduced within leading RDBMS products over the last few years and has been improving steadily in its practical capabilities. Use of a best-of-breed materialized view implementation facilitates automatic maintenance of the summary structures rather than manually implemented updates with SQL. Furthermore, this approach allows analytic applications to be coded against the business view of the data model with transparent translation to the denormalized physical structures embodied in the materialized views. Essentially, the aggregate awareness is moved from the analytic application into the cost-based optimizer of the RDBMS.

Moreover, tactical decision support workloads usually do not benefit from traditional summary tables that aggregate away the detail. Unlike traditional (strategic) decision support that often focuses on dimensionally oriented analysis against summary data, tactical decision support will normally go narrow and deep, making summary tables much less useful. Better than relying on summary data is to focus on high-performance design for providing access to the detailed data with summarization on demand. Appropriate indexing techniques and parallel database designs can deliver query performance without associated with summary tables.

Mistake 5: Lack of High Availability
The engineering focus of real-time data warehousing is often directed toward performance in the areas of data acquisition and query execution. Real-time data warehousing is most closely associated with aggressive service levels for data freshness. And if the query performance for getting data out of the warehouse is not equally aggressive, then there is not much point in high-performance data acquisition. However, availability service levels are equally important. The whole point of a real-time data warehouse is to deliver value for enhancing decision making as part of operational business processes. Strategic decision making doesn't need real-time data (or anything close to it), so if there is a business driver for real-time data warehousing it means that there is a need for tactical decision making.

A real-time data warehouse is, in most cases, an operational system. Not for bookkeeping, but for decision making. This brings a whole new perspective to the management of the data warehouse. Down time on a traditional data warehouse is bad for the productivity of knowledge workers, but delivering the answer to a strategic question can usually be delayed by a day without having a major impact upon an organization. On the other hand, down time of a data warehouse that has been integrated into the real-time enterprise impacts operational processes. Lack of the facilities necessary for optimal decision making when it comes to execution of the business strategy often has a direct or indirect influence on the customer experience. This, of course, is unacceptable.

When deploying a real-time data warehouse it is essential to architect high availability into all aspects of the platform and its operational procedures. Both planned and unplanned down time need to be minimized. When the data warehouse is involved in the decision making necessary for execution (not just development) of the business strategy, the opportunity cost associated with even small amounts of down time can be very large. Evolution toward high availability in a data warehouse must be engineered based on the business drivers for balancing the cost and value of the availability solution. Most real-time data warehouses will not start out with full disaster recovery and zero down time as out-of-the box service level requirements—but the trend is definitely in this direction.

Mistake 6: Failure to Initiate Business Process Changes
Installing a well designed technical implementation of a real-time data warehouse will deliver no value. Technology, on its own, rarely does. It is improvement in the underlying business processes of an organization that drives the real value. The real-time data warehouse is simply a tool for delivering a technical capability for high-performance decision making against very up-to-date data. But it is what an enterprise does with this capability that really matters.

Some visualize the real-time data warehouse as a source of immediate updates for dashboards that sit on the desks of top executives in the enterprise. While this may make for a high-visibility deployment of technology, it is unlikely to have bottom-line (positive) impact on the organization unless there is a change in business processes to make use of the information. The biggest source of business value from the real-time data warehouse is likely to stem from deployment of information to the front lines of a business rather than focusing on the corporate ivory tower.

It is the agents (human and software) on the front lines of an organization that are making the operational decisions that require up-to-date data and real-time analytics—not those in strategic planning, marketing, and finance positions. However, for the real-time information to have value, it must be delivered so that execution of optimized decision making is practical and effective for the front lines of the organization. This will usually demand significant changes in existing business processes. The technology component necessary for realizing the value from a real-time data warehouse is the easy part of the implementation. The inertia and resistance to change encountered when dealing with business process change is generally much more of a hindrance than the difficulty in evolving the underlying technology for realization of the real-time enterprise.

Mistake 7: Separate ODS Deployment per Channel
Implementations of real-time data warehousing capability using technologies unable to handle both tactical and strategic decision support on a single platform will often deploy operational data store (ODS) repositories for storing and providing access to the most current information needed for decision making. In this approach, the ODS is used for handling real-time data acquisition and tactical decision support. A separate repository, the enterprise data warehouse (EDW), is used to store historical data that is used for strategic decision support. The ODS is updated continuously, whereas the EDW is updated on a less frequent periodic basis.

Unfortunately, many ODS repositories are deployed using channel-specific implementations. For example, one ODS might be used to enhance decision making for personalizing product recommendations on a Web site and a separate ODS is used to drive the decision making related to call center offers. This is a business disaster. Lack of integration across the channels results in uncoordinated customer experiences. Product recommendations delivered on the two channels are likely to be redundant and inconsistent. The customers end up having channel relationships rather than relationships to the enterprise as a whole. This leads to very poor execution of relationship optimization.

A more appropriate architecture would have a single repository integrated across the enterprise and accessible across all channels. Emerging standards around .NET and J2EE allow seamless access to the analytic applications for managing decisions across channels through application server technology and Web Services. This approach obviates the need to create redundant and poorly integrated ODS repositories across channels. The end result is a single source of truth for tactical decision making whereby each channel application has visibility to the activities that have taken place through other channels within the enterprise.

The proliferation of ODS repositories across channels is reminiscent of the data mart deployment that was common a few years ago. In the height of the data mart craze, every line of business or department within the enterprise wanted to have its own data mart. The perception was that separate data marts were cheaper, faster, and had higher performance than an enterprise data warehouse. This trend was often fueled by technology vendors who were more interested in quick sales than enterprise architecture. The lack of scalability in many of the platforms popular at the time of the data mart craze contributed to the “divide and conquer” mentality influencing implementations of that era.

Experience, however, has taught the industry that the total cost of ownership (TCO) characteristics of data mart proliferation get out of hand very quickly. Not only is TCO higher with a multi-mart deployment architecture as compared to an enterprise data warehouse, but business value is significantly lower due to lack of flexibility and integration across multiple subject areas (see recent research from Gartner, Meta Group, AMR, etc.). So now organizations are feverishly consolidating data marts into the single source of truth embodied in their enterprise data warehouses to lower costs and increase business value.

So why all this talk about data marts in the midst of a discussion on real-time data warehousing? Because an ODS is like a data mart in sheep's clothing. The trend toward ODS construction has all of the worst characteristics of the data mart proliferation from the late twentieth century. The main difference is that data marts were built to “protect” the EDW from strategic reporting and decision support whereas the ODS repositories are designed to “protect” the EDW from operational reporting and tactical decision support.

It is clear that the trend is toward integration of strategic and tactical decision support within the domain of the real-time data warehouse. There may be specific applications for which a separate ODS may be appropriate (as is also the case for selective data mart deployment), but proliferation of ODS repositories should be avoided. The high cost of redundancy in storing data multiple times, transforming and moving data multiple times, maintaining multiple systems, etc. is to be avoided whenever possible. As the technology matures for handling coexistence of strategic and tactical decision support within a single source of truth for decision making in the enterprise, the ODS repositories will get consolidated into the real-time data warehouse. The best plan is to avoid the mistake of building them in the first place by evolving a traditional EDW into a real-time data warehouse. All serious RDBMS players in the real-time data warehousing marketplace are in various stages of evolving their products in the direction of handling mixed workloads for both tactical and strategic decision support in a single relational repository.

Mistake 8: Underestimating the Importance of Historical Data
There is a common myth that the real-time data warehouse only needs the most recent data and that historical data should be relegated to the traditional data warehouse infrastructure. This approach presents the real-time data warehouse as a thin layer of data that sits apart from the strategic data warehouse. In a sense, the real-time data warehouse gets relegated into an ODS role with only a small amount of information that is kept very up to date and is periodically fed to the data warehouse.

While simple decisions may be possible with only the most recent data from the enterprise, more sophisticated business processes require information about the current situation in the context of the historical detail. The decision of what to offer a customer during a Web interaction should be based on more than simply what products are being browsed at a particular moment in time. The best offers consider the current context along with past browsing and purchasing behavior. Moreover, information from additional channels, such as call center interactions, should also be considered.

While some pre-scoring can be performed using the historical data, it is not possible to score all relevant offer combinations. The best results will be obtained by dynamically combining information from the current interaction with historical context. Optimized decision making in supply chain management, logistics, fraud management, and most other tactical decision support applications require a combination of both up-to-date information and historical content from the real-time data warehouse. Integration of the present with the past is essential for best results in tactical decision making.

Mistake 9: Failure to Integrate the Data
It is common to confuse real-time data access with real-time data warehousing. In most cases, real-time data access can be best provided by going directly to the bookkeeping (OLTP) systems within the organization. By definition, the bookkeeping systems have the most up-to-date information about the state of the business. Successful deployment of real-time data warehousing focuses on more than just real-time data access. The key difference is integration.

Data warehousing, first and foremost, is about data integration. Real-time data warehousing means that data is integrated on a more timely basis. But without integration, the value of the data warehouse is marginal. A set of stand-alone tables lacking integration, even if they are updated in (near) real time, are not going to provide a solid foundation for real-time data warehousing. Consider a banking data warehouse implementation. Only when demographics, account holdings across all lines of business products, and transactional behaviors across all channels are integrated will analytic applications have a complete picture of a customer. The value of information is maximized when the organization can exploit relationships that cross product and channel boundaries within the enterprise.

In some cases, the effort to quickly implement a real-time data warehousing capability will lead an IT organization down the path of a “virtual” data warehouse. The idea is to “federate” data from across multiple (often operational bookkeeping) data sources to form the illusion of a data warehouse with very up-to-date data. This approach is attractive because middleware is used as a substitute for constructing a physical instantiation of the real-time data warehouse. However, this approach misses the fundamental underpinning for a successful data warehouse: integration.

The notion that integration will be “automagically” performed using a middleware layer at query execution time is impractical for all but the smallest of data sets and simplest of organizations. Imagine running a query that requires scanning, joining, and aggregating data from across multiple source systems. It is completely impractical to perform the necessary integration of information in middleware for anything resembling real-life data volumes and source system complexities. While federation is appropriate for small look-up tables that have high volatility, using this approach as a substitute for integrating core data of the organization into a properly implemented real-time data warehouse is a ruse.

Mistake 10: Assume all Knowledge Workers Want Real-Time Data
A common assumption is that faster is better. Real-time acquisition of data must certainly be more desirable than a weekly batch acquisition of data into the data warehouse, right? The correct answer is “it depends.” In the case of tactical decision making, faster is often better. More up-to-date information means that decisions are made with more complete information and important business events can be detected and acted upon on a more timely basis.

However, for some types of decision making, the real-time update of information in the data warehouse is not at all desirable. The finance person who is engaged in strategic planning does not want the data changing in the middle of his analysis. For long-term decision making, the stability of the data is much more important than keeping the information up to date. In fact, there are likely to be many different service level requirements for data freshness within an enterprise. Some knowledge workers need the most up-to-date information while others prefer their data to remain stable for a day, week, or even month.

For a real-time data warehouse to effectively serve the enterprise, multiple data freshness service levels must be delivered. However, it is certainly not practical to build a different set of tables for each service level requirement. An effective way to deploy the real-time data warehouse with multiple data freshness service levels is to use time stamping and view constructs. All rows acquired into the data warehouse are assigned a date-timestamp. Each data freshness service level desired by the knowledge worker community has its own set of “tables” that are simulated with views designed to filter out any data more up to date than that defined by the service level requirement. All views for a given entity resolve to the same underlying physical representation of the entity—only the filtering criteria change according to the data freshness service levels for each knowledge worker community. In this way, a single source of truth can be constructed to serve multiple data freshness service levels without unnecessary replication of data.

Stephen Brobst -

Stephen Brobst specializes in the design and construction of DW solutions for Fortune 500 companies in the U.S. and internationally. Stephen performed his graduate work in computer science at MIT where his master's and PhD research focused on high-performance parallel processing. He also completed an MBA with joint course and thesis work at the Harvard Business School and the MIT Sloan School of Management. Stephen has been on the TDWI faculty since 1996.