
|
Drowning in Data, Thirsty for Answers: Filling the Information Gap
Even with the great technological advances of recent times, the gaps between business intelligence (BI) data, warehouse user requirements, and IT delivery capability is not narrowing.
Introduction: Two All-Too-Familiar Scenarios Scenario #2: A CIO tries to explain to the CEO why user dissatisfaction is at an all-time high, despite huge investments in best-of-breed technology, a never-higher consulting budget, and a highly talented IT organization. If these situations are immediately recognizable, it’s because the gap between user requirements and IT delivery capability is not narrowing, despite the great technological advances of recent times. This article explores some of the reasons for the existence of this gap, and offers some alternatives that challenge the conventional wisdom about meeting data delivery requirements.
The Growing Demand for Data
The Organizational Understanding Gap On the IT side of the house, the situation is not much different: IT staff have typically built up an impressive understanding of the business and the associated data, often surpassing that of the business users, but this expertise is just as likely to be focused on a narrow view of the business. The IT manager depends on the business manager to provide enough guidance to establish clear deliverables, which can be used to design a solution and to plan and schedule work. The business manager, perhaps struggling with an unfamiliar context, truly does not know what answers will be required tomorrow, since they are a function of what happens today, both inside and outside the walls of the enterprise. Their only response to the “Just tell me what you want” statement is, “I want everything, and I want to be able to ask any question, and I want the answers right away, so that I can figure out my next question.” The IT manager knows through painful experience that vague guidelines are a recipe for open-ended projects that fail miserably while sucking budgets dry. So back comes the response: “Look, I can’t use a one-line statement of requirements in an RFI to work effectively with my development team. You have to tell me specifically what you want or I can’t help you.” There is no quick way to resolve this to the satisfaction of either party. Discussing future specifics in a dynamic environment is typically a colossal waste of time and effort: as soon as the specifics are agreed upon, the business environment changes again, and so the specifics change. The gap in organizational understanding is just another constraint that has to be addressed (like shortened delivery windows), and drives a fundamental need for flexibility and adaptability.
Architecture by Vendor Hype Sometimes, the acquisition of the multidimensional tool turns out to be the perfect solution, and both business users and the IT department live happily ever after. This relatively rare scenario will most probably occur when the multidimensional tool can be bolted on to a stand-alone transactional application of moderate size and complexity. Here, however, we will take a look at some common situations where the outcome might be less idyllic.
User Resistance In large and complex organizations, there is a wide range of business processes that have historically been supported by customized applications; even in the ERP world, specialized modules are used to provide requirements in functional areas. Perhaps more significant differences exist between the technical aptitudes of users in a large population. Given extensive disparity across the enterprise in the work performed, the data used, and the technical expertise available, arriving at a single tool to support data delivery is a daunting task.
Data Quality Issues Development of an overall data architecture is a starting point for ensuring adequate data quality and accurate reporting of business metrics, but the data architecture exercise is arduous, time-consuming, and expensive, and requires highly skilled (read “scarce”) resources. The only way to expedite the data assessment process is to provide users with unfettered access to all the data so they can explore it and find out, for instance, how many transactions occurred outside some threshold value. Of course, if this capability were readily available, the users would not be complaining in the first place.
Data Source Scale and Complexity However, when the initial “honeymoon” is over, users soon learn that multidimensional tools have their own scope and scale issues: when the number of attributes the user wants to include passes some (not very large) threshold, we start to see “data explosion,” and load times become an issue, just as they do when the number of indices in a RDBMS reaches a certain point. Admittedly, the threshold problem in a multidimensional tool can be avoided by creating another cube, which addresses the flexibility issue discussed earlier. The effort involved in maintaining a large number of less complex cubes is not appreciably different from the requirements for maintaining a smaller number of more complex cubes. The threshold problem of adding more indices in a RDBMS cannot be easily solved, which is why the market for multidimensional tools continues to be robust despite the issues just noted.
Multiple Data Sources Of course, the modeling effort is subject to the same issues as data architecture: it takes time and highly skilled resources. Completion of this work positions IT to start loading the new data, at which point it is promptly discovered that what was modeled is not exactly what exists in the data, and this requires another iteration of the model. When loading resumes, data quality problems are promptly encountered—those quirks again!—and when these issues are wrestled to the ground, someone notices that the carefully constructed security measures at the legacy application level have been adroitly bypassed, exposing sensitive data. Long before that issue can be addressed, more data is discovered that has to be included. It is no wonder that data warehousing is normally a difficult, expensive, and never-ending process.
Ad Hoc Query Capability Whether this is true or not (as an ex-user who attended lots of conferences, I would add sitting in the hot tub to the golfing and drinking part!), it is certainly not related to the increasing need of users to ask unpredictable questions. The needs of businesses increasingly include high-end or complex analytics. Ad hoc questions are often triggered by events that are themselves unpredictable: the business may learn that a major customer or supplier or competitor is shutting down, merging, changing product lines, was affected by a natural disaster, or any of a myriad of other possibilities, and it is critical that the implications of the event be clearly understood and a response quickly formulated. The acceptable time-to-response is inevitably far outside the comfort zone of IT, and is steadily shrinking. The scope, scale, complexity, and responsiveness of the required analytics fall well outside the “sweet spot” of any RDBMS or multidimensional tool, regardless of vendor claims to the contrary. These scenarios are likely more typical than the “live happily ever after” outcome, and explain why “architecture by vendor hype” is a significant source of IT budget strain and user disillusionment. Business users have been conditioned to believe that technology is a business enabler. When one takes into account the growing need for realtime access to data from multiple sources, and the almost daily announcements from RDBMS or multidimensional tool vendors proclaiming yet another quantum leap in capability, it is no wonder that user expectations exceed IT’s capacity to deliver. But there are no silver bullets: if it sounds too good to be true, one would be well advised to test it in a real-life situation before investing.
The CIO’s Dilemma On the other hand, the data has to be concentrated somewhere, since there isn’t a single front-end tool that can “talk” to the typical range of data sources—and, as already mentioned, if there were such a tool, it would probably be in some way unsuitable for some portion of the organization. Without consolidation and integration of the data, users will be unable to sift for relevant information, since functional interpretations hold sway and collisions between “facts” become inevitable. In such an environment, everybody works from his or her own set of numbers, and reconciliation becomes a full-time job for far too many people. Finding an acceptable solution for this situation presents a real dilemma for the CIO. The traditional approach to data delivery in a complex environment has involved a data repository and a front-end tool. This approach has not been successful in many organizations for the reasons already described. Changing the recipe is almost as hard as forcing the recipe to work around the obstacles because of one of the most powerful forces in the known universe: what might be called the “IT Cultural Immune System.” This system has been finely honed as IT personnel develop defenses against technically sophisticated users. Although all organizations resist change, in IT this resistance carries the added passion that is characteristic of the “true believer.” It would be interesting to find a situation involving a data delivery problem where an Oracle DBA and a SAS programmer would not each loudly proclaim the superiority of their respective approach. While these solutions might, in fact, be perfect for a wide range of problems, they would also be the equivalent of using a sledgehammer to drive in a thumbtack; it’s feasible, but it represents a misuse of the tools. However, just try telling that to the true believer!
The Missing Element In effect, this type of platform/repository enables a “prototyping” approach to data delivery. The strategy of prototyping has proven effective in many disciplines: to test out materials, ideas, and products, auto manufacturers develop full-scale mock-ups of their cars; fashion designers produce sketches, then use live models; architects build scale models. In all these fields, the prototype is designed to promote dialogue—it provides a shared frame of reference for discussing and refining a product. Prototyping accelerates design convergence because exchanges can be reduced to “I like this,” “I don’t like that,” “more of this,” or “less of that.” One of the greatest benefits of prototyping is the participatory approach to data delivery it enables. Instead of IT saying “no” to user requests for data access, or “wait six weeks/months” (which is often the same as “no”), IT and users work together. IT dumps data into the platform/repository and users can start asking questions immediately. As users become acquainted with those data quality “quirks” and confusing data labels, there are at least three very positive outcomes. Users:
The platform/repository for data-driven BI delivery must provide the ability to perform:
The following sections discuss each of these requirements in detail.
Data Assessment The platform must provide ODBC/SQL access with full update, insert, and delete capability. Any and all fields or columns should be available for interrogation without penalty. Queries that return minimum and maximum values, the number of nulls or missing values, a count of the distinct values, reports showing the distinct values, and value distributions are extremely useful in characterizing a large data set without actually looking at all records, and provide an early impression of data quality and content. The ultimate product of data assessment comes in the form of data cleansing rules and data transformation rules. If the platform supports live insert, update, and delete operations along with the ability to create and destroy tables on the fly, proposed data cleansing and transformation approaches can be tried out and evaluated by the business experts themselves.
Dependency and Relationship Validation The platform should allow data extracted from different applications and different departments to be brought together for a thorough process of dependency and relationship validation. Dependency validation usually involves examining two or more fields from a particular data source or single table at once. For example, one might choose a pair of columns within a particular table and then determine the frequency with which pairs of values occur in those columns. The associations and frequencies uncovered by these kinds of queries show functional dependencies between fields. The queries themselves are simple; it’s the ability to run them arbitrarily on two or more fields without having to worry about indexing or tuning that separates an effective prototyping facility from an ineffective one. An axiom of relational theory holds that logically related tables have one or more pairs of attributes or columns that reference one another. This is usually referred to as a primary-key-foreign-key relationship. For such a relationship to be correctly formed, the pair of columns must reference a common set of values, and all of the values that appear in one of the columns must appear in the other (though the reverse does not necessarily hold true). To find out whether columns are suitable for primary-key-foreign-key relations, our platform has to permit very fast evaluation of data set overlap, responding rapidly to questions such as, “How many values in column A are not in column B?” and “How many values in column B are not in column A?” With the answers to questions such as these, BI system builders can develop the data transformations that would bring potential key columns in line with referential integrity rules.
Integration Validation Ambiguity or duplication needs to be avoided when introducing data from different sources into the system. Therefore, a prototyping platform must provide the ability to create and remove tables and populate them on the fly with query results in order to evaluate different approaches to data source integration. System builders and their customers (the business experts) can work through scenarios for scores of tables from multiple sources in real time, collapsing months of work into weeks or even days. Like data assessment, this activity produces data transformation and integration rules that can become regular processes in the production system.
Anomaly Investigation
Data and Query Audition The business-side participants in the process have an opportunity to try the kinds of queries and reports that they will want to run in the production system against full-sized data sets that are test-cleansed, transformed, and integrated. They can evaluate the results for usability, time to execute, completeness, and correctness. If the system builders have provided summarized or aggregated data, the business participants can determine—using live data—whether the summarized data provides too much or too little detail. When aggregations or summarization are inappropriate, the system builders can go back and reconstruct them immediately or overnight. Business users can also explore further derivations from data provided by the system builders. Depending on how it is performed, derivation can be thought of as cube building. To visualize this, think of a query such as, “For all stores and all products, return the sales volume for all credit card holders with cards assigned in the last 12 months.” If you look at “store,” “product,” and perhaps “account holder” as the three dimensions in this scenario, the results appear as a “cube.” When these results are preserved in a table, the cube becomes available for further querying. In our example, this would probably involve joining the result with tables containing information about stores and products. This is why the key relationships embodied in the store and product attributes have to be preserved and remain valid throughout the cube-building process. Variations on these kinds of cubes will be auditioned repeatedly in the prototype system. For architectures that include some form of OLAP technology, this phase will produce a blueprint for cubes that will be built on a regular basis. The following questions, among others, must be answered:
Data-Driven Delivery of Business Intelligence
This gap between the core competencies of these two layers is the reason the gap between user requirements and IT delivery capability isn’t narrowing. Figure 1. The Build Phase
A Proposed Architecture The computer monitors along the top represent both users and IT accessing the data discovery mart through a user interface layer. This layer can be any standard ODBC/SQL-compliant front end, which allows existing company front-end tools to be used where appropriate. The doubleheaded arrow on the right reflects the existing reporting environments that are typically in place for each of the data sources. There should be no need to displace these reporting mechanisms while development work is underway. The success of the data discovery marts is predicated on the use of a column-oriented, tokenized RDBMS, which alone is capable of delivering the required “load and go” ad hoc analytics capability. The need for column orientation reflects a fundamental characteristic of analytic environments. In online transaction processing (OLTP) systems, the typical requirement is for individual records to be quickly found, operated upon, and then possibly rewritten. For this reason, all related fields are stored together in that record (a row of data in a relational table) to enable speedy performance of these operations. However, for the kind of queries typical of data analysis environments, the opposite is true: it is more common for a large number of records to be processed with reference to a small number of attributes (columns in relational tables). For example, a user might want to see information about total sales in different countries. Millions of records may need to be scanned to provide a response, but data is required from only two columns. All the remaining columns in the table are irrelevant to the query, and thus don’t really need to be accessed. In a traditional database management system (DBMS), however, these columns are still an integral part of each row that is processed, and so still need to be moved in and out of memory regardless of whether they are required to answer the query. This adds to the burden on the I/O subsystem, which tends to be the slowest part of a computer. As already mentioned, record-oriented database systems usually attack these problems by indexing the columns that are most frequently involved in the database operations of a particular application—but this then becomes an ongoing task that requires continuous attention from scarce, specially skilled personnel. Tokenization is another approach to achieving efficient data analytics that differs radically from conventional relational database technologies. In a typical relational database, when a record is added to the system, a physical representation of the data is recorded on disk. For example, each time a new customer is recorded, a new set of data values is added to the database, whose size increases correspondingly. Here the scaling of data is linear, because the volume of data in the database is directly proportional to the number of records it contains. Since query performance is usually a function of data volume, large and complex relational databases are hard to handle without summarization. In a tokenized database, however, data values in records are not simply appended to the database. Rather, each distinct value is stored only once, and is assigned an integer “token” that is used in representing the value’s occurrences in table columns. If a new record contains values that are already present in the database, all that needs to be done is to create the appropriate references to existing tokens. This means that all the redundancy within databases is automatically removed, bringing economy in both performance and size. Anybody interested enough in this subject to have read this far will undoubtedly be familiar with the concepts espoused by Ralph Kimball in his book The Data Warehouse Toolkit, and will have noticed that this article and the architecture it describes are primarily a “how to” guide for implementation of those concepts. The differences are mainly semantic in nature. Kimball defines a data mart as “a flexible set of data, ideally based on the most atomic (granular) data possible to extract from an operational source, and presented in a symmetric (dimensional) model that is most resilient when faced with unexpected user queries.” Although few would argue with this definition (certainly not me!), complex environments that include fact tables with multiple billions of rows present a daunting challenge for all the reasons previously mentioned. Using a column-oriented, tokenized database as a foundation for the architecture described in this article represents a practical way to achieve the objectives inherent in Kimball’s definition. Some of the newer challenges, including an increasing requirement to deal with data from outside the enterprise, are also well supported. External data, by its very nature, may never be a candidate for inclusion in a warehouse, but still needs to be accessible via data marts. The capabilities described above allow the iterative but rapid development of a truly effective reporting environment. Figure 2 shows both the build and operate phases of such an environment. After the first completed set of iterations, these phases will coexist, since the data delivery process is highly dynamic and continually adapting to new business realities. Figure 2. The Build and Operate Phases
Above all, because this approach is a collaborative one, the traditional roles and mindset of both user and IT groups are transformed. The “IT immune system” is deflected so that it exerts at most an indirect impact on the process. Although it is probably too much to hope that the “true believers” will be completely converted, the speed and flexibility of this approach are a winning combination when dealing with a large and complex business environment. Even the most die-hard traditionalists will recognize that “Just give me more money and more time and I could do that too!” is not a career-advancing position in today’s business world.
REFERENCES |