Business Intelligence Best Practices - BI-BestPractices.com

Collaboration. Communication. Community.

 
 
 Printer-friendly
 E-mail to friend
  Comments
ADVERTISEMENT
A Data Mining Primer for the Data Warehouse Professional
The key to data mining is ensuring that you have a foundation of quality data that is clean, consistent, and accurate.

By James Kashner, Arlene Zaima

You have probably heard about the rewards data mining can bring to business. However, very little has been written to explain the challenges that face many information technology (IT) organizations and professionals. With that in mind, this article explores data mining from the IT perspective—giving a quick overview of the data mining technology, technical challenges, and solutions.

The article explores:
  • How data mining is used for business advantages
  • The integral relationship between data mining and data warehousing
  • Data mining terms and techniques
  • The challenges encountered with data mining
  • How to get started with a data mining project
  • Examples of customer results achieved through data mining

Data Mining Brings Results
Ever-increasing global economic challenges are prompting companies to explore new ways to get more from their data warehousing investment. Technologies that offer valuable insight and predictive capabilities to drive business growth and improve ROI are a great next step after the data warehouse is in place.

Data mining is just the right technology for supercharging CRM and analytic applications by inserting intelligence in the form of predictions, scores, descriptions, and profiles (where data mining excels). Volumes of historical data containing facts about what occurred in business operations can be analyzed and used to predict what will happen in the future.

Data mining is one of the fastest growing business intelligence technologies because it pays off in quantitative value. Here are just a few results from companies that have embraced data mining:

  • A European financial institution saved $8.2 million by gaining a better understanding of its customers’ ATM behavior. It was able to strategically place ATMs to reduce fees and increase loyalty.
  • A South American telecommunications provider retained 98 percent of its high-value customers during deregulation. It identified high-value customers, understood their profile and customer satisfaction level, and successfully marketed to different customer segments.
  • A U.S. telecommunications provider improved its marketing response rate tenfold by targeting customers that were identified using data mining techniques.

What Exactly Is Data Mining?
Data mining is a powerful technology that converts detail data into competitive intelligence that businesses can use to predict future trends and behaviors. Some vendors define data mining as a tool or as the application of an algorithm to data. The truth is, data mining is not just a tool or algorithm. Data mining is a process of discovering and interpreting previously unknown patterns in data to solve business problems. Data mining is an iterative process, which means that each cycle further refines the result set. This can be a complex process, but there are tools and approaches available today to help you navigate successfully through the steps of data mining projects.

From an IT perspective, the data mining process requires support for the following activities:

  • Exploring the data
  • Creating the analytic data set
  • Building and testing the model
  • Integrating the results into business applications

Therefore, the IT organization must provide an environment capable of addressing the following challenges:

  • Exploring and pre-processing large data volumes
  • Providing sufficient processing power to efficiently analyze many variables (columns) and records (rows) in a timely manner
  • Integrating data mining results into the business process
  • Creating an extensible and manageable data mining environment

Data Mining Makes Its Way to the Business World
Data mining has been very effective in focused areas, such as medical diagnosis, scientific research, and behavioral profiling since the mid-1980s. In the past 10 years, data mining technology has journeyed into the business world where it has added the new dimension of predictive analysis. To be effective in the business world, the data mining process had to be adapted to deliver models in a much more time-sensitive manner. Today, with the advent of in-database data mining techniques, businesses have finally found it possible and affordable to benefit from the advanced capabilities of this powerful technology.

What Can Data Mining Do for Your Business?
For years, businesses have relied on reports and ad hoc query tools to glean useful information from data. However, as data volumes continue to increase, finding valuable information becomes a daunting task. Data mining technology was designed to sift through detailed historical data to identify hidden patterns that are not obvious to humans or query tools. Many of these previously hidden patterns reveal intelligence that can be integrated into business processes to provide predictive capabilities for improving strategic business decision making.

Data mining makes analytical business applications, such as CRM, smarter by providing insights into many new areas of your business that would otherwise go unnoticed. By making your applications smarter, data mining translates into a higher return on your warehouse investment.

The Difference Between OLAP and Data Mining
A commonly asked question is “What’s the difference between data mining and online analytical processing (OLAP)?”

OLAP is a business intelligence tool that allows a business person to analyze and understand particular business drivers in “factual terms.” Typically, a specific “descriptive” or factual question is formulated and either validated or refuted through ad hoc queries. OLAP results are also factual results. For example, you may ask, “What were the average monthly dollar sales of portable CD players during the past three months in our southwest regional stores?” These results are factual answers that enable you to validate or question past order and inventory decisions.

But what if you want to make a prediction about future demand for portable CD players with a high degree of confidence so that the amount in inventory will fulfill demand? What are the errors that you are most likely to make, and how do those potential errors match your organization’s specific approach to risk? These types of business questions challenge traditional query and OLAP techniques beyond their capabilities. OLAP techniques don’t produce predicted or estimated values with associated expectations of accuracies and errors.

Data mining, on the other hand, is a form of discovery-driven analysis where statistical and machine-learning techniques are used to make predictions or estimates about outcomes or traits before knowing their true values. With data mining, predictions are accompanied by specific estimates of the sources and number of errors that are likely to be made. Estimates of errors translate directly to estimates of risk. Consequently, with data mining, making business decisions in the presence of uncertainty can be done with detailed and reliable information about associated risks. Data mining techniques are used to find meaningful, often complex, and previously unknown patterns in data.

For example, you may ask, “How many of a particular product should I order for inventory to fulfill 97.5 percent of the expected demand for the next three months?” Data mining techniques can be used to build models based on detail data to predict the most likely number of a specific item that will be sold within a given time period along with the likely errors in that prediction. Typically, OLAP analyses use predefined, summarized, or aggregated data, such as multi-dimensional cubes. Data mining requires detail data so that it can be aggregated to, and analyzed at, optimal levels during exploratory analyses. The optimal levels are unique to a specific business question and the data attributes available to address it in a specific data warehouse.

Although these technologies are used for different purposes, OLAP and data mining are complementary. During the data mining exploration phase, you may use OLAP technology to help you understand your data. Data mining results can also be used in OLAP applications by incorporating new predictive variables or scores as dimensions or attributes in your OLAP tool. For example, if you calculate a new predictive variable called “Customer Value” that characterizes the value of a customer to your business in terms of current and estimated future profitability, you can include this new variable as an attribute in your OLAP tool. When retailers analyze which products to stock, they can consider products that attract high-value or profitable customers.

How Does Data Mining Work?
Data mining leverages artificial intelligence and statistical techniques to build models. Data mining models are built from situations where you know the outcome. These models are then applied to other situations where you do not know the outcome. For example, if your data warehouse identifies customers who have responded to past marketing campaigns, you can create a model that identifies the characteristics of those customers. This model can be applied to a wider customer database, identifying customers who demonstrate the same characteristics, allowing you to target those likely to respond, thereby improving response rates and reducing marketing cost.

Business problems that lend themselves to data mining are predictive and descriptive in nature. Predictive models are used to predict an outcome, referred to as the dependent or target variable, based on the value of other variables in the data set. For example, a predictive model could determine the likelihood that a customer will purchase a product based on her income, number of children, current product ownership, or debt. Predictive techniques build models based on a “training” set of data with a known outcome, such as prior buying patterns. The algorithm analyzes the values of all input variables and identifies which variables are significant as predictors for a desired outcome.

Unlike predictive models, descriptive models do not predict variables based on known outcomes. Instead, they describe a particular pattern that has no known outcome. Common techniques include data visualization, where large volumes of data are reduced to a picture that can be easily understood. Another common descriptive technique is clustering, where data is grouped into subsets based on common attributes. For example, you may use descriptive techniques to determine customer segments and their attributes.

In many cases, both descriptive and predictive models are used to solve business problems. A descriptive technique may identify customer segments based on value in terms of profitability to your business, and a predictive technique may identify the likelihood a particular segment will defect to your competitor. By combining results of the descriptive technique to predict customer defection, you can act to prevent attrition of your high-value customers.

The Data Mining Process
You cannot buy a data mining product, apply it to data and expect to generate a meaningful model. Data mining models are built as part of a data mining process—an ongoing process requiring maintenance throughout the life of the model.

The data mining process is not linear, but an iterative process where you loop back to the previous phase. For example, the initial model you create may lead to insight requiring you to return back to the data pre-processing phase to create new analytical variables. The data mining process contains four high-level steps: (1) define the business problem, (2) explore and pre-process the data, (3) develop the data model, and (4) deploy knowledge. Tasks for each step are listed in Figure 1 to provide a brief overview to the data mining process. Although each step is important, most of your time will be spent in the data exploration and pre-processing phase. A well structured data warehouse can significantly reduce the pain felt in this phase.

The Relationship Between Data Mining and Data Warehousing
Data mining is all about data. You can mine inconsistent or dirty data and find patterns. However, the patterns will be meaningless if your data does not accurately reflect the business you are modeling. The key to data mining is ensuring that you have a foundation of quality data that is clean, consistent, and accurate.

Data mining is all about data. You can mine inconsistent or dirty data and find patterns. However, the patterns will be meaningless if your data does not accurately reflect the business you are modeling. The key to data mining is ensuring that you have a foundation of quality data that is clean, consistent, and accurate.

A data warehouse provides the right foundation for data mining. Although data mining can be done without having a warehouse in place, the process of gathering, cleansing, and transforming the data from multiple data sources can be arduous. Once the process has been completed for one model, you must repeat the process for subsequent data mining projects. Approximately 70 percent of the data mining process involves accessing, exploring, and preparing the data. The data warehouse makes data mining more viable by removing many of the data redundancy and system management issues allowing people to focus on analysis.

Data Mining Terms and Techniques
Following are some data mining terms and techniques commonly used to solve predictive and descriptive analytical problems.

Following are some data mining terms and techniques commonly used to solve predictive and descriptive analytical problems.

Analytic Model
A model is a set of logical rules or a mathematical formula that represents patterns found in data that are useful for a business purpose. Once a model has been built based on one set of data, it can be reused to search for the discovered patterns in similar data. Models are sometimes called predictive models since they can be used to predict behaviors based on the discovered patterns.

A model is a set of logical rules or a mathematical formula that represents patterns found in data that are useful for a business purpose. Once a model has been built based on one set of data, it can be reused to search for the discovered patterns in similar data. Models are sometimes called predictive models since they can be used to predict behaviors based on the discovered patterns.

Association
This modeling technique is commonly referred to as affinity analysis and is used to identify items that occur together during a particular event. Affinity analysis is commonly used to study market baskets by identifying which combinations of products are most likely to be purchased together. Another form of this technique is sequence analysis in which you can understand the order in which customers tend to purchase specific products. These results may be helpful in the early phases of establishing cross-selling strategies.

This modeling technique is commonly referred to as affinity analysis and is used to identify items that occur together during a particular event. Affinity analysis is commonly used to study market baskets by identifying which combinations of products are most likely to be purchased together. Another form of this technique is sequence analysis in which you can understand the order in which customers tend to purchase specific products. These results may be helpful in the early phases of establishing cross-selling strategies.

Clustering
Clustering is a class of modeling techniques that can be used to place items into groups based on like characteristics. The goal is to create groups of items that are similar based on their attributes within a given group, but very different from items in other groups. Clustering is frequently used to create customer segments based on behavior or other characteristics. Customers in the same segment share similar characteristics and tend to behave consistently. Knowledge of the typical behavior of a segment can be powerful information if you want to predict the behavior of a new member of that segment.

Clustering is a class of modeling techniques that can be used to place items into groups based on like characteristics. The goal is to create groups of items that are similar based on their attributes within a given group, but very different from items in other groups. Clustering is frequently used to create customer segments based on behavior or other characteristics. Customers in the same segment share similar characteristics and tend to behave consistently. Knowledge of the typical behavior of a segment can be powerful information if you want to predict the behavior of a new member of that segment.

Data Visualization
This process takes a large quantity of data and reduces it into more easily interpreted graphs, charts, or tables.

This process takes a large quantity of data and reduces it into more easily interpreted graphs, charts, or tables.

Decision Tree
This class of techniques produces a tree-shaped structure that represents a set of decisions to predict values of a target variable. One of several algorithms can be used to produce estimated values or classify data based upon rules. Decision trees are commonly used to model good/bad risk or loan approval/rejection. The models they produce are often intuitively appealing because they are represented by sets of “rules” that humans can read and easily understand.

This class of techniques produces a tree-shaped structure that represents a set of decisions to predict values of a target variable. One of several algorithms can be used to produce estimated values or classify data based upon rules. Decision trees are commonly used to model good/bad risk or loan approval/rejection. The models they produce are often intuitively appealing because they are represented by sets of “rules” that humans can read and easily understand.

Linear Regression
A class of statistical techniques are used to find the best-fitting, linear relationship between a numeric target variable and its set of predictor variables. For example, linear regression can be used to estimate an appropriate amount of overdraft protection to offer a customer on their checking account based on account balances, years the account has been open, and income.

A class of statistical techniques are used to find the best-fitting, linear relationship between a numeric target variable and its set of predictor variables. For example, linear regression can be used to estimate an appropriate amount of overdraft protection to offer a customer on their checking account based on account balances, years the account has been open, and income.

Logistic Regression
A class of statistical techniques used to find the best-fitting, natural-logarithmic relationship between a categorical target variable and a set of predictors. It is commonly used to predict Yes or No questions.

A class of statistical techniques used to find the best-fitting, natural-logarithmic relationship between a categorical target variable and a set of predictors. It is commonly used to predict Yes or No questions.

Neural Networks
A family of non-linear modeling techniques that are often used to predict a future outcome based on historical data. They require substantial expertise to understand the rationale underlying the estimates and predictions they make. Neural networks are sometimes referred to as “black boxes” because they can produce models that are very difficult to understand. However, for some very complex phenomena, they can produce more accurate models than other techniques.

A family of non-linear modeling techniques that are often used to predict a future outcome based on historical data. They require substantial expertise to understand the rationale underlying the estimates and predictions they make. Neural networks are sometimes referred to as “black boxes” because they can produce models that are very difficult to understand. However, for some very complex phenomena, they can produce more accurate models than other techniques.

Score
A score is an outcome of a model that represents a predicted or inferred value on some trait or characteristic of interest. If your model calculates the customer value, the score for each customer may be a number that indicates the value of a particular customer in terms of current and estimated future value.

A score is an outcome of a model that represents a predicted or inferred value on some trait or characteristic of interest. If your model calculates the customer value, the score for each customer may be a number that indicates the value of a particular customer in terms of current and estimated future value.

Data Mining Challenges
Although it may appear that data mining is the next logical step for companies that have already implemented their data warehouse, the reality is that many businesses struggle with getting their data mining projects to deliver meaningful results. To be successful, data mining requires the right team, the right methodology, the right architecture, and the right technology.

Although it may appear that data mining is the next logical step for companies that have already implemented their data warehouse, the reality is that many businesses struggle with getting their data mining projects to deliver meaningful results. To be successful, data mining requires the right team, the right methodology, the right architecture, and the right technology.

Challenge #1: The Right Team
Data mining projects must be a collaborative effort driven by business experts, developed by analytic modelers and supported by IT. Your internal skill sets may be developed over time, which may mean initially hiring data mining consultants to develop your data mining capability with the ultimate objective of transferring knowledge to your team. To ensure a successful data mining outcome, you will need the following three classes of experts on the team: business domain experts, information technology support, and analytic modelers/data marts.

Data mining projects must be a collaborative effort driven by business experts, developed by analytic modelers and supported by IT. Your internal skill sets may be developed over time, which may mean initially hiring data mining consultants to develop your data mining capability with the ultimate objective of transferring knowledge to your team. To ensure a successful data mining outcome, you will need the following three classes of experts on the team: business domain experts, information technology support, and analytic modelers/data marts.

Business Domain Experts
It’s imperative to have the business analysts involved in the data mining project. They should be the champions and drivers of every data mining project. They need the answers that result from the project, and therefore, they must clarify the business issues to be solved by the project. The business domain experts should ultimately be held accountable for the results of the data mining project.

It’s imperative to have the business analysts involved in the data mining project. They should be the champions and drivers of every data mining project. They need the answers that result from the project, and therefore, they must clarify the business issues to be solved by the project. The business domain experts should ultimately be held accountable for the results of the data mining project.

Information Technology Support
The IT organization responsible for the data warehouse provides the bulk of the IT support. However, other groups may be called upon to assist with data cleansing and model integration.

The IT organization responsible for the data warehouse provides the bulk of the IT support. However, other groups may be called upon to assist with data cleansing and model integration.

Analytic Modelers/Data Miners
Analytic modelers/data miners are responsible for preparing the data, designing the model, building the model, and deploying it against the data. The analytic modeler works with the IT organization to integrate the model into the decision support infrastructure and business processes.

Analytic modelers/data miners are responsible for preparing the data, designing the model, building the model, and deploying it against the data. The analytic modeler works with the IT organization to integrate the model into the decision support infrastructure and business processes.

Challenge #2: The Right Methodology
Data mining, like data warehousing, is an ongoing process that must be maintained and changed as business drivers change. The key to a successful project is to base it on a proven methodology. Below is a data mining methodology that has delivered successful models that have uncovered millions of dollars in revenue and cost savings for customers. This section defines this data mining methodology. Although all tasks are equally important, for the purpose of this paper, the primary focus is on the activities that affect the data warehouse.

Project Management
Every successful project requires clearly defined objectives, requirements, deliverables, and resources. Project management activities are required throughout the project’s life. The project manager ensures the project will produce satisfactory deliverables from both a technical and business perspective.

Business Problem Definition
Successful data mining begins with a clearly defined business objective. Everything from data pre-processing to model selection is driven by the business objective. The business problem is described in operational terms so initial data availability and the analytic approach can be determined.

Architecture and Technology Preparation
Before tackling a data mining problem, the development and implementation requirements for the analytic models must be understood. These requirements determine how the models are built, what software is required, and whether or not new hardware is required. In most cases, your development and production environments will be different. However, you may leverage the same environment with appropriate resources. There are several techniques to building models. Based on your environment and requirements, the right balance of client/server and/or in-database mining must be chosen.

Data Preparation
This is the most time-consuming step, but also the most critical. You must first collect all the data necessary for your project. If you have an enterprise warehouse, you’re in luck. However, you may still need to pull data from different sources. First, examine your data sources to see what is available to address the business problem. Second, ensure that data is computationally valid and consistent. For example, if you are pulling from different data sources, you must resolve conflicts among data—which can be a daunting task. To avoid these issues, we highly recommend starting with a data warehouse where these conflicts are resolved.

Once data is gathered, you can explore your data. This task is often called exploratory data analysis. Data visualization and descriptive statistical techniques are used to uncover data quality issues and better understand data characteristics. You may uncover data quality issues or missing data, which can jeopardize the integrity of any analytic model, so you must compensate, if not correct, the data issue. For example, you must determine the best method for filling in missing data values. You can consider using a data mining technique to predict the value of a missing variable based on other data points.

Next, you must isolate and prepare your data for the particular model. You may exclude outliers for some models, whereas you may build a model based on outliers. For example, if you were predicting baseball attendance and revenue, you would need to exclude abnormal attendance data, such as attendance data from 1994, the year of the baseball players’ strike. In other cases, such as fraud detection, you should include outliers since they may represent fraudulent transactions.

Once you have selected your data, some level of transformation may be required. Detail data, as it exists in the data warehouse, is not necessarily ready for data mining. You may want to derive optimal aggregations or new analytic variables to build a better model. For example, a customer’s debt-to-income ratio may be a better predictor than just debt or income. Some statistical techniques and algorithms also require numeric data or data within a certain range. For those variables, you need to recode or transform them into the appropriate input variable for the data mining technique.

Model Development, Test, and Validation
The next step is to build an analytical model—an iterative process of applying analytical techniques to the analytical data set and interpreting mathematical equations. The resulting equations are refined as more iterations are performed, with each iteration resulting in higher statistical and conceptual confidence in the results.

Earlier in the process, you identified a preliminary analytical approach required to solve the business problem. Now you must select the specific analytical algorithms or statistical techniques that are most appropriate for building your model. Your selection of specific analytical techniques often requires revisiting some aspects of data pre-processing that you performed in the previous step. Once you have selected the algorithms, it’s time to build the model. Building an analytical model requires at least three broad steps: (1) training or fitting, (2) testing, and (3) validation, which in turn requires you to segment data into at least three different data sets: (1) training, (2) test, and (3) validation. Your model is built using the training data, then tested using the test data to assess the model accuracy. The data mining tool you use should have sufficient model, parameter, and row-level diagnostics that allow you to identify and understand specific strengths and weaknesses in your model during these first two steps.

After you have refined your model based upon the diagnostics, it’s time to validate your model. Model validation is a process where an analytical modeler attempts to establish and maximize the generalizability of a model beyond the data set with which the model was created. The validation data is used as an independent source of information to assess the degree to which your model’s accuracy might be overstated.

Overstated accuracy is frequently referred to as overfitting, a case where a model is built to closely fit the training and test data, but not the data that you intend to score. Overfitting has a direct and adverse effect on the usefulness, or validity, of your model. If the rules or formulas in a model are so tightly bound to any particular data set, then you won’t be able to use the model for the purpose you build it: to produce scores for data with unknown outcomes that you want to predict with high confidence. The amount of effort put into maximizing the validity of a model is directly proportional to its business value.

The analytical models are tested using statistical techniques; comparing models developed from different analytical techniques and the results for these models are further validated against the business criteria for the project. Once the model is developed, you must also establish a process to validate and to refresh it as the data changes. It is also necessary to monitor the continuing business validity of the analytical models.

Knowledge Delivery and Deployment
Knowledge derived through analytical models unlocks the ROI from your warehouse. There are several methods for deploying the models. Your IT organization may run the model and deliver the results to your business users for business decisions. The model or intelligence generated from the model can also be integrated into your customer relationship management (CRM) or analytical applications to facilitate business user access to the results. Regardless of your implementation, data mining adds intelligence to your business in the form of scores, predictions, descriptions, and profiles.

Knowledge Transfer
The right data mining methodology should include knowledge transfer. Knowledge transfer spans the entire data mining project beginning with the initial interviews with each data mining team member to determine their professional knowledge transfer objectives for the project. Mentoring and education throughout the data mining project arms the data mining team with the necessary modeling and process knowledge to interpret results, maintain the modeling environment, and monitor the analytical model.

Challenge #3: The Right Architecture
There are several data mining architectures commonly used today. They include the distributed independent data mart, data warehouse with dependent data marts, and the centralized data warehouse and mining architectures.

Distributed Independent Data Marts
The distributed sources with analytic data marts method requires that data be extracted from multiple sources to analytical servers. Data gathered from various sources must be converted into a common and consistent format, then merged into an analytic data mart. Because data mining is iterative, this process must be repeated many times. It’s true that you don’t need a data warehouse to mine data. However, the data movement and data management can be time consuming and lengthen your data mining project. Data mining tool and database vendors highly recommend beginning with a data warehouse if you’re planning to integrate data mining into your business intelligence strategy. A reason an analyst may opt for a distributed data mart model is for data autonomy. Once you extract data from your sources, you have full control over your analytical environment.

Data Warehouse with Dependent (Analytic) Data Marts
Using a data warehouse simplifies the data management issues since the data has already been gathered, cleansed, and transformed to meet your warehouse criteria. Although you are pulling from a single source, you must still contend with the data movement from your warehouse to your analytical server, potential human error that can occur with sampling, and analytic server management issues. In addition to data movement, you must ensure the data you select is a sample that accurately reflects the business environment. Building models using unrepresentative data samples will produce poor models.

Centralized Data Warehouse and Mining
As data mining projects are implemented across the enterprise, the number of users leveraging the data mining models continues to grow as does the need to access large data infrastructures. Data warehouse solution providers recognize this situation and are incorporating data mining extensions within the database to offer a centralized data mining architecture. The analytic processing performed within the database minimizes data movement in and out of the database and leverages the parallelism of the database. A massively parallel database provides a massively parallel analytical engine that you can use to build, test, and deploy analytical models.

The data warehouse becomes a centralized repository for your analytical data, data mining models, and data mining results providing an ideal foundation for data mining projects. Data is available for multiple mining projects across your entire enterprise. Your analytical models can be run against your entire customer table within your warehouse. Data mining models and results combined with your detailed customer records give you insight about customer value, buying patterns, and preferences.

The data warehouse with analytical data marts architecture is the most commonly used architecture today because of the limitations of databases and data mining tools. Most data mining tool vendors require data to be converted into their proprietary format for efficient processing. Technology limitations are discussed in the next section.

Challenge #4: The Right Technology
The right technology begins with the right foundation: the database. Effective data mining depends on a comprehensive and robust data warehouse, not a summarized data mart, because it’s difficult to predict the specific attributes that will contribute to a data mining model. Some companies are trying to do data warehousing with a database that was designed for OLTP—operational processing of high-speed transactions. The operations performed in databases optimized for OLTP—adding, deleting, modifying records, and other row-level update functions—are quite different from those that are necessary to analyze large volumes of historical data, and therefore require different database capabilities.

Scalability and Performance
To get a higher return on their data warehousing investments, data warehouse users are asking more complex questions that require access to large amounts of data. As data volumes and the complexity of the business problems grow, analyses will inevitably take longer to process on platforms with limited scalability, requiring the search for new ways to accelerate the data mining process. Users who analyze data warehouses that scale to the multi-terabyte range struggle with desktop and client/server data mining tools that do not scale to meet their requirements. This has required data mining to move from desktop and general-purpose toolboxes on client/server configurations to enterprise applications on massively parallel processing configurations. Several database vendors provide in-database approaches to data mining, which make for more efficient data processing. Mining directly in the database streamlines the data mining process by eliminating data movement and leveraging the parallelism of the database engine for the performance and scalability required to analyze large volumes of detail data.

Data I/O
As large volumes of data are processed and models are deployed across the enterprise, the I/O required by most tools creates a network bandwidth problem. As gigabytes and even terabytes are moved from database to analytic server to business server, the I/O puts a strain on the entire enterprise network. In-database mining eliminates the I/O issues by moving the functions to the data instead of moving data to the functions.

Tools
The right technology includes tools that provide a comprehensive set of statistical and machine-learning functions along with visualization and data pre-processing techniques. Many tools provide a sophisticated set of analytical algorithms and graphical interfaces. However, they fail to provide a robust set of data visualization and data pre-processing functions. Since the bulk of the data mining process is spent exploring and conditioning data, you need tools that will facilitate data exploration, visualization, transformation, and data management. Tools must also process large data volumes and provide an interface that enables integration of analytical models into business applications.

Packaged and Custom Solutions
The availability of packaged data mining solutions and templates is increasing, especially for certain classes of applications in specific industries—such as credit card fraud detection, credit risk modeling, customer attrition, and cross selling in the banking industry. Your decision to use a packaged solution or to build a custom data mining solution is a very important one. For specific business problems in specific industries, a packaged solution may be appropriate. However, two organizations could have similar business questions but different data and different needs. Some degree of customization and even retrofitting of a packaged offering will be required, making it difficult to achieve a true “shrink wrap” solution.

It’s also important to note that building a custom data mining solution does not necessarily mean “starting from scratch.” Highly qualified data mining professionals can often help you build a customized solution, while simultaneously achieving the same objective that packaged solutions offer—significantly accelerating the time to build and implement a solution.

Regardless of the direction chosen, it’s important to carefully assess any data mining solution based on your specific criteria and your underlying business principles.

Summary
To develop analytic solutions that can be applied throughout your enterprise, you need a powerful infrastructure that is built for analytic processing. The volume of data being created and captured and the amount of transaction data can cause massive bottlenecks in decision flow: thousands of variables, millions of transactions per day, and millions of customers. Reports and OLAP techniques provide the capabilities for navigating massive data warehouses but not all of the required insight to stay ahead of the competition. Data mining offers the analytic foundation to unlock additional intelligence from your enterprise data warehouse.


Arlene Zaima is the data mining marketing manager for Teradata, a division of NCR. arlene.zaima@ncr.com


James Kashner is the chief technology officer of data mining for Teradata, a division of NCR, and co-founder of Teradata’s Data Mining Lab. james.kashner@ncr.com