Business Intelligence Best Practices -

Collaboration. Communication. Community.

 E-mail to friend
Inside Data Mining Freeform Text
Using your data to drive more profitable business decisions moves data mining from being an interesting research project to a mission-critical business initiative.

As data storage continues to grow at a rapid clip, data mining technologies continue to evolve as key tools for analytical work on large data stores. These technologies help businesses make sense of, and wring competitive advantage from, their ever-growing storehouses of data.

Data mining technologies are widely applied across diverse industries (from lending institutions to wireless carriers) and occupations (from direct marketers to collections agents). Using your data to drive more profitable business decisions moves data mining from being an interesting research project to a mission-critical business initiative.

Cheap Storage, Valuable Information

Over the last 20 years, the growing size and sophistication of databases coupled with constantly decreasing storage prices have helped businesses amass very large amounts of data throughout the enterprise.

A 2002 Mentis survey of U.S. banks and thrifts with at least one currently operational data warehouse indicated that almost one-third of the respondents had or were planning databases of 500 gigabytes to 1 terabyte. Terabyte-sized databases, once exclusively the province of Fortune 10 companies and governments, are now becoming commonplace. In 1957, the first hard drive, IBM’s RAMAC, cost about $100,000 for 5 megabytes of storage—100 million dollars per gigabyte. In 1998, hard drive storage averaged $60 per gigabyte—over a million times cheaper! Today, as the trend continues, you can purchase 200+ gigabyte hard drives for as little as $150.

Large quantities of cheap storage can simplify business decisions about what data to store: buy more cheap storage and store all the information. While data mining has made this data valuable—indeed, in many areas, indispensable—there is still more business value to be gleaned from this abundance of data. But what benefits can you realize from these large quantities of data? What business advantages can you derive from your enormous data stores?

The trend is clear—rationalized, data-driven decisions are replacing intuitive decision making based primarily on experiential judgment. Businesses that can develop datadriven decisions quickly and accurately are competing more successfully in the marketplace.

What businesses really need for the volume of data is information. Who are my most profitable customers? What customers might be interested in Product A or Product B? How many customers will I lose this quarter and who are they? Which customers are likely to pay on time and which might declare bankruptcy?

Although data mining can be used to respond to specific questions, data mining techniques are better suited to discovering patterns and relationships that exist in the data.

In this sense, data mining differs from other analytical methodologies (such as business intelligence) because it provides useful information that you might not have known you wanted and almost certainly did not know how to ask for. In other words, you do not have to ask explicit questions. Instead, you can use data mining technologies to expose relationships between previously disparate data elements, enabling valuable observations such as: “Eighty percent of our customers who bought a 40- inch plasma TV and then bought new speakers ended up buying a surround sound system within 60 days.”

Data mining is an iterative process with several components. The discovery component is used to initially find the relationships in your data. The analysis component is used to determine what business value these relationships might provide and what new insights can be gained from them. This analysis helps generate more targeted discovery work. The examples below provide additional insight into this process.

Data-Driven Customer Understanding

After 9/11, air travel diminished substantially, to the point where commercial airlines received a government bailout. Across the travel and leisure industry, profits were down, but by utilizing data mining (specifically, customer relationship analytics techniques), it was able to learn more about customer behavior.

This analysis revealed that most customers were coming from local areas. People were not traveling to destination resorts, but instead were visiting places near home, and were driving, rather than flying.

Based on this discovery, many resorts and lodges were able to adjust their marketing tactics. Instead of marketing in geographically remote areas, they began to offer “weekend getaways” in advertisements placed in local media.

The bottom line: In a relatively short period, occupancy rates were dramatically improved. Driven by the revised marketing programs, these businesses were able to make up some of the shortfall by targeting their customers.

Mining Freeform Text

Until recently, regardless of application, data mining technologies have been almost exclusively employed to analyze structured data (e.g., database fields such as DateLastPaid, AmountDue, CrLimit, ZipCode, and ItemNumber). Structured data mining is a well-worked field that has been extensively explored for decades. Researchers, statisticians, and modelers (among others) have developed sophisticated technologies for mitigating risk, understanding relationships between customers and products, and enabling data-driven business strategies.

However, the last few years have seen significant market penetration from the next generation of data mining technologies— data mining of freeform text, also known as text mining.

Text mining transforms and mines the freeform text data in customer records, such as textual data from customer emails, customer service notes, or from collections notes. In addition, some text-mining technologies evaluate the semi-structured data in these records.

The business objectives in text mining are identical to those for structured data mining and, indeed, the two can be integrated to provide better analytical results than either can separately.

Text mining is actually an umbrella concept that represents a collection of fundamental capabilities. To understand how text mining works, it is useful to understand what functions are performed and how they are provided. At a conceptual level, text mining functionality can be divided into a few major categories. Recognizing that many taxonomies of functionality are possible, we will describe the basic functions of cleaning, categorization, extraction, and modeling.

The techniques used to deliver text-mining functionality fall into one of the two broad classes of techniques: natural or symbolic language processing (NLP) and statistical language processing (SLP).

NLP techniques generally rely on a rule-based understanding of human language, recognizing the roles of parts of speech such as nouns, adjectives, and verbs and other elements of syntax and semantics. SLP techniques, on the other hand, utilize a statistical analysis of the text to identify patterns in the data rather than rely on a prior understanding of natural language rules. Frequently, functions such as cleaning, categorization, extraction, and modeling mix both natural and statistical language processing techniques to accomplish a task. This section briefly discusses each of the major text mining functions, along with typical methods and some of the common business uses.

Cleaning prepares content for downstream use, and is typically a fundamental component of any application. For many real-world uses, the available text is in less than pristine condition. Cleaning, much like the extraction, transformation, and load processes (ETL) associated with standard structured data, can represent a major part of the processing effort.

Cleaning tasks include:

  • Filtering: the separation of the potentially valuable content from the irrelevant; for example, separating the header in a call center log from the body.
  • Repair: the correction of spelling, punctuation, or other elements in the content that will be required for more effective downstream processing.
  • Normalization: the mapping of multiple expressions to a common concept; this task includes the replacement of synonyms and hyponyms, the expansion of abbreviations, stemming or tense adjustment, and other tasks relevant to the reduction in language variation. This task partially addresses two of the major difficulties associated with text mining efforts of even moderate complexity: one concept can be expressed in many different ways, and one word can be used to express many different concepts. In database terminology, this is an example of the dreaded many-to-many relationship.

In terms of techniques, a mix of both symbolic and statistical language processing methods is typically utilized. For example, a symbolic technique known as a digital thesaurus can be used to automatically substitute synonyms (e.g., substituting satisfied for pleased), whereas statistical techniques can be applied for automatically correcting spelling learned automatically from examples of misspellings.

In categorization, text content is automatically assigned to one or more categories based on some measure of similarity. If the categories are predefined and seeded with an initial set of examples, the process is known as supervised learning. If the categories are derived by statistical algorithms with no training examples, the process is defined as unsupervised categorization or clustering.

The categorization techniques applied in text mining are most often examples of statistical language processing and are similar to the techniques applied to purely structured data. The difference is that the algorithms typically applied to text-based problems must contend with the greater dimensionality and noise that text implies. Some of the more common and successful machine-learning algorithms applied to text are kNN, SVM, Naive Bayes, and C4.5, or variants of tree classifiers.

Text categorization is used for a variety of real-world problems, including routing customer e-mails to the appropriate specialist for service, placing competitive intelligence documents into appropriate bins for later analysis, and mapping call center notes to the appropriate product issue category.

Extraction involves pulling content of interest directly from the text. There are several types of extraction, each differing significantly in the technological complexity required. One type of extraction is key word and phrase extraction—selecting the most descriptive or relevant words and phrases that capture the essence of the content.

A similar but more complicated type of extraction is summarization, which involves ranking individual sentences by importance or relevance, then selecting and combing the highest-ranking sentences into a readable summary of the content.

Still another form of text mining extraction is entity extraction. This involves the automatic identification of proper names—people, places, companies, and so on. Competitive intelligence and similar activities can benefit significantly from such automation.

Possibly the most sophisticated kind of extraction is role or relational extraction. Facts are automatically extracted from textual content through a process that involves parsing the sentence into subject, verb, and object relationships. With this structure understood, the data can be stored relationally and retrieved by action (verb), actor (subject), or recipient (object). Though this kind of extraction can be powerful, it typically requires grammatically correct content not often found in business applications, as well as significant understanding of the application domain and direct text-mining technology.

Extraction heavily utilizes both statistical and symbolic natural language techniques. For example, gazetteers (an NLP tool) may be used to look up potential named entities during extraction, but statistical techniques are also used to identify such structural elements as sentence boundaries, a task required to increase the accuracy of the named-entity extraction process.

Modeling is perhaps the most operationally relevant of the text-mining functions. It combines key elements or attributes of the text with structured data attributes to either describe or predict a phenomenon. Thus, it is more of a mixed data (i.e., combining structured and unstructured data) modeling solution.

Modeling with text comes in two primary forms: descriptive and predictive. Descriptive modeling examines an outcome and attempts to identify the data attributes that explain why the outcome occurred; it's useful for understanding business relevant phenomenon such as attrition, customer loyalty, and lifetime value. A descriptive model combines structured customer data such as the date and time of activity, frequency, and monetary data with unstructured data such as direct customer communications complaining about price, product, or service. By combining these data, the model could explain, statistically speaking, customers churn or why share-of-wallet may be down. Listening to your customers by analyzing e-mails, chat sessions, or inbound and outbound call center notes is possibly the best way to understand them.

A second form of text modeling is predictive modeling. Describing past behavior can be beneficial, but predicting future behavior can be more valuable. As with descriptive modeling, predictive modeling requires an outcome (i.e., what happened, such as customer attrition) and historical data about those customers. In predictive modeling however, the statistical learning techniques, such as linear or logistic regression, capture regularities in the data that enable the prediction of future behavior in similar accounts, permitting proactive treatment of the account.

The operational uses of this type of modeling are many, particularly in areas related to customer relationship management. As with categorization, modeling relies almost exclusively on statistical technology. Many of the algorithms applied to modeling using only structured data have been used in mixed data mining as well, but systems that are designed specifically for more complex mixed data problems often perform substantially better.

The Text Mining Vendor Continuum

Text mining vendors come in a variety of sizes, from very large tool vendors that have tacked text mining software onto existing structured data product offerings, to startups that focus on text and text only, to the exclusion of the broader enterprise data. Text mining products can be broken into four categories that reflect the customer effort required to create value. The categories are listed from least to greatest effort required:

  • Solution vendors provide advanced text mining technology and the domain expertise and services to successfully integrate and deploy the solution
  • Application vendors supply the technology that addresses specific business issues, but the skill to use and implement must reside in house
  • Text mining tool vendors require companies to have an in-house expert or analyst who knows and understands text mining techniques and how best to apply them to the specific business problems; often, IT personnel need to be involved before a tool is fully functional
  • Library vendors provide text mining functionality that must be imbedded into existing solutions or applications; libraries are heavily dependant on IT resources to implement

Table 1, while not exhaustive, displays several text mining vendors and the level of sophistication they provide.

Evaluating Vendors

There are several vendors who provide text mining technologies. As such, it is important to understand where they fall in the continuum of sophistication and to be aware of some key capabilities that ensure proper application of text mining techniques, and successful operational deployment.

Mixed Data Capabilities
Does the vendor support mixed data analysis where some information comes from new text mining capabilities and other pieces of information come from traditional numeric data? Some vendors continue to treat these sources separately, forcing the end user to merge the results of the two techniques.

Operational Deployment and Integration
Can the vendor’s solution be operationally deployed at your site or though ASP? Some vendors provide tools for mining but lack true deployment capabilities to leverage the discoveries and predictive models on a go-forward basis.

Breadth and Depth of Text Mining
How broad and deep are the text mining capabilities of the vendor? Some vendors have adopted a few text mining technologies and attached them to their traditional solutions. Unfortunately, like many complex problems, there are many elements to text mining and each is appropriate in certain circumstances. Be certain the vendor provides a broad set of well-developed technologies to match to various business problems.

Solution Focus
Does the vendor provide more than just a toolkit of technologies? Look for vendors who have experience solving the kinds of problems you want to solve. Their technology and offerings will be more tailored to your needs and provide less risk and faster time to value.

Text Mining Challenges

Text mining presents new challenges from both a technology and business standpoint. Collections of textual data rarely have a consistent internal infrastructure, or metadata, unlike structured data. When they do, it is more often related to format than content, being far more complex to analyze and harder to model.

Another issue is that the structured data elements (such as dates, values, times, and account IDs) are often systemderived and therefore more reliably accurate. Almost all unstructured text is entered into various systems by people, and is therefore subject to human error.

Additionally, text data contains familiar taxonomic elements such as hyperbole and innuendo, as well as less obvious types of understatement. Consider the customer comment, “I have not been unhappy with your service.” This sentence’s meaning—that the customer is content with the service—is evident to us, but any machine technology employed in reading this same text will have a far more difficult time in classifying and interpreting its meaning.

To further complicate matters, text data frequently contains acronyms, abbreviations, and misspellings, as well as multiply aliased terms, where several terms (some of which may be abbreviated or misspelled) map to a single root meaning. For example, a single text record may contain the text strings “customer” “cust” “csmr” “customar” or sometimes even, just “C.” as “called C., they said they had just strtd new jb.” This type of mapping makes text more difficult to manage in the ETL process.

ETL Steps
The first step in ETL is to extract the text data you want from diverse databases, applications, legacy data stores, call-center notes, and other systems, made difficult by its complexity. Typically, this involves significant preprocessing of the text data to massage it into a consistent format that can be successfully extracted to a single source.

Secondly, you can apply specialized text transformations such as synonym substitutions, spelling corrections, tense modifications, stemming, stopping, phrase extractions, and phrase mappings, to name just a few. The object of these transformations is the same as in any transforms— optimize this data into a consistent, usable format.

Finally, once you have performed the transformations you want on the extracted text data, you can load the transformed data into your data-mining system, which in itself can be a critical and time-consuming process. The data you load may have multiple representations, such as words, phrases, n-grams, and vectors.

Mining freeform text offers additional wrinkles to the traditional challenges of ETL data. A wide range of surveys and studies show that ETL absorbs a substantial percentage of the labor involved in large-scale data warehousing and data mining operations. It is the same with text mining, but with added complexity.

While a typical data mining operation on traditional structured data may end up working with as many as several hundred variables—although in certain cases, traditional data mining operations may start out with substantially more variables—standard structured data mining techniques become impractical with more than a few hundred, and text mining operations typically produce tens or even hundreds of thousands of variables.

A final challenge for text mining processes lies in “operationalizing” the results. Once you have actually analyzed the textual data and organized the results, the next challenge is integration. It’s not enough simply to generate information from your freeform text data—in order to make it profitable, you must integrate this information into your existing operational workflows to help you build actionable, data-driven strategies.

Text Mining Benefits

Despite these difficulties, incorporating text mining technologies into business processes is a growing trend. What makes text mining worth this effort? Indeed, what makes any form of analytics worth the effort?

Ultimately, it is the bottom line. If you care about profits, then you care about better data mining and analytics.

Analytics have two fundamental determinants for quality and applicability—the quality and applicability of the algorithms, and the quality and applicability of the data. While gains from algorithmic improvements are still available, the increase expected is generally marginal compared to the gains that can be realized by adding entirely new sources of relevant data.

Text offers a brand new source of relevant and previously underutilized data that has been shown to provide additional analytic value. Businesses are beginning to obtain substantial improvements to the accuracy and relevance of their analytics initiatives by incorporating data mining of freeform text into their analytics mix. For example, an article in DM Review notes that adding text mining has proven to increase model accuracy routinely by more than 20 percent across a wide range of modeling problems.

Neither data nor text mining technologies on their own solve business problems or directly generate operational information. Mining structured and text data is performed by technologies that encompass a number of disciplines from the worlds of statistics, artificial intelligence, and machine learning. However, in order for these technologies to generate useful business information, they must be correctly applied to a specific business need, and management must be willing to act effectively on the results. In other words, the information generated by mining technologies is only as useful as its business application.

In the financial services industry, text and data mining techniques are used to generate predictive models that enable lenders and financial institutions to understand which customers will respond best to certain offers, which collections accounts to outsource to agencies, and how best to manage credit lines. In the medical and bioinformatics fields, text and data mining technologies are used to generate information about how diseases are communicated to uncover medical risk factors, and to assess very large quantities of bioinformatics information. Retailers use these techniques to determine the likelihood of acceptance of a particular offer or promotion and to reduce attrition.

Text Mining and Customer Relationship Management

A major concern of business today involves better understanding of customers. This means understanding what motivates them, how they are segmented, what issues they are having with your products and services, how you are treating them, and how they are reacting to that treatment. In fact, it ideally means understanding them better than your competition, because you are both competing for their business and, in many cases, their wallet share.

The use of text mining in customer relationship management (CRM) initiatives supports the overall goal of developing more loyal, profitable customer relationships. By integrating the mining of structured and freeform text, companies are able to generate more accurate predictive models and gain deeper insight into their customers, and more importantly, better understand their behaviors (e.g., why they might cancel a service or buy a particular product). This enables more accurate customer segmentation for better-targeted marketing programs.

(1. DM Direct Newsletter, July 30th, “Better Predictions Using Unstructured Data in Mixed-Data Modeling” by Mike Meyer; 2. Intelligent Results, Mixed-Data Analytics Drive Additional Value from Settlement Offers, Strategy Series)

Text Mining in a CRM Environment
In both of the following examples, taken from actual business cases, text mining provides predictive information as well as descriptive information leading to valuable new understanding of customers.

Why Are They Leaving Us? Attrition in Financial Services
The situation:
A bank observed that approximately 7.8 percent of their customers with marginal delinquency issues (such as slightly late payment or modest over-limit charging) were canceling after settling their account. For one significant population of customers, canceling meant they were canceling the charge card at issue and were also withdrawing from other bank services, such as checking accounts. They were effectively terminating their business relationship with the bank. This discovery was cause for concern.

The process: To help them better understand this customer behavior, the bank’s analysts ran the customer records in question through a mixed-data-mining process that included looking at the text data and structured customer records. They focused on the customers who had canceled their charge accounts over a selected time period following their delinquency.

The results: The data mining analysis revealed a high percentage of valuable customers with requests to stop calling and explanations as to the delinquency. The analysts looked deeper into these accounts and determined that these otherwise good customers were being automatically dropped into a dialer queues without regard for the larger banking relationship and previous interactions with the bank. Simply put, after a good customer committed even a very minor credit indiscretion, they were automatically dunned via phone on a daily basis.

Understanding this issue and “operationalizing” the mixed-data-mining process to more accurately segment the delinquent accounts enabled the bank to deploy a different strategy for each account.

Are These Customers Good Guys or Bad Guys? Attrition in Telecommunications
The situation:
A telecommunications provider observed that a number of their good customers would suddenly stop paying their bills for one or two months. After paying up, these customers would leave for unknown reasons. This group represented a significant part of the telecommunications provider’s overall attrition.

The process: In order to better understand why their customers were leaving, a telecommunications service provider analyzed the text of thousands of customer comments taken by service representatives. They categorized this text and generated summary labels to make it easier to investigate areas of interest.

The results: One label in particular jumped out at the analysts: “Trouble with Web site.” How would trouble with a Web site relate to customer attrition? Keyed by this label, the analysts reviewed a number of the records; the text of these records revealed that there was a problem.

Customers who were changing their address using the Web site were not getting their addresses correctly updated in the master record. Accordingly, bills were being sent, sometimes for months, to the old address. When the change finally caught up with the master record, the accounts were most often seriously delinquent, and already in the queue for dunning. Most of these customers, upon being provided with a statement, either paid promptly or made arrangements to pay. Many of these customers viewed the issue as the telecommunications service provider’s problem and ultimately left for another provider as they did not agree with the assessed late charges or want to be dunned [here it is again] for statements that had never been delivered.

In fact, most of these customers were good customers, but due to a system glitch, were being incorrectly tagged as bad, and treated accordingly. Ultimately, this discovery, based on mixed-data mining of the customer records, prompted the telecommunications provider to fix its Web-site glitch. Customers were happier, and the provider stopped losing good customers.


Increases in data storage, along with advances in data mining technologies, have enabled businesses to manage their customers more effectively and to understand their behaviors in greater depth. Businesses that care about their bottom lines care about better data mining and predictive analytics. Text mining provides businesses with a substantial leap forward in the accuracy of their analytics by introducing an entirely new class of data; improving bottom line results through more accurate predictions and greater customer understanding. n



Betts, Mitch. “Unexpected Insights From Data Mining,” Computerworld, April 14, 2003, http://www.computerworld. com/databasetopics/data/story/0,10801,80222,00. html.

Intelligent Results. “Mixed-Data Analytics Drive Additional Value from Settlement Offers,” May 2004. Meyer, Mike. “Better Prediction Using Unstructured Data in Mixed-Data Modeling,” DM Direct Newsletter, July 30, 2004.