Business Intelligence Best Practices - BI-BestPractices.com

Collaboration. Communication. Community.

 
 
 Printer-friendly
 E-mail to friend
ADVERTISEMENT
Grid Computing Accelerates BI Analytics

by Stephen Swoyer
When NIEHS scientists felt that a lack of computing and analytic horsepower was slowing down their groundbreaking research into the environmental causes of cancer, they developed a unique solution marrying data mining with grid computing.

Over the last decade, NIEHS researchers advanced our understanding of human biology by identifying the first breast cancer gene, along with a gene that suppresses prostate cancer. It was at NIEHS that researchers first demonstrated the deadly effects of asbestos exposure, the development of impairment of children exposed to lead, and the health effects associated with urban pollution.

The groundbreaking work done at NIEHS is enabled first and foremost by the innovations of its researchers, one of whom was a recipient of the 1994 Nobel Prize in Medicine. But NIEHS researchers also rely on sophisticated data modeling, data mining, and analytic software programs. It’s doubtful, after all, that even the most innovative of Nobel Prize-winning research scientists could make sense of the more than three billion chemical base pairs in human DNA without the help of a data warehouse.

Not surprisingly, then, NIEHS scientists rely upon a variety of homegrown and commercial data mining and analytic software applications to support their research efforts. Among other packaged software vendors, NIEHS has tapped the expertise of SAS Institute Inc., perhaps because they’re next door neighbors—SAS is in Cary, North Carolina, NIEHS in Research Triangle Park—or, more likely, because SAS is one of the most respected names in data mining and analytics. Regardless, NIEHS has deployed SAS Enterprise Miner data mining software, along with other SAS applications.

According to IT security officer and systems administrator Roy Reter, NIEHS research scientists are using SAS to mine extremely large datasets that can tax even the beefiest of server hardware platforms. These datasets aggregate not just human genetic data, but also air quality data and other environmental variables. As a result, Reter admits, he wasn’t all that surprised when a team of scientists performing research into the environmental causes of cancer expressed frustration with the limitations of their server horsepower. They were dealing, after all, with multi-terabyte datasets.

What did surprise him, Reter acknowledges, was a suggestion from one NIEHS researcher that they use a distributed processing technology called grid computing to scale SAS’ data mining software across several dozen different servers. “We had built several grid computers for our scientists to use, and one of them in particular came to us with the idea that we could do [environmental cancer research] on a grid,” he explains. (There’s more about grid computing in our BI Backgrounder, which follows this case study.)

Intrigued, NIEHS researchers contacted a colleague at SAS, who helped them to load instances of SAS on 32 individual Linux servers. NIEHS and SAS researchers then used an SAS tool called SAS/Connect to intelligently distribute application data to each of the servers. SAS/Connect is based on an SAS feature called MP Connect, which allows multiple SAS sessions to run in parallel, each comprising an instance of a larger application. Theoretically, MP Connect can distribute a workload to an unlimited number of systems across a network.

The result was a grouping of 32 different servers—what’s called a grid—that are connected as a single system. The beauty of a technology like SAS/Connect, Reter says, is that it allows an application to run unmodified across a grid. As a result, NIEHS researchers didn’t have to re-write their applications to communicate with each of the distributed instances of SAS. As far as the applications are concerned, then, they’re communicating with one (admittedly supercharged) instance of SAS.

The upshot, says Reter, is that deploying applications to exploit multiple instances of SAS would have been almost impossible outside of the context of a grid: “Basically, you’d be looking at manually trying to break up a process against 32 computers, so you’re looking at the cost of 32 computers, and then having to run their own instance, then you manually having to break apart that process.”

When NIEHS researchers finally tested their applications on the SAS grid, they found that the scalability benefits were enormous, says Reter. “The SAS grid has helped us to reduce by up to 95 percent in some cases the execution time required for these key projects,” he reports. “I think the particular test that we did with this ran just short of a day, and probably just running one piece of this on one computer would have taken over a week.”

With 32 Intel servers and instances of SAS running on each of them, the NIEHS’ grid isn’t exactly inexpensive. Still, Reter refuses to speculate about the cost of achieving similar performance using only one very large system. In the first place, NIEHS probably wouldn’t be able to purchase an Intel-based system large enough to match the performance of its 32-system grid. Instead, the research institute would have to invest in larger and more expensive systems from vendors such as Hewlett-Packard Co., IBM Corp., or Sun Microsystems Inc.

As a result of this experience, Reter says he’s now a grid computing enthusiast. He acknowledges that the technology isn’t a good choice for all applications (see accompanying article), but says that for organizations that support data mining or analytic applications that exploit largedatasets, grid computing is the way to go.

“I think you’re looking at a revolutionary way of analyzing large amounts of data in a way that’s just not practical otherwise,” he comments. “I know for us, with the micro array of air quality data andgenetic data that our scientists are looking at, you’re definitely looking at very large amounts of data. How else are you going to economically analyze it all?”

As for ROI, says Reter, it’s a nobrainer. To the extent that the SAS grid furthers cancer research even one iota, he observes, it will more than have paid for itself. Because the grid has so drastically ratcheted up the performance of some of the NIEHS’ key cancer research applications, it has more than delivered the goods. “It’s amazing the power that grid computing has given us at such a reduced cost,” he concludes.

BI Backgrounder: Are Grids and BI Set to Converge?

Chances are that you’ve heard of grid computing, although there’s also an equally good chance that you haven’t given much thought to its potential usefulness for business intelligence (BI) or data mining. The irony, of course, is that since their inception, grid computing technologies have been used extensively to support applications such as data mining and analytics.

After all, grid computing has had its proving ground in a variety of highly successful public computing projects, including the University of California Berkeley’s SETI@Home distributed computing project (see http://www.setiathome.com for details). When you think about it, SETI@Home is nothing less than a massive distributed data mining effort, parceling out data collected by radio telescopes to hundreds of thousands of users who download a portion and analyze it on a client module that runs on their computer when it’s not in use. Once complete, the results of their analysis are uploaded to a centralized server.

At least one vendor (SAS) argues that grid computing is a natural fit for BI applications other than data mining, many of which work with large data sets, as well. More to the point, says Tho Nguyen, program director of data integration with SAS, many customers are already exploiting technologies similar to grid computing and many not even realize it.

As a result, when customers ask about grid computing, Nguyen says, “We try to explain to them that you’ve already been dong it, either by parallel processing or distributing workloads across a network. These things have been utilized already, but grid computing gives [customers] a more efficient way to utilize them. We’re finding that some customers are coming to us because they understand the potential value here.”

How does grid computing enable greater efficiencies than parallel or distributed processing, both of which have been mainstays of data mining for quite some time? For starters, grid computing isn’t a strict servercentric proposition. Instead, it proposes to exploit the un-utilized or under-utilized power of all computing resources in a network environment—including desktop PCs.

The typical desktop PC has changed a lot over the last 20 years: The term “PC” may once have described a low-end machine powered by an 8-MHz 8088 or 80286 microprocessor and outfitted with scanty memory resources, but today’s “PC” is more properly an entry-level server. That’s because it often sports a 1-, 2- or even 3-GHz processor, hundreds of gigabytes of hard disk storage and—frequently—a gigabyte or more of memory under its hood. BI Backgrounder: Are Grids and BI Set to Converge?

SAS’ Nguyen says that the success of initiatives such as SETI@Home have demonstrated that idle processing power in client workstations can be exploited in a grid. In an enterprise grid, where workstations are connected on dedicated internal networks and aren’t subject to the vicissitudes of the Internet, he suggests, this value proposition is even stronger. “There’s an opportunity there to take advantage of that computing horsepower, which is underutilized during the day and which is typically unutilized during off hours,” he argues.

SAS is pushing this argument with SAS/Connect, in spite of the fact that it has only publicly trumpeted one customer win, the National Institute of Environmental Health Sciences (NIEHS), which exploits a dedicated grid of 32 connected servers, instead of idle client workstations. (See our Case Study for details.)

In fact, says Roy Reter, IT security officer and systems administrator with the NIEHS, the idea of tapping under-utilized client processing power—while certainly intriguing—probably isn’t a good fit for his organization. “A couple of us in the systems administration group have thought about that, but right now, it’s kind of hard to do, due to the fact that the scientists do their work around here, science goes on around here 24 hours a day, five days a week,” he concedes.

Nevertheless, says Nguyen, there’s an opportunity for many customers to use a technology like SAS/Connect to find idle computer resources and put them to work. “[SAS/Connect] enables the grid computing technology by identifying the computers in the network and going out there and using them,” he explains. “We’re offering this to customers who have a need today, but we plan to evolve it and add more intelligence to it within probably the next six to twelve months, working with existing customers as well as potential customers to really identify what features they most want to see.”

Where’s the Market?
Market research firm Insight Research recently projected that worldwide grid spending will grow by almost 2000 percent over the next five years, from $250 million this year to almost $5 billion by 2008. Although no projections are available, it’s likely that demand for BI or data mining grid solutions will account for a very small percentage of that total.

Still, SAS isn’t alone in talking up a potential convergence of BI and grid computing. A couple of grid computing pure players—Avaki Corp. and Platform Computing Inc.—have successfully marketed gridbased BI solutions to Fortune 1000 stalwarts Pfizer Inc. and Advanced Micro Devices Inc.

Avaki, for example, ships Data Grid 4.0, which it positions as a data aggregation platform for distributed environments. More precisely, says Craig Muzilla, vice president of marketing and strategy with Avaki, Data Grid 4.0 is a mature solution for enterprise information integration (EII). “We first came out with a J2EEbased product in the fall of last year, and that focused on data problems, [such as] how do you provision unstructured data or flat file data across an organization,” he explains. “Now, with [Data Grid] 4.0, we’ve added relational capabilities, so that you can set up an SQL statement or a stored procedure and bring that data into the grid, cache that, and do manipulation or aggregation of the data.”

Why on earth would an organization choose to exploit grid computing to further its EII efforts? For the simple reason, says Muzilla, that grid vendors have already solved many of the security and provisioning issues that the EII point players are only now starting to tackle. “Using a grid, you can give local data owners the chance to manage their resources without going to a central administrator to manage security and provisioning.”

Excepting SAS, traditional BI players have been slow to warm to grid computing. The opposite has been the case in the grid community, where pure play Platform Computing last year established an original equipment manufacturer (OEM) relationship with BI powerhouse Cognos Inc., under the terms of which it agreed to OEM Cognos’ PowerPlay OLAP tool along with Cognos’ Upfront portal. Platform markets a grid solution for corporate performance management (CPM) called Platform Intelligence.

Analysts are intrigued by a possible convergence of BI and grid computing but suggest that grids aren’t ideal for most or even many BI applications. Says Doug Laney, vice president and director of technology research service with consultancy META Group: “The need to distribute data and then hit the data hard with a lot of CPUs is decidedly analytic in nature, but doesn’t really follow into an operational scenario as much. If you look at the highly publicized scenarios, it’s strictly for analytic purposes, but only certain kinds of analytics lend themselves to being chunked like that.”

That’s the rub, says Mike Schiff, a principal with data warehousing consultancy MAS Strategies. Computational grids grew out of the academic and theoretical computing spaces, and haven’t caught on as quickly for conventional business applications, which typically deal with transactional or operational data rather than static or very large datasets. As a result, Schiff says that BI grids are a “future technology” that most shops aren’t seriously evaluating right now.

SAS, for its part, claims that it has had some success selling BI grids to non-traditional customers. For example, says Nguyen, a major financial institution has deployed an SAS grid to manage millions of credit card customers and to mine petabytes—yes, petabytes—of historical data. Like the NIEHS, this financial institution is searching for patterns, trends, or other anomalies across literally years of historical information.

Nguyen believes that deep analysis on historical data is one application that could broaden the appeal of BI grids. “What [an SAS customer that is a] financial institution as well as the NIEHS is doing is looking at years and years of data. They’re collecting back to five years ago, trying to see if there are some trends, some anomalies, things like that,” he explains. “Most of these customers have terabytes of data, but I am anticipating that it will eventually escalate to petabytes. It’s just not practical to keep all of this [data] in a data warehouse.”

Even some grid advocates have their doubts, however. “I’m not convinced that there are enormous unsolved warehousing problems, [I think] that there’s less really relevant unsolved data problems than people think,” says Brian MacDonald, a product marketing manager with grid pure play Platform Computing. “I think that some people believe that what they just need is an enormous data warehouse, and if only they had better ETL tools that could help. It’s not clear that there’s as much of a demand for that, although if you’re going to do it, it would make sense to use a grid.”

META Group’s Laney believes that there’s a potential market—albeit a small one—for BI grids of this kind. “It’s not unreasonable to think that there’s a lot of untapped computing power during the dark of the night, so if there’s a way to tap that, then somebody is going to do it,” he agrees. “To the extent that some analytic processes require background processing, non-interactive processes that look for patterns, look for trends, those are the kinds of solutions that lend themselves to grid computing.”


Recent articles by Stephen Swoyer

Stephen Swoyer -

Stephen Swoyer is a technology writer based in Athens, Ga. You can contact Stephen via E-mail at swoyerse@percipient-analytics.com.