
|
Grid Computing Accelerates BI Analytics
When NIEHS scientists felt that a lack of computing and analytic horsepower was slowing down their groundbreaking research into the environmental causes of cancer, they developed a unique solution
marrying data mining with grid computing.
Over the last decade, NIEHS researchers advanced our understanding of human biology by identifying the first breast cancer gene, along with a gene that suppresses prostate cancer. It was at NIEHS that researchers first demonstrated the deadly effects of asbestos exposure, the development of impairment of children exposed to lead, and the health effects associated with urban pollution. The groundbreaking work done at NIEHS is enabled first and foremost by the innovations of its researchers, one of whom was a recipient of the 1994 Nobel Prize in Medicine. But NIEHS researchers also rely on sophisticated data modeling, data mining, and analytic software programs. It’s doubtful, after all, that even the most innovative of Nobel Prize-winning research scientists could make sense of the more than three billion chemical base pairs in human DNA without the help of a data warehouse. Not surprisingly, then, NIEHS scientists rely upon a variety of homegrown and commercial data mining and analytic software applications to support their research efforts. Among other packaged software vendors, NIEHS has tapped the expertise of SAS Institute Inc., perhaps because they’re next door neighbors—SAS is in Cary, North Carolina, NIEHS in Research Triangle Park—or, more likely, because SAS is one of the most respected names in data mining and analytics. Regardless, NIEHS has deployed SAS Enterprise Miner data mining software, along with other SAS applications. According to IT security officer and systems administrator Roy Reter, NIEHS research scientists are using SAS to mine extremely large datasets that can tax even the beefiest of server hardware platforms. These datasets aggregate not just human genetic data, but also air quality data and other environmental variables. As a result, Reter admits, he wasn’t all that surprised when a team of scientists performing research into the environmental causes of cancer expressed frustration with the limitations of their server horsepower. They were dealing, after all, with multi-terabyte datasets. What did surprise him, Reter acknowledges, was a suggestion from one NIEHS researcher that they use a distributed processing technology called grid computing to scale SAS’ data mining software across several dozen different servers. “We had built several grid computers for our scientists to use, and one of them in particular came to us with the idea that we could do [environmental cancer research] on a grid,” he explains. (There’s more about grid computing in our BI Backgrounder, which follows this case study.) Intrigued, NIEHS researchers contacted a colleague at SAS, who helped them to load instances of SAS on 32 individual Linux servers. NIEHS and SAS researchers then used an SAS tool called SAS/Connect to intelligently distribute application data to each of the servers. SAS/Connect is based on an SAS feature called MP Connect, which allows multiple SAS sessions to run in parallel, each comprising an instance of a larger application. Theoretically, MP Connect can distribute a workload to an unlimited number of systems across a network. The result was a grouping of 32 different servers—what’s called a grid—that are connected as a single system. The beauty of a technology like SAS/Connect, Reter says, is that it allows an application to run unmodified across a grid. As a result, NIEHS researchers didn’t have to re-write their applications to communicate with each of the distributed instances of SAS. As far as the applications are concerned, then, they’re communicating with one (admittedly supercharged) instance of SAS. The upshot, says Reter, is that deploying applications to exploit multiple instances of SAS would have been almost impossible outside of the context of a grid: “Basically, you’d be looking at manually trying to break up a process against 32 computers, so you’re looking at the cost of 32 computers, and then having to run their own instance, then you manually having to break apart that process.” When NIEHS researchers finally tested their applications on the SAS grid, they found that the scalability benefits were enormous, says Reter. “The SAS grid has helped us to reduce by up to 95 percent in some cases the execution time required for these key projects,” he reports. “I think the particular test that we did with this ran just short of a day, and probably just running one piece of this on one computer would have taken over a week.” With 32 Intel servers and instances of SAS running on each of them, the NIEHS’ grid isn’t exactly inexpensive. Still, Reter refuses to speculate about the cost of achieving similar performance using only one very large system. In the first place, NIEHS probably wouldn’t be able to purchase an Intel-based system large enough to match the performance of its 32-system grid. Instead, the research institute would have to invest in larger and more expensive systems from vendors such as Hewlett-Packard Co., IBM Corp., or Sun Microsystems Inc. As a result of this experience, Reter says he’s now a grid computing enthusiast. He acknowledges that the technology isn’t a good choice for all applications (see accompanying article), but says that for organizations that support data mining or analytic applications that exploit largedatasets, grid computing is the way to go. “I think you’re looking at a revolutionary way of analyzing large amounts of data in a way that’s just not practical otherwise,” he comments. “I know for us, with the micro array of air quality data andgenetic data that our scientists are looking at, you’re definitely looking at very large amounts of data. How else are you going to economically analyze it all?” As for ROI, says Reter, it’s a nobrainer. To the extent that the SAS grid furthers cancer research even one iota, he observes, it will more than have paid for itself. Because the grid has so drastically ratcheted up the performance of some of the NIEHS’ key cancer research applications, it has more than delivered the goods. “It’s amazing the power that grid computing has given us at such a reduced cost,” he concludes.
Recent articles by Stephen Swoyer
Stephen Swoyer -
Stephen Swoyer is a technology writer based in Athens, Ga. You can contact Stephen via E-mail at swoyerse@percipient-analytics.com. |