While data de-duplication is a relatively easy way for companies to consolidate storage systems, the technology has yet to attract the widespread attention of IT managers. According to Gartner Inc., only about 30% of corporations have deployed any form of de-duplication technology.
Gartner says the technology can cut storage needs by a ratio of 20:1 to 30:1 on average — far less than vendor claims of 200:1 to 300:1, but still enough to greatly reduce network bandwidth requirements and stave off storage hardware purchases by making better use of existing equipment. The technology could also eliminate the need for costly tape backup systems.
Gartner analyst Valdis Filks suggested that despite the potential benefits, IT managers remain reluctant buyers of the technology because of high start-up costs.
“De-duplication is becoming very fashionable and lays claim to large cost savings, but quite a few companies I speak to get turned off by the purchase price,” Filks said.
Simply put, data de-duplication, or single-instance storage, involves the elimination of redundant data. Hash algorithms mark data blocks with unique numbers, and those numbers are compared so that duplicate pieces of data can be left out of the storage process.
To date, the primary corporate use of the technology has been for e-mail archiving. The benefits of that application are obvious when you realize that a single e-mail can represent thousands of copies of one attachment.
Today's de-duplication tools are mostly point products, such as prepackaged appliances that sit between a primary disk storage system and the backup process, software that runs on commercially available host systems, or virtual tape library (VTL) disk subsystems. The market is starting to expand because companies like Sun Microsystems Inc. and NetApp Inc. have begun shipping tools that provide de-duplication on primary storage systems.
Analysts say that as IT managers consider deploying de-duplication technology, they should keep in mind that it will continue to expand beyond point solutions to primary storage systems. “You should definitely roll it out today but implement something they can adapt and change when de-duplication becomes ubiquitous in other areas of the [data center],” Filks said.
The top makers of de-duplication point products include Data Domain, Sepaton, EMC's Avamar unit, Diligent Technologies, FalconStor Software (whose technology is resold by vendors like IBM, Sun and EMC) and Quantum.
Some users have found that de-duplication can deliver significant benefits in certain specific environments, such as at companies running older tape library systems.
Boston Medical Center was able to reduce 400TB of data stored on its tape libraries and secondary disk storage system to 3.5TB while also eliminating the need for tape backup by using Data Domain Corp.'s DD690 virtual tape library de-duplication system, said Brad Blake, director of IT.
The medical center is projecting that the product's return on investment will total $300,000 to $400,000 over the next three years. In addition, Blake credits the tool with eliminating the need for Iron Mountain Inc.'s data archiving services, further cutting costs over the period by about $70,000.
Blake noted that while the two Data Domain VTLs were expensive — $700,000 for both — the price is about $400,000 less than it would have cost to replace the medical center's aging backup infrastructure, which included a tape library from ADIC fronted by a Centera secondary disk array from EMC.
Blake's IT shop manages medical, clinical, e-mail, database, HR and financial data stored on more than 120 application servers. Boston Medical Center's data grows at 40% to 50% annually. The hospital uses one Data Domain box to back up primary data and then replicates that data to a second array at an off-site facility for disaster recovery purposes.
“Initially, I was nervous about stepping into a space not a lot of other people had tried yet, but Data Domain had another customer that I was able to chat with. Taking time to understand the technology and talking to someone else using it — learning what he knew — gave me a comfort level,” Blake said.
Computer technician Paul Rivera's comfort level sank when his company's first attempt at de-duplication failed.
When his employer, ENGlobal Corp. wanted to consolidate its e-mail archive, it initially turned to a service provider, Verizon Communications Inc., which used EMC's Avamar software to do the job. But some of ENGlobal's older application servers had problems with Avamar's agents and wouldn't run backups.
James Saar, backup administrator at Houston-based ENGlobal, a provider of engineering and professional services, noted that Verizon had someone else managing the Avamar product, so it took two or three tiers of people to get the backups working again.
So three months ago, ENGlobal decided to bring the de-duplication process in-house using Symantec Corp.'s PureDisk software and NetBackup servers.
With PureDisk, agents can either be placed on host servers in remote sites in order to de-duplicate data prior to transport over a WAN to a data center for backup, or PureDisk can be used at the remote site for local backup and quick restore.
Rivera said ENGlobal didn't give up on de-duplication, because it believed that the technology could significantly slow the growth of storage requirements, enable the company to avoid using tape for restore purposes, and reduce the need for expensive service contracts. He said his company is amortizing the cost of the equipment to about $10,000 per month over the next three years, or about 25% of the price it was paying for Verizon's hosted backup services.
With Symantec's help, Rivera and his team took about three weeks to set up PureDisk agents at three remote sites. Each site backs up about 1TB of data per week.
“With de-duplication, we can also restore anytime within the retention period and it's all live,” Saar said. “There's no need to worry about finding tapes in a vault. Being able to access data immediately really alleviates management overhead for a backup system.”
In all, ENGlobal's backup systems store about 800GB from three offices, down from 4TB prior to the installation of the Symantec tools.
ENGlobal has one PureDisk appliance in its main data center connected to a single 8TB iSCSI unit, and a second appliance in its Washington office that's connected to a Dell PowerVault MD1000 DAS array.
Difference of Opinion
Scott Heffner, network operations manager at Wentworth Douglas Hospital in Dover, N.H., praised the EMC Avamar software that had stalled ENGlobal's de-duplication plans.
The hospital rolled out Avamar last October to replace a poorly performing IBM Tivoli backup system. Today, he said, “we're seeing 100:1 ratios. It's unbelievable. That means when we would have been backing up 100GB in the past, we're only having to back up 1GB of data now.” Heffner noted that Wentworth Douglas' SAN contains about 60TB of data.
While the hospital implemented de-duplication to ensure fast restoration of data in the event of a disaster, the technology has significantly cut its backup costs as well, Heffner said. He noted that the annual cost of backing up 91 nodes using the Tivoli storage products was about $121,000. Today, Wentworth Douglas expects to pay $100,000 for 140 nodes.
“The only thing going through our heads [at first] was how we could improve our disaster recovery capability,” Heffner said. “When things are at their worst, hospitals are expected to be at their best, and how can we be that if our [records] system can't come up for three days after a disaster?”
The hospital runs the Avamar software on Dell PowerEdge 2950 servers. It also uses EMC's Clariion Disk Library for backups in order to keep all data online for fast recovery.
Its data center is in transition to a virtual environment, and 100 of the company's 150 virtual machines now back up through the Avamar appliance, Heffner said. The majority of data going to the de-duplication appliance comes from Exchange e-mail servers, medical forms servers, Citrix servers, the hospital's Web portal server and a medical notes transcription server.