One of the more prosaic parts of data warehousing is getting information into a warehouse in the first place. Vendors that sell data-loading tools have long operated in the shadows of the business intelligence market, with little pizazz or glory.
Even in the isolated world of extract, transform and load (ETL) software, the focus has traditionally been more on the problem of cleansing and modifying data to prepare it for analytical uses. Data loading seemed to be an afterthought — a piece of cake — in comparison.
But that's changing as BI and analytics become more of a near-real-time affair at many companies. In addition, the biggest BI users now operate data warehouses larger than a petabyte in size and need to import huge amounts of data into them. For instance, decision-support database vendor Teradata Corp. says that eBay Inc. loads 50TB of online auction and purchase data on a daily basis.
Over the past few months, several start-ups and relatively unknown vendors have tried to take advantage of such needs by touting screaming-fast data-loading speeds that they claim to have achieved either in the lab or with users in the field.
Database start-up Greenplum Inc. said customer Fox Interactive Media Inc. routinely loads 2TB of Web usage data into its data warehouse in half an hour. Meanwhile, rival Aster Data Systems Inc. claimed that its nCluster technology supports load speeds of up to 3.6TB per hour.
Not to be outdone, Expressor Software Corp., an ETL start-up offering so-called semantic data integration tools, said in-house tests show that its data processing engine can scale to a rate of nearly 11TB per hour.
Even Syncsort Inc., a 41-year-old company that began as a mainframe software vendor, has gotten into the act. Syncsort said that in lab tests, its data integration software loaded 5.4TB of data into a warehouse built around Vertica Systems Inc.'s column-based database in less than an hour.
If Syncsort and the other vendors are actually achieving those kinds of load rates, that's “really impressive,” said James Kobielus, an analyst at Forrester Research Inc. “Anything above a terabyte per hour is good.”
And what about more-established vendors? Two years ago, SAS Institute Inc. and Sun Microsystems Inc. demonstrated a SAS data warehouse running on Sun hardware and StorageTek disk arrays that could push through 1.7TB of data in 17 minutes — the equivalent of just under 6TB per hour.
But other big-name vendors have posted performance benchmarks that fall short of the load rates claimed by the upstarts. Last fall, for instance, Oracle Corp. and Hewlett-Packard Co. said their joint BI-oriented HP Oracle Database Machine could load up to 1TB per hour. And Microsoft Corp. said early last year that the data integration software built into SQL Server 2008 had loaded at a rate of 2.36TB per hour.
But do customers really need the ultrafast loading speeds that vendors have been touting recently?
Not surprisingly, John Russell, Expressor's chief scientist, contends that many do. “Every financial firm we talk with says they want . . . something close to 1TB per day,” he said. “For clickstream data [from Web sites], those figures could be as high as 200 billion clicks, or nearly 24TB a day.”
A longtime data warehouse architect for Fortune 100 companies, Russell said he co-founded Expressor partly “out of the frustration I felt when dealing with the performance limitations and bottlenecks of those high-end [data integration] tools.”
Loading speeds not just a high-end issue
Small organizations may not have a place in the rarefied realm of terabyte-per-hour data-loading rates. But they have load-speed issues of their own to deal with.
For instance, Mamasource, a Corte Madera, Calif.-based company that runs an online community for mothers, recently moved its 300GB MySQL data warehouse to a specialized warehousing appliance from Kickfire Inc.
Mamasource adds about 1GB of clickstream data to the warehouse daily. That process previously took four hours — a rate of just 250MB per hour. But with the Kickfire upgrade, the loads can be done in less than an hour, said Steve Keptchel, Mamasource's director of research and analytics.
The University of Pennsylvania Health System has a mere 5GB in its Oracle-based data warehouse now. Its current loading rate is fast enough, said Brian Wells, chief technology officer at the Philadelphia-based health system. But Wells expects that over time, he will have to add more processing capacity to his Oracle server and to the systems that run IBM's ETL tools.
That's partly because of an expected increase in data volume. In addition, Wells said health system officials want to ensure that data is being made available in the warehouse no later than two days after it is created.
Kobielus said load rates of multiple terabytes per hour “are becoming the norm” for warehouses that store large volumes of event-based data, such as Web clickstream info or the call-detail records generated by telecommunications systems.
They can also be useful when companies need to populate new warehouses or data marts with historical information for quick-turnaround data-mining projects, Kobielus said.
But such uses are outside the mainstream of enterprise data warehousing, he added. According to Kobielus, most warehouses still store less than 10TB of data and only need “gigabytes per hour” load rates.
Independent database analyst Curt Monash made a similar point last December in a blog post about the performance claim made by Syncsort and Vertica. Monash acknowledged that data loading “is an increasingly nontrivial subject.” But in general, he contended, commercial databases “will provide most users with much more load speed than they actually need.”
Peter Schmidt is director of business intelligence at Centro LLC, a 100-employee online advertising services firm in Chicago. Centro has modest data-loading needs, but Schmidt has worked in BI for more than 20 years and has held jobs at companies with much greater storage needs, such as United Air Lines Inc. and OfficeMax Inc.
So, could Schmidt ever imagine needing an ETL tool that could load 4TB or more per hour? “No, not to that level,” he said. And, he noted, “the performance tests out there are never apples-to-apples [with real-world applications]. So what if you can bring in 11TB per hour but aren't transforming anything, just moving data from Point A to Point B?”
Even Teradata is skeptical about how broad the demand is for high-end loading tools. “Extreme data-load rates are irrelevant to most customer environments,” said Randy Lea, the vendor's vice president of products and services.
Most data warehousing systems, including Teradata's, can be configured to load multiple terabytes of data per hour, Lea said. But he cautioned that such systems are at risk of becoming unbalanced and performing badly in other areas, such as data reads and queries.
In addition, he said, “the current crop of 'gee whiz' data-loading boasts have little value because there are no benchmark standards.”
The latter issue is now being addressed by the Transaction Processing Performance Council, a benchmark development group known as the TPC. It formed an ETL benchmark subcommittee last November, and an initial meeting is scheduled for this month. That may at least enable BI users to see how vendors really stack up against each other when it comes to pressing the load pedal to the floor.