Data lakes are gaining momentum in the IT space. However, there are still a lot of factors information leaders must understand about the concept.
- Data lakes are still relatively new. The term, credited to Pentaho CTO James Dixon, has been discussed for several years. But the idea of data lakes as corporate resources is still in its infancy, according to IDC analyst Ashish Nadkarni. A data lake is defined as a massive – and relatively cheap – storage repository, such as Hadoop, that can hold all types of data until it is needed for business analytics or data mining. A data lake holds data in its rawest form, unprocessed and ungoverned.
- You can’t buy a ready-to-use data lake. Vendors are marketing data lakes as a panacea for Big Data projects, but that’s a fallacy. According to Gartner analyst Nick Heudecker, “Like data warehouses, data lakes are a concept, not a technology. You can use several technologies to build a data lake. At its core, a data lake is a data storage strategy.”
- Lakes have big appetites for data. Data lakes are designed for data ingestion – the procedure that involves gathering, importing and processing data for storage or later use. “Where the storage cost model of a data warehouse may not lend itself to wholesale data ingestion, a data lake does,” Heudecker says. “Also, a data lake doesn’t require the users to create a schema before data is available for use. Data can simply be ingested and the schema created and applied when the data is read.”
- You must involve multiple facets of the business. Data lakes are resources for the entire organisation, not just IT. Therefore, all interested parties should be involved in planning data lake projects. “It is central to the firm’s Big Data architecture, and therefore, cannot be implemented in isolation,” Nadkarni says. In addition to IT managers, a data lake project should involve business leaders and users. Storage experts also need to play a key role. “At the end of the day,” Nadkarni says, “it is a storage platform, and therefore companies should involve the storage team in its design and implementation.”
- The biggest benefits don’t come from technology. The business value of a data lake has very little to do with the underlying technologies chosen, Heudecker says. “Instead, the business value is derived from the data science skills you can apply to the lake. Data lakes aren’t a replacement for existing analytical platforms or infrastructure. Instead, they complement existing efforts and support the discovery of new questions.” Once those questions are discovered, you then need to “optimise” for the answers, optimising may mean moving out of the lake and into data marts or data warehouses.