The term data lake has been with us for many years. It’s origin is attributed to James Dixon who coined the term while writing, “If you think of a data mart as a store of bottled water – cleansed, packaged, and structured for easy consumption – the data lake is a large body of water in a more natural state.”
Many a subsequent writer has questioned whether organizations were creating data lakes with business value or data swamps with limited or no value. Given this, Marco Iansiti and Karim Lakhani have suggested that the data lake, data in it is original source, is part of a data platform with “data flowing from bottom to top…And the data platform aggregates, cleans, refines, and processes data” captured in the data lake.
Given this more refined view, the question is: where is the data lake within its hype cycle? To answer this question, I asked CIOs and industry experts for their opinions.
What have Data Lake’s Yielded?
CTO Steve Jones: “I’ve deployed quite a few data lakes and generally they have three things at their base: 1) reducing the conformance chasm of traditional enterprise data warehouses; 2) shifting to store the history of change making machine learning and AI easier; and 3) industrializing the ingestion and distillation of data. The aim is to allow the business to concentrate on outcomes while IT concentrates on provisioning data rather than integrating systems into data marts/warehouses. If you have this then the use cases become pretty much endless.”
As a data lake implementer, CIO Deb Gildersleeve says her organization has “implemented subject specific data lakes for business units and they’ve been really helpful at gaining insights and giving business users access to them.”
CIO Jim Russell has taken a similar self-service approach to Gildersleeve and deployed a vendor-specific lake. It is “part of a three-year maturation plan to fix data and start looking at processes. Traction is difficult to judge yet because it represents a total paradigm shift for our organization. So, it validates but does not surprise or refute us yet.”
Meanwhile, Enterprise Architect Craig Milroy says that he has “inherited 3 data lakes each on a different platform (AWS, Azure, and Cloudera). Each are focused on specific business outcomes from digital to 5G. I think we are at the start of our value driven journey. There is much more work to do to align business value and outcomes with their technology investment.”
Meanwhile, CIO Melissa Woo is unsure about the business outcomes that will be delivered from her data lakes. She says, “our head of analytics implemented a data lake before it became a thing, but for our organization there really hasn’t been much uptake. Our customers still want traditional data warehouses and report writing. On the upside, our president remains really interested in outcomes that can result from having the infrastructure in place and really likes the term data lake.”
Even worse former CIO Ben Haines says, “a lot of data lakes turned into data swamps, wastelands of data opportunities.” The above discussion led Mark Thiele to ask everyone does a data lake supplant other data repositories or is it a value add extra?
Biggest Opportunities for Data Lakes in Contrast to Data Warehouses?
For Milroy, a telecom executive if it wasn’t clear earlier, data lakes provide “support for the volume of unstructured data from 5G endpoints. This wouldn’t fit within traditional data warehouse approaches especially with online/real-time streaming data and analytics capabilities.” He continues by stressing that “fit for purpose workloads be deployed to data lakes for specific business requirements.”
Stephen diFilipo agrees with Milroy and suggests “data lakes provide for collecting, storing and analyzing all data, formats, unstructured, meta data that were not possible with traditional data warehouse repositories.”
Having a similar perspective, Gildersleeve argues the biggest opportunity for data lakes is the ability to apply focus and move a bit faster than with a traditional data warehouse. This can allow access for more people to dig into data.”
Jones asserted at this point that the difference between data lakes and data warehouses is the ability to move from “change data capture to the history of all change. With a data lake, you don’t need to extract just the data required for reporting, you can capture the whole history. This is the difference between a Cliff notes and the Complete Works of Shakespeare.”
Former Analyst Nick Heudecker of Gartner summarized the discussion by saying “data lakes should be viewed as systems of exploration. They supplement data warehousing approaches.”
What Drives Success or Failure of a Data Lake Project?
CIOs had varying opinions regarding the nature of successful data lake projects. Some believe that a data lake is best utilized when multiple business groups data combine to create a fusion rather than the sum of historical reporting. For Woo, “this has been part of our uptake problem. There is little value if different groups are unwilling to contribute data. Clearly, analytical maturity remains important. Organizations that succeed at this, however, become analytical companies and competitors as suggested by Tom Davenport.
When companies can operate together for common ends, former CIO McBreen says “it’s like a bunch of streams of data from many devices, partners, universe that we know are important but we have only scratched surface of utilizing. For AI and ML, it could be information that we will use later to enhance both of them.” In terms of success or failure drivers, it is important that CIO help executive teams understand the difference between valuable data lake and a data swamp. Common failure points include:
1) Lack of business defined use cases/outcome
2) Lack of people skills
3) Lack of resources
4) Inflated expectations
5) Data literacy and fluency
6) Data quality
7) Data governance
Heudecker says that “data lake deployments often struggle because the audience hasn’t been identified. That impacts the tools available, the level of data literacy required, and so on. The view that data lakes are one size serves all isn’t correct.”
With the Emergence of CDOs Where can CIOs Add the Most Value?
diFilipo suggested “this is $64M question. Let me know when someone nails this piece of Jell-O to the tree.” However, Jones suggests “the CIO is a chief information officer then they can be the data asset manager for the organization providing the CDO with the data platform.”
Craig Milroy says for this reason that CIOs should “make it easy for analytics, data science to access high quality and well understood data to drive business value and outcomes.”
Parting Remarks
So, CIOs continue to have an important role in data management. And data lakes offer the potential for them to add value. It seems clear that many organizations will head down the trough of disillusionment as data lake experiments yield mixed results.
But for those that consider data lakes as part of generating a data platform or, in the words of analysts, a data fabric, the opportunity to accelerate business transformation will make the ride and any resulting disillusionment worthwhile.
ABOUT THE AUTHOR:
Myles Suer is Head of Global Enterprise Marketing at Dell Boomi. He is also facilitator of the #CIOChat, and is the #1 influencer of CIOs, according to LeadTails. He is a top 100 digital influencer. Among other career highlights, he has led a data and analytics organization.