Buzzwords come and go, but if they manage to stick around a while, it means the concept is catching on. A buzzword you’re likely hearing more and more of is “Data Lake” which invokes a body of water, and in many ways, a data lake is like a lake. It’s a repository of data (or water) with multiple feeders in and perhaps a few out. It’s vast, still, and unmoving, unlike a river. Data (or water) accumulates until it is needed.
So the name, while not very high tech, is actually appropriately descriptive.
What is a Data Lake?
A data lake is a storage repository that holds a vast amount of raw data in its native format, to be held until it is processed. Unlike the more structured data warehouse, which uses hierarchical data structures like folders, rows and columns, a data lake is a flat file structure that preserves the original structure of the data as it was input.
Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When someone performs a business query based on a certain metadata, all of the data tagged is then analyzed for the query or question.
“The reason data lakes exist is because everyone is collecting huge amounts of information from everywhere, especially from IoT, and they need to store it somewhere. And the historical storage medium was a relational database. But these technologies just don’t work well for all these data fragments we’re collecting from all over the place. They’re too structured, too expensive and they typically require an enormous amount of prior setup,” said Avi Perez, CTO of Pyramid Analytics, developer of BI and analytics software.
“A data lake is a lot more forgiving, cheaper and can accommodate unstructured data. Though the problem is that if you can put something in there, you will just stick everything in there. That is what is happening with data lakes today. And it’s causing the ‘data graveyard effect’ whereby data becomes inaccessible and un-useable,” he cautions.
A data lake must be scalable to meet the demands of rapidly expanding data storage.
Benefits of Data Lakes
The data lake is a response to the challenge of massive data inflow. Internet data, sensor data, machine data, IoT data; it comes in many forms and from many sources, and as fast as servers are these days, not everything can be processed in real time.
- Ability to look at original data. The volume and variety and velocity of data could cause you to miss something at the time it comes in, but by storing it in the data lake you can go back and look later.
- Easy analysis. Also, because the data is unstructured, you can apply any analytics or schema when you need to do your analysis. With a data warehouse, the data is preprocessed so if you want to do a search or type of query that the data wasn’t prepared for, you might have to start all over again in terms of processing, if you can at all.
- Availability. Another advantage is that the data is available to anyone in the organization. Something stored in a data warehouse might be only accessible to the business analysts, but if security wants to run through data to check for potential security compromises, they can go through historical data themselves to look for signs of a break-in.
Data Lake Architecture
The data lake has a deep end and shallow end, says Mark Beyer, research vice president and distinguished analyst for data and analytics at Gartner. The deep end is for data scientists and engineers who know how to manipulate and massage it while the shallow end is for more general users doing less specific searches.
“Those two groups of users always want to use the lake but the advanced users prove out the lake. They build models, come up with theoreticals, and challenge existing business process models,” he said.
No special hardware is needed to build a data lake since it’s storage mechanism is a flat file system. You could use a mainframe if you want. The data will be moved out to other servers for processing. Most users, though, are likely to go with the Hadoop File System, a distributed, scale-out file system because it supports faster processing of large data sets.
That said, there needs to be some kind of structure or order to make it work. The data needs to have a timeliness quality so when users need immediate access to the data, they can get at it. It must be flexible, so users can use the tools of choice to process and analyze the data, and not just the ones IT has.
There must be some integrity and quality to the data, because the old adage about garbage-in, garbage-out applies here. If the data is missing or inaccurate, then your users might not even use it, so what good is it? Finally, it must be easily searchable.
Pivotal, a cloud development firm, recommends multiple tiers for a data lake, starting with the source, i.e. the flat file repository. There is the ingestion tier, where data is input based on the query, the unified operations tier where it is processed, the insights tier where the answers are found, and the action tier, where decisions and actions are made on the findings.
Building a Data Lake
Wei Zheng, vice president of products at Trifacta, a data manipulation and visualization developer, said that while data lakes are more open structurally than a data warehouse, one thing users should do is build zones for different data to quarantine the cleanliness of the data.
“In a data lake model, if you don’t know how the data is consumed but want to catalog everything in the lake, you have to group and organize it on the cleanliness and how mature the data might be,” she said.
She recommended four zones: the first being completely raw data, not clean not filtered or examined at all. Second is the ingestion zone where you do early standardization around categories. Does this data fit into finance, security, customer information, etc? Third is data ready for exploration. You might still need to at least pull from raw data a few key ingredients you want to focus on.
The consumption layer is fourth. This is the closest match to a data warehouse where you have a defined schema and clear attributes understood by everyone. Between all of these zones is some kind of ingestion and transformation on the data.
While this allows for a more freewheeling method of data processing, it can also get expensive if you have to reprocess the data every time you use it. “Generally you will pay less dollars if you define it up front because a lot has to do with how you organize the info in your data lake. There is a cost with repartitioning the data,” said Zheng.
Data Lake Tools
Tools for data lake preparation and processing come in several forms, and many are still early, as the data lake concept is only around five years old. The old guard of BI and data warehousing tools vendors have not moved into the data lake space yet, so most of what is out there comes from start-ups and open source projects. But there are notable vendors.
Amazon, Microsoft, Google, and IBM all offer a variety of data lake tools along with the basic service. While most data lakes reside on premises, some can be born in the cloud and stored on a cloud provider like the big four, and the four all offer a variety of tools for data ingestion, transformation, examination, and reporting. In addition, there are other notable vendors of data lake tools.
HVR: HVR offers software for moving data in and out of the lake in real time from multiple sources, does real-time comparisons to ensure data integrity and scale over multiple systems.
Apache NiFi: This is an Apache-licensed open-source tool, but it’s also available as a commercially supported product from Hortonworks under the name DataFlow. NiFi processors are file-oriented and schema-agnostic, so the individual processors operate on one specific format. It’s used for data routing and transformation.
Podium Data: Podium offers an easy-to-use tool for building an enterprise-class managed Data Lake while requiring no specialized Hadoop skills. It claims it can build and deploy a secure, managed enterprise Data Lake takes less than a week.
Snowflake Software: Snowflake has a custom SQL database for building repositories to store and process a wide variety of data, including corporate data, weblogs, clickstreams, event data, and email. It also can ingest semi-structured data from a variety of data sources without having to transform it first.
Zaloni: Zaloni offers a complete, enterprise data lake platform called Data Lake 360, which includes a management platform, data catalog and self-service data prep tools that cover end-to-end processing.