Is the data lake the data warehouse 2.0? Or is it something we’ve seen before? And why is the term data swamp everywhere lately? The term data lake has become a catch-all for any data management concept that doesn’t fit into data warehousing, and unfortunately, it gets a bad rap when the Big Data BI strategies it can support may not live up to their potential. Gartner estimates that 85% of Big Data projects fail, and according to Gartner's Nick Heudecker, “The problem isn't technology.”
Defining the Data Lake
Simply put, a data lake is a system of data stored in its source format. It’s usually a centralized store of all enterprise data, and tends to be confused with a data warehouse because it also enables reporting, business intelligence, and advanced analytics such as machine learning. While data warehouses and data lakes do many of the same things, the way they do those things is different.
The key differences are the way data is structured and the platform that hosts the data. Data warehouses are usually built on relational database management systems (RDBMS), while data lakes are usually built on Hadoop-based repositories or file systems. This is because RDBMS are not designed to store object blobs or other unstructured documents like emails, PDFs, image files and audio files. Hosting a data lake on Hadoop makes it easier to work with files in their natural format.
So where does the problem come in? When organizations lose control of their data lakes. “We see customers creating big data graveyards, dumping everything into HDFS [Hadoop Distributed File System] and hoping to do something with it down the road. But then they just lose track of what’s there,” says Sean Martin of Cambridge Semantics. “The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.”
How We Got Here
Every business intelligence pipeline involves doing three things:
The first people to try doing this on Hadoop were data scientists—perhaps not the best group of people to manage all enterprise data in a big-picture, disciplined way. Some data scientists focused narrowly on analytics problems, and were squeezed into data management only because an organization’s data engineering role was lagging.
Because data scientists were working on complex, big data problems using Hadoop as a platform, the notorious “data silo” emerged. A data silo is a repository of fixed data that remains under the control of one department and is isolated from the rest of the organization.
Organizations tried to eliminate the data silo by moving all data for analytics into one repository.
Thus arose the data lake! But this time, the big data was much harder to control than tabular data. Data warehouses control data with formal logic and a physical model around the data, but data lakes define no model, because they deal with much more complex data.
Where Data Lakes Can Fail
Creating a data lake is easy: You just dump all your data into Hadoop or S3. For the data lake to succeed, someone has to manage it. Organizations need to commit the resources—the resources with the right skill sets—to data cleansing and data governance, or else data lakes lose value.
Data cleansing. It’s a challenge for all but the largest organizations with a team of a data scientists. The typical data analyst is comfortable working closely with the data warehousing team. The data warehousing team presents data in prepared structures such as star schemas, cubes, and denormalized views, so data analysts can expect clean, quality, curated data. If these same data analysts are told to get data from the data lake, however, and this data has not been prepared or cleaned, the analysts will likely be frustrated by the unfamiliar skill set needed to complete this step.
Data governance. Typically data stewards and compliance teams have not had access to manage the processes that go on in the data lake. This means that metadata has not been managed and data lineage has not been tracked throughout the entire data lifecycle. Data has not been mastered either, meaning there is no golden record for key business entities available. This problem extends into a lack of workflow automation. Workflows have not been automated to ensure data quality and integrity. Data privacy and role-based security has been difficult to implement and the entire data lifecycle from ingestion to output has not been managed. Without data governance, the data lake can quickly turn into a data swamp.
Fortunately, innovative new technologies make data cleansing and data governance more straightforward.
New RDBMS solutions, like the Snowflake cloud data warehouse, can handle more data types. Snowflake can ingest semi-structured data, such as JSON and XML documents, eliminating the need to use Hadoop or object storage to augment a data warehouse. Most organizations won’t need to resort to having a data lake and data warehouse.
Data lake automation tools like WhereScape, data cataloging, stewardship tools and MDM stewardship tools like Tamr make it easier and faster to clean, steward, master and prepare data for analytics.
For organizations interested in data relationships, there is an alternative to a data lake. A Data Vault system not only stores data, it makes it possible to create relationships among unstructured, semi-structured and structured data. For an overview of the Data Vault system and why you may want to consider it, view this webinar.
Data lakes succeed when there is a process for moving, preparing, and analysing the data, and when data engineers and data stewards can meet data scientists in the middle. If we get data lakes right, they can be flexible storage spaces that help companies perform business intelligence and analytics, and that’s a change in the right direction.