Is your organization challenged on how to activate data to make more informed decisions? Does it take weeks or even months to obtain a report or visualization due to data being unclean, inaccessible, and disorganized? If you answer yes to these questions, there’s no reason you should continue following the same, laborious processes. In today’s modern world, organizations need to see a 360 degree view of their data in an agile, accessible, and accurate way. By having access to real-time information quickly, decision makers and leaders are able to share better insights and drive informed outcomes.
Good news! There are modern methodologies and technologies that make it much easier to achieve that 360 degree view. Methodologies such as a Data Vault and a Data Lake have gained popularity in the last few years. While both can be used to store raw data, the end results can be different based on the challenges and opportunities I outline below. There are certain circumstances where one path will make more sense than the other, but how do you know which is better suited for your organization?
You must determine your organization’s prioritized architectural quality attributes such as agility, accessibility, accuracy. Other considerations include compliance, FINRA, HIPAA, and two-factor authentication. Last, but definitely not lease, feasibility is determined by the complexities of data design, extensibility, and data quality.
Simply put, a Data Vault is a modern take on the enterprise data warehouse (EDW). It is both a prescriptive data modeling technique and a methodology that accommodates historical data, auditing, and tracking. The Data Vault model is a combination of 3rd normal form and star schema approaches and is a light hybrid of Agile, PMP, CMMI 5, TQM and Six Sigma methodologies. Its architecture allows for unstructured data, takes into account modern technologies, and enables organizations to have a real-time focus on business intelligence and analytic decision making. Learn more.
A Data Lake stores raw data in its original format. It is a storage repository that holds a vast amount of raw, native data until it is needed for transformation. In my opinion, data lake is a RYO (roll your own) methodology. In other words, a task that needs to be performed, but the method to complete the task does not exist. One will need to be creative to solve the problem. Successful Data Lakes rely highly on design, methodology, metadata, and governance. There are a few books on Data Lake that focus on the metadata and governance aspects of its methodology, but nothing fully prescriptive. Honestly, a Data Lake appears to be more of a term to help market technologies vs. being a prescriptive, fully baked approach to data warehousing. This is OK for organizations who want more (endless) decisions to make. In my 30 years of experience, I find that this is seldom the case! The majority of IT leaders today want a known precedent; they want a repeatable and prescriptive solution.
How Each Method Works
A Data Vault can be classified as a place to store unchanged data from multiple sources. The data is arranged primarily around 3 types of tables: hubs, satellites, and links. Hubs contain business keys of entities (this is nice in terms of one place to find keys), satellites contain descriptive data, and links contain associations or hierarchies of data. Satellites can exist off of hubs and links. All types of tables contain a source and load date. An important thing to remember is satellites can be added to links or hubs without changing existing satellites. You can add new sources without changing existing designs. This is the secret sauce that gives Data Vault its extensibility. The links give this method a relational flair, while hubs and satellites reflect star schemas. You can even have the ability to develop business and operational vault concepts to act as the true source for EDW, Information Marts, OLAP, and ODS’s.
A Data Lake is a place to store all raw data in its original format. There is no prescriptive approach to Data Lake design and it typically resembles a bunch of relational tables. Data Lakes can reside in platforms such as: Hadoop; RDBMS (Oracle, MySQL, MS SQL Server); or MPP (Snowflake, Redshift, or Teradata) stores. They can handle structured and unstructured data sources, but there is no prescriptive way to store data in terms of design so each source entity has to have its own table. So, if you have dozens of applications or sources with similar entities, each will have its own set of tables. To make sense of all of the raw data in disparate tables, you need to either virtualize its extraction and/or create a canonicalized business representation, as an information mart or an operational data store (ODS), for consumption.
For both Data Vault and Data Lake, you should consider data governance measures because each could benefit from the following:
- Data dictionaries that represent the technical metadata schema of sources
- Data glossaries that represent the business meaning of the data
- Data lineage maps of the data to support the need for traceability
- Impact analysis of changes for planning and lessening the load of maintenance
Regardless of approach, it is recommended that you consider embarking on a journey through the data quality maturity curve to create useful information.
- Data Profiling - Apply BI to your data for an understanding and barometer of data quality
- Master Data Management - Use matching, merging, and survivorship capabilities to reach golden records for individuals, products, groups, and organizations
- Standardization and Cleansing - Use validation, synonyms, regular expressions, and 3rd party APIs, as fundamental data quality measures
- Data Stewardship - Create workflows and manual tasks for intervention and enrichment
At TESCHGlobal (TG), our team leverages Talend as a data management tool for both methods. Whether you choose to go with a Data Vault or a Data Lake approach, Talend is a great tool that supports ingestion, data quality, master data management, and governance at any scale.
Pros and Cons of Each Method
Traditionally, we choose to implement a Data Vault methodology over a Data Lake. We do, however, have customers who are using a Data Lake model. Data Lakes have some really relevant and successful use cases. So, choosing the correct approach depends on weighing the factors we lay out below.
The key benefit of a Data Vault is the extensibility. To include new data sources in a Data Vault design, you can add a satellite table to your raw and business vault. This has no impact on existing structures or business key attributes. On the other hand, with a Data Lake model, if you have a new source, you will need to add a table in the raw staging area and possibly refactor your EDW or ODS. The amount of work and potential for risk is much greater.
Data Vault Pros:
- Business keys are in one place and satellites are created off of hubs and links
- It’s a cleaner model that is well defined and prescriptive
- It’s more extensible; you can add new sources/satellites (descriptive data) without impacting existing data
- It has a published methodology and architecture
- Point-in-Time analysis makes it possible to reflect the state of data within a given time frame
- Same as Links; they allow the cross-referencing of keys across lineage, and systems
- Built-in data lineage by storing source system and load date
Data Vault Cons:
- The learning curve, and rigor is greater than simply creating a data lake which is a representation of the source system raw data.
- Turning raw data into meaningful information marts is marginally better so you have to weigh this benefit the over the lifetime of the vault.
Personally, I have seen Data Lakes be successful if: the data does not vary greatly; delivery and design processes are well defined; and the correct data quality and governance measures are applied. That said, what I hear time and time again from users and tenants of Data Lakes is the data is not usable in the raw format, and it’s very difficult to build consumable information marts due to lack of governance and data quality.
Data Lake Pros:
- Has technologies to support data at any scale: databases; MPPs - Snowflake, Redshift and Google Query; Hadoop; Spark
- Easiest approach to store raw data
- Great for simple data sets of very high volume
- Successful at high levels of maturity
Data Lake Cons:
- Not fully defined from a design or governance perspective
- Transforming raw data in a Data Lake to a usable format can be infeasible given understanding, time, budget and maturity constraints
- Should not be used where the need for extensibility is a high priority
- Success can be limited to simple static datasets, and a low number of sources
- If the data is highly complex, then using a Data Lake must be executed by an organization with a very high level of maturity: Well defined and managed processes; Managed Metadata; Data lineage is recorded and available; Data governance is enforced
Conclusions to Remember
- Data Lakes can be repurposed to feed a Data Vault as a persistent stage area. It’s just the next level of where the data journey is maturing.
- Both approaches should equally emphasize the importance of independent data governance and data quality processes.
- For both approaches, you will need data designers/architects and technical business analysts to drive and iterate the delivery process. Data Vaults make this more prescriptive and extensible by allowing the addition of data sources without refactoring the model at the raw or post data quality representations.
- The best approach for determining which model is better for your organization is to listen to the business requirements and apply a roadmap that aligns with the reality of all the factors of complexity, maturity, and the balance of flexibility vs. prescriptiveness.
Interested in learning more? Feel free to reach out to us as we have experience in implementing both approaches with Startup to Fortune X organizations.