Data Management Life Cycle

Robert Griswold, VP Professional Services

Robert Griswold,
VP Professional Services

This article will address leveraging best practices throughout the data management life cycle. We will take into account: the sources, processing, storage capabilities, constraints and end-user needs.

The opportunities and complexities presented by today’s information life cycle presents us with options and decisions that deserve consideration. We will be breaking down these decisions and opportunities into a common batch life cycle pattern within the context of a reference architecture.


Modernized Data Management Milestones

Modern considerations

Source Data

  • What new data sources are available and what is their data management life cycle: Web activity logs, Social Media, System Logs, Sensor data 
  • Is there a way to process legacy sources that align more accurately with SLAs, technical drivers and business objectives? Cost, Performance, Maintenance windows

Staged Data (Raw Data)

  • How to ingest raw data in an efficient and accurate manner
    • Accomplish ingestion with the least amount of human touch
    • Meta-data governance replaces design drag-and-drop
    • Reuse of ingestion jobs
    • Operationalization of meta-data
  • Where to store

Warehouse Data

  • Which design to use Kimball, Inmon, or Lindstedt
  • Where to process
  • Where to store
  • What types of business rules are applied within the warehouse
    • Hard Business Rules - Type changes, normalizing, denormalizing, tabularized hierarchical, or structuring
    • Soft Business Rules - Changes meaning, content or granularity

Information - Data Marts and OLAP with “Soft Business Rules” (transformations, aggregations, granularity changes and data quality) applied

  • Again: When and where are business rule applied?
  • Format OLAP, or Star Schema
  • Operationalized feeds back to sources
  • In Place Analytics

Step 1: Ingest Source into Stage

 

Data Sources

Data sources can come from traditional means of files, applications, databases, media and transactions. The value recognition of new unstructured data sources can be added to data management life cycle. These new log, web, media and textual sources can be landed on low cost storage such as HDFS, Google Storage or Amazon’s S3. Ultimately this raw unstructured data can be indexed and linked to EDW(Enterprise Data Warehouse) information. Beyond the warehouse this data also can be used for operations or to build Data Marts and OLAP tables for decisioning and analytics.

Modern Data Sources

  • Relational databases that house application, transaction and batch data
  • Files
  • Transactions
  • Document Management
  • Emerging Sources
    • Clickstream
    • Web and Social Media
    • Server Logs
    • Sensor and Machine Data - IoT(Internet of Things)
    • Video
    • Audio
    • Unstructured
      • Blogs
      • Comments
      • Descriptions
      • Articles
      • Audio Transcriptions
    • Facial recognition

TESCHGlobal Value Proposition:

  • Identify:
    • New sources
    • Access methods
    • Ingestion strategies
  • Leverage low cost, highly performant storage options
    • Hadoop, Google, Amazon and Microsoft low cost high density storage for data storage.
    • Cloud warehousing offerings such as Redshift, Snowflake and Greenplum

Ingestion Processing

Data Ingestion is the landing of raw data into a database, or a high density file system such as HDFS, S3 or Google Storage. This data can be subsequently loaded into your data warehouse. Staged data will mimic source data in schema (data typing may be relaxed e.g., everything is a string) and be referred to as raw data. The goal is to make this process as low human touch as possible since not a lot of value is added at this step. Tools such as Spark Import, Bulk Utilities, Talend Dynamic Schemas (Non Big Data), and Talend Java APIs that use metadata for Schemas and Maps. The use of static vs dynamic schemas and jobs are driven by the volume of changes to existing schemas, and the addition of new sources added. Industries and sectors such as government, healthcare, manufacturing and retailing that have a large volume of ingestion sources will typically opt for metadata driven dynamic ingestion jobs as opposed to drag-and-drop design time static jobs; which handle finite schemas.

TESCHGlobal Value Proposition:

  • Identify what tools are available in your enterprise for ingestion
  • Leverage Talend out of the box capabilities when possible
  • Non Big Data Dynamic Schemas
  • Spark Import and other Bulk Load Components
  • Extend Talend capabilities by making them metadata driven by the use of a small amount of Java code and metadata sources
  • Use the proper tools and processing environment for ingestion based on requirements, constraints, best practices and enterprise architecture compliance.

Staging Raw Data

The main purpose of landing data in the staging area, as opposed to a transactional or operational store, is to reduce impact on the aforementioned systems. The staging area can be a relational, MPP, NoSQL or Hadoop database. The data in the staging table should generally be in the format of the source system. The staging area is used to build the data warehouse or data vault.

Some of the possible data generated for the staging table include:

  • Load Date
  • Record Source
  • Sequence Number - Only for order purposes not a key
  • Hash Keys - Generated from business keys
  • Checksums - Used to determine if a change has taken place in descriptive attributes of satellite rows

TESCHGlobal Value Proposition:

  • Design staging jobs that can be converted between technologies with minimal work
  • Leverage job designs that utilize the correct storage strategy that will meet enterprise architecture guidelines for maintainability and scalability.

Loading the EDW from Staged Data

Processing Stage Data

After data is staged it can be processed and loaded into the data warehouse. The data in the staging area can leverage direct processing power in the way of ELT using SQL and APIs; or by using ETL data management runtimes. The decision on where to process staging data is typically driven by the data management requirements. You have the choice to use the database ELT, Hadoop/Spark ELT or data management ETL runtime servers. The decision on which to leverage for processing power is typically based on cost, capacities, performance and enterprise architecture direction. The amount of processing that takes place in the loading of the data warehouse depends on the design pattern used. If you are using a relational warehouse design such as Bill Inmon or Star Schema (which Ralph Kimball has suggested in the past) you will be doing transformations that differ from the source system format and data prior to loading the warehouse. If you are using a Data Vault design strategy the EDW consists of loading satellites, hubs and link tables. The hub & satellites of a data vault resemble the source system formats and have far less transformations and only hard business rules are applied.

TESCHGlobal Value Proposition:

  • Leverage ELT components in both relational and Big Data technologies
  • Help with the proper job composition for vertical and horizontal scalability; specifically for jobs targeted at Hadoop/Spark and data management runtimes
  • Help with complex transformations using metadata APIs and industry specifications; such as HL7, X12 EDI and data mappers
  • Determine best practice processing patterns that meet your acceptance criteria
  • We recommend and enable Data Vault architecture for improved: extensibility, scalability and maintainability.

 

Provide Actionable Information

This step provides actionable Information to the EDW, Sources, Data Marts and OLAP Cubes

This step provides actionable Information to the EDW, Sources, Data Marts and OLAP Cubes

Traditional data warehouse

If you are using a traditional data warehouse strategy a majority of your business rules hard and soft have already been applied. Changes to information requirement sources could impact the gathering of information and vise versa. This adds a level of complexity and coupling which could increase maintenance costs and reduce agility.

Data Vault

If you are using a data vault strategy you have raw high fidelity data. Most of your heavy lifting (soft business rules) gets applied in the processing actionable information stage. I like to use the term “it is what it is, now what can we do with it.” Throughout time your abilities, requirements and understanding of data will increase. Why not start with the raw materials? It’s kind of like whole wheat or macrobiotic diets. Keeping it real and less processed with organic quality measures that don't impact the truth or wholeness of information.

TESCHGlobal Value Proposition: A lot of the steps are the same as transforming and loading your data warehouse but it’s dependent on your strategy for the amount in each phase of the life cycle

  • Again: How to leverage ELT components in both relational and Big Data technologies
  • Again: Help with the proper job composition for vertical and horizontal scalability for jobs targeted at Hadoop/Spark and data management runtimes
  • Again: Help with complex transformations using metadata APIs, industry specifications such as HL7 and X12 EDI and data mappers
  • Again: Determine best practice processing patterns that meet your acceptance criteria

Conclusion:

Whatever your data mission statement is “One Click”, “360 View”, “Right Data at the Right place and Time” etc…. there is no clear right or wrong answer. Let TESCHGlobal help you get the most value and insight from your data, resources and capabilities. We can help with full life cycle using industry-top data scientist.

 

Modern Data Management Series