Why a modern approach to data management?
“Modern Data Management” doesn't change the classic definition of data management, but it does take into consideration the evolution of:
- Constraints: Cost, Time, Resources, Internal and external compliance
- Data sources
- Data formats
This article will address modern data management using a reference architecture. This reference architecture will be used in subsequent blogs to give a context to each discussion.
What is a reference architecture?
A reference architecture provides a template solution for an architecture for a particular domain. It also provides a common vocabulary with which to discuss implementations, often with the aim to stress commonality. The objective is to provide useful guidance on the development of a solution architecture.
The common pitfalls of a reference architecture is to jump into the physical landscape and vendor implementation. It’s more important to layout the groundwork in a simple agnostic reference architecture which can be used as to create other conceptual and physical views and perspectives. In other words, don’t get too carried away with activities, models, or best practices until you have a simple reference architecture diagram and narrative in place. Everything else is an addendum to this reference placemat.
Goals of a reference architecture
- Create a standardized terminology
- Taxonomy - Classification of components
- Ontology - Shows the relationship between components
Modern Data Management Reference Architecture
The data management sources above include the typical databases, applications and files. You can see the addition of data sources not typically stored in databases or in files used to transfer information. These include things in today’s society that generate data that holds information that could be used to make decisions and gain insight. This data is textual and unstructured in a sense of not being stored in a database. There are two types of unstructured data: repeating (delimited or fixed length columns in each row) and non-repeating (emails, comments, dialog, blogs etc...). We will focus on repeating non-structured data which can be integrated into your data warehouse strategy; in much the same way as any file based data.
Data management Repositories
RDBMS, EDW and MPP
Repositories can be traditional RDBMS, EDW and MPP (Teradata, Netezza, Vertica, Greenplum, Snowflake, SQL Data Warehouse, and Redshift).
The main change to repositories is that a lot of data warehouse activity is moving to the cloud. This allows the management of compute power and storage to be optimized when needed and turned off or snapshotted respectively when not needed.
High density file storage options such as HDFS, S3 and Google storage have added a new dimension for fast and cheap file I.O.. Big data repositories are built and defined leveraging these cheap and highly performant storage advances.
The addition of Big Data repositories have added Hive, HBase, and NoSQL (non tabular database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure) using Hadoop (transformations on disk), and Spark (transformations in memory). These big data stores can be used to create and or store your EDW. The combination of traditional or cloud EDW storage and big data storage could suit both a high volume of concurrent users and means for cheap, efficient, scalable storage and processing for colder data.
Data Management Processing
There are several options for data management processing. You basically have ETL engines such as Talend, Informatica, DataStage, SSIS, and Ab Initio. A second choice is to have your database or MPP do the work. This method on MPP has been losing favor due to cost, so many are trying out big data as a processing engine for loading the data warehouse. HDFS and Hive can be used with Hive SQL to accomplish ELT. Spark batch can also be used for ELT with HDFS and HIVE. Spark batch has not gained wide usage in 2016, but I think that will change in 2017 given the improvements in tooling and awareness. Spark HDFS and HIVE are more performant than Map-Reduce since transformation happens in memory. Another important note is that Spark can be run in Mesos or YARN cluster managers within Hadoop vendor offerings, Spark also provides a simple standalone deploy mode. The standalone mode can run as a single or clustered server. This can remove the coupling and any cost or support issues associated with Hadoop access, capacity and chargeback.
Data Management Capabilities
Extract Transform and load data. The processing takes place in the ETL engine. Older engines like Informatica have resource managers that load balanced work across hardware. These proprietary engines tend to be very expensive and somewhat inflexible. Engines such Talend are available at a lower cost and can be used in a way that distributes the load. The jobs will need to be designed in a way that leverages both vertical and horizontal scale. For vertical scale Talend can use subjobs; for horizontal scale smaller, modularized jobs can be triggered by the schedule or execution plan which load balances across hardware or uses queues with the competing consumer pattern. With the competing consumer pattern, the same job can be deployed across hardware and be listening for the next event on a queue. The message on the queue can identify the context of the work to be done. The context can contain directories, filenames or elements to build a where clause.
Extract Load and Transform data. The processing takes place in the database engine. The database is a very efficient way to prepare and transform data. Tools like Talend will help you generate the sql. Some constraints may prohibit or limit this method such as cost to scale, or contention with online users. The cost constraint may be relieved by the movement of databases into the cloud. You can throttle and suspend both storage and compute resources based on need.
Transactions are typically thought of in terms of ESB queues and services. With the lower cost of distributed parallel processing in ETL engines and Big Data load, the EDW with transactions may gain more popularity.
Connector components are used to eliminate complexity of connections to applications, databases, files and things like email servers to name a few.
API’s are used in the form of JAVA or .NET libraries or web service REST API’s. A lot of data management tools are drag and drop. The use of API’s can add capabilities in the form of responses from web services. These types of API’s will prosper in an age where there is a focus on moving platforms and applications to the cloud. Code level API’s can be used to make processes more metadata driven.
Utilities can be used for bulk importing, exporting, and where API’s are not available.
Metadata can be used to define schemas, business rules and mapping. Some drag and drop tools don’t operationalize metadata. The metadata is designed and exists as static code. By utilizing ETL, API metadata capabilities can be operationalized. An example would be a job that can ingest many different schemas by feeding schema and mapping metadata into a process.
Data quality is the process of cleansing, matching, merging, de-duplicating, standardizing and fixing data. The placement of data quality in the life cycle depends on your approach. Typically data quality was applied during data capture or in the building of the data warehouse. With the push today towards using data vaults that house raw data the data quality would be applied when preparing information for feedback to source systems, analytics, marts and cubes. This approach takes the complexity of the data warehouse design, making it more extensible and representative of the source.
In databases, change data capture (CDC) is a design pattern used to determine (and track) the data that has changed, to take action using the changed data. Often this process involved using data triggers or logs to identify inserts, updates and deletes. With the new technologies for handling large files, database change logs can be used offline to determine which rows of a table need to be included in the CDC process. Triggers are usually not an option due to strict governance around database design and performance.
A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records. Lineage and source can be tracked through the SCD process.
Self service capabilities are becoming more important as the volume and variability of data increases. This process can get more data in the user's hands quickly. Care must be taken in regards to security, data governance, and siloed data stores. The goal is to have a hybrid approach which is quick to market but also considers the operationalization and homogenization of data management practices. Self service designs can be a catalyst worked into the formal design, canonicalization and implementation phases of data management teams. The term catalyst is very important here. The process is hastened but the end goal governance practices are still in place.
Data mapping is the process of creating data element mappings between two distinct data models. Often the mapping is simple, from one schema to another or it could involve business rules and industry specifications such as X12 (EDI) or HL7 (Healthcare). Tools such as Talend Data Mapper or Altova’s MapForce can be used.
Web services are web application components that can be published, found, and used on the Web. So basically you can expose capabilities as SOAP or REST services and give consumers access across an internet or intranet. This is relevant to data management transactions and gives batch jobs high availability, modularization, orchestration, and loosely coupling.
Data management systems can publish and subscribe to Queues that contain messages. For transactional systems the messages can contain business objects and events for processing. In the batch world queues can be used to create work events. These events are a best practice for distributing load and creating data driven integrations. The batch messages can contain nested contexts which can be thought of as a partition for distributing load across servers. As an example, if you have a job that processes a set of database data you could put values used in a “where” clause to select what you need. If you have these subscribers looking at queues from different servers each will partition its context on the server in which it is running. This pattern can also be used to keep each job small and modular. When one event is complete you can publish the next event and that job can also use the competing consumer pattern described above.
- Eclipse for design
- JvisualVM for monitoring system & JVM health
- Bamboo and Jenkins for Automated Configuration Management and Continuous Integration
- Administration of users
- Installation & Configuration of the platform
- Creation of source code projects
- Management of licenses
- Scheduling Jobs
- User and role security
- Data security
- Server security
- Communication security
- Assess and act on data quality in the proper place in the data life cycle ensuring regulatory compliance
- Auditability through the storage of raw data and the transparency on data lineage and traceability on the overall data flows
- Data Custodial capabilities - technical control activities
- Data Stewardship capabilities - management and oversight of data assets for availability and quality
I have included the definitions of how enterprises use information. These definitions (Wikipedia sourced or paraphrased) will help give context to the overall architecture.
A subset of the data warehouse and is usually oriented to a specific business line or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department.
Online Analytical Processing.Performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling. In OLAP cubes, data (measures) is categorized by dimensions. OLAP cubes are often pre-summarized across dimensions to dramatically improve query time over relational databases.
OLAP consists of three basic analytical operations:
- Consolidation (roll-up)
- Slicing and dicing
Per Wikipedia, analytics is the discovery, interpretation, and communication of meaningful patterns in data. Analytics can tell where you are and where you could be to meet your enterprise goals.
Analytics’ area of focus includes:
- Enterprise decision management
- Mix modeling
- Web analytics
- Price and promotion modeling
- Predictive science
- Risk analysis
- Fraud analytics
Big Data capabilities aid in analytics processing by giving access to various sources such as unstructured data and the ability to process the sheer volume of information available.
Visualization is the capability to communicate the meaning of underlying data in a visual format. These depictions can be used for planning and operations.
Author Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message:
Time-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the trend.
Ranking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period. A bar chart may be used to show the comparison across the sales persons.
Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%). A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.
Deviation: Categorical subdivisions are compared against a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period. A bar chart can show comparison of the actual versus the reference amount.
Frequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0-10%, 11-20%, etc. A histogram, a type of bar chart, may be used for this analysis. A boxplot helps visualize key statistics about the distribution, such as median, quartiles, outliers, etc.
Correlation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message.
Nominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.
Geographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used.