How does cloud integration play into Modern Data Management?
About the Author
Robert Griswold is the VP of professional services for Tesch Global, a Talend Gold Partner. Robert has been working with Talend for 4 years, but working with data management architecture and tools for over 30 years. His passion is to enable his team and customers to make data management a cornerstone of successful businesses.
Why should I read this?
This article is for those that desire a better understanding of what the Talend Integration Cloud (TIC) is and what it can do. A cloud strategy is inevitable for all companies: large, medium and small. Talend has been continually gaining market share due to its affordability and willingness to incorporate new technologies like Big Data, and the move to the Cloud for data sources. Now they are entering the realm of data management in the cloud.
What is TIC?
The Talend Integration Cloud, known simply as TIC, is a cloud-based data management platform. Data management involves the life cycle of data, and TIC gives you an option to eliminate or contain the need for an on-premise integration platform. You will most likely have sources and targets which are considered local, or private in terms of network accessibility. The goal is to securely enable agile connections to all intranet and internet data using a cloud hosted data management platform.
There must be a transitional period from on-premise to the cloud in terms of tools as well. The user based interfaces: administration, operations, self service, data quality and data stewardship can easily be served through web pages. We should see cloud coverage (excuse the pun) on most of these applications before year end or shortly there after. The studio (eclipse based) can still be thought of as a rich, local client and will likely be desktop based for the foreseeable future. Cloud based Eclipse projects like Che and Orion have been around for a while and will eventually be used more in tandem with their desktop based counterparts.
Why is it important?
It is extremely logical to transition the data management platform to the cloud since that’s where most Enterprise Data Warehouses (EDW) and applications exist or are headed. The breadth of data management platforms is broad and continually evolving in terms of capabilities and complexity. The number of servers and services needed to make each capability highly available can be daunting. There are a few compelling reasons to consider the cloud:
- Instantly Available Platforms - You can have a new data management platform within hours not days.
- Upgradable - Your upgrade path is much easier.
- Backup and restore your code and database configurations to a new repository so it can be upgraded to a new version without disrupting older versions.
- Upgrade you remote engine agents
- Upgrade your job artifacts or continue to run the old version until changes are needed.
- Scalable - The future will bring with it virtualization and autoscaling
- Increase your cloud data management footprint
- Install new virtualized remote or cloud engines
- Highly Available - Let the onus of high availability (HA) lie with your cloud vendor
- Pliable - Loosen up your version and vendor locking due to opportunity costs, risk, budget and complexity
- Abstraction of the technical details - Get out of the data center and infrastructure business and focus on what’s important for your industry, the data and processes needed to operate and excel.
- Security - Cloud security is typically completely locked down as a default by most vendors. This means you need to operate within the VPC firewalls or open it up with a VPN. For most operations you will need a OAuth 2.0 token or similar web authentication and authorization strategies setup for your user.
What makes it TIC?
GIT - A cloud based source code repository for Talend jobs.
TIC - Talend Integration Cloud Data Management platform as a service.
Public Cloud Sources - Sources available over the internet. No special network connections are needed.
Private Cloud and On-Premise
Web Apps - Cloud based web applications are typically accessed behind enterprise firewalls but in most cases this is not necessary.
Talend Studio - On-premise based full Talend development tool.
Remote Engines - Talend job engines where you deploy and run jobs within your firewall.
On-Premise Sources - Anything such as files, databases, and applications that are on-premises.
Private Cloud Sources - Anything such as files, databases, and applications within a private cloud
Networks do not lose significance when you move to the cloud, but instead gain importance. In fact, the need for complex network strategies such as multi-tenanted and/or siloed subnet private networks increase.
Below are a few basic definitions of what concepts are important to grasp in a cloud network strategy:
- VPC - A Virtual Private Cloud (VPC) is an on-demand configurable pool of shared computing resources allocated within a public cloud environment, providing a certain level of isolation between the different organizations using the resources.
- VPS - A Virtual Private Server (VPS) refers to the sharing of computing resources of a main host in a data center. Since a single host is partitioned into several virtual compartments where each unit is capable of functioning independently, each ‘instance’ is what is called a virtual private server.
- VPN - A Virtual Private Network (VPN) connection refers to the connection between your VPC and your own network.
- OPS - On-Premise Server (OPS) - Server that resides inside your network
Key points for bridging the gap:
- Using VPCs and VPSs allow you to work completely outside your firewall and in the cloud based firewalls.
- If you add a VPN to the mix then you can leverage VPS and/or OPS agents to bridge the cloud and on-premise worlds.
Today’s modern data management landscape contains many sources. These data sources can be in the on-premise or in a public/private cloud. If you look at the items below very few are on-premise; and when they are the environments tend to be internal private cloud strategies.
Here is a breakdown of the TIC Platform’s main components. The goal is to have everything in the cloud, with the exception of remote engines and the studio. So let’s assume that in the not-so-distant future all the web apps, administration and operations capabilities will be in the cloud.
Administration and Operations
In the cloud, the administration and operations activities will eventually be more integrated in terms of DevOps. The setup for development and code collaboration will feather into administration and operations. Today there is an on-premise TAC server for locking, authentication, and authorization with GIT; this will also move to the cloud.
Key administrative activities:
- Creating Users
- Creating Environments
- Creating WorkSpaces
- Creating Remote Engines
Key operations activities:
- Scheduling Jobs
- Monitoring Jobs
- Continuous Integration
- Continuous Deployment
The remote engine agents will run on VPSs or OPSs and will be responsible for deployment, running and monitoring of the integration jobs. Each job will be run using its script which is triggered by a scheduler. These jobs can spawn additional JVMs as designed within the Talend job. Therefore, all you need is a server with a JVM and the remote engine installed and coupled to the cloud with a security token. No ports need to be opened since the communication is outward and bidirectional once the server is started and paired. Remote engines run on-top of KARAF and will be ESB capable.
Cloud engines are very similar to remote engines with the following exceptions:
- Can be installed with a push of a button.
- Can only connect to source visible to the world wide web (that means everyone.) So you still have authentication, and authorization (and in some cases tokens) but no network access is required.
Virtualization of Engines
You can make both cloud and remote engines virtual allowing for high availability (HA) access. What this means is that Talend has the capabilities to load-balance traffic across a bank of servers by pointing to a virtual server. A server with available CPU, RAM, and disk space will be chosen.
Data Quality Portal
A web-based application uses the SpagoBI business-intelligence software in conjunction with Talend Data Quality in order to monitor data quality metrics. Data quality metrics are presented in graphical forms.
Allows for the creation resolution and application of human workflow data quality tasks. Talend jobs are used for the creation and to apply the resolution of data stewardship human tasks. A web application is used to create data stewardship campaigns, manage data stewardship users and administer and operate data stewardship human workflow tasks.
Data Preparation enables anyone to access and cleanse data using browser-based, point-and-click tools. Data sets are created to represent files, tables, jobs and Hadoop storage. Results can be exported into BI/Analytics tools (Qlik, Tableau, Excel etc...) or operationalized in Talend jobs for a one time or recurring data set.
- The ability for business or operation users to:
- Hasten the on-boarding and ingestion of new sources
- Discover and/or fix data quality issues
- Benefits to operationalization and automation of data quality issues:
- Valuable insights gained in validation and data stewardship jobs
- The increased usage and governance of Data Prep recipes in jobs and human processes
The studio is IDE built on the Eclipse framework. The studio stores jobs and projects as XML which it uses to generate and build Java code. Design in the studio is mostly drag and drop, accompanied by deeper configuration. You can also implement languages like Java and .Net in Talend jobs.
Talend Studio is one of the many reasons Talend is so popular. Its really fun and easy to build data-management jobs. If you know how to build a standard job learning things like Big Data is a snap! You don’t need to know how to code, and the configuration is simple.
Basic Studio Workflow
- Drag and configure a source
- Drag and drop a map
- Drag and configure the target
- Map the source to the target
- Add any special processing such as filters, joins or aggregations etc..
Talend has chosen to keep the studio intact for the TIC offering and offer the full integration capabilities of standard and big data jobs. Rumor has it the ESB is soon to follow. I really love this strategy because you simplify the platform, but you don't dilute the integration capabilities.
We at TESCHglobal are extremely excited to help Talend, and their customers, ride the wave of modernizing data management in the cloud! This is a defining evolution for Talend and we are urging all our customers to consider the benefits.
How Can TESCHGlobal Help?
Having worked several years in Talend professional services, I decided to establish a practice with some groundbreaking approaches. These rules apply to the “One Team” concept. Our consultants will work closely, as “One Team”, with your data management folks on a proven, intentional, approach.
- Enable people from day one in basic and advanced topics across the whole platform.
- Know the inner workings of the platform.
- Have an extensive knowledge base of jobs and articles
- Each developer should have access to a Talend platform sandbox environment to use for training, engagement preparation and troubleshooting
- Have a leveraged team where senior staff are process, technical and architectural savvy. Coding and contextual understanding is encouraged at all levels!
This equates to teams of people who can give you the best foot forward optimizing and modernizing your on-premise, cloud, or hybrid approach.