Looking Beyond OAS 3 (Part 1)

August 6, 2018, 8:39 am

≫ Next: How to Develop a Data Processing Job Using Apache Beam – Streaming Pipelines

≪ Previous: Making Data a Team Sport: Muscle your Data Quality by Challenging IT & Business to Work Together

After reviewing the history of the OAI and API specifications last year, I wanted to take a look at the future of the OpenAPI Initiative (OAI) and outline my calculated, plausible evolution paths for the project moving forward.

State of the OAI

Over the past year, we’ve hit several milestones. We have released the OpenAPI Specification (OAS) version 3.0, a major revision achieved by the long and intense work of the technical community with the support from the steering committee (TSC). AOI is now 30+ members strong and, we keep gaining new members. We are also ramping up our marketing efforts as highlighted by the transition of the API Strategy & Practice conference RedHat/3Scale to the OAI, under the Linux Foundation umbrella.

These are impressive achievements to attain in a short amount of time and we need to build on this momentum as the support of OAS 3.0 by various API tools is quickly expanding this year. Now, what are the next opportunities in front of us, where should we collectively focus our energy, so we can enable the Open API ecosystem to strive and expand?

Based on some recent discussions around the potential evolution of the OAS scope and our mission to help describe REST APIs, there are several compelling opportunities from which one should carefully choose. We’ll cover these options in a series of blog posts over the next few weeks, but this first post will focus on the opportunities around web API styles.

Web API styles

a) Enhanced support for Resource-driven APIs

With the active involvement of the creators of RAML and API Blueprint within the OAI, version 3 of OAS has added many enhancements including better examples, modularity and reuse. These capabilities are essential -especially when collectively designing a large set of API contracts – and helped reduce the functional gap with alternative specifications. We could continue to look at ways to close this gap even more so now that OAS eventually becomes a full replacement for RAML and API Blueprint in the future.

I grouped these three API specifications in a Resource-driven category because the notion of a Resource, identified by a Uniform Resource Identifier (URI) and manipulated via a standard set of HTTP methods (Uniform Interface), is central to the REST architecture style that they get inspiration from.

It is worth mentioning additional specs in this category such as the JSON-API that propose conventions to define resource fetching (sorting, pagination, filtering, linking, errors, etc.) and can help improve consistency of REST APIs. With OAS 3 capabilities, it should be possible to capture those conventions in a base OAS file reusable by several concrete API contract files (see the $ref mechanism). It’s feasible to see the OAI supporting the JSON-API spec as an additional project next to the core OAS project one day in the future.

Projects like HAL, JSON-LD and ALPS can help experienced API designers implement long-lasting APIs by reducing the coupling between clients and servers by better supporting the hypermedia principle in REST, at the media type level. They are rather complementary with OAS and could also be candidate as additional OAI projects to increase its support for Resource-driven APIs.

Support for streaming APIs based on the Server-Sent Event (SSE) media type, compatible with REST, are also becoming more and more frequent and could be better described in OAS with potentially limited changes.

The way OAS describes web APIs couple clients and servers at the resource level, isolating the clients from the lower level server implementation details such as the more complicated underlying microservices or databases, covering a very broad range of use cases that made it widely adopted in our industry.

However, there are situations where a web API is essentially a Function-driven API or a Data-driven API as illustrated below. Let’s now explore what that means and how the OAI could help describe those APIs in a standardized way.

b) Support for Function-driven APIs

For function-driven APIs (aka RPC APIs), a set of functions (aka methods or procedures) with their custom name, input and output parameters constitute the central API piece that is described in the contract. They also tend to rely on HTTP as a simple transport protocol, ignoring its application-level capabilities to offer more direct binding to programming languages that are in the majority function-driven as well.

While W3C SOAP was the most widely deployed variant due to its popularity 10 years ago as part of the SOA (Services Oriented Architecture) and WS-* bandwagon, it was quite complicated technically and resulted in low performance and limited support outside the major two programming environments (Java and .Net).

Because of its shortcomings, many RPC alternatives were developed such as XML-RPC, JSON-RPC that were already simpler than SOAP. However, the latest trend replaces textual data serialization format by more efficient binary formats such as Apache Avro or Apache Thrift.

Today, the leading RPC project is gRPC, that was created by Google and donated to the Cloud Native Computing Foundation (CNCF), also part of Linux Foundation. It is successful in high-performance microservices projects due to its optimized data format based on Protocol Buffers also created by Google and its reliance on the high-performance HTTP/2.0 protocol. It would be an interesting development if the gRPC project became part of the OAI or at least if OAS was supporting the description of such APIs (aka gRPC service definitions).

If you are interested in this idea, there is already a related OAS request for enhancement.

c) Support for Data-driven APIs

Finally, there is a category of web APIs where a data model (or data schema) is the central piece that can be almost directly exposed. In this case, the API can be generated, offering rich data access capabilities including filtering, sorting, pagination and relationship retrieval.

Microsoft was the first to develop this idea more than ten years ago, with Google pursing and then retiring a similar project called GData. OData is based on either Atom/XML or JSON formats and is now a mature project supported in many business tools like Excel, Tableau or SAP. It also exposes the data schema as metadata about the OData service at runtime, facilitating the tool discovery. It can be compared to standards like JDBC and ODBC but more web-native.

Recently, alternatives to OData have emerged, first from Netflix with its Falcor project which lets JavaScript clients manipulate a data model expressed as a JSON graph. Facebook has also released GraphQL in 2016 and received a very good level of interest and adoption in the API developers’ community.

Interestingly, OData automatically exposes a resource-driven API and a function-based API at the same time, offering a hybrid API style. GraphQL also supports exposing custom functions in a function-based manner but doesn’t have resource-based capabilities, instead reimplementing its own mechanisms such as caching.

There is a need for this style of APIs in the ecosystem and OAI could support the description of API contracts based on data models compatible with GraphQL and OData for example (see this OAS issue for more discussion) and potentially even host such a project if there was a joint interest.

Wrapping up

Beyond the incremental updates that the community will naturally add to the OAS 3 specification (see the list of candidates here and please contribute!), I believe that the original scope of OAS could be extended to support additional web API styles. In the second part of this blogpost series, I will explore the common API lifecycle activities and how OAS could be extended to improve its support for them.

The post Looking Beyond OAS 3 (Part 1) appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How to Develop a Data Processing Job Using Apache Beam – Streaming Pipelines

August 7, 2018, 10:21 am

≫ Next: 3 Common Pitfalls in Building Your Data Lake and How to Overcome Them

≪ Previous: Looking Beyond OAS 3 (Part 1)

In our last blog, we talked about developing data processing jobs using Apache Beam. This time we are going to talk about one of the most demanded things in modern Big Data world nowadays – processing of Streaming data.

The principal difference between Batch and Streaming is the type of input data source. When your data set is limited (even if it’s huge in terms of size) and it is not being updated along the time of processing, then you would likely use a batching pipeline. Input source, in this case, can be anything from files, database tables, objects in object storages, etc. I want to underline one more time that, with batching, we assume that data is mutable during all the processing time and the number of input records is constant. Why should we pay attention to this? Because even with files we can have unlimited data stream when files are always added or changed. In this instance, we have to apply a streaming approach to work with data. So, if we know that our data is limited and immutable then we need to develop a batching pipeline.

Things get more complicated when our data set is unlimited (continuously arriving) or/and mutable. Some of the examples of such sources might be the following – message systems (like Apache Kafka), new files in a directory (web server logs) or some other system collecting real-time data (like IoT sensors). The common theme among all of these sources is that we always have to wait for new data. Of course, we can split our data into batches (by time or by data size) and process every split in a batching way, but it would be quite difficult to apply some functions across all consumed datasets and create the whole pipeline for this. Luckily, there are several streaming engines that allow us to cope with this type of data processing easily – Apache Spark, Apache Flink, Apache Apex, Google DataFlow. All of them are supported by Apache Beam and we can run the same pipeline on different engines without any code changes. Moreover, we can use the same pipeline in batching or in streaming mode with minimal changes – the one just needs to properly set input source and voilà – everything works out of the box! Just like magic! I would dream of this a while ago when I was rewriting my batch jobs into streaming ones.

So, enough theory – it’s time to take an example and write our first streaming code. We are going to read some data from Kafka (unbounded source), perform some simple data processing and write results back to Kafka as well.

Let’s suppose we have an unlimited stream of geo-coordinates (X and Y) of some objects on a map (for this example, let’s say the objects are cars) which arrives in real time and we want to select only those that are located inside a specified area. In other words, we have to consume text data from Kafka topic, parse it, filter by specified limits and write back into another Kafka topic. Let’s see how we can do this with a help of Apache Beam.

Every Kafka message contains text data in the following format:
id,x,y

where:
id – unique id of the object,
x, y – coordinates on the map (integers).

We will need to take care of the format if it’s not valid and skip such records.

Creating a pipeline

Much like our previous blog, where we did batching processing, we create a pipeline in the same way:

Pipeline pipeline = Pipeline.create(options);

We can elaborate Options object to pass command line options into the pipeline. Please, see the whole example on Github for more details.

Then, we have to read data from Kafka input topic. As stated before, Apache Beam already provides a number of different IO connectors and KafkaIO is one of them. Therefore, we create new unbounded PTransform which consumes arriving messages from specified Kafka topic and propagates them further to the next step:

pipeline.apply(
    KafkaIO.<Long, String>read()
        .withBootstrapServers(options.getBootstrap())
        .withTopic(options.getInputTopic())
        .withKeyDeserializer(LongDeserializer.class)
        .withValueDeserializer(StringDeserializer.class))

By default, KafkaIO encapsulates all consumed messages into KafkaRecord object. Though, next transform just retrieves a payload (string values) by new created DoFn object:

.apply(
    ParDo.of(
        new DoFn<KafkaRecord<Long, String>, String>() {
            @ProcessElement
            public void processElement(ProcessContext processContext) {
                KafkaRecord<Long, String> record = processContext.element();
                processContext.output(record.getKV().getValue());
            }
        }
    )
)

After this step, it is time to filter the records (see the initial task stated above) but before we do that, we have to parse our string value according to the defined format. This allows it to be encapsulated into one functional object which then will be used by Beam internal transform Filter.

.apply(
    "FilterValidCoords",
    Filter.by(new FilterObjectsByCoordinates(
        options.getCoordX(), options.getCoordY()))
)

Then, we have to prepare filtered messages to write back to Kafka by creating a new pair of key/values using internal Beam KV class which can be used across different IO connectors, including KafkaIO as well.

.apply(
    "ExtractPayload",
    ParDo.of(
        new DoFn<String, KV<String, String>>() {
           @ProcessElement
           public void processElement(ProcessContext c) throws Exception {
                c.output(KV.of("filtered", c.element()));
           }
        }
    )
)

The final transformation is needed to write messages into Kafka, so we simply use KafkaIO.write() – sink implementation – for these purposes. As for reading, we have to configure this transform with some required options, like Kafka bootstrap servers, output topic name and serialisers for key/value.

.apply(
    "WriteToKafka",
    KafkaIO.<String, String>write()
        .withBootstrapServers(options.getBootstrap())
        .withTopic(options.getOutputTopic())
        .withKeySerializer(org.apache.kafka.common.serialization.StringSerializer.class)
        .withValueSerializer(org.apache.kafka.common.serialization.StringSerializer.class)
);

In the end, we just run our pipeline as usual:

pipeline.run();

This time it may seem a bit more complicated than it was in previous blog, but, as one can easily notice, we didn’t do any specific things to make our pipeline streaming-compatible. This is the whole responsibility of the Apache Beam data model implementation which makes it very easy to switch between batching and streaming processing for Beam users.

Building and running a pipeline

Let’s add the required dependencies to make it possible to use Beam KafkaIO:

<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-kafka</artifactId>
<version>2.4.0</version>
</dependency>

<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>1.1.0</version>
</dependency>

Then, just build a jar and run it with DirectRunner to test how it works:

# mvn clean package
# mvn exec:java -Dexec.mainClass=org.apache.beam.tutorial.analytic.FilterObjects -Pdirect-runner -Dexec.args=”–runner=DirectRunner”

If it’s needed, we can add other arguments used in the pipeline with a help of “exec.args” option. Also, make sure that your Kafka servers are available and properly specified before running Beam pipeline. Lastly, the Maven command will launch a pipeline and run it forever until it will be finished manually (optionally, it is possible to specify maximum running time). So, it means that data will be processed continuously, in streaming mode.

As usual, all code of this example is published on this github repository.

Happy streaming!

The post How to Develop a Data Processing Job Using Apache Beam – Streaming Pipelines appeared first on Talend Real-Time Open Source Data Integration Software.

↧

3 Common Pitfalls in Building Your Data Lake and How to Overcome Them

August 8, 2018, 9:36 am

≫ Next: Talend Connect Europe 2018: Liberate your Data. Become a Data Hero

≪ Previous: How to Develop a Data Processing Job Using Apache Beam – Streaming Pipelines

Recently I had the chance to talk to an SVP of IT at one of the largest banks in North America about their digital transformation strategy. As we spoke, their approach to big data and digital transformation struck me as they described it as ever evolving. New technologies would come to market which required new pivots and approaches to leverage these capabilities for the business. It is more important than ever to have an agile architecture that can sustain and scale with your data and analytics growth. Here are three common pitfalls we often see when building a data lake and our thoughts on how to overcome them:

“All I need is an ingestion tool”

Ah yes, the development of a data lake is often seen as the holy grail of everything. Afterall, now you have a place to dump all of your data. The first issue most people run into is data ingestion. How could they collect and ingest the sheer variety and volume of data that was coming into a data lake. Any success of data collection is a quick win for them. So they bought a solution for data ingestion, and all the data can now be captured and collected like never before. Well problem solved, right? Temporarily, maybe, but the real battle has just begun.

Soon enough you will realize that simply getting your data into the lake is just the start. Most data lake projects failed because it turns into a big data swamp with no structure, no quality, a lack of talent and no trace of where the data actually came from. Raw data is rarely useful as a standalone since the data still needs to be processed, cleansed, and transformed in order to provide quality analytics. This often lead to the second pitfall.

Hand coding for data lake

We have had many blogs in the past on this, but you can’t emphasize this topic enough. It’s strikingly true that hand coding may look promising from the initial deployment costs, but the maintenance costs can increase by upwards of 200%. The lack of big data skills, on both the engineering and analytics sides, as well as the movement of cloud adds even more complexity to hand coding. Run the checklist here to help you determine when and where to have custom coding for your data lake project.

Self-service

With the rising demands of faster analytics, companies today are looking for more self-service capabities when it comes to integration. But it can easily cause peril without proper governance and metadata management in place. As many basic integration tasks may go to citizen integrators, it’s more important to ask is there governance in place to track that? Is access of your data given to the right people at the right time? Is your data lake enabled with proper metadata management so your self-service data catalog is meaningful?

Don’t look for an avocado slicer.

As the data lake market matures, everyone is looking for more and yet struggling with each phase as they go through the filling, processing and managing of data lake projects. To put this in perspective, here is a snapshot of the big data landscape from VC firm FirstMarkfrom 2012:

And this is how it looks in 2017:

The big data market landscape is growing like never before as companies are now more clear on what they need. From these three pitfalls, the biggest piece of advice I can offer is to avoid what I like to call “an avocado slicer”. Yes it might be interesting, fancy, and works perfectly for what you are looking for, but you will soon realize it’s a purpose-built point solution that might only work for ingestion, only compatible with one processing framework, or only works for one department’s particular needs. Instead, have a holistic approach when it comes to your data lake strategy, what you really need is a well-rounded culinary knife! Otherwise, you may end up with an unnecessary amount of technologies and vendors to manage in your technology stack.

In my next post, I’ll be sharing some best questions to ask for a successful data management strategy.

The post 3 Common Pitfalls in Building Your Data Lake and How to Overcome Them appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Talend Connect Europe 2018: Liberate your Data. Become a Data Hero

August 10, 2018, 9:03 am

≫ Next: Going Serverless with Talend through CI/CD and Containers

≪ Previous: 3 Common Pitfalls in Building Your Data Lake and How to Overcome Them

Save the date! Talend Connect will be back in London and Paris in October

Talend will welcome customers, partners, and influencers to its annual company conference, Talend Connect, taking place in two cities, London and Paris, in October. A must-attend event for business decision makers, CIOs, data scientists, chief architects, and developers, Talend Connect will share innovative approaches to modern cloud and big data challenges, such as streaming data, microservices, serverless, API, containers and data processing in the cloud.

<< Reserve your spot for Talend Connect 2018: Coming to London and Paris >>

Talend customers from different industries including AstraZeneca, Air France KLM, BMW Group, Greenpeace and Euronext will go on stage to explain how they are using Talend’s solutions to put more data to work, faster. Our customers now see making faster decisions and monetizing data as a strategic competitive advantage. They are faced with the opportunity and challenge of having more data than ever spread across a growing range of environments, that change at an increasing speed, combined with the pressure to manage this growing complexity whilst simultaneously reduce operational costs. At Talend Connect you can learn how Talend customers leverage more data across more environments to make faster decisions and support faster innovation whilst significantly reducing operational costs when compared with traditional approaches. Here’s what to expect at this year’s show.

Day 1: Training Day and Partner Summit

Partners play a critical role in our go-to-market plan for cloud and big data delivery and scale. Through their expertise in Talend technology and their industry knowledge, they can support organisations’ digital transformation strategies. During this first day, attendees will learn about our partner strategy and enablement, our Cloud-First strategy, as well as customer use cases.

The first day will also be an opportunity for training. Designed for developers, the two training sessions will enable the attendees to get started with Talend Cloud trough hands-on practices. Attendees will also be able to get certified on Talend Data Integration solution. An experienced Talend developer will lead the participants through a review of relevant topics, with hands-on practice in a pre-configured environment.

The training day and partner summit will be organised in both London and Paris.

Day 2: User Conference

Talend Connect is a forum in which customers, partners, company executives and industry analysts can exchange ideas and best approaches for tackling the challenges presented by big data and cloud integration. This year’s conference will offer attendees a chance to gain practical hands-on knowledge of Talend’s latest cloud integration innovations, including a new product that will be unveiled during the conference.

Talend Connect provides an ideal opportunity to discover the leading innovations and best practices of Cloud integration. With cloud subscription growing over 100% year-over-year, Talend continues to invest in this area, including serverless integration, DevOps and API capabilities.

Talend Data Master Awards

The winners of the Talend Data Master Awards will be announced at Talend Connect London. Talend Data Master Awards is a program designed to highlight and reward the most innovative uses of Talend solutions. The winners will be selected based on a range of criteria including market impact and innovation, project scale and complexity as well as the overall business value achieved.

Special Thanks to Our Sponsors

Talend Connect benefits from the support of partners including Datalytyx, Microsoft, Snowflake, Bitwise, Business&Decision, CIMT AG, Keyrus, Virtusa, VO2, Jems Group, Ysance, Smile and SQLI.

I am looking forward to welcoming users, customers, partners and all the members of the community to our next Talend Connect Europe!

The post Talend Connect Europe 2018: Liberate your Data. Become a Data Hero appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Going Serverless with Talend through CI/CD and Containers

August 13, 2018, 2:20 am

≫ Next: New Sources, New Data, New Tools: Large Health Insurance Group Confronts the Future of Financial Data Management in Healthcare

≪ Previous: Talend Connect Europe 2018: Liberate your Data. Become a Data Hero

Why should we care about CI/CD and Containers?

Continuous integration, delivery and deployment, known as CI/CD, has become such a critical piece in every successful software project that we cannot deny the benefits it can bring to your project. At the same time, containers are everywhere right now and are very popular among developers. In practice, CI/CD delivery allows users to gain confidence in the applications they are building by continuously test and validate them. Meanwhile, containerization gives you the agility to distribute and run your software by building once and being able to deploy “anywhere” thanks to a standard format that companies adopt. These common DevOps practices avoid the “it works on my machine!” effect. Direct consequences are a better time to market and more frequent deliveries.

How does it fit with Talend environment?

At Talend, we want to give you the possibility to be part of this revolution giving you access to the best of both worlds. Since the release of Talend 7.0 you can now build your Talend Jobs within Docker images thanks to standard Maven build. In addition, we also help you to smoothly plug this process into your CI/CD pipeline.

What about serverless?

The serverless piece comes at the end of it. This is the way we will deploy our Talend jobs. In fact, by shipping our jobs in containers we now have the freedom to deploy integration jobs anywhere. Among all the possibilities, a new category of services that are defined as serverless is raising. Most of the major cloud providers are starting to offer their serverless services for containers such AWS Fargate or Azure Container Instances to name a few. They allow you to run containers without the need to manage any infrastructure (servers or clusters). You are only billed for your container usage.

These new features have been presented at Talend Connect US 2018 during the main keynote and have been illustrated with a live demo of a whole pipeline from the build to the run of the job in AWS Fargate as well as Azure ACI. In this blog post, we are going to take advantage of Jenkins to create a CI/CD pipeline. It consists of building our jobs within Docker images, making the images available in a Docker registry, and eventually calling AWS Fargate and Azure ACI to run our images.

Let’s see how to reproduce this process. If you would like to follow along, please make sure you fulfill the following requirements.

Requirements

Talend Studio 7.0 or higher
Talend Cloud Spring 18’ or higher
Have a Jenkins server available
Your Jenkins server needs to have access to a Docker daemon installed on the same machine
Talend CommandLine installed on the Jenkins’ machine
Nexus version 2 or 3
Have installed the Talend CI Builder configured along with your Talend CommandLine.

* All Talend components here are available in the Talend Cloud 30 Day Trial

Talend Environment

If you are new to Talend then let me walk you through a high-level overview of the different components. You need to start out with Talend Cloud and create a project in the Talend Management Console (TMC), you will then need to configure your project to use a Git repository to store your job artifacts. You also need to configure the Nexus setting in TMC to point to your hosted Nexus server to store any 3^rd party libraries you job may need. Talend Cloud account provides the overall project and artifact management in the TMC.

Next, you need to install the Studio and connect it to the Talend Cloud account. The Studio is the main Talend design environment where you build integration jobs. When you log in with the Studio to the cloud account following these steps you will see the projects from the TMC and use the project that is configured to the Git repository. Follow the steps below to add the needed plugins to the Studio.

The last two components you need are the CI builder and Talend CommandLine or cmdline (if using cloud trial, the CommandLine tool is included with the studio install files). When installing or using the CommandLine tool the first time you will need to give the CommandLine tool a license as well, you can use the same license from your Studio. The CommandLine tool and the CI builder tool are the components that allow you to take the code from the job in Studio (really in Git) and build and deploy fully executables processes to environments via scripts. The CI builder along with the profile in the studio is what determines if that is going to say a Talend Cloud runtime environment or a container. Here are the steps to get started!

1) Create a project in Talend Cloud

First, you need to create a project in your Talend Cloud account and link your project to a GitHub repository. Please, have a look at the documentation to perform this operation.

Don’t forget to configure your Nexus in your Talend Cloud account. Please follow the documentation for configuring your Nexus with Talend cloud. As a reminder your Nexus needs to have the following repositories:

releases
snapshots
talend-custom-libs-release
talend-custom-libs-snapshot
talend-updates
thirdparty

2) Add the Maven Docker profiles to the Studio

We are going to configure the Studio by adding the Maven Docker profiles to the configuration of our project and job pom files. Please find the two files you will need here under project.xml and standalone-job.xml.

You can do so in your studio, under the menu Project Properties -> Build -> Project / Standalone Job.

You just need to replace them with the ones you had copied above. No changes are needed.

In fact, what we are really doing here is adding a new profile called “docker” using the fabric8 Maven plugin. When building our jobs with this Maven profile, openjdk:8-jre-slim will be used as a base image, then the jars of our jobs are going to be added to this image along with a small script to indicate how to run the job. Please be aware that Talend does not support OpenJDK nor Alpine Linux. For testing purposes only, you can keep the openjdk:8-jre-slim image, but for production purposes, you will have to build your own Java Oracle base image. For more information please refer to our supported platforms documentation.

3) Set up Jenkins

The third step is to set up our Jenkins server. In this blog post, the initial configuration of Jenkins will not be covered. If you have never used it before please follow the Jenkins Pipeline getting started guide. Once the initial configuration is completed, we will be using the Maven, Git, Docker, Pipeline and Blue Ocean plugins to achieve our CI/CD pipeline.

We are going to store our Maven settings file in Jenkins. In the settings of Jenkins (Manage Jenkins), go to “Managed files” and create a file with ID “maven-file”. Copy this file in it as in the screenshot below. Make sure to modify the CommandLine path according to your own settings and to specify your own nexus credentials and URL.

What you also need to achieve before going into the detail of the pipeline is define some credentials. To do so go to “Manage Jenkins” and “Configure credentials” then on the left “Credentials”. Look at the screenshot below:

Create four credentials for GitHub, Docker Hub, AWS and Azure. If you only plan to use AWS, you don’t need to specify your Azure credentials and conversely. Make sure you set your ACCESS KEY as username and SECRET ACCESS KEY as password for the AWS credentials.

Finally, and before going through the pipeline, we must get two CLI Docker images available on the Jenkins machine. Indeed, Jenkins will use docker images with AWS and Azure CLIs to perform CLI commands to the different services. This is an easy way to use these CLIs without the need to install them on the machine. Here are the images we will use:

vfarcic/aws-cli (docker pull vfarcic/aws-cli:latest; docker tag vfarcic/aws-cli:latest aws-cli)
microsoft/azure-cli:latest (docker pull microsoft/azure-cli:latest; docker tag microsoft/azure-cli:latest azure-cli)

You can of course use different images at your convenience.

These Docker images need to be pulled on the Jenkins machine, this way in the pipeline we can use the Jenkins Docker plugin to use the “withDockerContainer(‘image’)” function to execute the CLI commands as you will see later. You can find more information about running build steps inside a Docker container in the Jenkins documentation here.

Now that all the pre-requisites have been fulfilled let’s create a “New item” on the main page and choose “Pipeline”.

Once created you can configure your pipeline. This is where you will define your pipeline script (Groovy language).

You can find the script here.

Let’s go through this file and I will highlight the main steps.

At the top of the file you can set your own settings through environment variables. You have an example that you can follow with a project called “TALEND_JOB_PIPELINE” and a job “test”. The project git name should match the one in your GitHub repository. That is why the name is uppercase. Please be aware that in this script we use the job name as the Docker image name, so you cannot use underscores in your job name. If you want to use an underscore you need to define another name for your Docker image. The following environment variables must be set:

env.PROJECT_GIT_NAME = 'TALEND_JOB_PIPELINE'
env.PROJECT_NAME = env.PROJECT_GIT_NAME.toLowerCase()
env.JOB = 'test'
env.VERSION = '0.1'
env.GIT_URL = 'https://github.com/tgourdel/talend-pipeline-job.git'
env.TYPE = "" // if big data = _mr
env.DOCKERHUB_USER = "talendinc"

In this file, each step is defined by a “stage”. The first two stages are here for pulling the latest version of the job using the Git plugin.

Then comes the build of the job itself. As you can see we are utilizing the Maven plugin. The settings are in a Jenkins Config file. This is the file we added earlier in the Jenkins configuration with the maven-file ID.

In the stages “Build, Test and Publish to Nexus” and “Package Jobs as Container” the line to change is:

-Dproduct.path=/cmdline -DgenerationType=local -DaltDeploymentRepository=snapshots::default::http://nexus:8081/repository/snapshots/ -Xms1024m -Xmx3096m

Here you need to specify your own path to the CommandLine directory (relatively to the Jenkins server) and your Nexus URL.

After the build of the job in a Docker image we are going to push the image to Dockerhub registry. For this step and the next one we will use CLIs to use the different third-parties. As the Docker daemon should be running on the Jenkins’ machine you can use directly the docker CLI. We use the withCredentials() function to get your Dockerhub username and password:

stage ('Push to a Registry') {
            withCredentials([usernamePassword(credentialsId: 'dockerhub', passwordVariable: 'dockerhubPassword', usernameVariable: 'dockerhubUser')]) {
               sh 'docker tag $PROJECT_NAME/$JOB:$VERSION $DOCKERHUB_USER/$JOB:$VERSION'
               sh "docker login -u ${env.dockerhubUser} -p ${env.dockerhubPassword}"
               sh "docker push $DOCKERHUB_USER/$JOB:$VERSION"
           }
}

The stage “Deployment environment” is simply an interaction with the user when running the pipeline. It asks whether you want to deploy your container in AWS Fargate or Azure ACI. You can remove this step if you want to have a continuous build until the deployment. This step is for demo purposes.

The next two stages are the deployment itself to AWS Fargate or Azure ACI. In each of the two stages you need to modify with your own settings. For example, in the AWS Fargate deployment stage you need to modify this line:

aws ecs run-task --cluster TalendDeployedPipeline --task-definition TalendContainerizedJob --network-configuration awsvpcConfiguration={subnets=[subnet-6b30d745],securityGroups=[],assignPublicIp=ENABLED} --launch-type FARGATE

You need to modify the name of your Fargate Cluster and your task definition. For your information you need to create them in your AWS console. You can read the documentation to achieve this operation. At the time of writing, AWS Fargate is only available in N. Virginia region, but other regions will come. The container you that will be defined in your task definition is the one that will be created in your Docker Hub account with the name of your job as image name. For example, it would be talendinc/test:0.1 with the default configuration in the pipeline script.

The same applies to Azure ACI, you need to specify your own resource group and container instance.

4) Configure the Command line

As a matter of fact, Maven will use the CommandLine to build your job. The CommandLine can be used in 2 modes: script and server mode. Here we will use the CommandLine in server mode. First, you need to indicate the workspace of your CommandLine (which in our case will be the Jenkins workspace). Modify command-line.sh file as follow with your own path to Jenkins workspace (it depends on your pipeline name you choose in the previous step):

./Talend-Studio-linux-gtk-x86_64 -nosplash -application org.talend.commandline.CommandLine -consoleLog -data /var/jenkins_home/workspace/talend_job_pipeline startServer -p 8002

Change the Jenkins’ home path according to your own settings. Last thing to do is to modify the /configuration/maven_user_settings.xml file. To do so copy paste this file with your own nexus URL and login information.

Then launch the CommandLine in background:

$ /cmdline/commandline-linux.sh &

5) Run the pipeline

Once all the necessary configuration has been done you can run your pipeline. To do so, you can go in the Open Blue Ocean view and click on “run” button. It will trigger the pipeline and should see the pipeline progress:

Jenkins Pipeline to build Talend Jobs into Docker Containers

The pipeline in the context of this blog will ask you where you want to deploy your container. Choose either AWS Fargate or Azure ACI. Let’s take the Fargate example.

After having proceeded the deployment, your Fargate cluster should now have one pending task:

If you go into the detail of your task once run, you should be able to access the logs of your job:

You can now run your Talend integration job packaged in a Docker container anywhere such as:

AWS Fargate
Azure ACI
Google Container Engine
Kubernetes or OpenShift
And more …

Thanks to Talend CI/CD capabilities you can automate the whole process from the build to the run of your jobs.

If you want to become cloud agnostic and take advantage of the portability of the containers this example shows you how you can use a CI/CD tool (Jenkins in our case) to automate the build and run in different cloud container services. This is only an example among others but being able to build your jobs as containers opens you to a whole new world for your integration jobs. Depending on your use-cases you could find yourself spend way less money thanks to these new serverless services (such as Fargate or Azure ACI). You could also now spend less time configuring your infrastructure and focus on designing your jobs.

If you want to learn more about how to take advantage of containers and serverless technology, join us at Talend Connect 2018 in London and Paris. We will have dedicated break-out sessions on serverless to help you go hands on with these demos. See you there!

The post Going Serverless with Talend through CI/CD and Containers appeared first on Talend Real-Time Open Source Data Integration Software.

↧

New Sources, New Data, New Tools: Large Health Insurance Group Confronts the Future of Financial Data Management in Healthcare

August 17, 2018, 1:42 pm

≫ Next: Six Do’s and Don’ts of Collaborative Data Management

≪ Previous: Going Serverless with Talend through CI/CD and Containers

The world of healthcare never rests and one of the largest health insurance groups in the U.S. is not alone in wanting to provide, “quality care, better patient outcomes, and more engaged consumers.” They achieve this through careful consideration as to how they manage their financial data. This is the backbone of providing that quality care while also managing cost. The future of healthcare data integration involves fully employing cloud technology to manage financial information while recognizing traditional patterns. For this, Talend is this insurance company’s ETL tool of choice.

In shifting from PeopleSoft to Oracle Cloud, the company realized that they needed a stronger infrastructure when it came to managing their data and also meeting the demands of a cost-benefit analysis. They needed the right tools and the ability to use them in a cost-efficient manner. This is where they made the investment in upgrading from using Talend’s Open Studio to Talend’s Big Data Platform.

Ready For More? Download The Forrester Wave™: Big Data Fabric, Q2 2018 User Guide now.

Download Now

The Question of Integration

The company’s financial cloud team receives a large quantity of data in over six different data types that come from a variety of sources (including in-house and external vendors). They are handling XML, CSB, text, and mainframe files, destined for the Oracle Cloud or the mainframe in a variety of formats. These files must be integrated and transferred between the cloud and their mainframe as inbound and outbound jobs. The ability to blend and merge this data for sorting is necessary for report creation.

It is imperative that the company be able to sort and merge files, blending the data into a canonical format that can be handled in batch and in real-time. Source targets vary, as do the file types that they require, and these must be able to be drawn from either the mainframe or Oracle Cloud. This is an around-the-clock unidirectional process of communication involving a multitude of user groups in their disparate locations.

Amid all of this, they must also anticipate the demands of the future, which will escalate the number of new types of data, sources, and destinations as the company grows. A seamless integration of new data sources and destinations will reduce, if not eliminate, downtime and loss of revenue. Ultimately, it is departmental building with a global impact.

From Open Studio to Big Data

The company started off like many Talend users, employing Open Studio at the design stage. It is important to note that they did not have to train in an entirely new skillset to move their infrastructure to Oracle Cloud. They used the same skillset in the same way that the on-premises integrations had always been accomplished since Talend natively works with any cloud architecture or data landscape. This helps companies with creating an effective prototype. However, while Talend’s Open Studio is the most flexible tool in the open-source market, the company ultimately needed something for design execution, hence their switch to Talend’s Big Data Platform.

It was also critical that the financial cloud team was able to run their testing locally. Fostering a DevOps culture has been critical to many IT teams because they can do their development locally. Talend allows for local development as well as remote management. Project development can be physically separated from the web, and project management can be handled remotely for global execution. It can also be used anywhere and does not require a web browser, negating the need for an internet connection at this stage, and adding to the level of flexibility.

It is vital for continued development that developers do not have to depend on internet access; they should be able to work independently and from anywhere. When the team gets together after the development stage is concluded, they can migrate the data and then upload it to another system easily. Even managing all their tools can be done remotely. There is limited need for an extensive infrastructure team to manage operations and this leads to further IT efficiency.

Utilizing User Groups

Within three weeks of introduction to Talend, the team from this health insurance group was competent in use. Zero to sixty learning was achieved through CBT web-based training bolstered with in-person instruction. The benefits of migrating to Talend are many but the company most values the Talend User Groups as a source of continuing education.

The company realized that User Groups offer a source of practical experience that no manual could ever fully embrace. The local (Chicago) user group meetups offer in-person assistance and a wealth of practical information and best-practices. According to the team at this company, taking advantage of the Talend User Groups is the prescription for success.

The post New Sources, New Data, New Tools: Large Health Insurance Group Confronts the Future of Financial Data Management in Healthcare appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Six Do’s and Don’ts of Collaborative Data Management

August 28, 2018, 9:29 am

≫ Next: Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools

≪ Previous: New Sources, New Data, New Tools: Large Health Insurance Group Confronts the Future of Financial Data Management in Healthcare

Data Quality Projects are not technical projects anymore. They are becoming collaborative and team driven.

As organizations strive to succeed their digital transformation, data professionals realize they need to work as teams with business operations as they are the ones who need better data to succeed their operations. Being in the cockpit, Chief Data Officers need to master some simple but useful Do’s and Don’t’s about running their Data Quality Projects.

Let’s list a few of these.

DO’S

Set your expectations from the start.

Why Data Quality? What do you target? How deep will you impact your organization’s business performance? Find your Data Quality answers among business people. Make sure you know your finish line, so you can set intermediate goals and milestones on a project calendar.

Build your interdisciplinary team.

Of course, it’s about having the right technical people on board: people who master Data Management Platforms. But It’s all also about finding the right people who will understand how Data Quality impacts the business and make them your local champions in their respective department. For example, Digital Marketing Experts often struggle with bad leads and low performing tactics due to the lack of good contact information. Moreover, new regulations such as GDPR made marketing professionals aware about how important personal data are. By putting such tools as Data Preparation in their hands, you will give them a way to act on their Data without losing control. They will be your allies in your Data Quality Journey.

Deliver quick wins.

While it’s key to stretch people capabilities and set ambitious objectives, it’s also necessary to prove your data quality project will have positive business value very quickly. Don’t spend too much time on heavy planning. You need to prove business impacts with immediate results. Some Talend customers achieved business results very quickly by enabling business people with apps such as Data Prep or Data Stewardship. If you deliver better and faster time to insight, you will gain instant credibility and people will support your project. After gaining credibility and confidence, it will be easier to ask for additional means when presenting your projects to the board. At the end remember many small ones make a big one.

DON’TS

Don’t underestimate the power of bad communication

We often think technical projects need technical answers. But Data Quality is a strategic topic. It would be misleading to treat it as a technical challenge. To succeed, your project must be widely known within your organization. You will take control of your own project story instead of leaving bad communication spreading across departments. For that, you must master the perfect mix of know-how and communication skills so that your results will be known and properly communicated within your organization. Marketing suffering from bad leads, operations suffering from missing infos, strategists suffering from biased insights. People may ask you to extend your projects and solve their data quality issues, which is a good reason to ask for more budget.

Don’t overengineer your projects then making it too complex and sophisticated.

Talend provides simple and powerful platform to produce fast results so you can start small and deliver big. One example of having implemented Data Management from the start, is Carhartt who managed to clean 50,000 records in one day. You don’t necessarily need to wait a long time to see results.

Don’t Leave the clock running and leave your team without clear directions

Set and meet deadlines as often as possible. It will bolster your credibility. As time is running fast and your organization may shift to short term business priorities, track your route and stay focused on your end goals. Make sure you deliver project on time. Then celebrate success. When finishing a project milestone, make sure you take time to celebrate with your team and within the organization.

To learn more about Data Quality, please download our Definitive Guide to Data Quality.

The post Six Do’s and Don’ts of Collaborative Data Management appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools

July 30, 2018, 4:40 pm

≫ Next: [Step-by-Step] How to Load Salesforce Data Into Snowflake in Minutes

≪ Previous: Six Do’s and Don’ts of Collaborative Data Management

It’s an exciting time to be part of the data market. Never before have we seen so much innovation and change in a market, especially in the areas of cloud, big data, machine learning and real-time data streaming. With all of this market innovation, we are especially proud that Talend was recognized by Gartner as a leader for the third time in a row in their 2018 Gartner Magic Quadrant for Data Integration Tools and remains the only open source vendor in the leaders quadrant.

According to Gartner’s updated forecast for the Enterprise Infrastructure Software market, data integration and data quality tools are the fastest growing sub-segment, growing at 8.6%. Talend is rapidly taking market share in the space with a 2017 growth rate of 40%, more than 4.5 times faster than the overall market.

The Data Integration Market: 2015 vs. 2018

Making the move from challengers to market leaders from 2015 to today was no easy feat for an emerging leader in cloud and big data integration. It takes a long time to build a sizeable base of trained and skilled users while maturing product stability, support and upgrade experiences.

While Talend still has room to improve, it’s exciting recognition of all the investments Talend has made to see our score improve like that.

Today’s Outlook in the Gartner Magic Quadrant

Mark Byer, Eric Thoo, and Etisham Zaidi are not afraid to change things up in the Gartner Magic Quadrant as the market changes, and their 2018 report is proof of that. Overall, Gartner continued to raise their expectations for the cloud, big data, machine learning, IoT and more. If you read each vendor’s write up carefully and take close notes, as I did, you start to see some patterns.

In my opinion, the latest report from Gartner indicates that in general, you have to pick your poison, you can have a point solution with less mature products and support and a very limited base of trained users in the market, or go with a vendor that has product breadth, maturity and a large base of trained users, but with expensive, complex and hard to deploy solutions.

Talend’s Take on the 2018 Gartner Magic Quadrant for Data Integration Tools

In our minds, this has left a really compelling spot in the market for Talend as the leader in the new cloud and big data use cases that are increasingly becoming the mainstream market needs. For the last 10+ years, we’ve been on a mission to help our customers liberate their data. As data volumes continue to grow exponentially along with growth in business users needing access to that data, this mission has never been more important. This means continuing to invest in our native architecture to enable customers to be the first to adopt new cutting-edge technologies like serverless, containers which significantly reduce total cost of ownership and can run on any cloud.

Talend also strongly believes that data must become a team sport for businesses to win, which is why governed self-service data access tools like Talend Data Preparation and Talend Data Streams are such important investments for Talend. It’s because of investments like these that we believe Talend will quickly become the overall market leader in data integration and data quality. As I said at the beginning of the blog, our evolution has been a journey and we invite you to come along with us. I encourage you to download a copy of the report, try Talend for yourself and become part of the community.

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
GARTNER is a federally registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and is used herein with permission. All rights reserved.

The post Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools appeared first on Talend Real-Time Open Source Data Integration Software.

↧

[Step-by-Step] How to Load Salesforce Data Into Snowflake in Minutes

September 6, 2018, 2:47 pm

≫ Next: Metadata Management 101: What, Why and How

≪ Previous: Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools

As cloud technologies move into the mainstream at an unprecedented rate, organizations are augmenting existing, on-premise data centers with cloud technologies and solutions—or even replacing on-premise data storage and applications altogether. But in almost any modern environment, data will be gathered from multiple sources at a variety of physical locations (public cloud, private cloud, or on-premise). Talend Cloud is Talend’s Integration Platform-as-a-Service (IPaaS), a modern cloud and on-premises data and application integration platform and is particularly performant for cloud-to-cloud integration projects.

To explore the capabilities and features of Talend Cloud, anyone can start 30-day free trial. Several sample jobs are available for import as part of the Talend Cloud trial to get you familiar with the Talend IPaaS solution. The video below walks you through two sample jobs to load data from Salesforce into a Snowflake database.

To get started, you obviously need to be a user (or trial user) of Talend Cloud, Snowflake cloud data warehouse and Salesforce. Then, there’s a simple 2-step process to migrate Salesforce data to Snowflake, using Talend cloud. The first job will use a snowflake connection to create a user-defined database with three tables in Snowflake. The second job will then migrate these three tables from salesforce.com to the snowflake cloud warehouse.

The full step-by-step process is available here with attached screenshots. Want to try Talend Cloud? Start your trial today!

The post [Step-by-Step] How to Load Salesforce Data Into Snowflake in Minutes appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Metadata Management 101: What, Why and How

September 7, 2018, 8:26 am

≫ Next: Key Considerations for Converting Legacy ETL to Modern ETL

≪ Previous: [Step-by-Step] How to Load Salesforce Data Into Snowflake in Minutes

Metadata Management has slowly become one of the most important practices for a successful digital initiative strategy. With the rise of distributed architectures such as Big Data and Cloud which can create siloed systems and data, metadata management is now vital for managing the information assets in an organization. The internet has a lot of literature around this concept and readers can easily get confused with the terminology. In this blog, I wanted to give the users a brief overview of metadata management in plain English.

What does metadata management do?

Let’s get started with the basics. Though there are many definitions out there for Metadata Management, but the core functionality is enabling a business user to search and identify the information on the key attributes in web-baseded user interface.

An example of a searchable key attribute could be Customer ID or a member name. With a proper metadata management system in place, business users will be able to understand where the data for that attribute is coming from and how was the data in the attribute calculated. They will be able to visualize which enterprise systems in the organization the attribute being used in (Lineage) and will be able to understand the impact of changing something (Impact Analysis) to the attribute such as the length of the attribute to other systems.

Technical users also have needs for metadata management. By combining business metadata with technical metadata, a technical user will also be able to find out which ETL job or database process is used to load data into the attribute. Operational metadata such as control tables in a data warehouse load can also be combined to this integrated metadata model. This is powerful information for an end user at to have at their fingertips. The end result of metadata management can be in the form of another ‘database’ of the metadata of key attributes of the company. The industry term for such a database would be called a Data Catalog, or a glossary or Data inventory.

How does metadata management work?

Metadata Management is only one of the initiatives of a holistic Data Governance program but this is the only initiative which deals with “Metadata”. Other initiatives such as Master Data Management (MDM) or Data Quality (DQ) deal with the actual “data” stored in various systems. Metadata management integrates metadata stores at the enterprise level.

Tools like Talend Metadata Manager provide an automated way to parse and load different types of metadata. The tool also enables to build an enterprise model based on the metadata generated from different systems such as your data warehouse, data integration tools, data modelling tools, etc.

Users will be able to resolve conflicts based on for example attribute names and types. You will also be able to create custom metadata types to “stitch” metadata between two systems. A completely built metadata management model would give a 360-degree view on how different systems in your organization are connected together. This model can be a starting point to any new Data Governance initiative. Data modelers will have one place now to look for a specific attribute and use it in their own data model. This model is also the foundation of the ‘database’ that we talked about in the earlier section. Just like any other Data Governance initiatives, as the metadata in individual systems change, the model needs to be updated following a SDLC methodology which includes versioning, workflows and approvals. Access to the metadata model should also be managed by creating roles, privileges and policies.

Why do we need to manage metadata?

The basic answer is, trust. If metadata is not managed during the system lifecycle, silos of inconsistent metadata will be created in the organization that does not meet any teams full needs and provide conflicting information. Users would not know how much they need to trust the data as they is no metadata to indicate how and when the data got to the system and what business rules were applied.

Costs also need to be considered. Without effectively managing metadata, each development project would have to go through the effort of defining data requirements increasing costs and decreasing efficiency. Users are presented with many tools and technologies creating redundancy and excess costs and do not provide the full value of the investment as the data they are looking for is not available. The data definitions are duplicated across multiple systems driving higher storage costs.

As business becomes mature and more and more systems are added, they need to consider how the metadata (and not just the data) needs to be governed. Managing metadata provides clear benefits to the business and technical users and the organization as a whole. I hope this has been a useful intro to all the very basics of metadata management. Until next time!

The post Metadata Management 101: What, Why and How appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Key Considerations for Converting Legacy ETL to Modern ETL

September 11, 2018, 12:25 pm

≫ Next: Creating Real-Time Anomaly Detection Pipelines with AWS and Talend Data Streams

≪ Previous: Metadata Management 101: What, Why and How

Recently, there has been a surge in our customers who want to move away from legacy data integration platforms to adopting Talend as their one-stop shop for all their integration needs. Some of the organizations have thousands of legacy ETL jobs to convert to Talend before they are fully operational. The big question that lurks in everyone’s mind is how to get past this hurdle.

Defining Your Conversion Strategy

To begin with, every organization undergoing such a change needs to focus on three key aspects:

Will the source and/or target systems change? Is this just an ETL conversion from their legacy system to modern ETL like Talend?
Is the goal to re-platform as well? Will the target system change?
Will the new platform reside on the cloud or continue to be on-premise?

This is where Talend’s Strategic Services can help carve a successful conversion strategy and implementation roadmap for our customers. In the first part of this three-blog series, I will focus on the first aspect of conversion.

Before we dig into it, it’s worthwhile to note a very important point – the architecture of the product itself. Talend is a JAVA code generator and unlike its competitors (where the code is migrated from one environment to the other) Talend actually builds the code and migrates built binaries from one environment to the other. In many organizations, it takes a few sprints to fully acknowledge this fact as the architects and developers are used to the old ways of referring to code migration.

The upside of this architecture is that it helps in enabling a continuous integration environment that was not possible with legacy tools. A complete architecture of Talend’s platform not only includes the product itself, but also includes third-party products such as Jenkins, NEXUS – artifact repository and a source control repository like GIT. Compare this to a JAVA programming environment and you can clearly see the similarities. In short, it is extremely important to understand that Talend works differently and that’s what sets it apart from the rest in the crowd.

Where Should You Get Started?

Let’s focus on the first aspect, conversion. Assuming that nothing else changes except for the ETL jobs that integrate, cleanse, transform and load the data, it makes it a lucrative opportunity to leverage a conversion tool – something that ingests legacy code and generates Talend code. It is not a good idea to try and replicate the entire business logic of all ETL jobs manually as there will be a great risk of introducing errors leading to prolonged QA cycles. However, just like anyone coming from a sound technical background, it is also not a good idea to completely rely on the automated conversion process itself since the comparison may not always be apples to apples. The right approach is to use the automated conversion process as an accelerator with some manual interventions.

Bright minds bring in success. Keeping that mantra in mind, first build your team:

Core Team – Identify architects, senior developers and SMEs (data analysts, business analysts, people who live and breathe data in your organization)
Talend Experts – Bring in experts of the tool so that they can guide you and provide you with the best practices and solutions to all your conversion related effort. Will participate in performance tuning activities
Conversion Team – A team that leverages a conversion tool to automate the conversion process. A solid team with a solid tool and open to enhancing the tool along the way to automate new designs and specifications
QA Team – Seasoned QA professionals that help you breeze through your QA testing activities

Now comes the approach – Follow this approach for each sprint:

Categorize

Analyze the ETL jobs and categorize them depending on the complexity of the jobs based on functionality and components used. Some good conversion tools provide analyzers that can help you determine the complexity of the jobs to be converted. Spread a healthy mix of varying complexity jobs across each sprint.

Convert

Leverage a conversion tool to automate the conversion of the jobs. There are certain functionalities such as an “unconnected lookup” that can be achieved through an innovative method in Talend. Seasoned conversion tools will help automate such functionalities

Optimize

Focus on job design and performance tuning. This is your chance to revisit design, if required, either to leverage better component(s) or to go for a complete redesign. Also focus on performance optimization. For high-volume jobs, you could increase the throughput and performance by leveraging Talend’s big data components, it is not uncommon for us to see that we end up completely redesigning a converted Talend Data Integration job to a Talend Big Data job to drastically improve performance. Another feather in our hat where you can seamlessly execute standard data integration jobs alongside big data jobs.

Complete

Unit test and ensure all functionalities and performance acceptance criteria are satisfied before handing over the job to QA

QA

An automated QA approach to compare result sets produced by the old set of ETL jobs and new ETL jobs. At the least, focus on:

Compare row counts from the old process to that of the new one
Compare each data element loaded by the load process to that of the new one
Verify “upsert” and “delete” logic work as expected
Introduce an element of regression testing to ensure fixes are not breaking other functionalities
Performance testing to ensure SLAs are met

Now, for several reasons, there can be instances where one would need to design a completely new ETL process for a certain functionality in order to continue processing data in the same way as before. For such situations, you should leverage the “Talend Experts” team that not only liaisons with the team that does the automated conversion but also works closely with the core team to ensure that, in such situations, the best solution is proposed which is then converted to a template and provided to the conversion team who can then automate the new design into the affected jobs.

As you can see, these activities can be part of the “Categorize” and “Convert” phases of the approach.

Finally, I would suggest chunking the conversion effort into logical waves. Do not go for a big bang approach since the conversion effort could be a lengthy one depending on the number of legacy ETL jobs in an organization.

Conclusion:

This brings me to the end of the first part of the three-blog series. Below are the five key takeaways of this blog:

Define scope and spread the conversion effort across multiple waves
Identify core team, Talend experts, a solid conversion team leveraging a solid conversion tool and seasoned QA professionals
Follow an iterative approach for the conversion effort
Explore Talend’s big data capabilities to enhance performance
Innovate new functionalities, create templates and automate the conversion of these functionalities

Stay tuned for the next two!!

The post Key Considerations for Converting Legacy ETL to Modern ETL appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Creating Real-Time Anomaly Detection Pipelines with AWS and Talend Data Streams

September 12, 2018, 11:35 am

≫ Next: How to Spot a DevOps Faker: 5 Questions, 5 Answers

≪ Previous: Key Considerations for Converting Legacy ETL to Modern ETL

Introduction

Thanks for continuing to read all of our streaming data use cases during my exploration of Talend Data Streams. For the last article of this series, I wanted to walk you through a complete IoT integration scenario using a low consumption device and leveraging only cloud services.

In my previous posts, I’ve used a Raspberry Pi and some sensors as my main devices. This single board computer is pretty powerful and you can install a light version of Linux as well. But in real life, enterprises will probably use System On Chip things such as Arduino, PLC, ESP8266 … Those SOC are less powerful, consume less energy and are mostly programmed in C, C++ or Python. I’ll be using an ESP8266 that has embedded Wi-Fi and some GPIO to attach sensors. If you want to know more about IoT hardware have a look at my last article “Everything You Need to Know About IoT: Hardware“.

Our use case is straightforward. First, the IoT device will send sensor values to Amazon Web Services (AWS) IoT using MQTT. Then we will create a rule in AWS IoT to redirect device payload to a Kinesis Stream. Next, from Talend Data Streams we will connect to the Kinesis stream, transform our raw data using standard components. Finally, with the Python processor, we will create an anomaly detection model using Z-Score and all anomalies will be stored in HDFS.

<<Download Talend Data Streams for AWS Now>>

Pre-requisites

If you want to build your pipelines along with me, here’s what you’ll need:

An Amazon Web Services (AWS) account
AWS IoT service
AWS Kinesis streaming service
AWS EMR cluster (version 5.11.1 and Hadoop 2.7.X) on the same VPC and Subnet as your Data Streams AMI.
Talend Data Streams from Amazon AMI Marketplace. (If you don’t have one follow this tutorial: Access Data Streams through the AWS Marketplace)
An IoT device (can be replaced by any IoT data simulator)

High-Level Architecture

Currently, Talend Data Streams doesn’t feature an MQTT connector. In order to get around this, you’ll find an architecture sample below to leverage Talend Data Streams to ingest IoT data in real-time and storing it to a Hadoop Cluster.

Preparing Your IoT Device

As mentioned previously, I ‘m using an ESP8266 or also called Node MCU, it has been programmed to:

Connect to a Wi-Fi hotspot
Connect securely to AWS IoT broker using the MQTT protocol
Read every second distance, temperature and humidity sensor values
Publish over MQTT sensor values to the topic IoT

If you are interested in how to develop an MQTT client on the ESP8266 take a look at this link. However, you could use any device simulator.

IoT Infrastructure: AWS IoT and Kinesis

AWS IoT:

The AWS IoT service is a secure and managed MQTT broker. In this first step I’ll walk you through registering your device, generate public/private key and C.A. to connect securely.

First, login to your Amazon Web Services account and look for IoT. Then, select IoT Core in the list.

Now, select “Create a single thing” from your list of options (alternatively you can select “Create many things “for bulk registration of things).

Now give your thing a name (you can also create device types, groups and other searchable attributes). For this example, let’s keep default settings and click on next.

Now to secure the device authentication using the “One-click certification” creation. Click on “Create a Certificate”.

Download all the files, those have to be stored on the edge device and used with MQTT client to securely connect to AWS IoT, click on “Activate” then “Done”.

In order to allow our device to publish messages and subscribe to topics, we need to attach a policy from the menu. Click on “Secure” and select “Policies”, then click on “Create”.

Give a name to the policy, in action start typing IoT and select IoT. NOTE: To allow all actions, tick the box “Allow” below and click on “Create”.

Let’s attach this policy to a certificate, from the left menu click on “Secure” and select certificate and click on the certificate of your thing.

If you have multiple certificates, click on “Things” to make sure that the right certificate. Next, click on “Actions” and select “Attach Policy”.

Select the policy we’ve just created and click on “Attach”.

Your thing is now registered and can connect, publish messages and subscribe to topics securely! Let’s test it (it’s now time to turn on the ESP).

Testing Your IoT Connection in AWS

From the menu click on Test, select Subscribe to a topic, type IoT for a topic and click on “Subscribe to Topic”.

You can see that sensor data is being sent to the IoT topic.

Setting Up AWS Kinesis

On your AWS console search for “Kinesis” and select it.

Click on “Create data stream”.

Give your stream a name and select 1 shards to start out. Later on if you add more devices you’ll need to increase the number of shards. Next, click on “create Kinesis stream”.

Ok, now we are all set on the Kinesis part. Let’s return back to AWS IoT, on the left menu click on “Act” and press “Create”.

Name your rule, select all the attributes by typing “*” and filter on the topic IoT.

Scroll down and click on “Add Action” and select “Sends messages to an Amazon Kinesis Stream”. Then, click “Configure action” at the bottom of the page.

Select the stream you’ve previously created, use an existing role or create a new one that can access to AWS IoT. Click on “Add action” and then “Create Rule”.

We are all set at this point, the sensor data collected from the device through MQTT will be redirected to the Kinesis Stream that will be the input source for our Talend Data Streams pipeline.

Cloud Data Lake: AWS EMR

Currently, with the Talend Data Streams free version, you can use HDFS but only with an EMR cluster. In this part, I’ll describe how to provision a cluster and how to set up Data Streams to use HDFS in our pipeline.

Provision your EMR cluster

Continuing on your AWS Console, look for EMR.

Click on “Create cluster”.

Next, go to advanced options.

Let’s choose a release that is fully compatible with Talend Data Streams. The 5.11.1 and below will do, then select the components of your choice (Hadoop, Spark, Livy and Zeppelin and Hue in my case). We are almost there, but don’t click on next just yet.

In the Edit software settings, we are going to edit the core-site.xml when the cluster is provisioned, in order to use specific compression codecs required for Data Streams and to allow root impersonation.

Paste the following code to the config:

[

  {

    "Classification": "core-site",

    "Properties": {

      "io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec",

"hadoop.proxyuser.root.hosts": "*",

"hadoop.proxyuser.root.groups": "*"




    }

  }

]

On the next step, select the same VPC and subnet as your Data Streams AMI and click “Next”. Then, name your cluster and click “Next”.

Select an EC2 key pair and go with default settings for the rest and click on “Create Cluster”. After a few minutes, your cluster should be up and running.

Talend Data Streams and EMR set up

Still on your AWS Console, look for EC2.

You will find 3 new instances with blank names that we need to rename. Then by looking at the security groups you can identify which one is the master node.

Now we need to connect to the master node through SSH (check that your client computer can access port 22, if not add an inbound security rule to allow your IP). Because we need to retrieve Hadoop config files I’m using Cyberduck (alternatively use FileZilla or any tool that features SFTP), use the EC2 DNS for the server, Hadoop as a user and the related EC2 key pair to connect.

Now using your favorite SFTP tool connect to your Data Streams EC2 machine, using the ec2-user (allow your client to access port 22). If you don’t have the Data Streams free AMI yet follow this tutorial to provision one: Access Data Streams through the AWS Marketplace.

Navigate to /opt/data-streams/extras/etc/hadoop. NOTE: The folders /etc/hadoop might not exist in /opt/data-streams/extras/ so you need to create them.

Restart your Data Streams EC2 machine so that it will start to pick up the Hadoop config files.

The last step is to allow all traffic from Data Streams to your EMR cluster and vice versa. To do so, create security rules to allow all traffic inbound on both sides for Data Streams and EMR security groups ID.

Talend Data Streams: IoT Streaming pipeline

<<Download Talend Data Streams for AWS Now>>

Now it’s time to finalize our real-time anomaly detection pipeline that uses Zscore. This pipeline is based on my previous article, so if you want to understand the math behind the scenes you should read this article.

All the infrastructure is in place and required set up is done, we can now start building some pipelines. Now logon to your Data Streams Free AMI using the public IP and the instance ID.

Create your Data Sources and add Data Set

In this part, we will create two data sources:

Our Kinesis Input Stream
HDFS using our EMR cluster

From the landing page select Connection on the left-hand side menu and click on “ADD CONNECTION”.

Give a name to your connection, and for the Type select “Amazon Kinesis” in the drop-down box.

Now use an IAM user that has access to Kinesis with an Access key. Fill in the connection field with Access key and Secret, click on “Check connection” and click on “Validate”. Now from the left-hand side menu select Datasets and click on “ADD DATASET”.

Give your dataset a name and select the Kinesis connection we’ve created before from the drop-down box. Select the region of your Kinesis stream then your Stream, CSV for the format and Semicolon for the delimiter. Once that is done, click on “View Sample” then “Validate”.

Our input data source is set up and our samples are ready to be used in our pipeline. Let’s create our output data source connection, on the left-hand-side menu select “CONNECTIONS”, click on “ADD CONNECTION”, give a name to your connection. Then select “HDFS” for the type, use “Hadoop as User” name and click on “Check Connection”. If it says it has connected, then click on “Validate.

That should do it for now, we will create the dataset within the pipeline, but before going further make sure that the Data Streams AMI have access to all inbound traffic to EMR Master and Slave nodes (add an inbound network security rule for EMR ec2 machine to allow all traffic from Data Streams Security group) or you will not be able to read and write to the EMR cluster.

Build your Pipeline

From the left-hand side menu select Pipelines, click on Add Pipeline.

In the pipeline, on the canvas click Create source, select Kinesis Stream and click on Select Dataset.

Back to the pipeline canvas you can see the sample data at the bottom. As you’ve noticed incoming IoT messages are really raw at this point, let’s convert current value types (string) to number, click on the green + sign next to Kinesis component and select the Type Converter processor.

Let’s convert all our fields to “Integer”. To do that, select the first field (.field0) and change the output type to Integer. To change the field type on the next fields, click on NEW ELEMENT. Once you have done this for all fields, click on SAVE.

Next to the Type Converter processor on your canvas, click on the green + sign and add a Windows processor, in order to calculate a Z-Score, we need to define a processing window.

Now let’s set up our window. My ESP8266 sends sensor values every second, and I want to create a Fixed Time window that contains more or less 20 values, so I’ll choose Window duration = Window slide length = 20000 ms— don’t forget to click Save.

Since I’m only interested about Humidity, which I know is in field1, I’ll make things easier for myself later by converting the humidity row values in my window into a list of values (or array in Python) by aggregating by the field1 (humidity) data. To do this, add an aggregation processor next to the Window Z-Score component. Within the aggregation processor, choose .field1 as your Field and List as the Operation (since you will be aggregating field1 into a list).

The next step is to calculate Z-score for humidity values. In order to create a more advanced transformation, we need to use the Python processor, so next to the Aggregate processor add a Python Row processor.

Change the Map type from FLATMAP to MAP, click on the 4 arrows to open up the Python editor and paste the code below and click SAVE. In the Data Preview, you can see what we’ve calculated in the Python processor: the average humidity, standard deviation and Z-Score array and humidity values for the current window.

Even if the code below is simple and self-explanatory, let me sum up the different steps:

Calculate the average humidity within the window
Find the number of sensor values within the window
Calculate the variance
Calculate the standard deviation
Calculate Z-Score
Output Humidity Average, Standard Deviation, Zscore and Humidity values.

#Import Standard python libraries

import math




#average function

def mean(numbers):

    return float(sum(numbers)) / max(len(numbers), 1)




#initialize variables

std=0




#Load input list




#average value for window

avg=mean(input['humidity'])




##lenth window

mylist=input['humidity']

lon=len(mylist)




# x100 in order to workaround Python limitation

lon100=100/lon










#Calculate Variance

for i in range(len(mylist)):

    std= std + math.pow(mylist[i]-avg,2)




#Calculate Standard deviation   

stdev= math.sqrt(lon100*std/100)




#Re-import all sensor values within the window

myZscore=(input['humidity'])

#Calculate Z-Score for all sensor value within the window

for j in range(len(myZscore)):

    myZscore[j]= (myZscore[j]-avg)/stdev




#Ouput results

output['HumidityAvg']=avg

output['stdev']=stdev

output['Zscore']=myZscore

If you open up the Z-Score array, you’ll see Z-score for each sensor value.

Next to the Python processor add a Normalize processor to flatten the python array into records, in the column to normalize type Zscore and select is list option then save.

Let’s now recalculate the initial humidity value from the sensor, to do that we will a python processor and write the below code :

#Ouput results

output['HumidityAvg']=input['HumidityAvg']

output['stdev']=input['stdev']

output['Zscore']=input['Zscore']

output['humidity']=round(input['Zscore']*input['stdev']+input['HumidityAvg'])

Don’t forget to change the Map type to MAP and click save. Let’s go one step further and select only the anomalies, if you had a look at my previous article, anomalies are Zscores that are outside the -2 Standard Deviation and +2 Standard deviation range, in our case the range is around -1.29 and +1.29. And now add a FilterRow processor, the product doesn’t allow us yet to filter on range of values, so we will filter the Absolute value of the Zscore that are superior to 1.29, we test on absolute value because Zscore can be negative.

The last output shows that 5 records that are anomalies out of the 50 sample records. Let’s now store those anomalies to HDFS, click on “Create a Sink” on the canvas an click on “Add Dataset”. Set it up as per below and click on Validate.

You will end up with an error message, don’t worry it’s just a warning Data Streams cannot fetch sample of a file that has not been created yet. We are now all set, let’s run the pipeline by clicking on the play button on the top.

Let’s stop the pipeline and have a look at our cluster, using Hue on EMR you can easily browse hdfs, go to user/Hadoop/anomalies.csv. Each partition file contains records that are anomalies for each processing windows.

There you go! We’ve built our Anomaly Detection Pipeline with Talend Data Streams, reading sensor values from a SOC based IoT device and only using cloud services. The beauty of Talend Data Streams is that we accomplished all of this without writing any code (apart from the Zscore calculation). I’ve only used the beautiful web UI.

To sum up, we’ve read data from Kinesis, used Type Convertor, Aggregation and Window processors to transform our raw data and then Python row to calculate Standard Deviation, Average and Z-Score for each individual humidity sensor readings. Then we’ve filtered out normal values and stored anomalies in HDFS of an EMR cluster.

That was my last article on the Data Streams for the year. Stay tuned, I’ll write the next episodes when it becomes GA in the beginning of 2019. Again, Happy Streaming!

The post Creating Real-Time Anomaly Detection Pipelines with AWS and Talend Data Streams appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How to Spot a DevOps Faker: 5 Questions, 5 Answers

September 13, 2018, 4:55 pm

≫ Next: Serverless: A Game Changer for Data Integration

≪ Previous: Creating Real-Time Anomaly Detection Pipelines with AWS and Talend Data Streams

With the rapid growth of DevOps teams and jobs, it follows that there are candidates out there who are inflating–or flat-out faking–their relevant skills and experience. We sat down with Nick Piette, Director of Product Marketing API Integration Products here at Talend to get the inside scoop on how to spot the DevOps fakers in the crowd:

What clues should you look for on a resume or LinkedIn profile that someone is faking their DevOps qualifications?

Nick: For individuals claiming DevOps experience, I tend to look for the enabling technologies we’ve seen pop up since the concept’s inception. What I’m looking for often depends where they are coming from. If I see they have solid programming experience, I look for complimentary examples where the candidate mentions experience with supply chain management (SCM), build automation or containerization technologies. I’m also looking for what infrastructure monitors and configuration management tools they have used in the past. The opposite is true when candidates come from an operations background. Do they have coding experience,andare they proficient in the latest domain specific languages?

What signs should you look for in an interview? How should you draw these out?

Nick: DevOps is a methodology. I ask interviewees to provide concrete examples of overcoming some of the challenges many organizations run into, how the candidate’s team reduced the cost of downtime, how they handled the conversion of existing manual tests to automated tests, what plans theyimplemented to prevent code getting to the main branch, what KPIs were used to measure and dashboard. The key is the detail–individuals who are vague and lack attention to detail raise ared flag from an experience standpoint.

Do you think DevOps know-how is easier to fake (at least up to a point) than technical skills that might be easier caught in the screening/hiring process?

Nick: Yes, if the interviewer is just checking for understanding vs. experience. It’s easierto read up on the methodology and best practices and havebook smarts than it is tohave the technology experience and street smarts. Asking about both during an interview makes it harder to fake.

How can you coach people who turn out to have DevOps-related deficiencies?

Nick: Every organization is different, so we always expect some sort of deficiency related to process. We do the best we can to ensure everything here is documented. We’re also practicing what we preach–it’s a mindset and a company policy.

Should we be skeptical of people who describe themselves as “DevOps gurus,” “DevOps ninjas,” or similar in their online profiles?

Nick: Yes. There is a difference between being an early adopter and an expert. While aspects of this methodology have been around for a while, momentum really started over the last couple years. You might be an expert with the technologies, but DevOps is much more than that.

The post How to Spot a DevOps Faker: 5 Questions, 5 Answers appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Serverless: A Game Changer for Data Integration

September 14, 2018, 3:46 pm

≫ Next: Why data is no longer just an IT function

≪ Previous: How to Spot a DevOps Faker: 5 Questions, 5 Answers

The concept of cloud computing has been around for years, but cloud services truly became democratized with the advent of virtual machines and the launch of Amazon Elastic Compute in 2006.

Following Amazon, Google launched Google App Engine in 2008, and then Microsoft launched Azure in 2010.

At first, cloud computing offerings were not all that different from each other. But as with nearly every other market, segmentation quickly followed growth.

In recent years, the cloud computing market has grown large enough for companies to develop more specific offers with the certainty that they’ll find a sustainable addressable market. Cloud providers went for ever more differentiation in their offerings, supporting features and capabilities such as artificial intelligence/machine learning, streaming and batch, etc.

The very nature of cloud computing, the abundance of offerings and the relative low cost of services took segmentation to the next level, as customers were able to mix and match cloud solutions in a multi-cloud environment. Hence, instead of niche players addressing the needs of specific market segments, many cloud providers can serve the different needs of the same customers.

Introduction to Serverless

The latest enabler of this ultra-segmentation is serverless computing. Serverless is a model in which the cloud provider acts as the server, dynamically managing the allocation of resources and time. Pricing is based on the actual amount of resources consumed by an application, rather than on pre-purchased units of capacity.

With this model, server management and capacity planning decisions are hidden from users, and serverless code can be used in conjunction with code deployed in microservices.

As research firm Gartner Inc. has pointed out, “serverless computing is an emerging software architecture pattern that promises to eliminate the need for infrastructure provisioning and management.” IT leaders need to adopt an application-centric approach to serverless computing, the firm says, managing application programming interfaces (APIs) and service level agreements (SLAs), rather than physical infrastructures.

The concept of serverless is typically associated with Functions-as-a-Services (FaaS). FaaS is a perfect way to deliver event-based, real-time integrations. FaaS cannot be thought of without container technologies, both because containers power the underlying functions infrastructure and because they are perfect for long-running, compute intensive workloads.

The beauty of containers lies in big players such as Google, AWS, Azure, Redhat and others working together to create a common container format – this is very different from what happened with virtual machines, where AWS created AMI, VMware created VMDK, Google created Google Image etc. With containers, IT architects can work with a single package that runs everywhere. This package can contain a long running workload or just a single service.

Serverless and Continuous Integration

Serverless must always be used together with continuous integration (CI) and continuous delivery (CD), helping companies reduce time to market. When development time is reduced, companies can deliver new products and new capabilities more quickly, something that’s extremely important in today’s market. CI/CD manages the additional complexity you manage with a fine grained, serverless deployment model. Check out how to go serverless with Talend through CI/CD and containers here .

Talend Cloud supports a serverless environment, enabling organizations to easily access all cloud platforms; leverage native performance; deploy built-in security, quality, and data governance; and put data into the hands of business users when they need it.

Talend’s strategy is to help organizations progress on a journey to serverless, beginning with containers-as-a-service, to function-as-a-service, to data platform-as-a-service, for both batch and streaming. It’s designed to support all the key users within an organization, including data engineers, data scientists, data stewards, and business analysts.

An organization’s data integration backbone has to be native and portable, according to the Talend approach. Code native means there is no additional runtime and no additional development needed. Not even the code becomes proprietary, so there is no lock-in to a specific environment. This enables flexibility, scale and performance.

The benefits of serverless are increased agility, unlimited scalability, simpler maintenance, and reduced costs. It supports a multi-cloud environment and brings the pay-as-you-go model to reality.

The serverless approach makes data-driven strategies more sustainable from a financial point of view. And that’s why serverless is a game changer for data integration. Now there are virtually infinite possibilities for data on-demand. Organizations can decide how, where, and when they process data in a way that’s economically feasible for them.

The post Serverless: A Game Changer for Data Integration appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Why data is no longer just an IT function

August 21, 2018, 10:56 am

≫ Next: The Big Data Debate: Batch vs. Streaming Processing

≪ Previous: Serverless: A Game Changer for Data Integration

Data – or at least the collection, storage, protection, transfer and processing of it – has traditionally been seen as the role of a modern data-driven technical division. However, as data continues to explode in both volume and importance, it is not enough to gather huge amounts of disparate data into a data lake and expect that it will be properly consumed. With data becoming the defining factor of a business’s strategy, this valuable gold dust needs to be in the hands of the right business function, in the right form, at the right time, to be at its most effective. This means that traditional roles within the organization need to adapt, as CIOs and CTOs oversee digital transformation projects across the business landscape.

The aim of digital transformation is to create an adaptive, dynamic company that is powered by digital technology – it is the perfect marriage of the business and IT function and requires both to collaborate to successfully harness the data at a company’s disposal. This will be imperative to deliver the types of rapid growth and customer-centric developments that modern businesses are determined to achieve. In recent years, the groundwork for this has already been delivered in the increasing use of cloud within businesses – which the Cloud Industry Forum revealed earlier this year stands at 88% in the UK, with 67% of users expecting to increase their cloud usage over the coming years. However, while the cloud provides the perfect platform for scalable, agile digitization, three further challenges stand between organizations and digital transformation success, and the business and IT functions need to work together to ensure their business emerges victorious at the other end.

Watch Put More Data to Work: Talend Spring '18 now.
Watch Now

Challenge 1: Business Wants Data, But IT Can’t Keep Up

With cloud applications, sensors, online data streams and new types of technology emerging week on week, businesses are seeing an explosion of data – both in volume and variety. At the same time, consumers are expecting the very latest products, with personalized services, in real-time. The data businesses have access to can help but frequently ends up siloed, out of context, or of bad quality. Industry estimates predict that working on flawed data costs a business in the region of 10x more than working on perfect data.

Traditionally, employees within the business have maintained this data, but this is no longer feasible in the face of the sheer volume of information that businesses receive. Instead, businesses will need to be empowered by modern technologies such as Big Data and machine learning to ensure that as much of data preparation, cleansing and analysis is guided or automated. Without a combined data landscape of high-quality data, businesses risk missing opportunities by simply not successfully analyzing their own data… or even drive improper insights and related actions.

Being data-driven is a mandate for modern business, and the strain cannot be placed on IT to simply keep pace with the latest technological innovations. Instead, the business function must support in creating a digital strategy, focused on the latest business objectives, in order for the company to succeed.

Challenge 2: Digitization is Changing the Job Description

In the not-too-distant past, IT resources were centralized, with a core IT organization managing on-premises data using legacy systems. While this was an effective way of keeping data safe and organized, it resulted in the data being hard to access and even harder to use. As recently as 2015, BARC statistics stated that from a sample of over 2,000 responses, 45% of business users say their companies have less than 10% of employees using business intelligence (BI).

However, in today’s data-centric world where surveys estimate that 38% of overall job postings require digital skills, empowering 10% of employees to be self-sufficient with data is nowhere near enough. Furthermore, Gartner research asserts that by 2019, citizen data scientists will surpass data scientists in terms of the amount of advanced analysis they produce. The roles of everyone throughout the business, from the CIO to the business process analyst, are emerging to need data right at the user’s fingertips. These figures need access to data to ensure they can strategize, execute and deliver for the business with the most relevant and up-to-date insights available. This means the business must fully equip its employees and at every level to empower their decision-making with highly available and insightful data. As well as providing self-service technologies and applications which provide a turnkey solution to mining insight from data, this involves using training and internal communications to define a data-driven culture throughout business divisions.

Challenge 3: The threats to data, and to businesses, are increasing by the day

The knee-jerk reaction to this might be to make as much data as possible available to as many people as possible. However, any well-versed CIO knows this is not viable. With regulations like the GDPR, organizations have an increasing obligation to make sure only the right people have access to every piece of information or place their entire organization at risk. This is especially important given a backdrop where 71% of users admit to having access to data they should not according to the Ponemon Institute.

The solution to this is successfully implemented self-service IT solutions, which automates functions such as data access requests and data preparation. This is fundamental to allowing business employees quicker access to the right data, as well as providing clear lineages of who accessed what information, when – which will be crucial to monitor under the GDPR. At the same time, automated data preparation tools are essential to reduce the burden on the IT team, performing manual cleansing and formatting tasks. This, in turn, will enable the IT team to focus on delivering new technologies for the organization, rather than troubleshooting legacy issues.

The rise of the cloud has created the possibility for every person in every business to be data driven – but to date, this has not been the case. Instead, organizations experience siloing and limits on innovation. The key is creating an approach to data that is built with the business objectives in mind. A successful digital transformation project is centered on achieving real business outcomes, which is then operationalized by IT – making both vital players in evolving the role and use of data within an organization.

The post Why data is no longer just an IT function appeared first on Talend Real-Time Open Source Data Integration Software.

↧

The Big Data Debate: Batch vs. Streaming Processing

September 18, 2018, 10:48 am

≫ Next: What’s In Store for the Future for Master Data Management (MDM)?

≪ Previous: Why data is no longer just an IT function

While data is the new currency in today’s digital economy, it’s still a struggle to keep pace with the changes in enterprise data and the growing business demands for information. That’s why companies are liberating data from legacy infrastructures by moving over to the cloud to scale data-driven decision making. This ensures that their precious resource— data — is governed, trusted, managed and accessible.

While businesses can agree that cloud-based technologies are key to ensuring the data management, security, privacy and process compliance across enterprises, there’s still an interesting debate on how to get data processed faster — batch vs. stream processing.

Each approach has its pros and cons, but your choice of batch or streaming all comes down to your business use case. Let’s dive deep into the debate to see exactly which use cases require the use of batch vs. streaming processing.

Batch vs. Stream Processing: What’s the Difference?

A batch is a collection of data points that have been grouped together within a specific time interval. Another term often used for this is a window of data. Streaming processing deals with continuous data and is key to turning big data into fast data. Both models are valuable and each can be used to address different use cases. And to make it even more confusing you can do windows of batch in streaming often referred to as micro-batches.

While the batch processing model requires a set of data collected over time, streaming processing requires data to be fed into an analytics tool, often in micro batches, and in real-time. Batch processing is often used when dealing with large volumes of data or data sources from legacy systems, where it’s not feasible to deliver data in streams. Batch data also by definition requires all the data needed for the batch to be loaded to some type of storage, a database or file system to then be processed. At times, IT teams may be idly sitting around and waiting for all the data to be loaded before starting the analysis phase.

Data streams can also be involved in processing large quantities of data, but batch works best when you don’t need real-time analytics. Because streaming processing is in charge of processing data in motion and providing analytics results quickly, it generates near-instant results using platforms like Apache Spark and Apache Beam. For example, Talend’s recently announced Talend Data Streams, is a free, Amazon marketplace application, powered by Apache Beam, that simplifies and accelerates ingestion of massive volumes and wide varieties of real-time data.

Is One Better Than the Other?

Whether you are pro-batch or pro-stream processing, both are better when working together. Although streaming processing is best for use cases where time matters, and batch processing works well when all the data has been collected, it’s not a matter of which one is better than the other — it really depends on your business objective.

Watch Big Data Integration across Any Cloud now.
Watch Now

However, we’ve seen a big shift in companies trying to take advantage of streaming. A recent survey of more than 16,000 data professionals showed the most common challenges to data science including everything from dirty data to overall access or availability of data. Unfortunately, streaming tends to accentuate those challenges because data is in motion. Before jumping into real-time, it is key to solve those accessibility and quality data issues.

When we talk to organizations about how they collect data and accelerate time-to-innovation, they usually share that they want data in real-time, which prompts us to ask, “What does real-time mean to you?” The business use cases may vary, but real-time depends on how much time to the event creation or data creation relative to the processing time, which could be every hour, every five minutes or every millisecond.

To draw an analogy for why organizations would convert their batch data processes into streaming data processes, let’s take a look at one of my favorite beverages—BEER. Imagine you just ordered a flight of beers from your favorite brewery, and they’re ready for drinking. But before you can consume the beers, perhaps you have to score them based on their hop flavor and rate each beer using online reviews. If you know you have to complete this same repetitive process on each beer, it’s going to take quite some time to get from one beer to the next. For a business, the beer translates into your pipeline data. Rather than wait until you have all the data for processing, instead you can process it in micro batches, in seconds or milliseconds (which means you get to drink your beer flight faster!).

Why Use One Over the Other?

If you don’t have a long history working with streaming processing, you may ask, “Why can’t we just batch like we used to?” You certainly can, but if you have enormous volumes of data, it’s not a matter of when you need to pull data, but when you need to use it.

Companies view real-time data as a game changer, but it can still be a challenge to get there without the proper tools, particularly because businesses need to work with increasing volumes, varieties and types of data from numerous disparate data systems such as social media, web, mobile, sensors, the cloud, etc. At Talend, we’re seeing enterprises typically want to have more agile data processes so they can move from imagination to innovation faster and respond to competitive threats more quickly. For example, data from the sensors on a wind turbine are always-on. So, the stream of data is non-stop and flowing all the time. A typical batch approach to ingest or process this data is obsolete as there is no start or stop of the data. This is a perfect use case where stream processing is the way to go.

The Big Data Debate

It is clear enterprises are shifting priorities toward real-time analytics and data streams to glean actionable information in real time. While outdated tools can’t cope with the speed or scale involved in analyzing data, today’s databases and streaming applications are well equipped to handle today’s business problems.

Here’s the big takeaway from the big data debate: just because you have a hammer doesn’t mean that’s the right tool for the job. Batch and streaming processing are two different models and it’s not a matter of choosing one over the other, it’s about being smart and determining which one is better for your use case.

The post The Big Data Debate: Batch vs. Streaming Processing appeared first on Talend Real-Time Open Source Data Integration Software.

↧

What’s In Store for the Future for Master Data Management (MDM)?

September 21, 2018, 10:37 am

≫ Next: Why Organizations Are Choosing Talend vs Informatica

≪ Previous: The Big Data Debate: Batch vs. Streaming Processing

Master Data Management (MDM) has been around for a long time, and many people like myself, have been involved in MDM for many years. But, like all technologies, it must evolve to be successful. So what are those changes likely to be, and how will they happen?

In my view, MDM is and will change in important two ways in the coming years. First, there will be technological changes, such as moving MDM into the cloud or moving into more ‘software as a service’, or SaaS, offerings, which will change the way MDM systems are built and operated. Secondly, there are and will be more fundamental changes within MDM itself, such as moving MDM from single domain models into truly multi-domain models. Let’s look at these in more details.

Waves of MDM Change: Technical and Operational

New and disruptive technologies make fundamental changes to the way we do most things. In the area of MDM, I expect that to change in two main areas. First comes the cloud. In all areas that matter in data we are seeing moves into the cloud and I expect MDM to be no different. The reasons are simple and obvious, the move towards MDM being offered as a SaaS offering brings cost savings in build, support, operation, automation and maintenance and is therefore hugely attractive to all businesses. I expect that going forward we will see MDM more and more being offered as a SaaS.

The second area I see changes happening are more fundamental. Currently, many MDM systems concentrate on single-domain models. This is the way it has been for many years and currently manifests itself in the form of a ‘customer model’ or a ‘product model’. Over time I believe this will change. More and more businesses are looking towards multi-domain models that will, for example, allow models to be built that have the links between customer and partners, products, suppliers etc. This is the future for MDM models, and already at Talend, our multi-domain MDM tool allows you to build models of any domain you choose. Going forward, its clear that linking those multi-domain models together will be the key.

Watch The 4 Steps to Become a Master at Master Data Management now.
Watch Now

MDM and Data Matching

Another area of change that is on the way is in regards to how MDM systems do matching. Currently, most systems do some type of probabilistic matching on properties within objects. I believe the future will see more of these MDM systems doing ‘referential matching’. By this, I mean making more use of the reference database, which may contain datasets like demographic data, in order to do better data matching. Today, many businesses use data that is not updated often enough and so becomes of less and less value. Using external databases to say, get the updated address of your customer or supplier, should dramatically change the value of your matching.

Machine Learning to the Rescue

The final big area of change coming in the future for MDM is the introduction of intelligence or machine learning. In particular, I forecast we will see intelligence in the form of machine learning survivorship. This will like take the form of algorithms which ‘learn’ how records survive and will, therefore, use these results to make predictions about which records survive, and which don’t. this will free up a lot of time for the data steward.

Conclusion

Additional changes will likely also come around the matching of non-western names and details (such as addresses). At the moment they can be notoriously tricky as, for example, algorithms such as Soundex simply can’t be applied to many languages. This will change and we should see support for more and more languages.

One thing I am certain of though, many of these areas I mentioned are being worked on, all vendors will likely make changes in these areas and Talend will always be at the forefront of development in the future of Master Data Management. Do you have any predictions for the future of MDM? Let me know in the comments below.

Learn more about MDM with Talend’s Introduction to Talend’s Master Data Management tutorial series, and start putting its power to use today!

The post What’s In Store for the Future for Master Data Management (MDM)? appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Why Organizations Are Choosing Talend vs Informatica

September 20, 2018, 1:10 pm

≫ Next: Overview: Talend Server Applications with Docker

≪ Previous: What’s In Store for the Future for Master Data Management (MDM)?

Data has the power to transform businesses in every industry from finance to retail to healthcare. 2.5 quintillion bytes of data are created every day, and the volume of data is doubling each year. The low cost of sensors, ubiquitous networking, cheap processing in the cloud, and dynamic computing resources are not only increasing the volume of data, but the enterprise imperative to do something with it. Plus, not only is there more data than ever, but also more people than ever who want to work with it to create business value.

Businesses that win today are using data to set themselves apart from the competition, transform customer experience, and operate more efficiently. But with so much data, and so many people who want to use it, extracting business value out of it is nearly impossible — and certainly not scalable — without data management software. But what is the right software to choose? And what criteria should data and IT professionals be looking for when selecting data integration software?

In a recent survey undertaken with TechValidate of 299 Talend users, respondents expressed clear preferences for data integration tools that had the following characteristics:

• Flexibility. Respondents wanted data integration tools to be able to connect to any number of data sources, wherever they happen to be located (in the cloud, on-premises, in a hybrid infrastructure, etc.)
• Portability. Respondents wanted the ability to switch deployment environments with the touch of a button.
• Ease of Use. Respondents want their data integration tools to be easy to use, with an intuitive interface.

Large majorities of survey respondents who selected Talend vs Informatica made their choice based on those factors.

Talend vs Informatica: A Common Choice

Data and IT professionals have numerous choices when deciding how to manage their enterprise data for business intelligence and analytics. Among our customers surveyed, we found the most common choice was between Talend and Informatica, followed by Talend vs hand coding.

The reasons to use a tool over traditional hand-coded approaches to data processes like ETL are numerous; interestingly, survey respondents that chose Talend over hand-coding see productivity gains of 2x or more.

They also find that their maintenance costs are reduced when they use Talend over hand coding. Clearly, choosing a data management tool like Talend over hand-coding data integrations is the right choice. But when organizations are trying to decide between tools, what factors are they considering?

Talend vs Informatica: Talend is More Flexible and Easier to Use

As we’ve seen, customers that chose Talend over Informatica cited Talend’s flexibility, portability, and ease of use as differentiating factors. These factors were particularly important to customers who chose Talend vs Informatica. In fact, 95% of these customers said that Talend’s flexibility and open source architecture distinguished it from the competition. In addition, 90% of them cited portability as a competitive differentiator, and 85% of them noted that ease of use distinguished Talend from the competition as well.

Given the increased impact that cloud data integration is having on the data management landscape, these factors make sense. The increasing amount of data in a wide variety of environments must to be processed and analyzed efficiently; in addition, there is an enterprise necessity to be able to change cloud providers and servers as easily as possible. Therefore, flexibility and portability gain greater importance. You don’t want your data management tools to hold you back from your digital transformation goals. Plus, with the greater number of people wanting access to data, having tools that are easy to use becomes very important to provide access to all the lines of business who want and need data for their analytics operations.

Talend: A Great Choice for Cloud Data Integration

Customers who are using Talend find its open-source architecture and collaborative tools useful for a number of business objectives, including using data to improve business efficiency and improving data governance.

Talend has proved extremely useful in helping organizations get true value out of their data. One customer noted:

If you’re considering using data management software, why not try Talend FREE for 30 days and see what results you can achieve for your business. Data can be truly transformative. Harness it with an open-source, scalable, easy-to-manage tool.

The post Why Organizations Are Choosing Talend vs Informatica appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Overview: Talend Server Applications with Docker

September 25, 2018, 6:53 am

≫ Next: Building Agile Data Lakes with Robust Ingestion and Transformation Frameworks – Part 1

≪ Previous: Why Organizations Are Choosing Talend vs Informatica

Talend & Docker

Since the release of Talend 7, a major update in our software, users have been given the ability to build a complete integration flow in a CI/CD pipeline which allows to build Docker images. For more on this feature, I invite you to read the blog written by Thibault Gourdel on Going serverless with Talend through CI/CD and Containers.

Another major update is the support of Docker for server applications like Talend Administration Center (TAC). In this blog, I want to walk you through how to build these images. Remember, if you want to follow along, you can download your free 30-day trial of Talend Cloud here. Let’s get started!

Getting Started: Talend Installation Modes

In Talend provides two different installation modes when working with the subscription version. Once you received your download access to Talend applications, you have a choice:

Installation using the Talend Installer: The installer packages all applications and offers an installation wizard to help you through the installation.
Manual installation: Each application is available in a separate package. It requires a deeper knowledge of Talend installation, but it provides a lighter way to install especially for containers.

Both are valid choices based on your use case and architecture. For this blog, let’s go with manual installation because we will be able to define an image per application. It will be lighter for container layers and we will avoid overload these layers with unnecessary weight. For more information on Talend installation modes, I recommend you look at Talend documentation Talend Data Fabric Installation Guide for Linux (7.0) and also Architecture of the Talend products.

Container Images: Custom or Generic?

Now that we know a bit more about Talend Installation, we can start thinking about how we will build our container images. There are two directions when you want to containerize an application like Talend. Going for a custom type image or a generic image:

A custom image embeds part of or a full configuration inside the build process. It means that when we will run the container, it will require less parameters than a generic image. The configuration will depend of the level of customization.
A generic image does not include specific configuration, it corresponds to a basic installation of the application. The configuration will be loaded at runtime.

To illustrate this, let’s look at an example with Talend Administration Center. Talend Administration Center is a central application in charge of managing users, projects and scheduling. Based on the two approaches for building an image of Talend Administration Centre:

A custom image can include:
- A specific JDBC driver (MySQL, Oracle, SQL Server)
- Logging configuration: Tomcat logging
- properties: Talend Administration Centre Configuration
- properties: Clustering configuration

A generic image
- No configuration
- Driver and configuration files can be loaded with volumes

The benefits and drawbacks of each approach will depend on your configuration, but :

A custom image:
- Requires less configuration
- Low to zero external storage required
- Bigger images: more space required for your registry
A generic image
- Lighter images
- Reusability
- Configuration required to run.

Getting Ready to Deploy

Once we have our images, and they are pushed to a registry, we need to deploy them. Of course, we can test them on a single server with a docker run command. But let’s face it, it is not a real-world use case. Today if we want to deploy a container application to on-premise or in the cloud, Kubernetes has become de facto the orchestrator to use. To deploy on Kubernetes, we can go with the standard YAML files or a Helm package. But to give a quick example and a way to test on a local environment, I recommend starting with a docker-compose configuration as in the following example:

version: '3.2'

services:

  mysql:

    image: mysql:5.7

    ports:

    - "3306:3306"

    environment:

      MYSQL_ROOT_PASSWORD: talend

      MYSQL_DATABASE: tac

      MYSQL_USER: talend

      MYSQL_PASSWORD: talend123

    volumes:

      - type: volume

        source: mysql-data

        target: /var/lib/mysql

  tac:

    image: mgainhao/tac:7.0.1

    ports:

    - "8080:8080"

    depends_on:

      - mysql

    volumes:

      - type: bind

        source: ./tac/config/configuration.properties

        target: /opt/tac/webapps/org.talend.administrator/WEB-INF/classes/configuration.properties

      - type: bind

        source: ./tac/lib/mysql-connector-java-5.1.46.jar

        target: /opt/tac/lib/mysql-connector-java-5.1.46.jar

volumes:

  mysql-data:

The first MySQL service, creates a database container with one schema and a user tac to access it. For more information about the official MySQL image, please refer to: https://hub.docker.com/_/mysql/

The second service is my Talend Administration Centre image, aka TAC, a simplified version as it uses only the MySQL database. In this case, I have a generic image that is configured when you run the docker-compose stack. The JDBC driver is loaded with a volume like the configuration.

In a future article, I’ll go in more details on how to build and deploy a Talend stack on Kubernetes. For now, enjoy building with Talend and Docker!

The post Overview: Talend Server Applications with Docker appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Building Agile Data Lakes with Robust Ingestion and Transformation Frameworks – Part 1

September 26, 2018, 8:31 am

≫ Next: From Dust to Trust: How to Make Your Salesforce Data Better

≪ Previous: Overview: Talend Server Applications with Docker

This post was authored by Venkat Sundaram from Talend and Ramu Kalvakuntla from Clarity Insights.

With the advent of Big Data technologies like Hadoop, there has been a major disruption in the information management industry. The excitement around it is not only about the three Vs – volume, velocity and variety – of data but also the ability to provide a single platform to serve all data needs across an organization. This single platform is called the Data Lake. The goal of a data lake initiative is to ingest data from all known systems within an enterprise and store it in this central platform to meet enterprise-wide analytical needs.

However, a few years back Gartner warned that a large percentage of data lake initiatives have failed or will fail – becoming more of a data swamp than a data lake. How do we prevent this? We have teamed up with one of our partners, Clarity Insights, to discuss the data challenges enterprises face, what caused data lakes to become swamps, discuss the characteristics of a robust data ingestion framework and how it can help make the data lake more agile. We have partnered with Clarity Insights on multiple customer engagements to build these robust ingestion and transformation frameworks to build their enterprise data lake solution.

Download Hadoop and Data Lakes now.
Download Now

Current Data Challenges:

Enterprises face many challenges with data today, from siloed data stores and massive data growth to expensive platforms and lack of business insights. Let’s take a look at these individually:

1. Siloed Data Stores

Nearly every organization is struggling with siloed data stores spread across multiple systems and databases. Many organizations have hundreds, if not thousands, of database servers. They’ve likely created separate data stores for different groups such as Finance, HR, Supply Chain, Marketing and so forth for convenience’s sake, but they’re struggling big time because of inconsistent results.

I have personally seen this across multiple companies: they can’t tell exactly how many active customers they have or what the gross margin per item is because they get varying answers from groups that have their own version of the data, calculations and key metrics.

2. Massive Data Growth

No surprise that data is growing exponentially across all enterprises. Back in 2002 when we first built a Terabyte warehouse, our team was so excited! But today even a Petabyte is still small. Data has grown a thousandfold—in many cases in less than two decades‚—causing organizations to no longer be able to manage it all with their traditional databases.

Traditional systems scale vertically rather than horizontally, so when my current database reaches its capacity, we just can’t add another server to expand; we have to forklift into newer and higher capacity servers. But even that will have limitations. IT has become stuck in this deep web and is unable to manage systems and data efficiently.

Diagram 1: Current Data Challenges

3. Expensive Platforms

Traditional relational MPP databases are appliance-based and come with very high costs. There are cases where companies are paying more than $100K per terabyte and are unable to keep up with this expense as data volumes rapidly grow from terabytes to exabytes.

4. Lack of Business Insights

Because of all of the above challenges, business is just focused on descriptive analytics, like a rear mirror view of what happened yesterday, last month, last year, year over year, etc., instead of focusing on predictive and prescriptive analytics to find key insights on what to do next.

What is the Solution?

One possible solution is consolidating all disparate data sources into a single platform called a data lake. Many organizations have started this path and failed miserably. Their data lakes have morphed into unmanageable data swamps.

What does a data swamp look like? Here’s an analogy: when you go to a public library to borrow a book or video, the first thing you do is search the catalog to find out whether the material you want is available, and if so, where to find it. Usually, you are in and out of the library in a couple of minutes. But instead, let’s say when you go to the library there is no catalog, and books are piled all over the place—fiction in one area and non-fiction in another and so forth. How would you find the book you are looking for? Would you ever go to that library again? Many data lakes are like this, with different groups in the organization loading data into it, without a catalog or proper metadata and governance.

A data lake should be more like a data library, where every dataset is being indexed and cataloged, and there should be a gatekeeper who decides what data should go into the lake to prevent duplicates and other issues. For this to happen properly, we need an ingestion framework, which acts like a funnel as shown below.

Diagram 2: Data Ingestion Framework / Funnel

A data ingestion framework should have the following characteristics:

A Single framework to perform all data ingestions consistently into the data lake.
Metadata-driven architecture that captures the metadata of what datasets to be ingested, when to be ingested and how often it needs to ingest; how to capture the metadata of datasets; and what are the credentials needed connect to the data source systems.
Template design architecture to build generic templates that can read the metadata supplied in the framework and automate the ingestion process for different formats of data, both in batch and real-time
Tracking metrics, events and notifications for all data ingestion activities
Single consistent method to capture all data ingestion along with technical metadata, data lineage, and governance
Proper data governance with “search and catalog” to find data within the data lake
Data Profiling to collect the anomalies in the datasets so data stewards can look at them and come up with data quality and transformation rules

Diagram 3: Data Ingestion Framework Architecture

Modern Data Architecture Reference Architecture

Data lakes are a foundational structure for Modern Data Architecture solutions, where they become a single platform to land all disparate data sources and: stage raw data, profile data for data stewards, apply transformations, move data and run machine learning and advanced analytics, ultimately so organizations can find deep insights and perform what-if analysis.

Unlike traditional data warehouses, where business won’t see the data until it’s curated, using the modern data architecture businesses can ingest new data sources through the framework and analyze it within hours and days, instead of months and years.

In the next part of this series, we’ll discuss, “What is Metadata Driven Architecture?” and see how it enables organizations to build robust ingestion and transformation frameworks to build successful Agile data lake solutions. Let me know what your thoughts are in the comments and head to Clarity Insights for more info.

The post Building Agile Data Lakes with Robust Ingestion and Transformation Frameworks – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

↧