Blog – Talend Real-Time Open Source Data Integration Software

Have you heard? Talend Fall ’18 is here and continues on Talend’s plan to meet the challenges of today’s data professionals around organizing, processing and sharing data at scale. Earlier Jean-Michel Franco wrote up about the Data Catalog portion of this exciting Fall 2018 launch. In this blog I’d like to focus on our new API features.

For many organizations, APIs are no longer just technological creations of engineers to connect components of a distributed system. Today, APIs are directly driving business revenues, enabling innovation, and are the source of connectivity with partners.

With our Fall ‘18 Launch, Talend Cloud will include a new API delivery platform, Talend Cloud API Services. Our delivery platform provides best in class purpose-built applications for API-first creation, testing, and documentation. Essentially, this platform enables organizations to be more agile in their API development. The platform provides productivity gains compared to hand coding through easy to use graphical design supporting both technical and less-technical personal.

Additional enhancements to existing tools for API implementation and operation ensures organizations have a comprehensive approach to building user-friendly data APIs. And finally, Talend Cloud’s full support for open standards such as OAS, Swagger, and RAML makes the Talend API delivery platform complementary to existing third-party API gateways and catalogs. Allowing for easy implementation with your existing gateway or catalog.

Talend Cloud API Designer

With Fall ’18, Talend Cloud provides a new purpose-built application for designing API contracts visually instead of having to go the traditional route of hand coding. Developers can start from scratch or import an existing OAS / RAML definition. The interface allows developers to define API data types, resources, operations, parameters, responses, media types, and errors.

Once a developer is finished defining their contract, the API designer will generate the OAS / RAML definition for you! This can be used later as part of the service(s) creation or imported into an existing API gateway / API catalog. I took a quick screenshot to show what the interface is going to look like.

Now I know how much everyone likes to write documentation (or maybe not). Thankfully with the API designer, the basic documentation is auto-generated for you. Users can then host it on Talend Cloud and easily share it with end consumers in a public or private mode. Talend Cloud API Designer also provides users with the ability to extend the generated documentation through an included rich text editor. Below is an example of the documentation generated by Talend Cloud API Designer.

Talend Cloud API Designer also provides Automatic API mocking that can act as a live preview for end consumers, decoupling support for consumer application development while the backend services are developed. Mocked API’s can return data specified during the contract design or automatically generate the data based on the defined data structure.

This mock is kept up to date throughout the development process and enabled using the interface below. This will be a huge benefit for the consumers of my API, they won’t have to wait for me to finish building the back end before they start writing their applications. It’s pretty easy to turn on inside API designer. A single click and users are off to the races.

Talend Cloud API Tester

Fall ’18 also includes a new application within Talend Cloud called Talend Cloud API Tester. Though this interface, QA / DevOps teams can easily call and inspect any HTTP based API. It works with complex JSON or XML responses enabling teams to validate the API’s behavior. Calls made are stored so I can easily look back into my history for what I’ve done before. An example of the interface is shown below.

My favorite feature of Talend Cloud API Tester is the ability to chain API calls together to create scenarios. These chained requests can utilize data returned from a previous call as parameters in the next call enabling teams to create real-world examples of how the API will be used. Thankfully this will keep my notepad++ tabs down to a minimum. An example of this scenario design is provided below.

Throughout the testing process, I can define assertions to help validate API responses. Responses can check payloads for completeness, how timely a response was or even if a field has a specific value. Here’s an example of an assertion I made recently.

The last benefit I’d like to highlight is that test cases created using API Tester can be exported for use within a continuous integration / continuous delivery pipeline ensuring further updates to the API’s maintain consistency.

Talend Studio for API implementation

We’ve made it simple to start working with the API’s built-in API designer. There’s a new metadata group called REST API definitions. A couple clicks and I’ve downloaded the API and am ready to start building.

We can use this contract to bootstrap a service using the contract’s defined URIs, media types, parameters, etc. This approach expedites delivery of the backend service by reducing the complexity of defining the various functions.

There are also some updates to Talend Data Mapper, I can use the defined schema from the API definition as the return schema from TDM! It is a lot easier converting data into the expected media type/structure.

Talend Cloud for API Operation

Yep, Talend cloud can now manage the services you’ve built in the studio, just like we manage data integration jobs. If this is your first time hearing about Talend Cloud, it provides a managed environment that enables developers to publish services developed in Talend studio into the Talend Management Console’s artifact repository or a third-party repository and manage the various environments the service needs to be deployed as part of the QA / DevOps process. An example of this management can be seen in the snippet

As you can see there is a mountain of functionality available in the new Talend Cloud API services. If you’d like to see more of this stay tuned we have a series of videos/enablement material to get you all up to speed.

I look forward to hearing about the API’s you plan to build and stay tuned to upcoming blogs from Talend if you’re looking for some inspiration. I’ll be following this blog up with a series of use cases we’ve seen and are hearing about!

The post Introducing Talend API Services: Providing Best in Class Purpose-Built Applications appeared first on Talend Real-Time Open Source Data Integration Software.

This post is co-authored by Hari Umapathy, Lead Data Engineer at Beachbody and Aarthi Sridharan, Sr.Director of Data (Enterprise Technology) at Beachbody.

Beachbody is a leading provider of fitness, nutrition, and weight-loss programs that deliver results for our more than 23 million customers. Our 350,000 independent “coach” distributors help people reach their health and financial goals.

The company was founded in 1998, and has more than 800 employees. Digital business and the management of data is a vital part of our success. We average more than 5 million monthly unique visits across our digital platforms, which generates an enormous amount of data that we can leverage to enhance our services, provide greater customer satisfaction, and create new business opportunities.

Building a Big Data Lake

One of our most important decisions with regard to data management was deploying Talend’s Real Time Big Data platform about two years ago. We wanted to build a new data environment, including a cloud-based data lake, that could help us manage the fast-growing volumes of data and the growing number of data sources. We also wanted to glean more and better business insights from all the data we are gathering, and respond more quickly to changes.

We are planning to gradually add at least 40 new data sources, including our own in-house databases as well as external sources such as Google Adwords, Doubleclick, Facebook, and a number of other social media sites.

We have a process in which we ingest data from the various sources, store the data that we ingested into the data lake, process the data and then build the reporting and the visualization layer on top of it. The process is enabled in part by Talend’s ETL (Extract, Transform, Load) solution, which can gather data from an unlimited number of sources, organize the data, and centralize it into a single repository such as a data lake.

We already had a traditional, on-premise data warehouse, which we still use, but we were looking for a new platform that could work well with both cloud and big data-related components, and could enable us to bring on the new data sources without as much need for additional development efforts.

The Talend solution enables us to execute new jobs again and again when we add new data sources to ingest in the data lake, without having to code each time. We now have a practice of reusing the existing job via a template, then just bringing in a different set of parameters. That saves us time and money, and allows us to shorten the turnaround time for any new data acquisitions that we had to do as an organization.

The Results of Digital Transformation

For example, whenever a business analytics team or other group comes to us with a request for a new job, we can usually complete it over a two-week sprint. The data will be there for them to write any kind of analytics queries on top of it. That’s a great benefit.

The new data sources we are acquiring allow us to bring all kinds of data into the data lake. For example, we’re adding information such as reports related to the advertisements that we place on Google sites, the user interaction that has taken place on those sites, and the revenue we were able to generate based on those advertisements.

We are also gathering clickstream data from our on-demand streaming platform, and all the activities and transactions related to that. And we are ingesting data from the Salesforce.com marketing cloud, which has all the information related to the email marketing that we do. For instance, there’s data about whether people opened the email, whether they responded to the email and how.

Currently, we have about 60 terabytes of data in the data lake, and as we continue to add data sources we anticipate that the volume will at least double in size within the next year.

Getting Data Management in Shape for GDPR

One of the best use cases we’ve had that’s enabled by the Talend solution relates to our efforts to comply with the General Data Protection Regulation (GDPR). The regulation, a set of rules created by the European Parliament, European Council, and European Commission that took effect in May 2018, is designed to bolster data protection and privacy for individuals within the European Union (EU).

We leverage the data lake whenever we need to quickly access customer data that falls under the domain of GDPR. So when a customer asks us for data specific to that customer we have our team create the files from the data lake.

The entire process is simple, making it much easier to comply with such requests. Without a data lake that provides a single, centralized source of information, we would have to go to individual departments within the company to gather customer information. That’s far more complex and time-consuming.

When we built the data lake it was principally for the analytics team. But when different data projects such as this arise we can now leverage the data lake for those purposes, while still benefiting from the analytics use cases.

Looking to the Future

Our next effort, which will likely take place in 2019, will be to consolidate various data stores within the organization with our data lake. Right now different departments have their own data stores, which are siloed. Having this consolidation, which we will achieve using the Talend solutions and the automation these tools provide, will give us an even more convenient way to access data and run business analytics on the data.

We are also planning to leverage the Talend platform to increase data quality. Now that we’re increasing our data sources and getting much more into data analytics and data science, quality becomes an increasingly important consideration. Members of our organization will be able to use the data quality side of the solution in the upcoming months.

Beachbody has always been an innovative company when it comes to gleaning value from our data. But with the Talend technology we can now take data management to the next level. A variety of processes and functions within the company will see use cases and benefits from this, including sales and marketing, customer service, and others.

About the Authors:

Hari Umapathy

Hari Umapathy is a Lead Data Engineer at Beachbody working on architecting, designing and developing their Data Lake using AWS, Talend, Hadoop and Redshift. Hari is a Cloudera Certified Developer for Apache Hadoop. Previously, he worked at Infosys Limited as a Technical Project Lead managing applications and databases for a huge automotive manufacturer in the United States. Hari holds a bachelor’s degree in Information Technology from Vellore Institute of Technology, Vellore, India.

Aarthi Sridharan

Aarthi Sridharan is the Sr.Director of Data (Enterprise Technology) at Beachbody LLC, a health and fitness company in Santa Monica. Aarthi’s leadership drives the organization’s abilities to make data driven decisions for accelerated growth and operational excellence. Aarthi and her team are responsible for ingesting and transforming large volumes of data into traditional enterprise data warehouse and into the data lake and building analytics on top it.

The post Beachbody Gets Data Management in Shape with Talend Solutions appeared first on Talend Real-Time Open Source Data Integration Software.

In the previous blog, we walked through the installation and set-up of Talend Open Studio and briefly demonstrated key features to familiarize you with the Studio interface. In this blog, we will build a simple job to load data from a local file into Snowflake, a cloud data warehouse technology. More specifically, we will build a new job that takes customer data from your local machine and maps it to a target table within Snowflake.

To follow along in this tutorial, you will need Talend Open Studio for Data Integration (download here), some customer data (either use your own customer data or generate some dummy data), and a Snowflake data warehouse with a database already created. If you don’t have access to a Snowflake data warehouse, you can use another relational database technology.

To see a video of this tutorial, please feel free to see our step-by-step webinar—just skip to the third video.

To get started, right-click within the Job Designs folder in the repository and create a new folder titled “Data Integration” to house your job. Next, dive into that new folder and choose “create a job” and name your job Customer_Load. As a best practice, enter the job’s intended purpose and a general description of its overall function. Once you click finish, the new job will be available within your new folder.

Bringing Data from a CSV into a Talend Job

Before building out your flow, bring your customer data into the Repository. To do this, create a new file delimited element within your Metadata folder by right-clicking on the File Delimited button and choosing “Create File Delimited.” Then, name the file “Customer” and click Next.

From there, browse to locate your customer data file. Once selected, the data is visible within the File Viewer. To define the settings and your data elements, click “Next”. In the next window, choose to use a comma as a field separator. Because we selected a CSV file, set the Escape Char setting to CSV and the Text Enclosure to be double quotes. Make sure to check “Set heading row as column names” before proceeding.

Now the customer data is imported and organized. One final time, click “Next” to confirm your data schema. Talend Open Studio will guess each column’s type based on each column’s contents—be sure to double check that everything is correct.

After checking this data set, you can see that Talend Open Studio guessed the “phone2” column was a date, which is incorrect, so instead, change it to String and then click Finish.

Next, you can drag your Customers delimited file onto the Design window as a tFileInputDelimited component. This brings your customer data into the Talend job.

Creating Your Snowflake Connection

Next, you need to create a new connection to your existing Snowflake table. First, find the Snowflake heading in the Repository and right-click to “Create a new connection”. Give your new connection a name, and then enter your account name (so if your Snowflake URL is talend.snowflake.com, your account name would be talend), User ID, and password. Also identify the Snowflake warehouse, schema and database you will be moving your data to.

After you input all of the necessary information, test the connection and make sure it is successful. Following a successful connection to Snowflake, select from the listed tables those you want to be added to the Talend Repository for this connection then click finish. This will import the schema of those tables from Snowflake into the Repository. From there, you can now choose your table of interest from the repository (in this case, Customers) and drag and drop it into the Design window as a tSnowflakeOutput component. As a side note, we have chosen to use an existing table in this tutorial; however, you can also use Talend to create a table in an existing database.

To map the source data (customer file) to the target table (Snowflake), add a tMap component by clicking within the Design window and searching for “tMap”. The tMap component is a very robust component that can be used for a wide range of functions, but for now, we will be using it to simply link the fields between two tables (to learn more about tMap, stay tuned for the next blog in the series). To start using the tMap, connect the CSV file to tMap by dragging the orange arrow from the file delimited component to the tMap.

Next, to connect the tMap to your Snowflake output, right click on tMap and select Row, and click *New Output* to create a new output connection and give it a name like “Customers”. Then, select “Yes” when asked whether you would like to get the schema of the target component.

Your Design window should now look like this:

To configure the tMap component, double-click on the component itself within the Design Window to reveal the input and output tables. Here you must link both table columns together. You can either drag and drop to link each corresponding field individually, or select Automap, which works great in this case to link the fields between the two tables. Make any adjustments necessary. Once you have ensured the types have been properly auto-selected, click Ok to save this configuration.

If you haven’t installed the additional licenses yet, this Snowflake output component will flash an error. If that’s the case, simply select to install the additional packages which are located within the Help drop-down.

You’re now ready to run the job and populate the data tables within Snowflake. Within the Run tab, simply click Run. You can watch the process run from start to finish within Studio, pushing 500 rows out to Snowflake.

Once the run has been completed successfully, you can head to your Snowflake account. In this example, you can see that 500 records were successfully processed through Talend Studio and loaded into your Snowflake Cloud Data Warehouse.

And that’s how to build your first job within Talend Open Studio. In our next blog, we will go through some more complex functionalities of tMap, and we will also give a few tips on running and debugging your Talend jobs. Please leave a comment and let us know if there are any other things that would help you get started on Talend Open Studio.

The post Getting Started with Talend Open Studio: Building Your First Job appeared first on Talend Real-Time Open Source Data Integration Software.

This post is co-authored by Jorge Villamariona and Anselmo Barrero at Qubole.

A popular term emerging from the software industry over the last few years is serverless computing, more commonly referred to as just “serverless”. So what does it mean? In its simplest form, a serverless architecture is a computing model where a service provider dynamically manages the allocation of computing resources based on a Service Level Agreement (SLA), provisioning and running resources only for the time needed and without requiring end-user involvement.

With a serverless architecture, the server provider would automatically increase computing capacity when demand for resources is high and would intelligently downscale when demand for resources goes down. In this architecture, the end users only care about the tasks they want to execute (get a report, execute a query, execute a data pipeline, etc.) without the hassle of procuring, provisioning and managing the underlying infrastructure.

Traditional vs. Serverless Architectures

So, what are some major advantages for going serverless? Cost, scale and, environment options to start. Traditional architectures rely on the infrastructure administrator’s ability to estimate workloads and size hardware and software accordingly. Moving to the cloud represents an improvement over on-premises architectures because it allows the infrastructure to scale on-demand.

However, administrators still need to be involved to define the conditions and rules to scale and manage the cloud infrastructure. The next step forward is to leverage a serverless architecture and allow the infrastructure to automatically decide behind the scenes when to provision, scale and decommission resources as workloads change. Qubole is a great example of a serverless architecture.

The Qubole platform automatically determines the infrastructure needed and scales it intelligently based on the workloads and SLAs. As a result, Qubole’s serverless architecture saves customers over 50% in annual infrastructure costs compared to traditional and other managed cloud big data architectures.

This intelligent automation allows Qubole to process over an exabyte of data per month for customers deploying AI, machine learning, and analytics without requiring customers to provision and manage any infrastructure

Value of adopting a serverless architecture for Big Data

Big data deals with large volumes of data arriving at high speed which makes it difficult and inefficient to estimate the infrastructure required for processing it ahead of time. On-premises infrastructures impose limits in processing power, are expensive, and complex to manage and maintain. Deploying Big Data in the cloud on your own or as a managed service from cloud providers (Amazon AWS, Microsoft Azure, Google Cloud, etc) improves the processing limitations and eases capex, but it creates overhead managing and attempting to optimize the infrastructure. Improper utilization, underutilization, or overutilization on certain time periods can lead to cloud costs that are much higher than on-premises processing. This, combined with scarce skilled resources results in a very low success rate of only 15% for all big data projects according to Gartner

To successfully leverage a serverless platform for big data you need to look for a solution that addresses the following questions:

Will it reduce big data infrastructure costs?
Does it provide automation and resources to execute data pipelines and provide analytics at any scale?
Will it reduce operational costs?
Will it help my data team scale and not be overrun by business demands for data?

A serverless platform like Qubole is very appealing to teams deploying big data because it addresses the factors that cause big data projects to fail since it reduces infrastructure complexity and costs, as well as reliance on scarce experts.

Qubole reduces the administration overhead by providing a simple interface to define the run-time characteristics of big data engines. Users only need to specify the minimum and maximum clusters size, whether to leverage spot instances (in the case of AWS) and the cluster composition to meet their price performance objectives. Qubole then takes over and automatically manages the infrastructure based on the business requirements and the workloads’ SLA without the need for further manual intervention

Qubole’s serverless architecture auto-scales to avoid latencies when dealing with large bursty incoming loads and it also down-scales to avoid idle wasted resources. Qubole can scale from 5 nodes up to 200 nodes in less than 5 minutes. For reference, Qubole also manages the largest Spark cluster in the cloud (500+ nodes).

TCO of a Serverless Big Data Architecture

When it comes to pricing, Qubole’s serverless architecture offers the best performance by adding computing capacity only when needed and orderly downscaling it as soon as resources become idle.

With Qubole there are no infrastructure administration overheads or cloud resources overspent. Additionally, as we can see in the chart above, data teams leveraging Qubole don’t suffer from delays in provisioning computing resources when workloads suddenly increase.

The combination of Talend Cloud and Qubole not only lowers infrastructure costs, but also increases the productivity of the data team, since they don’t need to worry about cluster procurement, configuration, and management. Data teams build their data pipelines in Talend Cloud and push their execution to the Qubole serverless platform, all without having to write complex code or managing infrastructure.

This partnership allows these teams to focus on building highly functioning end-to-end data pipelines, allowing data scientists to deploy faster IOT, machine learning and advanced analytics applications that have high impact on the business. With Talend and Qubole data teams build scalable serverless data pipelines, that work at low operating costs while often being engineered and maintained by single developers. This cost reduction makes the benefits of big data more accessible to a wider audience.

To learn more about Qubole and test-drive the serverless platform visit https://www.qubole.com/lp/testdrive/

About the Authors

Jorge Villamariona works for the Product Marketing team at Qubole. Over the years Mr. Villamariona has acquired extensive experience in relational databases, business intelligence, big data engines, ETL, and CRM systems. Mr. Villamariona enjoys complex data challenges and helping customers gain greater insight and value from their existing data.

Anselmo Barrero is a Director of Business Development at Qubole with more than 25 years of experience in IT and three patents granted. Mr. Barrero is passionate about building products and strategic partnerships to address market opportunities. He has created products that yield more than 50% YoY growth and established strategic partnerships in areas such as Data Warehouse that resulted in more than 100% consecutive YoY growth.

In his current role Mr. Barrero is responsible for establishing strategic partnerships in big data and the cloud to allow customers reduce the cost and time of getting value out of their data

The post A Serverless Architecture for Big Data appeared first on Talend Real-Time Open Source Data Integration Software.

With Thanksgiving around the corner, it’s a perfect moment to take a step back and get some recipes to be data savvy within your organization. Fortunately, Talend experts have a recipe for data success that will help you to stay above the fray.

As companies become more data driven, being ahead of the curve will obviously be considered as a sign of curiosity and a way to differentiate. This is also a means for your company to anticipate incoming trends and thrive in a changing world where data has become the subject of concern and heavy regulations.

Follow these simple recipes to anticipate trends, follow regulations or better manage your data.

Recipe #1: Learn more about the Data Kitchen and how to be GDPR Compliant

Recent news is here to remind that not meeting data compliance standards can be damaging for any organization type. As gdprtoday stated, data complaints appear to be widespread and it won’t stop here.

To better understand GPDR, avoid penalties and build proper data governance, follow the guidance of our GPDR Whitepaper that explains how to regain control of your data and get ready with data protection regulations.

Recipe #2: Open the fridge and discover your Data

While taking GDPR into consideration, it will also be the right time to identify how to better value the data you do have. To do that, you first need to understand what’s inside your data sources and assess it.

Data profiling is the process of examining the data available in different data sources and collecting statistics and information about this data. It helps to assess the quality level of the data according to defined set goals. If data is of a poor quality or managed in structures that cannot be integrated to meet the needs of the enterprise, business processes and decision-making suffer.

The best advice would be to read this post that explains the principals of Data Profiling. If you’re a data engineer, also follow this introduction to Talend Open Studio for Data Quality.

Recipe #3: Engage your guests, cook and enrich data together.

You alone will have a hard time solving all your data quality problems. It can be far better to consider data as an organization priority and not as a sole IT responsibility. Managing Data Quality beyond IT will involve different responsibilities in your organization to make your data strategy an enterprise wide success. This webinar will explain you the very first steps about collaborative data management. And if you don’t want to fail, this post will provide you with some good recommendations.

Recipe #4: Set the table and let the trust flow freely

Once Data will be cleaned, you would need to provide your teams with a way to share and crawl datasets easily. Follow this webinar about creating a single point of trust with the newly announced Talend Data Catalog. You’ll learn why a data catalog would benefit your entire company and how to take advantage of it.

Recipe #5: Don’t cook solo. Learn from experienced cooks.

You may look for customer references or good recipes from companies in your industry. Don’t hesitate to download this guide to see how companies fight their data integrity challenges with modern Talend Tools.

Want a dessert? Why not enjoying a good pecan pie with this thought leadership IDC whitepaper about intelligent governance ?

And if you’re still hungry, don’t hesitate to download our Definitive Guide to Data Quality.

Happy Thanksgiving!

The post 5 Recipes for Not Becoming the Data Turkey of Your Organization appeared first on Talend Real-Time Open Source Data Integration Software.

In our previous blog, we walked through a simple job moving data from a CSV file into a Snowflake data warehouse. In this blog, we will explore some of the more advanced features of the tMap component.

Similar to the last blog, you will be working with customer data in a CSV file and writing out to a Snowflake data warehouse; however, you will also be joining your customer CSV file with transaction data. As a result, you will need Talend Open Studio for Data Integration, two CSV data sources that you would like to join (in this example we use customer and transaction data sets), and a Snowflake warehouse for this tutorial. If you would like to follow a video-version of this tutorial, feel free to watch our on-demand webinar and skip to the fourth video.

First, we will join and transform customer data and transaction data. As you join the customer data with transaction data, any customer data that does not find matching transactions will be pushed out to a tLogRow component (which will present the data in a Studio log following run time). The data that is successfully matched will be used to calculate top grossing customer sales before being pushed out into a Sales Report table within our Snowflake database.

Construct Your Job

Now, before beginning to work on this new job, make sure you have all the necessary metadata configurations in your Studio’s Repository. As demonstrated in the previous blog (link to blog #2), you will need to import your Customer metadata, and you will need to use the same process to import your transaction metadata. In addition, you will need to import your Snowflake data warehouse as mentioned in the previous blog if you haven’t done so already.

So that you don’t have to start building a new job from scratch, you can take the original job that you created from the last blog (containing your customer data, tMap and Snowflake table) and duplicate it by right-clicking on the job and selecting Duplicate from the dropdown menu. Rename this new job – in this example we will be calling the new job “Generate_SalesReport”.

Now in the Repository you can open the duplicated job and begin adjusting the job as needed. More specifically, you will need to delete the old Snowflake output component and the Customers table configuration within t-Map.

Once that is done, you can start building out the new flow.

Start building out your new job by first dragging and dropping your Transactions metadata definition from the Repository onto the Design Window as a tFileInputDelimited component, connecting this new component to the tMap as a lookup. An important rule-of-thumb to keep in mind when working with the tMap component is that the first source connected to a tMap is the “Main” dataset. Any dataset linked to the tMap after the “Main” dataset is considered a “Lookup” dataset.

At this point it is a good idea to rename the source connections to the tMap. Naming connections will come in handy when it’s time to configure the tMap components. To rename connections, perform a slow double-click on the connection arrow. The name will become editable. Name the “Main” connection (the Customer Dataset) “Customers” and the “Lookup” connection (the Transactions dataset) “Transactions”. Later, we will come back to this tMap and configure it to perform a full inner join of customer and transaction data. For now, we will continue to construct the rest of the job flow.

To continue building out the rest of the job flow, connect a tLogRow component as an output from the tMap (in the same way as discussed above, rename this connection Cust_NoTransactions). This tLogRow will capture customer records that have no matching transactions, allowing you to review non-matched customer data within the Studio log after you run your job. In a productionalized job flow, this data would be more valuable within a database table making it available for further analysis, but for simplicity of this discussion we will just write it out to a log.

The primary output of our tMap consists of customer data that successfully joins to transaction data. Once joined, this data will be collected using a tAggregateRow component to calculate total quantity and sales of items purchased. To add the tAggregateRow component to the design window, either search for it within the Component Pallet and then drag and drop it into the Design Window OR click directly in the design window and begin typing “tAggregateRow” to automatically locate and place it into your job flow. Now, connect your tAggregateRow to the tMap and name the connection “Cust_Transactions”.

Next, you will want to sort your joined, aggregated data, so add the tSortRow component.

In order to map the data to its final destination–your Snowflake target table—you will need one more tMap. To distinguish between the two tMap components and their intended purposes, make sure to rename this tMap to something like “Map to Snowflake”.

Finally, drag and drop your Snowflake Sales Report table from within the Repository to your Design window and ensure the Snowflake output is connected to your job. Name that connection “Snowflake” and click “Yes” to get the schema of the target component.

As a best practice, give your job a quick look over and ensure you’ve renamed any connections or components with clear and descriptive labels. With your job constructed, you can now configure your components.

Configuring Your Components

First, double-click to open the Join Data tMap component configuration. On the left, you can see two source tables, each identified by their connection name. To the right, there are two output tables: one for the customers not matched to any transactions and one for the joined data.

Start by joining your customers and transactions data. Click and hold ID from within the Customers table and drag and drop it onto ID from within the Transactions table. The default join type in a tMap component is a Left Outer Join. But you will want to join only those customer Id’s that have matching transactions, so switch the Join Model to an “Inner Join”.

Within this joined table, we want to include the customer ID in one column and the customers’ full names on a separate column. Since our data has first name and last name as two separate columns, we will need to join them, creating what is called a new “expression”. To do this, drag and drop both the “first_name” and “last_name” columns onto the same line within the table. We will complete the expression in a bit.

Similarly, we want the Quantity column from the transaction data on its own line, but we also want to use it to complete a mathematical expression. By dragging and dropping Unit Price and Quantity onto the same line within the new table, we can do just that.

You can now take advantage of the “Expression Builder”, which gives you even more control of your data. It offers a list of defined pre-coded functions that you can apply directly to this expression—I highly recommend that you look through the Expression Builder to see what it can offer. And even better, if you know the Java code for your action, you can enter it manually. In this first case, we want to concatenate the first and last names. After adding the correct syntax within the expression builder, click Ok.

You will want to use the Expression Building again for your grouped transaction expression. With the Unit Price and Quantity expression, complete an arithmetic action to get the total transaction value by multiplying the Unit Price by the Quantity. Then, click Ok.

Remember, we set our Join Model to an Inner Join. However, Talend offers a nice way to capture just the outer customers whom didn’t have transactions. To capture these “rejects” from an Inner Join, first drag and drop ALL the fields from the customers table to the Cust_NoTransactions output table. Then, select the tool icon at the top right of this table definition and switch the “Catch lookup inner join reject” to “true”.

With the fields properly mapped, it is time to move on and review the data below. Rename the first_name field to be simply “name” (since it now includes the last name) and rename the Unit Price column to “transaction cost” (since it now has the mathematical expression applied). Then, ensure no further adjustments are necessary to the table’s column types to avoid any mismatched type conflicts through the flow.

With this tMap properly configured, click Ok. And then click “Yes” to propagate the changes.

Next, you will need to configure the Aggregate component. To do this, enter the Component Tab (below the Design Workspace) and edit the schema.

To properly configure the output schema of my tAggregateRow component, first choose the columns on the left that will be grouped. In this case we want to group by ID and Name. So, select “id” and “name” and then clicking the yellow arrow button pointing to the right. Next, we want to create two new output columns to store our aggregated values. By clicking the green “+” button below the “Aggregate Sales (Output)” section you can add the desired number of output columns. First, create a new output column for the total quantity (“total_qty”) and identify it as an Integer type. And then create another for the total sales (“total_sales”) and set it as a double type. Next, click ok, making sure to choose to propagate the changes.

With the output schema configured properly within the tAggregateRow component, we can now configure the Group By and Operations Sections of the tAggregateRow component. To add your two Grouped By output columns and two Operations ouput columns, go back to the Component Tabs. Click the green plus sign below the Group By section twice and the Operations section twice to account for the output columns configured in the tAggregateRow schema. Then, in the Operations section, set the “total_qty” column’s general function as “sum” and identify the input column position as “qty”. This configures the tAggregateRow component to add all the quantities from the grouped customer Id’s and output the total value in the “total_qty” column. Likewise, set the “total_sales” function as “sum” and input column position as “transaction_cost”.

Next, head to the sorting component and configure it to sort by total sales to help us identify who our highest paying customers are. To do this, click on the green “+” sign in the Component Tab, select “total_sales” in the Schema Column, and select “num” to ensure that your data is sorted numerically. Last, choose “desc” so your data will be shown to you in descending order.

Now, configure your final tMap component, by matching the customer name, total quantity and total sales. Then click Ok and click Yes to propagate the changes.

Finally, make sure your tLogRow component is set to present your data in table format, making it easier for you to read the inner join reject data.

Running Your Job

At last, you are ready to run your job!

Thanks to the tLogRow component, within the log, you can see the six customers that were NOT matched with transaction data.

If you head to Snowflake, you can view your “sales_report” worksheet and review the top customers in order of highest quantity and sales.

And that’s how to create a job that joins different sources, captures rejects, and presents the data the way you want it. In our next blog, we will be going through running and debugging your jobs. As always, please comment and let us know if there are any other basic skills you would like us to cover in a tutorial.

The post Getting Started with Talend Open Studio: Building a Complex tMap Job appeared first on Talend Real-Time Open Source Data Integration Software.

One of the aspects I am always fascinated about Talend is its ability to run programs according to multiple job execution methodologies. Today I wanted to write an overview of a new way of executing data integration jobs using Talend and Red Hat OpenShift Platform.

First and foremost, let us do a quick recap of the standard ways of running Talend jobs. Users usually run Talend jobs using Talend schedulers which can be either in the Cloud or On-premise. Other methods include creating standalone jobs, building web services from Talend Jobs, building OSGI Bundle for ESB and the latest entry to this list from Talend 7.1 onwards is building the job as Docker image. For this blog, we are going to focus on the Docker route and show you how Talend Data Integration jobs can be used with Red Hat OpenShift Platform.

I would also highly recommend reading two other interesting Talend blogs related to the interaction between Talend and Docker, which are:

Going Serverless with Talend through CI/CD and Containers by Thibaut Gourdel
Overview: Talend Server Applications with Docker by Michaël Gainhao

Before going to other details, let’s get into the basics of containers, Docker and Red Hat OpenShift Platform. For all those are already proficient in container technology, I would recommend skipping ahead to the next section of the blog.

Containers, Docker, Kubernetes and Red Hat OpenShift

What is a container? A container is a standardized unit of software which is quite lightweight and can be executed without environment related constraints. Docker is the most popular container platform and it has helped the Information technology industry in two major fronts i.e. reduction in the infrastructure and maintenance cost and reduction in turnaround time to bring applications to market.

The diagram above shows how the various levels Docker container platform and Talend jobs are stacked in application containers. The Docker platform interacts with underlying infrastructure and host operating system and it helps the application containers to run in a seamless manner without knowing the complexities of the underlying layers.

Kubernetes

Next, let us quickly talk about Kubernetes and how it has helped in the growth of container technology. When we are building more and more containers, we will need an orchestrator who can control the management, automatic deployment and scaling of the containers and Kubernetes is the software platform which does this orchestration in a magical way.

Kubernetes helps to coordinate a cluster of computers as a single unit and we can deploy containerized applications on top of the cluster. It consists of Pods which acts as logical host for the containers and these pods are running on top of worker machines in Kubernetes called Nodes. There are a lot of other concepts in Kubernetes but let us limit ourselves to the context of the blog since Talend Job containers are executed on top of these Pods.

Red Hat OpenShift

OpenShift is the open source container application platform from Red Hat which is built on top of Docker containers and Kubernetes container cluster manager. I am republishing the official OpenShift block diagram from Red Hat website for your quick reference.

OpenShift comes in a few flavors apart from the free (Red Hat OpenShift Online Starter) version.

Red Hat OpenShift Online Pro
Red Hat OpenShift Dedicated
Red Hat OpenShift Container Platform

OpenShift Online Pro and Dedicated will be running on top of Red Hat hosted infrastructure and OpenShift Container Platform can be set up on top of customer’s own infrastructure.

Now let’s move to our familiar territory where we are planning to convert the Talend job to Docker container.

Talend Job Conversion to Container and Image Registry Storage

Considering the customers who are using older versions of Talend, we will first create a Docker image from a sample Talend job. Those who are already using Talend 7.1 version, you have the capability to export the Talend jobs to Docker as mentioned in the introduction section. So, you can safely move to the next section where the Docker image is already available and we will meet you there. People who are still with me, let us quickly build a Docker image for a sample job 😊.

I have created a simple job where I am generating random first and last names and then printing them on the console. We are going to build a standalone job zip file from the Talend job and the zip will be placed in the target directory of the server, where Docker is available.

The next step will be to create a Docker file which will store the instructions to perform while building a Docker container from the Talend standalone zip file. The steps in the Docker file is as shown below.

FROM anapsix/alpine-java:8u121b13_jdk



ARG talend_job

ARG talend_version



LABEL maintainer="nthampi@talend.com" \

    talend.job=${talend_job} \

    talend.version=${talend_version}



ENV TALEND_JOB ${talend_job}

ENV TALEND_VERSION ${talend_version}

ENV ARGS ""



WORKDIR /opt/talend



COPY ${TALEND_JOB}_${talend_version}.zip .


RUN unzip ${TALEND_JOB}_${TALEND_VERSION}.zip && \

    rm -rf ${TALEND_JOB}_${TALEND_VERSION}.zip && \

    chmod +x ${TALEND_JOB}/${TALEND_JOB}_run.sh


CMD ["/bin/sh","-c","${TALEND_JOB}/${TALEND_JOB}_run.sh ${ARGS} "]

If you notice the various commands specified in the Docker file, we could identify that we are creating a base Alpine java image. On top of that we are adding additional instructions in a layered format. The instructions specify to unzip the file that contains the Talend job and execute the right shell script file. Now, we have created the Docker file which will be used for the container build.

The statement to create the Docker build for the Talend job is below.

docker build /home/centos/talend/ -f /home/centos/talend/dockerfile.txt -t nikhilthampi/helloworld:0.1 --build-arg talend_job=helloworld --build-arg talend_version=0.1

The docker images command will list the newly created container with the container name and container tag already present such as “nikhilthampi/helloworld” and “0.1” respectively.

If you are interested in moving the Docker image to a Docker repository, you can login to Docker using the command below and push the container to Docker Hub.

The image will be now available in the Docker hub repository as shown below.

Similarly, you can load the container to a Red Hat OpenShift image repository. The first step is to configure the OpenShift client in the server and follow the steps below for installing in CentOS.

wget https://mirror.openshift.com/pub/openshift-v3/clients/3.9.31/linux/oc.tar.gz
tar -xvf oc.tar.gz
cd /opt
mkdir oc
mv /home/centos/oc /opt/oc/oc
export PATH=$PATH:/opt/oc

The next step is to go OpenShift Console and get the login credentials from the site as shown below. You will be provided with login credentials with a token.

Using the token, you will be able to login to OpenShift and the details of successful login are shown below. I have already created a project called “docker” inside OpenShift and OpenShift will be start using this project.

We can now tag the container we have created and push the container to OpenShift image registry and the sample pattern is as shown below.

docker tag <docker image id> <OpenShift region registry>/< project >/<container image name>

docker push <OpenShift region registry>/< project >/<container image name>

The screenshot below is sample output we will be getting from OpenShift after executing the commands.

The container image can be viewed through OpenShift Console also and it will be available under Image Streams section of the project.

Alright! We have completed the tasks involved in transferring the Talend job docker image to OpenShift image registry. Don’t you think it is easy? Instead of doing the container image migration manually, CI/CD can be also used to do the deployment to docker registries. It is not in the scope of the current blog and I would recommend going through CI/CD blogs of Talend to automate the above steps.

Talend Job execution in OpenShift

Now, let’s get to Talend job execution in OpenShift. The first step to create a job in Openshift is to configure the corresponding YAML file. Below is the sample YAML file which I have created for the “helloworld” job.

apiVersion: batch/v1

kind: Job

metadata:

  name: helloworld

spec:

  template:

    spec:

      activeDeadlineSeconds: 180

      containers:

      - name: helloworld

        image: docker-registry.default.svc:5000/docker/helloworld

      restartPolicy: Never

  backoffLimit: 4

Instead of Pod, Route or Service, we have created a Job kind and we have also added the source image registry details to the YAML file. Once the YAML file is ready, below command must be executed in the command line to generate the job in OpenShift.

Once the success message is generated by the command, we will be able to see the entry got created under Other Resources -> Job section of OpenShift.

If you go the Pods section of OpenShift, you will be able to see that the Talend job has been executed successfully and the logs have been captured as shown below.

I hope your journey through the blog to execute Talend and Red Hat Openshift job was quite easy and interesting. There are a lot of other interesting blogs on various subject areas of Talend. I would highly recommend checking them also to increase your knowledge of Talend and how Talend is interacting with many interesting technologies in IT.

Till I come back with a new blog, enjoy your time using Talend!

References

https://www.docker.com/resources/what-container

https://kubernetes.io/docs/tutorials/kubernetes-basics/

https://www.openshift.com/learn/what-is-openshift/

https://docs.openshift.com/container-platform/3.5/dev_guide/jobs.html#creating-a-job

The post Talend and Red Hat OpenShift Integration: A Primer appeared first on Talend Real-Time Open Source Data Integration Software.

The information revolution – which holds the promise of a supercharged economy through the use of advanced analytics, data management technologies, the cloud, and knowledge – is affecting every industry. Digital transformation requires major IT modernization and the ability to shorten time data to insights to make the right business decisions. For companies, it means being able to efficiently process and analyze data from a variety of sources at scale. All this in the hope to streamline operations, enhance customer relationship, and provide new and improved products and services.

The healthcare and pharmaceutical industries are the perfect embodiment of what is at stake with the data revolution. Opportunities lie at all the steps of the health care value chain for those who succeed in their digital transformation:

Prevention: Predicting patients at risk for disease or readmission.
Diagnosis: Accurately diagnosing patient conditions, matching treatments with outcomes.
Treatment: Providing optimal and personalized health care through the meaningful use of health information.
Recovery and reimbursement: Reducing healthcare costs, fraud and avoidable healthcare system overuse. Providing support for reformed value-based incentives for cost effective patient care, effective use of Electronic Health Records (EHR), and other patient information.

Being able to unlock the relevance of healthcare data is the key to having a 360-view of the patient and, ultimately, delivering better care.

Data challenges in the age of connected care

But that’s simpler said than done. The healthcare industry faces the same challenge as others in that often business insights are missed due to speed of change and the complexity of mounting data users and needs. Healthcare organizations have to deal with massive amounts of data housed in a variety of data silos such as information from insurance companies, patient records from multiple physicians and hospitals. To access this data and quickly analyze healthcare information, it is critical to break down the data silos.

Healthcare organizations are increasingly moving their data warehouse to a cloud-based solution and creating a single, unified platform for modern data integration and management across cloud and on-premise environments. Cloud-based integration solutions provide broad and robust connectivity, data quality, and governance tracking, simple pricing, data security and big data analysis capabilities.

Decision Resources Group (DRG) finds success in the cloud

Decision Resources Group (DRG) is a good example of the transformative power of the cloud for healthcare companies. DRG provides healthcare analytics, data and insight products and services to the world’s leading pharma, biotech and medical technology companies. To extend its competitive edge, DRG made the choice to build a cloud data warehouse to support the creation of its new Real-World Data Platform, a comprehensive claim and electronic health record repository that covers more than 90% of the US healthcare system. With this platform, DRG is tracking the patient journey, identifying influencers in healthcare decision making and segmenting data so that their customers have access to relevant timely data for decision making.

DRG determined that their IT infrastructure could not scale to handle the petabytes of data that needed to be processed and analyzed. They looked for solutions that contained a platform with a SQL engine that works with big data and could run on Amazon Web Services (AWS) in the cloud.

DRG selected data integration provider Talend and the Snowflake cloud data warehouse as the foundation of its new Real-World Data Platform. With an integration with Spark for advanced machine learning and Tableau for analysis, DRG gets scalable compute performance without complications allowing their developers to build data integration workflows without much coding involved. DRG now has the necessary infrastructure to accommodate and sustain massive growth in data assets and user groups over time and is able to perform big data analytics at the speed of cloud. This is the real competitive edge.

The right partner for IT modernization

When it came to its enterprise information overhaul, DRG is not the only healthcare company that made the choice to modernize in the cloud. AstraZeneca, the world’s seventh-largest pharmaceutical company, chose to build a cloud data lake with Talend and AWS for its digital transformation. This architecture enables them to scale up and scale down based on business needs.

Healthcare and pharmaceutical companies are at the forefront of a major transformation across all industries requiring the use of advanced analytics and big data technologies such as AI and machine learning to process and analyze data to provide insights into the data. This digital transformation requires IT modernization, using hybrid or multi-cloud environments and providing a way to easily combine and analyze data from various sources and formats. Talend is the right partner for these healthcare companies, but also for any other company going through digital transformation.

Additional Resources:

Read more about DRG case study https://www.talend.com/customers/drg-decision-resources-group/

Read more about AstraZeneca case study https://www.talend.com/customers/astrazeneca/

Talend cloud https://www.talend.com/products/integration-cloud/

The post What the Healthcare Industry Can Teach Companies About Their Data Strategy appeared first on Talend Real-Time Open Source Data Integration Software.

In the last few years, we’ve seen the concept of the “Cloud Data Lake” gain more traction in the enterprise. When done right, a data lake can provide the agility for Digital Transformation around customer experience enabling access to historical and real-time data for analytics

However, while the data lake is now a widely accepted concept both on-premises and in the cloud, organizations still have trouble making them usable and filling them with clean, reliable data. In fact, Gartner has predicted that through 2018, 90% of deployed data lakes will be useless. This is largely due to the diverse and complex combinations of data sources and data models that are popping up more than ever before.

Migrating enterprise analytics on-premises to the cloud requires significant effort before delivering value. Cognizant just accelerated your time to value with a new Data Lake Quickstart solution. In this blog, I want to show you how you can run analytics migration projects to the cloud significantly faster, deliver in weeks instead of months, with lower risk using this new Quickstart.

Cognizant Data Lake Quickstart with Talend on Snowflake

First, let’s start by going into detail on what this Quickstart solution is comprised of. The Cognizant Data Lake Quickstart Solution includes:

A data lake reference architecture based on:
- Snowflake, the data warehouse built for the cloud
- Talend Cloud platform
- Amazon S3 and Amazon RDS
Data migration from on-premises data warehouses (Teradata/Exadata/Netezza) to Snowflake using metadata migration
Pre-built jobs for data ingestion and processing (pushdown to Snowflake and EMR)

Data Lake Reference Architecture

How It Works

Uses Talend to extract data files from on-premises (structured/semi-structured) and ingest into Amazon S3 using a metadata-based approach to store data quality rules and target layout
Stores data on Amazon S3 as an enterprise data lake for processing
Leverages the Talend Snowflake data loader to move files to Snowflake from Amazon S3
Runs Talend jobs on execution connecting to Snowflake and process data

With this quick start architecture, you can get your cloud data lake up and running quickly. Comment below and let me know what you think!

The post Accelerate the Move to Cloud Analytics with Talend, Snowflake and Cognizant appeared first on Talend Real-Time Open Source Data Integration Software.

In the past blogs, we have learned how to install Talend Open Studio, how to build a basic job loading data into Snowflake, and how to use a tMap component to build more complex jobs. In this blog, we will enable you with some helpful debugging techniques and provide additional resources that you can leverage as you continue to learn more about Talend Open Studio.

As with our past blogs, you are welcome to follow along in our on-demand webinar. This blog corresponds with the last video of the webinar.

In this tutorial, we will quickly address how to successfully debug your Talend Jobs, should you run into errors. Talend classifies errors into two main categories: Compile errors and Runtime errors. A Compile error prevents your Java code from compiling properly (this usually includes syntax errors or Java class errors). A Runtime error prevents your job from completing successfully, resulting in the job failing during execution.

In the previous blog, we designed a Talend job to generate a Sales report to get data into a Snowflake cloud data warehouse environment. For the purposes of this blog, we have altered that job, so when we try to run it, we will see both types of errors. In this way, we can illustrate how to resolve both types of errors.

Resolving Compile Errors in Talend Open Studio

Let’s look at a compile error. When we execute this job in Talend Studio, it will first attempt to compile, however, the compile will fail with the below error.

You can review java error details within the log, which states that the “quantity cannot be resolved or is not a field”. Conveniently, it also highlights the component the error is most closely associated to.

To locate the specific source of the problem within the tMap component, you can either dive into the tMap and search yourself OR you can switch to the Code view. Although you cannot directly edit the code here, you can select the red box highlighted on the right of the scroll bar to bring you straight to the source of the issue.

In this case, the arithmetic operator is missing from the Unit Price and Quantity equation.

Next, head into the tMap component and make the correction to the Unit Price and Quantity equation by adding a multiplication operator (*) between Transactions.Unit_Price and Transactions.qty. Click Ok and now run the job again.

And now you see the compile error has been resolved.

Resolving Runtime Errors in Talend Open Studio

Next, the job attempts to send the data out to Snowflake. And a runtime error occurs. You can read the log and it says, “JDBC driver not able to connect to Snowflake” and “Incorrect username or password was specified”.

To address this issue, we’ll head to the Snowflake component and review the credentials. It looks like the Snowflake password was incorrect, so re-enter the Snowflake password, and click run again to see if that resolved the issue.

And it did! This job has been successfully debugged and the customer data has been published to the Snowflake database.

Conclusion

This was the last of our planned blogs on getting started with Talend Open Studio, but there are other resources that you can access to improve your skills with Talend Open Studio. Here are some videos that we recommend you look at to strengthen and add on to the skills that you have gained from these past four blogs:

Joining Two Data Sources with the tMap Component – This tutorial will give you some extra practice using tMap to join your data complete with downloadable dummy data and PDF instructions.

Adding Condition-Based Filters Using the tMap Component – tMap is an incredibly powerful and versatile component with many uses, and in this tutorial, you will learn how to use tMap and its expression builder to filter data based on certain criteria.

Using Context Variables – Learn how to use context variables, which allow you to run the same job in different environments.

For immediate questions, be sure to visit our Community, and feel free to let us know what types of tutorials would be helpful for you.

The post Getting Started with Talend Open Studio: Run and Debug Your Jobs appeared first on Talend Real-Time Open Source Data Integration Software.

The challenge today for big data is that 85% of on-premises based Big Data projects fail to meet expectations and over 2/3 of Big Data potential is not being realized by organizations. Why is that you ask? Well, simply put on-premises “Big Data” programs are not that easy.

Your data engineers will need to be proficient in new programming languages and architectural models while your system admins will need to learn how to set up and manage a data lake. So you’re not really focusing on what you do best and instead paying top dollar for data engineers with the programming skills while spending (wasting) a lot of time configuring infrastructure – and not reaping the benefits of a big data program.

In short, making big data available at scale is hard and can be very expensive and the complexity is really killing big data projects.

Welcoming modern data engineering in the cloud

Data engineers ensure the data the organization is using clean, reliable, and prepped for whichever use cases that may present themselves. In spite of the challenges with on-premises “big data,” technologies like Apache Spark have become a best practice due to its ability to scale as jobs get larger and SLA’s become more stringent.

But using Spark on-premises as we’ve highlighted is not that easy. The market and technologies have come to an inflection point where it is agreed that what is needed is the ability to:

Eliminate the complexity of system management to lower operations costs and increase agility
Have automatic scale up/down of processing power, to grow and shrink as needed while only paying for what you use
Enable a broader set of users to utilize these services without requiring a major upgrade in their education or hiring expensive external expertise

To simplify success with big data programs, market leaders have moved from an on-premises model to a cloud model. Cloud based environments offer the ability to store massive volumes of data as well as all varieties (structured to unstructured). Now what is needed is the ability to process that data for consumption in BI tools, data science, or machine learning.

Databricks, founded by the original creators of Apache Spark, provides the Databricks Unified Analytics Platform. Databricks accelerates innovation by brining data and ML together. This service solves many of the hard challenges discussed above by automatically handling software provisioning, upgrades, and management. Databricks also manages the scaling up and down to ensure that you have the right amount of processing power and saving money but shutting down clusters when they are not needed. By taking this workload off the table for their customers, this allows those customers to focus on the next level of analytics – machine learning and data science.

While Databricks solves two out of three of the big challenges posed, there is still the third challenge of making the technology more accessible to “regular” data engineers that do not have the programming expertise to support these massively parallel, cloud environments. But that is where Talend comes in. Talend provides a familiar data flow diagram design surface and will convert that diagram into an expertly programmed data processing job native to Databricks on Azure or AWS.

The combination of Databricks and Talend then provides a massively scalable environment that has a very low configuration overhead while having a highly productive and easy to learn/use development tool to design and deploy your data engineering jobs. In essence, do more with less work, expense, and time.

For further explanation and a few examples, keep reading….

Example use case

Watch these videos and see for yourself how easy it is to run a Spark Serverless in the Cloud.

Movie recommendation use case with machine learning and Spark Serverless

Create and connect to a Databricks Spark Cluster with Talend

Click here to learn more about serverless and how to modernize your architecture?

Check out our GigaOM webinar with Databricks and Talend to learn how to accelerate your analytics and machine learning.

The post Hit the “Easy” Button with Talend & Databricks to Process Data at Scale in the Cloud appeared first on Talend Real-Time Open Source Data Integration Software.

When it comes to Data Matching, there is no ‘one size fits all menu’. Different matching routines, different algorithms and different tuning parameters will all apply to different datasets. You generally can’t take one matching setup used to match data from one distinct data set and apply it to another. This proves especially true when matching datasets from different regions or countries. Let me explain.

Data Matching for Attributes that are Unlikely to Change

Data Matching is all about identifying unique attributes that a person, or object, has; and then using those attributes to match individual members within that set. These attributes should be things that are ‘unlikely to change’ over time. For a person, these would be things like “Name” and “Date of Birth”. Attributes like “Address” are much more likely to change and therefore of less importance, although this does not mean you should not use them. It’s just that they are less unique and therefore of less value, or lend less weight, to the matching process. In the case of objects, they would be attributes that uniquely identify that object, so in the case of say, a cup (if you manufactured cups) those attributes would be things like “Size”, “Volume”, “Shape”, “Color”, etc. The attributes themselves are not too important, it’s the fact that they should be ‘things’ that are unlikely to change over time.

So, back to data relating to people. This is generally the main use case for data matching. So here comes the challenge. Can’t we use one set of data matching routines for a ‘person database’ and just use the same routines etc. for another dataset? Well, the answer is no, unfortunately. There are always going to be differences in the data that will manifest itself during the matching, and none more so than using datasets from different geographical regions such as different countries. Data matching routines are always tuned for a specific dataset, and whilst there are always going to be differences from dataset to dataset. The difference becomes much more distinct when you chose data from different geographical regions. Let us explore this some more.

Data Matching for Regional Data Sets

First, I must mention a caveat. I am going to assume that matching is done in western character sets, using Romanized names, not in languages or character sets such as Japanese or Chinese. This does not mean the data must contain only English or western names, far from it, it just means the matching routines are those which we can use for names that we can write using western, or Romanized characters. I will not consider matching using non-western characters here.

Now, let us consider the matching of names. To do this for the name itself, we use matching routines that do things like phoneticize the names and then look for differences between the result. But first, the methodology involves blocking on names, sorting out the data in different piles that have similar attributes. It’s the age-old ‘matching the socks’ problem. You wouldn’t match socks in a great pile of fresh laundry by picking one sock at a time from the whole pile and then trying to find its duplicate. That would be very inefficient and take ages to complete. You instinctively know what to do, you sort them out first into similar piles, or ‘blocks’, of similar socks. Say, a pile of black socks, a pile of white socks, a pile of colored socks etc. and then you sort through those smaller piles looking for matches. It’s the same principle here. We sort the data into blocks of similar attributes, then match within those blocks. Ideally, these blocks should be of a manageable and similar size. Now, here comes the main point.

Different geographic regions will produce different distributions of block sizes and types that result in differences to the matching that will need to be done in those blocks, and this can manifest itself in terms of performance, efficiency, accuracy and overall quality of the matching. Regional variations in the distribution of names within different geographical regions, and therefore groups of data, can vary widely and therefore cause big differences in the results obtained.

Let’s look specifically at surnames for a moment. In the UK, according to the National Office of Statistics, there are around 270,000 surnames that cover around 95% of the population. Now obviously, some surnames are much more common than others. Surnames such as Jones, Brown, Patel example are amongst the most common, but the important thing is there is a distribution of these names that follow a specific graphical shape if we chose to plot them. There will be a big cluster of common names at one end, followed by a specific tailing-off of names to the other, whilst the shape of the curve would be specific to the UK and to the UK alone. Different countries or regions would have different shapes to their distributions. This is an important point. Some regions would have a much narrower distribution, names could be much more similar or common, whilst some regions would be broader, names would be much less common. The overall distribution of distinct names could be much more or much less and this would, therefore, affect the results of any matching we did within datasets emanating from within those regions. A smaller distribution of names would result in bigger block sizes and therefore more data to match on within those blocks. This could take longer, be less efficient and could even affect the accuracy of those matches. A larger distribution of names would result in many more blocks of a smaller size, each of which would need to be processed.

Data Matching Variances Across the Globe

Let’s take a look at how this varies across the globe. A good example of regional differences comes from Taiwan. Roughly forty percent of the population share just six different surnames (when using the Romanised form). Matching within datasets using names from Taiwanese data will, therefore, result in some very large blocks. Thailand, on the other hand, presents a completely different challenge. In Thailand, there are no common surnames. There is actually a law called the ‘Surname Act’ that states surnames cannot be duplicated and families should have unique surnames. In Thailand, it is incredibly rare for any two people to share the same name. In our blocking exercise, this would result in a huge number of very small blocks.

The two examples above may be extreme, but they perfectly illustrate the challenge. Datasets containing names vary from region to region and therefore the blocking and matching strategy can vary widely from place to place. You cannot simply use the same routines and algorithms for different datasets, each dataset is unique and must be treated so. Different matching strategies must be adopted for each set, each matching exercise must be ‘tuned’ for that specific dataset in order to find the most effective strategy and the results will vary. It doesn’t matter what toolset you choose to use; the same principle applies to all as it’s an issue that is in the data and cannot be changed or ignored.

To summarize, the general point is that regional, geographic, cultural and language variations can make big differences to how you go about matching personal data within different datasets. Each dataset must be treated differently. You must have a good understanding of the data contained within those datasets and you must tune and optimize your matching routines and strategy for each dataset. Care must be taken to understand the data and select the best strategy for each separate dataset. Blocking and matching strategies will vary, you cannot just simply reuse the exact same approach and routines from dataset to dataset, this can vary widely from region to region. Until next time!

The post Data Matching with Different Regional Data Sets appeared first on Talend Real-Time Open Source Data Integration Software.

As enterprises move towards massively scaled interconnected software systems, they are embracing the cloud like never before. Very few would dispute the notion that the cloud has become one of the biggest drivers of change in the enterprise IT landscape and that the cloud has provided IT a powerful way to deploy services in a timely and cost-effective manner.

However, the tremendous benefits that you’ve tapped into the cloud has to be balanced against the need to adopt, configure and manage services in a secure manner. Digital transformation with a non-secure cloud solution is impossible.

When we look at the integration world, it is often a fragmented place where applications and data may each exist on multiple platforms; IT teams are frequently supporting a plethora of on-premises and cloud apps while moving petabytes of sensitive data among those apps. Data leakage, corruption, or loss can be devastating to a business.

A cloud integration solution, such as Integration Platform as a Service (iPaaS), with subpar security mechanisms may make your sensitive data such as customer records, passwords, or personal information susceptible to data breach and misuse. One way to avoid this, is to choose an iPaaS that has robust security mechanisms such as data encryption, user and access protection, security certifications, and information security standards in place.

Talend Cloud is a unified cloud integration platform (iPaaS) that integrates data, people, and applications across all types of data sources: public cloud, on-premises, and SaaS applications to deliver fast and reliable big data analytics. Besides having natively inherited AWS’s security practices, the platform is built from the ground up with all pieces of the infrastructure puzzle taken into account to best secure each layer: physical, network, operating system, data base, and application. The entire stack is secure for each customer, partner, and workload. As part of the holistic security design, Talend Cloud has earned industry-leading security certifications to ensure that its platform meet exacting stands for a range of customers and industries.

Talend Cloud provides cloud platform security certifications such as:

SSAE 16
SOC 2 Type II certification
ISAE 3402 certification
Cloud Security Alliance Certification (Level 1)

As well as Compliance Certifications such as GDPR via AWS. Additionally, Talend Cloud provides security features and services such as data encryption, key management for tenant isolation and remote engine, bi-annual penetration test, SSO support for Okta, IAM management and many more.

To learn more about how Talend provides security in iPaaS, take a look at the complete list of powerful security features detailed in this Talend Cloud Security White Paper or drop us a line at https://www.talend.com/contact/ for any questions.

The post Talend Cloud: A hybrid-friendly, secure Cloud Integration Platform appeared first on Talend Real-Time Open Source Data Integration Software.

This is the first of a series of blogs on how to architect, engineer and manage performance. In it, I’d like to attempt to demystify performance by defining it clearly as well as describing methods and techniques to achieve performance requirements. I’ll also cover how to make sure the requirements are potentially achievable.

Why is Performance Important?

To start, let’s look at a few real-world scenarios to illustrate why performance is critical in today’s business:

It is the January sales and you are in the queue at your favorite clothing stores. There are ten people in front of you, all with 10-20 items to scan and scanning is slow. Are you going to wait or give up? Are you happy? What would a 30% increase in scan rates bring to the business? Would it mean fewer tills and staff at quiet times? Money to save, more money to make and better customer satisfaction.
Let’s say you work at a large bank and you are struggling to meet the payment deadline for inter-bank transfers on peak days. If you fail to meet these, you get fined 1% of the value of all transfers waiting. You need to make the process faster, increase capacity, improve parallelism. It could lose you money, worse damage your reputation with other banks and customers.
You’ve needed a mortgage. XYZ Bank offers the best interest rate and they can give you a decision within a month. ABC Bank cost 0.1% more, but guarantee a decision in 72 hours. The vendors of the house you want to buy need a decision within the week.
Traders in New York and London can make millions by simply being 1ms faster than others. This is one of the rare performance goals, where there is no limit to how fast, except the cost versus the return.

Why is performance important? Because it means happy customers, cost savings, avoiding lost sales opportunities, differentiating your services, protecting your reputation and much more.

The Outline for Better Performance Processes

Let’s start with performance testing. While this is part of the performance process on it is own, it is usually a case of “too little, too late” and the costs of late change are often severe.

So, let me outline a better process for achieving performance and then I’ll talk a little about each part now, and in more detail in the next few blogs:

Someone should be ultimately responsible for performance – the buck stops with someone, not some team.
This performance owner leads the process – they help others achieve the goals.
Performance goals need to be measurable and realistic, and potentially achievable – they are the subject of discussions and agreement between parties.
Goals will be broken down amongst the team delivering the project – performance is a team game.
Performance will be achieved in stages using tools and techniques – there will no miracles, magic or silver bullets.
Performance must be monitored, measured and managed – how do we deal with the problems which will occur.

Responsibility for Performance

Earlier, I stated that one person should be solely responsible for performance. Let me expand upon that. When you have more than one person in charge, you can no longer be sure how the responsibility is owned. Even with two people, there will be things which you haven’t considered and do not fit neatly in one or other roles. If possible, do not combine roles like Application and Performance Architect (Leader) as this may lead to poor compromises. In my opinion, it is far better for each person to fight for one thing and have the PM or Chief Architect to judge the arguments rather than one person trying to weigh the benefits on their own. Clearly, in small projects, this is not possible, and care is necessary to ensure anyone carrying multiple roles can balance the two or bring out the multiple roles without bias. It is very easy to favour the role we know best or enjoy most.

I probably won’t return to the topic until, towards the end of the series, it will be easier to understand after looking at the other parts in more detail.

The Performance Architect – Performance Expert, Mentor, Leader and Manager

In an ideal scenario, a Performance Architect’s role is to guide others through the performance improvement process, not to do it all themselves. No one is an enough of an expert in all aspects of performance to do this. Performance Architects should:

Manage and orchestrate the process on behalf of the project.
Lead by taking responsibility for the division of a requirement(s) and alignment of the performance requirements across the project.
Provide expertise in performance, giving more specialised roles ways to solve the challenges of performance – estimation, measurement, tuning, etc.

This needs a bit more explanation in a future blog in this series, but I will cover some of the other parts first.

Setting Measurable Goals – Clear Goals that Reflect Reality

Setting measurable performance goals requires more work than you first think. Consider this example, “95% of transactions must complete in < 5 secs”. On the face of it this seems ok, but consider:

Are all transactions equal – logon, account update, new account opening?
What do the five seconds mean – elapsed time, processing time, what about the thinking time of the user?
What if the transaction fails?
Data variations – all customers named “Peabody” versus “Mohammed” or “Smith”.

That one requirement will need to be expanded and broken down into much more detail. The whole process of getting the performance requirements “right” is a major task and the requirements will continue to be refined during the project as requirements change.

It is worth stating that trying to tune something to go as fast as possible is not the aim of performance. It isn’t a measurable goal, you don’t know when you’ve achieved the goal, and it would be very expensive.

This area can be involved, it takes time, effort and practice, and this is the basis for the rest of the work, so a topic for more discussion in the next blog.

Breaking the Goals Down – Dividing the Cake

If we have five seconds to perform the login transaction, we need to divide that time between the components and systems that will deliver the transaction. The Performance Architect will do this with the teams involved, but I’ve found that they’ll get it wrong at the start. They don’t have all the facts, it doesn’t matter if it is right at the finish. You are, probably struggling to see where to start, don’t worry about that for a bit, the next section will help.

I’ll look at this in a future blog, but it is probably best discussed after looking at estimation techniques.

Achieving Performance – The Tools and Techniques

When most people drop a ball, they know it falls downwards due to gravity. But not many of us are physicists and four-year-old children haven’t studied any physics, but they don’t seem to struggle to use the experience (knowledge).

How long will an ice cube take to melt? An immediate reaction goes something like this, “I don’t know it depends on the temperature of the water/air around the ice cube”. So make some assumptions, provide an estimate, then start asking questions and doing research to confirm the assumptions or produce a new estimate.

How long will the process X take? How long did it take last time? What is different this time? What is similar? What does the internet suggest? Could we do an experiment? Could we build a model? Has it been done before?

Start with an estimate (guess, educated guess, the rule of thumb – use the best method available or methods) and then as the project progresses we can use other techniques to model, measure and refine.

You might be saying, “But I still don’t know”! That’s correct, and you probably never will at the outset.

Statistically, it is almost certain you won’t die by being struck by lightning (Google estimates 24,000 or 6,000 people per year on Earth pass away this way – no one knows the real figure – think about that, we don’t know the real answer). Most (or all, I hope) assume we won’t be sick tomorrow, but some people will. Nothing is certain, you don’t the answer to nearly everything in the future with accuracy, but it doesn’t stop you making reasonable assumptions and dealing with the surprises, both good and bad.

This is a huge topic and I need to spend some time on this in the series to build your confidence in your skills by showing you just some of the options you have now, you can learn and develop.

“Data Science” and “Statistics” are whole areas of academic study interested in prediction, so this is more than a topic. There are probably more ways to produce estimates than there are to solve the IT problem you are estimating.

Keeping on track – Monitoring, measuring and managing

Donald Rumsfeld made the point there are things we know, we know we don’t know and things that surprise us (unknown unknowns). Actually, it is worse, people often think they know and are wrong, or assume they don’t know and then realise their estimate was better than the one the project used.

Risk management is how we deal with the whole performance process. Risk management, just like the Project Manager uses will help us manage the process. As we progress through the project we will build up our knowledge and confidence in the estimates and reduce the risk and use the information to focus our effort where the greatest risk is.

A Project Manager measures the likelihood of the risk occurring and the impact. For performance, we measure the chance of us achieving the performance goal and how confident we are of the current estimate.

This will be easier to understand as we look at other parts in more depth, we’ll revisit this in a future blog.

In future blogs in the series I will cover:

Setting goals – Refining the performance requirements.
Tools and techniques – Where to start – Estimation, Research,
Monitoring, measuring and managing – risk and confidence.
Breaking the goals down across the team.
Tools and techniques – Next Steps – Volumetrics, Model and Statistics
Tools and techniques – Later Stages – Some testing (and monitoring) options.
Responsibility and Leadership

The Author

Chris first became interested in computers aged 9. His formal IT education was from ages 13 to 21, informally it has never stopped. He joined the British Computer Society (BCS) at 17 and is a Chartered Engineer, Chartered IT Professional and Fellow of the BCS. He is proud and honoured to have held several senior positions in the BCS including Trustee, Chair of Council and Chair of Membership Committee, and remains committed to IT professionalism and the development of IT professionals.

He has worked for two world leading companies before joining his third, Talend as a Customer Success Architect. He has over 30 years of professional working experience, with data and information systems. He is passionate about customer success, understanding the requirements and the engineering of IT solutions.

Our Team

The Talend Customer Success Architecture (CSA) team is a growing worldwide team of 20+ highly experienced technical information professionals. It is part of Talend’s Customer Success organisation dedicated to the mission of making all Talend’s customers successful and encompasses Customer Success Management, Professional Services, Education and Enablement, Support and the CSA teams.

Talend

Built for cloud data integration, Talend Cloud allows you to liberate your data from everything holding it back. Data integration is a critical step in digital transformation. Talend makes it easy.

The post How to Architect, Engineer and Manage Performance (Part 1) appeared first on Talend Real-Time Open Source Data Integration Software.

In the last few years, microservices or microservice architecture has become a popular reference in IT due to its benefits and the flexibility this architectural style brings.

Before we get into working with microservices and Talend, we should review the basics of microservices or a microservice architecture.

In a previous blog from Ravi Chebolu, he provided a great insight on Microservices:

“Microservices is often quoted as an architectural style for software development as a variant derived from the foundations of Service Oriented Architecture(SOA)”

To make it simple, we often compare microservices versus a monolith.

A monolith is an application that holds a group of operations together, like a frontend interface and the backend services which receives the data.
A microservices architecture will take the same operations but instead of creating one big application, it will decompose it into a collection of loosely coupled services.

For more information on Microservices and Talend, I will recommend looking at these articles:

A microservices architecture can bring a ton of benefits including better maintenance and upgrade, better fault isolation, easier continuous integration, better integration with containers, scalability and so much more. But if not well-managed and by default, it can also bring some complexity in the day to day life and management like request tracing and monitoring.

So what is a Service Mesh?

A service mesh is a network of microservices that makes service-to-service scenarios secure, performant and reliable. Istio is a service mesh backed up by Google and they added direct Istio access to its Kubernetes Engine.

Istio’s Core features include:

Traffic Management
Security
Observability
Platform support
Integration and customization

In terms of implementation, a service mesh can help by adding more features to assist in the deployment of microservices like:

A/B testing
Blue / Green deployment
Canary releases
Rate limit
Access control
End-to-end authentication.

Usually, service mesh is an infrastructure layer relying on proxies to be implement and the corresponding architecture.

Source: https://istio.io/docs/concepts/what-is-istio/#architecture

To understand more about service mesh with Talend, I invite you to register for the webinar we just put together that dives into the technical details. In it, I’ll show you how to use service mesh with Talend and Istio. Until next time!

Join the Webinar! Talend Microservices: Service Mesh with Istio

The post An Introduction to the Service Mesh appeared first on Talend Real-Time Open Source Data Integration Software.

Deploying a successful technology solution, especially in data management, takes more than just installing software and writing a job (or multiple jobs… thousands in some cases), and running those jobs. If you’re taking on a new data management initiative, deploying using containers and serverless technology, migrating from traditional data sources to Hadoop, or from on-premises to the cloud, you may be sailing in unfamiliar waters.

The rate of change in technology, the explosion of data, and the requirements of business all demand that IT organizations continually innovate. Whenever you’re doing something transformational and cutting-edge, you’re taking a bit of a risk and you’re going to have to work with a limited skill set or expertise in the area to which you’re trying to migrate, learn quickly as you go, and iterate quickly to incorporate those learnings.

The Risk/Reward Profile

Companies have historically been rewarded for innovation. Being first to market with a new technology or innovation can provide great financial outcomes, but the side of the road is also littered with initiatives that were either too early with an innovation or didn’t adequately plan to mitigate risk.

At Talend, we’ve built a framework for adoption that includes key elements that both manage risk and set you up for success. It’s a framework through which we internally look at our customer deployments and one which we’re starting to share with customers directly so that we can have a transparent conversation about what’s going well, where we see potential challenges in the future, and how we can work together proactively to address those issues. We’ve created a framework called the ‘’Seven Factors to Successful Adoption.” These include:

Stakeholder Engagement: Are your stakeholders effectively engaged? Do you have business users involved and invested in the successful outcome of the project and do they have something to gain or lose with the outcome of your deployment? Ideally that stakeholder is tied to an objective that drives business outcome. At our most successful customers, CIOs and CDOs have a stake in the outcome of the business problem we’re trying to solve, share their vision and strategy, and ensure that their project roadmap and key initiatives map to our technology roadmap – which we regularly share. Can you get access to your executive sponsors easily and are they willing to invest the time and effort to ensure that a project is set up for success because there’s a meaningful business outcome for them?
Project and Business Value Alignment: Is this IT initiative tied to an outcome that has real business value and are the goals of this IT initiative tied to tangible business outcomes? For example, our customer Euronext didn’t simply embark upon a science experiment with real time streaming, Talend Data Catalog and our APIs because they thought the technology was interesting. They had real financial objectives associated with getting access to trusted information in a matter of seconds rather than weeks. Given the quicker access to data they could trust, they were able to make that data available to external customers to derive value from it for the first time. They now receive 20% of their revenue from data that they’re able to provide to external customers.
Team Readiness: Is the project team set up for success? In many cases, it’s about more than just having smart developers on the team. In the case of containerization, serverless deployments or migrations to the cloud, the team on the project (like a large portion of the global IT population) may be performing their first cloud deployment. The more you can supplement your internal team with knowledge and skills from others who have real world experience (and yes, some battle scars) from deploying cloud technologies, the more value you can drive with greater speed and less risk. A successful project team will require more than just smart people, it will require assistance from others who have navigated the waters of cloud adoption. We regularly partner with our customers and SI partners to ensure that we’re providing guidance in new technologies, deploying and designing the Talend solution properly, and supporting a team collaboratively.
Architecture: Do you have the right architecture? While this may again sound straightforward, cloud can be a significant paradigm shifts, and an architecture that performed well in the legacy world may not provide expected performance in a new architectural paradigm. For example, managing database connection pools from an on premises “always connected” application can differ significantly from how you manage connections from within a container that spins up and down as needed to take advantage of the cloud cost model. Also, if you’re moving data from on-premises to the cloud, or from cloud-to-cloud, there are best practices to follow that can make a big difference with respect to security or performance.
Job Design: Has your job design been validated and have you built consistency around it so that it will perform well now and at scale? Do you have consistent error handling? Are you using context variables effectively? Our Customer Success Architecture and Professional Services teams have published best practices and spent time consulting with customers in order to ensure that their jobs are able to scale. In addition, it’s important that your team publish internal job design standards as you scale so that additional developers are creating jobs that follow those best practices.
Growth Strategy: Does your team have a growth strategy beyond the initial project? Projects tend to be more successful when they are either part of a greater initiative or if there is a logical next step. For example, if simple ingestion is your “end game”, you may soon find you’re your users won’t trust the data without a good understanding of where it came from. They may require a catalog or information about the lineage of that data. You may also want to ensure that you’re deduplicating records and performing other data quality steps that provide needed governance. Additionally, the uptake of technology tends to have a tailwind if the project is seen as part of a greater initiative as opposed to a one-off. And if you aren’t thinking about cloud, containerization, and serverless technologies, we should probably talk.
Ongoing Alignment and Checkpoints: We love to hear success stories from our customers. At Talend Connect this year, a number of our customers shared the business value they received working with Talend. We also like to hear from you how we can be doing a better job to help you meet your objectives. Are we aligning regularly? Do we know your major initiatives and milestones? Do we have regular calls or meetings between our team members and do we have a communication framework that allows us to check in with each other on those milestones? A close relationship between partners provides an opportunity to surface observations that, if managed proactively, can ensure alignment between product roadmap/features and implementation strategy/requirements. At the speed the IT world is changing, and at the rate the cloud is exploding, we should be meeting and discussing the trends that we’re seeing in data management on a regular basis.

I hope you find this framework useful. We’re excited about the successes it has enabled so far as well as the risks and challenges that it has surfaced as we’ve shared it with customers. We always come away from these conversations with a better understanding of what we should more of and what we should do differently.

To learn more about our best practices, or to connect with us for a conversation, please go to our Customer Success page or contact your CSM directly for more information on how we can help ensure your journey to the cloud is a success. I look forward to sharing more information on our framework in future blog posts.

The post 7 Factors for a Successful Deployment appeared first on Talend Real-Time Open Source Data Integration Software.

As a Customer Success Architect with Talend, I spend a significant amount of my time helping customers with optimizing their data integration tasks – both on the Talend Data Integration Platform and the Big Data Platform. While most of the time the developers have a robust toolkit of solutions to address different performance tuning scenarios, a common pattern I notice is that there is no well-defined strategy for addressing root causes for performance issues. Sometimes not having a strategy fixes some of the immediate issues but in the longer term, the same performance issues resurface because the core issues in the original design were not addressed. And that’s why I recommend customers to have a structured approach to performance tuning of their data integration tasks. One of the key benefits of having a strategy is that it is repeatable – irrespective of what your data integration tasks do, how simple or complicated they are, and the volume of data that is being moved as part of the integration.

Where is the bottleneck?

The first step in a performance tuning strategy is to identify the source of the bottleneck. There could be bottlenecks in various steps of your design. The goal is not to address all the bottlenecks at the same time but to tackle them one at a time. The strategy is to identify the biggest bottleneck first, find the root causes creating the bottleneck, find a solution and implement it. Once the solution has been implemented, we look for the next biggest bottleneck and address it. We keep iterating through all the bottlenecks until we have reached an optimal solution.

Here’s an example to help you understand. You have a Talend Data Integration Standard job that reads from an Oracle OLTP database, transforms it in tMap and loads it into a Netezza data warehouse.

If this task is not meeting your performance requirements, my recommendation would be to break down this task into three different parts:

Read from Oracle
Transform in Talend, and
Write to Netezza

One or more of the tasks listed above maybe causing a slowdown of your process. Our goal is to address one at a time. A simple way of finding out what is causing the bottleneck is to create three test Talend jobs to replicate the functionality of the one Talend job. This would look something like this:

Job 1 – Read from Oracle: this job would read from Oracle using tOracleInput and write to a file in the local file system of the Talend Job Server using tFileOutputDelimited. Run this job and capture the throughput (rows/second). If the throughput numbers do not look reasonable, the query from Oracle source is one of your bottlenecks.

2. Job 2 – Transformation: Read the file created in Job 1 using tFileInputDelimited, apply your tMap transformations and write out to another file using tFileOutputDelimited into the same local file system. How do the throughput numbers look? Are they much faster or much slower or the same compared to Job 1?

3. Job 3 – Write to Netezza: Read the file created in Job2 and load it into the Netezza database and look at the throughput numbers. How do they compare to Job 1 and Job 2?

There are couple of things you need to pay attention to when running these jobs:

First, these test jobs should be writing to and reading from a local file system – this is to make sure we eliminate any possible network latency.
The second thing – throughput (the rate at which data is read/transformed/written) – is a more accurate measure of performance than elapsed time. Our goal is to reduce elapsed time and we address that by increasing the throughput at each stage of the data integration pipeline.

Let’s assume that this was the outcome of running our tests:

Job	Description	Throughput
Job 1	Read from Oracle	20000 rows/sec
Job 2	tMap transformation	30000 rows/sec
Job 3	Write to Netezza	250 rows/sec

Based on the scenario above, we can easily point to Netezza being the bottleneck in our scenario since it has the lowest throughput*.

If the outcome was something like below, we can conclude that we have bottlenecks both in the read from Oracle and write to Netezza and we need to address both*.

Job	Description	Throughput
Job 1	Read from Oracle	500 rows/sec
Job 2	tMap transformation	30000 rows/sec
Job 3	Write to Netezza	250 rows/sec

* In my simple use case above, I assume that the row lengths do not change across the entire pipeline i.e. if we read 10 columns from Oracle, the same 10 columns are passed through the Transform and Write jobs. However, in real life scenarios, we do add or drop columns as part of the pipeline and we need to pick alternate measures of throughput like MBs/sec.

Let’s eliminate those bottlenecks

In the previous section, I talked about identifying “where” the bottleneck is. In this section, we will provide you a summary of “how” we can eliminate the different types of bottlenecks.

Source Bottlenecks

If your source is a relational database, you can work with your database administrators to make sure that the query is optimized and executing based on the best query plan. They can also provide optimizer hints to improve the throughput of the query. They should also be able to add new indexes for queries that have an GROUP BY or ORDER BY clause.
For Oracle and some other databases, Talend allows you to configure the Cursor Size in the t<DB>Input component. The Cursor Size defines the fetch size of the result set. Once the result set is retrieved from the database, it is stored in memory for faster processing. The ideal size for this is defined by your dataset and requirements. You can also work with the database administrators to increase the network packet size which allows for larger packets of data to be transported over the network at one time.
For very large reads, create parallel read partitions as multiple subjobs using multiple t<DB>Input components with non-overlapping where clauses. Pick columns that are indexed for the where clauses – this will enable an equal distribution of data across the multiple reads. Each of these subjobs can run in parallel by enabling “Multi thread execution” in the job properties
For files sources stored on the network shared storage, please make sure that there is no network latency between the server on which Talend Job server is running and the file system in which the files are hosted. The file system should ideally be dedicated to storing and managing files for your data integration tasks. In one of my assignments, the file system where the source files were stored were shared with mail server backups – so when the nightly email backups would run, our reads from the filesystem would significantly slow down. Work with your storage architect to eliminate all such bottlenecks.

Target Bottlenecks

Most modern relational databases support bulk loading. With bulk loaders, Talend bypasses the database log and thus improves performance. For some of the databases, we also provide the option to use named pipes with external loaders. This eliminates the need to write the intermediate files to disk.
Sometimes dropping indexes and key constraints before load helps with the performance. You can recreate the indexes and constraints after the load has successfully completed
For updates, having database indexes on the same columns as the ones that are defined as Keys in the t<DB>Output component will improve performance
For file targets on a network shared storage, follow the guidelines from above for source files stored on network shared storage

Transformation Bottlenecks

Reduce the volume of data being processed by Talend by eliminating the unnecessary rows and columns early in the pipeline. You can do this by using tFilterRows and tFilterColumns components
For some memory intensive components like tMap and tSortRow, Talend provides the option to store intermediate results on disk. A fast disk that’s local to the Job Server is recommended. This reduces the requirement for adding more memory as data volumes grow.
Sometimes transformation bottlenecks happen because of a large monolithic job that tries to do many things at once. Break down such large jobs into smaller jobs that are more efficient for data processing.

There are some additional optimization techniques to address bottlenecks at a job level (like Parallelization, ELT, Memory optimization etc.) that are not discussed as part of this blog but you can find information on them and other techniques on Talend Job Design Patterns and Best Practices – Part 1, Part 2, Part 3 and Part 4.

Conclusion

The key element to successfully tune your jobs for optimal performance is to identify and eliminate bottlenecks. The first step in the performance tuning is to identify the source of bottlenecks. And yes, it does involve creating additional test jobs. But don’t be discouraged that you have to put in additional effort and time to build these out. They are well worth the effort based on my experience doing this for 20+ years. A strategic and repeatable approach to performance and tuning is a lot more efficient than a tactical trial and error method. You can also incorporate the lessons learnt into your process and improve it over time. I hope this article gets you started in your performance tuning journey and wish you the best.

The post Talend Performance Tuning Strategy appeared first on Talend Real-Time Open Source Data Integration Software.

Introduction

In my last blog I described how to achieve continuous integration, delivery and deployment of Talend Jobs into Docker containers with Maven and Jenkins. This is a good start for reliably building your containerized jobs, but the journey doesn’t end there. The next step to go further with containerized jobs is scheduling, orchestrating and monitoring them. While there are plenty of solutions you can take advantage of, I want to introduce an effective way to address this need for containerized Talend jobs in this blog.

What Challenges are we Addressing?

When it comes to data integration or even big data processing you need to go beyond simple task scheduling. Indeed, you may want to run several jobs sequentially or in parallel in order to monitor them in an efficient manner. This is what we call workflow management. Even though this is not a new topic in this industry, we now need to consider that we are dealing with containers and their ecosystem.

Speaking of the container ecosystem, the rising adoption of Kubernetes shows us that more and more companies are betting on containers (as we do at Talend). As a result, our main challenge is to be able to provide a solution that can adapt to this fast-growing environment but also respond to our existing needs.

Why Apache Airflow?

According to Apache Airflow website:

Apache airflow is a platform for programmatically author schedule and monitor workflows.

In short, Apache Airflow is an open-source workflow management system. It allows you to design workflow pipelines as code. It uses Python which is a very popular language for scripting and contains extensive available libraries you can use. The value that Apache Airflow brings is: native management of dependencies, failures and retries management. Everything is easily configurable, and Airflow provides a great graphical interface to monitor your workflows. Like most of its competitors (such as Luigi or Pinball), it offers scalability and resilience over your workflows.

To sum up, Apache Airflow meets all the expectations we would have for a workflow management system. However, a key reason to choose Apache Airflow is the community behind it. A number of services integration have been already developed which considerably extends the capabilities of Airflow and its future is promising with a strong roadmap and a great number of contributors.

How to Use Apache Airflow with Containerized Talend Jobs

Before we begin, please be aware of the following requirements needed to follow our example:

Talend Studio 7.1 or higher with a Big Data Platform minimum
Apache Airflow 1.10 or higher : For a quick and easy setup you can use this docker-compose file.
Have an access to Databricks on AWS or Azure Databricks (Spark managed service).
Have an ECS cluster available to run containers on AWS

The goal in this article is to be able to orchestrate containerized Talend Jobs with Apache Airflow.

First, we are going to build 3 jobs as Docker container images.
These jobs will run in a Databricks
Then we will create an Apache Airflow workflow to create a Databricks cluster, run our jobs and eventually terminate our cluster.

The workflow is represented in the following schema:

1) Talend 7.1 and Docker Containers

First, let’s drill in the new Talend 7.1 release. You now have two different ways to package your Talend jobs into Docker containers. First, you can use Talend CI/CD capabilities as introduced in Talend 7.0.

Going Serverless with Talend Through CI/CD and Containers (blog post)

The 7.1 new release improves and eases the configuration to build your container images. Please read the documentation to see the updates. Talend has introduced a new feature allowing you to build your jobs as Docker images directly from Talend Studio. Let me show you how easy it is:

Right-click on any Standard or Big Data job:

It will build your job in a Docker image using your Docker daemon either locally or using a remote Docker host.

You can also publish it directly to a Docker registry:

Whether you are using the CI/CD Builder or the Studio to build Docker images you are now all set up to orchestrate your containerized jobs.

2) Serverless Big Data with Databricks

Talend 7.1.1 new release introduces Databricks support. Basically, Databricks is a managed service for Spark available on AWS or Azure. You can start a Spark cluster in a matter of minutes and your cluster can automatically scale depending on the workload making it easier than ever to set up a Spark cluster.

First, let’s orchestrate a workflow around Databricks.

From the Databricks website:

Databricks unifies data science and engineering across the Machine Learning lifecycle from data preparation, to experimentation and deployment of ML applications.

Databricks is targeting different types of users on their platform such as data engineers, machine Learning engineers, data scientist and data analysts. While data engineers deploy their Spark jobs and build ML models, data scientists analyze and interpret, through notebooks, the data made available for them. In this context, it is becoming even more important to develop ETL and Big Data jobs and be able to orchestrate, schedule and monitor them in order to fasten the model creation and data availability iterations.

In the light of this, the use of Talend to operationalize and Apache Airflow to orchestrate and schedule becomes an efficient way to address this use case.

To illustrate my point, I chose the following workflow example:

Create a Databricks Cluster
Copy files from AWS S3 to Databricks DBFS
Run two Databricks Jobs packaged in containers (train a model and test this model)
Stop the Databricks cluster once the jobs are done

I have 3 different jobs. The first one is a standard data integration job that copies a file from AWS S3 to Databricks DBFS (Databricks file system):

The second one trains a machine learning model using a decision tree.

As you can see we have defined three context variables. These variables will be my only setup needed to run a job on a Databricks cluster.

In my case:

DATABRICKS_ENDPOINT=https://westeurope.azuredatabricks.net/?o=3888544531850695#

DATABRICKS_CLUSTER_ID=1114-095830-pop434

DATABRICKS_TOKEN=you can create one under user settings.

My last job is the testing of the previously trained model. It returns a confusion matrix based on the testing dataset.

Finally, you need to specify these three parameters in your job Spark configuration, and you are all set. I won’t go into the details of the job because the purpose of this article is to show how to orchestrate containers, but they are simple big data processing jobs training a machine learning model on a dataset and then testing this model on a different dataset. If you are interested in this use case, you can find a detailed description here as these jobs are available in our Talend Big Data & Machine Learning Sandbox.

As an example, once built either through Continuous Integration/Continuous Delivery (CI/CD) or from the Talend Studio you can then run your job as follow:

docker run trainmodelriskassessment:latest \
--context_param DATABRICKS_ENDPOINT="https://westeurope.azuredatabricks.net/?o={INDENTIFIER}#” \
--context_param DATABRICKS_CLUSTER_ID="1114-095830-pop434" \
--context_param DATABRICKS_TOKEN="{TOKEN}"

At this point, we now have three containerized Talend jobs published in a Docker registry allowing us to pull and run them anywhere.

3) Apache Airflow

Let’s get started with Apache Airflow. If you have never tried Apache Airflow I suggest you run this Docker compose file. It will run Apache Airflow alongside with its scheduler and Celery executors. If you want more details on Apache Airflow architecture please read its documentation or this great blog post.

Once it is running, you should have access to this:

As you can see I have created one DAG (Directed Acyclic Graph) called databricks_workflow. Please read the description of Apache Airflow concepts if needed. In short, this is where you define the tasks, their execution flow and scheduling settings. Please find my Python file here defining my workflow.

In the Python file above you can see we have several tasks:

create_cluster + create_cluster_notify
s3_list_files
run_job (for each file listed in previous task) + run_job_notify
train_model + train_model_notify
test_model + test_model_notify
terminate_cluster + terminate_cluster_notify

Each of these tasks requires an operator. We have added a notification along with all the tasks using the Slack Operator that allow us to send Slack messages to follow the workflow advancement.

Example of notifications sent by Apache Airflow to Slack

We use the Python Operator for create_cluster and terminate_cluster tasks. The Python Operator simply calls a Python function you can see in the file. The train_model and test_model tasks use the ECS Operator that allows us to run a Docker Container in an ECS Cluster easily.

In our case, we use the containerized Databricks Jobs we earlier built, and we specify the 3 parameters to target our newly created Databricks cluster. This workflow associated graph is rendered as follows:

As you can see we use S3 List Operator to list all the files in a S3 bucket and then we use our containerized job to copy each of these files into Databricks DBFS in parallel. We dynamically pass the parameters with Apache Airflow to the container at runtime.

Then we run our other containerized jobs to train and test the machine learning model. In fact, only the training_dataset.csv and testing_dataset.csv files are used but to clearly show you the loop that Apache Airflow achieves, I have added 2 files in the bucket that are not used.

Conclusion

Thanks to Talend containers capabilities associated with an orchestrator such as Apache Airflow you can easily create complex and efficient workflows. Apache Airflow is still a young open source project but is growing very quickly as more and more DevOps, Data engineers and ETL developers are adopting it.

The above example shows you how you can take advantage of Apache Airflow to automate the startup and termination of Spark Databricks clusters and run your Talend containerized jobs on it. It potentially can reduce your cloud processing cost profile and help you monitor your data pipelines more efficiently. Monitoring the success and performance of your workflows is also a core concern for ETL, Data engineering teams. As you have seen you can track all the tasks with email or Slack notifications as well as get logs depending on the services you use. This is very useful to track errors and be more agile with your processes.

The post An Introduction to Apache Airflow and Talend: Orchestrate your Containerized Data Integration and Big Data Jobs appeared first on Talend Real-Time Open Source Data Integration Software.

The energy industry supplies electrical power to consumers from a variety of sources, including gas-based and hydroelectric plants, as well as nuclear and coal-based power plants. As temperature, economic and political events occur along with changes in demography, preferences and technology, shifting demand and supply interact to form prices in competitive energy markets.

Supply and demand need to be managed on a daily second-by-second basis, otherwise blackouts may occur. In this complex world of ever-changing technologies and markets, Uniper, who ranks among large energy generation and trading companies in Europe, didn’t have data readily available to make decisions. Once the idea of an organization-wide data strategy emerged, Uniper decided to go with a public cloud solution for reasons of scalability and cost. Uniper concluded that Talend would be the best cloud integration platform for their cloud architecture.

Providing self-service data and analytics in real time

To create the Uniper Data Analytics Platform, Uniper selected Talend to integrate more than 120 internal and external data sources into a Snowflake central data lake in the Microsoft Azure Cloud. Data governance, provided by Talend Data Catalog, was also essential to the success of the data lake that contains data from very disparate sources. By enabling integration, centralization, and standardization of Uniper’s most valuable data, Talend gives Uniper a single point of truth for decision support. Employees in selected departments now have access to data in self-service mode enabling faster to make the right decisions while ensuring high data quality and governance. Integration costs have been reduced by more than 80% and the data lake project was profitable within 6 months.

Trading successfully and managing risk

Uniper’s team is a mix of business and IT staff who collaborate to address specific business use cases and solve real pain points for the company.

Now that the relevant information is aggregated in the data lake, questions from traders that previously required months of research can now be answered by market analysis teams right away. Speed is critical because the earlier trading teams can react, the earlier they can take a position and that can make a difference of millions of euros. Once traders have taken a position, Uniper gets real-time information from sensors in the plants, which monitors their status and receives early warnings if components need to be replaced, and ramps up or ramps down production depending on demand. Post trade, the Uniper data lake and Talend have also greatly reduced operational risk and helped easily meet regulatory challenges.

The post How Uniper is providing sustainable energy for everyone with a modern digital platform appeared first on Talend Real-Time Open Source Data Integration Software.

While the transformation to a data-driven culture needs to come from the top of the organization, data skills must permeate through all areas of the business.

Rather than being the responsibility of one person or department, assuring data availability and integrity must be a team sport in modern data-centric businesses. Everyone must be involved and made accountable throughout the process.

The challenge for enterprises is to effectively enable greater data access among the workforce while maintaining oversight and quality.

The Evolution of the Data Team

Businesses are recognizing the value and opportunities that data creates. There is an understanding that data needs to be handled and processed efficiently. For some companies, this has led to the formation of a new department of data analysts and scientists.

The data team is led by a Chief Data Officer (CDO), a role that is set to become key to business success in the digital era, according to recent research from Gartner. While earlier iterations of roles within the data team centered on data governance, data quality and regulatory issues, the focus is shifting. Data analysts and scientists are now expected to contribute and deliver a data-driven culture across the company, while also driving business value. According to the Gartner survey, the skills required for roles within the data team have expanded to span data management, analytics, data science, ethics, and digital transformation.

Businesses are clearly recognizing the importance of the data team’s functions and are making significant investments in it. Office budgets for the data team increased by an impressive 23% between 2016 and 2017 according to Gartner. What’s more, some 15% of the CDOs that took part in the study revealed that their budgets were more than $20 million for their departments, compared with just 7% who said the same in 2016. The increasing popularity and evolution of these new data roles has largely been driven by GDPR in Europe and by new data protection regulations in the US. And the evidence suggests that the position will be essential for ensuring the successful transfer of data skills throughout businesses of all sizes.

The Data Skills Shortage

Data is an incredibly valuable resource, but businesses can only unlock its full potential if they have the talent to analyze that data and produce actionable insights that help them to better understand their customers’ needs. However, companies are already struggling to cope with the big data ecosystem due to a skills shortage and the problem shows little sign of improving. In fact, Europe could see a shortage of up to 500,000 IT professionals by 2020, according to the latest research from consultancy firm Empirica.

The rapidly evolving digital landscape is partly to blame as the skills required have changed radically in recent years. The required data science skills needed at today’s data-driven companies are more wide-ranging than ever before. The modern workforce is now required to have a firm grasp of computer science including everything from databases to the cloud, according to strategic advisor and best-selling author Bernard Marr. In addition, analytical skills are essential to make sense of the ever-increasing data gathered by enterprises, while mathematical skills are also vital as much of the data captured will be numerical as this is largely due to IoT and sensor data. These skills must also sit alongside more traditional business and communication skills, as well as the ability to be creative and adapt to developing technologies.

The need for these skills is set to increase, with IBM predicting that the number of jobs for data professionals will rise by a massive 28% by 2020. The good news is that businesses are already recognizing the importance of digital skills in the workforce, with the role of Data Scientist taking the number one spot in Glassdoor’s Best Jobs in America for the past three years, with a staggering 4,524 positions available in 2018.

Data Training Employees

Data quality management is a task that extends across all functional areas of a company. It, therefore, makes sense to provide the employees in the specialist departments with tools to ensure data quality in self-service. Cloud-based tools that can be rolled out quickly and easily in the departments are essential. This way, companies can gradually improve their data quality whilst also increasing the value of their data.

While the number of data workers triples and to stay competitive with GDPR, businesses must think of good data management as a team sport. Investing in the Chief Data Officer role and data skills now will enable forward-thinking businesses to reap the rewards, both in the short-term and further into the future.

The post Data skills – Many hands make light work in a world awash with data appeared first on Talend Real-Time Open Source Data Integration Software.