Blog – Talend Real-Time Open Source Data Integration Software

Last night it occurred to me that everything in the last three parts of this blog series ( Part 1 / Part 2 / Part 3 ) had been oriented towards the Talend on-premise solution. Many developers I have worked with over the years are moving to Talend Cloud. I too have moved much of my code to the cloud as well. In fact, to build a lot of the collateral for this blog, I had to go back to my old on-premise environment to put it together. I then got to thinking, “would these best practices work in Talend Cloud?”

What’s a Remote Engine?

Before I explain whether it will and if so, how, I should point out that one of the benefits of Talend Cloud is that there is significantly less server admin work to carry out. If you run everything in the cloud, then all you need to worry about is building your jobs and configuring them to run.

I know that when I was a full-time developer, I always thought too much of my time was taken up dealing with server administration. This not only frustrated me, but it swallowed up time that could have been better spent doing what I was good at. Having the freedom to just develop is a massive benefit, but having everything in a cloud that is managed by another entity can sometimes limit flexibility…..or so you might think. With everything running in the cloud, getting access to set operating system environment variables or create local properties files on the servers is likely to be a challenge, if indeed possible at all. Talend recognized that and have overcome it with the Remote Engine.

The Remote Engine is what bridges the gap between an entirely cloud-hosted solution and an on-premise solution. You can still have total control of your Remote Engine’s config while also allowing Talend to handle the Management Console. You can also have your data processed where you need it processed (none of the data goes from the Remote Engine to anywhere you don’t expressly send it), which means that your current on-premise code can likely be easily migrated without implications caused by moving it away from other on-premise tools you may have.

The reason I have focused on the Remote Engine is that it is this that allows us to do pretty much exactly what has been described in the previous blog posts, using Talend Cloud. When I tried it out, there were a few subtle changes that I had to make, but I will go through these as I explain how I got it working.

However, first I feel I owe an apology to non-Windows users….

Environment Variables on Systems other than Windows

I started my investigations into achieving this with Talend Cloud by setting up on a Mac. I am a new convert to the world of Macs. As yet, I haven’t quite been through all of the situations I have experienced on Windows, with my Mac. Environment variables were an area I thought might be interesting. As it turned out, it went from interesting to downright silly. I spent a couple of hours trying to figure out why my variables, which worked in all of my terminals, would not be picked up by my Talend Studio.

It turns out using .profile, .bashrc, .bash_profile, etc, are all useless when wanting a GUI app to pick up your variables. You need to use a plist file. I won’t go into detail about this here, I’ll just point you to this useful link. This process solved my issue on my Mac. Once I had set up my environment variables here, Talend Studio was able to see them and I could use this functionality as I could on Windows.

However, there is a part 2 to this apology. I must also apologise to those of you who may have tried to configure an on-premise Talend Runtime on Linux as well. I’m kind of hoping that this is a very small number, but I suspect that someone may have (or will in the future and will be pulling their hair out now). The Talend Runtime is an Apache Karaf based OSGI container that runs as a system service on Linux. As such, any environment variables set in .profile, .bashrc, .bash_profile, etc, will be ignored by anything that runs inside it.

The Remote Engine is based upon Apache Karaf as well. However, we can get around this VERY easily. When you install the Talend Runtime or the Remote Engine as a service on Linux, you will make use of a wrapper.conf file. For the Talend Runtime it will be called something like Talend-ESB-Container-wrapper.conf and for the Remote Engine, it will be called something like Talend-Remote-Engine-wrapper.conf. The file will be located in the installation’s /etc folder. All you need to do is to stop the service from running and add a couple of lines to the beginning of the wrapper.conf file.

Look in the file to find some code like this….

set.default.JAVA_HOME=${java.home}

set.default.KARAF_HOME=${karaf.home}

set.default.KARAF_BASE=${karaf.base}

set.default.KARAF_DATA=${karaf.data}

….and add the following with the settings you require for your variables….

set.default.FILEPATH=/home/Richard/Documents/env.txt

set.default.ENCRYPTIONKEY=12345678

These variables will be picked up in exactly the same way as system environment variables, by anything running inside the Talend Runtime or Remote Engine.

My file looks like this….

So, how is this done using the Remote Engine?

Once we have all of the possible environment variable issues resolved, it is extremely easy to get this working by using the Remote Engine. First, we need to install a Remote Engine. If you haven’t done this, there are instructions which can be followed here.

Once the Remote Engine is installed and the updates to the wrapper.conf (described above) are implemented, we can configure our first Task. I’ll assume that you have created a job for this (following the instructions in Part 3 of this blog) and have uploaded it to the artifact repository. If so, you can follow the steps below to see this working in the Remote Engine.

1) Go to the Management Console and click on the “Operations” link in the left sidebar. Then click on the “View Tasks & Plans” button.

2) Click on the “Add” button and select “Task”

3) Select the “Workspace”, “Artifact type” and “Artifact”. The job being set here is a test job that has been configured to use the Implicit Context Load.

4) Leave all of the context variables blank because these will be set via the Implicit Context Load

5) Select the “Runtime” and “Run type”. We are selecting the Remote Engine here. This is important. The “Run type” can be left as “Manual” or you can set this to be scheduled if you want.

6) Once we click “Go Live” the job will start (if we left the “Run type” as “Manual”). The next screen will show the job running on your Remote Engine.

7) If everything has been configured correctly, the next screen will show a success status

Using the method described in this blog series, you can easily control your context variable usage across all of your environments, so long as you can add environment variables to your servers. If the Implicit Context Load settings are configured for your project, you needn’t ever think about which context is used. When you build a new job, it will automatically be set to use the Implicit Context Load, which will be controlled by the settings on the machines you use to run your jobs.

I hope the series has been useful and that you learned a few new tricks. Until next time!

The post Best Practices for Using Context Variables with Talend – Part 4 appeared first on Talend Real-Time Open Source Data Integration Software.

A common perspective that I see amongst software designers and developers is that Machine Learning and Artificial Intelligence (AI) are technologies which are only meant for an elite group. However, if a particular technology is to truly succeed and scale, it should be friendly with the common man (in this case a normal software developer).

In this blog, I would like to discuss how Amazon Web Services (AWS) has changed the landscape of Machine Learning and AI through its various services and how Talend is further evangelizing this idea of “Democratizing Machine-Learning” among the IT masses. We will also review some sample scenarios where Talend is integrated with different Amazon Machine Learning services. I have also added links for corresponding Talend KB articles with sample Talend job details for the audience who would love to do some real hands-on experience.

Before going further into the topic, I would like to invite the readers to go through three interesting blogs to get more idea about Machine Learning and Artificial Intelligence.

Getting Started with Amazon Web Services and Talend

Now, let’s get back to our original discussion point i.e., how we can democratize the machine learning area further so that a common IT developer can make use of these functionalities without writing any complex programming.

The relevance of AWS Machine Learning and AI Services comes in this context as these services helps to encapsulate the complex underlying processes from the end user. Talend augments the usage of these wonderful services further by packaging these service calls neatly with multiple data flows. This methodology helps to build various everyday applications in a simple yet sophisticated manner.

Age of “SaaS”

Software as a Service (“SaaS”) has become the new normal for Information Technology. In the context of Machine Learning and Artificial Intelligence, Amazon Web Services has created a wonderful series of SaaS modules. These modules help users to perform tasks like real-time predictions based on machine learning models, sentimental analysis of messages, language detection and translation, image moderation, text analysis from images, speech synthesis and much more. Talend helps to integrate these machine learning and AI services of Amazon in a seamless manner to create end-to-end applications in a graphical format.

Prediction made easy

Predicting real-time response based on a machine learning model was a daunting task, even just a while ago. Talend helps to make this task easy by integrating with Amazon’s Machine Learning Real-time prediction service. For example, a Botanist can now predict the class of a flower by analyzing its petal and sepal features instantly, based on the Machine Learning model he or she has created.

In the diagram above, Talend helps in dataset creation (which will be used to build ML model) by fetching flower classification data from different source systems in multiple formats. Talend also helps in the real-time prediction of a flower category based on the input data provided by Botanist by making a request to Amazon Machine Learning Service. Instead of writing complex codes spread across thousands of lines, programmers can create real-time prediction services by getting the help of Amazon Web Services and Talend.

The Talend KB Article, which outlines the various steps involved in real-time prediction using Machine Learning, can be referred from below link.

Introduction to Talend and Amazon Real-Time Machine Learning

Language processing – Key for becoming “Global Village”

The world has become a global village and interactions between people from different parts of the world are increasing day-by-day. Language was one of the major roadblocks in enabling free communication between people all over the world. The natural language processing services of Amazon like Amazon Comprehend and Amazon Translate help us to understand the dominant language any given text text, translate it and perform the sentiment analysis for the incoming textual information. Talend integrates these Amazon AI services to convert end to end applications like real-time sentimental analysis dashboard and multilingual customer care system.

A quick example is the sentimental analysis dashboard as shown below. Talend is integrated with Amazon’s Comprehend service to identify the customer sentiments in real time and to send the sentimental analysis details to downstream system dashboards.

Another example which showcases Talend’s integration capabilities with Amazon Comprehend and Amazon Translate services is the creation of a multi-lingual customer care system. The incoming messages are analyzed to understand the dominant language used in it and the text is translated from non-supported languages to supported language automatically.

The two Talend KB articles I would recommend getting a detailed overview and hands-on experience about Talend’s integration with above two Amazon services are as shown below.

Talend and Amazon Comprehend Integration

Talend and Amazon Translate Integration

Image Processing

Image Processing and analysis opens a lot of new opportunities to make our life easier. Amazon Rekognition provides one of the top-notch image processing services. Talend helps you to build end to end applications like image moderation system, celebrity image analysis, text recognition process from images and facial analysis.

A sample illustration of the various possibilities can be shown from below diagram where Talend is easily integrated with Amazon Rekognition to perform real-time vehicle number plate analysis. This helps in various avenues like parking monitoring, toll collection and to capture automobiles running red lights.

Another interesting use case is to setup image moderation in websites as shown below where Talend verifies the incoming image by making request to Amazon Rekognition. Based on the result, the images are allowed to be consumed by consumer systems or image will be quarantined with warning messages to necessary teams.

Talend and Amazon Rekognition can also be used for other use cases like Celebrity Recognition from the images which can be used by media houses, facial analysis systems for home security systems, crime scene analysis. I would recommend you go through below KB article to understand the details of integration and would highly recommend trying some of the above scenarios by creating sample jobs using Talend.

Talend and Amazon Rekognition Integration

Speech Synthesis

Another interesting Talend integration with Amazon AI service is the amalgamation of Talend with Amazon Polly. Speech synthesis is a remarkable game changer in many applications. Amazon Polly can do speech synthesis from multiple languages. Talend helps to integrate this interesting AI service with other areas of a complex enterprise application.

A simple use case where Talend is seamlessly integrated with Amazon Polly is the flight tracking and automatic audio notification systems. As shown in below diagram, Talend transfers the data from various sources as text messages to Amazon Polly. The converted voice messages will be fetched and transferred to downstream systems.

The KB article which explains the steps to do the Talend integration with Amazon Polly can be referred from the link below. It also has a sample program where Talend is supplying the input text and converting them to mp3 audio files.

Talend and Amazon Polly Integration

Conclusion

Many Talend users have a understand that Talend can be used to connect with pre-built components in the Talend Palette, but a true lover of Talend will explore other possibilities like creating a custom component, usage of routines etc. to integrate with different services. Integration of Talend with Amazon’s Machine learning and AI services is one such fascinating experiment to showcase the true power of Talend. It also illustrates how Talend and Amazon web services are changing the complex world of Machine Learning to a more democratic one for software engineers. Hope you found this overview useful and please let me know of your machine learning use cases with Talend in the comments.

The post 6 Ways to Start Utilizing Machine Learning with Amazon Web Services and Talend appeared first on Talend Real-Time Open Source Data Integration Software.

Azure SQL Data Warehouse (DW) has quickly become one of the most important elements of the Azure Data Services landscape. Customers are flocking to Azure SQL DW to take advantage of its rich functionality, broad availability and ease-of-use. As a result, Talend’s world-class capabilities in data integration, data quality and preparation, and data governance are a natural fit with Azure SQL DW.

With Talend’s vast component library, as well as new products being brought in, there are many options for connecting into Azure SQL DW. Here, we will explore those options and help you identify which one is best, based on your organizational needs for data movement, integration, preparation, quality, and governance.

Simple, Fast Ingestion with Stitch Data Loader

The first scenario is that you are simply looking to ingest data from a cloud application data source into your Azure SQL Data Warehouse. In other words, you want to populate your data warehouse. You are not yet concerned with data quality or preparation, as this is “Phase 1”.

In this “Phase 1” approach, your best option is to use Talend Stitch Data Loader. Stitch Data Loader connects to a huge array of popular cloud application sources and allows you to set up ingestion of your data into Azure SQL DW in mere minutes. This is a low-cost subscription option that gives you the ability to populate your data warehouse as quickly and easily as creating a new account on Stitch, choosing your data ingestion source, then setting the destination. Stitch allows you to set an ingestion schedule, as well as options for ingesting only new or updated data.

Get a free 14-day trial of Stitch Data Loader

Data Warehouse Modernization

The second scenario is that you have far more complex data integration needs. A perfect example of this would be a data warehouse modernization project. You need to move your data from multiple on-premise and cloud data sources into your Azure SQL Data Warehouse. While integrating these disparate data sources together, you are also looking to clean that data and verify that data for accuracy. In this more robust use case, your best option is to use Talend Cloud with Talend Studio. Talend Studio is the design environment with over 900 connectors and components to facilitate all your data integration, ETL, data quality, and data governance needs.

Talend Studio contains over a dozen components for Azure SQL DW alone, including components for bulk loading. This allows to build a DI/ETL job to move huge amounts of data in mass directly into your Azure SQL DW. These jobs can be further enhanced and augmented to include functions for data cleansing, data quality, and data governance.

Once you have built your job in Studio, you can then upload or promote your job to Talend Cloud. With Talend Cloud, you can schedule and deploy this job in a cloud-based application environment, giving you a truly modern architecture for enterprise data integration. With Talend Cloud, you can further enhance this data for the line-of-business user with Talend Data Preparation. These business users can create, define and share preparations, providing key insight into the preferred shape and format of that data, as defined by their business project requirements. Those preparations are then used directly in live data integration jobs.

A Self-Service Data Scenario

The third and final scenario here falls in between the first two, in terms of complexity and needs. In the third scenario, you have multiple data sources to which you need to connect, to populate your Azure SQL DW, and you may want to bulk load this data during this process. You still want to provide business users with the ability to define data preparations, to satisfy their business project requirements. However, you may not need all the functionality around data quality and governance.

Later this year, Talend will look to introduce new and innovative ways to connect to Azure SQL DW, giving even more flexibility with integrating data into your Azure SQL Data Warehouse.

Whether your connectivity needs into Azure SQL DW are simple loading and ingestion, or more complex enterprise data integration, with data quality, data preparation, and data governance, Talend has a variety of options available to cover any use case. As you need to move your data into Azure SQL DW, Talend can help.

The post All the Ways to Connect Your Azure SQL Data Warehouse with Talend appeared first on Talend Real-Time Open Source Data Integration Software.

We know that data is a key driver of success in today data-driven world. Often, companies struggle to efficiently integrate and process enterprise data for fast and reliable analytics, due to reliance on legacy ETL solutions and data silos.

To solve this problem, companies are adopting cloud platforms like Microsoft Azure to modernize their IT infrastructure. To get the most out of their Microsoft Azure investment, however, these organizations need a data integration provider like Talend to seamlessly integrate with their data sources and Azure Cloud services such as Azure Cloud Storage and Azure SQL Data Warehouse.

In Talend’s newly published Architect’s Handbook on Microsoft Azure, we featured a few companies that couple the power of Talend and Microsoft Azure solutions to overcome application and data integration challenges in order to modernize their cloud platform for Big Data analytics.

Let’s take a closer look at each use case to pinpoint exactly how each company is utilizing Talend and Azure.

Use Case 1: Maximizing Customer Engagement to Keep a Liquid Petroleum Gas Supplier ahead of the Competition

Maintaining a high level of customer engagement is critical to keeping the competition at bay for any company, yet for a leading British liquid petroleum gas supplier, it requires a Herculean effort. They must keep customer engagement high across a number of criteria ranging from product quality to pricing to supply and operations to a compelling branding and positioning strategy.

One way to ensure great customer engagement is to find the right customer segment and target them with the right messaging at the right time through the right channel. The challenge, however, lies in getting relevant, accurate, and in-depth data of individual customers. Using Talend Big Data Platform to build a cloud data lake on the Microsoft Azure Cloud Platform, this company was able to integrate and cleanse data from multiple sources and deliver real-time insights. With a clear view of each customer segment’s profitability, they could target their customers with customized offers at the right time to maximize engagement.

Use Case 2: Enabling GDPR Compliance and Social Media Analytics to Improve Marketing

Balancing visibility into customer data in order to design effective marketing campaigns while complying with data regulations is not an easy task for the highly-regulated liquor industry, as wine, beer, and spirits companies are not allowed to collect customer or retail store data first-hand with surveys.

The CTO of a century-old large European food, beverage and brewing company with 500 brands was able to achieve this balance, however, with a GDPR-compliant solution that delivers insights on how customers and prospects talk about their products and services on social media platforms in real-time. Using Talend Big Data Platform and Microsoft Azure to build an enterprise cloud data lake, the company was able to analyze various social media data from 450 topics with a daily sample set of up to 80GB and transform over 50 thousand rows of customer data in a time span of 90 days.

Use Case 3: Delivering Real-Time Package Tracking Services by Building a Cloud Data Warehouse

To maintain a premium level of package tracking and delivery service, a leading logistics solution provider needed to consolidate, process and accurately analyze raw data from scanning, transportation, and last mile delivery from a wide range of in-network applications and databases.

They selected Talend for its open source and hybrid nature, its developer-friendly UI, and simple pricing. By deploying Talend Real-Time Big Data Platform the Microsoft Azure cloud environment, they were able to re-architect a legacy infrastructure and build a modern cloud data warehouse that allows them to provide cutting edge services, and shrink package tracking information delays from 6 hours to less than 15 minutes.

Building Your Microsoft Azure and Talend Solution

What’s next? You can get the full whitepaper on how to modernize your cloud platform for Big Data analytics with Talend and Azure here. Additionally – you can try Talend Cloud and start testing Azure integration for free for 30 days.

The post Microsoft Azure & Talend : 3 Real-World Architectures appeared first on Talend Real-Time Open Source Data Integration Software.

Kent State University is a public research university located in Kent, Ohio, with an enrollment of nearly 41,000 students.

Success in recruiting qualified students in sufficient numbers is the lifeblood of any university. In its efforts to aggregate data related to admissions, Kent State found itself dealing with a “spaghetti mess”, further complicated by its hybrid environment. Currently, the university relies on an on-premises Banner ERP system, but its Salesforce CRM and other SaaS applications live in the cloud.

Facilitating the transition to a cloud-based environment

To find the right solution to serve as a centralized integration hub, Kent State put out an RFP and evaluated software from several vendors. The university considered utilizing Talend because it provides data integration, ESB, data quality and master data management all in one solution. The university decided to deploy for Talend Cloud on Amazon Web Services (AWS). The cloud-native character of Talend also helped decide Kent State in its favor.

“Data is the currency of higher education. It enables us to build relationships and understand how to engage students, faculty, staff, researchers, and alumni more effectively” – John Rathje, CIO

Talend played a huge role in supplying a wide range of data to multiple organizations in Salesforce. Talend enabled the school to integrate between 25 and 50 separate sources containing purchased lists of names of prospective students and import the data into Salesforce to be used in recruiting communications. Talend’s prebuilt connectors, and especially the Salesforce connector, streamlined the Kent State’s typical processes.

Managing the admissions process

A key Kent State system that relies heavily on Talend is CollegeNET, the school’s CRM system for managing the admissions process for graduate and international students. Talend is the critical component that integrates CollegeNET with Banner, the ERP widely used in higher education.

By catching faulty data early, Talend Cloud Data Stewardship has also eliminated the need for admissions staff to manually change data in Banner. That data cleansing process used to take up to 20 minutes per applicant and has now been significantly reduced. Currently, Kent State’s main ERP and data warehouse are on-premises, but plans are to move both source and target systems to the cloud. “Once we’re there,” says Holly Slocum, Director of Process Evaluation and Improvement for Kent State, “the flexibility the Talend cloud engines give us will enable us to avoid moving data in the cloud to an on-prem remote engine, then back up to the cloud. We also plan to look into installing Talend in an AWS or Microsoft Azure instance. If we want to take advantage of services from cloud providers, we’re not stuck with running the engine on-prem.”

<<Read the full case study here>>

The post How Kent State University is streamlining processes for recruiting and admitting students appeared first on Talend Real-Time Open Source Data Integration Software.

“My momma always said, “Life was like a box of chocolates. You never know what you’re gonna get.” Even if everyone’s life remains full of surprises, the truth is that what applied to Forrest Gump in the 1994 movie by Robert Zemeckis, shouldn’t apply to your data strategy. As you’re making the very first steps into your data strategy, you need to first know what’s inside your data. And this part is critical. To do so, you need the tools and methodology to step up your data-driven strategy.

<<ebook: Download our full Definitive Guide to Data Governance>>

Why Data Discovery?

With increased affordability and accessibility of data storage over recent years, data lakes have increased in popularity. This left IT teams with a growing number of diverse known and unknown datasets polluting the data lake in volume and variety every day. As a consequence, everyone is facing a data backlog. It can take weeks for IT teams to publish new data sources in a data warehouse or data lakes. At the same time, it takes hours for line-of-business workers or data scientists to find, understand and put all that data into context. IDC found that only 19 percent of the time spent by data professionals and business users can really be dedicated to analyzing information and delivering valuable business outcomes

Given this new reality, the challenge is now to overcome these obstacles by bringing clarity, transparency and accessibility to your data as well as to extract value from legacy systems and new applications alike. Wherever the data resides (in a traditional data warehouse or hosted in a cloud data lake), you need to establish proper data screening, so you can get the full picture and make sure you have the entire view of the data flow coming in and out your organization.

Know Your Data

When it’s time to get started working on your data, it’s critical to start exploring the different data sources you wish to manage. The good news is that the newly released Talend Data Catalog coupled with the Talend Data Fabric is here to help.

As mentioned in this post, Talend Data Catalog will intelligently discover all the data coming into your data lake so you get an instant picture of what’s going on in any of your datasets.

One of the many interesting use cases of Talend Data Catalog is to identify and screen any datasets that contain sensitive data so that you can further reconcile them and apply data masking, for example, to enable relevant people to use them within the entire organization. This will help reduce the burden of any data team wishing to operationalize regulations compliance across all data pipelines. To discover more about how Talend Data Catalog will help to be compliant with GDPR, take a look at this Talend webcast.

Auto Profiling for All with Data Catalog

Auto-profiling capabilities of Talend Data Catalog facilitate data screening for non-technical people within your organization. Simply put, the data catalog will provide you with automated discovery and intelligent documentation of the datasets in your data lake. It comes with easy to use profiling capabilities that will help you to quickly assess data at a glance. With trusted and auto profiled datasets, you will have powerful and visual profiling indicators, so users can easily find and the right data in a few clicks.

Not only can Talend Data Catalog bring all of your metadata together in a single place, but it can also automatically draw the links between datasets and connect them to a business glossary. In a nutshell, this allows organizations to:

Automate the data inventory
Leverage smart semantics for auto-profiling, relationships discovery and classification
Document and drive usage now that the data has been enriched and becomes more meaningful

Go further with Data Profiling

Data profiling is a technology that will enable you to discover your datasets in-depth and accurately assess multiple data sources based on the six dimensions of data quality. It will help you to identify if and how your data is inaccurate, inconsistent, incomplete.

Let’s put this in context. Think about a doctor’s exam to assess a patient’s health. Nobody wants to be in the process of having surgery without a precise and close examination. The same applies to data profiling. You need to understand your data before fixing it. As data will often come into the organization as either inoperable, in hidden formats, or unstructured an accurate diagnosis will help you to have a detailed overview of the problem before fixing it. This will save your time for you, your team and your entire organization because you will have primarily mapped this potential minefield.

Easy profiling for power users with Talend Data Preparation: Data profiling shouldn’t be complicated. Rather, it should be simple, fast and visual. For use cases such as Salesforce data cleansing, you may wish to gauge your data quality by delegating some of the basic data profiling activities to business users. They will then be able to do quick profiling on their favorite datasets. With tools like Talend Data Preparation, you will have powerful yet simple built-in profiling capabilities to explore datasets and assess their quality with the help of indicators, trends and patterns.

Advanced profiling for data engineers: Using Talend Data Quality in the Talend Studio, data engineers can start connecting to data sources to analyze their structure (catalogs, schemas, and tables), and stores the description of their metadata in its metadata repository. Then, they can define available data quality analysis including database, content analysis, column analysis, table analysis, redundancy analysis, correlation analysis, and more. These analyses will carry out the data profiling processes that will define the content, structure, and quality of highly complex data structures. The analysis results will be then displayed visually as well.

To go further into data profiling take a look at this webcast: An Introduction to Talend Open Studio for Data Quality.

Keep in mind that not your data strategy should first and foremost start with data discovery. Failure to profile your data would obviously put your entire data strategy at risk. It’s really about analyzing the ground to make sure your data house could be built on solid foundations.

The post Life Might Be Like a Box of Chocolates, But Your Data Strategy Shouldn’t Be appeared first on Talend Real-Time Open Source Data Integration Software.

When doing data matching with large sets of data, consideration should be given to the combinations that can be generated, and it’s associated effects on performance. This has an effect when using Talend’s Data Integration Matching and Data Quality components. Matching routines do not scale in a linear fashion.

For those that have read my various blogs and articles on data matching, you already know that the basic premise behind data matching is the mathematics of probabilities. For objects being matched (people, widgets, etc), the basic idea is to identify attributes that are unlikely to change, and then block and match on those attributes. You then use math to identify the probabilities that each of the attributes in turn matches, and them sum up all the weighted probabilities to get a final match value. That could be either a match, an unmatch or a possible match.

Blocking Keys

Now, these simple combinations work both ways. Blocking into columns reduces the number of combinations done by the matching routines, the match within a block, but, it does mean the larger the data gets, the more combinations need to be checked overall.

Here is a simple example to illustrate what I mean. Imagine matching data from two datasets, “A” and “B”, each containing 10 records. Ignoring blocking keys, each record in “A” must be checked against the 10 records in “B”, resulting in 100 combinations in total. 100 checks in effect.

Now, this can be reduced by making a good choice of your blocking key, but the maximum could be 100. So, the result could be that record number 1 in dataset “A” might match with, say 2 records in dataset “B”. Record 2 and 3 may have no matches, record 4 may have a possible match in “B” and 1 definite match.. and so it goes on. Ten records in each set could result in both 100 checks and potentially more than 10 results. That’s why you could start with 10 records in each dataset and get something like 3 matches, 4 possible matches, and 6 non-matches. In other words, you end up with more results than the number of input records that you started with. Simply put, one record may match more than once, so it’s usual to get more results out than you would first expect. As discussed, blocking will reduce the overall number of combinations to be checked, but there will always be more combinations and therefore results, than the initial number of records unless there are say only a few distinct duplicates, or nothing matches at all.

An example of this is shown below. Here, I have built a simple matching job using Talend components. In this case, I am matching a dataset on one hundred thousand demographic records against a reference set of ten thousand records. The data is input and a Talend tRecordMatch component was used to match the records on just two field, ‘First Name’ and ‘Last Name’ and blocking on’ County of residence’, i.e. the area where they live. Now, this is not ideal, but I am using this to simply demonstrate the scenario above. The match threshold is set at “80%”.

When we run the job, the results show that we have found over 32,000 potential matches, over 74,000 records that don’t match and around 1,200 matches. If we add all these up, we see the total come out to more than 100,000. As described above, this is due to some records matching more than once, resulting in many more records than we started with. More importantly, this means that a very large number of combinations have been checked, and of course, will result in the job requiring sufficient resources in order to process all those combinations. In this example we can immediately see that we have far too many potential matches.

Dealing With Data Matching In Large Datasets

Now, for small numbers of records its not a problem, but for large datasets, the combinations can get overwhelming. This can obviously affect performance as well. Now, the main drivers here in performance are the blocking keys and the matching rules. In general, for records with say 10 attributes, you would want around 2-3 blocking keys. Within those blocks, you would want to match around 3-4 attributes, but remember the most important thing is the accuracy of the matches.

The higher the match threshold, the less matches you get. However, you run the risk of missing real matches and increase the size of your possible matches that need to be checked, usually done manually. The lower the thresholds, the more matches you get, the more combinations are possible, but you run the risk this time of getting false positive matches. It’s a trade-off and this is where the size of the dataset becomes important. For datasets in the thousands and hundreds of thousands its not usually an issue, but for datasets in the millions and tens of millions, it can be.

The Importance of Tuning You Data Matching

This is why tuning your data matching, blocking routines, and algorithms is crucially important, but unfortunately there is no simple ‘one stop’ recipe to setting it up. You need to start with a first pass and check the results. Make changes and then run it again. The process then repeats until you get to a point of diminishing returns. This is also why a careful choice of your overall strategy and consideration for the possible combinations involved is key too.

So, the overall message is straightforward. When doing data matching against large datasets you want a good choice of blocking keys, matching algorithms and thresholds in order do two things:

Reduce the number of false positives and negatives you will get, and therefore improve the quality of your results.
Reduce the overall number of combinations that have to be checked in order to reduce the amount of resources needed to do you matching.

As discussed, this can be a trade off but it is crucially important when matching with large datasets. Finally, it’s important to note that this does not simply scale linearly. A dataset twice as large as the last one does not take twice as long to process. Mathematically, matching does not scale linearly (like a straight line), but scales more like a power relationship. It is possible to estimate what this could be in practice, but you would need to first understand how many combinations are possible. The point, in practice, is therefore to carefully consider those possible combinations and take this into account when matching data from large datasets.

How are you data matching efforts going? Is there anything you’d like me to cover in my next article? Let me know in the comments and until then, happy data matching!

The post Data Matching and Combinations Using Large Data Sets appeared first on Talend Real-Time Open Source Data Integration Software.

Almost three years ago today, to the day, Talend opened its fourth global research and development center, and its second one in France, in Nantes. It was clear to Talend from the very beginning of this new innovation center that it would not be a simple satellite of existing centers, but a key element in our strategy and overall R&D efforts.

Dedicated to innovative cloud, big data, and machine-learning technologies, this center plays a key role in our research and development efforts, creating new products, adding new features and services, increasing functionality, and improving our cloud data integration infrastructure.

A winning investment!

Today, Talend is increasing its investments in France with the expansion of its research and development center in Nantes. This new innovation center of more than 2600m² will make it possible to meet the requirements of supporting Talend’s strong growth but also serve to strengthen our foothold in the region’s digital ecosystem.

In 2016, when this center of excellence was opened, our objective was to recruit up to 100 engineers by the end of 2018. This target has been exceeded, with the recruitment of 120 engineers. We are now planning to increase our workforce in Nantes to 250 by 2022.

We are proud of our ability to attract and retain the best talent to our R&D team, to create a challenging but also rewarding environment where employees can thrive, solve complex issues, and find innovative ways to address current and future challenges in data integration, processing, and governance.

At Talend, we apply agile development methodologies, work with the latest technologies, and have created a modern, flexible, and automated software development process that allows us to deliver high-quality applications and quickly adapt to market changes and the new requirements of our customers and partners.

A local footprint

This day, our team is moving into a new office space where they will have every opportunity to thrive in an environment that is conducive to innovation and collaboration. We also hope that this innovation hub will contribute to the development of the digital economy in Nantes and the broader region. We will therefore also have the pleasure of opening this space to the booming environment of local technology companies, by organizing regular meetings and events, meetups, or hackathons.

By establishing ourselves in Nantes, we had chosen a dynamic, innovative city and region, benefiting from a living environment recognized by those who live there on a daily basis. Nantes benefits from a highly developed digital ecosystem with many startups and innovative companies. And what better example to illustrate this than to mention Talend’s acquisition in November 2017 of Restlet, a Nantes-based leader in cloud API design and testing.

But the area of Nantes also benefits from a pool of students and leading engineering schools that are recognized internationally for the quality of their training. We will work closely with these educational centers of excellence to create joint programs around new cloud and big data technologies, work-linked training, or through the sharing of our expertise around open source technologies such as Apache Spark, Apache Beam, or Hadoop.

It is with great pride and emotion that I would like to thank all of Talend’s employees – developers, DevOps, UX designers and other automation specialists – the public stakeholders who have supported us and made our implementation a success, the digital and educational ecosystem for the opportunities we are given to exchange and learn together.

The post Talend increases its investments in Research & Development in Nantes appeared first on Talend Real-Time Open Source Data Integration Software.

Let me start by thanking all those who read the first part of our blog series on converting legacy ETL to modern ETL! Before I begin the second part of this three-blog series, let’s recap the three key aspects under consideration for converting from legacy ETL to modern ETL.

Will the source and/or target systems change? Is this just an ETL conversion from their legacy system to modern ETL like Talend?
Is the goal to re-platform as well? Will the target system change?
Will the new platform reside in the cloud or continue to be on-premise?

In the first blog of the series, we focused on the first question in that list. If you haven’t caught up on that yet, please do so before continuing. In this blog, I will be focusing on the second question mentioned above.

Let’s assume that we have an organization that wants to move away from their legacy backend system to a more advanced, distributed, columnar backend system and at the same time would like to migrate away from legacy ETL to a more modern ETL platform. What’s the best way to approach this initiative? Today, we are going to find out!

Where Should You Get Started?

The first step is to understand the environment. To do that, I often ask myself a couple of key questions:

What will the new backend system be based on? Will it be an MPP, columnar database platform or a Hadoop cluster?
How much data are they processing and how do they anticipate the growth of their data in the next x months/years?
Finally, is the business looking for new nuggets of insight in the data or is the goal to simply get off the overall legacy environment for a better turnaround?

If the answer to the above questions is an MPP database platform that is expected to see the same amount of data growth that they have been experiencing and their ultimate goal is to get off a legacy environment for better turnaround, then the strategy outlined in the first blog can be applied here too with an additional functionality of the conversion tool to replace the existing output component to the relevant output component, for instance, replacing Oracle output with Netezza/Snowflake output. Also, leveraging ELT components in Talend where it makes sense to push down the processing of data to the MPP database rather than processing it external to the database.

Both these aspects can be leveraged using the strategy outlined in the first blog. Please note, at this point, the strategy remains pretty much the same whether you decide to keep this migration on-prem or in the cloud. Some special considerations need to be taken care of for cloud migration which I’ll be covering in the upcoming third blog.

Now, should the answer to the questions above be a Hadoop cluster to build out a data lake, anticipating medium to large data growth with added business functionalities such as self-service queries of data on the data lake as well as additional metrics for reporting, then the strategy I’ll go through below has proven to be useful for many of our customers.

Bright minds bring in success. Keeping that mantra in mind again, first build your team:

Core Team – Identify architects, senior developers and SMEs (data analysts, business analysts, people who live and breathe data in your organization)
Talend Experts – Bring in experts of the tool so that they can guide you and provide you with the best practices and solutions to all your conversion related effort. Will participate in performance tuning activities
Conversion Team – A System Integrator partner who can provide people trained and skilled in Talend
QA Team – Seasoned QA professionals that help you breeze through your QA testing activities

Now comes the approach: Divide the effort into logical waves and follow this approach for each wave.

Based on the existing ETL code and the new functionalities that need to be incorporated in order to migrate to, for instance, a data lake on HDFS or S3

Identify Data Ingestion & Processing Patterns – Analyze the ETL jobs and categorize them based on overall business and technical functionality. Given that migration needs to happen, there will be new technical functionalities that may be identified, such as an “Ingestion Framework” that will ingest data in its raw format into the data lake sitting on HDFS or an S3 bucket. Write down all these patterns. This is where your SMEs, Talend Experts and Architects from your System Integrator partner can help you define the right categories working closely in conjunction with each other.

Design Job Templates for those Patterns – Once the patterns are identified for a given wave, Talend Experts can help you design the right template for such patterns. Be it templates for “Ingestion Framework” or data loads following specific business rules. Designs will most likely leverage big data components.

Develop – Now that the designs for the identified patterns are ready, work in an iterative manner across multiple sprints to develop the jobs required to ingest and process the data. Any deviation to the template, which could be the case in a handful of jobs, given data processing complexities, needs to be approved by a governing body consisting of your SMEs and Talend Experts.

Optimize – Focus on job design and performance tuning. This will primarily be driven by volume and data processing complexities. Given the use case, usage of big data components will be common. For instance, the focus will be more on tuning Spark parameters and queries. Here again, Talend Experts will be helpful in providing the most optimum performance tuning guidelines given each scenario.

Complete – Unit test and ensure all functionalities and performance acceptance criteria are satisfied before handing over the job to QA

QA – A mix of SIT, UAT and an automated approach to compare result sets produced by the old set of ETL jobs and new ETL jobs (the latter may not be applicable for all jobs and hence proper SIT and UAT will be required). Extremely important to introduce an element of regression testing to ensure fixes are not breaking other functionalities and performance testing to ensure SLAs are met

In the past, we have witnessed that such conversions take a significant amount of time to complete. It is, therefore, extremely important to set the right expectations with the stakeholders and ensure setting milestones to define success criteria for each wave. Have a clear roadmap and work towards it. Keep business informed and ensure key resources are available from business and IT during critical design discussions and during UAT. This will ensure a smooth transition of your legacy data management platform to a whole new modern big data management platform.

Conclusion:

This brings me to the end of the second part of the three-blog series. Below are the five key takeaways of this blog:

Define roadmap and spread the conversion effort across multiple waves
Set milestones at critical junctures and define success criteria for each of them
Identify core team, Talend experts, a good System Integrator partner and seasoned QA professionals
Identify patterns, design templates and follow an iterative approach across multiple sprints to implement those patterns
Leverage Talend experts for the most optimum performance tuning guidelines

Stay tuned for the last one!!

The post Key Considerations for Converting Legacy ETL to Modern ETL – Part II appeared first on Talend Real-Time Open Source Data Integration Software.

TI Media is the UK’s third-largest consumer magazine and digital publisher with more than 40 brands that reach 14.1 million UK adults monthly across print and digital. The company sells 117 million magazines per year and has 37 million global online users.

TI Media faced similar challenges to other consumer magazine publishers—an eventual drop in print readership and advertising revenues that couldn’t be made up with online advertising. The company needed to generate more revenue from each customer that interacted with its brands in order to close this revenue gap.

But in the UK, only 20% of readers get their magazines through subscriptions, with the rest buying them at newsstands. This makes it impossible to capture customer data at the point of sale. Further, the move to digital formats means that the company’s customers consume content in myriad ways, including print, websites, digital editions, social media, video, apps, podcasts, and events, presenting further challenges for capturing customer data.

Creating a single customer view

The cornerstone of the TI Media data strategy is bringing together customer data, advertising data, content data, and online data all into one place where TI Media can analyze it faster and more easily.

TI Media chose to integrate all of its customer data into a cloud-based data lake. It implemented Talend Cloud Data Quality on a Snowflake database in Amazon Web Services (AWS). Talend enables TI Media to reduce time to onboard new data sources by 85% and shorten time to fix data quality issues by 90% —all without the help of IT. Data budgets have also been reduced by 50% in the first year, with further reductions expected.

“If we convert just one percent more of our readers coming to our websites to sales, we can gain £40 million of additional revenue. Data is the answer to drive revenues. “Lee Wilmore, Data Intelligence Director

And because the project had to be done in two months, speed was essential for TI Media. The new customer data domain has been built in just two months.

Content + context + data = results

Consumers interact with TI Media brands more than a quarter of a billion times a year. That’s over 10 meaningful interactions per second. With a holistic view of all customer data in one domain and clean customer data, TI Media can better target its customers while being GDPR compliant.

The company sends out 30 million emails per month in order to drive customers back to websites to engage with content online. Since implementing Talend, the company is executing marketing campaigns 10% faster and response rates increased by 5%. If TI Media converts just 1%more of their readers coming to their websites to sales, they can gain £40 million of additional revenue.

All of these benefits translate to reduced costs and increased revenue. The team at TI Media feel they have just scratched the surface of what the new system can provide. In addition to customer data, they are looking at how to use the data lake to improve revenues and reduce costs for other domains in the company.

<<Download the full case study>>

The post How TI Media is Using Data to Build Meaningful Relationships with 14 million Consumers appeared first on Talend Real-Time Open Source Data Integration Software.

By now, you know that data is the lifeblood of digital transformation. But the true digital leaders have taken a step beyond by starting to understand the need to preserve this lifeblood with people, process and tools. That’s why data quality is so important in its ability to take control of the health of your data assets from diagnostic to treatment and monitoring with whistleblowers.

With this respect, we are especially proud that Talend was recognized by Gartner as a Leader for the second time in a row in the 2019 edition of Gartner’s Magic Quadrant for Data Quality Tools.

<<Download the 2019 Gartner Magic Quadrant for Data Quality Tools>>

Strong Dynamics in the Data Quality Market

This ability for organizations to deliver trusted data at speed across the organization and beyond has become imperative. In Gartner’s hype cycle for data management, “Data Quality Tools” was removed for its 2018 version compared to the previous year because it reached the plateau.

However, Gartner pinpoints that: “This market is still among the fastest-growing in the infrastructure software subsector of the enterprise software market. We forecast compound annual revenue growth of 8.1% in this market for the period 2017 through 2022.”

Today’s Outlook in the Gartner Magic Quadrant

With that respect, the Gartner analysts, Melody Chien and Ankush Jain have brought their deep subject matter expertise and thought leadership to examine the Data Quality Tools vendors.

“Data quality vendors are competing to address these requirements by introducing an array of new technologies, such as machine learning, interactive visualization, and predictive/prescriptive analytics, all of which they are embedding in data quality tools. ”.

But succeeding in data quality is also a matter of adopting data quality principles at speed, in a flexible way, and across the organization. This is where Gartner analysts, Melody Chien and Ankush Jain pinpoint the ways to delivering data quality tools to a wider range of organizations and audience through cloud deployment models saying “They also are offering new pricing models, based on open source and subscriptions.”

Talend’s Named in the 2019 Gartner Magic Quadrant for Data Quality Tools

For the last 10+ years, we’ve been on a mission to help our customers to deliver trusted data at speed. Today, data must be timely, because digital transformation is all about speed and accelerating time to market— whether that’s providing real-time answers to business teams or delivering personalized customer experiences. While speed is critical, it’s not enough. For data to enable effective decision-making and deliver remarkable customer experiences, organizations need data they can trust.

With respect to our data quality vision, this means that from the very beginning, when we introduced our data quality products back in early 2010, we decided not to deliver Data Quality as a stand-alone tool, but rather as a capability that spans across each and every component within our unified platform. We were recognized for the first time in the Magic Quadrant for Data Quality Tools in 2011.

We believe that this pervasive approach to data quality is unique and that it had an important impact on Talend being recognized in the 2019 Gartner Magic Quadrant for Data Quality Tools.

Pervasiveness allows Talend Data Quality to run anywhere and enables users to apply the same data quality sensors, controls, and metrics just in time and consistently across the data chain. And, now, thanks to the cloud, our data quality capabilities can be fully ubiquitous and access any data, anywhere in the data chain and make it trustworthy. This applies to streaming data, real-time or data at rest no matter if it is stored in an enterprise application, a cloud data warehouse, or in a data lake cluster.

Start to Think Data Quality Before You Think Data Quantity

Many data quality projects fail because data quality is an afterthought. Data issues are remediated downstream, but the root cause is not addressed. This is why thinking “quality”, as well as “quantity” is needed.

Let’s take the example of correcting your Salesforce data in your data warehouse because it is where you find the issue. You might have better analytics for a while, but the problem will come back for other use cases or when you modernize your data warehouse into a new environment. Using a pervasive tool allow you to remediate to data quality issue upstream in the data chain. For example, directly into your Salesforce application, or while your capture a stream from the IoT; or when a data engineer turns raw data into something that is more structured and ready to share ; or as an API that can be pushed to a third party so that they can remediate to the data quality at its roots.

Pervasiveness also allows users to democratize data quality and make it relevant for any audience, not just the IT department or CDO. Within the Talend platform, the same data quality capabilities can empower a developer or data engineer using our development framework. It can also be used by a data steward through the Talend Data Stewardship App or by a business analyst with Talend Data Preparation.

“Data and analytics leaders face a challenge to enable business users as the primary audience for new data quality tools and to adopt a more flexible trust-based data governance model”, as stated in Gartner’s new Magic Quadrant for Data Quality report. Operationalization and orchestration remain essential to deliver business friendly apps under IT supervision. Talend also strongly believes that data must become a team sport for businesses to win, which is why governed self-service data access tools like Talend Data Preparation and Talend Data Stewardship are such important investments for Talend.

And now, with Talend Data Catalog, this same data quality stack can be applied to a whole data landscape in a systematic and automatic way, making data more meaningful and searchable by anyone through automatic sampling, profiling, categorization, masking, and relationship discovery. Data Quality used to be a discipline, one that only a happy few can really turn into business value. Now, thanks to the advancements in pattern recognition, IA, and machine learning, data quality can get into anyone’s hands. This encourages data literacy now that profiling morphs into data discovery, but also engages a wider audience to collaborate for better data.

As I said at the beginning of the blog, our evolution has been a journey and we invite you to come along with us, try Talend for yourself and become part of this fantastic growing user community.

<<Download the 2019 Gartner Magic Quadrant for Data Quality Tools>>

1) Gartner Magic Quadrant for Data Quality Tools, Melody Chien, Ankush Jain, 27 March 2019.

2) Gartner Magic Quadrant for Data Quality Tools, (Authors), Date of Publication, 2011.

3) Gartner Press Release, Magic Quadrant for Data Quality Tools, 20 August 2018. https://www.gartner.com/en/newsroom/press-releases/2018-08-20-gartner-identifies-five-emerging-technology-trends-that-will-blur-the-lines-between-human-and-machine

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

The post Making Sense of the 2019 Gartner Magic Quadrant for Data Quality Tools appeared first on Talend Real-Time Open Source Data Integration Software.

I am very excited to introduce Pipeline Designer, a next-generation cloud data integration design environment that enables developers to develop and deploy data pipelines in minutes, design seamlessly across batch and streaming use cases, and scale natively with the latest hybrid and multi-cloud technologies

<<Try Pipeline Designer Now>>

Why Pipeline Designer?

It is no secret that data has become a competitive edge of companies in every industry. And in order to maintain your competitive edge, your organization needs to ensure three things:

That you are gathering all the data that will bring the best insights
That business units depending on the data receive it in a timely fashion to make quick decisions
That there is an easy way to scale and innovate as new data requirements arise.

Achieving this can be very difficult given the emergence of a multitude of new data types and technologies. For example, one of the biggest challenges that businesses face today is working with all types of streaming paradigms as well as dealing with new types of data permeating everywhere from social media, web, sensors, cloud and so on. Companies see processing and delivering data in real time as a game changer that can bring real-time insight, but easily collecting and transforming this data has proven to be a challenge.

Take clickstream data for example. Data is constantly being sent from websites and the stream of data is non-stop and flowing all the time. The typical batch approach to ingest or process data which relies on a definitive “start” and “stop” of the data is obsolete with streaming data and takes away the potential value of real-time reactivity to the data. For example, online retailers rely on clickstream data to understand their users’ engagement with their websites—which is essential to understanding how to target users with the products that they will purchase. In an industry with razor-thin margins, it is essential to have real-time insight to customer activity and competitor pricing data in order to make the fast decisions to win market share.

Additionally, if you are relying on data from different applications, your company’s data integration tool may not cope well with data format changes and data pipelines may break every time a new field is added to the source data. And even if IT is able to handle the dynamic nature of the data, the business units who need access to the data may have to wait several weeks before they can make any actionable insights due to the increasing amount of work put on those who are responsible for distributing the data to the rest of the business.

In fact, in a recent data scientist survey, over 30% of data scientists reported their top challenges as the unavailability of data and the difficulty accessing data, and the market demand for increased access to actionable data is further supported by job postings that show that there are 4-times more job openings for data engineers compared to data scientists.

The data engineering skillset –accessing, collecting, transforming, and delivering all types of data to the business—is in need, and data engineers today need to be more productive than ever before while working in a constantly-changing data environment. At the same time, ad hoc integrators need to be able to empower themselves to access and integrate their data, taking away their reliance on IT.

And last, with more of the business demanding quicker turnaround times, both data engineers and ad hoc integrators need to integrate their data right away, and their data integration tools need to help them meet these new demands. Data engineers and ad hoc integrators now require a born-in-the-cloud integration tool that is not only accessible and intuitive but is also capable of working with the variety and volumes of data that they work with every day.

These problems may sound daunting, but don’t worry. We wouldn’t make you read this far without having an answer.

Introducing Pipeline Designer

As we saw this scenario play out over and over again with customers and prospects, we knew we could help. That’s why we built Pipeline Designer.

Pipeline Designer is a self-service web UI, built in the cloud, that makes data integration faster, easier, and more accessible in an age where everyone expects easy-to-use cloud apps and where data volumes, types, and technologies are growing at a seemingly impossible pace.

It enables data engineers to quickly and easily address lightweight integration use cases including transforming and delivering data into cloud data warehouses, ingesting and processing streaming data into a cloud data lake, and bulk loading data into Snowflake and Amazon Redshift. Because of the modern architecture of Pipeline Designer, users can work with both batch and streaming data without needing to worry about completely rebuilding their pipelines to accommodate growing data volumes or changing data formats, ultimately enabling them to transform and deliver data faster than before.

<<Try Pipeline Designer Now>>

So what makes Pipeline Designer so unique? Here are a few highlights we want to share with you:

Live Preview

The live preview capabilities in Pipeline Designer allows you to do continuous data integration design. You no longer need to design, compile, deploy, and run the pipeline to see what the data looks like.

Instead, you can see your data changes in real time, at every step of your design process, in the exact same design canvas. Click on any processor in your pipeline and see the data before and after your transformation to make sure the output data is exactly what you’re looking for. This will dramatically reduce development time and speed up your digital transformation projects.

As a quick example, let’s take a look at the input and output of the Python transformation below:

Schemaless Design

Schema-on-read is a data integration strategy for modern data integration like streaming data into big data platforms, messaging systems, and NoSQL. It saves time in not having to map incoming data, which is often less structured, to a fixed schema.

Pipeline Designer provides schema-on-read support removing the need to define schemas before building pipelines and keeps pipelines resilient when the schema changes. There is not a strong definition of schema when defining a connection or dataset in Pipeline Designer. The structure of the data is inferred at the moment the pipeline is run, i.e. it will gather data and guess its structure. If there is a change in the source schema, then at the next run, the pipeline will adapt to take into account the changes. This means you can start to work with your data immediately and add data sources “on-the-fly” because the schemas are dynamically discovered. In summary, it brings more resilience and flexibility compared to a “rigid” metadata definition.

Integrate Any Data with Unparalleled Portability

Talend has long been a leader in “future proofing” your development work. You model your pipeline, and can then select the platform to run it on (on-prem, cloud or big data). And when your requirements change, you just select a different platform. An example is when we turned our code generator from MapReduce to Spark, so you could turn your job to running optimized, native Spark in a few clicks. But now, it’s even better. By building on top of the open source project Apache Beam, we are able to decouple design and runtime, allowing you to build pipelines without having to think about the processing engine you will run your pipeline on.

Even more, you are able to design both streaming and batch pipelines in the same palette.

So you could plug the same pipeline on a bounded source, like a SQL query, or an unbounded source, for example, a message queue, and it will work as a batch pipeline or a stream pipeline simply based on the source of data. At runtime, you can choose to run natively in the cloud platform where your data resides, and you can even choose to run on EMR for ultimate scalability. Pipeline Designer truly achieves “design once and run anywhere” and allows you to run on multiple clouds in a scalable way.

Embedded Python Component

With Python being both the fastest growing programming language and a programming language commonly used by data engineers, we wanted Pipeline Designer to allow users to take advantage of their own Python skills and extend the tool to address any custom transformations they needed. So, Pipeline Designer embeds a Python component for scripting Python for customizable transformations.

Looking to put more data to work?

What’s even better with Pipeline Designer, is that it’s not a standalone app or a single point solution. It is part of the Talend Data Fabric platform, which solves some of the most complex aspects of the data value chain from end to end. With Data Fabric, users can collect data across systems, govern it to ensure proper use, transform it into new formats, improve its quality, and share it with internal and external stakeholders.

Pipeline Designer is managed by the same application as the rest of Talend Cloud: the Talend Management Console. This continuity ensures that IT is able to have a full view of the Talend platform, providing the oversight and governance that can only come from a unified platform like Talend Cloud. And of course, IT gets all the other benefits of Talend Data Fabric including being in control of data usage, so it’s easy to audit and to ensure privacy, security and data quality.

Users new to Talend can start with Pipeline Designer knowing that there is a suite of purpose-built applications that are designed to work with each other in order to support a culture of comprehensive data management that spans throughout the business. As your needs grow, Talend will be able to support you through your data journey.

We are excited to bring a free, zero-download trial of the product where you can see how Pipeline Designer can make lightweight integration easier. You can find more details of the product features on the product page here or try it free for 14-days!

The post Introducing Pipeline Designer: Reinventing Data Integration appeared first on Talend Real-Time Open Source Data Integration Software.

DevOps is all the rage right now, and it is only the beginning.

In this blog, I’ll cover how to get started with Talend continuous integration, delivery and deployment (CI/CD) on Azure. The first part of the blog will briefly present some basic DevOps and CI/CD concepts. I will then show you how the Talend CI/CD architecture and how it fits in Azure ecosystem with a hands-on example.

What is DevOps?

Before digging into the technical depth, let me briefly explain to you what the term DevOps is referring to. Basically, it’s a methodology that companies embrace to bring together software development (develop the applications) and operations (deploy in production), hence the term Dev + Ops. Typically, a DevOps team functions as the bridge between development and operations. It combines the needs of both teams in order to be more efficient. Through this philosophy, the entire lifecycle is controlled from design to production thanks to continuous integration, development and deployment. Adopting a DevOps strategy implies a wide range of benefits:

• Better agility -> fit better with the expectations

• Faster deployments -> better time to market, early feedback

• More automation -> fewer human errors

• Better testing and security -> more reliability

Figure 1: Software Development Lifecycle

How Does Talend CI/CD work?

That being said, let’s look at how Talend evolves in the DevOps world.

Figure 2: Talend Continuous Integration, Continuous Delivery and Continuous Deployment

Everything starts with designing jobs and unit tests in Talend Studio. These jobs are sourced control in versioning software such as Git. Source control is very important in CI/CD as it guarantees effective collaboration and reproducibility in our continuous integration process.

The Maven builds follow the design portion with the help of the Talend CommandLine and CI Builder. Depending on the build type, it creates a Talend Artifact (basically jars and execution scripts) or a Docker image. The latter is a packaged version of the Talend artifact in a container image using Docker. The aggregation and automation of the two previous steps is called continuous integration.

We can consider continuous delivery when it is added a publication of any sort. Publishing means that we have a place to store and distribute our artifacts or Docker images. In the case of Talend artifact builds, you can publish to Artifact Repositories (such as Nexus or Artifactory) or publish to Talend Cloud. For Docker images, we use what are called Docker registries. However, the goal remains the same using Docker images in that we are using these locations to distribute and deploy our jobs.

The continuous deployment part depends on where the jobs are published. If an artifact repository is used, you are most likely on-premise and must use a JobServer to execute your jobs. In the case of Talend Cloud you can use your own Remote Engine or take advantage of Talend Cloud Engines. For containers, you can run them anywhere Docker is available: a standalone machine with a Docker daemon, cloud provider container services (AWS Fargate, AWS ECS, Azure ACI, Kubernetes, OpenShift, etc.).

Talend CI/CD pipeline on Azure DevOps

Now that we have the basics, let’s get our hands dirty and build our first pipeline on Azure DevOps. In this blog post, we are only going to outline the continuous delivery of containerized jobs in Azure. The deployment part would necessitate another full article as there are several possibilities to tackle it.

Requirements

You need to already have a good knowledge of the usual Talend project management through Talend Management Console and Talend Studio. Please refer to Talend help documentation. You can also read this blog post where the project configuration is detailed.

· Azure DevOps Services

· Talend Platform license

· Talend 7.1.1

· Azure Repos for versioning your projects (or GitHub, the following applies to both) already set up in Talend Management Console

· Talend Cloud to manage your projects

Azure DevOps

Azure DevOps is a set of tools to manage CI/CD pipelines. From Azure DevOps product page, it entails:

· Azure Boards: Deliver value to your users faster using proven agile tools to plan, track, and discuss work across your teams.

· Azure Pipelines: Build, test, and deploy with CI/CD that works with any language, platform, and cloud. Connect to GitHub or any other Git provider and deploy continuously.

· Azure Repos: Get unlimited, cloud-hosted private Git repos and collaborate to build better code with pull requests and advanced file management.

· Azure Tests Plans: Test and ship with confidence using manual and exploratory testing tools.

· Azure Artifacts: Create, host, and share packages with your team, and add artifacts to your CI/CD pipelines with a single click.

As you can see, the scope is very large, from project management to git or artifact hosting. However, Azure Pipelines is the one managing the whole CI/CD pipeline.

Talend CI/CD for containers with Azure DevOps

Figure 3: Talend CI/CD with Azure DevOps

In this example, we will focus on building container images in the Azure ecosystem. In Azure DevOps the Talend CI/CD for containers is represented in the diagram above. We are going to use the in-house git called Azure Repos. You can use GitHub as well, they are both well integrated with Azure Pipelines. Speaking of Azure Pipelines, it’s an equivalent of the well-known Jenkins if you are more familiar with this tool. The goal here is to build continuously our jobs into Docker container images and push them to Azure Container Registry. To know more about container registries authentication please refer to my previous blog post.

Azure Pipelines allows you to manage a CI/CD pipeline, but it needs build agents to effectively perform the builds. There are many Microsoft-hosted native agents such as a Maven agent (on-demand), but Talend builds need external components like the Talend CommandLine or a Docker daemon. That is why we want a custom build agent in order to achieve our builds successfully. These are called self-hosted agent.

Let’s start:

1. Prepare permissions

Please follow this Azure documentation to prepare the permissions to set up a self-hosted agent. It shows you how to create a PAT token that will allow you to connect your self-hosted agent to your Azure DevOps account.

2. Create the virtual machine (self-hosted agent)

Figure 4: Azure Virtual Machine settings

In your Azure portal, start by creating a virtual machine with Centos 7.5. Once launched successfully, make sure you perform the following instructions:

1. Install OpenJDK8 and Maven 3

2. Install Docker as a non-root user

3. Install Git

4. Pull OpenJDK docker image: docker pull openjdk:8-jre-slim

5. Install Talend CommandLine

Download Talend Studio 7.1.1 and unzip it in your machine (it will be your CommandLine folder)
Copy your license at the root of your CommandLine folder

3. Configure the Nexus for third-party libraries

If you plan to build jobs using third-party libraries, you will have to set up an Artifact Repository such as Nexus. It will upload all the jars file needed to complete the build of these jobs.

Install Nexus 3
- Install a Nexus instance wherever make sense for you, either on the same machine or a different one.
- Create at least these two new repositories
  - talend-custom-libs-release (maven2, hosted, release version policy and permissive)
  - talend-custom-libs-snapshot (maven2, hosted, release version policy and permissive)
  - release (maven2, hosted, release version policy and permissive)
Configure it in Talend Management Console:

Figure 5: Talend Management Console Nexus setup

Once the Nexus is fully configured restart your Studio and you should see that at startup the third-party libraries used in your project are being uploaded to the Nexus server:

Figure 6: Talend Studio with Third-Party libraries being uploaded

You can check these libraries are available in your Nexus Web Interface:

Figure 7: Nexus with Third-Party libraries

4. Install the Azure agent on the virtual machine

Let’s come back to your self-hosted agent. Now that all the requirements on this virtual machine are set up, we can bind the machine to what we call an agent pool. An agent pool is one or more agents that will execute your pipeline in Azure Pipelines. To configure your self-hosted agent, go to your Azure DevOps- Organization Settings. Then follow the Azure documentation.

Figure 8: Azure DevOps Self-Hosted agent installation

Configuration on the virtual machine should look like this:

Figure 9: Azure DevOps Self-Hosted agent configuration

If you miss dependencies to config the Azure agent, they created at your disposal a script to install them:

Run: “sudo ./bin/installdependencies.sh” in the agent folder

Once you configured your agent you can run it with the run.sh script. (of course, you can add it to your systemctl or other to launch it at startup).

Once launched, you can see it Online in your Agents Pools in Azure DevOps:

Figure 10: Azure DevOps Self-Hosted agent online

Finally, connect again in SSH to your self-hosted agent and pre-configure your command line:

Run the Talend CommandLine once “./commandline-linux.sh” to initialize the configuration and exit it.
Then modify the file “commandline-linux.sh” and replace the command with this:

./Talend-Studio-linux-gtk-x86_64 -nosplash -application org.talend.commandline.CommandLine -consoleLog -data workspace startServer -p 8002

Edit the YOUR_PATH/configuration/maven_user_settings.xml with this file. (Please change Talend CommandLine path depending on your own path)

5. Modify Project POM in Studio

You need to modify the Project POM in your Studio: Settings -> Maven -> Build -> Project
In the Docker Profile, look for the <autoPull></autoPull> tags and instead of “once”, edit with “false” and push your changes to git.

6. Create a Pipeline in Azure DevOps

Figure 11: Azure DevOps pipeline

It’s finally time for us to create a pipeline in Azure DevOps. Select your Azure Repos or GitHub source and self-hosted agent pool you created previously. Then copy-paste this file which is the yaml file describing the pipeline. Please change the variables and the Nexus URL accordingly.

The docker_password variable is not mentioned in the pipeline file. For security reasons please set a secret.

As you can see, only one Maven command allows you to build and push your Docker image. Everything is taken care of by Maven and the Talend CommandLine.

7. Run the Pipeline in Azure DevOps

You can now run your pipeline! You should see that your agent pool is selected and be able to access the logs of your builds.

Conclusion

To conclude this article, let’s sum it all up. After a brief overall description of the DevOps and CI/CD concepts and how Talend can be structured around them, we have created a simple CI/CD pipeline with Azure DevOps. This pipeline allows us to continuously build our jobs as Docker images and push them to a Docker registry. Taking advantage of CI/CD and containers can help you overcome many challenges in your organizations. It will help improve your agility, reproducibility and flexibility by letting you run your jobs anywhere.

The post Building a CI/CD pipeline with Talend and Azure DevOps appeared first on Talend Real-Time Open Source Data Integration Software.

Founded in 2016 and launched in February 2017, PointsBet is a cutting-edge online bookmaker in Australia offering both traditional fixed odds markets as well as “PointsBetting – where winnings or losses aren’t fixed but depend on how accurate the bet turns out to be.

Online betting platforms have long recognized the need for a strong, resilient IT infrastructure. Even a minor glitch during a major sporting event can be disastrous. The losses can run into millions of dollars. Good data systems are also needed to comply with a myriad regulation that online bookmakers are subjected to, such as for example providing reports and information to authorities and racing organizations to meet license conditions.

Eyes on a more reliable Cloud data management solution

PointsBet standardized on Talend Cloud and Microsoft Azure due to its scalability to handle peak online gaming requests and agility to quickly spin up new projects. Talend was a natural fit with out-of-the-box native support for Azure Blob Storage, CosmosDB, SQL Data Warehouse, and SQL Server, Azure SQL Server, Native SQL Server and the flexibility to run workloads in the Cloud or on-premises.

Talend Cloud was live in days and for PointsBet, that meant connecting everything to everything, extracting data from every part and system in the organization – transaction, betting, customer and statistical systems – and providing a unified view of all the requested data. Using Talend, PointsBet managed to accelerate development time from eight hours to one.

“Talend Cloud’s quick and successful introduction meant that we were able to comply with regulations and keep our promise and launch into the United States as planned. “Maayan Dermer, Data Analytics & Business Intelligence Lead / Solution Architect

Entering the US market

As PointsBet starts launching its online sports betting products throughout the US, Talend Cloud plays a vital role in supporting PointsBet’s ability to quickly expand while ensuring the company maintains compliance with varying state regulations. This is important for compliance but also to gain license permission to operate in new countries.

<<Read the Full Case Study>>

Using Talend, PointsBet managed to off-load many software components from their backend engineers. For example, one ETL process that was initially estimated at 2 weeks of work was done in 4-6 hours using Talend. In the future, Talend is set to be expanded to underpin the data needs of the entire Australian business. Once PointsBet expands into more US states, the company is likely to stand up another data warehouse “to consolidate all the data from all of our instances around the world. We’re looking to use Talend for that as well,” Mr. Dermer added.

The post PointsBet is Tuning Up for a US Market Breakthrough appeared first on Talend Real-Time Open Source Data Integration Software.

We are in the era of the information economy. Now, more than ever, companies have the capabilities to optimize their process through the use of data and analytics. While there are endless possibilities to data analysis, there are still challenges with maintaining, integrating, cleaning it to ensure that it will empower the people to take decisions.

Bottom up, Top down? What is the best?

As IT teams begin to tackle the data deluge, a question often asked is, if this problem should be approached from the bottom up or top down. There is no “one-size-fits-all” answer here, but all data teams need a high-level view to help you get a quick view of your data subject areas. Think of this high-level view as a map you create to define priorities and identify problem areas. This map will allow you to set up a phased approach to optimize your most value contributing data assets.

The high-level view, unfortunately is not enough, to turn your data into valuable assets. You also need to know the details of your data.

Getting the details from your data is where a data profile comes into play. This profile tells you what your data is from the technical perspective. The high-level view (the enterprise information model), gives you the view from the business perspective. Real business value comes from the combination of both views. A transversal, holistic view on your data assets, allowing to zoom in or zoom out. The high-level view with technical details (even without the profiling), allows to start with the most important phase in the digital transformation: Discovery of your data assets.

Not Only Data Integration, But Data Integrity

With all the data travelling around in different types and sizes, integrating the data streams across various partners, apps and sources has become critical, but it’s more complex than ever.

Due to sizes and variety of data being generated, not to mention the ever-increasing speed in go to market scenarios, companies are looking for technology partners that can help them achieve this integration and integrity, either on premise or in the cloud.

Talend is one of the companies determined to be this partner. Starting as an open source ETL tool, Talend has evolved into an enterprise grade cloud data integration and data integrity platform. This vision becomes clear in the unified suite of applications they offer and focus to get the foundation of your data initiatives right.

Talend strategically moves data management to the cloud to provide scalability, security and agility. The recent acquisition of the Stitch Data platform and full support for the only ‘made for the cloud’ data warehouse platform Snowflake, makes the offering even more complete

Your 3 Step Plan to Trusted Data

Step 1: Discover and cleanse your data

A recent IDC study found that only 19% of data professional’s time is spent analyzing information and delivering valuable business outcomes. They spend 37% of their time preparing data and 24% of their time goes to protecting data. The challenge is to overcome these obstacles by bringing clarity, transparency, and accessibility to your data assets.

Building this discovery platform, which at the same time allows you to profile your data, to understand the quality of your data and build a confidence score to build trust with the business using the data assets, comes under the form of an auto-profiling Data Catalog.

Thanks to the application of Artificial Intelligence and Machine Learning in the Data Catalogs, data profiling can be provided as self-service towards power users.

Bringing transparency, understanding and trust to the business, brings out the value of the data assets.

Step 2: Organize Data You Can Trust and Empower People

According to the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms, 2017: “By 2020, organizations that offer users access to a curated catalog of internal and external data will realize twice the business value from analytics investments than those that do not.”

An important phase in a successful data governance framework is establishing a single point of trust. From the technical perspective this translates to collecting all the data sets together in a single point of control. The governance aspect is the capability to assign roles and responsibilities directly in the central point of control, which allows to instantly operationalize your governance from the place the data originates.

The organization of your data assets goes along with the business understanding of the data, transparency and provenance. The end to end view of your data lineage ensures compliance and risk mitigation.

With the central compass in place and the roles and responsibilities assigned, it’s time to empower the people for data curation and remediation, in which an ongoing communication is from vital importance for adoption of a data driven strategy.

Step 3: Automate Your Data Pipelines & Enable Data Access

Different layers and technologies don’t make our lives easier to keep our data flows and streams aligned and adopt to swift and quick changes in business needs.

The needed transitions, data quality profiling and reporting can extensively be automated.

Start small and scale big. A part of this intelligence these days can be achieved by applying machine learning and artificial intelligence. These algorithms can take the cumbersome work out of the hands of analysts and can also be better and easier scaled. This automation gives the analysts faster understanding of the data and build better faster and more insights in a given time.

Putting data at the center of everything, implementing automation and provisioning it through one single platform is one of the key success factors in your digital transformation and become a real data-driven organization.

This article was made in collaboration with Talend, and represents my view on data management, and how these align with Talend’s vision and platform.

About the author Yves Mulkers:

Yves is an industry thought leader, analyst and practicing BI and analytics consultant, with a focus on data management and architecture. He runs a digital publication platform 7wData, where he shares stories on what you can do with data and how you should do it.

7wData works together with major brands worldwide, on their B2B marketing strategy, online visibility and go to market strategy.

Yves is also an explorer of new technologies, and keeps his finger on what’s happening with Bigdata, Blockchain, Cloud solutions, Artificial Intelligence, IoT, Augmented Reality / Virtual Reality, future of work and smart cities, from an architecture point of view, helping businesses build value from their data.

The post Can You Trust Your Analytics Dashboard? 3 Steps To Build a Foundation of Trusted Data appeared first on Talend Real-Time Open Source Data Integration Software.

This post is authored by Datagrate. Datagrate helps their clients launch, manage and scale Talend’s best application integration solutions solving the most complex communication challenges.

When the Application Program Interface (API) first came into existence, developers viewed it as a revolutionary approach to creating re-usable software fragments. Instead of creating new code from scratch for every new program, they could now use existing functionality to develop new features. Not only did this decrease the amount of time needed to deploy a program but also meant they could leverage existing code which was already tried and tested.

Though the original concepts applied in software engineering do not have much in common with modern APIs, the fundamental idea has not changed much. Basically, developers use an existing code base to develop new programs. Advancements in technology have created countless practical use cases and opportunities for seasoned developers to set up competitive API frameworks.

Figure 1 Public API Growth – Source: programmableweb.com

The Role of the Internet in Shaping Current Trends

One major step forward came with the introduction of the World Wide Web in the late 1990’s. In addition to reusing APIs, software engineers were able to execute software remotely from any part of the world and get feedback in real-time.

<<Download the Field Guide to Web APIs>>

As the WWW has continued to mature, so was the use of these resourceful tools. Since the start of the 21st century, more people have access to the internet. Moreover, the web has created a whole new market both for new and existing business entities. By virtue of these changes, people, devices and business systems today need a standard for seamless communication. To address these and other needs, developers have created various API frameworks based on varying application scenarios.

Application Strategies for APIs

1. Creating an API to Solve Business Challenges

APIs are the connecting points that facilitate interaction between people, their devices and the digital systems they use. They are comparable to a waiter in a restaurant who connects the cook (software) to the diner (user), getting the order and communicating it so as to get food and deliver it.

Within an organization, one might want to get data from a given database and display it on a different application. An API endpoint will communicate with the database to draw out the required statistics and communicate it to the user. To illustrate this, consider how travel sites work. They gather information from the traveler about budgetary and cabin preferences as well as travel dates among other details. They connect to airline APIs which dip into airline databases and offer options for the trip.

2. Creating an API as a Business Case

On the other hand, an API can in itself become a full business for your company. For instance, Twilio has built a “Communication API” allowing business persons to chat using WhatsApp, SMS or other means. They sell their API to interested parties as a business case for enabling communication.

Opportunities for API Implementation

With the above points in mind, take a look at the following enterprise strategies for APIs:

· As an Information Platform – Companies with different data buckets, databases or multiple cloud-based applications often need solid APIs. These enable them to collect data from different sources and display them from a consolidated endpoint. As such, the user will get a more holistic perspective of their business.

· As a Product – This option requires the most experience as it involves creating a unique product to fill a market gap. In this case, the design and purpose of the API is the core determining factor of the success or failure of the business. Its key advantage is having a competitive advantage as the product will be new to the market.

· As a Service – Modern enterprises usually have IT departments that include multiple systems and applications, each of which handles a specific role. Using an API to bundle the applications, each having a subset of features, is a great way to create a single all-inclusive service. Such services are reusable internally for the implementation of additional business needs.

API Integration in Practical Scenarios

In order to get inspiration for possible business applications of APIs, take a look at some of the ways the technology can be used in various business processes and sectors:

a) eCommerce

eCommerce was among the earliest beneficiaries of API integration and for good reason. The business sector involves lots of moving transactions which require high speed. Third party APIs are, therefore, a necessity and come in all conceivable forms. Credit card processing is one of the most common of these but there is potential for numerous other innovations.

For instance, business entities can use exposed APIs to:

· offer information about stock movement to partners

· help in automatic transaction processing

· allow access to customer portals

b) Obtaining Customer Insights

For most business models, there are different departments, each of which has its own data pool. Notably, all such information is a valuable resource for understanding the customer. While CRM and ERP systems usually have more than a single database, this might be limited to financial or transactional aspects.

Using an API offers the opportunity to pull and combine all business data, presenting a comprehensive, real-time picture of the client. This combination of data and easy-to-digest presentation can give an entrepreneur a peek into different client personas. Armed with this information, they can tailor the approach to optimize customer experience.

c) IoT

Advances in technology have made it possible to computerize almost everything around us. Applying the API concept in this area facilitates real-time interaction between such devices and relevant parties. IoT devices connect such “things” as automobiles, thermostats, medical devices and others within ecosystems. APIs come in handy by exposing the items as interfaces allowing access to them using apps. By virtue of this, users get control and can access the devices at any time and request for live data.

d) Data Governance

More and more organizations are looking for ways to share data across their enterprises. In cases where such organizations have a multi-shop model, the data being shared may vary from one environment to the next. For example, a business which offers both B2B and B2C solutions might have to offer the different customers different pricing information and other data.

An API can help in such situations if it is implemented as a layer on top of existing data management frameworks. It will act as a control center ensuring that the correct data flows to the correct customer.

e) Real-Time Supplier Communication

For cloud application users, APIs are more of a necessity than an option. Some providers require the installation of a third party endpoint to allow access to data. These applications have the objective of offering live updates. The use of a reliable endpoint in this case is a key to ensuring real-time communication with suppliers. It could also mean getting updates faster than competitors and thus offer you an edge.

Figure 2 API Usage by Sector – Source: programmableweb.com

From Theory to Practice

Considering the above opportunities and practical use cases for APIs, creating a valuable application that solves a real market need seems achievable. One of the most important factors that determine the success (or lack thereof) has to do with identifying a business case for your potential API. A viable concept does not necessarily have to make headlines but needs to fulfill a specific need in an organization. The need should be the driver of the project.

Business changes such as acquisitions and mergers are a good example of situations that create a need. Changes of this sort create significant needs for new systems, data streams or applications.

To start with, a developer can create lightweight endpoints offering access to data or other functionality. With time, they can promote this to business partners, clients or other organizational departments.

In order to sell an API as a product or service, it is important to meet client expectations and deliver value for the investment. Some of the basic features that will make this possible are creating one that is easy to use, well-documented, reliable as an endpoint and future-proof.

Keep in mind though that many business operators consider reusability as a major strength of a good API. As such, creating a flexible, generic API opens up a wide door of opportunity as it offers increased an output for the end user. While still on the matter of flexibility, remember that modern APIs have remote capability, and can thus be used from any part of the world and on any device.

Possible Challenges in API Development

Whether you are setting up your very first custom API within your organization or creating an API ecosystem, there are 2 major challenges likely to stand in your way:

· Business acceptance – The target users need to understand the benefits they are likely to gain from the API. Since APIs are intangible, this can be a serious challenge. They usually operate silently in the background as part of middleware. A good selling strategy to make the benefits obvious involves highlighting the flexibility, reusability and generic design that your model offers. To enhance user acceptance, ensure that it offers the desired set of features and is not too complex or error-prone.

· Technical maturity – Business operators will often notice the benefits of the model when it becomes unavailable. Much as this might seem like a desirable effect, operational reliability takes higher priority. Solid architecture and a reasonable level of experience will reduce the chances or frequency of system downtime. Ensure that it meets high-quality standards before taking it to the market.

Staying Ahead of the Competition

In recent times, there has been an increase in service providers looking to address business needs using various API designs. Companies no longer have to endure a one-size-fits-all approach, but rather, they use a best-of-breed approach. This means seeking out the best available software to meet their individual business challenges. For developers, software vendors and service partners, it underscores the need to ensure that applications offer flexible, comprehensive solutions which facilitate optimal integration.

<< Watch Next: Taming Complex and Hierarchical Data Structures>>

About the Author – Andre Sluczka

Founded Datagrate – a US-based Talend Partner – focusing on Talend ESB and Cloud API solutions since 2012. He has helped customers create complex enterprise application integrations while leveraging leading-edge technologies. Over the last 15 years Andre has worked globally, including Europe, United States and Singapore building sustainable platforms for clients like US Cellular, GE, SIEMENS and Parkway Hospitals among others.

The post Time-Tested Insights on Creating Competitive API Programs appeared first on Talend Real-Time Open Source Data Integration Software.

Gartner has recently released its 2019 Market Guide for Data Preparation ([1]), its fourth edition of a guide that was first published in the early days of the market, back in 2015 when Data Preparation was mostly intended to support self-service uses cases. Compared to Magic Quadrants, the Market Guide series generally cover early, mature or smaller markets, with less detailed information about competitive positioning between vendors, but more information about the market itself and how it evolves over time.

While everyone’s priority with these kinds of documents might be to check the vendor profiles (where you’ll find Talend Data Preparation listed with a detailed profile), I would recommend focussing on the thought leadership and market analysis that the report provides. Customers should consider the commentary delivered by the authors, Ehtisham Zaidi and Sharat Menon, on how to successfully expand the reach and value of Data Preparation within their organization.

After searching the report myself, I thought I’d share three takeaways addressing our customers’ requirements in that exciting market.

Data Preparation turns data management into a team sport

Self-Service was the trend that started the data preparation market. This happened at a time when business users had no efficient way to discover new data sources before they could get insights, even after they were empowered with modern data discovery tools such as Tableau or Power BI. They had to depend on IT… or alternatively create data silos by using tools like Microsoft Excel in an ungoverned way.

Data Preparation tools addressed these productivity challenges, an area where reports have shown that data professionals and business analysts spend 80% of their time searching, preparing protecting data before they could turn actually turn them into insights. Data Preparation came to the rescue by enabling a larger audience with data integration and data quality management.

This was the challenge in the early days of the 21st century, but since that time data has turned into a bigger game. It is not only about personal productivity but also about creating a corporate culture for data-driven insights. Gartner’s Market Guide does a great job at highlighting that trend: as disciplines and tools are maturing, the main challenge is now to turn data preparation into a team sport where everybody in the business and IT can collaborate to reap the benefits of data.

As a result, what’s critical is operationalization. In order to capture what lines of business users, business analysts, data scientists or data engineers are doing ad-hoc and turn it into an enterprise-ready asset that can run repetitively in production in a governed way. Ultimately, this approach can benefit to enterprise-wide initiatives such as data integration, analytics and Business Intelligence, data science, data warehousing or data quality management.

Smarter people with smarter tools… and vice-versa

Gartner’s market report also highlights how tools are embedding the most modern technologies, such as data cataloging, pattern recognition, schema on read or machine learning. This empowers the less skilled users to do complex activities with their data, while automating tasks such as transformation, integration, reconciling or remediation as soon as they become repetitive.

What’s even more interesting is that Gartner relates those technologies innovation with a market convergence, as stated in this prediction: “By 2024, machine-learning-augmented data preparation, data catalogs, data unification and data quality tools will converge into a consolidated modern enterprise information management platform”.

In fact, a misconception might have been to consider Data Preparation as a separate discipline geared towards a targeted audience of business users. Rather, it should be envisioned as a game-changing technology for information management due to its ability to enable potentially anyone to participate. Armed with innovative technologies, enterprises can organize their data value chain in a new collaborative way, a discipline that we refer to at Talend as collaborative data management, and sometimes also referred to as DataOps by some analysts, including by Gartner in the market guide.

Take Data Quality management as an example. Many companies are struggling to address their Data Quality issues because their approach rely too heavily on a small number on data quality experts from a central organization such as central IT or the office of the CDO. Although those experts can play a key role in orchestrating data quality profiling and remediation, they are not the ones in the organization that know the data best. They need to delegate some of the data cleansing effort to colleagues that are working closer to where the data is sourced. Empowering those people with simple data preparation tools makes data quality management much more efficient.

The value of the hybrid cloud

Gartner also heard growing customer demands for Data Preparation being delivered through innovative Platform as a Service deployment models. What they highlight are requirements for much more sophisticated deployment models that goes beyond basic SaaS. The report notes that “organizations need the flexibility to perform data preparations where it makes the best sense, without necessarily having to move data first”. They need a hybrid model to meet their constraints, both technical (such as pushing down the data preparation so that it runs where the data resides) and business (such as limiting cross borders data transfers for data privacy compliance).

This is a brilliant highlight, one that we are seeing very concretely at Talend: we are hearing sophisticated requirements in our Data Preparation tool with respect to hybrid deployments: Some of our cloud customers are requiring to run their preparations on premises. Others want a cloud deployment, but with the ability to access remotely to data inside the company’s firewalls through our remote engines. Others want to be able to operationalize their data preparations so they can run natively inside big data clusters.

Are you ready for Data Preparation? Why don’t you give it a try?

Enabling a wider audience to collaborate on data has been a major focus for Talend over the last 3 years. We introduced Talend Data Preparation in 2016 to address the needs of business analysts and lines of business workers. One year later, we released Talend Data Stewardship, the brother in arms of Data Preparation for data certification and remediation. Both applications were delivered as part of Talend Cloud in 2017. In Fall 2018, we brought a new application, Talend Data Catalog to foster collaborative data governance, data curation and search-based access to meaningful data.

And now we are launching Pipeline Designer. As we see more and more roles from central organizations or lines of business that want to collaborate on data, we want to empower those new data heroes with a whole set of applications on top of a unified platform. Those applications are designed for the needs of each of those roles in a governed way, from analysts to engineers, from developers to business users, and from architects to stewards.

2019 is an exciting year for Data Preparation and Data Stewardship. We added important smart features in the Spring release, for example extracting part of a name into respective sub-parts with machine learning or extracting parts of a field into subparts based on semantic types definition, i.e. the ability to split a field composed of several parts into the respective sub-parts. We improved the data masking capabilities, a highly demanded set of functions now that GDPR, CCPA and other regulation are raising the bar for privacy management. Stay tuned for other innovations coming in this year that leverage machine learning, deliver more options for hybrid deployment and operationalization, or allow to engage a wider range of data professionals and business users to collaborate for trusted data in a governed way.

Data Preparation is hot topic. Do you want to know more? What about seeing those apps in action? Or simply give them a try.

The post 3 Key Takeaways from the 2019 Gartner Market Guide for Data Preparation appeared first on Talend Real-Time Open Source Data Integration Software.

PT Bank Danamon is Indonesia’s sixth-largest bank, with four million customers and a network that stretches across the archipelago.

Competition for customers in the Indonesian market is especially fierce, with 115 banks to choose from, according to recent government statistics. Finding a way to stand out is crucial. PT Bank Danamon has ambitious goals to more than double the number of customers making use of mobile and internet banking channels over the next year – up from 30 percent to 80 percent and believes it can achieve that mark by harnessing the power of big data.

Differentiate on customer experience in a crowded marketplace

Though the bank has a lot of data, until recently it was held in around ten siloed data marts. The analysis was often performed in each silo and it could be difficult to understand how particular decisions had been reached.

A shift by Indonesian banks towards real-time customer engagement became a catalyst for the PT Bank Danamon project. That meant overhauling the bank’s existing static digital channels to allow for two-way communication with customers and overhauling the data infrastructure underpinning those channels that allow opportunities for engagement to be recognized so that a personalized ‘next-best action’ can be recommended for each customer.

Talend won the bank over with channel partner, Artha Solutions. PT Bank Danamon liked that Talend could run on-premises or in the cloud. The new big data infrastructure consists of an on-premises Hadoop cluster and Talend, which ingests more than 40+ source systems, including its core banking and credit card systems, and sets governance standards across the bank on how data is to be organized and used.

“Using data, we want to make sure once a customer opens a relationship with us, that our products and our bank remain top of mind.” Billie Setiawan, Head of Decision Management for Data & Analytics

It now takes half the time it previously did to produce reports, saving PT Bank Danamon additional time and money.

Real-time engagement drives change

The bank is using the big data platform for two initial use cases. Under the first use case, the platform is being used to build a 360-degree profile of customers in a bid to better understand their behavior and recommend products or services they might like based on propensity modeling.

The second use case for big data at the bank is detecting suspected fraudulent incidents faster. The fraud team can now quickly generate a list of customers “who the bank thinks are conducting fraudulent activity” for further investigation. The bank believes it can further improve fraud detection with big data, moving towards more proactive detection.

Further use cases are under development. For now, work is ongoing to bring all data users across the bank onto the new big data governed platform that gives IT administrators control over data, reports, roles and functionality permissions for all users. It will ultimately be a single source of truth for all data held by PT Bank Danamon. “Eventually, all our divisions will be accessing the same data from a single repository. From the data analytics team to finance, risk management, operations, and even our branch network, everyone will be using it,” Setiawan said.

The post How PT Bank Danamon Uses Data to Understand and Respond to Customers appeared first on Talend Real-Time Open Source Data Integration Software.

Introduction

Among Talend’s blog posts are many outstanding ones on data governance, such as David Talaga’s “Life Might Be Like a Box of Chocolates, But Your Data Strategy Shouldn’t Be” which encourages us to know our data, and his two-part post on “6 Dos and Don’ts of Data Governance,” in which David offers steps to take and pitfalls to avoid when starting out on data governance. In “5 Key Considerations for Building a Data Governance Strategy,” my colleague Nitin Kudikala describes five data governance best practices and success factors.

I plan to add a few more posts to the category, focusing on specific Talend products that contribute to operationalizing data governance. Before I talk in detail about these tools and how they add value I feel it necessary to say something about data governance first, as a way of establishing a foundation on which to build.

In the preface to the children’s tale “How the Elephant Got His Trunk” is this short poem:

I keep six honest serving-men

(They taught me all I knew);

Their names are What and Why and When

And How and Where and Who.

These questions, “What? Why? When? How? Where? and Who?” are fundamental to solution-seeking and information gathering. I’m going to use them to frame this introduction to data governance fundamentals. Part 1 covers the What, Why and Who. Part 2 will cover the When, Where, and most importantly, How. Keep in mind as you read that the “5 Ws and 1 H” are neither mutually exclusive nor as separate as their presentation suggests. They pop up together and continually throughout whatever journey to data governance you may take. Ready? Let’s get started!

The “What” of Data Governance

When talking with someone about a topic, it’s always worthwhile to confirm that all parties are aligned on what exactly they’re talking about, so let’s begin by establishing what data governance is. In a tie-in to the earlier poem, recall the parable of the blind men and the elephant, which originated in India (fig. 1).

Figure 1: Parable of the blind men and the elephant

There are many interpretations of this story, but the one that applies here is that context matters, and that the truth of what something is often a blend of several observations and impressions.

<<ebook: Download our full Definitive Guide to Data Governance>>

What Data Governance is vs. What it is Not

To that end, I offer these definitions from noted luminaries within data governance, including our own product team. These four sources are the wellspring from which I drew most of my content:

DAMA’s DMBoK, via John Ladley: The exercise of authority and control (planning, monitoring, enforcement) over the management of data assets.
Steve Sarsfield: The means by which business users and technologists form a cross-functional team that collaborates on data management.
Bob Seiner: Formalizing and guiding existing behavior over the definition, production, and use of information assets.
Talend: A collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals.

Within the definitions, I’ve bolded important recurring themes:

Data governance is not an IT-only function—it’s a partnership between the business and technologists. Indeed, John Ladley, a noted data governance consultant, advises against the CIO “owning” data governance.
Data governance is a bit of a misnomer, as what’s really governed is the management of data. More on this momentarily.
Data governance is cross-functional—it’s not done in isolation but rather pervades the entire organization. Once it’s live, no single group owns it.
Data governance views data as assets—not in some nebulous figurative sense, but as real, tangible assets–that must be formally managed.
At its core, data governance is not about technology but rather about guiding behavior around how data is created, managed, and used. Its focus is on improving the processes that underlie those three events.
Data governance is intended to be an enabling function, not exclusively a command-and-control one. Its purpose is to align data management practices with organizational goals, not be the data police.

Having established what data governance is, I’d like to say a few words about what it is not:

It is not simply a project—it’s a program. What’s the difference you may ask? According to the Project Management Institute, “A project has a finite duration and is focused on a deliverable, while a program is ongoing and is focused on delivering beneficial outcomes.” Projects have ROIs; programs are enablers—they’re required for other organizational initiatives to succeed. Let me hasten to add that the establishment, i.e., the “standing up,” of, data governance is indeed a project, but its business-as-usual state is as a program.
It is not achieved through technology alone. It’s achieved through changes in organizational behavior. Technology plays a big part in getting there, but to succeed at data governance you need to establish and institutionalize the activities and behaviors the tool is supporting. It may seem strange that a person who works for a technology company is downplaying technology, but keep in mind my role as a customer success architect is to help ensure just that, and critical to the successful leveraging of any tool is the success of the program it supports.
It is not a stand-alone department. As a program, it may have a Center of Excellence, but it’s not a distinct functional area. Instead, you’re putting functionality in place throughout the organization.
It is not the same as data management. Management is concerned with execution, while governance is oversight—an audit and small-”c” control function. Data governance ensures the management is done right by establishing, maintaining, and enforcing standards of data management. Recall I mentioned earlier that data governance is a bit of a misnomer—that it’s really data management Data governance is to data management what accounting is to finance. There’s the governed and the governor.

The “Why” of Data Governance

The benefits of governing data are many, but quite simply, those organizations that govern their data get more value from it.

Here are a few macro-level benefits of DG:

Data governance leads to trusted data. As Supreme Court Justice Louis Brandeis said, “sunlight is said to be the best of disinfectants.” By putting eyeballs on data, data governance enables better-quality data. When data quality goes up, trust in the data does too.
Data governance enables benefits at every management level by enabling and improving the processes around the creation, management, and use of data. Strategic benefits include aligning business needs with technology and data, better customer outcomes, and a better understanding of the organization’s competitive ecosystem. Tactical benefits include data silo-busting, i.e., greater data sharing and re-use, and timely access to critical data. Operational benefits include increased efficiencies and better coordination, cooperation, and communications among data stakeholders.
When deciding whether to do something, organizations commonly decide the value of that “something” by the extent to which it impacts the “Big-3”: revenue, costs, and risks. Governing data increases revenue, reduces costs and mitigates risk in manifold ways, but how it does so is highly specific to an organization. These particulars are identified when the business case for DG is developed, which I cover later.

The “Who” of Data Governance

There’s an old joke that asks “How many psychiatrists does it take to change a lightbulb? “Only one,” goes the answer, “but the lightbulb has to want to change.” Change—especially organizational change—is, of course, difficult, but with data governance, the juice is worth the squeeze.

I mentioned above that those organizations that govern their data get more out of it, so the glib answer to the “who?” question is “everyone.” Bob Seiner argues that “you only need data governance if there’s significant room for improvement in your data and the decisions it drives.” As you can imagine, he believes that description applies to most organizations.

Organizations recognizing data as a strategic enabler govern their data to ensure that data management responsibilities are aligned with business drivers. It’s also the case that heavily regulated organizations such as banking and financial services are driven to implement data governance to ensure they’re doing the right things the right way. If you’re data-driven, you should govern that data.

Conclusion

In this post, the first of two on data governance fundamentals, I’ve discussed the what, why, and who of a governance program. What data governance is (and isn’t), why it’s worth doing, and who should govern their data. In part 2, I’ll conclude with the when, where, and how of data governance. Thanks for reading!

The post The Fundamentals of Data Governance – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

Today’s ever-increasing competitive market is forcing organizations to become more data-driven. To support key business objectives such as growth, profitability and customer satisfaction, businesses must digitally transform and become reliant on more and more data to make laser sharp decisions faster.

Per IDC, the global data sphere will more than quadruple by 2025, growing from 40 ZB in 2019 to a whopping 175 ZB in size in 2025 (Source: IDC DataAge 2025 whitepaper).

Mastering Data – One Pizza at a Time

Let’s look at how Domino’s Pizza transformed itself into becoming a digital company that sells pizza.

“We’ve become an e-commerce company that sells pizza. Talend has helped us make that digital transformation.”

Dan Djuric, VP, Global Infrastructure and Enterprise Information Management, Domino’s Pizza, Inc.

Domino’s Pizza is the world’s largest pizza delivery chain, it operates around 15,000 pizza restaurants in more than 85 countries and 5700 cities worldwide and delivers more than 2 million pizzas daily.

In 2009, Domino’s Pizza was worth $500M. Today the business is worth $11B (20X in 10 years!).

The story of their growth started six years ago, when they began coupling business transformation with digital transformation. They first reinvented their core pizza, and aggressively invested in building data analytics and a digital platform to reimagine their customer experience. Domino’s can now own their customer experience, optimize customer data and iterate quickly.

They also implemented a modern data platform that improves operational efficiency, optimizes marketing programs and empowers their franchisees with store specific analytics and best practices.

Domino’s integrates customer data across multiple platforms—including mobile, social and email— into a modern data platform to increase efficiency and provide a more flexible customer ordering and delivery experience.

The company implemented a digital strategy by enabling customers to order pizzas on their favorite devices and apps, any way they want, anywhere. Domino’s knows each member of the household, their buying patterns and can send personalized promotions and proactively suggest orders.

Building Your Data-Driven Enterprise

According to Gartner, through 2020, integration tasks will consume 60% of the time and cost of building a digital platform.

The data value chain

To enable the business with data, you must solve these two problems at the same time and do it a scale.

The data must be timely, because digital transformation is all about speed, accelerating time to market – whether for real-time decision making or delivering personalized customer experiences. However, most companies are behind the curve. Per Forrester, only 40% of CIOs are delivering results against the speed required.

But speed isn’t enough. Because the question remains: do you trust your data? For data to enable decision-making and deliver exceptional customer experiences, data integrity is required. This means delivering accurate data that’s providing a complete picture to make the right decision and has traceability: you know where the data is coming from.

This is also a major challenge for organizations. According to the Harvard Business Review, on average, 47% of data records are created with critical errors that impact work.

Companies that are digital leaders like Domino’s can rapidly combine new sources of data to produce insights for innovation or respond to a new threat or opportunity -because they can deliver accurate and complete data that the business can trust. And they do this at the speed required to compete and innovate. For these organizations, data has become a strategic differentiator.

To see how your team’s competencies match up to a digital leader’s, see this white paper from Gartner: Build a Data-Driven Enterprise.

The post The Secret Recipe for Digital Transformation? Speed & Trust at Scale appeared first on Talend Real-Time Open Source Data Integration Software.