Understanding what Machine Learning is and what it can do

July 11, 2019, 5:00 am

≫ Next: The first Pay-as-You-Go design environment for accelerating integration projects

≪ Previous: Modern Data Architecture with Delta Lake Using Talend

As machine learning continues to address common use cases it is crucial to consider what it takes to operationalize your data into a practical, maintainable solution. This is particularly important in order to predict customer behavior more accurately, make more relevant product recommendations, personalize a treatment, or improve the accuracy of research. In this blog, we will attempt to understand the meaning of Machine Learning, what it takes for it to work and which are the best machine learning practices.

What is ML?

Machine learning is a computer programming technique that uses statistical probabilities to give computers the ability to “learn” without being explicitly programmed. Put simply, machine learning ‘learns’ based on its exposure to external information. Machine learning makes decisions according to the data it interacts with and uses statistical probabilities to determine each outcome. These statistics are supported by various algorithms modeled on the human brain. In this way, every prediction it makes is backed up by solid factual, mathematical evidence derived from previous experience.

A good example of machine learning is the sunrise example. A computer, for instance, cannot learn that the sun will rise every day if it does not already know the inner workings of the solar system and our planets, and so on. Alternatively, a computer can learn that the sun rises daily by observing and recording relevant events over a period of time.

After the computer has witnessed the sunrise at the same time for 365 consecutive days, it will calculate, with a high probability, that the sun will rise again on the three hundred and sixty-sixth day. That is, of course, there will still be an infinitesimal chance that the sun won’t rise the day after as the statistical data collected thus far will never allow for a 100% probability.

There are three types of machine learning:

1. Supervised Machine Learning

In supervised machine learning, the computer learns the general rule that maps inputs to desired target outputs. Also known as predictive modeling, supervised machine learning can be used to make predictions about unseen or future data such as predicting the market value of a car (output) from the make (input) and other inputs (age, mileage, etc).

2. Un-supervised Machine Learning

In un-supervised machine learning, the algorithm is left on its own to find structure in its input and discover hidden patterns in data. This is also known as “feature learning.”

For example, a marketing automation program can target audiences based on their demographics and purchasing habits that it learns.

3. Reinforcement Machine Learning

In reinforcement machine learning, a computer program interacts with a dynamic environment in which it must perform a certain goal, such as driving a vehicle or playing a game against an opponent. The program is given feedback in terms of rewards and punishments as it navigates the problem space, and it learns to determine the best behavior in that context.

Making ML work with data quality

Machine Learning depends on data. Good quality data is needed for successful ML. The more reliable, accurate, up-to-date and comprehensive that data is, the better the results will be. However, typical issues including missing data, inconsistent data values, autocorrelation and so forth will affect the statistical properties of the datasets and interfere with the assumptions made by algorithms. It is vital to implement data quality standards with your team throughout the beginning stages of the machine learning initiative.

Watch Fundamentals of Machine Learning now.
Watch Now

Democratizing and Operationalizing

Machine Learning can appear complex and hard to deliver, but if you have the right people with the right skills and knowledge involved from the beginning, there will be less to worry about.

Get the right people on your team involved who:

can identify the data task, chose the right model and apply the appropriate algorithms to address the specific business case
have the skills in data engineering are useful, as machine learning is all about data
will choose the right programming language or framework for your needs
have a background in general logic and basic programming is vital in order
have a good understanding of core Mathematics to help you manage most standard machine learning algorithms effectively- especially Linear Algebra, Calculus, Probability, Statistics and Data and Frameworks

Most importantly, share the wealth. What good is a well-designed machine learning strategy if the rest of your organization cannot join in on the fun. Provide a comprehensive ecosystem of user-friendly, self-service tools that incorporates machine learning into your data transformation for equal access and quicker insights. A single platform that brings all your data together from public and private cloud as well as on-premise environments will enable your IT and business teams to work more closely and constructively while remaining at the forefront of innovation.

Machine Learning best practices

Now that you are prepared to take a data integration project that involves machine learning head-on, it is worth following these best practices below to ensure the best outcome:

Understand the use case – Assessing the problem you are trying to solve will help determine whether machine learning is necessary or not.
Explore data and scope – It is essential to assess the scope, type, variety and velocity of data required to solve the problem.
Research model or algorithm – Finding the best-fit model or algorithm is about balancing speed, accuracy and complexity.
Pre-process – Data must be collated into a format or shape which is suitable for the chosen algorithm.
Train – Teach your model with existing data and known outcome.
Test – Test against non-associated data without known outcomes to test accuracy
Operationalize – After training and validating, start calculating and predicting outcomes with new data.

As data increases, more observations are made – which results in more accurate predictions. Thus, a key part of a successful data integration project is creating a scalable machine learning strategy that starts with good quality data preparation and ends with valuable and intelligible data.

To learn more about common problems machine learning addresses and where it works best, download the Guide on Machine Learning. It will define what machine learning means for commercial use, how to prepare for a successful machine learning strategy with data quality and how exploring machine learning best practices will improve the way your data is processed.

Download the Guide on Machine Learning

The post Understanding what Machine Learning is and what it can do appeared first on Talend Real-Time Open Source Data Integration Software.

↧

The first Pay-as-You-Go design environment for accelerating integration projects

July 16, 2019, 5:00 am

≫ Next: Microsoft Azure & Talend : 3 Real-World Architectures

≪ Previous: Understanding what Machine Learning is and what it can do

The integration landscape is changing.

According to Gartner, “Two-thirds of all business leaders believe that their companies must pick up the pace of digital transformation to remain competitive.”

One of the byproducts of this increasing pace is the desire to get results quickly. In a cloud-first world, that means expectations are changing for how products are trialed, procured, and billed. People expect things to be simpler, faster, and more intuitive.

Despite this fervor, enterprises are weary of procuring simple single-task solutions that don’t scale to meet more complex needs or standards. Sometimes it’s more than simply moving the data. The ability to quickly produce and leverage trusted data requires advanced capabilities that simple solutions don’t provide – including hybrid integration, data quality, and governance.

We need the best of both.

Today, we’re excited to announce the upcoming availability of our Summer ’19 release. With this release, Talend provides the industry’s first end to end integration platform that allows users to get started quickly and grow in complexity as their needs change. Summer ’19 enables companies to balance immediate business needs for initiating data projects with long-term scale requirements for integration, such as governance and cross-cloud support.

To help customers get started quicker, we’re introducing new Pay as You Go (PAYG) functionality to Pipeline Designer. This means users can trial the full product for 14 days and purchase it at anytime with a credit card. No need to contact a person or process a PO. At the same time, we’re introducing hourly billing, so customers only pay for what they use. It’s flexibility to buy a product in a budget friendly way. Customers can scale up and down while being billed accordingly automatically.

Within our product portfolio, we’re introducing In Product Chat (IPC) starting with our Pipeline Designer PAYG and trials. Users can interact with a live Talend employee and get help from directly within the product! There’s no need to open another browser window or exit what you’re doing. Users can ask questions about function, connectivity, or even billing.

As needs grow, the Summer’19 release enables data professionals, from citizen integrators to more technical developers, to scale to more complex and trusted multi-cloud hybrid use cases.

We’re introducing new machine-learning-driven data preparation (MagicFill). Data stewards are automatically suggested transformation recipes based on the actual work they’re already doing. It’s an amazing way to raise productivity while lowering errors. Summer ’19 also introduces data trust scores, data stewardship analytics, bi-directional format preserving encryption, multi-step workflows, and more to help organizations build and leverage trusted data.

Our partner ecosystem continues to grow with new connectivity to Azure SQL DWH, Blob, and ADLS v2 in Pipeline Designer. With Databricks, we’re adding support to run Pipeline Designer jobs with a simple configuration change and connect to their new Delta Lake platform (technical preview) within the Talend Cloud platform. These steps make it easier for customers to take advantage of the latest data technologies.

New scalable DevOps with Docker container support for Data Services and Routes ensures organizations can take advantage of containerization to scale up and down with business requirements. We’ve even added new automated zero configuration continuous integration (CI) support to ensure that organizations using Talend will support enterprise projects of any scale or complexity.

Talend is committed to helping organizations quickly transform their raw data into insight ready trusted data. Our Summer ’19 release enables users start quickly and scale as their needs grow to achieve this reality. This release will be available in Q3 for cloud or on-premises. To learn more about the product update go here.

See What’s New

The post The first Pay-as-You-Go design environment for accelerating integration projects appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Microsoft Azure & Talend : 3 Real-World Architectures

July 17, 2019, 5:00 am

≫ Next: How to bulk load Snowflake tables using Talend Cloud Platform

≪ Previous: The first Pay-as-You-Go design environment for accelerating integration projects

We know that data is a key driver of success in today data-driven world. In fact, according to Forrester, data and insight-driven businesses are growing at an average of more than 30% annually. However, becoming a data driven organization is not easy. Companies often struggle with speed in accessing and analyzing their data, as well with ensuring delivery of trustworthy data that is free of critical errors. Legacy ETL solutions, siloed data, cumbersome governance processes, are just a few of the reasons why 69% of recently surveyed companies do not identify themselves as being data-driven.

To solve this problem, organizations are increasingly adopting cloud platforms like Microsoft Azure to modernize their IT infrastructure and take advantage of latest innovations to improve their business on a number of fronts, such as analytics, decision making, vendor management, and customer engagement. To get the most out of their Microsoft Azure investment, however, these organizations need a data integration provider like Talend to seamlessly integrate with their data sources and Azure Cloud services such as Azure Cloud Storage and Azure SQL Data Warehouse.

Talend can help you innovate in the cloud faster by accelerating analytics, automating and scaling data transformation, ensuring delivery of trusted data, and making it easier to share and monetize your data. Earlier this year, Talend published an Architect’s Handbook on Microsoft Azure, where we featured a few companies that couple the power of Talend and Microsoft Azure solutions to overcome application and data integration challenges in order to modernize their cloud platform.

Let’s take a closer look at each use case to pinpoint exactly how each company is utilizing Talend and Azure.

Use Case 1: Maximizing Customer Engagement to Keep a Liquid Petroleum Gas Supplier ahead of the Competition

Maintaining a high level of customer engagement is critical to keeping the competition at bay for any company, yet for a leading British liquid petroleum gas supplier, it requires a Herculean effort. They must keep customer engagement high across a number of criteria ranging from product quality to pricing to supply and operations to a compelling branding and positioning strategy.

One way to ensure great customer engagement is to find the right customer segment and target them with the right messaging at the right time through the right channel. The challenge, however, lies in getting relevant, accurate, and in-depth data of individual customers. Using Talend Big Data Platform to build a cloud data lake on the Microsoft Azure Cloud Platform (consisting of Microsoft Azure Data Lake Store, Azure HD Insight, and Azure SQL Data Warehouse), this company was able to integrate and cleanse data from multiple sources and deliver real-time insights. With a clear view of each customer segment’s profitability, they could target their customers with customized offers at the right time to maximize engagement.

Use Case 2: Enabling GDPR Compliance and Social Media Analytics to Improve Marketing

Balancing visibility into customer data in order to design effective marketing campaigns while complying with data regulations is not an easy task for the highly-regulated liquor industry, as wine, beer, and spirits companies are not allowed to collect customer or retail store data first-hand with surveys.

The CTO of a century-old large European food, beverage and brewing company with 500 brands was able to achieve this balance, however, with a GDPR-compliant solution that delivers insights on how customers and prospects talk about their products and services on social media platforms in real-time. Using Talend Big Data Platform and Microsoft Azure (specifically, Microsoft Azure HDFS, Microsoft Azure Hive Data Lake, and Microsoft SQL Data Warehouse) to build an enterprise cloud data lake, the company was able to analyze various social media data from 450 topics with a daily sample set of up to 80GB and transform over 50 thousand rows of customer data in a time span of 90 days.

Use Case 3: Delivering Real-Time Package Tracking Services by Building a Cloud Data Warehouse

To maintain a premium level of package tracking and delivery service, a leading logistics solution provider needed to consolidate, process and accurately analyze raw data from scanning, transportation, and last mile delivery from a wide range of in-network applications and databases.

They selected Talend for its open source and hybrid nature, its developer-friendly UI, and simple pricing. By deploying Talend Real-Time Big Data Platform on the Microsoft Azure cloud environment (specifically, a Microsoft Azure SQL Data Warehouse), they were able to re-architect a legacy infrastructure and build a modern cloud data warehouse that allows them to provide cutting edge services, and shrink package tracking information delays from 6 hours to less than 15 minutes.

Building Your Microsoft Azure and Talend Solution

What’s next? You can get the full whitepaper on how to modernize your cloud platform for Big Data analytics with Talend and Azure here. Additionally – you can try Talend Cloud and start testing Azure integration for free.

The post Microsoft Azure & Talend : 3 Real-World Architectures appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How to bulk load Snowflake tables using Talend Cloud Platform

July 18, 2019, 5:00 am

≫ Next: Ingesting data from AWS s3 to Snowflake using Stitch

≪ Previous: Microsoft Azure & Talend : 3 Real-World Architectures

Talend Cloud is an Integration platform as a service (iPaaS) offering by Talend. It is a fully managed cloud option which has the capabilities of data integration, data stewardship, data preparation, API designer and Tester and Pipeline designer. These tools can be used for lightweight ETL and detecting the schema on the fly. One of the unique features of Talend Cloud is it provides both on premise and cloud execution environments. Remote engines can be installed behind the firewall, which allows one to run tasks on premise, but scheduling and managing tasks can be done using Talend Management console. Similarly, Talend provides Cloud engine which is an execution environment fully managed by Talend. With many companies undergoing the cloud journey because of its ease of use, minimal maintenance and cost effectiveness, Talend cloud certainly will ease and make the process frictionless.

Snowflake database is a completely cloud-managed service for building the analytic data warehouse. It is highly scalable and easy to use as it provides an SQL engine to interact with the underlying database storage.

Prerequisites for the use case:

User should have valid Talend Cloud license and should have installed Talend Studio version 7.1.1 or higher.
User should have valid snowflake account.
User should have valid AWS account (since the files to be loaded are staged in S3 bucket)

Use case description:

In today’s world, data is the new oil. It is highly valued and is sought for everywhere. Things like sensors and IoT produce large dataset volumes in gigabytes. Oil needs purification process to convert it into usable form. The same goes for data which needs proper cleansing and aggregation to be convert into a consumable form called information. Hence, the processing of these large datasets/files must be handled efficiently. For this, Snowflake provides bulk load in the form of copy command which ingests the data from large files quickly and efficiently into snowflake tables. Talend has built a component around COPY command. By simply filling in some required parameters, you will be ready to use the COPY command which makes it easier to use and ingest data with ease. For COPY command to load the data, the files must be staged in AWS S3, Google Cloud Storage, or Microsoft Azure. In our current use case the files are staged in S3 bucket.

Creating Talend job for bulk load:

Talend has a prebuilt component called “tSnowflakeBulkExec” which will be used to execute bulk load and job design as shown in the below screenshot. For this exercise Talend Studio for Cloud Version 7.1.1 has been used.

SnowflakeConnection(tsnowflakeconnection) creates the connection to snowflake database. LoadEmployee(tSnowflakeBulkExec) executes the COPY command on snowflake database and loads the employee table. CommitLoad(tsnowflakerow) commits the snowflake connection finally CloseConnection(tsnowfalkeclose) closes the snowflake connection.

The screenshot below shows the configuration of a bulk component. The component is configured such that it loads to Employee table in Snowflake using the file staged in Amazon S3 bucket inside the Snowflake folder.

Creating and Executing Task in Talend Cloud

After designing of job, right click on job and Publish to Cloud to deploy the job to Talend cloud.

After publishing to cloud, the corresponding artifact would be available in the respective Workspace. Click on ADD TASK to create a task for the artifact.

Note that artifact is just a binary or executable. To execute the binary a TASK must be created for that binary.

Verify the below configuration and click on continue.

Choose the Runtime as Cloud to execute job on Cloud Engine and Run Type as Manual to run it as and when required and click the GO LIVE button to execute the task.

After going live, the job will run on Talend cloud Engine and will change its status to SUCCESSFUL, as shown below in the screenshot. The data in staged file will be loaded to employee table. You can go back to snowflake query console and query the employee table for data.

Conclusion

With most of the organizations leaning towards cloud solutions and Integration of different technologies within it to handle huge data volumes, Talend cloud with Snowflake provides an easy and seamless way of doing this as depicted here.

The post How to bulk load Snowflake tables using Talend Cloud Platform appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Ingesting data from AWS s3 to Snowflake using Stitch

July 23, 2019, 5:00 am

≫ Next: The Gendarmerie Nationale uses data to strengthen French national security

≪ Previous: How to bulk load Snowflake tables using Talend Cloud Platform

Stitch is a very simple yet very powerful cloud service which connects to a diversity of data sources. It is a very user-friendly ingestion system and works in a very simple format. Stitch connects to the data sources, pulls the data and loads the data to a target.

In this blog, I am going to connect to Amazon S3, read a file and load the data to Snowflake but first let’s understand few concepts of Stitch. For Stitch to work you need an Integration and a Destination.

An Integration is your data source. Each integration will contain details about a source system like AWS S3, MySQl, etc.
Destination is a place like Snowflake which holds all the data from different sources pulled via integration.

Now that we know the concepts, we are ready to start ingestion using Stitch. For the rest of the blog I am going to create Destination, Integration and do the data load.

Integration

Let’s look at the process of Integration using an example of AWS S3. As a first step, login to Stitch and click on Integrations

Click on Add Integration and select your data source from the list. Let’s select Amazon S3 CSV

Configuring the integration is very simple but in case you still need assistance, every data source listed has a documentation available. Give the details like your integration name, S3 bucket Name, AWS account ID, your file name or a file pattern

If you want Stitch to limit its file search to a particular directory, then specify the Directory. I have left it open as I just have one file to try. Next specify the Table name, Primary key and file Delimiter.

You can also configure more than one table for an Integration. You can also sync the historical data and set a frequency for data replication. I have left ‘Sync Historical Data’ as default and have configured the Replication Frequency to once in every 24 hours at 10am UTC.

Click on continue. To be able to read file from s3, Stitch needs to have access to the S3 bucket. Create a new IAM Policy and IAM Role to give access. Once the right access is set, Stitch will test the connection. On successful connection you will get the following message.

Destination

You could either navigate to Destination from main menu or click ‘Configure Your Warehouse’ in the above screen.

Select Snowflake from the list provided

Configure the snowflake destination. Give the Host, Port, username and password.

To grant stitch access to snowflake, you need to add the required IP address to snowflake security policies. Once done, click on check and Save.

On successful connection you will get the following message

Data Replication

Now come to Integration and select the integration we created ‘s3_stitch_test’

Click on the integration and select the ‘Choose Tables to Replicate’

Select the table ‘Test_Stitch’ . Until now stitch has not read the file. At this point, Stitch read the metadata of the file from s3 bucket. Select the fields you want to replicate. I have selected all the fields. Click on ‘Finalize your Selections’

Selection of columns is an important step as starting from this point, stitch is going to fetch only these columns at every run. Click on Yes, Continue and navigate to Extractions menu.

For testing purpose. Click Run Extraction Now. Notice the status change.

Check the extraction log

Log Starting :

AWS s3 to Snowflake extraction logs with Stitch

Log Ending:

AWS s3 to Snowflake extraction log ending Stitch

Now, let’s verify the tables in Snowflake

Stitch watches the extraction. If I run the extraction again stitch doesn’t ingest the data. In my next blog I will be creating a new Ingestion or Destination. Keep watching this space and until then happy reading!!

The post Ingesting data from AWS s3 to Snowflake using Stitch appeared first on Talend Real-Time Open Source Data Integration Software.

↧

The Gendarmerie Nationale uses data to strengthen French national security

July 24, 2019, 5:00 am

≫ Next: Modernizing the IT architecture for a successful digital transformation

≪ Previous: Ingesting data from AWS s3 to Snowflake using Stitch

In 2010, the Union of European Football Associations (UEFA) named France the 2016 host of the European Championship. Paris was also set to host the UN Climate Change Conference, COP 21, in late 2015, another major international event with high-stake security issues. How to make sure dangerous individuals would not have access to very sensitive areas?

Two deadly terrorist attacks in 2015 and 2016 pushed the French Ministry of Interior to further strengthen its authority over the automated processing of personal data in the context of the fight against terrorism. Called ACCReD (Automation of Centralized Consultation of Data Information), this system allows the automatic and simultaneous analysis of files, in particular, that of individuals considered threats to national security.

The initial difficulty was that all databases were independent. To access the information, it was necessary to consult them one by one and sometimes to type the same name 4 to 5 times, which took an infinite amount of time. This is where Talend came in.

Cross-referencing multiple government security files

Talend has become a strategic figure in the National Gendarmerie’s efforts to ensure inter-application exchanges for security data in the interest of public safety. Talend receives a list of identities to screen, accesses hundreds of applications and dozens of existing files, processes 3 TB of data every day and returns the information, all in real time.

This automated access of all security data is highly sensitive. Given such complexity, the government adopted CNIL recommendations and has issued a restricted list of users authorized to access intelligence files. And every night, Talend uses various data sources to construct a “pivotal HR repository,” which factors in certain elements such as relocations or assignment changes, thereby restricting file access to authorized individuals only. Ultimately, a person rather than the system decides who is qualified to access the data. And the overall results are subject to the approval of the Officers of the Judicial Police (OPJ).

At the core of French national security applications

Today, Talend can analyze more than one million identities screened per month compared to 300,000 requests per year at the beginning of the project.

The system has opened up to security-risk operations (public transport operator, transporter, nuclear, etc.) for the monitoring of authorizations to access these vitally important facilities and to ensure that their employees records do not show any misconduct. Police officer mobility applications also benefit from the system. Today, during a roadside check, an agent enters an identity and/or a license plate and chooses the verifications to be made. The exact same procedure is now used for automated border controls in French airports, which takes between 1 and 20 seconds.

The system also handles operational issues, such as geo-localizing patrols, monitoring vehicle consumption and sending fuel bills to the Ministry of Finance for payment. There is not a single application of the Gendarmerie and the Police that does not consume or produce data without going through Talend.

One million identities screened each month

1 to 30 seconds to return information

Talend has become a strategic figure in the National Gendarmerie’s efforts to ensure inter-application exchanges for security data in the interest of public safety.

The post The Gendarmerie Nationale uses data to strengthen French national security appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Modernizing the IT architecture for a successful digital transformation

July 25, 2019, 5:00 am

≫ Next: Standardization of Customer Data – Hidden gems in Talend Component Palette

≪ Previous: The Gendarmerie Nationale uses data to strengthen French national security

In a competitive world driven by consumers, companies are all facing the same business challenge: they need to make their data available to their business teams, enabling them to deliver better experiences and to streamline operations. IT plays a strategic role here as it can accelerate delivery of trusted data. But this is easier said than done because digital transformation comes with a wide array of technology challenges – from capped IT budgets to lack of resources with the right skillsets, from demanding SLAs to ever-changing technology stacks.

Cloud offerings have expanded beyond storage and computing to offer a wide array of services. The adoption of other cloud computing models such as Platform as a Service (Paas), Infrastructure as a Service (IaaS), and the growing usage of connected devices and IoT platforms means that additional data and processes are also moving outside of the firewall and into the cloud. So much so that some companies now rely on a full cloud-native stack with managed services. With their whole architecture in the cloud, companies do not have to manage anything. This gives cloud computing yet another interesting value prop by reducing the overhead expenses of on-premises equipment and lowering the total cost of ownership.

The cloud as a pre-requisite for any data strategy

Modernizing the enterprise IT architecture with the cloud is a universal answer to any digital transformation. And, indeed, companies are now all on a journey to the cloud, an astounding 96% of companies are using cloud for some of their data projects. Yet, we’re still in the very early days of adoption. Today, only 19% of enterprise IT spending goes to cloud, a number that should go up to 28% by 2022 according to Gartner; and these figures actually hide a great variety of situations. Depending on business priorities, level of risks and time constraints, some companies will decide to simply migrate some of their data bases to the cloud while others will rearchitect their data stack for the cloud. There is no one-way ticket to the cloud, but we’ve seen success with all these approaches.

Migrating some data bases and data centers to the cloud

Sometimes, the trigger that puts organizations on the path to the cloud is as simple as a renewal deadline: we’ve worked with companies that looked at cloud hosting options because they were about to extend their current on-premises solutions for another 5 years. When these companies do decide to migrate their databases, applications or project workloads to the cloud, the process has to be staggered – No server is switched off overnight. During that transition period where data and information live in two different environments, data synchronization is key. Talend is actually part of a specific program with AWS dedicated to such migration projects. This Workload Migration Program (WMP) helps companies seamlessly migrate their data workloads and applications into the cloud, on AWS. It allows customers to achieve their business goals and accelerate their cloud journey. WMP works with AWS Partner Network Technology and Consulting Partners to create a repeatable migration process and methodology for their AWS offering. WMP helps APN Partners drive and deliver ISV workload migrations, enhancing their cloud practices and customer success on AWS.

Moving to a cloud data lake or data warehouse to accelerate data delivery and analytics

Migrating data bases to the cloud only provides companies with part of the advantages of the cloud. In the cloud, computing is decoupled from storage, this gives companies a lot more flexibility and scalability opportunities. More data sources can be handled and be processed faster. As such, companies are increasingly moving their data to the cloud to accelerate analytics and get quicker, self-service access to trusted insights

Decision Resources Group (DRG), a leading provider of healthcare industry analysis and insights, already had part of its infrastructure on AWS. But they wanted to go a step further and accelerate their data access and data analytics by building their data platform in the cloud. DRG selected Talend and Snowflake cloud data warehouse as the foundation of its new Real-World Data Platform, a comprehensive claims and electronic health record repository that covers more than 90% of all data for the US healthcare system. DRG on-boarded more than a 100 TB of data in three months. In addition to the cost savings, DRG’s Real-World Data Platform has helped them scale from serving a handful of users to serving more than 200 users.

Beyond speed, some cloud platforms come with astounding computing performance. Employsure is an Australian-based company that provides workplace relations support to employers and business owners in Australia and New Zealand. While Employsure did reduce its infrastructure costs by moving to the cloud, the benefits for its analytics went beyond. Employsure uses Talend Stitch Data Loader to import data from Salesforce and other operational systems and leverages Google Cloud Platform capabilities to perform more complex AI-driven analytics in a frictionless way. Employsure’s AI-based system now monitors transactional data for factors that suggest which customers may be at risk of attrition, ultimately enabling the team to innovate and develop new client-facing services.

Overhauling its entire architecture to becoming cloud-first

If public and private clouds can accelerate business innovation, companies need to leverage the best cloud services of all service providers. A cloud-first architecture can allow them to scale infinitely and the choice of services should be based on the type of workloads they run.

Back in 2017, AstraZeneca, a pharmaceutical company operating in over 100 countries, resolved to build a data lake on Amazon S3 to hold the data from its wide range of source systems. But AstraZeneca went beyond a simple lift and shift project. For its long-running and compute intensive workloads, AstraZeneca built data pipelines with Talend to execute on an AWS container infrastructure. After some transformation work, Talend then bulk loads that into Amazon Redshift for the analytics. Talend is also being used to connect to Amazon Aurora. These seamless data projects are made possible because AstraZeneca completely overhauled its IT architecture to make it cloud-first.

This is easier to do for “born into the cloud” companies that are not going through a digital transformation process. This is the case of PointsBet, an online bookmaker in Australia and most recently in the United States. Online betting platforms have long recognized the need for a strong, resilient IT infrastructure. Even a minor glitch during a major sporting event can be disastrous and the losses can run into millions of dollars. As a fully digital operation, PointsBet already understood the power that data offered its organization. PointsBet operates with a platform-as-a-service that runs on Microsoft Azure. This fully managed environment provides convenience, reliability and peace of mind. In a regular week, PointsBet engineers only spend an hour troubleshooting, Vs. a traditional 4-8 hours.

To succeed in their digital transformation, companies must find the right balance for their IT architecture when it comes to their cloud adoption. For scalability and flexibility, cloud platforms can truly accelerate computing power for data analytics without compromising the resources dedicated to the storage of the data. In order to ensure that the data moving between the cloud and the enterprise is trusted data, get it right, getting it clean, and governing it so you know what’s happened to it along the way. Finally, bringing everything –storage, processing and management– together in a cloud environment will help the company reach their ROI quicker with a much smaller TCO.

Data transformations must happen where the data lies – data integration has to happen in the cloud. When companies re-architecture their data ecosystem to collect all their data in a cloud data lake or data warehouse and keep it there for processing, this is where the digital transformation begins. Wherever companies are on their transformation journey, modernizing the data architecture with the cloud is the only answer for whatever their business case.

Try Talend Data Fabric

The post Modernizing the IT architecture for a successful digital transformation appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Standardization of Customer Data – Hidden gems in Talend Component Palette

July 30, 2019, 5:00 am

≫ Next: Be the first to try Talend Cloud on Azure through Early Access Program (EAP)

≪ Previous: Modernizing the IT architecture for a successful digital transformation

If you are a banker, would you like to hear the bad news that your company’s name is on the headlines of every news channel since the bank has accidently delivered the credit card to a scamster instead of a genuine customer? Unfortunately, this scenario occurs when the address records were not in standard format and the letter was delivered to a similarly resembling address. This issue can happen not only in banks but also in government institutions, hospitals, and for that manner any customer facing organization.

In this era of data privacy and high corporate scrutiny, every customer friendly institution would like to be on the good books of their customers. Since customer address, email and telephone are three major mediums of customer interactions for any institutions, data quality for these attributes should be given utmost importance and care.

Data Quality and standardization of above three interaction mediums will be a crucial step for customer satisfaction in the long run. We have already discussed a scenario for address standardization issue. Similarly, if the phone numbers or emails are not correct, companies may not be able to serve right offers and recommendations at right time if sms or email get bounced due to invalid phone number or invalid email.

In this blog, I would like to reinforce the importance of some of the hidden gems of the Talend component Palette. At times, they are overlooked by Talend developers due to their ignorance about the importance of data quality and standardization. I hope the ideas mentioned in the blog will help them to plan the customer data standardization tasks in more efficient way.

Address Standardization

Address Standardization is one of the most important aspect in ensuring data privacy for a customer. The most popular customer merge rule used by companies is based on customer’s name, address and date of birth. Many customers have common and popular names with minor change like Junior, II etc. For non-essential websites, customers tend to provide random values like 1^st January of a year as Date of Birth. This means, to distinguish the correct customer and to get 360-degree view of the customer, customer address eventually becomes a crucial parameter.

Talend helps in address standardization by providing components which can integrate to Experian QAS, Loqate, MellissaData and Google address standardization services. The various components available in Talend Palette are as shown below.

Let us see a quick scenario of address validation service and in this example, we are using tMelissaDataAddress component to standardize the input address information.

The input address details are as shown below.

The data will be verified and enriched by the MelissaData component and you will get standardized address output data as shown below.

The standardization of address records helps in quicker address match process which in turn will reduce the overall time required for customer de-duplication efforts. The usage of these components will also help to identify wrong or non-existent addresses at an earlier stage in their data processing flow. The method of processing can be either real time API call through Cloud or through batch mode, depending on the Talend component and the choice of the address standardization vendor.

Some of the other address lookup components like tGoogleMapLookup and tGoogleGeoCoder also helps the Talend developers to identify address from geographical co-ordinates (latitude and longitude) and vice versa.

The component detailed specifications and associated scenarios of the Talend components for each vendor can be referred from below links.

Vendor		Talend Component Reference Link
Experian		Experian QAS address standardization
Google		Google address standardization
Loqate		Loqate address standardization
MelissaData		MelissaData address standardization

The cloud version of the Talend address data standardization is also available and there are two components under this category. The data can be processed either through real time or in batch mode for cloud components.

Email Standardization

Email has become the new primary medium of communication for the customers of digital age. This has resulted in increased scrutiny and validation of email addresses by most of the companies.

The standardization of email is achieved in Talend by using tVerifyEmail component and it helps to verify and format email addresses against patterns and regular expression. The component also helps to either whitelist or blacklist specific email domains based on business requirements.

A sample scenario for email verification is as shown below.

The input data for the sample scenario has been added by inline table of the input component.

In this example, instead of regular expression verification, name column contents are used to create validation rules as shown below.

The emails were validated and categorized to multiple groups along with suggested emails as shown below.

The detailed specification about the various properties and usage of this component can be referred from here.

Phone Standardization

Standardization of phone numbers is another basic requirement for customer interactions and match process. Every customer is using a mobile phone which means invalid or junk phone numbers often become a headache to companies when they are trying to reach their customers. Talend helps in the telephone number standardization using the component tStandardizePhoneNumber. The telephone number can be standardized to one of the below formats.

E164
International
National

A sample scenario for the Telephone number standardization is as shown below.

Telephone numbers in various formats have been added inline to the input flow as shown below.

Telephone standardization component verifies the data and provide the standardized format. It will also provide Boolean values to identify whether the input value is a valid phone number, possible phone number, already standard along with phone number type and any error message.

The detailed specification about the various properties and usage of this component can be referred from here.

First Name Standardization

The last data standardization component I would like to discuss in this blog is tFirstNameMatch which will help to standardize the first name of a customer. The process of standardizing the first name can be explained with a quick example as shown below.

The sample names have been added as inline to the input component as shown below.

The details will be filtered to send only name and gender in the above example. But it is also possible to select the column that contains the gender or country respectively, which will optimize system performance and will give more precise results. You can also use Fuzzy Logic option for precise results.

The output of the flow is as shown below where standardized name details are provided for further usage.

The detailed specification about the various properties and usage of this component can be referred from here.

Conclusion

The standardization of the customer name, address, phone number and email will help the customer data matching and de-duplication efforts in big way during later stages of the data process flow. The above components will also enable Talend developers to avoid writing time consuming logics within tMap using various regular expression rules. So next time, when a business user is asking a Talend designer or developer whether they can standardize customer interaction medium data, tell to them with confidence and with a smile that I can do it! 😊

The post Standardization of Customer Data – Hidden gems in Talend Component Palette appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Be the first to try Talend Cloud on Azure through Early Access Program (EAP)

July 31, 2019, 5:00 am

≫ Next: Provisioning and executing Talend ETL Jobs on Serverless platforms using Apache Airflow

≪ Previous: Standardization of Customer Data – Hidden gems in Talend Component Palette

Earlier in May of this year we announced that Talend Cloud, our Integration Platform as a Service (iPaaS), will soon be available on Microsoft Azure starting in Q3 2019. For those who choose Azure as their cloud platform of choice, Talend Cloud running natively on Azure will provide enhanced connectivity options for modern data integration needs as well as better performance.

Today, we’re announcing the availability of our Early Access Program (EAP) for Talend Cloud on Azure.

As a participant in the EAP, you’ll gain access to try out this release before the official launch date and provide direct feedback to Talend. It’s a great chance to experience the benefits and even shape the product’s future.

With the EAP, you can leverage the benefits of Talend on Azure including:

Accelerating Cloud Migration
End-to-end Integration on Azure Cloud
Faster time to Analytics and Business Transformation
Reduced Compliance and Operation Risks

The EAP has limited enrollment and is now open. If you’re interested, be sure to sign up soon before it ends!

The post Be the first to try Talend Cloud on Azure through Early Access Program (EAP) appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Provisioning and executing Talend ETL Jobs on Serverless platforms using Apache Airflow

August 1, 2019, 5:00 am

≫ Next: Talend Named a Leader Again in 2019 Gartner Magic Quadrant for Data Integration Tools

≪ Previous: Be the first to try Talend Cloud on Azure through Early Access Program (EAP)

Talend 7.1 brings in many new features, one such feature which needs a special discussion is containerization, which opens the doors to design and implement new Architectures namely Microservices. With a Maven plugin the Studio is now able to Build and Publish standard Jobs as Docker images into a Docker registry. The Talend Jobs which are built as Docker images provide us portability and ease of deployment on Linux, Windows and macOS and allows us to choose between on-premise and cloud infrastructures.

This blog illustrates with two examples how we can run containerized Talend ETL jobs on Amazon cloud leveraging the Container Services (CaaS) namely EKS and Fargate. If you are interested in containerizing Talend microservices and orchestrate on Kubernetes please read my KB article.

Scheduling and Orchestrating containerized Talend Jobs with Airflow

While we are comfortable running our containerized Talend jobs using Docker run command or as a Container Services on ECS, for running more complex jobs we should additionally address some more challenges to

run several jobs with a specific dependency/relationship
run jobs sequentially or in parallel
skip/retry jobs conditionally when an upstream job succeeded or failed.
monitor running/failed jobs
monitor the execution times of all the tasks across several runs

Airflow is an open source project to programmatically create complex workflows as directed acyclic graphs (DAGs) of tasks. It offers a rich user interface which makes it easy to visualize complex pipelines, tasks in a pipeline (our Talend jobs/containers), monitor and troubleshoot the tasks.

Publish Talend ETL Jobs to Amazon ECR

In Talend Studio, I created 2 standard ETL Jobs tmap_1, tmap_2 and published to Amazon Elastic Container Registry.

Title: Talend jobs published to Amazon ECR

Example 1:

Provision and execute ETL Jobs on Amazon EKS

The Airflow KubernetesOperator provides integration capabilities with Kubernetes using the Kubernetes Python Client library. The operator communicates with the Kubernetes API Server, generates a request to provision a container on the Kubernetes server, launches a Pod, execute the Talend job, monitor and terminate the pod upon completion.

Logical Architecture Talend Docker Kubernetes

Title: Logical Architecture

A simple DAG with two Talend Jobs tmap_1, tmap_2

Title: DAG with KubernetesOperator

A Graph view of DAG run with Kubernetes tasks and execution status

Title: DAG Graph view with Execution status

Kubernetes Dashboard with Talend Jobs

Title: Kubernetes Dashboard – Pods overview

Example 2:

Provision and execute standard ETL Jobs on Amazon Fargate

The Airflow ECSOperator launches and executes a task on ECS cluster. In this blog the Fargate launch type is discussed since it supports the Pay-as-you-go model.

Logical Architecture Talend Kubernetes Fargate

Title: Logical Architecture

In the previous example with KubernetesOperator we defined our task, and from where it should pull the Docker image directly in our DAG. However, the ECSOperator ease our efforts by directly retrieving the Task Definitions from the Amazon ECS Service.

Using the AWS console, I created two Task Definitions with Fargate launch type and repository URLs of my Docker images tmap_1 & tmap_2

Fargate Task Definitions

Title: Fargate Task Definitions

Then I created my DAG in Airflow leveraging the ECSOperator.

Observe the definition of the DAG it is much simpler and require only the name of the task_defintion, which was actually created in ECS (in the previous step). The operator communicates with ECS using the cluster name and subnet settings.

DAG with ECSOperator

Title: DAG with ECSOperator

A Graph view of DAG run with ECS tasks and execution status

DAG Graph view with Execution status

Title: DAG Graph view with Execution status

When the DAG runs, the operator provision containers for tmap_1 and tmap_2, executes the jobs and after completion stops and deprovision the containers.

Running container - Task_tmap_1

Title: Running container – Task_tmap_1

provisioning container - Task_tmap_2

Title: provisioning container – Task_tmap_2

Conclusion

Thanks for the Maven and assembly features of the Talend Studio we can build our Talend jobs also as Docker images. The above two examples also illustrate with orchestration tools like Airflow how we can construct complex workflows with containerized jobs, provision and deprovision containers on EKS and Fargate without worrying about the managing the infrastructure.

References

Airflow on Kubernetes (Part 1): A Different Kind of Operator

Airflow concepts

Docker Apache Airflow

The post Provisioning and executing Talend ETL Jobs on Serverless platforms using Apache Airflow appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Talend Named a Leader Again in 2019 Gartner Magic Quadrant for Data Integration Tools

August 6, 2019, 5:00 am

≫ Next: Generating a Heat Map with Twitter data using Pipeline Designer – Part 2

≪ Previous: Provisioning and executing Talend ETL Jobs on Serverless platforms using Apache Airflow

From its inception back in 2005 to today, Talend has always sought to be a different kind of data integration company – one rooted in open source to provide our customers with greater choice and flexibility and not afraid to change and embrace new innovations, such as multi-cloud and hybrid support and native integration with serverless cloud platforms. It gives us great pleasure to announce that Gartner is recognizing Talend as a leader for the fourth time in a row in their 2019 Gartner Magic Quadrant for Data Integration Tools, which we believe validates our vision and market impact.

Talend leader Gartner Magic Quadrant Data integration 2019

Talend has been able to sustain its place as a market leader by focusing on where data integration and data management are headed rather than where they’ve been. Earlier this year, Gartner identified “Top 10 Data and Analytics Trends for 2019.” Of the ten trends, two, in particular, pertain to data integration and management. They are: (1) the implementation of a data fabric for access and sharing of data; and (2) augmented data management utilizing ML and AI, for tasks such as data quality, metadata management, and data integration. (See, Gartner Identifies Top 10 Data and Analytics Trends for 2019, Trend Nos. 2 and 6, February 18, 2019).

Both of these, along with a laser focus on cloud – deployment of Talend in the Cloud, as well as support for hybrid and multi-cloud environments, have been key areas of emphasis for Talend for the last several years. We have continued to build Talend Data Fabric, our unified environment for data integration, governance, and sharing, on-premise and in the Cloud. And, we have also progressively added more and more ML-based capabilities to automate data quality tasks, make data pipelines more intelligent, and enable more non-technology users to gain self-service access to data.

It is this focus on continuous improvement and a willingness to change and change quickly that has helped Talend become a firmly established data integration leader in a short period of time.

The Evolution of Data Integration: Hybrid Cloud, Augmented Automation, APIs, Governance, and More

Today, the amount of data is growing year after year. And this data is increasingly moving to the Cloud. This, combined with increasing processing power, is providing organizations with vast opportunities to take advantage of their data – from ML and AI-powered analytics to optimize operations, improve forecasting, and enhance customer experiences, to deploying data hubs for greater internal and external sharing of data, and to the reduction of costs through deployment of native serverless cloud platforms.

According to Gartner, to be considered a leader in this environment, a vendor needs to have a strong vision and ability to execute and a data integration solution that: supports and combines a full range of data delivery styles for both traditional and modern integration use cases, has advanced metadata capabilities with design assistance and ML over active metadata to support design and implementation automation, includes integrated data and application integration as well as hybrid and multi-cloud integration, and has the ability to be efficiently deployed across many locations.

Best of Both Worlds Approach Makes Talend a Different Kind of Leader

Data integration is messy and complicated. Our customers have data on-premises and across multiple cloud environments. They are looking to move to native serverless cloud platforms to reduce costs and integrate ML and AI into their operations and analytics to increase efficiencies and improve decision making. And, they want to become more data-driven and allow self-service access to data for both internal and external users. As Gartner indicates, vendors who can make this process easier and less complicated are the ones who are going to remain leaders in the data integration tools market.

At Talend, we have adopted a best-of-both-worlds approach to meet our customers’ needs and simplify their lives. We want to make sure that they can deliver data they can trust at the speed they need to succeed.

We have invested heavily in our Talend Data Fabric, which provides complete data integration and data governance. With a single, unified environment that delivers data ingestion, data integration, data preparation, data catalog, data stewardship, data quality, and API management capabilities, our customers do not have to learn multiple tools or deal with multiple support organizations. They can confidently deliver the data their constituents need with both speed and trust. We have also focused greatly on the Cloud and provide the ideal solution for on-premises, cloud, and hybrid environments with native support for cloud deployments and cloud-to-cloud integration. And, we deliver an enterprise solution with open source roots and commitment. This enables us to deliver extensive and faster adoption of innovative technologies while providing our customers with the benefit of a dedicate user community committed to support and development.

We believe that our best-of-both-worlds approach – making our customers’ lives easier by providing a single, adaptable solution that is not bogged down by legacy technology or optimized for a single environment – has been a key reason for our success. We are confident that our focus on the future and willingness to change and adapt are ideally suited to the constantly evolving data integration and management market. We encourage you to download a copy of the Gartner report and try Talend for yourself.

Download Gartner Report

The post Talend Named a Leader Again in 2019 Gartner Magic Quadrant for Data Integration Tools appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Generating a Heat Map with Twitter data using Pipeline Designer – Part 2

July 22, 2019, 3:30 am

≫ Next: Talend Data Trust Readiness Report: How Companies Cope with Modern Data Challenges

≪ Previous: Talend Named a Leader Again in 2019 Gartner Magic Quadrant for Data Integration Tools

First of all, sorry about the delay in getting the second part of this blog written up. Things are moving very fast at Talend at the moment, which is great, but it’s led to me having to delay in getting this second part out. I’ve been spending quite a bit of time working with the Pipeline Designer Summer 2019 release. Some really exciting additions are coming to Pipeline Designer, but we will look into those at a later date. For now, we will start looking into configuring our AWS data targets and setting up the beginning of our Pipeline. For those of you who may be reading this without having seen part 1 of this blog series, take a look here.

So, in the last blog we got to the point of having an AWS Kinesis Stream being fed by a Talend Route consuming data from Twitter. AWS Kinesis will be our Pipeline’s source of data. Before we can configure our Pipeline we need to configure our targets for the data.

Configuring AWS Data Targets

There will be two AWS targets for our Pipeline. One of them will be an Elasticsearch Service, so that we can analyse our collected data and create a heat map there. The other target will be an AWS S3 Bucket. This will be used to collect any rejected data from our Pipeline.

Configuring an AWS S3 Bucket

First of all I want to point out that it is not expected that this will be used very much (given how this project is configured), but will be useful if we want to extend the functionality to include some extra data filtering. In the Pipeline we have a filter component which filters out messages without suitable GPS data. Since our data is being acquired based on GPS parameters, nothing should end up being filtered out. However, we need to have a target for data that is filtered out. As such, we are using an Amazon S3 Bucket.

Creating an Amazon S3 Bucket is possibly the easiest thing to set up within AWS. As such, I will simply point to the AWS documentation and give you the details I am using for my bucket so that you can follow the section where I configure the Pipeline to use this bucket. The AWS documentation for this is here.

The S3 Bucket that I created is called “rh-tweets”.

Configuring an AWS Elasticsearch Service

The first thing I want to point out is that you do not need to use Elasticsearch with AWS. You can use your own server or anyone else’s server (if you have permission). I am simply using the service provided by AWS because it is quick and easy for me to set up from scratch. There are cost implications for using AWS services, and for these blogs I am choosing to use the cheapest configurations I can within AWS to achieve my goal. I am pointing this out as these configurations will not be the most efficient and if you want to build something upon what you have learnt in these tutorials, you will likely have to put a bit more thought into the amount of data you will be processing and the performance you will require.

As with the previous blog, I will point to third party documentation when it comes to configuring tools other than Talend. However, for this you will need to follow some of the steps I am taking in order for your Elasticsearch Service to work with everything else in this tutorial. So if you are going to follow the AWS documentation, please pay attention to the steps below. Once you are happy you understand everything, feel free to play around and extend this if required.

Log into your AWS console and go to the Services page. Click on “Elasticsearch Service” which is surrounded by a red box below.
On the next screen click on the “Create a new domain” button. This is surrounded by the red box below.
On the next screen select the “Development and testing” radio button for “Deployment type”. This is surrounded by the green box. We are doing this to make this service as open as possible for our development and learning purposes. If you were building something for a production system, you would clearly not use this.
For the “Elasticsearch version” we are using “6.3”. At present, this is the highest version that is supported by Pipeline Designer. This is surrounded by the red box.

Once these have been selected, click the “Next” button, which is surrounded by the blue box.
Now we need to configure our domain. I have chosen “rhall” for my “Elasticsearch domain name” (surrounded by the green box), but you can put whatever you like here.
For the “Instance type” (surrounded by the blue box) I chose “t2.small.elasticsearch”. This should be OK for our purposes here, but you may wish to go a bit bigger in a production environment.

For the “Number of instances” (surrounded by the red box) I selected 1. Again, you may wish to play with this for your versions.

Everything else on this page I have left as standard. It is entirely up to you if you want to tweak these settings, but you will not need for this tutorial.

As mentioned above, none of the settings below have been modified from defaults. All you need to do here is to click on “Next” (the bottom of the page).
When setting up access to our server, we are really cutting a few security corners here. This IS NOT a configuration I would recommend at all, but it is useful for the purposes of this demonstration. This enables us to play around and develop this environment quickly without having to deal with complicated security settings. This is NOT recommended for anything but a sandpit environment. Please keep in mind that third parties that get access to your server details can flood it with data given this configuration.
OK, now the warning message has been consumed, we can go back to the configuration. I have set the “Network Configuration” to “Public access” (surrounded by the red box).

Next you need to click on the “Select a template” and select “Allow open access to the domain” option (Surrounded by the green box).

You will be asked to tick a warning box which will pop-up, pretty much warning you of what I stated above. Do this and click “OK”.

Once you have clicked the security warning above, you should be left with a screen like below. Click “Next”.
The final configuration screen simply lists your choices. You do not need to make any changes here. Simply click on “Confirm”
We should now be left with the Elasticsearch Service dashboard, showing instances that are up and running or loading. Your instance will stay with a “Domain status” of “Loading” for approximately 10 mins. Once everything is loaded, the screen will look similar to the one below.
In the screenshot below you will notice two boxes; a red one and a blue one. These are surrounding two important URLs. The red box surrounds the Elasticsearch Endpoint that we will need for the Pipeline Designer Elasticsearch Connection. You will need to use this later. The blue box surrounds the Kibana URL. We will be using Kibana in the next step where we configure Elasticsearch for the data mapping we will need for our data. Once that is done, we are ready to move to Pipeline Designer.
Copy the Kibana URL (surrounded by the blue box above) and load the page in a web browser. You will see the screen below.

Click on the “Dev Tools” towards the bottom in the left sidebar.
The “Dev Tools” screen looks like below. We need to add a mapping here. Essentially what this screen does is make it easy to send API messages to Elasticsearch. You can do everything here using a third party REST API tool, but it is just easier to do it here.

You will notice the following code in the screenshot above….
```
PUT heatmap_index
{  
   "settings":{  
      "number_of_shards":1
   },
   "mappings":{  
      "sm_loc":{  
         "properties":{  
            "latitude":{  
               "type":"double"
            },
            "longitude":{  
               "type":"double"
            },
            "location":{  
               "type":"geo_point"
            },
            "id":{  
               "type":"text"
            },
            "created_date":{  
               "type":"text"
            },
            "text":{  
               "type":"text"
            },
            "type":{  
               "type":"text"
            }
         }
      }
   }
}
```
Essentially what this does is identify the elements being supplied by the Pipeline we will build next, and assign a data type. Most of these would normally be identified automatically by Elasticsearch, however we have a special data type that we need to ensure is typed correctly. This is “location” which needs to be identified as a “geo_point”.

By loading the above code and clicking on the green triangle next to “PUT heatmap_index”, we are creating an index called “heatmap_index” with mapping information for the data we will be sending (“sm_loc”). You can see more about this here.

Once we have got to this point, our Elasticsearch Service is ready to start consuming data.

Creating a Pipeline to process our Data in AWS Kinesis

We have finally got to the point where we can start playing with Pipeline Designer. However, first we may need to install a Pipeline Designer Remote Engine.

Installing a Pipeline Designer Remote Engine

Depending on how you are using Pipeline Designer, there will be different steps you need to take here. If you are using the Free Trial then you shouldn’t need to worry about installing or configuring a Pipeline Designer Remote Engine and can move on to the next section. You may have been using Pipeline Designer for a while and have already installed a Remote Engine. If so, you can move on as well. But should you need one, some of my colleagues have put together some step-by-step videos on how to do this. You have the following choices of installation…

Remote Engine for Pipeline Designer – The most basic local install

Remote Engine for Pipeline Designer Setup in AWS – An AWS install

Remote Engine for Pipeline Designer with AWS EMR Run Profile – An AWS EMR install

For my purposes here, I have chosen the basic install. However this project will work using any of the above.

The Pipeline

OK, we are ready to start building. First of all, I will share a screenshot of the Pipeline I have built. For the remainder of this blog and the first part of the final blog (yes, there is one more after this one) I will describe how this is built.

As can be seen, this Pipeline is made up of 7 components. There are 3 Datasets (“Twitter Data”, “S3 Bucket” and “ElasticSearch Tweets”) which need to be created after having created suitable Connections for them, then there are the 4 processors (“Window”, “FieldSelector”, “Filter” and “Python”).

I will start by detailing how each of the Datasets and Connections are constructed, then I will demonstrate how each of the components are added to the Pipeline.

Creating the new Connections and Datasets

We have 3 new Connections and Datasets to set up in Pipeline Designer. We need a Connection and Dataset for AWS Kinesis, a Connection for AWS S3 and a Connection for Elasticsearch. These are all pretty straight forward, but will refer to some values we created while building the collateral in both this blog and the previous blog. You will need to be aware of this and remember where you can find your values, since you will not necessarily be able to simply copy the values I am using (particularly if you have modified any names).

In order to reduce the amount of repeated information, I will cover the initial stages for creating all Connections here. It is the same for every type of Connection, so there is no need to demonstrate this 3 times.

Inside Pipeline Designer, click on the “Connections” link in the left sidebar (surrounded by the red box), then press the “Add Connection” button (surrounded by the blue box).
This will reveal the standard Connection form as seen below.

Depending upon what you select, this form will change. Every Connection will need a “Name” (Red box), a “Remote Engine” (Green box) and a “Type” (Orange box). The “Description” (Blue box) can be useful, but is not essential. Once you have selected a “Type”, further fields will appear to help you configure the Connection correctly for that “Type”.

Notice the “Add Dataset” button at the bottom of the screen. Once the Connections have been created, we will re-open each Connection and create a Dataset using this button. It saves on a tiny bit of configuration assigning the Dataset to a Connection.

Amazon S3 Bucket Connection and Dataset

For the S3 Bucket Connection we will need our AWS Access Key and our AWS Secret Access Key. If you are unsure of how to find these, take a quick look here.

Fill in the “Name”, “Remote Engine” and “Type” as below (you don’t have to call yours the same as mine and your Remote Engine will be from a list available to you), and you will see a “Specify Credentials” slider appear. Click on it to switch it on.
You now need to add your “Access Key” and “Secret Key” (the red and blue boxes). Once you have done this, you can test them by clicking on the “Check Connection” button.

If your details are valid, you will see a popup notification like below.

If everything is OK, click on the “Validate” button to complete configuration.
We now need to create the Dataset. Reopen the Connection we have just created and click on the “Add Dataset” button. You will see a screen which looks like below. This image has all of the settings already applied. I will go through this underneath the image.

You will see that the “Connection” has already been set because you initiated this from the Connection that is used. You simply need to set the “Name”, the “Bucket”, the “Object” and the “Format”. If you recall, I set my “Bucket” to be called “rh-tweets”. Set this to whatever you called yours. The “Object” refers to the name of data. Think of this like a folder name. The “Format” should be set to “Avro”. This will correspond to the Avro schema we talked about in this blog.

Once everything has been set, simply click on “Validate” and the Dataset is ready to be used.

Amazon Kinesis Connection and Dataset

As with the S3 Connection, for the Amazon Kinesis Connection we will also need our AWS Access Key and our AWS Secret Access Key.

Fill in the “Name”, “Remote Engine” and “Type” as below, and you will see a “Specify Credentials” slider appear. Click on it to switch it on.
You now need to add your “Access Key” and “Secret Key” (the red and blue boxes). Once you have done this, you can test them by clicking on the “Check Connection” button.

If everything is set up OK and your test works, click on the “Validate” button.

We now need to create the Dataset. As before, reopen the Connection we have just created and click on the “Add Dataset” button. You will see a screen which looks like below.

Again the “Connection” has already been set because you initiated this from the Connection that is used. You simply need to set the “Name”, the “AWS Region”, the “Stream”, the “Value Format” and the “Avro Schema”.

The region I created the Kinesis Stream in is London. AWS use codes which do not necessarily correspond to what you can set in the console. To make this easier for you, here is a link to the codes that AWS use.

The “Stream” I created in the previous blog was called “twitter” and the “Value Format” is Avro. The “Avro Schema” I am using is the schema used in this blog. To make it easier, I will include it below…

{  
   "type":"record",
   "name":"outer_record_1952535649249628006",
   "namespace":"org.talend",
   "fields":[  
      {  
         "name":"geo_bounding_box",
         "type":{  
            "type":"array",
            "items":{  
               "type":"record",
               "name":"gps_coordinates",
	       "namespace":"",
               "fields":[  
                  {  
                     "name":"latitude",
                     "type":[
                     	"null",  
                        "double"
                     ]
                  },
                  {  
                     "name":"longitude",
                     "type":[  
                        "null",  
                        "double"
                     ]
                  }
               ]
            }
         }
      },
      {  
         "name":"gps_coords",
         "type":["null","gps_coordinates"]
      },
      {  
         "name":"created_at",
         "type":[ 
         	"null", 
            "string"
         ]
      },
      {  
         "name":"text",
         "type":[  
            "null", 
            "string"
         ]
      },
      {  
         "name":"id",
         "type":[  
            "null", 
            "string"
         ]
      },
      {  
         "name":"type",
         "type":[  
            "null", 
            "string"
         ]
      }
   ]
}

There is one more very important step for this Dataset that will help us when building our Pipeline. It is REALLY useful for Datasets used as sources (as this one will be) to have a sample of the data that they will be delivering. This is really useful when building a Pipeline, as you can see as you build it exactly what will happen to the data. In order to get a sample of our data, there are a couple of things we need to do. The first thing is to ensure that our Kinesis Stream is live. The configuration of the Kinesis Stream can be seen here.
Once the Kinesis Stream is live, we need to send Twitter data to it using the Talend Route we created in this blog. Simply load the Route and start it running.
Once a few records have been loaded into our Kinesis Stream, we can start retrieving the sample for our Dataset. Assuming that we are still looking at our Dataset after having copied the Avro schema into it, click on the “View Sample” button (this is where the “Refresh Sample” button, surrounded by the red box below, is located).

Once the sample has been obtained (this might take a minute or so), you will see the data at the bottom of the screen. This can be seen inside the blue box above. You can explore this data and refresh it with new data if you choose.

Once you are happy, simply click on the “Validate” button and the Dataset is ready to be used.

Elasticsearch Connection and Dataset

For the Elasticsearch Connection we are not concerned with AWS credentials since we have left this open as explained here. So this Connection is pretty easy to configure.

Simply fill out the details required for the other Connections and use the Endpoint given to you here for the “Nodes*” config (in Orange). Click the “Check Connection” button and if everything is OK, click the “Validate” button.
Finally we need to create the Dataset. Reopen the Connection as before and click on the “Add Dataset” button. You will see a screen which looks like below.

Simply add a “Name”, the “Index” and the “Type” here. The “Index” and “Type” are briefly spoken about here. Ensure that if you have done something different when configuring your index and mapping for Elasticsearch, that you put the appropriate values here.

We are now ready to start putting our Pipeline together.

In the final part of this blog series, I will describe how this Pipeline is built, how to run it and how we can create a Heatmap to display the data that has been processed.

The post Generating a Heat Map with Twitter data using Pipeline Designer – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Talend Data Trust Readiness Report: How Companies Cope with Modern Data Challenges

August 8, 2019, 5:01 am

≫ Next: Data Privacy through shuffling and masking – Part 1

≪ Previous: Generating a Heat Map with Twitter data using Pipeline Designer – Part 2

We’ve entered the era of the information economy, where data has become the most critical asset of every organization. To support business objectives such as revenue growth, profitability, and customer satisfaction, organizations are increasingly reliant on data to make decisions and drive their operations.

Since data is such an important strategic asset for most organizations, we wanted to know how much confidence data users (e.g. data architects, developers, and data engineers) as well as executives actually put in their current organization’s data. This report aims to help you to answer the following critical questions: Do you trust your data? Is your organization able to deliver it at the speed of the business? Is your organization a leader in digital transformation or is it lagging behind the competition?

In April 2019, Talend tasked Opinion Matters to survey 763 data professionals to evaluate their confidence in delivering trusted data at speed.

3 gaps revealed: integrity gap, speed gap and execution gap

Data quality confidence remains low

The survey shows that only 38% of respondents believe their organizations excel in controlling data quality. Less than one in three (29%) data operational workers are confident their companies’ data is always accurate and up-to-date.

360 real-time data integration is still a challenge

Having data on time accelerates change and drive decisions when and where they make the most business impact. According to the survey, only 34% of data workers believe in their organization’s capability to succeed in a 360 real-time data integration process whereas executives feel more confident (46%). The real-time challenge is not trivial. It also relies on the organization’s willingness to invest in cloud-based modern systems such as data warehouses, data lakes or data hub. The challenge doesn’t stop at collecting and connecting data, but also to make them actionable in real-time. remains a challenge.

Significant difference between management and operational workers

The closer people are to enterprise data, the less confident they are about their organization’s ability to deliver trusted data at speed; while 49% of respondents at a management level feel very confident about having standing access to data, only 31% of data operational workers feel the same. This execution gap is being highlighted even on the compliance side, where trust is a regulatory mandate that all organizations need to enforce; while the majority of managers feel very optimistic (52%, only 39% among the operational workers share this perception).

10 capabilities needed to cope with modern data challenges

The journey to data excellence

Talend publishes this Data Trust Readiness report to provide step-by-step recommendations and best practices to achieve data leadership. Talend has identified 10 capabilities (5 related to speed, 5 related to integrity) that are essential to guarantee trusted data in any organization. The report covers what, why and when it is essential to master each of these capabilities over the data lifecycle process to become a digital leader. The top leaders got the highest score in each of the following capabilities: they feel fully confident about the ability of their organizations to deliver the right data, at the right time, to any user. As they feel fully in control of their data, they feel very confident about their organizations’ capability to protect data to comply with regulations such as GDPR. However, at the same time, they have confidence in their ability to unlock their data for broader, self-service access, thereby enabling their employees and stakeholders while avoiding shadow IT.

Are you a leader in delivering trust data at speed? Download the report and benchmark yourself against your peers according to the 10 capabilities identified to achieve data excellence. Does your journey still lie ahead? Follow the step-by-step recommendations as well as best practices and get inspired by real world examples.

Download Talend Data Trust Readiness Report

The post Talend Data Trust Readiness Report: How Companies Cope with Modern Data Challenges appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Data Privacy through shuffling and masking – Part 1

August 13, 2019, 5:00 am

≫ Next: Continuous Integration made easy in the Cloud with Talend

≪ Previous: Talend Data Trust Readiness Report: How Companies Cope with Modern Data Challenges

Protecting sensitive data can be a challenging task for companies. In a connected world in which data privacy regulations are continually changing, some technics offer strong solutions for staying compliant with the latest requirements such as the California Consumer Privacy Act (CCPA) in the United States or the General Data Protection Regulation (GDPR) in Europe.

In this two-part blog series, we aim to explore some of the techniques giving you the ability to selectively share production quality data across your organization for development, analysis and more without exposing Personally Identifiable Information (PII) to people who are not authorized to see it.

To guarantee Data Privacy, several approaches exist – especially the four following ones:

Encryption
Hashing
Shuffling
Masking

Choosing one of them – or a mix of them – mainly depends on the type of data you are working with and the functional needs you have. Plenty of literature is already available for what regards Encryption and Hashing techniques. In the first part of this blog two-part series, we will take a deep dive on Data Shuffling techniques. We will cover Data Masking in the second part.

Data Shuffling

Simply put, shuffling techniques aim to mix up data and can optionally retain logical relationships between columns. It randomly shuffles data from a dataset within an attribute (e.g. a column in a pure flat format) or a set of attributes (e.g. a set of columns).
You can shuffle sensitive information to replace it with other values for the same attribute from a different record.

Those techniques are generally a good fit for analytics use cases guaranteeing that any metrics or KPI computed on the whole dataset would still be perfectly valid. Then, it allows production data to be safely used for purposes such as testing and training since all the statistics distribution stays valid.
One typical use case is the generation of “test data” where there is a need to have data looking like real production data as input for a new project (e.g. for a new environment), but guaranteeing anonymity while ensuring data statistics are kept exactly the same.

Random Shuffling

Let’s take a first simple example where we want to shuffle the following dataset.

By applying a random shuffling algorithm, we can get:

The different fields from phone and country have correctly been mixed up as the original link between all columns has completely been lost. Per column, all statistics distributions remain valid, but this shuffling method caused some inconsistencies. For instance, the telephone prefixes vary within the same country and the countries are not anymore correctly paired with names.

This method is a good fit and might be sufficient if you need statistics distributions to remain valid per column only. If you need to keep consistency between columns, specific group and partition settings can be used.

Designating Groups

Groups can help preserve value association between fields in the same row. Columns that belong to the same group are linked and their values are shuffled together. By grouping the phone and the country columns, we can get:

After the shuffling process, values in the phone and country columns are still associated. But the link between names and countries has completely been lost.

This method is a good fit and might be sufficient if you need to keep functional cross-dependencies between some of your attributes. By definition, the main drawback is that grouped columns are not shuffled between themselves, which let you have access to some of the initial relationships.

Designating Partitions

Partitions can help preserve dependencies between columns. Data is shuffled within partitions, and values from different partitions are never associated.
By creating a partition for a specific country, we can get:

Here, the name column still stays unchanged. The phone column has been shuffled within each partition (country). Names and phones are now consistent with the country.

In other words, this shuffling process only applied to the rows sharing the same values for the country column.
Sensitive personal information in the input data has been shuffled but data still looks real and consistent. The shuffled data is still usable for purposes other than production.

This method is a good fit and might be sufficient if you need to keep functional dependencies within some of your attributes. By definition, the main drawback is that values from different partitions are never associated which let you have access to some of the initial relationships.

As a reminder, shuffling algorithms randomly shuffle data from a dataset within a column or a set of columns. Groups and partitions can be used to keep logical relationships between columns:

When using groups, columns are shuffled together, and values from the same row are always associated.
When using partitions, data is shuffled inside partitions; values from different partitions are never associated.

Shuffling techniques might look like a magical solution for analytics use cases. Be careful. Since the data is not altered per se, it might not prevent sometimes to come back to some of the original values using pure statistical inference.

Take the example where you have a dataset with the population for French cities and you want to perform data shuffling on it. Since you have about 2.5M individuals in Paris, which is by far the biggest city in France, you will have as much occurrences as the original value after shuffling. It would allow to easily “un-shuffle” this specific value and retrieve the original one.
Even worse – take the example of a single value representing more than 50% of a dataset. After having been shuffled, some of the input values will have exactly the same output values, as if they had not been shuffled at all!

Ultimately, shuffling techniques can certainly be used in addition of other techniques such as masking, especially if the predominance of a dedicated value has to be broken. This concludes the first part of the two-part series on data privacy techniques. Stay tuned for the second part when we will discuss, in detail, the data masking technique as an effective measure for data privacy.

The post Data Privacy through shuffling and masking – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Continuous Integration made easy in the Cloud with Talend

August 14, 2019, 5:00 am

≫ Next: 5 best practices to deliver trust in your data project: Tip #1 Master Data Quality

≪ Previous: Data Privacy through shuffling and masking – Part 1

In this blog I would like to highlight a new capability from our latest 7.2 Studio and its integration with Talend Cloud in the Summer ‘19 release. While this feature could be seen as a light improvement aiming to ease the life of Talend users, it’s in fact much more than that!

While Talend has been delivering CI/CD capabilities for a very long time, the continuous integration tools landscape is evolving fast. Here at Talend we have kept pace to allow you to use the latest technologies and speed up your CI/CD pipelines faster than ever. That is why Talend has introduced the Zero install CI feature.

Simply put, you won’t have to manually install the Talend CommandLine. The CommandLine is the application that, along with Maven, assume all CI/CD related tasks such as builds, tests, and publications within the Talend product.

Why it is a game changer?

This new capability has much more impact than just facilitating the setup of CI/CD pipelines. It allows us to now take advantage of cloud hosted environments for our builds. Before I explain what these environments are and what they can bring to us, let me remind you how we used to consider continuous integration. Most of the times, we would dedicate one or more servers to CI tasks, meaning that we would have to provide physical servers or virtual machines (on-prem or in the cloud) to achieve our builds and tests. We would also need to manage, update, patch, and monitor these environments daily.

Talend Continuous Integration Continuous development

Talend Continuous Integration, Continuous Delivery and Continuous Deployment

While this situation was suitable and worked great, the rise of new cloud native services such AWS CodeBuild or Azure Pipelines introduced a new way to perform these tasks. Instead of providing our own machines, these new CI/CD tools give us the possibility to use on-demand environments hosted and managed by these cloud service providers.

AWS calls them build environments whereas Azure DevOps uses the term Microsoft-hosted agents. In both cases, these environments will take care of the builds and tests for you when requested. No servers to setup. You pay as you go. These environments can be virtual machines or containers that spin up on-demand and come with everything you would expect to build artifacts such as Maven, npm, Gradle or even Docker.

Undoubtedly, the Zero Install CI feature allows us to take advantage of these on-demand environments by not requiring the manual setup of a Talend CommandLine beforehand. We can leverage one of these environments and let it automatically install the Talend CommandLine on the fly and build our artifacts right afterwards.

Another benefit of the Zero-Install CI is that it is also much easier to leverage Kubernetes or any other container platform for your build executions. Indeed, most of the tools allow you to now execute your builds in containers to simplify and uniformize your CI/CD pipelines.

Great, what are the benefits?

The benefits of such architectures are multiple.

No more CI software to install, you can stay within your favorite cloud provider console.
Installation, maintenance and upgrades are taken care of for you. No need to worry about disk space or network slowness anymore.
You get brand new environments each time you run a pipeline. Your builds are immutable and consistent between two builds.
You scale up and down automatically depending on your build workload. Pipelines are run independently and concurrently. No more builds are left behind in the queues!
And finally, you are charged based on the time to complete your builds. No more paying for unused servers.

How does it work in practice?

Technically, we have integrated within our Maven plugin a new mechanism to pull the Talend CommandLine from a given location. Instead of indicating where you have installed your CommandLine, you now specify a URL from which it can be pulled and installed for you. As you need to pull it, you will have to host it. A simple HTTP server will be enough, or you can choose to host it in Amazon S3 or Azure Blob Storage. The only thing needed is access from your build workers. For the Zero Install CI, we chose to distribute the CommandLine as a P2 repository. They are very common in the Eclipse space and composed of artifacts and metadata. They are independent from Maven repository but fit well in the Maven ecosystem.

Talend CI/CD Zero Install

Figure 1: CI/CD workflow with hosted build environments

As previously mentioned, there are several services giving you the possibility to use hosted build environments on-demand. Here is a non-exhaustive list:

Azure Pipelines (within Azure DevOps)
- Look at our documentation to learn how to create your first CI/CD pipeline on Azure DevOps using the Zero Install CI.
AWS CodeBuild (within AWS Developer tools)
Google Cloud Build
Travis CI
Gitlab
GitHub (with their GitHub Actions – in beta)

Of course, you don’t have to stick to hosted build environments to use the Zero Install CI. If you want to still use Jenkins with your very own provisioned machines it will help you install Talend CommandLine in an easy and efficient manner. In a static mode like this, the Zero Install CI will only be actioned the first time, while the following build will then continue to use the same CommandLine.

Conclusion

Talend CI/CD capabilities are fully compatible with all CI/CD tools providing Maven support, which is one of the most popular standard package management and building tools. With the Zero Install CI, we are going even further. No more complex install and configuration, and you can take advantage of hosted build environments in your favorite cloud native CI/CD tool.

The post Continuous Integration made easy in the Cloud with Talend appeared first on Talend Real-Time Open Source Data Integration Software.

↧

5 best practices to deliver trust in your data project: Tip #1 Master Data Quality

August 15, 2019, 5:00 am

≫ Next: The Privacy Hazard in High Tech Heritage

≪ Previous: Continuous Integration made easy in the Cloud with Talend

During Summer, The Talend Blog Team will relay to share fruitful tips & to securely kick off your data project. This week, we’ll start with the first capability: make sure the data you create, develop and share within your organization stay clean and governed.

Master data quality with trustworthy, complete and up-to-date data assets

According to Harvard Business Review, 47% of newly created data has at least one critical error. Poor data quality adversely affects all organizations on many levels, while good data quality is a strategic asset and a competitive advantage to the organization. Having the ability to master data quality can really make the difference: it’s a key component for any organization willing to gain value out of its data.

Data quality is in far worse shape than most managers realize

Harvard Business Review reveals that 47% of newly-created data records have at least one critical error

Bad data can come from every department within your organization – sales, marketing or engineering – under diverse forms.

Some examples of bad data include:

First names and surnames with missing marks
National ID numbers with an invalid suffix
Credit cards exposed to unauthorized persons
Obsolete post codes or incomplete billing addresses
Heavily abbreviated names, surnames and addresses
Miscellaneous remarks not stored in designated fields
Wrong or missing product references

All of these examples can negatively affect your income statement on the long term if nothing is done.

Our recent data trust readiness report reveals that only 43% of data professionals believe their organizations’ data is always accurate and up-to-date. That figure falls to 29% for data practitioners. That shows that the problem seems to be controlled at top level, but data workers are less confident. Download the Guide to learn more.

When is Data Quality needed?

Data Quality is required at every stage of the data lifecycle.

Data quality is a process that needs to be pervasive throughout your data lifecycle – all the time, all users, for all projects: you will need to provide inflight data quality self-service tools to enable business experts and empower business people with stewardship applications to resolve missing data over time.

What are key capabilities to look at?

When we’re talking about Data Quality, some key capabilities rise to the surface: profiling, deduplication, matching, classification, standardization, remediation and masking are one of them. Good to know they are all integrated into the Talend Platform making it accessible to a wide array of technical and business users.

How to get started:

Regardless of an organization’s size, function, or market, every organization. Data quality cannot be an afterthought, otherwise it will soon become the main obstacle for your data-driven transformation. Start by discovering your data asset, understanding the data quality issues and how they can negatively impact your decisions and operations. Then cleanse your data as soon as it enters your information chain with the right stakeholders on board and tools that can automate whenever possible.

How Talend tools can help

Data Quality is everywhere in the Talend Platform and shared with everyone. It starts with Data Catalog that can automatically and systematically profile and categorize any data set, present data samples and profiling as part of data discovery process, and assign accountabilities, so that data owner can be responsible of the most critical data set and take actions when data quality issues are highlighted.

Profiling is also delivered across tools and persona, from a business analyst using Data Prep to a data engineer using Pipeline Designer and its new trust index (a great new feature from the Summer 18 release), or the IT developer using Talend Studio.
Once data quality issues have been identified, then can handled automatically as part of Talend Data Pipelines in Talend Studio or Data Prep. Rules for data protection, for example for data masking, can be applied there as well.

In some cases, data remediation requires a manual intervention for arbitration, correction. This is where workflows where anyone can participate matters, and this where Data Stewardship comes into play.

What happens next?

Once you have established a data quality culture for every department, with data literacy programs and modern accessible data quality tools, behaviors will change, and people will start taking care of bad data and avoid polluting data systems with inaccurate or incomplete data.

Some Questions to ask yourself, your it team and your organization:

1. How do you discover your data?

As a matter of fact, you cannot solve any data quality issue if you don’t have a global state of your data quality.

2. How do you measure the cost of bad data and the ROI of data quality?

Making sure you can track progress I will help you to highlight the problems and gains associated to it.

3. How do you engage data stewards for data consistency and accountability?

Experts need to be part of the Data Quality Loop. Talend Data Stewardship help them to correct and reduce human errors in data pipelines.

4. How do you automate data quality remediation?

Data Quality is not a set of repetitive manual tasks. It can be fully automated. This short video shows you all the power of Talend Studio with Data Quality Components.

Want to explore more capabilities?

This is the first out of ten trust & speed capabilities. Cannot wait to discover our second capability?

Go and download our Trust Data Readiness Report to discover other findings and the other 9 capabilities or stay tuned to discover our next week Tip #2 : get control over organization’s data.

The post 5 best practices to deliver trust in your data project: Tip #1 Master Data Quality appeared first on Talend Real-Time Open Source Data Integration Software.

↧

The Privacy Hazard in High Tech Heritage

August 20, 2019, 8:00 am

≫ Next: Generating a Heat Map with Twitter data using Pipeline Designer – Part 3

≪ Previous: 5 best practices to deliver trust in your data project: Tip #1 Master Data Quality

DNA kits like 23andMe, Helix and AncestryDNA topped holiday gift guides again this past year. Kits range in the market from $60 to $200, and they’re meant to help consumers understand family history, genealogy and can even connect unknown family members. Collecting genetic data can also have broader impacts in healthcare and justice for law enforcement. We’ve seen this play out in the arrest of the suspected Golden State Killer Joseph James DeAngelo as a result of DNA from an ancestry website.

However, companies compensate for low product cost through investors, who can have potentially unsavory intentions. While these products can explore how DNA impacts your health, lead to innovations in the healthcare system and solve more crimes, the amassed information and third parties involved also lend themselves to privacy concerns.

Putting your family history in someone else’s hands

Profit and business will always impact companies no matter the company vision and goals. Recently, GlaxoSmithKline (GSK) invested $300 million in 23andMe. While both companies insist that this partnership will only further the availability of beneficial medications, it is important to note that consumers who shared their genetic data with 23andMe before the investment will still be impacted. Consumers who do not consent to sharing their identity will be anonymized, but the rest of their genetic information will be available to GSK.

But, if you dig into the details, it becomes clear that genetic information shared with DNA testing companies can never be truly anonymized, and in using these services consumers risk losing control of the data they provide. With an opt-in status that is unclear, consumers are left unsure of who is using their data – and for what. When you look at data privacy scandals like Cambridge Analytica, the problem was that privacy related data collected by a first party was then shared with a third party. This is where the control was lost – and once actors in the field are driven by profit through services delivered to third parties, they run the risk of deprioritizing data privacy.

Direct to consumer products like AncestryDNA and 23andMe do not have to abide by the same regulations such as the Health Insurance Portability and Accountability Act (HIPAA), to which doctors are bound. And at times it’s in the best interest of the government that this genetic information be shared. FamilyTreeDNA gave the FBI access to its database of more than 1 million users’ data that allowed agents to test DNA samples from crime scenes against customers’ genetic information. In future incidents, the government could hypothetically subpoena these companies for access to DNA. It’s also important to note that individuals do not need to be the first party granter of their DNA to be found in a system. If a family member, even down to your third cousin, conducts a DNA test kit, the privacy of your DNA is also at risk. Furthermore, Scientists are now able to identify unique mutations in an anonymous sample of DNA to pinpoint its owner.

The fact that this data shares information about an individual’s family member takes this case to beyond “personal”. We need to find a way to better control “personal data once removed” with DNA.

How to protect DNA data for beneficial use cases

Despite potential risks, it’s clear that there are advantages to collecting such a vast amount of genetic data. Leaps can be made in healthcare discoveries, killers can be brought to justice, and families can be reunited. It’s all a question of how to ensure the data is only used for good.

Consumers should be able to trust the companies they share their genetic information with and use the platforms for their original intent: to connect users with their genealogic history and family members. Similar to Facebook as it relates to social data and trust, DNA testing companies are facing a huge trust challenge when it comes to growth. From addressing risk for the future of their business to ultimately reaping the full benefits that the DNA data can bring to both the economy and society. In order to regain public trust, DNA testing companies must implement regulations on data sharing and privacy.

Additionally, there is a strong need for regulation in this space. In healthcare, regulations such as HIPAA have established very strong rules for data protection that demands providers anonymize and scrub out any obvious characteristics before medical companies can share and eventually sell a patient’s data. Unfortunately, with DNA specific regulations in their infancy, this data does not fall within HIPAA regulations. This raises a large number of questions given the increase in consumer interest (AncestryDNA alone sold about 1.5 million kits), and dropping the cost of at home kits as demand continues to grow.

The Future of DNA Kits
Genetic testing kits provide benefits from family background and healthcare innovation to the justice system, but privacy regulations in the field are lax and allow data to be shared without consumer consent. Companies are influenced by investors and profit, and this impacts their decisions on how to use the data they’ve collected. Whether or not the government decides to implement more stringent regulations is yet to be seen, but in the meantime companies should employ data governance to protect the personal data they amass.

The post The Privacy Hazard in High Tech Heritage appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Generating a Heat Map with Twitter data using Pipeline Designer – Part 3

August 21, 2019, 5:00 am

≫ Next: 5 best practices to deliver trust in your data project : Tip #2 control your data wherever

≪ Previous: The Privacy Hazard in High Tech Heritage

If you have got through part 1 and part 2 of this series of blogs, there are only a few more steps to carry out before you can see the end to end flow of data and create your Heatmap. If you have not read the first two blogs, the links to the blogs are above. Although these blogs have been quite lengthy, I hope you understand that I have tried to make sure that any level of experience can achieve this. Since Pipeline Designer is a new product, I felt that it made sense to be as explicit as possible.

In this last blog I will talk about putting your Pipeline together and running it. Once you have finished this, I am hoping you will be able to extrapolate from this to build a Pipeline to achieve your own use cases or to extend this use case for your purposes.

But now, it’s time to build your Pipeline….

Building the Pipeline

By now we should have all of our Datasets created and be ready to use them. I will describe each Processor we are using as shown in the Pipeline screenshot that can be seen here. However, first we need to create our Pipeline.

To create the Pipeline, select the “Pipelines” link in the left sidebar (in a red box in the image below), then click on the “Add Pipeline” button (in the blue box below).
When the Pipeline screen opens, we need to give it a name (where the blue box in the screenshot below is), we need to select a “Remote Engine” (the drop down in the red box), and we need to add our “Source” (click on the “Add Source” + symbol surrounded by the green box).
You will see the next screen pop-up. This is where we can select our Dataset to use as a source. Select the required Dataset (I’ve selected “Twitter Data” as seen below) and click on “Select Dataset” (surrounded by the red box).

When the pop-up screen disappears, we will be presented with the layout below.
If you went through the process of retrieving the sample data for our Dataset (described here), you will see the “Data Sample” shown in the green box below. This will be massively useful in our debugging while we are building this Pipeline.
Now it is time to add our first Processor. The first Processor we will add is a “Window” Processor. These are described here. Windows allow us to chunk our streaming data into mini batches. This allows us to build our Pipelines in the same way for both batch and stream Pipelines, they also allow us to control how the stream of data will be batched. Take a look at the link I gave you above to see more about the different ways in which you can use Windows.
To add a “Window” Processor simply click on the green “+” symbol (surrounded by the orange box below) and an “Add a processor” pop-up box will appear. Select the “Window” Processor and the box will disappear, leaving your new Processor on your Pipeline.
Once the “Window” Processor is on your Pipeline, select it and you will see the configuration side panel showing you the config options. I have chosen to leave the defaults shown in the blue and red boxes. They essentially create a batch of 5 seconds worth of data, every 5 seconds. This is known as a “Window train”, the simplest type of batching.
Our next Processor is the “Field Selector”. We click on the green “+” symbol to the right of our “Window” Processor (surrounded by the red box below), then select the “Field Selector” option. This will then appear in our Pipeline.
The “Field Selector” Processor is used to filter, reorganise and rename fields. This component is described here. We will not be removing any fields, we will simply be changing the order and renaming a couple. As seen in the screenshot below, the first one we are going to deal with is the “id” field. We will keep its name of “id” (in the blue box), but it will be selected first, therefore changing its position in the document. Its path is “.id” which is the avpath of the field.

After each field has been configured, click the “New Element” button to create a new field. Once all fields are created, click the “Save” button. The following fields are needed to be configured….

Field Name Path

id .id

gps_coords .gps_coords

bounding_box .geo_bounding_box

created_date .created_at

text .text

type .type

Once all of the fields have been configured and the Processor has been saved, your Pipeline’s Data Preview will look like below.

Notice that the input and output are different. This is an example of the usefulness of having the Data Preview while you are configuring your Pipeline.
Our next Processor is the “Filter” Processor. This is used for filtering data into two sets. It is described here. As with other Processors, we click on the green “+” symbol to the right of the last Processor we added, then select the Processor required from the pop-up. In this case, we are selecting the “Filter” Processor.
This “Filter” is going to be used to ensure that all data being sent to our Elasticsearch instance, has GPS data. As soon as this Processor is dropped into the Pipeline, it will have one filter element waiting to be configured. We will need to have two filter elements. To add a second, we click on the “New Element” button.

As you can see, there are 4 properties to configure per element. The table below lists the settings I am using for both fields being used to filter.

Field Path Apply a function first Operator Value

.gps_coords NONE != NONE

.bounding_box[0] NONE != NONE

All of the “Field Path” values follow the avpath specification.

Once both of our filter elements are configured, we need to set the “Select rows that match” to “ANY”. This means that as long as one of the filter elements is true, the data can proceed. Once that is done, select “Save”.
You will notice that there are two outputs for the “Filter” Processor. There is an output for the data filtered in (the data we want) with a blue line, and an output for the data filtered out (the data we do not want) which has a yellow line. We are adding our “S3 Bucket” Dataset as the Destination for our rejects here. The data we will be gathering for this Pipeline should always have GPS data, so we shouldn’t see much, if anything going to S3. However, we may wish to add some more filter elements in the future to filter out Tweets based on other logic. So this is useful to have in place.
Click on the “+” symbol above “Add Destination” which is in the box linked to the yellow output from the “Filter” Processor (surrounded by the blue box below). This will reveal a pop-up showing available Datasets. We need to select the “S3 Bucket” Dataset, then click on “Select Dataset”.
When we configure this Dataset in the configuration sidebar on the right, we just need to make one change to the defaults. We need to change the “Merge Output” slider to be set to on, as seen below. This will mean that the data will be appended for every run of the Pipeline.
The next Processor to be added is the “Python” one. This allows us to use the power of Python to carry out some changes to our data before it is sent to Elasticsearch.
To do this, click on the green “+” symbol along the blue “Filter” output line (surrounded by the blue box) and select the “Python” Processor.

When we configure this Processor in the configuration sidebar on the right, we need to carry out two tasks. The first is simply to ensure the “Map type” is set to “Map”. The second task is to set the “Python Code”. I have included the code to use after the image below.

The following code needs to be copied and pasted into the “Python Code” section of the configuration window.

# Here you can define your custom MAP transformations on the input
# The input record is available as the "input" variable
# The output record is available as the "output" variable
# The record columns are available as defined in your input/output schema
# The return statement is added automatically to the generated code,
# so there's no need to add it here

# Code Sample :

# output['col1'] = input['col1'] + 1234
# output['col2'] = "The " + input['col2'] + ":"
# output['col3'] = CustomTransformation(input['col3'])

output = json.loads("{}")
output['latitude']=None 
output['longitude']=None
output['location']=None
output['id']=input['id']
output['created_date']=input['created_date']
output['text']=input['text']
output['type']=input['type']

if input['gps_coords']==None:
    output['latitude']=(input['bounding_box'][0]['latitude']+input['bounding_box'][1]['latitude']+input['bounding_box'][2]['latitude']+input['bounding_box'][3]['latitude'])/4
    output['longitude']=(input['bounding_box'][0]['longitude']+input['bounding_box'][1]['longitude']+input['bounding_box'][2]['longitude']+input['bounding_box'][3]['longitude'])/4
    output['location']=str(output['latitude'])+','+str(output['longitude'])    	
else:        
    output['latitude']=input['gps_coords']['latitude'] 
    output['longitude']=input['gps_coords']['longitude']
    output['location']=str(output['latitude'])+','+str(output['longitude'])

The code above creates a new field called “location” and also sets the “latitude” and “longitude” values. If a bounding box has supplied our only GPS data, we need to get an average “latitude” and “longitude”. The “location” is a comma separated String value which includes “latitude” and “longitude”. The other fields are output along with these computed values.

Once this is done, simply click the “Save” button.

Once the “Python” Processor has been saved, you will see how it has changed the input record (surrounded by the blue box) to the output record (surrounded by the red box).
The last step we need to carry out is to add our Destination. To do this, click on the “+” symbol above “Add Destination” (surrounded by the red box), then select your “ElasticSearch Tweets” Dataset. Click “Select Dataset”.

Your Pipeline is now created and ready to run.

Running the Pipeline and Generating the Heatmap

The last part to this, and the part you have been waiting for, is to join all of the dots and run this project. So far we have created a Talend Route to collect the Twitter data and send it to an AWS Kinesis Stream, built a Pipeline to consume that data and send it Elasticsearch, created the Elasticsearch service and configured the required Elasticsearch index and mapping using Kibana. Before we can see the Heatmap, we need to switch on all of the components, get some data into Elasticsearch and then create our Heatmap. This section will talk you through each of those steps.

The first thing we need to do is make sure that our AWS Kinesis Stream and Elasticsearch service are running. When I described the building of each of these, we got them to a point where they were all live. Check to see that these are live.
Once they are live, we need to go back to our Talend Route. This was configured to use Twitter credentials. So long as those credentials are still valid, you should be able to simply turn the Route on.
Open the Route and click the “Run” button (surrounded by the blue box). In the screenshot below the Route is running. You will notice the “Kill” button (surrounded by the red box). You can use that to stop it. However, don’t stop it before you have managed to get some data into Elasticsearch.
We now need to check that our Kinesis Stream is populating. To do this, log into AWS and go to the Kinesis Stream you created earlier. You should see the screen below.

We need to switch to the “Monitoring” tab to see what is happening. Click on the “Monitoring” tab (surrounded by a red box above).
On the “Monitoring” tab, scroll down until you see the “Put Record (Bytes) -Sum” chart. That is the chart surrounded by the red box below. If you can see a line like in the chart below, which starts at about the time you started your Talend Route, you are good to go.
We are now ready to switch on our Pipeline. Load your Pipeline and start it by clicking on the start button which is surrounded by the green box below.
The Pipeline might take a minute or two to start up. Once it does you will some stats generating in the right hand “Pipeline Details” panel (surrounded by the green box). As soon as you see some data here, your Pipeline is working and sending data to Elasticsearch. When you want to switch off the Pipeline, click on the stop button (surrounded by the blue box). Like the Talend Route, you might not want to do this yet. The more data you have, the better your Heatmap will look. I waited until I had about 100,000 records.
We now need to load Kibana. This can be done by loading the Kibana URL shown here. Load that URL and you will see the following screen. If not, click on the “Discover” link and if you’ve not used Elasticsearch for anything before, it will show you this screen.
EDIT: I realised while proof reading (and after I had finished with the screenshots) that you may need to click on the “Management” option to get this screen.

In the “Index pattern” box (surrounded by the red box above), add “heatmap_index”. Then click on the “Next step” button (surrounded by the blue box above). “heatmap_index” is the name of index we created in the previous blog.
On the next screen, simply press the “Create index pattern” button (surrounded by the green box below).
The next screen shows the “Index Pattern” created. You do not need to do anything here.
If you click on the “Discover” link (surrounded by the green box in the image below), you will see the data that has been sent to Elasticsearch with the index “heatmap_index”.
Click on the “Visualize” link (surrounded by the green box in the image below) and you will see the “Visualize” screen. Click on the “Create a visualization” button.
The next screen gives you a choice of visualization types. Although we want a Heatmap, we do not want the “Heat Map” option. Click on the “Coordinate Map” option.
On the next screen, simply click on “heatmap_index” (surrounded by the red box below).
You will be presented with the following screen. Click on the “Value” selector (surrounded by the red box below).
Ensure that “Count” is selected for the “Aggregation” (surrounded by the blue box below) and click on “Geo Coordinates” (surrounded by the red box below).
Once the “Geo Coordinate” section is expanded, select “Geohash” (surrounded by the blue box below) and select “location” for the field (surrounded by the red box below).
The final step before we can see the Heatmap is to set the “Options”. Click on the “Options” tab (surrounded by the red box below), select a “Map type” of “Heatmap” (surrounded by the blue box), set a “Cluster size” using the slider (anything will do, you can tweak this later) and set the “Layers” to “road_map” (surrounded by the orange box).

Once the above has been set, you are ready to start the Heatmap.
Below is an example of the sort of result you might get once you start the Heatmap. To start it, click on the “Start” button (surrounded by the red box). You will see the map produced with the data that Elasticsearch has up to the point at which you started it. If you want it to refresh periodically, you can set this using the controls surrounded by the purple box. You can also zoom in and out of the map using the controls surrounded by the green box. If you want to tweak the look of the map, simply have a play around with the “Options” panel we edited in step 17.

Hopefully you have found these blogs useful and they have revealed a few possibilities for how you can use Pipeline Designer. If you have any questions about the blogs, please feel free to add comments. If you have any questions about Pipeline Designer, log on to the Talend Community and go to the Pipeline Designer board. This board is monitored by several Talend employees who have been working with Pipeline Designer, it is also viewed by fellow Community members who may be able to help you out.

The post Generating a Heat Map with Twitter data using Pipeline Designer – Part 3 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

5 best practices to deliver trust in your data project : Tip #2 control your data wherever

August 22, 2019, 5:00 am

≫ Next: Data Privacy through shuffling and masking – Part 2

≪ Previous: Generating a Heat Map with Twitter data using Pipeline Designer – Part 3

During Summer, The Talend Blog Team will relay to share fruitful tips & to securely kick off your data project. This week, we’ll continue with the second capability: make sure you stay in control of your data pipelines from ingestion phase to consumption and beyond.

Get control over all the organization’s data and prevent shadow IT

Whenever an IT system, application or personal productivity tool is used inside an organization without explicit organizational approval, we talk about shadow IT. Shadow IT is not only a security and compliance nightmare, it creates a data sprawl where each group can create its data silos.

As per Wikipedia: Shadow IT, also known as Stealth IT or Client IT, are Information technology (IT) systems built and used within organizations without explicit organizational approval, for example, systems specified and deployed by departments other than the IT department.

According to a Cisco 2016 customer survey, there is 15-25X more used services without iT involvement in an organization. Furthermore, the cloud services explosion is likely to accelerate this trend.

Why it’s important

The more shadow IT is developing, the less easy it’s for users to access & protect data.

IDC estimates that data professionals spend 81% – and waste 24% – of their time searching, preparing and protecting data before they can actually take advantage. When data is not a team sport, everyone spends time creating silos and their version of truth, which drives up costs. Decisions are influenced by questionable data and ultimately put the organization at risk.

IDC went even deeper in the analysis in a data governance webinar, highlighting the high frequency of spreadsheet usage by business users as a data integration tool. Data silos start here, as copy/paste is the most frequently used approach to bring this data in. To avoid any shadow IT, equipping people with modern tools such as Talend Data Preparation is essential for avoiding creating those uncontrolled copies of data. Data citizens can then process data from sources to destination without keeping local storage or unknown or unprotected folders, systems, on premise storage or uncontrolled cloud-based storage.

This is not acceptable anymore with the rise of regulations (Basel II, IRFS, GDPR, CCPA, etc.), where companies are mandated to take control of their data assets. If they don’t, companies run the risk of being non-compliant and being exposed to significant regulatory fines.

When it’s important

Data Control should take place everywhere: when the data enter the system, along data pipelines and at data consumption points thru apps, api or analytics.

As more and more data professionals are getting closer to operations to drive business outcome “where the action is”, there is a growing risk of data fragmentation and misalignment. There is a need for a central organization that can enable people with data in a governed way while tracking and tracing data flows through data lineage.

Our recent data trust readiness report reveals that 46% of executives believe their organization is always in control of data. That figure falls to 28% for data practitioners. That shows that the problem seems to be controlled at top level, but data workers are less confident.

Download Data Trust Readiness Report now.
Download Now

How Talend tools can help

Talend Data Catalog helps you to create a central, governed catalog of enriched data that can be shared and collaborated on easily. It can automatically discover, profile, organize and document your metadata and makes it easily searchable.

Imagine that you find some inconsistent data in your data systems that have been created and perpetuated in one of your datasets and you’re asked to explain it, identify it and correct it. The data lineage will dramatically accelerate your speed to resolution by helping you to spot the right problem at the right place. Moreover, if new datasets come to your data lake, establish a data lineage will help you to identify these new sources very quickly.

But data lineage provided by Talend Data Catalog is not enough. You also have to cleanse it without leaving local files on unsafe data systems. To avoid any shadow IT, equipping people with modern tools such as Talend Data Preparation is essential: it helps them to cleanse data whilst avoiding local data treatment files such as excel files. Data citizens will process data from sources to destination without keeping local storage or unknown or unprotected folders, systems, on premise storage or uncontrolled cloud-based storage.

How to get started

If you want to become a digital leader, you need to be in control of applications wherever they’re located within the company. In our survey, top leaders combine speed and integrity by reclaiming control of shadow IT while still enabling the business with data. Incorporating appropriate data controls in your data chain is vital for the success of your data governance initiatives. Establish a step-by-step approach to data governance, from discovery to sharing.

Download our Definitive Guide to Data Governance to learn how to take control of your data assets and maximize their value. This bottom up and data-driven approach to governance will foster a data-driven culture as well as the benefits of delivering trusted data at speed.

Want to explore more capabilities?

This is the second out of ten trust & speed capabilities. Cannot wait to discover our third capability?

Go and download our Trust Data Readiness Report to discover other findings and the other 9 capabilities or stay tuned to discover our next week Tip #3 : Make people accountable around a single source of trusted data.

https://www.talend.com/products/data-catalog/

The post 5 best practices to deliver trust in your data project : Tip #2 control your data wherever appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Data Privacy through shuffling and masking – Part 2

August 27, 2019, 5:02 am

≫ Next: The Need for a Frictionless Enterprise in a Digital-Native World

≪ Previous: 5 best practices to deliver trust in your data project : Tip #2 control your data wherever

In the first part of this blog two-part series, we took a deep dive on Data Shuffling techniques aiming to mix up data and allowing to optionally retain logical relationships between columns. In this second part, we will now focus on Data Masking techniques as one of the main approaches to guarantee Data Privacy.

Data Masking

Simply put, masking techniques allows to block visibility of specifics fields or pieces of data. It hides data while preserving the overall format and semantic. It actually creates a structurally similar but inauthentic version of the data after having applied specific functions on data fields.

Note that, when using the most usual technics for data masking, original data cannot be retrieved after having been masked. Still, some encryption-based algorithms exist and allow to encrypt and decrypt data while preserving the format, as we will see at the very end of this section.

In the following, we first describe some of the numerous data transformation functions used to hide pieces of data. Then we detail the different masking modes and their implications at runtime.

Data Transformation Functions

To mask data, lots of transformation functions can be applied on the original data. Let’s first dig into the most common ones. This list is not exhaustive and other transformations can be easily applied to create other inauthentic version of the data.

Text handling functions

The following table lists some of the available masking routines for text, and their effects on the value Talend in 2019 is awesome for example.

Numeric values handling functions

The following table lists the available masking routines for a column containing numeric values, and their effect on the value 21803 for example.

Data Privacy through masking

Date handling functions

The following table lists the available masking routines for a column containing date values, and their effects on the value 05/04/2018 for example.

Data Privacy through masking

Patterns handling function

Specific algorithms can be applied to mask data that follows a specific pattern. This can be ideal to mask records such as credit card numbers, Social Security Numbers (SSN), account ids, IP addresses, etc. which is structured and standardized data.

For example, if we want to mask a French Social Security Number, the input values consist of 15 characters, excluding spaces, and use the pattern “s yy mm ll ooo kkk cc” where:

s is the gender: 1 for a male, 2 for a female,
yy are the last two digits of the year of birth,
mm is the month of birth,
ll is the number of the department of origin,
ooo is the commune of origin,
kkk is an order number to distinguish people being born at the same place in the same year and month,
cc is the “control key”.

By specifying exactly how to mask which parts using specified ranges allows to transform the original data in consistent manner. For example, you can specify that:

s must be generated between 1 and 2,
yy must be generated between 00 and 99,
mm must be generated between 01 and 12,
etc.

You also get the ability to mask specific parts of the input and keep other parts unmasked. For example, you might want to mask all the SSN characters except the exact first one. This would allow to keep real statistics for gender (represented by the first character) by preserving the anonymity of the real person – the other characters being fully masked.

Of course, same behaviors can be applied for dedicated data type such as emails, phone numbers, addresses, etc.

Masking Modes and Runtime Behaviors

When masking data, besides those technical routines applied to transform data, another component is also key. It concerns the masking modes and the behavior the functions have at runtime.

Depending on the targeted use case, data masking routines can be purely random. But they can also be repeatable from one execution to another on the same dataset. This opens huge perspectives, especially allowing joins and lookups on masked data. Let’s dig into that…

Random Data Masking

Random masking consists of masking an input value with a randomly generated value. As a consequence, when there are multiple occurrences of the same value in the input dataset, it can be masked to different values. Vice versa, different values from the input dataset can be masked to the same value.

For example, the following diagram shows an example of pure random data masking:

The A value is masked to D when it first appears in the input dataset.
The B and C values are masked to E.
The A value is masked to F when it appears in the input dataset for the second time.

Data Privacy through masking

The following table shows examples of generated masked values using a “Replace n first chars” function:

Data Privacy through masking
Here, two input values are the same. Once masked, the output values are completely different. “newuser” is masked by “uãáìser” in the first occurrence and is masked by “åõzoser” in the second occurrence.

The following table shows examples of generated masked values from a French SSN number:

Data Privacy through masking

Again, two input values are identical. Once masked, the corresponding output values are completely different. “1 90 04 94 184 376 21” is masked by “2 59 04 592 221 47 22” in the first occurrence and by “2 73 03 64 078 284 70” in the second occurrence.

Random data masking is a good fit and might be sufficient if you only need to hide data while preserving the overall format and semantic without having any further constraints on keeping some relationships between initial values and masked values.

Consistent Data Masking

When the same value appears twice in the input data, consistent masking functions output the same masked value. However, two different input values can be replaced with the same masked value in the output.

For example, the following diagram shows an example of consistent masking:

The A value is masked to D, regardless of the number of occurrences in the input dataset.
The B and C values are masked to E.

Data Privacy through masking

The following table shows examples of generated masked values using a “Mask email left part of domain” with consistent items function (i.e. replace the left part domain by one of the items set in the extra parameters list):

Here, the same input values are always masked by the exact same output values: “domain” is always masked by “newcompany” and “company” is always masked by “value”.

Consistent data masking can be seen as a first step prior to bijective data masking.

Bijective Data Masking

Bijective masking functions have the following characteristics:

They are consistent masking functions.
They always output two different masked values for two different input values.

For example, the following diagram shows an example of bijective masking:

The A value is masked to D, regardless of the number of occurrences in the input dataset.
The B value is masked to E.
The C value is masked to F.

Data Privacy through masking

The following table shows examples of generated masked values from a French SSN number:

Data Privacy through masking
Here, the same input values are always masked by the exact same output values: “1 90 04 94 184 376 21” is always masked by “2 89 05 24 283 319 01”.

Bijective data masking is a good fit if you need to ensure one-to-one correspondence between initial values and masked values. As we will see in the next section, this property is key if you aim to join/ lookup several masked datasets while keeping the correct relationship.

Repeatable data masking

Repeatable masking allows to maintain consistency between Job executions.
A seed is defined so that, for a given combination of input and seed values, the same output masked value is produced.

Combining bijective data masking and repeatable data masking has very powerful properties. Especially it allows to join different datasets based upon a key already masked.

Let’s say you want to perform Business Intelligence on an insurance database and a healthcare database, and we lack explicit consent to directly access those data. We might still be able to perform some statistics on data once shuffled and masked.

Since we have to join both databases, we rather make sure that the data used to make the join is masked exactly the same way everywhere.

By leveraging bijective data masking, we can ensure the same input value is always masked with the same output value, and vice versa.
By leveraging repeatable masking, we can ensure the above is true… at each job execution.

Format-Preserving Encryption (FPE)

Format-Preserving Encryption algorithms are cryptography algorithms which keep the input value formats. Those algorithms require a secret key to be specified to generate unique masked values.

Since those methods are based on encryption, several advantages exist:

Once encrypted, the data can be decrypted – meaning that it is possible to unmask the output values (knowing the secret key, of course). Then, such a masking function is reversible – you can retrieve the original input data.
It natively implies bijectivity and repeatability.

FF1 algorithm is the NIST-standard Format-Preserving Encryption algorithm.

Conclusion

We saw several techniques helping to ensure data privacy. Here, the key takeaway is that there is no easy magic solution that will solve your entire data privacy concerns.

Depending on the type of data you are working with and the use cases you want to address, some techniques are more relevant than others. A mix of different techniques such as data shuffling sprinkled with a bit of repeatable data masking and a pinch of hashing is often the right path to correctly address such complex data privacy projects.

The good news is that Talend Data Fabric provides all those different technical resources to help you to address your data privacy needs!

Want to see those capabilities into action with Talend Studio and Talend Data Preparation?
Watch this online video on Managing Data Privacy with Data Masking and Talend Data Quality

Take the next step: Download a Free Trial

The post Data Privacy through shuffling and masking – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Field Name	Path
id	.id
gps_coords	.gps_coords
bounding_box	.geo_bounding_box
created_date	.created_at
text	.text
type	.type

Field Path	Apply a function first	Operator	Value
.gps_coords	NONE	!=	NONE
.bounding_box[0]	NONE	!=	NONE