Migrate Data from one Database to another with one Job using the Dynamic Schema

November 11, 2019, 5:02 am

≫ Next: Birds migrate. But why do data warehouses?

≪ Previous: How to do Snowflake query pushdown in Talend

“How can I migrate data from all tables from one database to another with one generic Talend Job……with transformations as well?“, is a question I get again and again on Talend Community. As an integration developer with over 15 years of honing my skills, this question used to get me banging my head against my desk. I now sit at a desk with a subtle but definitely present dent, which is slightly discoloured from classic pine to pine with a rosé tinge.

My attitude was always that Talend is a tool which helps developers build complex integration Jobs and should be used by experts who realise that there is no universal, one-size-fits-all Job for everything. However, I was being somewhat harsh and maybe somewhat elitist. Looking back at my frustrations I can see that they came from the fact that I had spent a lot of time and energy building my expertise, and I was a little resentful of the expectation that what we integration experts do should be considered so trivial and easy.
I recently received an example of that sort of question again and it got me considering my attitude.

The question wasn’t quite the same as those I have received in the past. They didn’t want to migrate the data, create dynamic transformations of the data and filter adhoc rows, all by simply joining 3 components together and pressing “Go”. This individual wanted to take a database and migrate the data from source tables to target tables, with a change in table name and possibly a change in column order as well. I had a bit of time, so I thought I would give it a go. It sounded like it should be possible and like the sort of thing that I could use to dust off my skills a little, having been looking at some of Talend’s newer tools over the last few months. In this blog I will demonstrate a method of achieving this requirement.

First of all, I should point out that I am not going to demonstrate a complete multi-table to multi-table migration. What I will demonstrate is an example of a Job that could be easily extended to do that. I will talk about how to easily extend it at the end. In this blog I was focused on creating a Job which will move data from one table to another using a Dynamic schema, a column mapping table and a bit of Java.

The DynamicTableMigration Job

Below is a screenshot of the Job. You will see that the components are numbered from 1 to 13. I will use this numbering when talking through what each component does and how you can recreate this.

I have created 3 tables in a MySQL database to demonstrate this. A source table, a target table and a column mapping table. I’ve used a single database, but in reality you will likely be using different databases. It doesn’t make much difference, but you will need to make sure that the database column types are the same if you are following this. It would be possible to add some code to dynamically change the column types, but this would require extra data in the column mapping table and some extra Java code. This is not covered here.

The Source Table

The Source table is a simple table holding a few person details. You can see the schema below as a MySQL create statement.

CREATE TABLE `source` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `first_name` varchar(45) DEFAULT NULL,
  `last_name` varchar(45) DEFAULT NULL,
  `house_number` int(11) DEFAULT NULL,
  `street` varchar(45) DEFAULT NULL,
  `city` varchar(45) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=201 DEFAULT CHARSET=latin1;

The Target Table

The Target table is also a simple table holding a few person details, but slightly different. You can see the schema below as a MySQL create statement.

CREATE TABLE `target` (
  `id` int(11) NOT NULL,
  `firstName` varchar(45) DEFAULT NULL,
  `lastName` varchar(45) DEFAULT NULL,
  `cityName` varchar(45) DEFAULT NULL,
  `road` varchar(45) DEFAULT NULL,
  `addressNumber` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The column order has changed slightly, as have some of the names. You will also notice I have not set the “id” as an auto-increment field. This was just done to save from faffing around with that in this simple example. However, there are plenty of ways of dealing with this if you have to.

The Column_Mapping Table

The Column_Mapping table is used to translate the column names and change the column order in this example. You can see the schema below as a MySQL create statement.

CREATE TABLE `column_mapping` (
  `old_column_name` varchar(45) DEFAULT NULL,
  `new_column_name` varchar(45) DEFAULT NULL,
  `order` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

I will also show how I have populated this for this tutorial. It is a basic example with only a few changes. But it should give you the idea.

old_column_name	new_column_name	order
id	id	1
first_name	firstName	2
last_name	lastName	3
house_number	addressNumber	6
street	road	5
city	cityName	4

Component Configuration

Below I will describe the configuration of each of the components.

1. “Source” (tDBInput)

This component is used to read the data from the source table and output it in a Dynamic schema column. You can see the configuration of this component below…

Other than connection credentials, the only thing that needs to be done is to configure a single column (in this case called “dynamicColumns”) of type “Dynamic”. The query is simply….

"Select * From source"

Once this is configured, that is this component finished with.

2. “Identify column names” (tJavaFlex)

This component is used to identify the column names and the order of the columns retrieved by the Dynamic schema column. I will not show this as it is simply code. The code is broken down into 3 sections; Start Code, Main Code and End Code. These are shown below so that you can copy and paste. Ensure that all columns and rows are named the same otherwise you will get errors. Alternatively, ensure that you change those variable names.

Start Code

//Create an ArrayList to contain the column names of the input source
java.util.ArrayList columns = new java.util.ArrayList();
//Row count variable to count rows processed
int rowCount = 0;

Main Code

//Only carry out this code for the first row
if(rowCount==0){
	//Set the dynamicColumnsTmp variable
	Dynamic dynamicColumnsTmp = row3.dynamicColumns;

	//Cycle through the columns stored in the Dynamic schema column and add the column names
	//to the ArrayList
	for (int i = 0; i < dynamicColumnsTmp.getColumnCount(); i++) { 
    	DynamicMetadata columnMetadata = dynamicColumnsTmp.getColumnMetadata(i); 
      	columns.add(columnMetadata.getName());
      
	}
	
	//Append 1 to the rowCount
	rowCount++;
}

End Code

//Set the columns ArrayList to the globalMap for later
globalMap.put("columns", columns);

The above code is described in-line.

3. “Initial dataset” (tHashOutput)

This component is used to store the data from the source component. It is passed through the tJavaFlex while that component is calculating the column order. The configuration of this component can be seen below.

You simply need to ensure that your column “dynamicColumns” is added.

4. “Column name” (tJavaFlex)

This component takes the globalMap variable created in the End Code section of the first tJavaFlex and returns each column name found, in order, one at a time to the following flow.

Start Code

//Retrieve the columns ArrayList to be used here
java.util.ArrayList columns = (java.util.ArrayList)globalMap.get("columns");
java.util.Iterator it = columns.iterator();

//Set a loop to produce a row for each column name
while(it.hasNext()){

Main Code

//Set each column name to a new row
row2.columnName = it.next();

End Code

This essentially works as a While loop, iterating over the ArrayList.

5. “Column mapping” (tDBInput)

This component is used to retrieve the data from the Column_Mapping table. This holds the old and new column names, it also holds the order of the columns. This data is used to dictate the order of the output data. This component is used as a lookup for the tMap which will be described next. The configuration of this component can be seen below.

Like the first tDBInput component, this is pretty simple to set up. Just set the schema and the query. The query can be seen below…

"SELECT 
  `column_mapping`.`old_column_name`, 
  `column_mapping`.`new_column_name`, 
  `column_mapping`.`order`
FROM `column_mapping`"

6. “Replace column names” (tMap)

This component is used to take the column names found in the first subjob and compare them against the lookup data from previous component. The configuration of this component can be seen below…

This is a pretty simple component in terms of its configuration. We simply bring our data row from the tJavaFlex and lookup against the dataset from the tDBInput. There is a join on the column name against the old column name from the lookup. The “columnName”, “ColumnNameNew” and “Order” columns are returned. The data will hold the translation of the column names and the required order of those columns in the output data.

7. “Set column order” (tSortOrder)

This component is used to order the output from the tMap into the required column order for the final output. The configuration of this component can be seen below….

First, ensure that all input columns are sent to the output. Next, set the “Criteria” for the ordering to be on “Order”, “num” and “asc”. This will return the columns ordered from smallest “Order” value to largest.

8. “Create ordered column list String” (tMap)

This component is used to merge all of the column data into a single String. What it actually does is to append the rows together, returning as many rows as input, but with the final row holding a concatenation of all of the data. First, the old column name is concatenated with the new column name. These values are separated by a comma. Then the rows are concatenated using a semicolon. So the first row output might look like this….

oldColumnOne,newColumnOne

The last row output will look like this….

oldColumnOne,newColumnOne;ldColumnTwo,newColumnTwo;oldColumnThree,newColumnThree;oldColumnFour,newColumnFour;oldColumnFive,newColumnFive;oldColumnSix,newColumnSix

We need to keep the last row, but this will be handled by the next component.

The configuration of this tMap can be seen below….

This is relatively straight forward, but the tMap variables will need explaining. As I regularly mention in my tutorials, the tMap variables are processed per row from top to bottom. They also retain their values between rows. This makes this process possible. The variables and expressions are shown below…

Expression	Type	Variable
row6.columnName +“,”+ row6.ColumnNameNew	String	mergedColumns
Var.mergedRecords==null ? Var.mergedColumns : Var.mergedRecords +“;”+Var.mergedColumns	String	mergedRecords

The top variable (mergedColumns) is used to concatenate the old column name with the new column name. This is separated by a comma. The second variable (mergedRecords) is used to concatenate the mergedColumns values of every row. This is made possible by the fact that it is references (using Var.mergedRecords) itself in the concatenation. Since this is the case, it’s appended value is stored between rows. So, as explained previously, by the last row of data all of the records will have been concatenated.

9. “Return last row” (tAggregateRow)

This component is used to return only the last row of data from the previous tMap. First of all, we ensure that the single input column is set as our output column. After this, we do not set a “Group by” field. This essentially puts all rows into one group. Then for the “Operations”, we set the “mergedRecords” to have the Function” of “last”. This will return only the last record.

The configuration of this component can be seen below…

10. “Save ordered list String” (tJavaFlex)

Here we use another tJavaFlex component. Now here, I didn’t really need to use a tJavaFlex. I could have used a tSetGlobalVar. I’ll be totally honest and say that I produced all of the screenshots for this before thinking that it was a bit of overkill using the tJavaFlex. However, it doesn’t hurt. All I am doing here is setting the value returned by the previous component to a globalMap variable. The code for this takes place in the Main Code section of the tJavaFlex. No other code is used. This can be seen below…

//Set the column translation record to the globalMap
globalMap.put("records", row7.mergedRecords);

11. “Initial dataset” (tHashInput)

This component is used to read in the data stored in our tHashOutput from the first subjob. It is simply linked to the first tHashOutput and set with the same schema. The configuration of this component can be seen below…

12. “Reorder dynamic schema” (tJavaFlex)

This is our last tJavaFlex component and arguably the most complicated. I’ll show the code below. The code is described in-line, but it essentially takes the data passed in the first subjob from our source, takes the mergedRecords String stored in the globalMap with a key of “records”, splits up the mergedRecords String, then uses that data to match with column names from the data set passed from the tHashOutput. For each record, it checks the order and the new name required, then creates a new Dynamic schema record with the columns reordered and renamed.

Please see the code below…

Start Code

//Retrieve the "records" globalMap String which holds the record order
String records = ((String)globalMap.get("records"));
//Splite the columns up using the semi-colon
String[] columns = records.split(";");

Main Code

//Create a Dynamic schema variable to hold the incoming Dynamic column
routines.system.Dynamic dynamicColumnsTmp = row8.dynamicColumns;
//Create a brand new Dynamic column variable to be used for the newly formatted record
routines.system.Dynamic newDynamicColumns = new routines.system.Dynamic();

//Cycle through the column data supplied by the globalMap
for(int x = 0; x<columns.length; x++){
	//Cycle through the columns inside the Dynamic column holding the data
	for (int i = 0; i < dynamicColumnsTmp.getColumnCount(); i++) { 
		//Retrieve the value of the current column inside the Dynamic column
	  	Object obj = dynamicColumnsTmp.getColumnValue(i);
	  	//Retrieve a DynamicMetadata object from the column inside the Dynamic column
      	DynamicMetadata columnMetadata = dynamicColumnsTmp.getColumnMetadata(i); 
		
		//If the current column inside the Dynamic column starts with same name
      	if(columns[x].startsWith(columnMetadata.getName()+",")){
      			//Identify the old and new column names from the column record
      			String newColumnName = columns[x].substring(columns[x].indexOf(',')+1);
      			String oldColumnName = columns[x].substring(0,columns[x].indexOf(','));
      			//Create a new DynamicMetadata object
      			DynamicMetadata tmpColumnMetadata = new DynamicMetadata();
      			tmpColumnMetadata.setName(newColumnName);
      			//Set the metadata for this metadata
      			tmpColumnMetadata.setDbName(columnMetadata.getDbName());
    			tmpColumnMetadata.setType(columnMetadata.getType());
    			tmpColumnMetadata.setDbType(columnMetadata.getDbType());
    			tmpColumnMetadata.setLength(columnMetadata.getLength());
    			tmpColumnMetadata.setPrecision(columnMetadata.getPrecision());
    			tmpColumnMetadata.setFormat(columnMetadata.getFormat());
    			tmpColumnMetadata.setDescription(columnMetadata.getDescription());
    			tmpColumnMetadata.setKey(columnMetadata.isKey());
    			tmpColumnMetadata.setNullable(columnMetadata.isNullable());
    			tmpColumnMetadata.setSourceType(columnMetadata.getSourceType());
    			
    			//Set the new metadata for the new column inside the new Dynamic schema column
      			newDynamicColumns.metadatas.add(tmpColumnMetadata);
      		
      			//Set the value for the new column inside the new Dynamic schema column
      			newDynamicColumns.addColumnValue(obj);

      		}
      }
      
      
}

//Set the output Dynamic schema column
row9.dynamicColumns = newDynamicColumns;

There is no End Code section for this component.

13. “Target” (tDBOutput)

This component is used to send the data to the target table. As with the first component, most of the configuration data depends upon you environment. You will need to make sure you set it up with a Dynamic schema column. This should automatically be set simply by connecting your component. The way this will work is that, so long as the new column names you selected match those in the database table, the DB component will identify these and create the appropriate insert statement. The configuration of this component can be seen below….

Once all of the above has been completed, you should be able to run your job, check your database and see that the data has been copied from source to target correctly.

What if I want to use this to support multiple tables in one Job?

In my introduction I mentioned that this method can be used to migrate multiple tables. This example only supports 1 table to 1 table. All that is needed to extend this is the following.

1. Use Context Variables in your DB components

First of all, you will need to make your DB components more dynamic. The tDBInput component has a hardcode SELECT Statement. This needs changing, but not by much. If you add a Context variable for the source name, you can change your query to ….

"Select * From "+context.source

No further changes to this component are needed.

You need to do similar to the tDBOutput component. But instead of setting a query, you will need a Context variable for the “Table” parameter. So you’d set….

context.target

…for your “Table” parameter.

2. Add columns to your Column_Mapping table to hold the Table Name

You will need to add a bit more supporting data to your Column_Mapping table. If you add an “Old_Table_Name” column and a “New_Table_Name” column, you can query the Column_Mapping table using the “Old_Table_Name” field and the context.source Context variable. That will return the mapping configurations for your source table and return the new table name. This will need to be set as your context.target Context variable value.

3. Create a wrapper Job to call this Job and supply the Table data as Context Variables

The final step for this will be to create a wrapper Job. This is a Job that will query a data set (maybe your Column_Mapping) table to return a list of source tables to be migrated. This data will then be sent to this Job, run using a tRunJob. For every source table identified in the wrapper Job, this Job will be run. Therefore you can start the wrapper Job, it will return each of the source tables and this Job will dynamically run for each of them.

Further Considerations

What I have described above is the most basic version of what you would need to do to meet this requirement. I have not included any logging or any error catching. You should do this if you want to achieve this successfully. You will likely also have to consider any referential integrity issues that might crop up. Assuming that all of your Primary and Foreign keys will be the same, this may just mean having to switch off any DB constraints on your target DB before running this. However, it is important to think this through before jumping to use this, as there will be further considerations to take into account.

Finally, good luck 😉

The post Migrate Data from one Database to another with one Job using the Dynamic Schema appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Birds migrate. But why do data warehouses?

November 12, 2019, 10:50 am

≫ Next: 5 best practices to innovate at speed in the Cloud: Tip #4 Perform faster root cause analysis thanks to data lineage

≪ Previous: Migrate Data from one Database to another with one Job using the Dynamic Schema

Well, let’s be specific here. Birds migrate either north or south. Data warehouses are only going in one direction. Up, to the cloud.

It’s a common trend we’re seeing across every vertical and across every region. Companies are moving their existing data warehouses to cloud environments like Amazon Redshift. And more often than not –unlike their feather counterparts– once they migrate to the cloud, they never come back.

But why?

Simply put, it just makes sense.

The cloud has fundamentally changed the game of leveraging data effectively. There are far fewer constraints on resources, procurement, scale, set up, or speed. In other words, cloud computing has made most on-premises datacenter strategies, obsolete. So why wouldn’t you invest in something like that?

Now that I’ve got your attention, let’s get into the nitty gritty details.

Cloud data warehouses are awesome, but they’re not perfectly wrapped packages. How will you get the data you have from all your cloud and on-premises applications and systems into them? How will you make sure the data is right? You still need to make the projects that are driven off of them successful. You still need to make them work with everything else. This is not trivial. In fact, it’s really really hard.

That’s where we come in.

In a typical scenario, companies will utilize a data lake like Amazon S3 for structured and unstructured data. The biggest challenge at this stage is just getting the data in there. Talend speeds up the ingestion with over 1000 connectors and components. It doesn’t matter if the data is from Teradata, Salesforce, DB2, or any other system. You don’t need to worry about version numbers, data formats, or even where the system resides. We make it effortless to feed the data lake.

Next, there’s a fundamental need to standardize, parse, and cleanse the data. This is where the G-word comes in. That’s right: Governance. In the end, governed data is useful data. That’s what we need to drive projects and make them successful. At this stage we’re paring down the data and throwing out the bits we don’t need. To do this effectively, we can leverage machine learning and big data processing coupled with insight from people who understand the business. The machine learning can help identify data that needs to be fixed and automatically correct it based on prior corrections. We have tools like Data Preparation and Data Stewardship that enable different users (including business analysts) to contribute to refining the data. Finally, we generate native code to run on big data services like EMR to help consolidate large data sets so that only the important concise data remain.

All this clean and governed data can now be moved into a cloud data warehouse like Redshift for optimized performance in a structured data environment. At this point, the cloud data warehouse serves as the quick and clean repository for all uses of this trusted data. Analytics is a huge use case for cloud data warehouses.

In fact, you can see a terrific example of this in this video featuring the University of Pennsylvania where they modernized a legacy application with a hybrid AWS implementation and reduced runtime by 3x.

To learn more about how a modern data warehouse on AWS can drive business results, check out the Cloud Architects’ Handbook on How Leading Enterprises Achieve Business Transformation with Talend and AWS. I hope this helps you on your migratory journey to cloud data warehousing.

Have questions? Feel free to reach out to us and we’ll help you out.

By the way, if you happen to be in Las Vegas in early December for AWS RE:Invent 2019, please drop by our booth #613! We’d love to show you all of this in person!

Cheers.

The post Birds migrate. But why do data warehouses? appeared first on Talend Real-Time Open Source Data Integration Software.

↧

5 best practices to innovate at speed in the Cloud: Tip #4 Perform faster root cause analysis thanks to data lineage

November 13, 2019, 5:00 am

≫ Next: Want to Build a Responsive and Intelligent Data Pipeline? Focus on Lifecycle

≪ Previous: Birds migrate. But why do data warehouses?

Starting September, The Talend Blog Team started to share fruitful tips & to securely kick off your data project in the cloud at speed. This week, we’ll start with the fourth speed capability: perform faster root cause analysis thanks to data lineage.

Like any supply chain that aspires to be lean and frictionless, data chains need transparency and traceability. There is a need for automated data lineage to understand where data comes from, where does it go, how it is processed and who consumes it. There is also a need for whistle blowers for data quality or data protection and for impact analysis whenever change happens.

Data catalog

Why it’s important

The faster data flows and the more it is used to automate and drive, rather than just influence decisions, the more important it is used to sense issues or change and react accordingly. A modern data platform establishes an audit trail for impact analysis, data error resolution, internal control or regulatory compliance.

When it’s important

Regulators ask for data transparency when managing sensitive data to mitigate risks, managing privacy and moving data across borders. The cost of data errors only compounds with time. As such, the sooner in the data flow data errors are identified, the better.

If a business wants to review, for example, where sales information entered the system in order to test an idea about a new product or process, data lineage can quickly provide that information. An extraordinary amount of data enters a business system each day, and data lineage reduces risk by providing data origin and information about how it is traveling through the system.

When it comes to trusting data and ensuring governance, lineage information becomes especially important. For example, the healthcare and finance industries are subject to strict regulatory reporting and must rely on data provenance and demonstrate lineage especially with today’s large open source technologies. Providing a record of where data came from, how it was used, who viewed it and whether it was sent, copied, transformed or received, all in real time assures that full details about any person or system in contact with data are available at any time

Our recent data trust readiness report reveals that only 38% of respondents believe their organizations are excellent at tracing back errors into files.

Download Data Trust Readiness Report now.
Download Now

How Talend tools can help

Data lineage is a map of the data journey, which includes its origin, each stop along the way, and an explanation on how and why the data has moved over time. The data lineage can be documented visually from source to eventual destination — noting stops, deviations, or changes along the way. The process simplifies tracking for operational aspects like day-to-day use and error resolution.

Data lineage is a core component of Talend Data Catalog. Whilst it integrates data lineage, Talend Data Catalog helps you to create a central, governed catalog of enriched data that can be shared and collaborated on easily. It can automatically discover, profile, organize and document your metadata and makes it easily searchable. You can manage metadata by searching for, documenting, analyzing and comparing them, tracing end-to-end data lineage and performing impact analysis.

Talend data catalog

Figure 1 Talend Data Catalog builds end to end lineage down to the attribute level.

Talend Data Catalog supports data lineage across multiple platforms, including enterprise apps like SAP, cloud apps like Salesforce.com, data stores like file systems, Hadoop, SQL and NoSQL, BI and analytical tools , ETLs…

Feel free to watch this on-demand webinar where Stewart Bond, Research Director of IDC’s Data Integration and Integrity Software Service, and Talend will highlight this modern approach to data governance. You can also download our definitive guide to data governance to explore other capabilities of trust & speed.

Want to explore more capabilities?

This is the fourth out of five speed capabilities. Cannot wait to discover our last capability?

Go and download our Trust Data Readiness Report to discover other findings and the other 9 trust and speed capabilities.

The post 5 best practices to innovate at speed in the Cloud: Tip #4 Perform faster root cause analysis thanks to data lineage appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Want to Build a Responsive and Intelligent Data Pipeline? Focus on Lifecycle

November 19, 2019, 5:00 am

≫ Next: 5 best practices to innovate at speed in the Cloud: Tip #5 Accelerate data delivery to third party applications and teams through APIs

≪ Previous: 5 best practices to innovate at speed in the Cloud: Tip #4 Perform faster root cause analysis thanks to data lineage

Today, enterprises need to collect and analyze more and more data to drive greater business insight and improve customer experiences.

To process this data, technology stacks have evolved to include cloud data warehouses and data lakes, big data processing, serverless computing, containers, machine learning, and more.

Increasingly, layered across this network of systems, is a new architectural model – “data pipelines” that collect, transform, and deliver data to the people and machines that need it for transactions and decision making. Like circulatory systems for blood to deliver nutrients to the body, data pipelines deliver data to fuel insights that can improve a company’s revenue and profitability.

However, even as data pipelines gain in popularity, many enterprises underestimate how difficult they can be to manage properly. Enterprises simply do not have the time or resources to introspect data as it moves across the cloud, so data lineage and relationships are often not captured, and data pipelines become an island unto themselves. Likewise, many first-generation data pipelines have to be rearchitected as the underlying systems and schemas change. Without proper attention, they can even break.

To ensure their data pipelines work appropriately, we suggest adopters start with making sure the pipeline’s “nervous system” be built to support reliability, change and future-proof innovation.

Understanding the Data Pipeline Lifecycle

While data pipelines can deliver a wide range of benefits, getting them right requires a broad perspective. So, we suggest organizations get familiar with the full data pipeline lifecycle, from design and deployment to operate and govern.

Taking a lifecycle approach will very much improve the chances that your data pipeline is intelligent and responsive to your company’s needs – and will provide quick access to trusted data. Having deep experience in data pipeline design and delivery, I’d like to share the overall lifecycle – and tips for a successful implementation.

Design

It is vital to design data pipelines so they can easily adapt to different connectivity protocols (database, application, API, sensor protocol), different processing speeds (batch, micro-batch, streaming), different data structures (structured, unstructured), and different qualities of service (throughput, resiliency, cost, etc.).

For example, during the design phase, these challenges can be addressed through a flexible, intuitive, and intelligent design interface, including autosuggestion, data sampling for live preview, and design optimization.

Some potential challenges within the Design phase include how to access the data, what its structure is, and if it can be trusted. This means it’s important to have live feedback on what you’re building, or you have a tedious design-test-debug-design scenario. The framework must have the right level of instrumentation so developers can capture and act on events, addressing changes in data structure and content in real time.

Modern data pipelines also need to support:

Data semantic and data structure changes while ensuring compatibility through a schema-less approach

Data quality validation rules to detect anomalies in the content flowing through the pipeline

Full data lineage to address governance requirements such as GDPR

“Out of order” real-time data processing in the case of data latency

Deployment

When building a data pipeline, it’s important to develop the pipeline to be as portable and agile as possible. This will ensure your technology choices from the beginning will prove long-lasting – and not require a complete re-architecture in the future.

During the deployment phase, challenges can include where to deploy each part of the pipeline (local to data or at the edge), on what runtime (cloud, big data, containers), and how to effectively scale to meet demand.

For example, we often see clients who start on-premises, then go to a Cloud/Hybrid platform, then incorporate a multi-cloud and/or serverless computing platform with machine learning. Working through an abstraction layer like Apache Beam, for example, ensures this level of flexibility and portability, where the data pipeline is abstracted from its runtime.

Another consideration is scale. Data pipelines and their underlying infrastructure need to be able to scale to handle increasing volumes of data. In today’s cloud era, the good news is that you can get the scalability you need at a cost you can afford. A technique that works is to use a distributed processing strategy where you process some data locally (e.g. IoT data), and/or utilize new serverless Spark platforms where you just pay for what you need when you need it.

Operate and Optimize.

This phase presents a range of challenges to capturing and correlating data, as well as delivering analytics and insights as outcomes. Among the challenges are how to handle changing data structures and pipelines that fail, and how to optimize and improve data pipelines over time. We find that AI/ML is of sufficient maturity to be very helpful here.

At runtime, data pipelines need to have capabilities to intelligently respond and improve, rather than fail. For example, autoscaling as the volume of data increases through serverless infrastructure provisioning and auto load balancing, dynamically adjusting to changing schemas, and autocorrection. All of this is AI-driven thanks to technical, business, or operational historical or real-time metadata.

AI is also used to optimize data pipeline operations and highlight bottlenecks, decreasing the meantime to detect errors, investigate, and troubleshoot. And with auto-detection or adaption to schema changes at runtime, AI keeps your pipelines running.

One last tip here. Practitioners also can optimize their pipelines with machine learning through a framework like Apache Spark. Spark’s machine learning algorithms and utilities (packaged through MLLib) allow data practitioners to introduce intelligence into their Spark data pipelines.

Governance

As companies integrate many more types of structured and unstructured data, it is a requirement to understand the lineage of data, cleanse, and govern it. Having a well-crafted data governance strategy in place from the start is a fundamental practice for any project, helping to ensure consistent, common processes and responsibilities.

We suggest users start by identifying business drivers for data that needs to be carefully controlled and the benefits expected from this effort. This strategy will be the basis of your data governance framework.

Common governance challenges for the adaptive data pipeline include complying with (or face severe penalties) recent regulations such as General Data Privacy Regulation (GDPR) for European data or the California Consumer Privacy Act (CCPA).

If built correctly, your data pipeline can remain accurate, resilient, and hassle-free – and even grow smarter over time to keep pace with your changing environments – whether it be batch to streaming and real-time; hybrid to cloud or multi-cloud; or from Spark 1.0 to 2.4 to next-big thing.

This article was originally published on Integration Developer News.

The post Want to Build a Responsive and Intelligent Data Pipeline? Focus on Lifecycle appeared first on Talend Real-Time Open Source Data Integration Software.

↧

5 best practices to innovate at speed in the Cloud: Tip #5 Accelerate data delivery to third party applications and teams through APIs

November 21, 2019, 5:00 am

≫ Next: Why I stopped practicing law? Because data is king.

≪ Previous: Want to Build a Responsive and Intelligent Data Pipeline? Focus on Lifecycle

Starting last September, The Talend Blog Team started sharing fruitful tips to securely kick off your data project in the cloud at speed. This week, we’re concluding the series with the fifth and last capability: Accelerate data delivery to third party applications and teams through APIs.

Billions of times each day, application programming interfaces (APIs) facilitate the transfer of data between people and systems, serving as the fabric that connects businesses with customers, suppliers, and employees. Having the right API strategy in place can make the difference between success and failure when it comes to utilizing APIs to deliver results, reduce response times, and improving process efficiency.

An API is a building block of code that helps programmers connect their applications to data services. Once data is accessible through an API, it can be reused in a controlled way by potentially anyone within and beyond an organization.

Why it’s important

Digital transformation doesn’t stop with analytics. Organizations need to deliver trusted data into everything they do, across their operations and as part of customers experiences. APIs significantly improve the value of your data by making it ubiquitous, allowing any applications and connected services to embed themselves in an easy and sustainable way.

When it’s important

APIs are the small bricks that build successful consumer or user experience. An API makes data actionable because it brings all the needed data together into applications to trigger actions or guide human interactions. Whether we are making a purchase, looking for a new doctor, or checking a book out of the library, every stakeholder in the transaction can benefit from APIs.

Our recent data trust readiness report reveals that only 43% of Management level is fully confident about the ability of their organization to accelerate data delivery to third party applications and teams. The rate falls to 28% among operational data workers.

Download Data Trust Readiness Report now.
Download Now

How Talend tools can help

With the transformation of the retail industry, frictionless methods of payments, PSD2 and the Open Banking Standard, organizations need to continue to improve their API strategy, ensure security and transactions in the cloud.

Increased reliance on APIs, as well as the shift toward open source, REST, and cloud-native API technologies mean that companies need a comprehensive API solution to stay competitive and remain ahead of the technology curve.

Talend Cloud API Services

Talend Cloud API Services brings full development lifecycle support—from design to deployment for APIs—to Talend Cloud. Now, you can build APIs in days not months and extend your platform to new business models and partners so you can get ahead of the next wave of data services and data monetization with Talend Cloud.

Talend API Cloud Services

Figure 1: Easily Test APIs with Cloud API Service

How to get started

Read this comprehensive guide about API to learn how to build an API Platform strategy at your company.

Watch this short video on how to share data at scale and see how to:

· Design APIs faster with the graphical API Designer Accelerate testing through live API preview and iterative prototyping

· Add data quality, advanced routing, and transformation with Talend Studio

· Enable DevOps and deploy anywhere

Want to explore more capabilities?

This is the last of five speed capabilities series. Go and download our Trust Data Readiness Report to discover other findings and the other 9 trust and speed capabilities.

If you missed any of our previous Trust & Speed tips, you can find them all below:

Trust

Tip #1 | Tip #2 | Tip #3 | Tip #4 | Tip #5

Speed

Tip #1 | Tip #2 | Tip #3 | Tip #4

The post 5 best practices to innovate at speed in the Cloud: Tip #5 Accelerate data delivery to third party applications and teams through APIs appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Why I stopped practicing law? Because data is king.

November 26, 2019, 5:00 am

≫ Next: How Databricks on AWS and Stitch Data Loader can help deliver sales and marketing analytics

≪ Previous: 5 best practices to innovate at speed in the Cloud: Tip #5 Accelerate data delivery to third party applications and teams through APIs

When I was in my mid-twenties, I thought I had it all. I had just recently graduated from a top law school, passed the California Bar Exam, and was working as a junior associate at a prestigious San Francisco law firm. Three short years later, I had turned my back on law and embarked on a career in the technology field, which after many twists and turns, including stints as an analyst at Gartner, positions at a number of start-ups (some of which were actually somewhat successful) and some of the world’s largest companies (Dell and EMC), has landed me at my current position at Talend’s product marketing team.

Over the years, I have been asked many times why I left the practice of law. My usual answer has always been what you would expect. Quality of life (i.e. no time left for a personal life), office politics (need to cozy up to the right partners to advance), and an unhealthy dislike for billable hours (who wants to document and charge for every minute of every day) were some of my go-to responses. But now that I have been working at Talend for more than half a year, I have realized that the true reason went much deeper than that. Let me try to explain.

Talend provides data integration, quality and management solutions to organizations of all sizes – from smaller companies to some of the world’s largest enterprises. Our number one goal is to make sure that organizations have all the data they need to make the right decisions and take the right actions – whether it is to have more compelling engagements with customers, develop better products, or make more efficient and cost-effective operational decisions. And I believe in this goal. When you think about it, this is the exact opposite of what a lawyer does.

A lawyer’s job (and I am speaking from the perspective of a trial lawyer, which is what I did) is to limit the amount of data – evidence in the legal parlance – that is used by the ultimate decision maker (whether it is a jury or a judge) as much as possible to what favors your client’s side. Through a variety of motions before a trial and objections during trial (think of terms like hearsay, prejudicial, or irrelevant that you have heard in numerous TV shows or movies), lawyers try to limit the data or evidence that should be considered in making the ultimate decision.

While this seems to work fine in an adversarial situation, think what it would be like if business decisions were made the same way. What if a company decided to develop one product over the other because the product development team for the chosen product was able to limit what the other team could share with the executive decision makers. Or, if a decision to expand to a new territory was made based on incomplete market data from all regions.

I have always been a data head deep down – in college, my favorite class (and my highest grade) was statistics. Looking back on it, I think I realized at a sub-conscious level that limiting or hiding data was not what I wanted to do for a living. That’s why I find it so appropriate that I ultimately ended up at Talend, a company whose goal is the opposite.

If you are guilty of being as data driven as I am and want to ensure that you have all the information you need to make the right decisions and take the right actions, consider how your organization can benefit from improved data transparency and data access. Check out Talend Data Fabric to learn how your data can work for you.

So, how do you plead?

The post Why I stopped practicing law? Because data is king. appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How Databricks on AWS and Stitch Data Loader can help deliver sales and marketing analytics

November 27, 2019, 5:00 am

≫ Next: Talend Solution Templates – Try on your superhero suit

≪ Previous: Why I stopped practicing law? Because data is king.

Sales and marketing analytics is one of the most high-profile focus areas for organizations today, and for good reason – it’s the place to look to boost revenues. Savvy executives are already on board with the impact – in a survey of senior sales and marketing executives, McKinsey finds that “extensive users of customer analytics are significantly more likely to outperform the market.”

For organizations thinking seriously about revenue growth, teams of data scientists and analysts are tasked with transforming raw data into insights.

Data scientists and analytics teams are pouring through ever-increasing volumes of customer data but also identifying more and more sources of data relevant to their efforts.

So for today’s topic, let’s take a look at how you can quickly access the myriad of customer data sources with Talend’s Stitch Data Loader and get that data into an advanced analytics platform, in this case Databricks running on AWS and using a Lambda function to automate a common manual task.

Companies like Databricks, with their Unified Data Analytics Platform, provide powerful tooling for data teams to deliver actionable insights and predictions to the business. Stitch is our web-based ingestion service that allows users to connect to sources in minutes and schedule ongoing ingestion processes to move that data into common cloud data services.

It is very extensible (see Singer IO) and to date has over 100 Sources of which most integrate with Sales and Marketing platforms and 8 of the most widely used Cloud Data Analytic platforms.

The Ingestion Problem

Databricks is a Unified Data Analytics platform used widely by data scientists which provides powerful tooling for data teams to deliver actionable insights and predictions to the business. It is a platform that allows users to store, explore, analyze and process incredible volumes of data quickly. But first, data needs to get into the environment. This is the very first obstacle anyone needs to tackle before they can begin analyzing their data.

Enter Stitch. By utilizing Stitch and a small AWS Lambda function we can ingest data from many different sales and marketing SaaS applications and land the data in an AWS S3 bucket. From there, the Lambda function automatically moves the data into the Databricks File System. Now the Databricks user has the data in the foundational layer of the Databricks platform. From this point we can execute some python code to take the data loaded into the directory and begin to model the data or visualize the data until we are happy with the results. Once completed we can then save the Dataframe as a Global Table in the cluster for others to use or put in DELTA format and save as a Delta Lake table.

Example (Create table from data in directory):

dataFrame = "/FileStore/tables"

data = spark.read.format("csv").option("header","true").option("inferSchema", "true").load(dataFrame)

data.write.mode(“overwrite”).saveAsTable("MOCKDATA")

Let’s examine the small bit of python code above. The first line has a variable ‘dataFrame’ that points to the directory that we have the Lambda function loading data into from above. The next line loads all the data in that directory. Now we have told it the format, if it has a header and, if so, if it should infer the schema from the file.

Finally, we tell the cluster to write the data but as a Global table so that it persists beyond our execution of this code block. We also tell it to overwrite the data if we have this set to process new data loads. We give it the name ‘MOCKDATA’ and now anyone can easily go into Databricks and issue a simple SQL statement such as ‘select * from MOCKDATA’ or connect to your ETL tool of choice (Talend) and process the data as any regular result set.

Additionally, we can also create a Delta Lake table very quickly now with the following SQL statement:

‘create table MOCKDATA_DELTA USING DELTA LOCATION '/delta2' AS select * from MOCKDATA’

Here we create a Delta Lake table called MOCKDATA_DELTA and store this in a directory in our Databricks Filesystem called delta2. We take the query of the batch table and load its results into the newly created table.

How To

For starters, we will need is an AWS S3 bucket and Stitch Account. I am not going to discuss how to setup S3 in this blog but there are a number of tutorials on the AWS web site on how to achieve this. For the Stitch Account, one just has to go to the Stitch website and sign up for a free account and create a new Integration.

The Lambda Function

Creating a Lambda function is not all that difficult and the code for this example is very short and to the point. However, the purpose of this blog is to provide you with easy and configurable artifacts so you can quickly duplicate and get to work. First thing I did was code all the Databricks FS ReST API’s as there are not that many of them and, to be honest, they are very simple and easy to work with. So, why not?

Let’s discuss the Lambda Function in a bit of detail that I wrote to take the data landed in the S3 bucket from Stitch and stream into Databricks.

this.dbfs = DBFS.getInstance(System.getenv("DB_REGION"), System.getenv("DB_TOKEN"));
String path = System.getenv("DB_PATH");

Our Lambda function looks for three environment variables to execute properly. If you look at the code above, you will see the following

DB_REGION: This is the Azure Region that your Databricks is instantiated on
DB_TOKEN: This is the token that will Authenticate your requests
DB_PATH: This is the directory path that you want the files to be streamed into

From the AWS Lambda Function this is what it looks like:

context.getLogger().log("Received event: " + event);
List  records = event.getRecords();
try {
    dbfs.mkdirs(path);
} catch (DBFSException e) {
    context.getLogger().log(e.getMessage());
}

In the above lines of code, we are logging the event that has triggered our Lambda function and returned the list of records from this event. Then we attempt to create the full path that was used in the DB_PATH environment variable. If it already exists, then the catch block will log this and if not, the entire path will be created on the Databricks FS.

for (S3EventNotificationRecord record : records)
{
    S3Object fullObject = null;
    String filename = record.getS3().getObject().getKey();
    AmazonS3 client = new AmazonS3Client();
    S3Object object = client.getObject(new GetObjectRequest(record.getS3().getBucket().getName(),filename));

    context.getLogger().log("FileName : " + filename);
    String xpath = path+"/"+filename;

    try {
        if (!paths.contains(xpath)) {
            context.getLogger().log("Creating Path " + xpath);
            int handle = dbfs.create(path + "/" + filename);
            context.getLogger().log("Handle: " + handle);
            processFile(object.getObjectContent(), context, handle);
            dbfs.close(handle);
            context.getLogger().log("Closing Handle: " + handle);
        } else {
            context.getLogger().log(xpath + " already exists!");
        }
    } catch(DBFSException e)
    {
        context.getLogger().log(e.getMessage());
    }
}

In the preceding block of code, we are looping over every record from the event to get the file name and create the variable xpath. This variable is a combination of both the DB_PATH environment variable as well as the file name that was just written to your S3 bucket.

Next, we make a call to Databricks to create the file and have Databricks return the handle to this file. The handle will be used going forward to write data into the Databricks FS. The processFile function takes the S3 Object and the Databricks File handle and loops through the file until it has written the entire file into Databricks. Finally, we close the Databricks file by passing in the handle and calling the close function.

Now that our data is loading into the Databricks file system the Data Scientist / Data Engineer now has control to do whatever they want.

Next steps

If you would like to reproduce the steps from this article you can do some from the follow

Create a Stitch Account: https://www.stitchdata.com/
Clone my code from GitHub and build the Lambda Function
1. https://github.com/tcgibennett/databricksdbfsapi
2. https://github.com/tcgibennett/databrickslambdafunc
Create S3 Bucket
Create Lambda Function in AWS and have it trigger on the bucket created in Step 3
Setup Azure Databricks
1. Get the Region
2. Get the Token

Good luck and happy loading!

The post How Databricks on AWS and Stitch Data Loader can help deliver sales and marketing analytics appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Talend Solution Templates – Try on your superhero suit

December 4, 2019, 5:00 am

≫ Next: 10 Things You’re Doing Wrong in Talend

≪ Previous: How Databricks on AWS and Stitch Data Loader can help deliver sales and marketing analytics

Hello Talend Superheroes!!! Hope all of you are enjoying your time in Talend realm playing with different types of data. Many of you are old-timers of Talend, but everyday new developers are happily joining the band to learn about various superpowers related to data. Like any superhero movie, there is a learning curve to understand and manage the newly acquired capabilities the right way. As Uncle Ben said, “With great power come great responsibility.” Being an Avenger fan, the first thing that comes to my mind is the Spidey Suit given to Peter Parker so that he can hone his newly acquired superpowers in a controlled manner 😊

Talend Academy – Your own data superpower training suit

In Talend World, the best way to learn various data processing superpowers is through Talend Academy. Talend makes sure that developers of both customers and partners are benefitting from the Talend learning platform. The details for the Talend academy can be accessed from below links.

Category		Link
Customer Website		https://academy.talend.com
Partner Website		https://partneracademy.talend.com

Talend Academy

Talend Academy provides amazing learning experiences even for budding data superheroes who are new to the data realm. Some of the key features of this awesome learning platform include:

All-you-can-learn access to 180+ hours of online training
Instantly available, cloud-based virtual machines (VMs)
Fully functional, preinstalled Talend software
Just-in-time training for licensed Talend software users

Industry veterans have already taken note of this initiative and Talend academy was recognized by 2019 CEdMA Innovation Award for outstanding innovation in education business methods or practices. The details of the press release can be referred from this link.

Talend Solution Templates

Talend Solution Templates are the new way to get started quickly with Talend integration projects once the developers complete the initial set of trainings from Talend academy. The Solution Templates are fully functional practical projects that developers can import into their studio to speed up their development, while leveraging already built-in job design best practices. Developers will benefit from fully customized real-world solutions and can extend or replicate these examples within their projects.

Talend Solution Templates

The solution templates focus on specific integration patterns like ingestion and transformation of data from various source systems to the data lake. Each solution template provides fully functional Talend code base along with associated documentation as shown below.

Talend Solution Templates

Let us take a simple example of Salesforce to Snowflake Incremental Lift and Shift Template job. The zip file of the Talend Solution template provides fully functional Talend jobs to extract data from Salesforce and load them to Snowflake. The jobs illustrate Talend data extraction flows for 17 major Salesforce objects in recommended methodology.

Talend Solution Templates

Associated documentation for templates comes in handy for Talend Developers who are interested in knowing the details of each stage of the template job.

Talend Solution Templates

Talend Solution Template Library is growing quickly, so keep an eye out regularly for new template jobs that will appear in the above space for integrating an immense variety of different systems.

Free Solution Template Modules for the budding data superheroes

Talend academy is quite happy to welcome all Talend developers (a.k.a. data superheroes) to try the Talend Solution Templates. You can register for a free trial as shown below.

Talend Solution Templates

Talend is happy to announce the following modules are currently available in free tier version of Talend Solution Templates.

Context Management Common Framework
Salesforce to Snowflake Incremental Lift and Shift
Salesforce to Amazon Redshift Incremental Lift and Shift
Salesforce to Microsoft Azure SQL DWH Incremental Lift and Shift

Talend Solution Templates

I hope all of you enjoy the new feature in your Talend Superhero Suit and get to try it out, soon.
Until the next super blog topic, enjoy your time using Talend!

The post Talend Solution Templates – Try on your superhero suit appeared first on Talend Real-Time Open Source Data Integration Software.

↧

10 Things You’re Doing Wrong in Talend

December 5, 2019, 5:06 am

≫ Next: This year’s Talend at AWS re:invent recap: Bigger, busier, and better than ever before

≪ Previous: Talend Solution Templates – Try on your superhero suit

…and how to fix them!

We’ve asked our team of Talend experts to compile this top ten list of their biggest bugbears when it comes to jobs they see in the wild – and here it is!

10. Size does matter

Kicking off our list is a common problem – the size of individual Talend jobs. Whilst it is often convenient to contain all similar logic and data in a single job, you can soon run into problems when building or deploying a huge job, not to mention trying to debug a niggling nasty in the midst of all those tMaps!

Also, big jobs often contain big sub-jobs (those blue-shaded blocks that flows form into), and you will eventually come up against a Java error telling you some code is “exceeding the 65535 bytes limit”. This is a fundamental limitation of Java that limits the size of a method’s code, and Talend generates a method for each sub-job.

Our best advice is to break down big jobs into smaller, more manageable (and testable) units, which can then be combined together using your favourite orchestration technique.

9. Joblets that take on too much responsibility

Joblets are great! They provide a simple way to encapsulate standard routines and reuse code across jobs, whilst maintaining some transparency into their inner workings. However, problems can arise when joblets are given tasks that operate outside the scope of the code itself, such as reading and writing files, databases, etc.

This usage will require additional complexities when reusing joblets across different jobs and contexts, and can lead to unexpected side effects and uncertain testability.

We have found that you can get the best from joblets when treating them as the equivalent of pure functions, limiting their score to only the inputs and outputs defined by the joblet’s connections and variables.

8. “I’m not a coder, so I’m not looking at the code”

Ever seen a NullPointerException? If you’re a Talend engineer, then of course you have! However, this kind of run-time error can be tricky to pin down, especially if it lives in the heart of a busy tMap. Inexperienced Talend engineers will often spend hours picking apart a job to find the source of this kind of bug, when a faster method is available.

If you’ve worked with Talend for a while, or if you have Java experience, you will recognise the stack trace, the nested list of exceptions that are bubbled up through the job as it falls over. You can pick out the line number of the first error in the list (sometimes it’s not the first, but it’ll be near the top), switch to the job’s code view and go to that line (ctrl-L).

Even if the resulting splodge of Java code doesn’t mean much, Eclipse (the technology that Talend Studio is built on) will helpfully point out where the problem is, and in the case of a tMap it’s then clear which mapping or variable is at fault.

7. No version of this makes sense

Talend, Git (and SVN) and Nexus all provide great methods to control, increment, freeze and roll back versions of code – so why don’t people use them! Too often we encounter a Talend project that uses just a single, master branch in source control, has all the jobs and metadata still on version 0.1 in Studio, and no clear policy on deployment to Nexus.

Without versioning, you’ll miss out on being able to clearly track changes across common artefacts, struggle to trace back through the history of part of a project, and maybe get into a pickle when rolling back deployed code.

It’s too huge a topic to go into here, but our advice is to learn how your source control system works – Git is a fantastic bit of software once you know what it can do, and come up with a workable policy for versioning projects sources and artefacts.

6. What’s in a name?

Whilst we’re on the topic of not having policies for things, what about naming? A busy project gets in a mess really quickly if there’s no coherent approach to the naming of jobs, context groups/variables, and other metadata items.

It can be daunting to approach the topic of naming conventions, as we’ve all seen policy documents that would put Tolstoy to shame, but it doesn’t need to be exhaustive. Just putting together a one-pager to cover the basics, and making sure all engineers can find, read and understand it, will go a long way.

Also, while you’re about it, think about routinely renaming components and flows within jobs to give them more meaningful names. Few of us would disagree that tFileInputDelimited405 is not quite as clear and informative LoadCustomerCSV, so why don’t we do it more? And renaming flows, particularly those leading into tMaps, will make everyday chores a breeze!

5. Don’t touch the contexts!

So often we see context variables being updated at different points in a job – it’s so easy to do as they’re right there at context.whatever, with the right data type and everything. But then how can you refer back to the original parameter values? And what if you then want to pass the original context into a child job, or log it to a permanent record?

If you need a place to store and update variables within a job, the globalMap is your friend. Just be careful to cast the correct type when reading values back, or you may get Java errors. For example, if you put an integer into the map, make sure you read it back as ((Integer)globalMap.get(“my_variable”)), as auto-generated code will often put (String) as the cast by default.

4. But I like all these columns, don’t make me choose!

Something that adds to the unnecessary complexity and memory requirements of a job is all those columns that are pulled from the database or read from the file and not needed. Add to that, all the rows of data which get read from sources, only to be filtered out or discarded much later in the job, and no wonder the DevOps team keep complaining there’s no memory left on the job server!

We find that a good practice is to only read the columns and rows from the source that the job will actually need. That does mean you may need to override the metadata schema and choose your own set of columns (although that can, and often should, then be tracked and controlled as a generic schema), and while you’re about it, maybe throw a WHERE clause into that tDbInput’s SQL statement. If that’s sometimes impossible, or impractical to do, then at least drop in a tFilterColumns and/or tFilterRow at the earliest possible point in the flow.

3. Hey, lets keep all these variables, just in case

Context groups are great – just sprinkle them liberally on a job and you suddenly have access to all the variables you’ll ever need. But it can soon get to the point where, especially for smaller jobs, you find that most of them are irrelevant. There can be also cases where the purpose of certain context groups overlap leading to uncertainty about which variable should actually be set to achieve a given result.

A good approach is first to keep context groups small and focussed – limit the scope to a specific purpose, such as managing a database connection. Combine this with a clear naming convention, and you’ll reduce the risk of overlapping other sets of variables. Also, feel free to throw away variables you don’t need when adding a group to a job – if you only need to know the S3 bucket name then do away with the other variables! They can always be added back in later if needed.

2. Jumping straight in

By missing out a tPrejob and tPostjob, you’re missing out on some simple but effective flow control and error handling. These orchestration components give developers the opportunity to specify code to run at the start and end of the job.

It could be argued that you can achieve the same effect by controlling flow with onSubjobOk triggers, but code linked to tPrejob and tPostjob is executed separately from the main flow, and, importantly, the tPostJob code in particular will execute even if there’s an exception in the code above.

1. Mutant tMap spiders

Too many lookup inputs in a tMap, often coming in from all points of the compass, can make a Talend job look like something pulled from a shower drain. And just as nice to deal with. Add to this a multitude of variables, joins and mappings, and your humble tMap will quickly go from looking busy and important to a pile of unmanageable spaghetti.

A common solution, and one we advocate, is to enforce a separation of concerns between several serial tMaps, rather than doing everything at once. Maybe have one tMap to transform to a standard schema, one for data type clean-up, one to bring in some reference tables, etc. Bear in mind though that there’s a performance overhead that comes with each tMap, so don’t go overboard and have 67 all in a line!

At the end of the day, each job, sub-job, and component should be a well-defined, well-scoped, readable, manageable and testable unit of code. That’s the dream, anyway!

Special thanks to our Talend experts Kevin, Wael, Ian, Duncan, Hemal, Andreas and Harsha for their invaluable contributions to this article.

This article was originally published on Datalytyx.

The post 10 Things You’re Doing Wrong in Talend appeared first on Talend Real-Time Open Source Data Integration Software.

↧

This year’s Talend at AWS re:invent recap: Bigger, busier, and better than ever before

December 13, 2019, 7:30 am

≫ Next: The best part of holiday travel: Flight tracking with Talend Pipeline Designer

≪ Previous: 10 Things You’re Doing Wrong in Talend

Whether you stopped by to visit us in person at AWS re:Invent last week or you planned on it but didn’t get a chance, here are some of the most important key points and takeaways:

Paving the way with ML and AI

An ever-present part of AWS re:Invent is Andy Jaffe announcing a seemingly endless list of new features and services and capabilities coming to customers, developers, and partners. It was very clear that AWS sees Machine Learning and AI as a critical path forward in 2020, as it focused a significant number of announcements on Amazon SageMaker, its managed service for machine learning modeling and deployment.

Talend gets in on the announcing, too

It was an event for big announcements, indeed. We were very pleased to announce that Talend Cloud is now available for purchase on the AWS Marketplace.

This is massive news! Now that both Talend Cloud and Talend’s Stitch Data Loader are available in AWS Marketplace, you will have access to incredible capabilities such as:

Broad connectivity for on-premises, cloud, hybrid, and multi-cloud environments and support for over 70+ AWS services including Amazon Redshift, Amazon Simple Storage Service (Amazon S3), Amazon EMR, and Amazon Relational Database Service (Amazon RDS)
Built-in pervasive data quality to ensure trusted data and regulatory compliance
Governed self-service apps, such as Pipeline Designer, Data Preparation, Data Stewardship, to democratize data and facilitate transition to cloud for data-driven organizations
Predictable pricing (user-based) with pay-as-you-go options”

In our formal press release, Mike Pickett, Senior Vice President of Business Development and Ecosystems at Talend, highlights the importance of this business milestone by pointing out that, “[c]ustomers often want to purchase through AWS Marketplace, and this is an excellent opportunity to expand our footprint in the market.” Pickett expresses our commitment by adding, “We look forward to continuing to expand our relationship with AWS and providing a truly seamless experience for our customers using Talend Cloud in an AWS environment.”

Talend in more AWS programs

Adding to the excitement, we were also honored to announce that Talend is an inaugural participant and partner in two separate AWS programs:

Amazon Redshift Service Ready Program

The combination of Talend and Redshift has been used by many customers over the years. Talend has always made it easy for customers to not only move their data into Amazon Redshift, but to govern, clean, and organize that data on an on-going basis. Achieving the Amazon Redshift Ready partner designation means that Talend has been recognized for demonstrating successful integration with Amazon Redshift and is generally available and fully supported for AWS customers.

AWS has some exciting new features and capabilities coming, as announced during re:Invent, and we look forward to enhancing Talend to best leverage those features for our joint customers.

AWS Retail Competency designation

Achieving the AWS Retail Competency in the Advanced Retail Data Science Category differentiates Talend as an AWS Partner Network (APN) member that delivers highly specialized technical proficiency and possess deep AWS expertise to deliver solutions seamlessly on AWS. status. With over 50 shared customers between Talend and AWS in retail and consumer products, this designation only furthers our relationship.

Even if you couldn’t come by to visit us at re:Invent, with these great points, we hope you will be able to share that message with more people and encourage them to broaden their use of both Talend and AWS.

The post This year’s Talend at AWS re:invent recap: Bigger, busier, and better than ever before appeared first on Talend Real-Time Open Source Data Integration Software.

↧

The best part of holiday travel: Flight tracking with Talend Pipeline Designer

December 17, 2019, 8:46 am

≫ Next: How to create a business glossary on Talend Data Catalog using API Services and Data Stewardship

≪ Previous: This year’s Talend at AWS re:invent recap: Bigger, busier, and better than ever before

Nobody likes air travel during the holiday season, especially when it involves long queues at the airport security checkpoints or delayed departures thanks to unfavorable weather. Of course, once onboard, then you have to deal with a crying baby nearby or an eager conversationalist at your side. Oh, the agony!

Luckily, amidst the unpleasantries associated with air travel, there is still consolation for those of us who have opted out of the air fair and have instead chosen to sit idly by at our computer to watch the air traffic chaos unfold from a safe distance.

When a new tool is released, the full end-to-end examples are hard to find, and yet, this is one of the best ways to get hands on experience. “We learn by example,” we used to say. And what better way to learn than with a cool real-time example. With Talend Pipeline Designer, now available since April ’19, I will show you how you can build a detailed air traffic tracking dashboard in real-time from the comfort of your spacious armchair. Sounds cool, huh?

What I am about to show you will dive deeper into the awesome capabilities of Talend Pipeline Designer and I am hoping that by following it, you will get a good overview of the product. Usually, when I want to learn a new tool I always start with a guided tutorial or blog, such as this.

Disclaimer: A getting started guide is always available in the documentation and I highly recommend you give it a look before continuing on with this blog.

How to Access Pipeline Designer?

There is no download required to build your data pipelines—the entire interface is web-based, and the product is hosted in the cloud. If you are a current Talend Cloud customer, ask your administrator to activate Pipeline Designer (it comes free with all Talend Cloud developer licenses!). If you aren’t a Talend Cloud customer, please give Pipeline Designer a try during a 14-day trial, it takes less than two minutes to get started.

<< Sign up to the free trial >>

Talend Pipeline designer

What are we going to achieve?

The goal here is first to ingest data in real-time from an open API providing us current aircrafts information. Thanks to Talend Pipeline Designer we will process the incoming data along the way in order to aggregate and transform these and eventually display aircraft positions on a nice visual dashboard.

Note: This post is written specifically for beginner-level users. However, having a grasp on technologies such as Apache Kafka and Elasticsearch is needed.

Before we start, what do we need?

As you may have understood, we are going to process data in real-time which means in a streaming mode as opposed to batch mode. To do so we will use a message broker as a source to ingest data, and I am calling out Apache Kafka! Now to store and display our airplanes information we are going to use Elasticsearch as a destination along with Grafana on top of it. In this article I won’t go into much detail about how these tools work, so don’t hesitate to look at their documentation if needed. However, be reassured, this tutorial is very detailed to make sure you can achieve it yourself without expertise on any of these tools.

To sum up, Talend Pipeline Designer will ingest data from a source, in our case Kafka, and push the processed data to a destination, Elasticsearch. The following requirements are needed:

Zookeeper + Kafka 0.10.0.x to 1.1.x
Elasticsearch v6.5
Grafana (latest version preferred)
Talend Pipeline Designer (sign-in for a free trial here)

Talend Pipeline designer

To help you, we have left at your disposal a docker-compose file to deploy Kafka, Elasticsearch and Grafana anywhere you want if you have Docker installed. You only need to make sure the pipeline will be able to access them. See next section to understand where your pipeline will be executed.

With the docker compose file provided, you can run the following command in the same folder on an accessible machine:

docker-compose up

Alternatively, you can use a managed Kafka service for Kafka and a managed Elasticsearch service in the cloud.

Where the pipeline will be executed?

With Talend Pipeline Designer you have multiple ways to run your pipelines. If you subscribe to a free trial the easiest way to go is to use the cloud engines provided to you. These are engines that are managed by Talend and you don’t need to setup anything. If you want to run your pipeline in your own environment (VPC or even your laptop) you can install what we call a Remote Engine for Pipelines. Please look at the documentation to install a remote engine.

Quick tip: The fastest way to deploy a Remote Engine for pipelines is to use the CloudFormation for AWS or the ARM Deploy form for Azure.

Where does the data come from?

Good question! We will use the OpenSky Network open API based on the ADS-B protocol. First, ADS-B is a technology used by aircrafts to periodically broadcast their position. It allows air traffic control stations to track airplanes in addition of their regular radars. Luckily these signals are not encrypted for non-military aircrafts. Consequently, a community of several individuals has built a network of antennas around the world to gather in real-time most of the ADS-B signals and openly offer them through an API. You can have a look at the data fields available here. To ease the ingestion of the data we have prepared a script that periodically pushes the data from the API to a Kafka topic. This is a NodeJS script, so you need to install npm and nodejs to run it. As this can be cumbersome and if you like Docker as much as I do, just pull the following Docker image and I will explain how to run it later.

docker pull talendinc/flight-tracker

For your information, Talend Pipeline Designer works natively with Avro. To put it simply, this is a serialized format in binary data. It allows way much compact data transmission meaning a better throughput, but the drawback is the need of specifying a schema. Indeed, as the data is in binary, we need a schema in order to retrieve the different fields and values from the documents. Therefore, the data requested from the open API will be converted into Avro to speed up our pipeline!

Now let’s get our hands-on Pipeline Designer!

In the next part of this blog you will have to replace {KAFKA_IP:PORT} and {ELASTICSEARCH_IP:PORT} with your own relative IP addresses and ports.

1) Data source

Once you have Talend Pipeline Designer at your fingertips we are going to start by creating a connection for Kafka. To do so click on “Connections” on the left menu and press the button “Add a connection.”

Fill the form with the following and test the connection.

Talend Pipeline designer

The next step is to create a Dataset. A Dataset is a particular set of data from a connection. For example, for MySQL it would be a table or for AWS S3 it would be a bucket and file. However, for a message broker such as Apache Kafka it is a topic. Create it as follows:

Talend Pipeline designer

You copy-paste the Avro schema:

{ 
   "type":"record",
   "name":"aircrafts",
   "fields":[ 
      { 
         "name":"icao24",
         "type":"string"
      },
      { 
         "name":"origin_country",
         "type":"string"
      },
      { 
         "name":"callsign",
         "type":"string"
      },
      { 
         "name":"time_position",
         "type":"long"
      },
      { 
         "name":"latitude",
         "type":"double"
      },
      { 
         "name":"longitude",
         "type":"double"
      }
   ]
}

If you have already tried to look at the sample, of course, you have found that the sample is empty. This makes sense because your Kafka topic is not populated. Let’s use the Docker image I made you pull previously:

docker run -e KAFKA_TOPIC=flights -e KAFKA_HOSTS=< KAFKA_IP >:9092 –e CITY=Paris talendinc/flight-tracker

This will send aircrafts detected around Paris to the Kafka queue. You can change the name of the city. It works with the biggest cities in the world.

Please wait a bit and then try again to get the preview. You should get some records in your sample:

Talend Pipeline designer

We have our data source, let’s continue.

2) Data destination

As mentioned in the introduction, we chose Elasticsearch as a destination. The reason why Elasticsearch is a good fit is that it is particularly effective for time-series data. In our case we are going to get aircraft data on a periodic basis with a time date. In addition, Elasticsearch can be easily associated to visual dashboard tools such as Kibana or Grafana. In the context of this blog we chose to use the latter.

Before creating a connection to Elasticsearch in Talend Pipeline Designer we are going to prepare our Elasticsearch indices. An index is what we could compare to a table for a regular SQL database. Elasticsearch can be managed through its RESTful API. As a matter of fact, let’s create an index called “aircrafts” and specify its mapping. To do so, perform the following request:

curl -i -X PUT \
   -H "Content-Type:application/json" \
   -d \
' { 
   "mappings":{ 
      "aircrafts":{ 
         "properties":{ 
            "date":{ 
               "type":"date",
               "format":"epoch_second"
            },
            "location":{ 
               "type":"geo_point"
            },
            "id":{ 
               "type":"keyword"
            },
            "country":{ 
               "type":"keyword"
            }
         }
      }
   }
}' \
 'http:// {ELASTICSEARCH_IP:PORT}/aircrafts'

Now that your Elasticsearch instance is ready, you can create a connection and a dataset for it in Talend Pipeline Designer.

Talend Pipeline designer

3) Data Processing

We are now going to create the pipeline. Click on Pipelines on the left menu and click on “Add a pipeline”. You should get an empty canvas where you can add a new source. That is great because you have created one in the previous section. So, click on that box and select your dataset “Kafka Flights”. You should get your sample data available, check that you get it all right for the following steps as we want to view the live preview feature of Talend Pipeline Designer in action!

a). Window

The first component we are adding is a Window. The aircrafts data is pushed to Kafka every five seconds by the script (provided previously). This is streaming data, so if we want to aggregate it, we need to set time boundaries in order to achieve the aggregations. That’s why we are going to add a fixed window of 5 seconds. Please have a look at the documentation to understand the windowing notions such as fixed, sliding and session windows. In our demonstration we only want aircrafts position updates in the last 5 seconds time frame which correspond to a fixed window.

Talend Pipeline designer

b). Aggregate

Now that the data flow is cut in time frame windows, let’s aggregate the aircraft data in each of them. The goal of this aggregate is to remove duplicates. In fact, you could get the same aircrafts twice in the same 5-second window. To avoid that, we “Group By” icao24 and origin_country. In the case of several identical aircrafts found in the 5-second window, we want to list the different positions. As operations we want:

MAX of time_position

LIST of latitude and longitude

You should get something like the following screenshot:

Talend Pipeline designer

c). Field Selector

This is where we are going to see one of the greatest Talend Pipeline Designer feature. The Field Selector is a pretty simple component. You can select which fields you want to keep as output of this component.

But where it becomes interesting is the availability of Avpath, (an expression language to select, update, filter Avro data). As I was telling you before, Talend Pipeline Designer natively supports Avro format. That is why you can use Avpath (an equivalent of Jsonpath or XPath) in Talend Pipeline Designer. The Talend Avpath documentation is available here and you can take a look at its usage in the examples provided. In our case it’s going to help us select the latest position of an aircraft in the case we have multiple positions for the same one. We simply use the “[-1]” from Avpath, and it selects the latest element of the list. This works in our example because we know that the Open API assures a chronological order of the positions.

Talend Pipeline designer

d). Python

The last step requires the Python component which allows us to use custom Python snippet codes to fulfill our custom needs in Talend Pipeline Designer. As you know, our goal is to display aircraft positions on a world map in Grafana. To achieve this, we are going to use the Worldmap Panel plugin of Grafana that only accepts Geohash positions. A Geohash is a string that represents geolocalization in the world. It is particularly useful because it can also represent rectangular surfaces depending on the precision used to encode the position. In our case we use it as coordinates. Paste the following code is the Python component (Flat Map mode):

def encode(latitude, longitude, precision=12):
  
    __base32 = '0123456789bcdefghjkmnpqrstuvwxyz'
    __decodemap = { }
    for i in range(len(__base32)):
        __decodemap[__base32[i]] = i
    del i

    lat_interval, lon_interval = (-90.0, 90.0), (-180.0, 180.0)
    geohash = []
    bits = [ 16, 8, 4, 2, 1 ]
    bit = 0
    ch = 0
    even = True
    while len(geohash) < precision:
        if even:
            mid = (lon_interval[0] + lon_interval[1]) / 2
            if longitude > mid:
                ch |= bits[bit]
                lon_interval = (mid, lon_interval[1])
            else:
                lon_interval = (lon_interval[0], mid)
        else:
            mid = (lat_interval[0] + lat_interval[1]) / 2
            if latitude > mid:
                ch |= bits[bit]
                lat_interval = (mid, lat_interval[1])
            else:
                lat_interval = (lat_interval[0], mid)
        even = not even
        if bit < 4:
            bit += 1
        else:
            geohash += __base32[ch]
            bit = 0
            ch = 0
    return ''.join(geohash)

output = json.loads("{}")
output['id'] = input['id']
output['country'] = input['country']
output['date'] = input['date']
output['location'] = encode(input['lat'], input['lon'])

outputList.append(output)

Talend Pipeline designer

4) Run the pipeline

Now add your Elasticsearch dataset as the destination of your pipeline and run it!

As explained before you can send real-time data to your Kafka queue with this Docker image:

docker run -e KAFKA_TOPIC=flights-e KAFKA_HOSTS=<IP_KAFKA>:9092 –e CITY=Paris talendinc/flight-tracker

Talend Pipeline designer

5) Dashboard

Now open your Grafana tool and import this dashboard. Make sure to connect the data source to your Elasticsearch instance. The dashboard is setup to get the last 5 minutes of data in Elasticsearch and refresh every 5 minutes. You can see the aircrafts are moving quickly and you can guess the path they follow.

Conclusion

As described in this article, you can create data pipelines, batch or streaming, in a matter of minutes with Talend Pipeline Designer. In our case we focused on streaming processing with Kafka as a source and Elasticsearch as a destination. It allows us to get aircrafts position in real-time that we lightly transform and store to follow the aircraft live positions. Building such a pipeline by code would have taken so much more time and yet Talend Pipeline Designer requires no code to create efficient and powerful Spark processing pipelines.

If you are interested in seeing more don’t hesitate to sign up for free to a 14-days trial. You’ll be able to reproduce this example and create your own pipelines!

The post The best part of holiday travel: Flight tracking with Talend Pipeline Designer appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How to create a business glossary on Talend Data Catalog using API Services and Data Stewardship

December 23, 2019, 7:40 am

≫ Next: Countdown to CCPA compliance: Top 10 data governance tips

≪ Previous: The best part of holiday travel: Flight tracking with Talend Pipeline Designer

We often come across people talking about managing their data by one means such as a Data Lake, MDM or data governance. Modern data management is not only about managing your data but also about making data useful for the business. Furthermore, data management is also about providing the ability to relate frequently used business terminologies to data in the systems. Most of the big enterprises spend months to discover and identify the impact of any change on their entire data supply chain.

For example, introducing a change to a business terminology could cause domino effect of change on the dependent systems, companies usually spend a large amount of time accurately identifying the impact of such a change on the downstream systems and even that does not guarantee 100% success rate. Incorrect impact analysis would usually result in breaks in the lineage and propagation of incorrect information across the data supply chain. Talend Data Catalog is a one-source platform which can help you leverage the single source of truth with the flexibility of using API’s to add or remove new terminologies and relate it to data within the systems.

With the introduction of Talend Data Catalog API we can now leverage REST API calls to automate actions on business terms such as create, update, delete terms etc.

A sample job shown in Figure 5 demonstrates how we can use Talend data catalog’s REST APIs through a Talend DI Job to set attributes, custom attributes for new business terms in Talend Data catalog glossary model as needed. Additionally, we can rely on Talend data stewardship to accept or reject changes made to terms in the business glossary.

Figure 1 below shows the Swagger documentation of the Talend Data Catalog APIs available. As seen from the documentation, Talend Data Catalog APIs provide a rich feature set to programmatically access and manipulate metadata content.

Figure 1: Talend Data Catalog Rest API Calls

Talend Data Stewardship for business terms

Data Stewardship plays a critical role for a successful data-driven glossary across the enterprise. Data Stewards perform a significant contribution in cleaning the data, refining the data and approving the data. Talend provides a data stewardship portal and Studio components to leverage stewards for validating and approving terminologies that should be part of the enterprise glossary. The work of a data steward is dictated by two core components called campaigns and tasks. There are four types of campaigns: Arbitration, Resolution, Merging or Grouping.

To begin, we will use a Resolution campaign to create a data model in the data stewardship portal. The data model we will create needs to have attributes such as name, glossarypath, categorypath, and description as shown in Figure 2. Data stewards will then explore the data that relates to their tasks, resolve the tasks on a one to one basis or for a whole set of records.

Figure 2: Data Stewardship Portal to create Campaign and data model

In the below example we have created a new Talend Data Integration job to fetch business terms and their corresponding description from database and assign for approval to data stewards. As shown in Figure 3, we can levarage enterprise databases having a predefined table with terms and their definitions and push them through data stewards for changes/approvals or we can pass it through a file.

Figure 3: Fetch terms from database and assign to stewards.

In tStewardshipTaskOutput component put the correct URL of data stewardship portal and corresponding user credentials. Create column in schema as created on data stewardship data models attribute. Select the campaign type as resolution. You can assign the task to a particular steward or select “No Assignee” as shown below.

Figure4: Create Steward Task for approval

Create another job with components to connect Talend Data Catalog using the REST API call as shown in Figure 5. Then select and add approved terms by data stewards into data catalog glossary. We can also export all terms as CSV file using the export API call. We can also update or add custom attributes to the terms. Finally close the REST API call connection and delete the task on data stewardship if you want to.

Figure 5: Job to fetch approved terms by data Stewards and create into Data Catalog Glossary

As shown in Figure 6, create a tRestClient connection to access the data catalog portal. Provide the correct URL and HTTP method as GET and Accept Type as JSON. Provide the query parameters such as user, password and forceLogin as “true”.

Figure 6: tRestClient component to create a REST connection to data catalog

Extract the JSON fields and map it to “Session_token” column to store the connection approved token for future access as shown in Figure 7.

Figure 7: Extract access token from response

Set the global variable with the access token and add another key as “id” corresponding to object_id of the glossary in Talend Data Catalog as shown in Figure 8.

Figure 8: Set Global variable with glossary Object_id and access token

Stewards should select the tasks assigned to them and approve them by clicking on corresponding rows and validating their choices to approve or reject as shown below in Figure 9.

Figure 9: Resolving terms by data Stewards in Stewardship portal

Select all the terms approved by stewards corresponding to a particular steward or any assignee which has State set to “ Resolved” or custom state such as “Ready to Publish” using tDataStewardshipTaskInput component as shown below in Figure 10.

Figure 10: Selecting resolved terms ready to publish in data catalog glossary

Use a tREST component to add term to the data catalog glossary. As shown below in Figure 11, provide the correct URL with parameters of API token and accept type as Json.

Figure 11: Adding term to glossary

Once you have added terms into glossary you can either export them as CSV as shown using tREST component in Figure 12 or you can add custom attribute to a term as shown in Figure 13.

Figure 12: Downloading glossary terms as CSV file

Figure 13: Setting attributes to a glossary term

In summary, Talend Data Catalog Rest API feature provides lot of flexibility for business to populate business terminologies into Talend Data Catalog glossary by various means and a platform to incorporate data governance and regulatory compliance with the involvement of right stakeholders and data stewards. This blog is a starting point for exploring ways to make use of Talend Data Catalog Rest API’s for Talend Data Catalog.

The post How to create a business glossary on Talend Data Catalog using API Services and Data Stewardship appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Countdown to CCPA compliance: Top 10 data governance tips

December 30, 2019, 1:40 am

≫ Next: Data privacy hidden gems in Talend Component Palette: Part 2

≪ Previous: How to create a business glossary on Talend Data Catalog using API Services and Data Stewardship

In just a couple of days, the new year will be upon us and the California Customer Privacy Act (CCPA) will be in full effect. This means, going forward, businesses that collect personal data from people who reside in the Golden State must now honor their requests to access, delete, and opt out of sharing or selling their information.

Sound familiar? That’s probably because your organization is already on top of it and has been dedicating its resources to preparing for the January 1, 2020 compliance deadline. If the CCPA is news to you, however, then I’d advise you contact your legal and IT departments immediately if you have any pressing questions.

Otherwise, this blog will serve as a refresher and highlight the main points about CCPA as well as countdown the top ten data governance tips that will make your CCPA compliance easier to follow for everyone in your organization.

Recap on the CCPA

Similar to 2018’s General Data Protection Regulation (GDPR), the CCPA will ensure Californians similar data privacy protections already offered to European consumers. However, this mandate also requires businesses to have the impetus to not only update their privacy policy protocols but to also cater to their consumers’ data privacy preferences and requests.

Who is affected

The CCPA applies to any for-profit entity doing business in California that:

does business in the State of California
collects, shares, or sells California consumers’ personal data (or does so through third parties)
solely or jointly with others determines the purposes or means of processing of that data

Twenty-five-million-dollar club

More specifically, the CCPA targets larger businesses that fall into either of the following categories:

generate an annual gross revenue in excess of $25 million
annually buy, receive, sell, or share the personal information of 50,000 or more consumers in for commercial purposes
derive 50 percent or more of their annual revenues from selling consumers’ personal information

Organizations affected by the CCPA are required to inform the consumers of how their data will be used. Not only that, the handling of consumer requests must be done in a nondiscriminatory way and the requested information must be delivered to consumers in a reasonable time and free of charge.

Bracing for impact

According to the California Department of Finance’s Economic and Fiscal Impact Statement, up to 400,000 businesses may be impacted. An estimated 9,776 jobs will be lost and the total statewide dollar costs that businesses and individuals may incur to comply with this regulation over its lifetime can around $16.4 billion.

As of July 1, 2020, enforcement of the CCPA which will begin and violations could result in millions of dollars of fines for companies. Civil penalties can range from $2,500 to $7,500 per violation, including up to $750 per individual California resident afflicted by a data breach or the cost of damages caused by the same.

Although the economic impact is significant, the benefits far outweigh the financial burdens as the regulations benefit CA residents because they implement the CCPA. They provide clear direction to businesses on how to inform consumers of their rights and how to handle their requests making it easier for consumers to exercise their rights. They also provide greater transparency on how businesses collect, use, and share consumers’ personal information (Pl).

The final countdown

Although consumer data is the focus of this new law, data governance is at the heart of ensuring that CCPA compliance is carried out effectively throughout any organization.

So, without further delay, here is the countdown of the top ten data quality tips that we hope will allow your organization to better adapt to CCPA compliance.

10. Align your data governance standards with your updated your Privacy Policy

In addition to making sure you Privacy Policy complies with CCPA rules, take the necessary measures to adjust your data governance practices to account for the increased information requests from consumers and compliance queries from auditors.

9. Know how to respond to requests for “who, what, & why” to their data collection

Make sure business users as well as IT teams know what happens to consumers’ personal data or personally identifiable information when it’s shared within the business. Define the policies for its users, such as anonymization and ownership.

8. Make the option to delete data easy for everyone

One of the biggest problems with data quality is when customer data is either duplicated in or missing from multiple data sources. Deduplication tools can help assure that when customers want to erase their records, it can be done effectively.

7. Use data lineage to track data processing history

History answers many questions. Tracking the “who, what, where, when and why” for your consumer’s data will make it easier to fill missing gaps and account for data breaches.

6. Keep your data mapping accurate through a data catalog

Organizations need to map data correctly to create and maintain a holistic data inventory of Personal Identifiable Information (PII) they have stored and processed. A data catalog will be helpful in when addressing requests from California consumers to access or delete their PII.

5. Data preparation is key

With data preparation tools and machine learning, you can quickly identify errors, apply rules to massive datasets, reuse and share consumer data when requested. In a recent benchmark survey that questioned how well companies met GDPR compliance, one of the main reasons companies failed to comply was the lack of a consolidated view of data and clear internal ownership over pieces of data. The most effective way to resolve this data disparity is by having appropriate data preparation tools that prepare consumer data in such a way that it will be quickly and easily retrievable.

4. Honor opt-out decisions holistically and take customers’ new rights seriously

Giving consumers the option to no only access and delete their personal data but to deny a company the ability to sell their personal information to third parties are fundamental rights afforded by the CCPA. It is imperative that all systems within an organization reflect these decisions accurately to avoid penalties for its violation. Aside from facing steep fines, organizations risk losing their consumers’ trust and tarnishing their reputation.

3. Appoint data stewards and a chief data officer

Enforcing the rules for data protection across all the systems cannot be done by a single person. Collaborative data stewardship that is empowered by self-service apps will be critical in successfully supporting this self-service approach and in fostering accountability across all stakeholders. Contrary to GDPR, the CCPA doesn’t mandate the naming of a data protection officer (DPO). However, appointing a chief data officer who is accountable for compliance and can act as the change agent to engage the rest of the company is key to successfully internalizing and complying with the new law.

2. Leverage APIs for efficient notice communication

A primary element stipulated in the CCPA is the presentation of a Notice to Consumers which informs consumers of the categories of personal information to be collected from them and the purposes for which the categories of personal information will be used. With the prevalence of mobile devices as a means of consumer interactions, this important notice can be most effectively served through API integrations with privacy policy programs.

1. Bring all the consumer data into a data lake

Bringing together consumer data from across disparate data sources into a cloud-based data lake will achieve a single source of truth where all consumer data can be referenced, reconciled and linked to their provenance. Big data management tools can help to populate it. Data quality tools can match disparate data from various sources. Data governance and stewardship can prepare, cleanse, de-dupe and create a 360 view of consumer PII for quick access, modification or deletion upon request.

The disruptive nature of this such law is a hallmark for US privacy laws; a first of its kind. Although GDPR did come first, the CCPA may well be the catalyst that leads to similar laws being written in the near future for the sake of privacy and in the interests of residents of other states and countries, as well.

To learn about how one of our partners was able to help their customer prepare for CCPA, checkout this webinar.

Happy New Year!

For you Data Privacy, CCPA, and GDPR compliance needs, Talend has a solution fit for your organization.

Learn more!

The post Countdown to CCPA compliance: Top 10 data governance tips appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Data privacy hidden gems in Talend Component Palette: Part 2

January 3, 2020, 5:00 am

≫ Next: From GDPR to CCPA, the right to data access is the achille’s heel of data privacy compliance and customer trust – Part 1

≪ Previous: Countdown to CCPA compliance: Top 10 data governance tips

Data Privacy is becoming the main buzz word in technical circles day by day. Sometime back, we thought that illegal gathering personal identifiable information from data servers can happen only in James Bond and Mission Impossible movies. But technology is changing quite rapidly and in this era of global virtual connectivity, customer private information is becoming more and more insecure. The news of customer data getting misused by data analytics companies, data theft from major banks, etc. are no longer front-page headlines in news channels.

Privacy protection policies

The growing outrage against these data thefts has forced law makers to think about data privacy laws. European Union brought the most famous data privacy law called General Data Protection Regulation (GDPR) and more and more countries and states are creating similar laws to protect their citizen’s data privacy rights. Some of the other popular data privacy acts are California Consumer Privacy Act (CCPA) created by California State and implemented just a few days ago, Personal Information Protection and Electronic Documents Act (PIPEDA) created by Canadian law makers. We are seeing more and more countries moving in this direction to safe guard private and confidential data of their citizens.

Meeting the privacy demand

The increasing demand for data privacy has forced the software vendors to bring more and more functionalities to address the concerns of software developers. Today we are going to discuss some of the hidden gems available in Talend component palette to address the concerns related to Data Privacy. This is the second part of the Hidden Gems Blog series and if you have missed the first part, please refer the link here.

Talend Data Privacy Components

Talend has created an array of Data privacy components to handle concerns related to Data Privacy. Broadly, we can classify the components to below categories.

Data Encryption and Decryption
Data Masking and Unmasking
Data Shuffling
Data Duplicate Row Generation

Talend Component Palette Data Privacy group

In the subsequent sections, we will do a quick glance of each of these categories and Talend components available under each of these categories. Below diagram shows the full list of components available in Talend Component Palette under Data Privacy group.

talend components

Data Encryption and Decryption

Encryption becomes handy when you are handling confidential and sensitive information, which needs to be stored in most secured manner. In Talend world, the original data can be converted into unreadable cipher text by tDataEncrypt component of Talend and the original data can be retrieved back using tDataDecrypt component of Talend.

Talend allows developers to select one of the below cryptographic methods for encryption.

AES-GCM
Blowfish

A simple example to demonstrate the encryption and decryption capabilities of Talend is shown below.

The input data containing name, postal code and date of birth is transferred from input component.

Required columns are selected in tDataEncrypt as shown below.

The output data after encryption will be transferred to the file as shown below.

If the developer would like to convert the encrypted data back to original format, they can easily do this step using Talend component tDataDecrypt, as shown in below job.

Talend component tDataDecrypt

Please refer the links below to understand more about Encryption and Decryption components available in Talend.

Data Privacy Requirement	Talend Component
Data Encryption	tDataEncrypt
Data Decryption	tDataDecrypt

Data Masking and Unmasking

Data Masking is the process of hiding the original data with random characters or figures with functional substitutes to protect the actual sensitive data. This process is used to conceal the original confidential data while doing activities like data testing, user training etc. The process is widely used when the Talend developer need to handle personally identifiable information like customer name, address, email, phone number, SSN or financial information like credit card number, salary etc.

Talend helps developers to perform Data Masking by two pairs of components. They are:

tDataMasking and tDataUnMasking components to perform masking operations of heterogenous input data. A simple example for this category of operation is as shown below.

Talend Component Palette Data Privacy

tPatternMasking and tPatternUnmasking components which will replace pattern-specific and generic data with random characters from a specified range of date and numeric values or a set of named values. A simple example to handle pattern based processing for phone numbers is as shown below.

Talend Component Palette Data Privacy

Now, let us discuss each of these functionalities with an example. The first scenario is related to tDataMasking and tDataUnMasking components where the customer information like credit card, name and email are masked.

Talend Component Palette Data Privacy

The data is transmitted from an input file as shown below.

Talend Component Palette Data Privacy

Data Masking is done using tDataMasking component as shown below.

Talend Component Palette Data Privacy

The output after Data masking will be generated as shown below. All the input records will have ORIGINAL_MARK column as true and all the modified records will have this column value as false.

Talend Component Palette Data Privacy

The data can be unmasked in similar method using tDataUnMasking component as shown in below Talend job.

Talend Component Palette Data Privacy

The second scenario is related to tPatternMasking and tPatternUnMasking components where we will be masking and unmasking phone numbers in specific pattern.

Talend Component Palette Data Privacy

The sample input data component contains phone numbers as shown below.

Talend Component Palette Data Privacy

The masking pattern for phone numbers can be created by using tPatternMasking component as shown below.

Talend Component Palette Data Privacy

The masked output will be stored to file as shown below. All the input records will have ORIGINAL_MARK column as true and all the modified records will have this column value as false.

Talend Component Palette Data Privacy

The unmasking of data can be done in same methodology and the only change will be to use tPatternUnmasking component.

Talend Component Palette Data Privacy

To understand more about the masking components available in Talend, please refer the links below.

Data Privacy Requirement	Talend Component
Data Masking	tDataMasking
Data Unmasking	tDataUnmasking
Pattern Masking	tPatternMasking
Pattern Unmasking	tPatternUnmasking

Data Shuffling

Data Shuffling is the process of moving sensitive information available in a column from one row to another by the method of shuffling. This method is widely used to quickly create data set for testing purpose. tDataShuffling component in Talend Palette helps the Talend developers to do the data shuffling.

Quick example for data shuffling method using Talend components is as shown below.

Talend Component Palette Data Privacy

The input data is loaded from file as shown below.

Talend Component Palette Data Privacy

The data in the content column is as shown below.

Talend Component Palette Data Privacy

The data shuffling is done by using tDataShuffling component as shown below.

Talend Component Palette Data Privacy

In this example, the credit_card column is considered as first group to shuffle and it is allocated group id value 1. Similarly, lname, fname and mi columns are grouped together under group id value 2.

In many cases, we would like to shuffle the data based on a specific partition. In this example, the data shuffling is happening based on country column.

Talend Component Palette Data Privacy

The shuffled data will be loaded to output file and the quick review of data is as shown below.

Talend Component Palette Data Privacy

Please refer the link below to understand more about Shuffling component available in Talend.

Data Privacy Requirement	Talend Component
Data Shuffling	tDataShuffling

Data Duplicate Row Generation

Duplicate row generation is performed to quickly create sample data for data quality checks and for functional testing. tDupliateRow in Talend Palette helps to generate duplicate records based on the criteria mentioned in the component and it will be used for further data processing.

A simple example for tDuplicateRow is as shown below.

Talend Component Palette Data Privacy

Input record is loaded from file and there are 6 input records in the file.

Talend Component Palette Data Privacy

The configuration rules to generate duplicate rows will be specified in the tDuplicateRow component as shown below.

Talend Component Palette Data Privacy

The output will be printed to console using tLogrow and you can see that the data got replicated to multiple output records. The original input record can be recognized using ORIGINAL_MARK column and the value for input record will be true for these records. All the records which got generated by the component will have ORIGINAL_MARK as false.

Talend Component Palette Data Privacy

Please refer the link below to understand more about Duplicate row generation component available in Talend.

Data Privacy Requirement	Talend Component
Duplicate Row generation	tDuplicateRow

Conclusion

Talend Data Privacy components enable the Talend developers to handle customer sensitive data with more confidence. Gone are those days when you used to write lot of custom code to complete activities related to Data Privacy. You can easily do the same functionalities in Talend with its signature graphical user interface. Till we meet again for another blog topic, enjoy your time using Talend 😊

The post Data privacy hidden gems in Talend Component Palette: Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

From GDPR to CCPA, the right to data access is the achille’s heel of data privacy compliance and customer trust – Part 1

January 7, 2020, 3:00 am

≫ Next: Talend’s Next Chapter

≪ Previous: Data privacy hidden gems in Talend Component Palette: Part 2

This blog is the first of a series dedicated to Data Subject Access Requests (DSARs) and its importance to regain customer trust.

In December 2019 we released the second edition of our data privacy benchmark, and this year again, results are shocking: 18 months after GDPR came into force, 58% of surveyed companies are not performing with data privacy. The issue relates to the right of access, which gives individuals the right to obtain a copy of their personal data. This is worrying for companies not only since:

regulators reported that Data Subject Access Requests (DSARs) make many of the complaints they receive and started to deliver fines accordingly;
failing to meet this requirement directly and negatively impacts customer relationships.

Targeting the companies that failed to respond to our previous survey one year ago, we’ve also shown that only 32% of organizations that failed in 2018 have since fixed the issue. Although this shows a progress, it also hints that meeting this requirement might be tougher than expected, while many organizations might be overwhelmed with those requests up to a point that they fail to deliver in a professional way, and on delays. Organizations are struggling to materialize data privacy properly; however, they are taking it very seriously according to a recent LinkedIn survey that shows that the Data Privacy Officer is the fastest growing job across Europe together with the Artificial Intelligence Specialist.

In this first blog post, I’ll explain what is DSAR and why it is so important for the organizations.

What is DSAR?

Most of data privacy regulations – such as GDPR in Europe, CCPA in California, PIPEDA in Canada, PDPA in Singapore, LGPD in Brazil, DPA in Philippines, and PoPI in South Africa… – include Rights for Data Access that empower individuals with the control over their personal data. Individuals, referred to as data subjects (depending on regulations, those can be consumers, stakeholders in a B2B relationships, employees, members…) can make a subject access request verbally or a written one. Organizations have a limited delay to respond to this request (one month with GDPR, 45 days with CCPA, 15 days with LGPD…) and generally cannot charge a fee to fulfill the request.

Why should companies care about Data Subject Access Rights?

There is a misconception that data privacy is only a matter of governance, risk, and compliance. Indeed, with the rise of regulations such as GDPR or CCPA and the related record fines that they can trigger in case of violations, data privacy has now caught the attention of the execs.

But, first and foremost, data privacy is about customer relationships and digital transformation: the real challenge is to efficiently use customer data while protecting it according to customer’s privacy preferences. Organizations should not only care because it is regulated, but also because their customers urge them to do so. A Pega survey has shown that a whopping 82 percent of EU consumers welcome their new data privacy rights including the right to know what personal data organizations have about them and to have control over them.

Dealing with the surge

Customers are exerting their data access rights and as a result, organizations are facing a surge of requests, and will face even more when CCPA enters into effect in 2020: A Cap Gemini survey indicates that one third of organizations have received more than 1000 requests (including 50% of US organizations). On its side, ICO, the UK regulator, revealed that 46% of the GDPR complaints received so far are linked to this topic.

Following up the complaints, fines are popping out as well. In the highest fine delivered so far by the German data privacy regulator an online food delivery service got fined for non-observance of the rights of data subjects. In addition, the Austrian Data Privacy activist NGO, NOYB (None Of Your Business) has filed a wave of complaints against 8 streaming companies, including Amazon Prime, Apple Music, Soundcloud, Spotify, Youtube, along with 3 smaller businesses. Finally, we are only at the beginning of a wave of collective actions, where a group of customers might sue organizations when they fail to respect their privacy rights including answering to their data access requests. While GDPR allows that to some extent, CCPA might bring this to a whole new level in a litigious country where class actions have turned into a common phenomenon.

What’s also important to note is that DSARs are expensive, time-consuming, and really hard data crunching work unless organizations streamline the fulfilment process and automate modern data management technologies such as a data catalog, data integration and privacy centers. Gartner estimates that a DSAR costs an average $1406 and that only 15% of organizations can fulfill them in less than one week. With the surge of DSAR, the related costs are exploding, and you no longer can handle them as an ad hoc, manual process.

The second blog post will focus on the customer side and how painful the DSAR process can be.

For more information about achieving global data privacy compliance, read this practical guide on data privacy compliance.

The post From GDPR to CCPA, the right to data access is the achille’s heel of data privacy compliance and customer trust – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Talend’s Next Chapter

January 9, 2020, 9:29 am

≫ Next: Like the Infinity Stones, keep your Talend services as far apart as possible

≪ Previous: From GDPR to CCPA, the right to data access is the achille’s heel of data privacy compliance and customer trust – Part 1

Today we open a new chapter at Talend, in which we begin our journey from a $250M company to a $1 billion cloud market leader. Over the last six years, I’ve been honored to help build and lead the team that brought Talend from a $50M startup, through its IPO in 2016 to become a quarter-billion-dollar company. Together, we built one of the fastest-growing cloud businesses in the world. Now it’s time for me to take a step back and welcome our new leader who will help propel Talend through its next growth phase. As of today, I’ll be handing the reins over to Christal Bemont, who will become Talend’s new CEO.

I’m thrilled to welcome Christal to Talend. She joins us from SAP Concur, where she was most recently its CRO and responsible for leading the global go-to-market team that grew its business into a multi-billion-dollar cloud market leader. Prior to being CRO, she spent 15 years at Concur in an expanding series of sales and leadership positions. During that time, she was instrumental in shaping the company’s go-to-market strategy, growing its largest global clients, and leading multiple sales teams to success at scale. She brings exceptional leadership skills along with unique expertise in scaling a large cloud business. This unique leadership experience is exactly the skill set we need to grow Talend into the $1B cloud market leader that I know we can become.

To win the lion’s share of the exploding cloud data market, we also need to double down on our customer-first strategy. To spearhead that effort, we’re bringing in Ann-Christel Graham to the newly-created position of Chief Revenue Officer and Jamie Kiser as Chief Customer Officer. AC and Jamie worked closely together with Christal at SAP Concur and bring significant cloud industry go-to-market expertise.

What does this new phase mean for our customers and partners? Talend has come this far in large part because our mission has always been to help our customers eliminate roadblocks in their journeys to becoming data-driven. As more companies migrate to the cloud, we believe those challenges will grow and become more complex. After all, data is everywhere: it’s being generated by every system that companies use to power their businesses, and it’s being collected at every customer touchpoint. Properly deployed, data can redefine a company’s fortune and future, but it’s often inaccessible, frequently bad, and in the case of customer data, extremely risky to manage.

As we pursue our next phase of growth and cement our leadership position, we’ll continue developing a dynamic roadmap that removes the roadblocks data-driven customers face. As always, we’ll play a vital role in ensuring the data that companies use to make critical business decisions is accessible, clean, and compliant—and we’ll help companies do so wherever they are in their data journey. As we have over the past several years, we’ll also work to ensure we’re delivering not just the best product set for our customers, but also the best service and support capabilities worldwide.

We believe the cloud will drive the majority of our future growth, and the progress we’ve made in the past year positions us for unprecedented success. Our cloud business continues to grow well over 100%; it’s now over half of what we sell and we’re solving our customers’ most demanding cloud data requirements. We have an incredible team, a great product set and, in Christal, Ann-Christel and Jamie, new leaders with the expertise ideally suited to help us meet our goals. Our new team members have the go-to-market, cloud and leadership skills needed to tackle the challenges ahead. As a board member, I’ll continue to participate in Talend’s business and welcome you to stay in touch and let me know how your data journeys are progressing. I’m sure you all share my excitement and optimism for the future of the company and will welcome Christal, Ann-Christel, and Jamie in their new roles.

The post Talend’s Next Chapter appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Like the Infinity Stones, keep your Talend services as far apart as possible

January 13, 2020, 5:00 am

≫ Next: From GDPR to CCPA, the right to data access is the achille’s heel of data privacy compliance and customer trust – Part 2

≪ Previous: Talend’s Next Chapter

Ok, so we all probably know why keeping all Infinity Stones in one place is a bad idea, right? You must now be wondering what the relationship between Infinity Stones and Talend could be. Worry not, Thanos isn’t coming and there is a reasonable explanation behind the MCU fandom references, I promise.

I like to use this analogy to emphasize how distributing your Talend services among different servers according to Talend recommended architecture guidelines is as important as keeping the Infinity stones scattered across the universe and away from the clutches of evil. Are you following me?

In this blog, we will explore some common customer questions which come up during Talend architecture discussions and our answers to those queries.

Most Common query during planning stage

One of the most common queries I used to get while interacting with customers:

“What should be the right methodology to allocate various Talend services in our ecosystem? Shall I give you one gigantic server which can handle all our computational requirements?”

After going through the blog, I am sure that you will find the answer to this query yourself. Before going to the details, we need to agree that the Cloud technology has given power to the customers to select any type of servers quickly and they can set the server up and running in a matter of minutes. This means, theoretically, you can put all the Talend services in single server for demonstrational purposes. But is it a right approach? Before making the final call, we will discuss about various factors surrounding this scenario.

Going back to Monolithic server approach

The world had already moved out of monolithic systems when I started my IT career (well, it doesn’t mean I am too old or too young 😉). One of the tendencies we are seeing is that some users are trying to go back to same monolithic environment patterns by taking a different route. This time, the monolithic server concept is wrapped carefully with gift paper in the form of a single high computational server available in various Cloud environments. Customers often overlook the fact that those servers are available in Cloud for specific high computational use cases like graphics processing, scientific modeling, dedicated gaming servers, machine learning etc.

Talend services

From the Talend perspective, it’s always ideal to distribute various services (remember the Infinity Stones) to various servers as per recommended architecture.

Keeping all eggs in same basket problem

For argument’s sake, let’s put all the services of Talend for a Production environment in a single server. When a server failure occurs, this approach will bring down the entire Talend ecosystem.

This might be a costly affair especially if the enterprise is having specific data transfer SLAs between various systems. But if you are distributing the Talend environments in recommended architecture style, you can manage these types of failures in a graceful manner.

The battle for system resources

Still not convinced? Then I will take you to the next phase where the battle to grab system resources is happening. The basic scenario remains the same where you have installed all the services in single server.

Imagine you are going to pump a lot of data from source to target in batch format. We must also consider that multiple transformation rules need to be applied to the incoming data. This means your job server will be ready to grab a lot of available “treasure” (which is nothing more than system resources like CPU, memory, disk space etc). At the same time, you need to make sure that system resources are available for other services like TAC, Continuous Integration systems, Run time servers.

The tug-of-war for system resources will eventually lead to a big battle among various services. For example, let’s assume a case where TAC fails due to lack of memory. This means that you have lost all the control over the ecosystem and there is nobody to manage the services. If the victim is the run time server, it means your data flow through various web services and routes will start failing.

At the same time, if you start using a gigantic single server, you may not use the entire computational capacity all the time. It will be like gold plating all of your weapons and the result will be too much cost to maintain the underlying infrastructure.

Refreshing our minds with Talend Architecture

I hope, by now, all of you are convinced by the rationale of not keeping all the Infinity Stones Talend services in a single server. Before going further into detail on recommended architecture, let us quickly refresh our minds about various services involved in Talend Architecture. I will start with On-premise version. Below diagram will help you to understand the various services involved to handle both batch and streaming data. If you would like to understand more about each service, you can get the details from this link.

Talend Architecture

The Talend Cloud Architecture simplifies the overall landscape. You need to remember that you may still have to manage Remote Engines (either for Studio jobs or Pipeline Designer), Runtime Servers, Continuous Integration related activities etc.

Talend Architecture

If you would like to know more about Talend Cloud Architecture, I would recommend you have a look at the Cloud Data Integration User Guide.

Talend Recommended Architecture

The detailed description of Talend Recommended Architecture (including server sizing) for On-premise products can be referred from this link. I am not going to repeat the content, but I would like to show a high-level view of Talend recommended server layout for your quick reference.

Talend Recommended Architecture

The Cloud recommended architecture layout is much simpler since the Talend services are managed from Cloud environment. You can refer to the recommended Talend Cloud Architecture at this link. A quick peek of the server layout in the case of Talend Cloud is as shown below.

Talend Cloud Architecture

I hope this discussion on the “Infinity Stones” of Talend was as interesting for you as it was for me 😊. Until I come up with another clever analogy to write a blog around, enjoy your time using Talend and keep those “stones” safe!

The post Like the Infinity Stones, keep your Talend services as far apart as possible appeared first on Talend Real-Time Open Source Data Integration Software.

↧

From GDPR to CCPA, the right to data access is the achille’s heel of data privacy compliance and customer trust – Part 2

January 15, 2020, 3:00 am

≫ Next: From GDPR to CCPA, the right to data access is the Achilles’ Heel of data privacy compliance and customer trust – Part 3

≪ Previous: Like the Infinity Stones, keep your Talend services as far apart as possible

This blog is the second of a series dedicated to Data Subject Access Requests (DSARs) and its importance to regain customer trust.

In the first part of this series, I explained what is DSAR and why the organizations should care about it. Now, let’s take a look at how the process can be perceived by the customers. Our recent GDPR benchmark research shows that the road can be tortuous.

A bumpy ride for customers

There has always been a significant gap between how organizations believe they perform in terms of customer experience and their customer’s perception. And data privacy is no exception: among the companies we surveyed, 93% proudly claim in their privacy legal notice that they will fulfill customer requests for data access on delays, and they document the procedure to trigger this request. But, in most cases, following the instructions reveals a huge execution gap, highlighting organizations’ failure to meet their own (legal) promises.

Through this execution gap, not only did organizations put themselves in an embarrassing situation, giving formal evidence of non-compliance, but they are exposing their inefficiencies right to the face of their customers, negatively impacting their reputation and customer’s loyalty.

Shortfalls and inefficiencies

Fulfilling DSAR proves to be a long and winding road. Our benchmark has shown that inefficiencies often start as soon as a DSAR is sent. Only a few of surveyed organizations sent a notification to confirm that they received the request and are taking care of it. Then, only 20% of organizations had a formal process for an identity check. This is a serious issue, as other research highlighted that privacy regulations, when not properly implemented, can bring security breaches for identity thefts.

Although we clearly stated that our request was related to data access and portability, it created some confusion for some. Some companies asked us to explain to them what we meant by that and even what GDPR was! Some inadvertently considered that we were requesting for a right to erasure and started to drop our personal data. All those failures show the lack of a well-defined process or difficulties to run it with the right educated resources. All this severely hurts customer trust.

Broken experiences

Other failures were related to broken experiences, with some organizations redirecting us to other channels or to additional steps requiring our actions but then missing the follow-ups. Some organizations seemed to make the process intentionally tedious, such as a public organization fulfilling the request through printed pages physically available at their local agency which was only open during our office work hours. Some companies asked for a range of personal data before as a prerequisite to fulfill our request (ID, loyalty number, birthday, transactional data…) and then we never heard from them back when the ball was back on their side.

Others sent incomplete data, like this insurance company that forgot to mention half of the active contracts of the requester. Would you trust such a company to ensure your life, wealth and other members of your family when they can’t even find back the open contracts you have with them? Most company seem to ignore that customer who trigger their right for data access are those who care the most about their privacy and therefore might leave the company or share their experience with their peers when their request is not properly fulfilled. That could have a bigger impact on brands when the bad experiences are shared on social media.
We all know that delivering a customer 360 view can be tricky, but DSAR enables the organizations to make the process transparent to their customers and the regulators.

Delayed response

In addition, delay matters. Our survey has shown that many companies are struggling with the one-month delay per GDPR, either failing to respond or responding beyond the delays. Those that succeeded responded with an average delay of 16 days. Although this is good enough from a legal perspective, think about it from a customer perspective. In the digital age, a response time beyond two weeks sounds like a decade!

Not to mention, our research has shown a correlation between the speed and the quality of the answer to the DSAR: overall the 50% of companies that were able to fulfill the request in less than 16 days tended to deliver a better outcome than the other 50% that fulfilled them but with longer delays.

From the eye of the customer

A takeaway you should take from this survey is that you can’t hide your weaknesses from your customers when answering to a DSAR. When it is processed in an ad-hoc way or in a way that is not customer centric, your customer will see it. It might not result in a fine, but it hurts your customer relationship one way or another.

That’s why our recommendation is to look at this process from the eye of the customer, beyond the pure compliance side. Data governance processes should be defined with the customers’ expectations in mind.

The third and final blog post will focus on the keys to succeed.

For more information about the power of compliance on customer experience, read the Air France KLM story.

The post From GDPR to CCPA, the right to data access is the achille’s heel of data privacy compliance and customer trust – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

From GDPR to CCPA, the right to data access is the Achilles’ Heel of data privacy compliance and customer trust – Part 3

January 29, 2020, 3:00 am

≫ Next: Talend on Talend: How to use machine learning for your marketing database segmentation

≪ Previous: From GDPR to CCPA, the right to data access is the achille’s heel of data privacy compliance and customer trust – Part 2

This blog is the third and last one of a series dedicated to Data Subject Access Requests (DSARs) and its importance to regain customer trust.

In the first and second blog posts we explained the importance of DSAR as well as how the customer experience can be impacted if the process is not well managed. In this last part, we will go through a few tips that could help you to be DSAR champions!

data access rights

How to succeed with DSARs

Although our GDPR benchmark research highlighted a low level of maturity with respect to data privacy, it was also extremely useful to highlight how best-in-class companies differentiate.

Integrate DSAR in your customer experience

To the data subject, a DSAR appears as a customer service, rather than a legal procedure: best in class companies understand that privacy matters to their customers and that answering the DSARs creates a differentiated customer experience. As a result of their answered requests, customers are reinsured of their suppliers’ ability to protect their personal data and use it only for the right purpose.

Set up a workflow management

DSAR is managed as a workflow: in case a request is not addressed in real time (which only a very small fraction of organization do), the data subject is informed of the progress of his or her request and what the current process is. When the data subject needs to provide additional information, for example for identity checking, he or she is informed clearly and reminded in case he or she doesn’t answer on delays.

Make it frictionless and smooth

Data subject data is rendered in a meaningful and easy to consume way. No matter if the data is rendered online through a portal or as an electronic file, not only do best in class companies deliver personal data in a complete and understandable format, they do it in a didactic way. This helps to explain to the data subject why the supplier needs this data and how a data subject will benefit from this.

Automate the process

The data collection process is automated. Best in class answer in a timely manner. As mentioned in part 2 of the series, 30 days might be good enough for complying with the regulation, but it is beyond what a data subject would consider as a decent timeline. For a data subject, DSAR sounds like a basic request. Why should it take so long for an organization to share personal data? Could this indicate a lack of transparency? A lack of control? Or slowness to enter the digital age? Best in class companies leave no doubt that they take privacy very seriously and address it in a professional and timely way.

Lessons learned from the best in class

One great example of how to address DSAR with a customer centric approach is Accor. Customer experiences are key in hospitality, and as a global leader in this industry, Accor has transformed its business around the idea of “Enhanced Hospitality,” with tailor-made services to anticipate the slightest wishes of its guests and make them experience moments of emotion. The customer loyalty program is so important to Accor that you see the brand of their new program on the chests of some of the most globally knows sports stars, such as Neymar, Mbappé, and the rest of the players of the Paris Saint Germain soccer team.

Quick results

This type personalization requires responsibility with data. So, when GDPR came into effect, Accor engaged into a modernization of the privacy program leveraging their brand-new data lake, and using Talend Data Fabric modern data management technologies, such as Talend Data Catalog and Talend Big Data to discover, categorize, protect, locate, reconcile and share personal data. As the result, they have been able to reduce the time it takes to answer DSAR from 30 to 6 days.
Accor data transformation program, with data privacy at its core, is further described in this success story

Taking these first steps and putting data governance at the heart of your data strategy will help you to master DSARs for GDPR, CCPA and other data protection regulations that could come into force in 2020 and beyond.

For more information about GDPR and CCPA compliance, please visit: https://www.talend.com/solutions/data-protection-gdpr-compliance/

The post From GDPR to CCPA, the right to data access is the Achilles’ Heel of data privacy compliance and customer trust – Part 3 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Talend on Talend: How to use machine learning for your marketing database segmentation

March 2, 2020, 8:38 am

≫ Next: Revealing the Intelligence in your Data with Talend Winter’20 (part 1)

≪ Previous: From GDPR to CCPA, the right to data access is the Achilles’ Heel of data privacy compliance and customer trust – Part 3

In today’s business world, marketing segmentation is a must have for every organisation. It helps you process and aim different targets in a market into multiple customer or prospect segments to enhance your marketing actions. Through this discipline, you can hold a crucial competitive advantage over your competitors because you can adapt your offer and your communication according to the identified groups of personas you want to address.

Successful marketing segmentation optimises the marketing efforts to provide better satisfied customers and therefore increases revenue and profitability. But in fact, it is rare that companies that define personas have their segmentation ready in their database. If you already have the data for each individual, you may want to tag them into bundles to land faster strategies out of your database. And that’s where Talend can help you accomplish your customised segmentation.

You may think that, nowadays, we are in a 1:1 personalised world and marketing shouldn’t care about getting your business leads bucketed. Amazon does personalize its homepage based on your interests, right? Well, most companies may first start using a group-of-many approach before going into an individual personal customisation. Always better to learn to walk before running.

First of all, you have to define your segments. There are two typical methods available to you: carrying out market research or analyse your existing database. If you combine them, these two techniques allow a more precise identification of the segments, a better knowledge of the company’s targets and therefore an optimization of the marketing campaigns and actions to be implemented.

As you are writing down your segments, have in mind these should not alter with time. As you are planning who can be your customers, make sure they shouldn’t switch into new categories or segments they belong to in a short span. Be precise but give enough space. Secondly, think about your products and their use case for each of the segments you are listing. You want your segments to be relevant to your business cases.

Once you have your segments listed out, you can set the theory over at the corner of a table. Now it is about setting your machine learning actions.

In Talend’s case, we had data for job title within the individuals that were in our database. The data was poor in quality and were representing 600k+ values in the database. It would have taken centuries to find a way to manually segment each of them into the buckets we have drafted on a decent accuracy. So, we started with the big volumes and assigned them into bundles to create a base where we could start supervising the learning of the machine. As we were obtaining at least 5% of the database with almost 98% accuracy, we took this dataset into a semantic analysis approach.

We were considering a syntactic approach where the distances between instances (strings) is based on a character level similarity (e.g. apple, appeal) and a semantic approach where the distances between instances (strings) was obtained with word embeddings that can capture closeness between two semantically close words (e.g. apple, banana).

This approach gave us a dictionary of words used in job titles, and therefore we could use it as an input in the machine learning model to start making predictions. As expected, this first approach held poor generalization capabilities. For that reason, we decided to implement an “active-learning” inspired method and include a human expert in the learning loop.

The learning loop includes someone that validates or invalidates the predictions of the machine learning model, so it improves accuracy over the multiple iterations performed. You validate, it predicts, and again, you validate, it predicts… We did it 6 times and we achieved 80% accuracy on annotated data. So, in a short matter of time, we came from 3% of the database with almost a total accuracy to 100% of the database segmented with 80% accuracy.

As of now, the more data we validate, the more accurate it gets. We are achieving almost 90% accuracy out of a dataset that contains more than 600,000 unique values.

Now what? It is fantastic to have the predictions, but now you have to write the data back into your system. The idea was to take the job title information in our CRM, run the machine learning model, and push back a cleaned job title plus the persona predictions back into the database. The easiest way we found to complete that task was with our own tool, Talend Pipeline Designer.

Talend Pipeline Designer is a lightweight data transformation tool that allows you to build data pipelines with a schema-ready user interface. It allows you to integrate any data — structured or unstructured — and design seamlessly in batch or streaming from a single interface. It connects easily to leading data sources such as Salesforce, Marketo, Amazon Redshift, Snowflake…

In the Talend Pipeline Designer Summer ’19 release, the component connector for Marketo came alive. We just had to use it to connect to our database and retrieve the job title. Essentially the pipeline is built like this:

You can also incorporate the active learning loop with the human validation for predictions in the pipeline by adding a flow that is capturing a certain amount of people where you can validate the predictions of the machine learning model as an output and decide if you want to write this value in the CRM you have connected with.

Once you have laid your pipeline you have to define who can get into it. This part is on Marketo. We chose to push the prospects or customers who had a job title and a certain amount of basic information but didn’t have a persona segmentation in their details. Based on these criteria, it is quite easy to push these people into lists in your database where your Marketo connector for Pipeline Designer connects with. Then you have to set-up a pace and a rhythm.

There were two different options: go batch or go stream. The batch allows you to write data at a defined schedule of time in the database. While the stream will be steady, continuous flow, allowing to write data while the rest of the data is still being received. We chose to take a batch mode as we are receiving inconsistent amount of leads per days and weeks. It allows us to keep control on the model while with a streaming mode we couldn’t manage how heavy we could charge the system performance.

To ensure the best performance for your integrations, when performing inserts or updates, records should be grouped into as few transactions as possible. When retrieving records from a data store for submission, the records should always be aggregated prior to submission, rather than submitting a request for each individual change. We are calling updates for persona predictions at a pace of 50,000 updates per week maximum.

The results. I’m working in Marketing, and I can tell you, this is almost magical. With an extended dataset as we held, the machine learning is smarter than you. It better perceives slight differences in the words than you actually can. It understands the similarities between words within job titles in different languages. The segments we drafted are fulfilled with homogenous types of people. On a business perspective, it is a game-changer for our Marketing organization as it allows us to better understand our personas based on analytics we ran on the group of business leads that fell into the bundles.

Just to give a few examples, you can look at the difference in the web traffic behavior among the personas, look at how the sales team are converting them into customers as we can spot the cross-sell or upsell opportunities we can work on. It is also a great asset for (ABM). You can target accounts and personas simultaneously and deliver a highly personalized messaging based on the line of business and the person you are talking to.

With this I hope you enjoyed reading my approach to segmentation and it will be great to read if anyone has any thoughts or comments on this approach!

Credits: Sebastiao Correia, Raphaël Nedellec, Tarek Benkhelif, Thibaut Gourdel, Maedeh Afshari

This article was written by William Prunier and originally appeared on: https://www.linkedin.com/pulse/marketing-database-segmentation-using-machine-learning-prunier/

The post Talend on Talend: How to use machine learning for your marketing database segmentation appeared first on Talend Real-Time Open Source Data Integration Software.

↧