Blog – Talend Real-Time Open Source Data Integration Software

What does your next data-driven project have to do with data stewardship?

Well actually a lot if you want to get the most out of your data. Many companies today are filling the data lake with vast amounts of structured and unstructured data. But they tend to forget an important fact: on average, organizations believe that 32 percent of their data is inaccurate. Sounds like addressing this data quality issue before your data lake turns to a data swamp is a must, not an option, right? That is where data stewardship comes into play.

Data stewardship is becoming a critical requirement for successful data-driven insight across the enterprise. And cleaner data will lead to more use, while reducing the costs associated with “bad data quality” such as decisions made using incorrect analytics.

What is Data Stewardship?

If you think of all the data you need to work with each day, you know that often it is incomplete and sometimes incorrect. You may be able to fix it since you know it, but that process does not scale when dealing with vast amounts of data and when other groups “bring their own data” and know what it should look like. Also, let’s not forget that using email or Excel to resolve data quality issues one by one is not very efficient, not to mention the risks that come with the proliferation of uncontrolled copies of potentially sensitive data everywhere in the enterprise across file folders. You need purposed tools, processes and polices to effectively and sustainably manage data quality.

As a critical component of data governance, data stewardship is the process of managing the lifecycle of data from curation to retirement. Data stewardship is about defining and maintaining data models, documenting the data, cleansing the data, and defining the rules and policies. It enables the implementation of well-defined data governance processes covering several activities including monitoring, reconciliation, refining, deduplication, cleansing and aggregation to help deliver quality data to applications and end users.

In addition to improved data integrity, data stewardship helps ensure that data is being used consistently through the organization, and reduces data ambiguity through metadata and semantics. Simply put, data stewardship reduces “bad data” in your company, which translates to better decision-making and the elimination of the costs incurred when using incorrect information.

Traditionally, data stewardship tasks are assigned to a staff of data experts, the so-called data stewards. But the challenge is that there are few data stewards in a company and they are generally dedicated to high risk projects, such as regulatory compliance. In the absence of data stewards, nobody knows who is accountable for data quality, and that is what leads to a frustrating situation where organizations are fully aware that almost one third of their data assets are not accurate, but nobody acts on it.

Data Stewardship, Now a Team Activity

With more data-driven projects, “bring your own data” projects by the line of business, and increased use of data by data workers such as data scientists, marketing and operations, there presents a need to rethink data stewardship. Next generation data stewardship tools need to evolve to support:

Self-service – so that any user from IT to the business can solve data quality issues in a controlled way
Team collaboration – including workflow and task orchestration
Manual interaction – in the case of data arbitration or certification where human intervention is required to validate, certify, tag, or select a dataset
Integration with data preparation – defining a process for “bring your own data”
Built in privacy – empowering the data protection officer and compliance teams to address new industry regulations for maintaining privacy such as GDPR (General Data Protection Regulation)

Introducing Talend Data Stewardship

With Talend Winter ’17, we are proud to launch a new capability, the Talend Data Stewardship app, a comprehensive tool you can use to configure and manage data assets, that addresses the quality challenges holding your data-driven projects back.

talenddatastewardship1

More than a tool just for data stewards with specific data expertise, IT can empower business users to use a point-and-click, Excel-like tool to curate their data. With Talend Data Stewardship you can manage and quickly resolve any data integrity issue to achieve “trusted” data across the enterprise. With the tool, you define common data models, semantics, and rules needed to cleanse and validate data, then define user roles, workflows, and priorities, and delegate tasks to the people that know the data best. Productivity is improved in your data curation tasks by matching and merging data, resolving data errors, certifying, or arbitrating on content.

Delegating tasks that used to be done by data professionals, such as data experts, to operational workers that now the data best is called self-service. It requires workflow-driven, easy to use tools with an excel-like user experience and smart guidance. With this respect, Talend Data Stewardship uses the same user interface that Talend Data Preparation and the tools are bundled together in a unified suite for self-data access, preparation, integration and curation. While Talend Data preparation empowers business users to get clean, useful data in minutes, not hours, in an ad-hoc way, Talend Data Stewardship orchestrates the collaborative work of fixing, merging and certifying data with self-service data curation. Similar to using Excel and Word for office automation, data workers get access to those to tools with consistent user experience that the use depending on their use case.

Because it is fully integrated with the Talend Platform, it can be associated to any data flow and integration style that Talend can manage, so you can embed governance and stewardship into data integration flows, MDM initiatives, and matching processes.

Tools For Everyone

The core concepts of Talend Data Stewardship are campaigns and tasks, and the product comes with two predefined roles namely: campaign owners and data stewards.

Campaign owners can define different campaigns including Arbitration, Resolution or Merging; engage the data stewards that will contribute in each campaign; define the structure of the data used by the campaigns; refer to Talend Jobs to load tasks into the campaigns; retrieve tasks from the campaigns and assign tasks in the campaigns to different data stewards.  
Data stewards can explore the data that relates to their tasks, resolve the tasks on a one to one basis or for a whole set of records, delegate tasks to colleagues and monitor and audit stewardship campaigns and data error resolution errors.

Additionally, Talend Data Stewardship can trigger validation workflows for tasks that should be double-checked. Because it is easy to use through a guided user experience and workflow-driven, anyone can participate in the data curation efforts with clear responsibilities and efficient tools to execute them.

CRM Example Use Case

Consider a use case where you want to improve the quality of data in your CRM system, as it has incorrect data and many duplicates. As the campaign owner using Talend Data Stewardship, you would define a Resolution campaign and objective (e.g. resolve incorrect addresses) and quarantine the data that needs attention, typically the records with invalid or empty contact data, or the potential duplicates. You would then define the participants in the campaign, for example all regional marketing managers, digital marketing managers, and the sales admin. Then you would assign tasks, e.g. the error resolution tasks for the German marketing contacts are assigned to the German marketing managers, because they know this data best to certify it, correct it, or reconcile it against multiple versions of the truth. And they will benefit from the cleansed data through higher conversion rates in their marketing campaigns. As each stakeholder updates the data, you can track the changes made, e.g. marketing verified mailing addresses, telephone numbers and email addresses.

Next, a merging campaign is created to match and merge duplicate records and the sales admin can merge the duplicate records.

talenddatastewardship2

Take It For A Test Drive

In summary, as companies consume more data and start providing self-service access to data, there is a clear requirement for self-service data quality tools to get the most out of your data. The business benefits from increased data usage and more informed decisions using better data. IT also benefits by delegating data cleaning tasks to data workers.

Are you starting to fill the data lake and realize that you now need to manage it? Take Talend Data Stewardship for a test drive!

With Talend Winter ’17, you get 2 free licenses of Talend Data Stewardship and Talend Data Preparation with your Talend subscription. Contact your local Talend sales representative for your Talend Winter ’17 evaluation.

The post What Exactly is Data Stewardship and Why Do You Need It? appeared first on Talend Real-Time Open Source Data Integration Software.

Disclaimer: While I work for AstraZeneca. All opinions expressed in this article are my own and do not necessarily reflect the position of my employer.

Recently I developed a proof of concept ETL job to strip product information from a website, translate it, and create a data set comprising the original text and the translated text.

For my example, I chose to strip drug product names out of a French health authority website, and translate them to English. This is not a detailed instructional article, just a high-level overview of the process of solving this problem in Talend using third party java libraries and calling on external translation services.

A sample URL to perform a search of this website is below, the zero value is an offset. The search returns 20 products at a time in a fairly simple HTML table. For this example we will do a keyword search of “Glucose”, but we could choose the marketing authorisation holder or a number of other attributes.

This returns the first 20 products (offset of 0):

http://ansm.sante.fr/searchengine/search/(offset)/0?keyword=Glucose

This returns the next 20 products (offset now 20):

http://ansm.sante.fr/searchengine/search/(offset)/20?keyword=Glucose

This is a good fit for Talend, because you have the ability to drop in jar files to add additional functionality to the workflow. For HTML parsing I am using the excellent jsoup Java library. I added the jsoup jar file as a code routine into the Talend Routines library and used this in a tJavaFlex component to parse the products in the HTML table and create a flow for each product in the list.

For language translation, I am making a REST API call to Google Translate – using the Talend tRest component. I needed to set up a Google Cloud account, with billing enabled to use the Google Translate API, but for this volume of translation, the costs will be in the pennies range.

The rest of the workflow is processed using out of the box Talend components.

tSetProxy component (x2) to set the corporate proxies for the job (one for HTTP and one for https required to make the Google API calls).
tForEach component to fabricate a set of URLs using offset values of 0,20,40,60,80 etc. (see above URL). This offset value is used to construct the URL for a tHttpRequest component.
tHttpRequest component to make the calls to the health authority website, using URLs fabricated with the different offset values sent from the tForEach component. Each HTTPd request writes the HTML response to a numbered file for later HTML parsing by jsoup.
tJavaFlex component which is where the jsoup HTML parser is used to split the product list in the HTML into separate rows for passing down the flow. Jsoup has a powerful jquery type selector syntax to make it easy to target the HTML nodes you need to process.
tRest component to make a call to the Google Translate API using my application key. We pass, for translation, the product name which jsoup parsed out of the HTML.
tExtractJSONFields component to extract the translated product name from the Google Translate API JSON response from the tRest call.
Few other components such as tLogRow and tFileOutputExcel to capture the output which is effectively just 2 columns: Product Name (in French) and the translated product name courtesy of Google Translate.

Download >> Talend Open Studio for Data Integration

The end result is scraping all these French product names from the website:

…and we end up with the French product name and an English translated product name in a nice spreadsheet format which we can work with.

And the job isn’t that complicated to look at, in fact, there is some redundancy in here (a tJavaRow I probably don’t actually need).

The first step is to harvest the HTML out into a set of files. We need to fabricate URLs with different offsets to get the first 20 products, then the next 20 products etc. An offset of 180 is enough to retrieve ~200 product names as a test, so we need to fabricate these URL calls:

http://ansm.sante.fr/searchengine/search/(offset)/0?keyword=Glucose

http://ansm.sante.fr/searchengine/search/(offset)/20?keyword=Glucose

…

http://ansm.sante.fr/searchengine/search/(offset)/180?keyword=Glucose

I used a tForEach component to loop generating the offsets 0,20,40 etc.

Then the tHttpRequest uses this variable from tForEach to generate a URL with the offset, and writes the response to a file we can process later.

Note the tHttpRequest component uses context variables where the base URL and file locations are maintained:

You can see the result is a set of text files which contains the HTML responses for parsing. The numeric prefix matches the tForEach variable, and _0.txt represents the URL response with a zero offset, _20.txt the URL response with a 20 offset etc.

The bulk of the clever stuff is handled in tJavaFlex which uses the jsoup HTML parsing library to target the table cells which contain the product names:

An HTML text file created earlier in the workflow, as a result of tHttpRequest, is opened and then a jquery type selector is used to home in on the table cell required.

Elements products = doc.select(“td.first > a > strong”);

This selector basically states “go and return – in a collection – all the HTML nodes, which are <strong> tags, which are descendants of an <a> tag which are descendants of a table cell <td> tag with the CSS class called ‘first’…”.

We can see from the HTML of the website, how the product names are structured, and why this selector works. Each page presents 20 results, so the tJavaFlex generates 20 rows of data for the flow, one for each product.

We then make a call to the Google Translate API using tRest, note this is not my real API key in the screenshot…we specify a source language (French) and a target language (English), and we pass in the product name.

Note that I am reading this product name from the globalMap, as I had to turn the flow into an iterate – so I could feed the product name into the tRest component. I am also URL encoding the product name, as this caused me problems calling the Google Translate API.

It is then a case of reading the response from Google Translate and parsing out the JSON field from the response. For this, we use a tExtractJSONFields component and use an XPath query. a link to the Google Translate API documentation is provided here, followed by the configuration of the JSON Talend component.

You can see how this works by looking at the JSON response sample from the Google Translate documentation:

And that is, at a very high level, how Talend enables you to parse HTML, translate it from one language to another – and produce a nicely formatted data set…despite not supporting either natively.

Download >> Talend Open Studio for Data Integration