Talend Cloud on Azure is Coming! Create A Faster, More Connected Cloud

May 7, 2019, 6:02 am

≫ Next: The Fundamentals of Data Governance – Part 2

≪ Previous: The Secret Recipe for Digital Transformation? Speed & Trust at Scale

Today during Microsoft Build 2019, we announced that Talend Cloud, our Integration Platform as a Service (iPaaS), would soon be available on Microsoft Azure starting in Q3 2019. For those who have selected Azure as their cloud platform of choice, Talend Cloud running natively on Azure will provide enhanced connectivity options for modern data integration needs as well as better performance.

Cloud services are already (or will soon be) a critical piece to every organization’s digital business strategy. By moving towards the cloud, companies are already reaping the benefits such as speed of provisioning, time to market, flexibility and agility, instant scalability, reduced overall IT and business costs, to name a few. Making the cloud data migration and integration experience simple is key to delivering trusted data at speed.

Let’s dive into a few more benefits of Talend Cloud on Microsoft Azure.

Accelerating Cloud Migration

Before we talk about building end-to-end integration, let’s first talk about getting your data to the cloud. Data migration into the cloud is perhaps one of the bigger hurdles to cloud adoption. If you’ve selected Azure as your cloud of choice, having a scalable iPaaS running on Azure is a must to get data into your cloud and start maximizing your investment quickly.

End-to-End Integration on Azure Cloud

What’s Talend Cloud running natively on Azure mean for you? Put simply, a faster and easier integration experience. Whether you are loading data into SQL Data Warehouse or running it through HD Insight, Talend Cloud can help you create an end-to-end data experience, faster by running natively on Azure.

Faster Time to Analytics and Business Transformation

Customers who use Talend Cloud services natively on Azure will experience faster extract, transform and load times regardless of the data volume. Additionally, it will boost performance for customers using Azure services such as Azure SQL Data Warehouse, Azure Data Lake Store, Cosmos DB, HD Insight, Azure Databricks and more.

Reduced Compliance and Operational Risks

Because the new data infrastructure offers an instance of Talend Cloud that is deployed on Azure, companies can maintain higher standards regarding their data stewardship, data privacy, and operational best practices.

What’s Next?

If you are a Talend customer, keep an eye out for the announcement of the general availability date of Talend Cloud on Azure in Q3 2019.
Not a current Talend Cloud customer? Test drive Talend Cloud free of charge or learn how Talend Cloud can help you connect your data from 900+ data sources to deliver big data cloud analytics instantly.
See three real-world use cases (data architectures included) of companies using Talend Cloud and Azure today by downloading this white paper.

The post Talend Cloud on Azure is Coming! Create A Faster, More Connected Cloud appeared first on Talend Real-Time Open Source Data Integration Software.

↧

The Fundamentals of Data Governance – Part 2

May 10, 2019, 2:10 pm

≫ Next: How Social Media Network XING is Connecting Systems for Better Business Networking

≪ Previous: Talend Cloud on Azure is Coming! Create A Faster, More Connected Cloud

Introduction

In part 1 of my post on data governance fundamentals, I introduced the “5 Ws and 1 H” of problem-solving—“What”, “Why”, “Who”, “When”, “Where”, and “How”— and applied the first three to data governance. This part covers how you can apply the last three pieces and suggest some next steps. Let’s get started!

The “When” of Data Governance

In 2001 Doug Laney, an analyst at Gartner, defined the now-ubiquitous “3-Vs” of data: volume, velocity, and variety. Laney’s original context was e-commerce, but most discussions of the 3-Vs today are within the milieu of big data. Laney argued that increases in each of the Vs necessitate changes in how data is managed. Since he wrote his paper, other “Vs” have been added, such as value and veracity. I’ve included this last one since it speaks to data quality.

When organizations reach a threshold in data volumes and varieties where existing ad hoc, reactive, methods no longer work, a more disciplined approach to data becomes necessary. In this case, data governance.

Volumes and varieties grow as companies gain market share, acquire companies, or offer additional products and services. In this way, a company’s success may compel a governance practice. As to veracity—or truth – data quality can also be a driver of data governance. As more data is generated and combined, there’s more opportunity area for data quality issues. If a lot of people within your organization are dedicated to fixing data, that’s a sure sign that data governance is warranted. The right time to start governing your data is likely now.

The “Where” of Data Governance

There are many opinions as to who should own data governance and where it should reside. Some are:

The office of the Chief Data Officer: After all, data strategy is their remit.
The business: Having the business own data governance necessitates that they take an active role in it and makes it more strategically aligned. But which “business”? Compliance?
IT: Data Governance typically originates with them, and they’ll likely administer its enabling technologies, to say nothing of the systems and data stores they already run. Maybe they should do it?
It doesn’t matter: Establishing who does “what” is more important than “where” as data governance can succeed irrespective of who manages it.

The best answer, though, is that once data governance is in steady state, the data governance “organization” should be a federation of business and IT personnel ubiquitous throughout the organization with no single owner. Consider a fabric metaphor: To make fabric, you need two threads—the warp and the weft.

In like fashion (see what I did there?), the best kind of data governance is the one woven into the organization chart. You won’t see a data governance department on there, nor will you see human resources titles like “data steward.” Data governance is most successful when its functions are put into place within the existing organizational hierarchy, as an overlay on people’s “regular” jobs. As the DMBoK puts it, data governance should be embedded within existing organizational practices.

<<ebook: Download our full Definitive Guide to Data Governance>>

The “How” of Data Governance

Data governance exists at the intersection of people, processes, and technologies (fig. 2) As I said earlier governance is not achieved through technology alone (but technology is critical to its success).

Figure 2

People

There are two principal roles in data governance, stewards and owners. Data stewards are responsible and accountable for data, particularly its control and use as pertains to data’s fitness for its intended purpose(s). Again, data steward isn’t a position on an organization chart. Rather, it’s a function people perform as part of their daily work, with stewards assigned based on a person’s existing relationship to data. Further, there might be different levels of stewards, for example, domain data stewards. I can’t say too much beyond that as details of stewardship are going to be highly specific to the organization.

Data owners are subject matter experts and approve what data stewards do. By subject, I mean data-subject areas, e.g., customer, product, loan, location, etc.–the things against which transactions are executed. Another term sometimes used for a data owner is a data custodian.

Finally, there’s the data governance council, which may sound ominous but needn’t be. The council consists of stewards, owners, and IT staff who meet regularly to discuss and resolve escalated data issues. Think of it as governance’s “supreme court.”

The roles just described together form the data governance organization framework (fig. 3).

Figure 3

Its hierarchy isn’t an aggregation but consists of escalation paths for approvals in cases where consensus can’t be reached at lower levels. Let me hasten to add that the DMBoK estimates that 80-85% of data governance issues can be resolved at these lower levels, with the council needing to arbitrate only around 5% of all issues.

There are many ways to layer the framework roles; I’ve just shown a particularly flexible one. Often, the leaf levels are lines of business (LoBs).

Federation is a key component of the data governance organization framework. According to Ladley, federation describes the extent to which data governance permeates a given subject area. It is the means by which you blend and stratify the various governance roles and functions across the organization. Why this is important is that some aspects of some subjects, for example, the creation of a new product entity, will likely be more centrally controlled than others. Greater central control might also be justified by an organization’s relative lack of data management maturity.

Processes

When I say “processes” I’m using the term rather loosely as a catch-all for things people need to codify and document. These “things” are, in fact, principles, policies, and standards (fig. 4).

Referring again to Ladley:

Principles are statements that guide conduct and are the foundation of the other three and the behaviors they’re meant to guide. An example of a principle might be “Data is an asset that needs to be formally managed.”
Policies are a type of process. If principles answer “why”, then policies address “how”. The DMBoK defines policies as “codify[ing] requirements by describing…guidelines for action.” Policies operationalize principles and are enforceable (i.e., require that they be followed).
Standards, a kind of policy, establish norms or criteria against which to be evaluated, such as a business glossary term definition standard.

Figure 4

Having established roles and processes, the next step is to map the two. A RACI matrix is ideal for this (fig. 5). Those letters stand for, respectively: responsible, accountable, consulted and informed. Responsible means the role owns the process, accountable signs-off on the work, consulted has the necessary information, and informed must be notified of results.

Figure 5

Tools and Technology

I’ve mentioned on several occasions that data governance cannot be achieved through technology alone. A governance program is often precipitated by a sponsoring effort that has a significant technology component in the form of data quality and/or metadata. These two technology suites are not only complementary but are instrumental in furthering data governance’s goals. (Remember, part of control—a key aspect of DG—is monitoring.) Talend has best-in-class offerings in both of these spaces, which I’ll be covering in my future sessions.

Making the Case

Recall the principal determinants of business value: revenue, cost, and risk exposure. When making the case for governance keep these in mind. Having said that, though, don’t just say “we should govern our data because doing so will increase revenue, reduce costs, and mitigate risk” (although it will 🙂 and expect to leave it at that. What’s critical to making the case for governance is tying it to business goals. Data governance is intended to be an enabling function, and you may recall that in Talend’s definition of data governance is the phrase “enabling an organization to achieve its goals.”

A good way to begin tying governance to business goals is to do a gap analysis. Seiner suggests keeping in mind the question “what can’t the organization do with the data it has now?” when looking for gaps. Ladley echoes this by recommending a business alignment exercise, a good next step, which links business processes to data requirements. If you can tie data issues to business needs and cash flows to these business needs, then transitively you can put a number on the business value added by governance.

Conclusion

Successful governance is achievable only if an organization is committed to changing its data management behaviors. Once this commitment is made, the thoughtful orchestration of people, business processes, and tools operationalize the new behavior. This two-part blog introduced the fundamentals of data governance by addressing what it is, why do it, who should do it, when it should be done, where it should live, and how to do it. Thank you for reading.

References

Data Management Association. Data Management Body of Knowledge. Basking Ridge: Technics Publications, 2017.

Ladley, John. Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program. Waltham: Morgan Kaufman, 2012.

Sarsfield, Steve. The Data Governance Imperative. Cambridgeshire: IT Governance Publishing, 2009.

Seiner, Robert. Non-Invasive Data Governance. Basking Ridge: Technics Publications, 2014.

Talend. The Definitive Guide to Data Governance. https://www.talend.com/resources/definitive-guide-data-governance/.

The post The Fundamentals of Data Governance – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How Social Media Network XING is Connecting Systems for Better Business Networking

May 17, 2019, 9:12 am

≫ Next: How to Query a Redshift Table With Talend Cloud

≪ Previous: The Fundamentals of Data Governance – Part 2

With more than 15 million members, XING is the leading online business network for professional contacts in German-speaking countries. XING has an advantage in providing local news and job information.

Because the services XING provides are based on acquiring, storing and manipulating high-quality, timely data, it’s essential for the company to be able to efficiently integrate the many sources feeding data into XING’s systems.

But XING had legacy systems, which required them to script their data integration which in turn made it hard to keep track of where data sets were generated. The file formats also presented a challenge, because the data was in Apache Avro format, which is not usually supported by traditional data processing tools.

Handling a vast amount of data in a time-pressured event-streaming environment

XING evaluated several potential solutions for its data integration needs and selected Talend for multiple reasons including its open source approach, wide range of connectors, and its capabilities for metadata management and automated Documentation, its fast adoption of emerging technologies and the ability to enable implementation of new use cases.

“Data analysis is key for the success of an online network. Talend helps us find in real-time the signals from our data to support decision-making process for a superior user experience.” Mustafa Engin Soezer, Senior Business Intelligence and Big Data Architect, XING SE

XING now uses Talend as the bridge between a 150TB MapR-DB NoSQL on-premise database and a 60TB Exasol database for analytics.

Connecting professionals to make them more productive and successful

Key benefits of XING‘s new integration architecture include understanding the business better now that data is consolidated on one platform and being able to more efficiently run analytics and reports that support better decision-making.

“In addition,” says Soezer, “maintenance costs are reduced and productivity and efficiency have increased.” “Talend is helping us find insights and measure performance against KPIs,” says Soezer. “ For example, we can now more quickly and accurately analyze data and extract metrics and KPIs that are used to drive business strategies all across XING. We also have better statistics on the number of daily and weekly active users, new job postings, number of users who clicked on specific jobs, and more.”

More than 15 million users entrust their personal data to XING. XING, therefore, has a special responsibility to their customers, who all expect the social network to keep their data safe and to handle sensitive information confidentially. Talend is also helping XING to adhere to strict standards of corporate governance, data protection, and GDPR compliance.

“Online business networking is based on trust,” says Soezer. “So it’s critical for compliance to centralize and track metadata,“ “With Talend, we’re centralizing all source and target systems, and can analyze data to determine which one is relevant to what requirement. We can determine if data is private or not. And we can take full control of our data as well as metadata.“

The post How Social Media Network XING is Connecting Systems for Better Business Networking appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How to Query a Redshift Table With Talend Cloud

May 29, 2019, 5:00 am

≫ Next: Join Us At Snowflake Summit

≪ Previous: How Social Media Network XING is Connecting Systems for Better Business Networking

Talend Cloud enables the use of several prebuilt connectors and components for different services running on cloud platforms like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform.

This article explores a use case of querying an Amazon Redshift table as part of a Talend Job developed using Talend Studio and executed as a task on Talend Cloud. The data used is a sample of orders placed by customers; the requirement is to (1) fetch the order quantity of each customer from a table on Redshift (2) using lookup of state codes that consists of only state ID and state name columns, and identify the state name against each customer order, where state names are available as part of a CSV file. Execute a Talend Cloud task to accomplish this.

Preparation for the use case:

For the implementation of the use case, a demo environment of Amazon Redshift has been prepared. As part of this preparation the steps followed are:

Creating a Redshift cluster (single node used here)
Creating a table ‘dbinfo’ with columns for: customer ID, customer first name, customer last name, state ID (as part of customer address), order number, order quantity. Refer to the image below of the ‘Query editor’ for Redshift on AWS console.

The table is created in a public schema. When the above ‘create table’ statement is successful, it appears in the list, refer to the screen capture below. Note that there are state names available as part of the data on Redshift. Query Redshift Table with Talend

3. To populate the table with sample data, the sample CSV available in S3 is used. Run the COPY command/query below screen.

4. Verify the sample data populated.

These preparation steps are part of the demonstration for the article here.

In a real-world scenario, the use case could be a larger extension of this demo that requires you to do further complex analysis/querying on one or multiple tables populated in Redshift.

Talend job for querying Redshift table:

A Talend standard Job has prebuilt components to connect to Amazon Redshift and to fetch data from Redshift. The sample job created to demonstrate the use case here looks like the image below.

Talend Studio, available with Talend Cloud Real-Time Big Data Platform version 7.1.1, is used to develop this sample job.

Steps to create this job include:

Create a DB connection in the Talend Studio metadata repository.
Either drag the connection definition from the repository into designer – select tRedshiftConnection component when prompted OR use tRedshiftConnection from Palette – enter the Redshift cluster, database, and table information manually. The element named ‘blog_redshift’ in the image above is the tRedshiftConnection component.
Create a new subjob starting with the tRedshiftInput component. The element named ‘dbinfo’ is the tRedshiftInput component. The configuration for this component looks like the image below.

Using ‘Guess Query’ populates the ‘Query’ property with the selected statements as displayed in the image. The demo here is using the default query thus populated. This query could be edited as needed. For example, here the query could be edited to fetch only the necessary columns –‘quantity’ and ‘stateid’.

Using tMap component helps with combining the Redshift table and CSV data; and filtering out necessary columns, here in the use case ‘quantity’ from Redshift table and the ‘statename’ from lookup CSV file.

tLogRow is used to output the two columns – one from Redshift, other from CSV after joining both data input using ‘state ID’ columns in each. Again, in a real-world scenario, this part of Talend job could include various complex logic to work through required analysis depending on the need.
It is good practice to close any connections created as part of Talend job, tRedshiftClose is used to close the connection created by tRedshiftConnection.

Thus, the job that implements the requirements of the use case is complete, and ready to be deployed in Talend Cloud to be executed as a task.

Create a task in Talend Cloud to run the job with Redshift query:

To deploy the Talend Job to Talend Cloud, right-click the job in the repository – click on the option ‘Publish to Cloud’. Refer to the image below.

Publish to Cloud option requires to select a workspace, where the job will be deployed as an artifact. A version number is associated with the artifact; every publish automatically increments the version.

The Talend Job gets deployed as an artifact, and a task is automatically created. Both the artifact and task are available under the ‘Management’ left-menu of Talend Management Console (TMC). The image below displays the workspace ‘Personal’ tile under ‘Default’ environment that contains links to artifacts list and tasks list.

Artifacts list:

Tasks and plans list:

Note: A plan is the step-by-step execution of multiple tasks depending on specified conditions. Each step in a plan is associated with one task.

Click on the task to edit, and use pencil icon within the Configuration section, as highlighted with a green box in the image below.

Editing a task includes selecting the artifact for the task which is pre-populated here and specifying go live attributes:

Clicking on the Go Live button, executes the task based on the run-type.

Runtime here allows the use of Cloud Engine or Remote Engine. In this demo, a pre-defined Remote Engine called ‘demo-csm-re’ is used. This Article on Architecture in Talend Community provides details regarding Cloud Engine and Remote Engine.

The remote engines are present under ‘Engines’ left-menu on Talend Management Console of Talend Cloud. Refer to the image below.

Clicking on View Logs:

Conclusion

Talend Cloud makes it easier to integrate data from different kinds of sources like other cloud platforms, SaaS applications, or data in on-premises systems; it empowers users to perform complex transformations and/or analysis on the integrated data. Give Talend Cloud a try today.

The post How to Query a Redshift Table With Talend Cloud appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Join Us At Snowflake Summit

May 30, 2019, 5:00 am

≫ Next: 4 Best Practices For Utilizing Talend Data Catalog in Your ETL/ELT Processes

≪ Previous: How to Query a Redshift Table With Talend Cloud

For those of you in the Snowflake world, I am sure you have heard about Snowflake Summit. This is their inaugural customer conference and not surprisingly it is a sold-out show.

With over 2000 attendees from over 700 companies from all reaches of the world, Talend gets to talk about the hundreds of customers that are using Talend to ingest, transform, improve, and govern the data going into your Snowflake data warehouse.

Some of the sessions where you will be able to learn more about it are:

Accelerate and Scale Your Cloud Analytics

Brought to you by Talend

Building a data warehouse has never been easier with cloud-native data warehouses like Snowflake. As a longtime partner, Talend has been providing tools to ingest, transform, cleanse, and govern data in Snowflake to provide trusted data at the speed and scale for business. This session will review how Talend can quickly get your initial data warehouse project up and running with the data needed as well as how to scale that from a departmental solution to an enterprise data warehouse.

SPEAKER
Jake Stein
Senior VP, Stitch

iDirect: How To Democratize Modern Data Analytics for All Your Users

Brought to you by Talend

iDirect, one of the world’s largest manufacturers for satellite network and communications hardware, wanted to enable big data analytics in the cloud. By deploying a joint solution from Datalytyx, Talend, and Snowflake, iDirect business users, including sales and marketing, now have direct access to relevant data analytics and data science capabilities. Any business user can manipulate data and spot business opportunities without a background in statistics or technology, and without incurring additional IT costs or needing dedicated resources.

SPEAKER
Guy Adams
CTO of Datalytyx

Modern Reusable Data Load Accelerator for Snowflake using Talend

Brought to you by Slalom

The world is moving towards Data Discovery, Analysis, Machine learning and Predictive Analytics. However, a constraint to perform data analysis is the need to load and maintain your data efficiently, requiring extensive efforts and resources. Slalom’s Data Load Accelerator provides you a Talend & Snowflake based re-usable Data Ingestion framework that productionalizes datasets in days instead of weeks/months!

SPEAKER
Ricky Sharma
Solution Architect, Slalom

Soumya Ghosh
Solution Principal, Slalom

Salesforce = Snowforce … Cleaned Salesforce Data with Stitch and Snowflake

Brought to you by Trianz

Doing sales forecasting out of Salesforce was problematic due to manually entered data. Learn how Salesforce used a combination of Stitch, Looker, and Snowflake as its data provider to fix data inconsistencies, incorporate other data, and increase the accuracy of its sales forecasts.

SPEAKER
Andrew Crider
Director of Analytics, Trianz

Talend will be showcasing three types of integration:

Fast Ingest with Talend’s Stitch Data Loader

Pipeline Design and Data Quality for BI Analysts

Data integration and quality for Data Engineers

Talend is the only data integration company that spans the range of needs and user audience from getting started very quickly with your Snowflake data warehouse to scaling it to meet your enterprise and data governance needs.

Come see us at booth P2 on the main expo floor to get a live demo and discuss with our technical experts! Also, we are hosting an exclusive networking reception. Stop by the booth to get your invitation.

The post Join Us At Snowflake Summit appeared first on Talend Real-Time Open Source Data Integration Software.

↧

4 Best Practices For Utilizing Talend Data Catalog in Your ETL/ELT Processes

June 4, 2019, 5:00 am

≫ Next: Stitch and Talend: Working Together To Make Data Integration Easy

≪ Previous: Join Us At Snowflake Summit

Introduction

Talend Data Catalog provides intelligent data discovery that delivers a single source of trusted data into a centralized data catalog. Talend Data Catalog provides the capability for doing impact analysis and/or tracing lineage by harvesting Talend data integration Jobs. For example, you can find the use of a specific attribute or column from the source to the destination of the data flow within the scope of a Talend data integration Job.

This blog post explores the methods to design Talend data integration Jobs that help maximize the benefits of using Talend Data Catalog, focusing primarily on a file system (Windows) and relational databases as either data source or target storage for data. Please note, however, that Talend Data Catalog supports harvesting data from a variety of sources, and not just file systems and relational databases.

If you’re interested in basic guidelines on how to develop better Talend data integration Jobs, Dale Anderson has written a four-part series on Talend “Job Design Patterns” and Best Practices. This blog post offers some guidelines that are particularly relevant when it is necessary to harvest the DI Jobs in Talend Data Catalog.

Note: This article uses Talend Studio 7.1.1 and Talend Data Catalog version 7.1

Best practice #1: Using “Repository” Property Type Instead of “Built–In”

For most of the components in Talend Studio, there is a ‘Property Type’ attribute that allows the user to define a data input physical path source. This source could be defined as part of the component – as ‘Built-In’ OR defined as part of metadata repository – as ‘Repository’ (Repository helps in using the data input in other DI Jobs). For Talend Data Catalog, I suggest configuring the Property Type to ‘Repository’ for better results. Let take the example of the tMap component. The sample Job below is a very simple Job that stores data from a CSV file into MySQL database.

Data Catalog harvesting ETL job

tMAp is expanded here:

Ensuring the tFileInputDelimited and tDBOutputs components have the Property Type defined as ‘Repository’ (as highlighted in image below). This is necessary for the correct lineage or impact trace.

Data Catalog Harvesting ETL jobs

Below is an example of what this would look like in Talend Data Catalog.

Below is a view of Data Flow to and from a Talend DI Job, with highlighted areas of connection from data source and data storage (target) models.

Lastly, let’s look at a working data impact analysis of column ‘lastname’ getting stored as ‘customername’ due to the tMap configuration in the Talend DI Job, highlighting the tMap concatenation function.

Best practice #2: Using data mapping specifications in custom code components like tJavaRow

Talend data integration components like tJava or tJavaRow allow users to write custom code to implement a choice of logic. This custom code could either break lineage or produce everything-to-everything dependencies (Cartesian product), in Talend Data Catalog. To avoid such consequences, it is recommended to create data mapping specifications in the Documentation section of the component. Here is an example explaining the recommended process, with a sample Job that uses tJavaRow for a simple conditional check of quantity ordered by a customer, given the quantity is less than 2 for any row assign null value (as less than is considered negligible in the given use case).

If the data mappings are not included, there could be different results depending on the custom code, either lineage could break or a cartesian product of mappings could form. The image below depicts the possible scenarios in case of missing mapping specifications.

Let’s understand the cartesian product with an example. Consider a Job without mappings — just a custom code for the example use case.

For this Job, without the mappings defined (and not all columns used for processing), results in a cartesian product. See the image below.

That, in turn, results in a lineage break.

You can add the data mapping specification to the Documentation section of the component, as highlighted in the image below.

This solves the lineage issue — refer to the images below.

Best practice #3: Using context variables for dynamically generated SQL queries

It is a very common practice to do string concatenations while forming SQL queries programmatically. Talend DI Jobs allow for such string concatenations in SQL related components like tDBRow and custom code components like tJavaRow. Depending on the component where the concatenation is done to form a SQL query, this could create difficulties for tracing lineage. The solution here is to use context variables for the dynamic part of the SQL query within SQL components instead of custom code. To understand this scenario better, let’s use an example Talend Integration Job. The Job below is a simple Job with only a tDBInput component using a context variable defined for table name — refer to the image below.

And the data lineage works well without issues.

The second sample Job contains a tJavaRow where the SQL query is generated with use concatenation, refer to the image below. This uses the same context variable for the table name as the first Job.

With tDBRow having the output row ‘sql’, this is defined for a query. Refer to the image below:

Though the data flow depicts links throughout, the data lineage breaks. Refer to the images below:

Best practice #4: Using files instead of lists for iterating through lists of datasets or data inputs

Consider a use case where a set of SQL queries need to be executed as part of a Talend data integration Job. Such a situation requires iterating through the list of SQL queries, either listed manually – for example using tFixedFlowInput component — or using an external template. This situation could break lineage as Talend Data Catalog cannot access SQL queries. It is a similar problem with a list of datasets.

The recommended approach here is to use tFileOutput components for saving individual SQL statements in separate SQL script files. And then, you harvest the SQL scripts to trace lineage or do impact analysis.

Let us walk through an example with a set of SQL update statements defined in tFixedFlowInput, that are executed using tDBRow. Notice the flow of the Job in the images below. There is no data input/output to the Talend DI Job here; this Job is intended to execute a set of SQL statements. For this kind of scenario, it is recommended to store the SQL statement into a .sql file, and harvest the formed SQL script for lineage.

Note that the tDBRow is deactivated in the images below. It is not necessary to have an activated component while harvesting the DI Job in Talend Data Catalog, though it is needed for successful DI Job execution.

There are different bridges supported by Talend Data Catalog to harvest SQL scripts written for a specific database like Teradata, Microsoft SQL Server, etc. We will use the Oracle bridge for the MySQL database used in this example. Refer to the images below with the import settings:

Best Practices for Data Integration Jobs with Talend Data Catalog

Though a DI Job without the .sql file would execute the SQL statements (and implement a use case), it would be difficult to trace lineage for it. Refer to the image below:

To conclude, these four guidelines are basic ways to start a better data governance journey with Talend Data Catalog as well as effective use of Talend Data Catalog. There are other best practices that might be useful, depending on use case–to–use case or particular scenario–to–scenarios. Sometimes certain guidelines help with a particular use case, perhaps for a successfully tracing lineage or doing an impact analysis, and at other times there could be a set of different practices for a different use case. This blog is a starting point for exploring ways to make use of Talend Data Catalog for harvesting Talend data integration Jobs.

The post 4 Best Practices For Utilizing Talend Data Catalog in Your ETL/ELT Processes appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Stitch and Talend: Working Together To Make Data Integration Easy

June 11, 2019, 5:00 am

≫ Next: Talend Pipeline Designer – Avro schema considerations when working with Java and Python

≪ Previous: 4 Best Practices For Utilizing Talend Data Catalog in Your ETL/ELT Processes

Data integration is one of the hardest things developers have to do, but the talented members of the Talend User Group have lots of ideas on how to make it easier. Stitch, now part of Talend, recently hosted the inaugural meeting of the Philadelphia Talend User Group in their office. It was a great time characterized by interesting presentations, tons of food, and a chance to meet and talk with peers who use Talend products.

Stitch’s offices are on the 15th floor of a hundred-year-old office building in Center City, right across the street from City Hall. User group members began arriving shortly after 4. Many grabbed soft pretzels and mini cupcakes and soda and beer in the Stitch kitchen as they introduced themselves to other attendees.

Talend User Group cupcakes

The official program began half an hour later. Talend Director of Community Laura Ventura started things off by welcoming the guests, then introduced Jonathan Samberg, Talend’s regional vice president for the Northeast. He thanked the hosts, in the person of former Stitch CEO and current Talend Senior Vice President Jake Stein, and welcomed Talend partners Qubole and Snowflake. We then dove right into the presentations.

Talend at the University of Pennsylvania

University of Pennsylvania Data Warehouse Architect Katie Staley Faucett is responsible for the enterprise data warehouse as a member of the Penn Information Systems and Computing department. They were early adopters of Talend Cloud, and have been using it for about two years.

University of Pennsylvania provides a presentation on Talend Cloud at Talend User Group

ISC provides shared technology services to all of the University’s schools and centers. A few years ago, the organization decided to modernize and transform its processes and technology with a cloud-first initiative. Now it has dozens of SaaS applications, platforms, and uses the Amazon PaaS environment: AWS, Aurora, S3, and Lambda. It also supports hundreds of server instances. ISC uses GitLab for collaborative development with Talend.

Faucett gave a few examples of problems Penn needed to solve and for which they used Talend. In one, the department needed to simplify and scale a fundraising application. Part of the solution involved a custom connector Faucett found in Talend Exchange that let them push BLOB database objects to Amazon S3. Using Talend let the application development team focus on building application features and functionality instead of data integration infrastructure.

After a couple of other examples, Faucett shared a list of lessons learned and best practices from the University of Pennsylvania‘s experiences with Talend Cloud. She suggested learning about and experimenting with ELT for better data integration performance. She also had advice on using Talend with GitLab: Learn to understand the repository structure, including how objects are shared. Parameterize things as much as you can, and use generic database components when you can, so you don’t have to redo a lot if things change.

Stitch and Talend

Next up was Chris Merrick, vice president of engineering for Stitch, who introduced Stitch to the Talend users and talked about how the two platforms will work together.

Stitch VP of Engineering Chris Merrick describes data ingestion pipelines

Merrick talked about a data ingestion pipeline with a value chain that involves data collection, data governance, data transformation, and sharing. Stitch is focused on enabling the collection stage and providing accelerated ingestion for more than 90 data sources.

This year, Stitch developers are working on integrating Stitch into the Talend Cloud Platform to give users of all Talend tools a unified experience. Next year, the road map calls for seamless ingestion by Stitch and handoff to Talend tools.

The meeting wrapped up with schawarma for all and a chance to meet and talk to other Talend users, Talend and Stitch staff, and partners in attendance.

We hope to see you at our next Stitch and Talend User Group. Find one in your area here.

The post Stitch and Talend: Working Together To Make Data Integration Easy appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Talend Pipeline Designer – Avro schema considerations when working with Java and Python

June 12, 2019, 5:00 am

≫ Next: Making Sense of the Tableau and Looker Acquisitions

≪ Previous: Stitch and Talend: Working Together To Make Data Integration Easy

When working with Talend Pipeline Designer, you may come across situations where you need to write an Avro schema. Go to the Apache AVRO site to read more about this serialization technology. This blog has come about thanks to some experiences I’ve had when working with Avro schemas and getting them to work when serializing using Python and Java. Spoiler alert!….they don’t necessarily work the same way when used with Python and Java.

Since Pipeline Designer processes Avro using Python we need to ensure that our Avro schemas work with Python. However, I felt it important to demonstrate Pipeline Designer being used with Talend ESB….which is a Java tool. It was here I spotted the differences in behavior. Rather than let you guys struggle through the same confusions I’ve had, I figured that this blog would be a nice prequel to the blog series I planned to write when I stumbled upon this. This blog will include examples of the issues I have found as well as useful code to serialize JSON to Avro using Python and Java.

The differences between serializing JSON to Avro when using Java and Python

Let’s say you wish to build an Avro schema to serialize the following JSON using Java, but to supply it to Talend Pipeline Designer for processing….

{ "name": "Richard", "age": 40, "siblings": 2 }

It’s a very basic example of a person record, with “name”, “age” and the number of “siblings” the person has. A very basic Avro schema for the above can be seen below….

{
  "type" : "record",
  "name" : "Demo",
  "namespace" : "com.talend.richard.demo",
  "fields" : [ {
    "name" : "name",
    "type" : "string"
  }, {
    "name" : "age",
    "type" : "int"
  }, {
    "name" : "siblings",
    "type" : "int"
  } ]
}

This schema identifies the “name” field as expecting a String, the “age” field as expecting an int and the “siblings” field as expecting an int. This Avro schema, used to serialize the JSON above that, will work exactly the same way in both Java and Python. Awesome…..so what is the problem? The problem is that sometimes your JSON may not hold every value you initially specify it to hold. If your JSON MUST hold every value, then it is right for it to fail. But this will seldom be the case.

What if your JSON is missing some values, like this…

{ "name": "Richard", "age": null, "siblings": null }

Notice that both “age” and “siblings” are set to “null”. The Avro schema that we have just specified will not handle this for either Java or Python. An easy fix for Python would be to add “Unions” to the fields that can be null. A Union is essentially a type option. We can specify multiple types in Unions. The important thing to remember is that “null” is always listed first in the Union.

So, the Avro schema might look like below for a JSON String where only “name” is essential, but all other fields are optional …

{
  "type" : "record",
  "name" : "Demo",
  "namespace" : "com.talend.richard.demo",
  "fields" : [ {
    "name" : "name",
    "type" : "string"
  }, {
    "name" : "age",
    "type" : [ "null", "int"] 
  }, {
    "name" : "siblings",
    "type" : [ "null", "int" ]
  } ]
}

The changes are represented in red. This is important. As I said, this will work when being processed using Python. However, to get this to work with Java, you need to slightly alter the JSON to give each of the optional fields a type. Examples that would work with Java can be seen below…

{ "name": "Richard", "age": {"int":40}, "siblings":{"int": 3} }

…if values are supplied, or this…

{ "name": "Richard", "age": {"int":40}, "siblings": null }

…if “name” and “age” have values, but “siblings” is null.

However, if we take the examples of JSON above and try to process them using Python, we will get an error. Python will see the “types” specified (“int”, “string”, etc) as objects in their right. Since they are not represented in the Avro schema, this will cause problems. Java understands that these are indicating a “type” for the value following.

Now, if we are using Python to serialize our JSON, we can also completely omit optional fields. Given the last Avro schema we looked at, all of the following JSON Strings would be acceptable when serializing using Python…

{ "name": "Richard", "age": 40, "siblings": 3 }
{ "name": "Richard", "age": 40, "siblings": null }
{ "name": "Richard", "age": 40 }
{ "name": "Richard"}

This, unfortunately, is not the case for Java. Given the examples I have shown above (factoring in the “type” requirements for Java), only …

{ "name": "Richard"}, "age": {"int":40}, "siblings":{"int": 3} }

…and…

{ "name": "Richard"}, "age": {"int":40}, "siblings": null }

…will work when serializing using Java.

Let’s distill the above into a set of rules.

If ALL objects in a JSON String will always be supplied, a basic Avro schema with no Unions will work equally well for both Python and Java.
If there will be optional values (but the keys will still be supplied), the Avro schema will need to include Unions for the optional values. No further changes are needed for the JSON when being serialized by Python. When being serialized by Java, a “type” is required before the value, but ONLY for those objects where Unions are used. The changes required for Java will cause Python to fail when serializing.
If there are optional objects (no “key” or “value”), this will ONLY work with Python.

So, what does this mean for working with Pipeline Designer? Essentially, if you are serializing your JSON using Python, once you have it working in Python, it will almost certainly be readable by Pipeline Designer. Python serialization for Pipeline Designer is always going to be relatively straight forward. However, many Talend developers will want to serialize their JSON using Java. What do we need to consider?

All object keys in the JSON MUST be supplied regardless of whether there is data.
Where Unions are used in the Avro schema, we must supply a “type” before the value in the JSON (if there is a value, otherwise “null” will suffice).
When listing types for a Union, “null” should always come first.
Even though the Avro data will be consumed by Python, the presence of the “type” information in the JSON schema before serialization will not cause problems when Python de-serializes it.

Why not try this out for yourself with some pre-built code?

It’s great having the rules, but there will be times when you just want to try it out for yourself. This is why I am including some Python code to serialize and deserialize JSON, and some Java methods with a bit a code to do the same. Much of what I have learnt to write this blog, came from serializing using Java, saving the byte array in a file, then deserializing using Python….and vice versa. Give it a go.

Python Code

Below is some code which will serialize JSON into Avro using Python. You will need to change the file paths to point to your own schemas and JSON files. The code below will serialize the JSON without including the Avro schema in the output.

import io
import json
import avro.io

def deserialize_json(bytes):
    schema = avro.schema.Parse(avro_schema)
    reader = avro.io.DatumReader(schema)
    bytes_reader = io.BytesIO(bytes)
    decoder = avro.io.BinaryDecoder(bytes_reader)
    # Deserialize JSON
    deserialized_json = reader.read(decoder)
    json_str = json.dumps(deserialized_json)
    return json_str

def serialize_json(json_data):
    schema = avro.schema.Parse(avro_schema)
    # Create Avro encoder
    writer = avro.io.DatumWriter(schema)
    bytes_writer = io.BytesIO()
    encoder = avro.io.BinaryEncoder(bytes_writer)
    # Serialize JSON
    writer.write(json.loads(json_data), encoder)
    return bytes_writer.getvalue()

#Read schema file into String
filename = '/Users/richardhall/Documents/Avro.txt'
f = open(filename, "r")
avro_schema = f.read()
f.close()
print('Avro Schema')
print(avro_schema)

#Read JSON file into String
filename = '/Users/richardhall/Documents/PYTHON_JSON.txt'
f = open(filename, "r")
input_json = f.read()
f.close()
print('Python JSON')
print(input_json)

#Read the schema
schema = avro.schema.Parse(avro_schema)

#Serialize Python JSON
output_bytes = serialize_json(input_json)
print('Serialized Python JSON')
print(output_bytes.decode("latin-1"))

#Write serialized content to file
filename = '/Users/richardhall/Documents/serialized_python.txt'
f = open(filename, 'w+b')
binary_format = bytearray(output_bytes)
f.write(binary_format)
f.close()

#De-serialize JSON
output_str = deserialize_json(output_bytes)
print('De-Serialized Python JSON')
print(output_str)

#Read JAVA serialized file into byte array - Comment out if not present
filename = '/Users/richardhall/Documents/serialized_json.txt'
f = open(filename, "rb")
binary_read = f.read()
f.close()
print('Serialized JAVA JSON')
print(binary_read.decode("latin-1"))


#De-serialize JSON from file - Comment out if not present
output_str = deserialize_json(binary_read)
print('De-Serialized Java JSON')
print(output_str)

Java Code

Below is some code which will serialize JSON into Avro using Java. There are a couple of parts to this code. There is a routine for some static methods and some code making use of those methods. This has been written to be run using a tJava in a Talend Job, but can be run in any Java IDE.

The routine below requires the following Java libraries. These can all be found packaged with Talend Studio 7.1. Alternatives may work, but this has not been tested with other libraries…

The routine using these libraries can be seen below…

package routines;


import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.EOFException;
import java.io.IOException;
import java.io.InputStream;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileStream;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.io.JsonEncoder;



/*
 * user specification: the function's comment should contain keys as follows: 1. write about the function's comment.but
 * it must be before the "{talendTypes}" key.
 * 
 * 2. {talendTypes} 's value must be talend Type, it is required . its value should be one of: String, char | Character,
 * long | Long, int | Integer, boolean | Boolean, byte | Byte, Date, double | Double, float | Float, Object, short |
 * Short
 * 
 * 3. {Category} define a category for the Function. it is required. its value is user-defined .
 * 
 * 4. {param} 's format is: {param} [()] [ : ]
 * 
 *  's value should be one of: string, int, list, double, object, boolean, long, char, date. 's value is the
 * Function's parameter name. the {param} is optional. so if you the Function without the parameters. the {param} don't
 * added. you can have many parameters for the Function.
 * 
 * 5. {example} gives a example for the Function. it is optional.
 */
public class AVROUtils {

    
	//Static schema variable
    static String schemaStr = null;
    

    /**
     * setSchema: sets the schemaStr variable.
     * 
     * 
     * {Category} User Defined
     * 
     * {param} string("{\"type\":\"record\",\"name\":\"Demo\",\"namespace\":\"com.talend.richard.demo\"}") schema: 
     * The schema to be set.
     * 
     * {example} setSchema("{\"type\":\"record\",\"name\":\"Demo\",\"namespace\":\"com.talend.richard.demo\"}") 
     */

    public static void setSchema(String schema){
    	schemaStr = schema;
    	
    }


    
    /**
     * jsonToAvroWithSchema: Serialize a JSON String to an Avro byte array including a copy of the schema in the
     * serialized output
     *	 
     * {talendTypes} byte[]
     * 
     * {Category} User Defined
     * 
     * {param} string("{ \"name\": {\"string\":\"Richard\"}}") json: The JSON to be serialized
     * 
     * {example} jsonToAvroWithSchema("{ \"name\": {\"string\":\"Richard\"}}") .
     */
    public static byte[] jsonToAvroWithSchema(String json) throws IOException {
        InputStream input = null;
        DataFileWriter writer = null;
        Encoder encoder = null;
        ByteArrayOutputStream output = null;
        try {
            Schema schema = new Schema.Parser().parse(schemaStr);
            DatumReader reader = new GenericDatumReader(schema);
            input = new ByteArrayInputStream(json.getBytes());
            output = new ByteArrayOutputStream();
            DataInputStream dis = new DataInputStream(input);
            writer = new DataFileWriter(new GenericDatumWriter());
            writer.create(schema, output);
            Decoder decoder = DecoderFactory.get().jsonDecoder(schema, dis);
            GenericRecord datum;
            while (true) {
                try {
                    datum = reader.read(null, decoder);
                } catch (EOFException eofe) {
                    break;
                }
                writer.append(datum);
            }
            writer.flush();
            writer.close();
               
            return output.toByteArray();
        } finally {
            try { input.close(); } catch (Exception e) { }
        }
    }


    
    /**
     * avroToJsonWithSchema: De-Serialize a JSON String from an Avro byte array including a copy of the schema 
     * inside the serialized Avro output
     *  
     * {talendTypes} String
     * 
     * {Category} User Defined
     * 
     * {param} byte[]("{ \"name\": {\"string\":\"Richard\"}}") avro: The JSON to be serialized
     * 
     * {example} jsonToAvroWithSchema(bytes) .
     */
    public static String avroToJsonWithSchema(byte[] avro) throws IOException {
    	
        boolean pretty = false;
        GenericDatumReader reader = null;
        JsonEncoder encoder = null;
        ByteArrayOutputStream output = null;
        try {
            reader = new GenericDatumReader();
            InputStream input = new ByteArrayInputStream(avro);
            DataFileStream streamReader = new DataFileStream(input, reader);
            output = new ByteArrayOutputStream();
            Schema schema = streamReader.getSchema();
            DatumWriter writer = new GenericDatumWriter(schema);
            encoder = EncoderFactory.get().jsonEncoder(schema, output, pretty);
            for (GenericRecord datum : streamReader) {
                writer.write(datum, encoder);
            }
            streamReader.close();
            encoder.flush();
            output.flush();
            return new String(output.toByteArray());
        } finally {
            try { if (output != null) output.close(); } catch (Exception e) { }
        }
    }

    
    /**
     * avroToJsonWithoutSchema: De-Serialize a JSON String from an Avro byte array without a copy of the schema in 
     * the serialized Avro output
     *  
     * {talendTypes} String
     * 
     * {Category} User Defined
     * 
     * {param} byte[]("{ \"name\": {\"string\":\"Richard\"}}") avro: The JSON to be serialized
     * 
     * {example} avroToJsonWithoutSchema(bytes) .
     */
    public static String avroToJsonWithoutSchema(byte[] avro) throws IOException {
        boolean pretty = false;
        GenericDatumReader reader = null;
        JsonEncoder encoder = null;
        ByteArrayOutputStream output = null;
        try {
        	Schema schema = new Schema.Parser().parse(schemaStr);
            reader = new GenericDatumReader(schema);
            InputStream input = new ByteArrayInputStream(avro);
            output = new ByteArrayOutputStream();
            DatumWriter writer = new GenericDatumWriter(schema);
            encoder = EncoderFactory.get().jsonEncoder(schema, output, pretty);
            Decoder decoder = DecoderFactory.get().binaryDecoder(input, null);
            GenericRecord datum;
            while (true) {
                try {
                    datum = reader.read(null, decoder);
                } catch (EOFException eofe) {
                    break;
                }
                writer.write(datum, encoder);
            }
            encoder.flush();
            output.flush();
            return new String(output.toByteArray());
        } finally {
            try {
                if (output != null) output.close();
            } catch (Exception e) {
            }
        }
    }
    
    
    /**
     * jsonToAvroWithoutSchema: Serialize a JSON String to an Avro byte array excluding a copy of the schema in 
     * the serialized Avro output
     *
     *  
     * {talendTypes} byte[]
     * 
     * {Category} User Defined
     * 
     * {param} string("{ \"name\": {\"string\":\"Richard\"}}") json: The JSON to be serialized
     * 
     * {example} jsonToAvroWithoutSchema("{ \"name\": {\"string\":\"Richard\"}}") .
     */
    public static byte[] jsonToAvroWithoutSchema(String json) throws IOException {
        InputStream input = null;
        GenericDatumWriter writer = null;
        Encoder encoder = null;
        ByteArrayOutputStream output = null;
        try {
            Schema schema = new Schema.Parser().parse(schemaStr);
            DatumReader reader = new GenericDatumReader(schema);
            input = new ByteArrayInputStream(json.getBytes());
            output = new ByteArrayOutputStream();
            DataInputStream dis = new DataInputStream(input);
            writer = new GenericDatumWriter(schema);
            Decoder decoder = DecoderFactory.get().jsonDecoder(schema, dis);
            encoder = EncoderFactory.get().binaryEncoder(output, null);
            GenericRecord datum;
            while (true) {
                try {
                    datum = reader.read(null, decoder);
                } catch (EOFException eofe) {
                    break;
                }
                writer.write(datum, encoder);
            }
            encoder.flush();
            return output.toByteArray();
        } finally {
            try { input.close(); } catch (Exception e) { }
        }
    }
}

The code below shows how you can use the methods above in a tJava component. You will need to change the file paths to point to your own schemas and JSON files. This code will serialize the JSON without including the Avro schema in the output.

//Set Avro schema
String avro_schema = "";
String avroFilePath = "/Users/richardhall/Documents/Avro.txt";
  
try{
	avro_schema = new String ( java.nio.file.Files.readAllBytes( java.nio.file.Paths.get(avroFilePath) ) );
}catch (IOException e){
	e.printStackTrace();
}
System.out.println("Avro Schema");
System.out.println(avro_schema);
routines.AVROUtils.setSchema(avro_schema);

//Set the JSON configured for JAVA
String json = "";
String jsonFilePath = "/Users/richardhall/Documents/JAVA_JSON.txt";    
try{
	json = new String ( java.nio.file.Files.readAllBytes( java.nio.file.Paths.get(jsonFilePath) ) );
}catch (IOException e){
	e.printStackTrace();
}
System.out.println("JSON");
System.out.println(json);

//Serialize the JSON
byte[] byteArray = routines.AVROUtils.jsonToAvroWithoutSchema(json);
System.out.println("Serialized JSON");
System.out.println(new String(byteArray));

//Write the Serialized JSON to a file
String serializedFileJava = "/Users/richardhall/Documents/serialized_json.txt"; 
java.io.FileOutputStream fos = new java.io.FileOutputStream(serializedFileJava);
fos.write(byteArray);
fos.close();

//De-serialize the JSON
String deserializedJson = routines.AVROUtils.avroToJsonWithoutSchema(byteArray);
System.out.println("De-Serialized JSON");
System.out.println(deserializedJson);

//Load the serialized JSON file produced by Python -- Comment out if not present
String serializedFilePython = "/Users/richardhall/Documents/serialized_python.txt"; 
java.io.File file = new java.io.File(serializedFilePython);
byte[] serializedJsonFromPythonFileBytes = new byte[(int) file.length()]; 

java.io.FileInputStream fis = new java.io.FileInputStream(file);
fis.read(serializedJsonFromPythonFileBytes); //read file into bytes[]
fis.close();

//De-serialize the JSON file produced by Python -- Comment out if not present
deserializedJson = routines.AVROUtils.avroToJsonWithoutSchema(serializedJsonFromPythonFileBytes);
System.out.println("De-Serialized JSON from Python");
System.out.println(deserializedJson);

Running the code above (Python and Java) will allow you to build an Avro schema and test it against both Python and Java. It will also allow you to serialize using one language and deserialize using the other. This should help you deal with Avro issues before you get anywhere near trying to use it with Talend Pipeline Designer.

The post Talend Pipeline Designer – Avro schema considerations when working with Java and Python appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Making Sense of the Tableau and Looker Acquisitions

June 13, 2019, 5:00 am

≫ Next: Finding Patterns in Data with Talend

≪ Previous: Talend Pipeline Designer – Avro schema considerations when working with Java and Python

We all know that data is important and that becoming a data-driven enterprise is critical to future enterprise success. But recent events threw into sharp relief just how critical data is to business. Google announced its intention to buy Looker for $2.6 billion dollars. Several days later, Salesforce announced that it would be purchasing Tableau for $15 billion dollars. What can we make of these acquisitions? The answer is simple: business insight is the top priority and data is what fuels this insight!

What these acquisitions suggest to me is just how central business insight is to today’s enterprise. BI is clearly a hot topic in today’s data-driven world. Large players are making bold moves to ensure they have and control the BI capability within their product suites. What seems to be clear is BI has become a critical component in both the SaaS application stack as well as in cloud platform services.

It also seems clear that these players are evolving from SaaS applications to full-fledged SaaS platforms to answer the increasing need for continuous customer engagement. Talend director of product marketing Nick Piette says, “For many companies, digital transformation is about digital customer engagement. The battle for the best customer experience will be waged with the best customer insight. It will require a single customer touchpoint to captured, collected and analyzed, be it brick and mortar, mobile, web, etc. BI is the insight – data is the fuel.”

Tableau and Looker: It’s All About the Analytics

The Salesforce acquisition of Tableau in particular highlights the importance of understanding customers and their journeys through better analytics. Talend works with Salesforce, Amazon, Microsoft, Google, and many more cloud players. Each and every one of them is investing in advanced analytics and business intelligence applying machine learning, real-time data processing, artificial intelligence and more to gain insights and put them into action in real time.

But as we all know, customer data doesn’t just exist in a CRM. It’s everywhere and in an ever-increasing number of systems. Data is in every business application, customer interaction, and stakeholder engagement; however, without an independent data integration and data management vendor to span all these systems, the data becomes locked into silos with no way to tap its potential value.

Great Analytics Depends on Data You Can Trust

Behind every single BI report or dashboard, or AI solution for that matter, resides the data that you need to trust. Data is the single most important thing that feeds business intelligence. Without proper data integration, you get an incomplete view of your customers, products, and markets – gaps that create reasons to distrust your data. And without clean and governed data, the business misses opportunities and is exposed to risk.

One thing is clear; it seems that we are unlikely to be finished with consolidation in the BI and data analytics market. And there is no question that an independent data integration platform that spans all applications and cloud platforms will be an even more critical piece of enterprise data infrastructure.

For more information about data integration platforms, take a look at our Definitive Guide to Data Integration.

The post Making Sense of the Tableau and Looker Acquisitions appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Finding Patterns in Data with Talend

June 18, 2019, 5:00 am

≫ Next: Creating Avro schemas for Pipeline Designer with Pipeline Designer

≪ Previous: Making Sense of the Tableau and Looker Acquisitions

The main purpose of machine learning is to perform learning tasks on unseen data sets, having previously built up experience using training and testing data. Often those tasks can include looking for patterns and relationships between variables within the data. In supervised learning, one of the three main types of machine learning, we will have some idea of the types of input and output that we are looking for, and in order to quickly and efficiently build a supervised model, it helps that we understand some of the relationships between the variables within the data.

As an example, imagine we are an automotive trader. We want to build a machine learning model that can work out the value of second-hand cars. We know from experience that the value is dependent upon the model of car, the condition, the mileage, the service history, etc. even the color of a car can affect the resale price. What we don’t know is the exact form of the relationships between these variables and this is where machine learning comes in. we can use a training data set, of say tens of thousands of sale records, to train our model. If it includes all those variables then we can build a model, include the variables that affect the value of second-hand cars and then let the model learn those relationships with our training and testing data. Once we are happy the model is performing correctly (we can test the accuracy of our model using test data) then we let our model run.

Now imagine a situation where we don’t know, or are not certain about the relationships between our variables, what can we do? We need some tools that can help us understand the relationships between our variables, which could then help us build a model.

This is what we would call unsupervised machine learning. That is, we don’t really understand how elements within the data are related, we can’t classify or categorize those data, so we need some way to do this.

The most common type of unsupervised learning is called Cluster Analysis, or Clustering. Clustering is the task of grouping together a set of objects (whatever they are) in such a way that objects in the same group (or cluster) are more like each other (or more similar to) than to those in other groups (or clusters). It is basically one of the main tasks of exploratory data mining, and a common technique for statistical data analysis. It is and can be used in many, many fields, and there are lots of different algorithms which can be used to perform that analysis.

The most well-known type of clustering algorithm is K-means clustering. This is a type of unsupervised learning in which is used when you have unlabelled data (data without any defined categories or groups). The goal of this algorithm is to find groups in the data, with the of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are then clustered based on feature similarity. K-means is one of the simplest unsupervised learning algorithms that solve the well–known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters. The diagram shows a simple illustration of how clustering works.

Finding Patterns in Data With Talend Screenshot 1

In this diagram we can see data which is clustered into three main groups, red, blue and green. Data points that are near the center of each cluster are referred to as ‘well-clustered data’, and that on the outside are referred to as ‘loose-clustered data’.

In Talend we have clustering components in our set of Machine Learning components. These consist of three components as shown below.

The tKmeansModel component analyzes sets of data based on applying the K-Means algorithm. This component analyses feature vectors usually pre-processed by the tModelEncoder component to generate a clustering model. This model can then be used by the tPredict component to cluster given elements. It generates a clustering model out of this analysis and writes this model either in memory or into file. The tPredict and the tPredictCluster component predict which elements or clusters an element belongs to based on the clustering model generated by a model training component (the tKMeans component). These Talend components are all available in all Talend Platform products with Big Data and in Talend Data Fabric. They are available for both the Spark Batch and Spark Streaming Framework.

As well as the classification components shown above, there are also three other sets of machine learning components available in Talend. There are classification components which are used for identifying which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is already known. There are Recommendation components which seek to predict the “rating” or “preference” that a user would give to a certain item. Finally, there are Regression components. Regression analysis is a process for estimating the relationships among variables. It includes techniques for modeling and analyzing several variables when the main focus is on the relationship between a dependent variable and one or more independent variables.

So, we can see there are a number of sets of Talend machine learning components that can help you look for patterns in data. These can help you discover hitherto unknown relations between data, to find patterns in those data, to classify your data and to help you build models which can predict future patterns and behavior. For more information on Talend and machine learning refer to the following page, which includes a video introduction to machine learning presented by myself:

https://www.talend.com/resources/machine-learning-platform/

The post Finding Patterns in Data with Talend appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Creating Avro schemas for Pipeline Designer with Pipeline Designer

June 19, 2019, 5:00 am

≫ Next: Talend and Qubole Serverless Platform for Machine Learning: Choosing Between a Cab vs Your Own Car

≪ Previous: Finding Patterns in Data with Talend

I have had the privilege of playing with and following the progress of Pipeline Designer for a while now. I am really excited about this new tool. If you haven’t seen it yet, then don’t delay and get your free trial now…..actually, maybe read this blog first 😉

Pipeline Designer is an incredibly intuitive, web-based, batch and stream processing integration tool. The real genius of this tool is that batch and stream processing “Pipelines” can be built in exactly the same way. This might not sound that significant, but consider the differences between processing a batch of data with a start and an end, and processing an on-going stream of data with a known start, but no known end. Without going into too much detail, just contemplate the differences you’d need to take into account when considering aggregates, for batch and stream data sources. But batch and stream processing is not the only cool thing about this product. There is also the schema-on-read and data preview functionality. I won’t go into too much detail reiterating what my colleague Stephanie Yanaga has discussed here, but the schema functionality is particularly special. In most cases, you do not need to understand how this works, but in a couple of cases, it is really useful to have an insight into this. This blog discusses those cases and focuses on how to work with the tool in the most efficient way.

The schema functionality is largely provided by Apache AVRO. In several scenarios, this is completely hidden from users unless they choose to output their data in AVRO format. However, there are a couple of scenarios where you need to supply your data in AVRO format. Two examples of these use cases are pushing JSON data to Kafka and Kinesis. Unluckily for me (luckily for you?) I had to work around this having never heard of AVRO before. My use case was consuming Twitter data to display it on a geographical heatmap. This use case is one which I will elaborate on in a future blog. The problem I had here was that I could retrieve the Twitter data in JSON format really easily, but I didn’t know how to convert it into AVRO format. I then discovered a nice way of getting Pipeline Designer to do the majority of the work for you. I will explain….

Pipeline Designer supports several connection types (many of which I am sure you have come across) …..Amazon Kinesis, Amazon S3, Database, Elasticsearch, HDFS, Kafka, Salesforce and the Test Connection. It is the Test Connection which is really useful to us here. With this connection, we can simulate the consumption of JSON data and the AVRO schema is calculated for us. We can then create a Pipeline which will output this file to an Amazon S3 bucket or an HDFS filesystem, open the file and get a copy of the schema we need to use.

Building a Pipeline to generate your AVRO schema

The above is a massively simplified description of what you need to do. There are other steps involved, particularly if you want to use the schema to create an AVRO String in Java. This section of the blog will describe in detail how you can achieve the above for your JSON data.

Open up Pipeline Designer and click on the “Connections” option on the menu on the left-hand side. Then click the “Add Connection” button at the top.
You will be presented with a screen like the one shown in the simple example below. Add a “Name”, “Description”, select a “Remote Engine” and select “Test Connection” as the “Type”. Click “Validate” once completed.
You now need to create a Dataset which uses the Test Connection you have just created. Click on the “Datasets” left menu option and then “Add Dataset”
In the “Add a New Dataset” screen, fill out the “Name” you wish to use, the “Connection” type (the one we have just created), and select JSON for the “Format”.
Below the “Dataset” option, you will see the “Values” box. This is where we will add our example JSON. I will go into a bit more detail about the JSON we need to use later on in this blog. Once this is done, click on “Validate”.
In this example we will be using an Amazon S3 bucket as the target for the AVRO file. You can use HDFS is you prefer, but Amazon S3 is just easier and quicker for me to use. If you do not know how to set up an Amazon S3 bucket, take a look here.
Once you have created your S3 bucket, we need to create a connection to it. Click on “Add Connection” as demonstrated in step 1.
In the “Add a New Connection” screen, add a “Name” for your S3 Connection, select a “Remote Engine”, select the “Type” (Amazon S3) and fill out your access credentials (make sure you keep yours hidden as I have, otherwise these credentials could cost you a lot of money in the wrong hands). Check the connection by clicking on the “Check Connection” button. If all is OK, then click the “Validate” button.
We now need to create our S3 bucket dataset. To do this, repeat step 3 and click on “Add Dataset”.
In the “Add a New Dataset” screen, select the “Name” of your dataset, select the “Connection” type (the Connection we have just created), select the “Bucket” you have created in S3, decide upon an “Object” name (essentially a folder as far as we are concerned here) and select the “Format” to be “Avro”. Click on “Validate” once completed.

We now have the source and target created. The next step is to create our Pipeline.
Go to the left-hand side menu and select “Pipelines”. Then click on “Add Pipeline”.
You will see the Pipeline Designer development window open up. This is where we can build a Pipeline. This one will be a very easy one. First, select our Pipeline name. This is done where the red box is in the screenshot. Next, select the Run Profile. This is done where the pink box is. Add your source by clicking on the “Add Source” option in the development screen and selecting the “Dummy JSON File” source we created earlier (where the green box is). This will reveal an “Add Destination” option on the development screen. Click on this and select your S3 bucket target (where the blue box is).

At the bottom of the screen you will see the data preview area. This is where (orange box) you can look at your data while building your Pipeline.

Once everything is built and looking similar to the above, click on the “Save” button (in the yellow box).
Once the Pipeline is built, we are ready to run it. Click on the Run button (in the red box) and you should see stats starting to generate where the blue box is. This might take between 30 and 40 seconds to start up.
Once the Pipeline has finished running, we are ready to take a look at our Avro file.
Go to your Amazon account (or HDFS filesystem if you have tried this using HDFS) and find the file that was generated. You can see the file that I generated in my S3 bucket in the screen shot below.

Download this file and open it.
Once the file is opened, you will see something like the screenshot below

We are interested in the text between “Objavro.schema<” and “avro.codec”. This is our Avro schema. We *may* need to go through some extra processes, but I will talk about these (and the details on the example JSON that I mentioned in step 5) in the following section.

JSON Data Format

An example of the JSON we will be working with here (and in my next blog on Pipeline Designer) can be seen below.

{  
   "geo_bounding_box":[  
      {  
         "latitude":{  
            "double":36.464085
         },
         "longitude":{  
            "double":139.74277
         }
      },
      {  
         "latitude":{  
            "double":36.464085
         },
         "longitude":{  
            "double":139.74277
         }
      },
      {  
         "latitude":{  
            "double":36.464085
         },
         "longitude":{  
            "double":139.74277
         }
      },
      {  
         "latitude":{  
            "double":36.464085
         },
         "longitude":{  
            "double":139.74277
         }
      }
   ],
   "gps_coords":{  
      "latitude":{  
         "double":36.464085
      },
      "longitude":{  
         "double":139.74277
      }
   },
   "created_at":{  
      "string":"Fri Apr 26 14:08:41 BST 2019"
   },
   "text":{  
      "string":"Hello World"
   },
   "id":{  
      "string":"1121763025612988417"
   },
   "type":{  
      "string":"tweet"
   }
}

This is a rather generic schema that is built to hold rough bounding box location data (essentially GPS coordinates forming a box covering an area), exact GPS coordinates, a timestamp, some text, a record Id and a type. The purpose of this is to allow several sources of real-time data with GPS information, to be processed by a Pipeline. The example above is using Twitter data.

The eagle eyed amongst you will notice that the JSON above does not match the JSON in the screenshot I presented in point 5. The reason for this is that Pipeline Designer uses Python to serialize and deserialize the JSON to and from AVRO format. Python does not need the “type” of data to precede the data when dealing with potentially nullable fields, Java does. I have written another blog explaining the differences between Java and Python AVRO serialization here. Since the objective of this project is to serialize the JSON using Java to be deserialized by Pipeline Designer, the schema I am using must accommodate both Java and Python. FYI Python can interpret the JSON above, but Pipeline Designer will generate an AVRO schema which is not suitable for Java. So, the JSON that is used in this example is changed to remove the type prefixes. Like so…..

{  
   "geo_bounding_box":[  
      {  
         "latitude":36.464085,
         "longitude":139.74277
      },
      {  
         "latitude":36.464085,
         "longitude":139.74277
      },
      {  
         "latitude":36.464085,
         "longitude":139.74277
      },
      {  
         "latitude":36.464085,
         "longitude":139.74277
      }
   ],
   "gps_coords":{  
      "latitude":36.464085,
      "longitude":139.74277
   },
   "created_at":"Fri Apr 26 14:08:41 BST 2019",
   "text":"Hello World",
   "id":"1121763025612988417",
   "type":"tweet"
   }
}

The above JSON will generate an AVRO schema somewhat similar to below….

{  
   "type":"record",
   "name":"outer_record_1952535649249628006",
   "namespace":"org.talend",
   "fields":[  
      {  
         "name":"geo_bounding_box",
         "type":{  
            "type":"array",
            "items":{  
               "type":"record",
               "name":"subrecord_1062623557832651415",
               "namespace":"",
               "fields":[  
                  {  
                     "name":"latitude",
                     "type":[  
                        "double",
                        "null"
                     ]
                  },
                  {  
                     "name":"longitude",
                     "type":[  
                        "double",
                        "null"
                     ]
                  }
               ]
            }
         }
      },
      {  
         "name":"gps_coords",
         "type":"subrecord_1062623557832651415"
      },
      {  
         "name":"created_at",
         "type":[  
            "string",
            "null"
         ]
      },
      {  
         "name":"text",
         "type":[  
            "string",
            "null"
         ]
      },
      {  
         "name":"id",
         "type":[  
            "string",
            "null"
         ]
      },
      {  
         "name":"type",
         "type":[  
            "string",
            "null"
         ]
      }
   ]
}

The autogenerated “record types” will almost certainly be named differently to the above. In the blog I previously wrote about this subject (link to blog), I describe how the above schema may need to be adjusted to make it as bullet proof as possible.

First, lets look at the type named “subrecord_1062623557832651415”. This is used in two places; the “geo_bounding_box” and the “geo_coords” records. These are exactly the same. So, let’s give them a more meaningful name. I have chosen “gps_coordinates” and have changed this in both locations.

Since we have changed the above record type to give it a meaningful name, lets do the same with the name of the outer record as well. This is called “outer_record_1952535649249628006”, I’ll change this to “geo_data_object”.

These changes can be seen below (in red)….

{  
   "type":"record",
   "name":"geo_data_object",
   "namespace":"org.talend",
   "fields":[  
      {  
         "name":"geo_bounding_box",
         "type":{  
            "type":"array",
            "items":{  
               "type":"record",
               "name":"gps_coordinates",
               "namespace":"",
               "fields":[  
                  {  
                     "name":"latitude",
                     "type":[  
                        "double",
                        "null"
                     ]
                  },
                  {  
                     "name":"longitude",
                     "type":[  
                        "double",
                        "null"
                     ]
                  }
               ]
            }
         }
      },
      {  
         "name":"gps_coords",
         "type":"gps_coordinates"
      },
      {  
         "name":"created_at",
         "type":[  
            "string",
            "null"
         ]
      },
      {  
         "name":"text",
         "type":[  
            "string",
            "null"
         ]
      },
      {  
         "name":"id",
         "type":[  
            "string",
            "null"
         ]
      },
      {  
         "name":"type",
         "type":[  
            "string",
            "null"
         ]
      }
   ]
}

Once the above changes have been made, there is one more change required to enable nullable fields when serializing and deserializing using Java and Python. You should notice that there are fields where the “type” has two options in a square bracket. These are known as “Unions”. These essentially tell AVRO what data to expect. Pipeline Designer has output these with the data type (string or double) preceding the “null”. We need to change these around to ensure that these fields can be nullable. So, the above schema will change to what is seen below…

{  
   "type":"record",
   "name":"geo_data_object",
   "namespace":"org.talend",
   "fields":[  
      {  
         "name":"geo_bounding_box",
         "type":{  
            "type":"array",
            "items":{  
               "type":"record",
               "name":"gps_coordinates",
               "namespace":"",
               "fields":[  
                  {  
                     "name":"latitude",
                     "type":[  
                        "null",
                        "double"
                     ]
                  },
                  {  
                     "name":"longitude",
                     "type":[  
                        "null",
                        "double"
                     ]
                  }
               ]
            }
         }
      },
      {  
         "name":"gps_coords",
         "type":"gps_coordinates"
      },
      {  
         "name":"created_at",
         "type":[  
            "null",
            "string"
         ]
      },
      {  
         "name":"text",
         "type":[  
            "null",
            "string"
         ]
      },
      {  
         "name":"id",
         "type":[  
            "null",
            "string"
         ]
      },
      {  
         "name":"type",
         "type":[  
            "null",
            "string"
         ]
      }
   ]
}

The above change is VERY important, so don’t forget that one.

Once you have got to this point, you are ready to start building/collecting and serializing your JSON using Talend Studio.

The next blog I will write on this subject will demonstrate how to use the above schema to serialize Tweets with location data, send them to AWS Kinesis, consume and process the data using Pipeline Designer, before sending that data to Elastic Search so that a real-time heatmap of worldwide Twitter data can be generated.

The post Creating Avro schemas for Pipeline Designer with Pipeline Designer appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Talend and Qubole Serverless Platform for Machine Learning: Choosing Between a Cab vs Your Own Car

June 19, 2019, 5:15 pm

≫ Next: How to Unlock Your SAP Data Potential for Accelerated Analytics – Part 1

≪ Previous: Creating Avro schemas for Pipeline Designer with Pipeline Designer

Before going to the world of integration, machine learning, etc., I would like to discuss with all of you about a scenario many of you might experience when you live in a mega city. I lived in the London suburbs for almost 2 years (and it’s a city quite close to my heart too), so let me use London as this story’s background. When I moved to London, one question which came to my mind was whether I should buy a car or not. The public transport system in London is quite dense and amazing (Oh!!! I just love the amazing London Underground and I miss it in Toronto). Occasionally, I really need a car, especially when I’m traveling and I need to bring heavy bags to the airport. The question is: do I need to spend a considerable amount of money for a down payment on a car, insurance, maintenance cost, queueing at gas stations, vacuuming the car, getting a parking space…for the few times that I really needed a car? I decided against it; it was easier and cheaper to call an Uber or the famous London black cab and finish the job quickly, rather than trying to sort out all the things mentioned in the previous long list 😊.

Talend and Qubole Serverless Platform

The choice of a cab vs your own car in Big Data processing

Coming back to the world of Big Data and machine learning, the question remains the same! Do we really need an array of heavy-duty servers running 24/7, with a huge army of engineers, to manage big data and machine learning processing or can we do something different? This thought has led to the concept of serverless processing in the Big Data arena, where you can save the costs related to computation associated with idle clusters. The new technology also helps in automatic management of cluster upscaling, downscaling, and rebalancing based on various factors like the context of the workload, SLA, and priority for each job running on the cluster.

Talend is actively collaborating with industry leaders in big data and machine learning serverless technology. In this blog, I am going to tell the story of the friendship between Talend and Qubole.

Tell me more about Qubole

Many readers who have yet to get into this space of IT might not have heard about Qubole. Qubole is one of the market leaders in serverless big data technology:

“Qubole provides you with the flexibility to access, configure, and monitor your big data clusters in the cloud of your choice. Data users get self-service access to data using their interface of choice. Users can query the data through the web-based console in the programming language of choice, build integrated products using the REST API, use the SDK to build applications with Qubole, and connect to third-party BI tools through ODBC/JDBC connectors”

Talend and Qubole is a good example of the phrase “match made in heaven.” Talend helps customers to build complex data jobs and pipelines through its signature graphical user interface and Qubole automatically handles the infrastructure part in a seamless fashion.

(Picture Courtesy: – Qubole)

How can you perform machine learning tasks using Talend and Qubole?

Many of you might be thinking that you have heard these types of stories about seamless data integration numerous times in your IT career. The million-dollar question in your mind might be whether the machine learning data processing using Talend and Qubole is really seamless or not? The answer is an emphatic Yes.

Instead of explaining the theory, let us create a quick Talend Job and see the steps involved in the flow. The prerequisites and steps involved for setting up of a Qubole account, which will be using Amazon Web Services (AWS) to interact with Talend 7.1, has been depicted in a detailed fashion in the Qubole documentation link.

Our story will start from the point where a Cluster has been created in Qubole where we can run Spark jobs. The examples for both a Talend Standard job and a big data Batch job using Spark are also available in the Qubole documentation, but our interest is to see how we can create a machine learning Job easily using the two tools.

Once the Cluster is created for Spark processing, its status can be verified from Qubole dashboard as shown below.

Talend and Qubole Serverless Machine Learning Platform Screenshot 3

In this blog, I am going to use a simple Zoo data set containing a classification of animals provided by UCI (University of California, Irvine) for prediction.

Acknowledgements

UCI Machine Learning: https://archive.ics.uci.edu/ml/datasets/Zoo

Source Information — Creator: Richard Forsyth — Donor: Richard S. Forsyth, 8 Grosvenor Avenue, Mapperley Park, Nottingham, NG3 5DX, 0602-621676 — Date: 5/15/1990

The dataset provided was bifurcated into two groups for our example job. The main dataset zoo_training.csv will be used for training the model and the second dataset zoo_predict.csv will be used as the input for Prediction. The third file class.csv will act as lookup file to get the description of each code value of animal categories. All the three datasets will be loaded to S3 as shown below.

The next step is to create a Talend Job to process the files and to get the prediction output. In three easy stages, we will be able to create the Job as shown below.

The first stage captured the configuration of S3 from where we will be reading the data.
The data processing for machine learning model training and decision tree creation was done in the second stage
The third stage will read the input data and it will calculate the Prediction, perform the lookup for each prediction value code and will print the output to console.

For those who are curious to try the sample Job shown in the blog, please download the attached Talend Job and sample input files (Click here to download).

The configuration of Qubole in Talend job is quite easy as shown below.

The input and the output values for Prediction stage as shown below.

Input file for Prediction

Output from Talend job using Qubole

Everything happened like a breeze, right? Now imagine the effort you would have to spend if the overall process was to create a big data cluster of your own and create the same machine learning logic using hand coding. I am sure the long list of tasks for owning a car in London is coming to your mind and how calling a cab can make your life easy.

Is this the end of the story?

Absolutely not! We just saw the fairy tale ending of the story for the machine learning flow development using Talend and Qubole. But like any Marvel movie, let me give some post credit scenes to keep up the interest until I complete the next blog post in this series. What we have done is just creation of a sample Job. We still have to get answers for lot of other questions like how we can operationalize the job through methods other than Talend Cloud. Did someone hear the word “Docker”?? Is it possible to do a continuous integration flow to move the jobs seamlessly? We will soon meet again in the next part of the blog to find answers.

The post Talend and Qubole Serverless Platform for Machine Learning: Choosing Between a Cab vs Your Own Car appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How to Unlock Your SAP Data Potential for Accelerated Analytics – Part 1

June 21, 2019, 5:00 am

≫ Next: Seacoast Bank is banking on higher-quality data

≪ Previous: Talend and Qubole Serverless Platform for Machine Learning: Choosing Between a Cab vs Your Own Car

Many SAP customers have been running SAP on premise for decades and have struggled to harness the full potential of their business processes data running inside of SAP along with other enterprise and external data to gain augmented insight and become more agile in this digital era where everything keeps on moving at an exponential pace with no sign of slowing down.

Talend has been helping customers tap into the huge potential of their SAP, Salesforce, Marketo, SaaS data and more for many years. Now with cloud modernization projects, customers can accelerate their way to cloud analytics for their SAP data while benefiting from the huge cost savings of running their analytics processing in the cloud.

Brewing Up The Best Customer Experience at AB InBev

AB InBev is the world’s largest beer maker, with a diverse portfolio of over 500 beer brands, including Budweiser, Corona, Stella Artois, Beck’s, Hoegaarden, and Leffe. Because the company has grown through acquisitions, its data ecosystem was comprised of Salesforce, 15 SAP instances, 27 ERP systems, 23 ETL tools, and a host of brewers operating as independent entities with their own internal systems. When Harinder Singh assumed leadership for data strategy and solution architecture at AB inBev in 2017, he set out to change things—starting with the creation of an enterprise data hub.

AB inBev extracts data from a vast range of sources to better understand customer tastes, improve store and bar experiences, supply chain optimization, product development, and more. With Microsoft Azure Cloud and Talend Data Fabric, AB inBev consolidated its data into a single repository, and optimized its data pipelines. This strategy has freed up the data science team from menial, time-consuming tasks so that they can focus on testing new data models and analyzing data. The increased speed has had a major impact on production and demand forecasting, which in turn helps with operations planning. All of the newly gained insights also feed into Ab inBev new product development, enabling the teams to create best-selling beverages from the get-go.

Harinder Singh presented their digital transformation journey at Strata Data Conference in March and what he said about empowering self-service analytics for the business was a aha moment for many in the room: by the 4^th time a business person asks a question, he/she often comes up with a new business idea or innovation.

Unlocking your SAP data for Cloud Analytics

Talend provides out of the box visual components to connect to your SAP data and to your preferred cloud data warehouse so you can easily offload data for faster and more cost-effective analytics.

Users can integrate data between SAP systems and any other business-critical applications. By integrating the power of SAP interfaces (IDOC , BAPIs or Data Extractor), Talend facilitates complex data-related tasks; any module of enterprise SAP structures can be accessed.

Reading data from SAP environment at any level, from Tables, BAPIs or IDOCs. These operations include for example extracting a partial data set for reuse in a third-party application.
Transforming data from or to any SAP or non-SAP system: transformations include data quality operations such as deduplication or aggregation over heterogeneous SAP data environments or over data to be integrated into an SAP system.
Integrating data in any format: data may come from databases, binary files or flat files, or web streams, etc.

Talend’s SAP-dedicated features include:

A set of connectors for reading and writing purposes, these connectors enable efficient development and deployment of bi-directional data transformations from and to SAP systems, and between SAP and non-SAP systems.
Wizards that simplify the function calling as well as the data handling through a convenient visual interface. This allows users to reuse their preferred settings in all their jobs, dramatically accelerating and facilitating development and maintenance operations.
Native support of SAP at any level: Table, IDOC and BAPI level (including custom BAPI, ZBAPI, Table and Ztable).

Some of the more recent additions to Talend SAP components include:

SAP Bulk Extraction allowing to pull large amounts of batch data out of SAP Business Suite and SAP S/4HANA.
A Business Content Extractor delivering semantic views on top of SAP data sources, it also allows to integrate with SAP Business Warehouses without needing external logic.

Lear more about our components for SAP and Redshift, and components for SAP and Microsoft Azure.

For more information about integrating your enterprise data, take a look at our Definitive Guide to Data Integration

The post How to Unlock Your SAP Data Potential for Accelerated Analytics – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Seacoast Bank is banking on higher-quality data

June 13, 2019, 5:00 am

≫ Next: Generating a Heat Map with Twitter data using Pipeline Designer – Part 1

≪ Previous: How to Unlock Your SAP Data Potential for Accelerated Analytics – Part 1

Chartered in 1926, Seacoast Bank is the operating subsidiary of Seacoast Banking Corporation of Florida, which is one of the largest community banks headquartered in the state.

Seacoast Bank improves data quality

Seacoast Bank relies heavily on data to be able to provide customers the best solution for their needs, and to develop a deeper understanding of who customers are and how they want to work with their bank. But being heavily regulated, Seacoast Bank also understands the need for trusted data.

Delivering data quality and compliance

Early in the process of building a data quality framework, Seacoast engaged Talend Professional Services to help define requirements, design an architecture, install components, and provide training and technical support. The framework specifies data quality activities at each stage of a data lifecycle whose steps are to discover, profile, measure, monitor and remediate data.

The index, which is part of Seacoast’s overarching data governance program, is an innovative approach that starts with an out–of-the-box solution—Talend Data Quality—and adds customized tables and structures to produce a tool that makes a complex topic like data quality understandable to business users throughout the organization.

Measuring data quality with an innovative data quality index

The index measures data quality on six dimensions: accuracy, completeness, conformity, consistency, uniqueness, and validity. Using a scale of 0 to 100, the index is a single measure that provides an overall indication of the level of data quality at Seacoast. It also measures data quality over time to track how it improves or degrades as the bank acquires other banks, and as data sources, processes, and the technical environment change.

In the loan area, for instance, the framework automates a suite of tests Seacoast Bank performs. And in the credit department, the bank has automated many of the data quality rules, which frees the credit administration staff to concentrate on the credit-worthiness of the portfolio rather than on monitoring data quality.

Talend Data Quality helps Seacoast Bank be responsible data caretakers who can rely on high-quality data to better know and serve their customers.

A full-time employee’s time saved by automating data quality tasks

Seacoast Bank is banking on a data quality index to measure data quality on six dimensions and track how it improves or degrades as the bank acquires other banks, and as data sources, processes, and the technical environment change.

Learn more about Seacoast Bank’s story and how they were able to achieve this approach to data quality.

The post Seacoast Bank is banking on higher-quality data appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Generating a Heat Map with Twitter data using Pipeline Designer – Part 1

June 25, 2019, 10:38 pm

≫ Next: Data Readiness and Quality: The Big New Challenges for all Companies

≪ Previous: Seacoast Bank is banking on higher-quality data

For me, the most exciting thing about Pipeline Designer is the way that it makes working with streaming data easy. Traditionally this has required a completely different way of thinking if you have come from a “batch” world. So when Pipeline Designer was released, the first thing I wanted to do was to find a good streaming data source and do something fun and interesting with the data.

Twitter was my first choice of streaming data. The data is easy to acquire, constantly being produced and is not limited to a specific genre or domain. The challenge I set myself was to build a solution using Talend products to acquire and process the Twitter data, and AWS tools to store and present the data. This is the first of a couple of blogs I will write to demonstrate exactly how I have taken the Twitter data, processed it and presented it in a heat map.

The tools and services I will be using for this project are as follows…

Twitter – To supply the data

Talend ESB – To retrieve the data, serialize it into an Apache Avro schema, and send it to AWS Kinesis

AWS Kinesis – To collect and stream the data to Talend Pipeline Designer

Talend Pipeline Designer – To process the data and output it to an AWS Elasticsearch Service

AWS Elasticsearch Service – To analyse the data and produce the heat map

In this first blog, I will focus on acquiring the data from Twitter, serializing it to Apache Avro and sending it to AWS Kinesis.

Creating a Twitter App

The first thing we need to do is to configure a developer account with Twitter. I could spend ages taking screenshots of how I did this, only for Twitter to change the process in a few months time. So instead of doing that, I will give you the objectives you need to achieve with Twitter and list where you can get the information using Twitter’s own documentation.

The first step is to apply for a Twitter developer account.

The second step is to create a Twitter App and generate tokens for your app. Once you’ve read the linked page and followed a few links from there, you may still be a little confused about the settings you need to set. So this short step by step guide should fill in the blanks. Hopefully the gist of this will not change too much if Twitter do evolve their developer environment.

Click on the “Create an app” button to reveal the following screen. The fields (in the image you can see) which you need to populate are the “App name”, the “Application description”, and the “Website URL”. You can essentially use anything you want for these values. The “Website URL” can be completely made up. However, the “App name” must be unique.

The bottom half of the “Create an app” form can be seen below. I’ve only filled in the required fields. In the bottom half of the form only an application description is required. You can then click on “Create”.
Assuming that everything you entered is OK, you will see the next screen. If there was a problem, you will need to fix the problem before getting to this screen.

At the top of the screen you will see a link with the title “Keys and tokens”. Click that.
This is where you can configured your keys and tokens. These are the whole point of going through this process of creating a Twitter app. They will be used as Context Variable values for the Route we will create.

The “API key” and the “API secret key” will already be generated. You can regenerate these if you wish. However, before you are finished with this process you will need to create your “Access token & Access token secret”. Click on the “Create” button to generate these. Once finished, you will see the screen below.

Copy these keys and tokens ready to be used later. These MUST be kept secret otherwise you run the risk of somebody being able to attack your Twitter account.

Configuring an AWS Kinesis Stream

If the Twitter data stream is the source for this subset of the project, an AWS Kinesis stream is the target. Since it always makes sense to get your source and target configured before working on the “bit in the middle”, we will configure our AWS Kinesis stream before I get to the Talend ESB Route, which will join the dots. First you will need an AWS account. As with the section on configuring Twitter, I will point you to official documentation on this here.

The next thing you need to do is to create your Kinesis Stream. For this I will point you towards the AWS documentation, but I will also share some screenshots of what I did to configure mine. It is pretty straight forward and hopefully those of you who already have AWS accounts will be fully configured in the time it will take you to read this section.

Click on the “Services” link (in the blue box) at the top of the AWS dashboard to reveal the screen you see below. Then click on the “Kinesis” link (in the red box).
Then select your region (where the yellow box is). I have selected London here. Once you have selected that, click on the “Create data stream” button (in the red box).
We now have to fill out the “Kinesis stream name” (the red box), the “Number of shards” (the blue box) and click on “Create Kinesis stream” (the green box). I have chosen 1 shard for this project as that is all it will need for the data we will be processing.
Our Kinesis stream is now configured. Remember the stream name and the region for when we are creating the Talend Route. We will also need this information when we get to creating our Pipeline in the next blog.

We now have our source and target configured, we can now look at building our Talend Route.

Creating a Talend Route to send Tweet data to our Kinesis Stream

There are 4 processes we need to carry out with this Talend Route

Retrieve a stream of Tweets from Twitter
Convert those Tweets into a JSON String
Serialize that JSON String into Apache Avro
Send the Apache Avro data message to our AWS Kinesis stream

Before I started writing this blog I realised that Avro serialization was a potentially massive subject. I have written a couple of blogs on this already. I will be referring to those blogs in this section, as the JSON format I shall be using here is the format I used in the blogs. I will link to them when they are needed, but you can also see the links below incase you are interested in getting an understanding before we get to that point in this project.

The second link is where I talk about the JSON format we shall be using here.

As far as this Talend Route is concerned, it is pretty simple. It consists of 3 components and 2 code routines. I will start by explaining the code routines. Once they are explained, we have all of the pieces we need and can just fit them together in a few easy steps.

Code Routines

There are two code routines I use with this Talend Route.

I have re-used a code routine from the Talend Pipeline Designer – Avro schema considerations when working with Java and Python blog. This will be used unchanged, so I won’t go over this again. This routine is called “AVROUtils”.

In order to send messages to AWS Kinesis I have created a new code routine. I will describe this routine here.

AWSKinesisUtils

The “AWSKinesisUtils” routine is a relatively basic routine which simply creates a connection to our AWS Kinesis stream and sends messages to it. The routine can be seen below.

Create a routine by the same name and add the following code….

package routines;

import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.kinesis.AmazonKinesis;
import com.amazonaws.services.kinesis.AmazonKinesisClientBuilder;
import com.amazonaws.services.kinesis.model.*;

import java.nio.ByteBuffer;
import java.util.List;


public class AWSKinesisUtils {

        static BasicAWSCredentials credentials = null;
        static AmazonKinesis amazonKinesis = null;
        
        
        /**
         * setAWSCredentials: set the "credentials" static object
         * 
         * 
         * {talendTypes} String
         * 
         * {Category} User Defined
         * 
         * {param} string("JHGSAHJTR%^%TRYJSJ") accessKey: The AWS Access Key.
         * {param} string("JHGSAHJHGFHR$%£RTIUTTR%^%TRYJSJ") secretKey: The AWS Secret Key.
         * 
         * {example} setAWSCredentials("JHGSAHJTR%^%TRYJSJ","JHGSAHJHGFHR$%£RTIUTTR%^%TRYJSJ") 
         */
        public static void setAWSCredentials(String accessKey, String secretKey){
        	credentials = new BasicAWSCredentials(accessKey, secretKey);      	
        }
        
        
        /**
         * createAmazonKinesisConnection: creates and stores the "amazonKinesis" static object
         * 
         * 
         * {talendTypes} String
         * 
         * {Category} User Defined
         * 
         * {param} string("JHGSAHJTR%^%TRYJSJ") region: The AWS Region.
         * 
         * {example} createAmazonKinesisConnection("EU_WEST_2") 
         */
        public static void createAmazonKinesisConnection(String region){
        	
        	amazonKinesis = AmazonKinesisClientBuilder
                    .standard().withRegion(region)
                    .withCredentials(new AWSStaticCredentialsProvider(credentials))
                    .build();
        }
        
        /**
         * putMessage: adds message to Kinesis stream
         * 
         * 
         * {talendTypes} String
         * 
         * {Category} User Defined
         * 
         * {param} byte[]("#EA132EA") rawMessage: A message as a byte array
         * {param} string("twitter_stream") streamName: The AWS Stream Name.
         * 
         * {example} createAmazonKinesisConnection("EU_WEST_2") 
         */
        public static void putMessage(byte[] rawMessage, String streamName) {
        	
            
            PutRecordRequest putRecordRequest = new PutRecordRequest();
            putRecordRequest.setStreamName(streamName); 
            putRecordRequest.setPartitionKey("filler");
            putRecordRequest.withData(ByteBuffer.wrap(rawMessage));
            PutRecordResult putRecordResult = amazonKinesis.putRecord(putRecordRequest);

 
        }
    }

This routine requires the following Java libraries. Some of these can be found packaged with Talend Studio 7.1. Unfortunately not all are packaged. I will give sources for the ones that are not packaged. You may find alternative versions packaged with Talend, but they are not guaranteed to work. The Jars I am listing are guaranteed to work.

All of the Jars are listed below. The Jars that are not included are listed as links to where they can be obtained…

Jar	Present in Talend v7.1
httpcore-4.4.9.jar	Yes
httpclient-4.5.5.jar	Yes
jackson-core-2.9.8.jar	No
jackson-databind-2.9.8.jar	No
jackson-annotations-2.9.8.jar	No
aws-java-sdk-core-1.11.333.jar	No
aws-java-sdk-kinesis-1.11.333.jar	No
aws-java-sdk-1.11.333.jar	No
jackson-dataformat-cbor-2.9.8.jar	No
joda-time-2.9.jar	Yes

Once this routine has been created and the Jars have been added to the routine, we are ready to go. Remember to also create this routine, if you have not done so already.

Talend Route

As I mentioned earlier, this is a pretty simple route. The layout can be seen in the screenshot below.

Context Variables

Before discussing the components numbered above, I will briefly talk about the Context Variables that you will need for this. The code displayed in the next section will refer to several Context Variables which must be configured before they are used. The Context Variables used by this Route can be seen below.

You will notice that some of the values have been blanked out. This is because I do not want anyone to be “borrowing” my resources 🙂

Each of the Context Variables I use are explained below.

Name	Description
ConsumerKey	The Twitter Consumer Key created here
ConsumerSecret	The Twitter Consumer Secret created here
AccessToken	The Twitter Access Token created here
AccessTokenSecret	The Twitter Access Token Secret created here
Locations	A set of GPS coordinates to specify a bounding box location. -180, -90; 180, 90 specifies the whole world.
AWSAccessKeyID	The AWS Access Key created here
AWSSecretAccessKey	The AWS Secret Access Key created here
AWSRegion	The AWS region in which your stream is located. Check here
AWSKinesisStream	The AWS Kinesis stream we set up here

Component Configuration

The configuration of each of the components used in the Route is detailed below.

cConfig_1
This component is used to set up an AWS Kinesis connection and to set the Avro schema to be used. You can simply copy the code below, but you may want to play around with this project to achieve a slightly different outcome, so I will explain what is taking place below.

routines.AVROUtils.setSchema("{  \r\n   \"type\":\"record\",\r\n   \"name\":\"geo_data_object\",\r\n   \"namespace\":\"org.talend\",\r\n   \"fields\":[  \r\n      {  \r\n         \"name\":\"geo_bounding_box\",\r\n         \"type\":{  \r\n            \"type\":\"array\",\r\n            \"items\":{  \r\n               \"type\":\"record\",\r\n               \"name\":\"gps_coordinates\",\r\n               \"namespace\":\"\",\r\n               \"fields\":[  \r\n                  {  \r\n                     \"name\":\"latitude\",\r\n                     \"type\":[  \r\n                        \"null\",\r\n                        \"double\"\r\n                     ]\r\n                  },\r\n                  {  \r\n                     \"name\":\"longitude\",\r\n                     \"type\":[  \r\n                        \"null\",\r\n                        \"double\"\r\n                     ]\r\n                  }\r\n               ]\r\n            }\r\n         }\r\n      },\r\n      {  \r\n         \"name\":\"gps_coords\",\r\n         \"type\":[\"null\",\"gps_coordinates\"]\r\n      },\r\n      {  \r\n         \"name\":\"created_at\",\r\n         \"type\":[  \r\n            \"null\",\r\n            \"string\"\r\n         ]\r\n      },\r\n      {  \r\n         \"name\":\"text\",\r\n         \"type\":[  \r\n            \"null\",\r\n            \"string\"\r\n         ]\r\n      },\r\n      {  \r\n         \"name\":\"id\",\r\n         \"type\":[  \r\n            \"null\",\r\n            \"string\"\r\n         ]\r\n      },\r\n      {  \r\n         \"name\":\"type\",\r\n         \"type\":[  \r\n            \"null\",\r\n            \"string\"\r\n         ]\r\n      }\r\n   ]\r\n}");


routines.AWSKinesisUtils.setAWSCredentials(context.AWSAccessKeyID, context.AWSSecretAccessKey);
routines.AWSKinesisUtils.createAmazonKinesisConnection(context.AWSRegion);

The first block of code is where the Avro schema required to serialize the JSON, is set. The schema that we will be using for this project is the schema that I described (and showed how to generate) here. The difference between the schema built in the linked blog and the code you can see above, is that the schema text has been “escaped” to be used with Java. A nice tool for doing that can be found here.

The next two lines of code are used to configure the AWS credentials and to create an Amazon Kinesis connection. All of the code in this component makes use of the routines described above.

cMessagingEndpoint_1
This component is configured using both the “Basic settings” and “Advanced settings”. The cMessagingEndpoint component allows us to use any of the Apache Camel Components. With this component, we are using the Twitter Apache Camel Component. The following screenshot shows the “Basic settings”.

Notice that the URI is actually a piece of Java code generating a String making use of Context Variables. The endpoint that is used can be copied from below.
```
"twitter://streaming/filter?consumerKey="+context.ConsumerKey+"&consumerSecret="+context.ConsumerSecret+"&accessToken="+context.AccessToken+"&accessTokenSecret="+context.AccessTokenSecret+"&locations="+context.Locations
```
The above endpoint will return a Twitter stream of messages which have location data within the boundary specified by the Locations context variable. For this project, I have set the boundary to be the whole world.

The next screenshot shows the “Advanced settings” of this component.

Here we simply click on the green plus button and select “twitter”. This is used to add the appropriate library for the Camel Component we wish to use.

cProcessor_1
This component is used to dissect the content of each message from the cMessagingEndpoint. Each message will hold Twitter data in a Twitter4J object. This data is extracted and built into a JSON object which matches the Avro schema we used in the cConfig_1 component. Since the cProcessor component is a component used for Java, I will not take a screenshot of this. Instead I will simply show the Java that is used in the “Code” section.

The code below should be commented well enough for you to figure out what is going on. However, I will summarise here. The first thing that is done is to retrieve the Twitter4J Status object from the Apache Camel Exchange object. The rest of the code is used to build a JSONObject which holds the data that we require to meet the Avro schema that we are working to. The fields that are retrieved from Twitter4J Status are the Place, GeoLocation, CreatedAt, Text and Id fields. These are all explained in the Twitter4J documentation linked above.

After the JSONObject has been created, it is printed to the output window as a String (so that we can see that it is OK….this can be commented out later), it is then serialized using the AVROUtils routine, then it is sent to our AWS Kinesis stream using our AWSKinesisUtils routine.

Code

//Get access to the Twitter4j Status object from the Exchange
twitter4j.Status tweet = exchange.getIn().getBody(twitter4j.Status.class);

//Create new JSON object
JSONObject json = new JSONObject();

//Create the Geo Bounding Box JSON Array
JSONArray jsonGeoBoundingBox = new JSONArray();

if(tweet.getPlace()!=null){
	if(tweet.getPlace().getBoundingBoxCoordinates().length==1){
		
		for(int i = 0; i<tweet.getPlace().getBoundingBoxCoordinates()[0].length; i++){
			JSONObject point = new JSONObject();
			JSONObject lat = new JSONObject();
			lat.put("double", tweet.getPlace().getBoundingBoxCoordinates()[0][i].getLatitude());
			JSONObject lon = new JSONObject();
			lon.put("double", tweet.getPlace().getBoundingBoxCoordinates()[0][i].getLongitude());
			point.put("latitude",lat);
			point.put("longitude",lon);
			jsonGeoBoundingBox.put(point);
		}
		
	}
}

//Add the Geo Bounding Box to the JSON object
json.put("geo_bounding_box",jsonGeoBoundingBox);

//Create a JSON object to hold Coords 
JSONObject jsonCoords = new JSONObject();

//If the Tweet has a GeoLocation add to the gps_coords object
if(tweet.getGeoLocation()!=null){
	JSONObject lat2 = new JSONObject();
	JSONObject lon2 = new JSONObject();
	lat2.put("double", tweet.getGeoLocation().getLatitude());
	jsonCoords.put("latitude", lat2);	
	lon2.put("double", tweet.getGeoLocation().getLongitude());
	jsonCoords.put("longitude", lon2);
	JSONObject coordComplexType = new JSONObject();
	coordComplexType.put("gps_coordinates",jsonCoords);
	json.put("gps_coords",coordComplexType);
}else{ //Add an empty gps_coords object
	json.put("gps_coords",JSONObject.NULL);
}

//Add a created_at object
JSONObject createdAt = new JSONObject();
createdAt.put("string", tweet.getCreatedAt());
json.put("created_at", createdAt);

//Add a text object
JSONObject text = new JSONObject();
text.put("string", tweet.getText());
json.put("text", text);

//Add a type object
JSONObject type = new JSONObject();
type.put("string", "tweet");
json.put("type", type);

//Add an id object
JSONObject id = new JSONObject();
id.put("string", tweet.getId()+"");
json.put("id", id);

//Print the String JSON to the output window
System.out.println(json.toString());

//Serialize the JSON to an AVRO byte array
byte[] rawMessage = routines.AVROUtils.jsonToAvroWithoutSchema(json.toString());

//Send the AVRO byte array to Kinesis
routines.AWSKinesisUtils.putMessage(rawMessage, context.AWSKinesisStream);

Once you have got to this point, we are ready to look at the next stage which is to write the Pipeline to consume the data. I will talk about this in my next blog. However, since you have a streaming source which can be easily consumed by Pipeline Designer, maybe you can give it a try to see what you can produce with this data.

If you have any questions related to this blog, please feel free to raise them below. I will check periodically to ensure that I answer as many questions as I can. Alternatively, you can raise your questions in the Pipeline Designer board on Talend Community.

The post Generating a Heat Map with Twitter data using Pipeline Designer – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Data Readiness and Quality: The Big New Challenges for all Companies

June 26, 2019, 7:15 pm

≫ Next: Using the Spark Machine Learning Library in Talend Components

≪ Previous: Generating a Heat Map with Twitter data using Pipeline Designer – Part 1

We live in a digital age which is increasingly being driven by algorithms and data. All of us, whether at home or work, increasingly relate to one another via data. It’s a systemic restructuring of society, our economy and institutions the like of which we haven’t seen since the industrial revolution. In the business world we commonly refer to it as digital transformation.

In this algorithmic world, data governance is becoming a major challenge. Data is fueling the entire transformative process which means good data quality is vital. Without it, we could find ourselves running into problems of process execution and bad customer service.

This chart illustrates this new landscape. Business relies on technology, processes and people. Data governance lies at the heart of these interactions. It is the cog in the wheel which makes everything else go round. If it has a problem, the entire system breaks down.

All of this is very easy to say, in theory, but what does it mean and how will we deal with it in practice?

What is data governance?

The standard definition of data governance refers to the ability to manage every aspect of enterprise data including integrity, security, availability and usability. Data sources are everywhere, in many formats and many standards. We need to extract and elaborate insights that feed daily enterprise decisions.

I’m currently conducting research with Talend (NASDAQ: TLND), a global leader in data integration, to investigate how companies can harness the benefits of a comprehensive data governance strategy in an extended scenario where data drives business decisions. If those decisions are bad it means poor outcomes, higher costs and lower revenue.

During our collaboration, I discovered many aspects of the data governance framework that ambitious companies need to refer to and many real-life case studies of companies that rely on data governance as a strategic asset. As usual, I like to look at how data governance is being applied in the real world. So, let’s go through it industry by industry.

Data Governance in the Financial Services

EURONEXT is a European stock exchange that combines five stock exchanges (Netherland, Belgium, Ireland, Portugal and France) and deals with more than 100TB of transactions. You can imagine the huge data problem they are dealing with.

To analyse all that data, they had to wait six to twelve hours after closing. Using Talend Big Data and Talend Catalog, they boosted their performance while respecting integrity, availability and, most importantly, this highly-regulated market.

Data Governance in Energy

UNIPER is a global energy provider that generates, markets and trades energy on a large scale. They are dealing with more than 100 data sources, including IoT devices, from various external and internal sources. Using Talend Data Catalog and Talend Cloud Real-time Big Data, they achieved astounding results. The cost of integrating data reduced by 80%” and the speed of integrating data increased by 75%”. Those are some seriously impressive numbers.

Data Governance and Bookmakers

PointsBet is an online bookmaker in Australia offering both traditional fixed odds markets as well as private betting. As they prepared to launch their online sports betting products in the United States, they were concerned about the time needed to be fully operative. Using Talend Cloud Data Integration, they managed to go live within days while respecting the company plan. Now that’s what I call availability!

Data Governance and Pizza

Of course, some of you may be thinking, ‘Antonio is Italian and he will not be able to resist talking about pizza’, and you’d be right, but how can data work with pizza? Well it can if we talk about one of the biggest names in pizza delivery which is leveraging data to raise their performance and customer experience.

I’m talking about none other than Domino’s, the largest pizza delivery company in the world. They are pursuing an ambitious omni-channel strategy to allow their customers to order pizza from any device they like including phone, PC, tablet, smartwatch or social media. They are striving to offer a truly distinctive customer experience which is why they turned to Talend Data Fabric and Talend Data Quality. Using these, they can manage 85,000 data sources of structured and unstructured data and 17TB of data while enhancing customer engagement and business performance.

What a pizza!

But please, try to refrain from eating pizza with cappuccino. It’s not fair to Italians.

In conclusion, whether you work in energy, the financial services or pizza, you need to handle huge amounts of data, all of which must be available, secured, usable and integrated. A good data governance framework can enhance your business performance across the board and your customers will be the first to feel the benefits. If they’re happy, your revenue will start to look pretty good too.

So I’d like to finish by giving a big ‘thank you’ to Talend for all the material they’ve provided me for this research. It has really helped me demonstrate the importance of a good data governance strategy in the world of digital transformation.

The post Data Readiness and Quality: The Big New Challenges for all Companies appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Using the Spark Machine Learning Library in Talend Components

July 2, 2019, 5:00 am

≫ Next: Outra – Increasing value and predictability of big data

≪ Previous: Data Readiness and Quality: The Big New Challenges for all Companies

Talend provides a family of Machine Learning components which are available in the Palette of the Talend Studio if you have subscribed to any Talend Platform product with Big Data or Talend Data Fabric.

These components provide a whole bunch of tools and technologies to help integrate Machine Learning concepts for your use cases. These out of the box components can perform various Machine Learning techniques such as Classification, Clustering, Recommendation and Regression. They leverage Spark for scale and performance (i.e. for working with large data sets) and also provide a faster time to gain insight and value. These components focus on business outcomes, not development tasks so there is no need to learn complex skills such as R, Python or Java.

However, if you do wish to take advantage of some of the complex ML resources and algorithms available within Apache Spark, it is possible to do so.

As a recap, Apache Spark is an open-source cluster-computing framework. Originally developed at UCLA’s Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation. Spark Streaming uses Spark Core’s fast scheduling capability to perform streaming analytics. Spark’s Machine Learning library, know as MLlib is a powerful, fast distributed machine learning framework that sits on top of the Spark Core. Many common Machine Learning and statistical algorithms have been implemented and are shipped with MLlib. These can be utilised inside some of Talend’s Machine Learning components.

One important component is the tModelEncoder component. This component performs operations which transform data into the format expected by the Talend model training components such as tLogisticRegressionModel or tRandomForestModel. It receives data from its preceding components, apply a wide range of these processing algorithms to transform given columns of this data. It then sends the result to the model training component that follows to eventually train and create a predictive model. Depending on the Talend solution you are using, this component can be used in either Spark Batch, Spark Streaming or both modes.

The specific algorithms available in the Transformation column varies depending on the type of the input schema columns that make up the data to be processed. Here is where you can utilise the transformations available in the Spark MLib Machine Learning Library. There are a large number of transformations which can use, some in batch, some in streaming and some in both modes.

There are text processing transformations which perform a number of functions such as the hashing and unhashing of data. There are algorithms to identify similarities in text and extract frequent terms and are algorithms to bucket data. There are mathematical algorithms to work on vectors and time series data and algorithms that work on image data. There are algorithms to expand or quantities data. You can work with Regular expressions or tokenise data. There is an algorithm to transform SQL statements, an algorithm to do statistical analysis, one to index strings and one to do assemble vectors. Finally, there is RFormula, a very useful algorithm which allows you to define a formula which represents the relationship between variables in your data and then model its output. Overall, there are plenty of algorithms in the MLib library to suit most needs and use cases, and new ones are added all the time. Using these algorithms from within a Talend component is easy. Illustrated below is a screenshot showing how you can select an algorithm from within the ‘basic settings’ configuration section of a tModelEncoder component, in this case to build a Random Forest model. The choice of algorithms available in the Transformation column varies depending on the type of the input schema columns to be processed.

Using these algorithms in the tModelEncoder component allows you to build a wide range of models which can be used for many use cases. Whether its modelling who your gold customers are, whether fraud is occurring in your organisation, whether it’s the suitability of drug treatments or whether some event may happen or not. All of these use cases, and many, many more can be modelled using the different modelling components and model types which are available. In the diagram below, we can see a Talend job which has been built to predict outcomes by using a Model created in a previous job. In this job we take data from within the HDFS file system, use a model to make predictions with the Talend tPredict component, and then output the results to file.

One important point to note is that Talend’s Machine Learning components are ‘out of the box’ components, and they are configurable. You do not need to be an R programmer or have to have skills in Python. You should be able to design, build and configure a Machine Learning jobs without having to write lots of complex code.

So, as we have shown, Talend is leveraging the power of Spark, Big Data and Machine Learning to allow our customers to do things which just a few years ago we thought were not possible. In many industries and verticals Talend, can empower and enable you to quickly and simply build Machine Learning jobs.

More information on leveraging Spark and Talend’s Machine learning components can be found on the Talend website, or speak to your Talend account executive.

The post Using the Spark Machine Learning Library in Talend Components appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Outra – Increasing value and predictability of big data

July 2, 2019, 5:00 pm

≫ Next: One Year After GDPR: Three Common Mistakes Businesses Still Make

≪ Previous: Using the Spark Machine Learning Library in Talend Components

Outra is a UK predictive data science business that helps companies increase the power and precision of data through a modern, science-led approach which delivers actionable insight at speed.

Many of Outra’s clients have either limited or no data science capabilities in house. Outra matches its proprietary property data with client’s customer data and any relevant third-party data. Then by applying deep learning, they create segmentations to help their clients make smarter, faster decisions, and improve customer experiences.

Building a fast, comprehensive Cloud data platform

The cloud is a cost-effective way to get things up and running quickly for a startup such as Outra. Outra chose Amazon Web Services (AWS) as its cloud provider. Talend Cloud Big Data enables data ingestion into Snowflake. Talend Cloud Data Quality matches and cleans data from the various sources.

Now Outra can ingest over a billion rows in less than 60 seconds. This new Cloud data platform helps onboarding new data sources in one to seven days and reduces time to deliver insight from months to days.

Staying competitive with fast client delivery times

In existence for less than two years, Outra’s clients include well-known companies in industries such as insurance, utilities, broadband and TV, telecom, energy, retail, restaurants, and more.

Through data science, Outra can predict, for example, what age group visits a particular restaurant at a particular time, risk factors and scores of the members for potential insurance claims or when people are going to move to a new house and will need broadband or energy supply.

Customers are impressed by how fast Outra can process their data. While every customer is different, Outra can access the data, run analytics, and provide them with insight typically within days. This is a very fast turnaround for the market, especially considering the volumes of data processed.

Over a billion rows ingested in less than 60 seconds

The start-up Outra, a predictive data science business, is suddenly experiencing a big push of new customers due to the way it provides its clients actionable insight at speed, with its new Talend, Snowflake and AWS Cloud data platform.

View Story

The post Outra – Increasing value and predictability of big data appeared first on Talend Real-Time Open Source Data Integration Software.

↧

One Year After GDPR: Three Common Mistakes Businesses Still Make

July 9, 2019, 5:00 am

≫ Next: Modern Data Architecture with Delta Lake Using Talend

≪ Previous: Outra – Increasing value and predictability of big data

May 25, 2019 marked the one-year anniversary of the European Union’s (EU) General Data Protection Regulation (GDPR) coming into full effect.

This milestone serves as a timely reminder for any business in the EU or doing business with EU residents on both the implications of failing to protect data and the procedures needed to prevent this from happening.

Here are the three common misconceptions that businesses – big and small – still have about the GDPR:

1) Data Subject Access Rights is many companies’ Achilles’ heel

With GDPR violations now attracting large fines, you might think businesses would be bending over backwards to ensure compliance, but this isn’t always the case.

Most businesses have improved accountability by appointing a Data Protection Officer. They have devised (or refreshed) a legal framework for data privacy, improved their lines of defense against data breaches and even managed identity and access more rigorously. And yet, our recent research reveals that mistakes are still being made under the GDPR: 74% of UK organizations are failing to respond to consumers’ personal data requests within the required one-month time period.

Despite it being very easy for consumers to request their data now, most businesses still struggle to provide it within the time demanded of them. One thing is certain: if regulators put a focus on enforcing breaches in this area, then many more companies could be held accountable over the next twelve months for failing on Data Subject Access Rights.

2) Data privacy or protection is not the same as cybersecurity

When most UK businesses hear the phrase ‘data privacy’ or ‘data protection’ they immediately think ‘cybersecurity threat’. This is a broad misconception. Rather than putting the correct processes and IT systems in place to respond to data privacy issues like data access requests, they look at building stricter security systems.

Google has had to pay a $57 million euro fine due to GDPR violations, and class action lawsuits have been filed on streaming services. These events mean organizations must begin to realise that cybersecurity is only one aspect of GDPR compliance. In fact, the biggest fine to date has been imposed for a violation of data consent, while the largest class action suits currently being heard by regulators are focusing on data subject access requests. The GDPR has presented organizations with an opportunity to re-think the current relationship between business processes, data transparency, and customer privacy needs.

3) The GDPR is more than a legal requirement between customer and business

Over the past twelve months, businesses have been busy asking themselves if they comply with the GDPR. However, when faced with this question, most have taken a defensive approach, considering only legal and security implications on the business. Herein lies another misconception – the view that the GDPR is nothing more than an issue of legality.

The GDPR is a contract between the organization and its customers, detailing how the business plans to store, process and protect customers’ personal data. For every contract, there is a legal dimension, but the scope is much broader than that of the GDPR. It is also about building better customer relationships and experiences through trust. This is a vital distinction because trust is a pivotal commodity for businesses today. If you do not have a contract that your customers like or trust, customers will begin to withhold their data or abandon companies altogether.

GDPR breaches and the publicity they have attracted have done a lot to damage consumer trust in recent months. The organizations which succeed will be those which are willing to put consumer privacy concerns at the heart of the business and to prioritize the customer experience – for example, establishing privacy portals where their customers can access their data and give their consent for the personalized services they find valuable.

Regulation is always a minimum standard, so companies must aim to comply and then go beyond the GDPR. With all data, organizations should act as stewards to make sure data is used, stored and shared in a way that does not lead to the misuse of data by unauthorized third parties, and in doing so they will win more trust in their own data – and from their customers.

Take a look at the full article in Gigabit Magazine here. For more information about GDPR compliance, take a look at our webinar 5 Pillars for GDPR Compliance.

The post One Year After GDPR: Three Common Mistakes Businesses Still Make appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Modern Data Architecture with Delta Lake Using Talend

July 10, 2019, 5:00 am

≫ Next: Understanding what Machine Learning is and what it can do

≪ Previous: One Year After GDPR: Three Common Mistakes Businesses Still Make

Data lakes: smooth sailing or choppy waters?

In May, Talend announced its support for Databricks’ open source Delta Lake, “a storage layer that sits on top of data lakes to ensure reliable data sources for machine learning and other data science-driven pursuits.” What does this mean for your company, and is Delta Lake right for you?

Since data lakes came into existence nearly a decade ago, they have been touted as a panacea for companies seeking a single repository to house data from all sources, whether internal or external to the organization, cloud-based or on-premises, batch or streaming.

Data lakes remain an ideal repository for storing all types of historical and transactional data to be ingested, organized, stored, assessed, and analyzed. Never before have business analysts been able to access all this data in one place. All of this was not sustainable in traditional data warehouses due to the high volume, cost, latency, complexity, and performance requirements. So yes, data lakes are a cure-all for many of our data woes.

But over time, data lakes have grown exponentially, often to the extent that the volumes of raw, granular data have become overwhelming for analytical purposes, even though they’re intended to make it easy to mine and analyze data.

In fact, the term “data swamp” has emerged to create the perfect visualization of a data lake gone bad. Data swamps are data lakes that have no curation or data life cycle management and minimal to no contextual metadata and data governance. Due to the way it is stored, it has become hard to use or unusable.

Now Delta Lake offers a solution to restore reliability to data lakes “by managing transactions across streaming and batch data and across multiple simultaneous readers and writers.” Here, we’ll discuss how Delta Lake overcomes common challenges with data lakes and how you can leverage this technology with Talend to get more value from your data.

Common challenges with data lakes

Whether or not you would classify your data lake as a swamp, you may notice end users struggling with data quality, query performance, and reliability as a result of the volume and raw nature of data in data lakes. Specifically:

Performance:

Too many small or very big files require more time to open and close files, rather than reading content (this is even worse with streaming data)
Partitioning or indexing breaks down when data has many dimensions and/or high cardinality columns
Neither storage systems, nor processing engines are great at handling very large number of subdir/files

Data quality and reliability:

Failed production jobs leave data in corrupt state requiring tedious recovery
Lack of consistency makes it hard to mix appends, deletes, and upserts and get consistent reads
Lack of schema enforcement creates inconsistent and low-quality data

Generating analytics from data lakes

As organizations set up their data lake solutions, often migrating from traditional data warehousing environments to cloud solutions, they need an analytics environment that can quickly access accurate and consistent data for business applications and reports. For data lakes to serve the analytic needs of the organization, you must follow these key principles:

Data cataloging and metadata management: To present the data to business, create a catalog or inventory of all data, so business users can search data in simple business terminology and get what they need. But with high volumes of new data added every day, it’s easy to lose control of indexing and cataloging the contents of the data lake.
Governance and multi-tenancy: Authorizing and granting access to subsets of data requires security and data governance. Delineating who can see which data and at what granularity level requires multi-tenancy features. Without these capabilities, data is controlled by only few data scientists instead of the broader organization and business users.
Operations: For a data lake to become a key operational business platform, build in high availability, backup, and constant recovery.
Self-service: To offer a data lake with value, build a consistent ingestion of data with all the metadata and schema captured. In many cases, business users want to blend their own data with the data from the data lake.

Yet as data lakes continue to grow in size, including increasing volumes of unstructured data, these principles become increasingly complex to design and implement. Delta Lake was created to simplify this process.

Delta Lake improves reliability and speed of analytics

Talend has committed to seamlessly integrate with Delta Lake, “leveraging its ACID compliance, Time Travel (data versioning), and unified batch and streaming processing. In addition to connecting to a broad range of data sources, including popular SaaS apps and cloud platforms, Talend will empower Delta Lake users with comprehensive data quality and governance features to support machine learning and advanced analytics, natively supporting the full power of the Apache Spark technology underneath Delta Lake.”

The benefits of Delta Lake include:

Reliability: Failed write jobs do not update the commit log, hence partial or corrupt files are not DELTA visible to readers
Consistency: Changes to tables are stored as ordered, atomic commits and each commit is a set of actions filed in a directory. Readers read the log in atomic units, thus reading consistent snapshots. In practice, most writes don’t conflict with tunable isolation levels.
Performance: Compaction is performed on transactions using OPTIMIZE; optimize using multi-dimensional clustering on multiple columns
Reduced system complexity: Delta is able to handle both batch and streaming data (via a direct integration with structured streaming for low latency updates) including the ability to concurrently write batch and streaming data to the same data table

Architecting a modern Delta Lake platform with Talend

The architecture diagram below shows how Talend supports Delta Lake integration. Using Talend’s rich base of built-in connectors as well as MQTT and AMQP to connect to real-time streams, you can easily ingest real-time, batch, and API data into your data lake environment. The use of a data lake accelerator makes it is easier to onboard any new sources at greater pace rather than hand coding for every requirement. An accelerator allows you to ingest data in a consistent way by capturing all required metadata and schemas of the ingested systems, which is the first principle of deploying a successful data lake.

Talend integrates well with all cloud solution providers. In this architecture diagram, we’re showing the data lake on Microsoft Azure cloud platform using Azure Blob for storage. The storage layer is called Azure Data Lake Store (ADLS) and the analytics layer consists of two components: Azure Data Lake Analytics and HDInsight. Another alternative option in Azure is to use Azure BLOB storage which is just a storage and no compute is attached to it.

Alternatively, if you’re using Amazon Web Services, the data lake can be built based on Amazon S3 with all other analytical services sitting on top of S3.

The Talend Big Data Platform integrates with Databricks Delta Lake, where you can take advantage of several features that enable you to query large volumes of data for accurate, reliable analytics:

Scalable storage. Data is stored as parquet files on big data filesystem or on storage layers such as S3 or Azure BLOB
Metadata: Sequence of metadata files track operations made on table, stored in scalable storage along with table
Schema check and validation: Delta provides ability to infer schema from input data. This reduces the effort for dealing with schema impact of changing business needs at multiple levels of the pipeline/ data stack.

Here is a diagram of an architecture that shows how Talend supports Delta Lake implementation, followed by instructions for converting a data lake project to Delta Lake with Talend.

Creating or Converting data lake project to Delta Lake through Talend

Below are instructions that highlight how to use Delta Lake through Talend.

Configuration : Set up the Big Data Batch job with Spark Configuration under Run tab. Select the distribution to Databricks and the corresponding version.

Under Databricks section update the Databricks Endpoint(it could be Azure or AWS), Cluster Id, Authentication Token.

Sample Flow: In this sample job, click events are collected from mobile app and events are joined against customer profile and loaded as parquet file into DBFS. This DBFS file will be used in next step for creating delta table.

Create Delta Table: Creating delta table needs keyword “Using Delta” in the DDL and in this case since the file is already in DBFS, Location is specified to fetch the data for Table.

Convert to Delta table: If the source files are in Parquet format, we can use the SQL Convert to Delta statement to convert files in place to create an unmanaged table:

SQL: CONVERT TO DELTA parquet.`/delta_sample/clicks`

Partition data: Delta Lake supports partitioning of tables. In order to speed up queries that have predicates involving the partition columns, partitioning of data can be done.

SQL :

CREATE TABLE clicks (

date DATE,

eventId STRING,

eventType STRING,

data STRING)

USING delta

PARTITIONED BY (date)

Batch upserts: To merge a set of updates and inserts into an existing table, we can use the MERGE INTO statement. For example, the following statement takes a stream of updates and merges it into the clicks table. If a click event is already present with the same eventId, Delta Lake updates the data column using the given expression. When there is no matching event, Delta Lake adds a new row.

SQL :

MERGE INTO clicks

USING updates

ON events.eventId = updates.eventId

WHEN MATCHED THEN

UPDATE SET

events.data = updates.data

WHEN NOT MATCHED

THEN INSERT (date, eventId, data) VALUES (date, eventId, data)

Read Table : All Delta tables can be accessed either by choosing the file location or using the delta table name.

SQL : Either SELECT * FROM delta.`/delta_sample/clicks` or SELECT * FROM clicks

Talend in Data Egress, analytics and machine learning on high level:

Data Egress: Using Talend API services, create APIs faster by eliminating the need to use multiple tools or manually code. Talend covers the complete API development lifecycle, from design, test, documentation, implementation to deployment – using simple, visual tools and wizards.

Machine learning: With the Talend toolset, machine learning components are ready to use off the shelf. This ready-made ML software allows data practitioners, no matter their level of experience, to easily work with algorithms—without needing to know how the algorithm works or how it was constructed. At the same time, experts can fine-tune those algorithms as desired. Talend supports below machine learning algorithm. Talend machine learning componentsinclude tALSModel, tRecommend, tClassifySVM, tClassify, tDecisionTreeModel, tPredict, tGradientBoostedTreeModel, tSVMModel, tLogicRegressionModel, tNaiveBayesModel and tRandomForestModel.

The post Modern Data Architecture with Delta Lake Using Talend appeared first on Talend Real-Time Open Source Data Integration Software.

↧