Quantcast
Channel: Blog – Talend Real-Time Open Source Data Integration Software
Viewing all 824 articles
Browse latest View live

The Need for a Frictionless Enterprise in a Digital-Native World

$
0
0

The word “frictionless” has emerged as a term used to describe when an action is achieved with or involving little difficulty; it is about effortlessness. This term is most commonly associated with customer experience and payments.

Amazon-like expectations

Consumers are already familiar with this term without realizing it. Retail is the perfect illustration of what being frictionless means. Driven by tech-native companies like Amazon or eBay, the purchasing journey has completely changed for customers. In today’s digital society, consumers are more likely to be actively engaged with their favourite brands (via social media and specially designed apps), whilst also keeping a level of autonomy. According to a report from McKinsey about “Digitizing the Consumer Decision Journey”, creating frictionless experiences support the optimisation of digital channels.

Frictionless payment
In our new “zero touch” world, consumers in the do not need to have their credit cards at all times. Now they can purchase or access almost anything in one-click via their smartphones – clothes, cars, entertainment, you name it. The idea behind this practice which is now second nature to millions all stems from being able to do things in a way that presents no problem for the do-er. Our world is becoming frictionless and almost every single action is now as easy as ABC.

Of course, there is still work to be done when it comes to delivering an omni-channel experience online and off-line and according to published reports, “48 percent of US consumers believe companies need to do a better job of integrating their online and off-line experiences.” That said, retailers are getting close to making this a reality.
Today, our frictionless world has developed to such an extent that it would be hard to think of a world without such convenience and the possibilities for its future expansion are exciting.

The key to frictionless is data

Frictionless has exploded with the emergence of new technologies – smartphones, the cloud, machine learning and virtual assistants to name a few. All of these technologies either produce, store or analyses vast quantities of data – to provide seamless experiences in the 21st century. The impact of this is industries such as travel, finance, retail and many more have been touched by the frictionless effect which enables people to have a seamless experience whether they are purchasing goods, travelling or interacting with their bank or insurance company. Even governments and the public sector are embracing frictionless experiences – online tax returns, “no paper” programmes and online procedures are now commonplace and encouraged across the board.

Data-driven generations

This has had a knock-on effect on the generations that have grown up with these technologies. The “data-driven” generations, Millennials and Generation Z, are happy to embrace the idea of a fully connected society and are expecting a seamless experience from the start. They purchase the latest technology, use services that take advantage of digital concepts and technology, have multiple social media accounts and understand that data is a powerful tool – especially concerning their experiences. Expectations are high. Consumers are more engaged and more autonomous.
While the term is appearing across industries it is hardly new. The European Union Schengen Agreement which was signed in 1985, sought to abolish borders between European nations. The purpose: make the movement of people and goods as frictionless as possible. While Brexit will, of course, threaten to cause friction once again, it will be the job of the UK parliament to ensure that this does not happen. The potential result: major disruptions to the UK’s trade agreements and mass outrage from British sun-seekers.

But what does this mean for IT teams within enterprise organisations?

What’s become accepted and expected in the consumer world is making its way to the enterprise. On one hand this directly translates to external enterprise delivery. No matter if it’s a customer or supplier, engagement with a customer must be seamless – the level of frictionless is linked to reputation in a digital world. IT teams want to use software that’s as easy as streaming a video on YouTube or listening to music on Spotify. They want to be able to start data loading projects easily, they want to be able to connect cloud sources into data warehouses or data lakes in minutes. And they need it to be simple as they purchase on Amazon.

The PAYG movement

The digital-native generations are pushing new ways of consuming data within an organisation. One of the major trends driven by this generation is the “pay-as-you-go” model enabled by the explosion of cloud applications. According to IDC , the millennial-led companies have adopted cloud applications such as travel, invoice, expense management and human capital management — as well as desktop as a service — at a more than 20% higher rate than the average midmarket firm.
Cloud has been revolutionising the way the organisations work with data integration. These new data workers now want to consume technology as a service and not just as products anymore, which reduces costs – the installation and administration are no longer managed internally, but by the provider. Data workers can thus focus on the real benefits of their job – data ingestion and integration in the cloud, providing analysis to improve the business in real-time. Cloud integration is not an option for this generation, it is a no-brainer and they are leading the way to cloud and self-service adoption within their organisations. Making use of platforms offered by data integration and cloud management platforms is pivotal in the success of your business. Platforms and services that allow you to seamlessly and instantly move large amounts of data to their final destination must be taken advantage of in a society where frictionless is the new norm.

The impact on GDPR

GDPR data securityThe more frictionless an interaction between customer and vendor becomes, there are legitimate questions about data privacy and protection. Indeed, data is at the heart of an organisation’s frictionless strategy. Data volumes and flows are exploding, as well as the number of companies dealing with individuals’ data. And, in the event of a data breach, integrity of individuals is compromised, which is one of the major concerns when going frictionless.
However, just like frictionless experiences empower people by giving them more autonomy, new data protection regulations now empower them with data privacy.

The EU General Data Protection Regulation (GDPR) has set up new foundations for data protection, especially with Article 15 which gives EU citizens the right to access their personal data. Now, individuals own their data and can decide whether it can be used by organisations. The European data protection regulation has also increased data consciousness among people who now pay more attention to their data, how it is used and by who.
Regulations like GDPR thus play a role of “frictionless enablers”; companies respecting the regulations’ principles are more trusted by individuals and so are able to use data to make individuals’ lives and business users’ daily jobs easier.
The reality is, no matter what audience we are talking about – be it customers, IT teams, suppliers, or a desk-residing employees – enterprise organisations must fully embrace being frictionless in every part of their business.

The post The Need for a Frictionless Enterprise in a Digital-Native World appeared first on Talend Real-Time Open Source Data Integration Software.


5 best practices to deliver trust in your data project: Tip #3 Take ownership around a single source of trusted data

$
0
0

During Summer, The Talend Blog Team will relay to share fruitful tips & to securely kick off your data project. This week, we’ll start with the third capability: make sure you create a single source of trust and foster data ownership.

Create a single source of trusted data and foster data ownership

Would you imagine e-commerce without an electronic catalog or the web without search engines? Digital transformation requires single points of access to enable a wider range of people to access a wider range of information.

Why it’s important

When data is siloed, users can’t produce value out of cross-referenced datasets. As an example, it becomes complicated to match customers and prospects and provide useful customer recommendations on the next product to buy.  Especially as the recommendation might depend on the context, for example if you are a fashion retailer and the customer is currently in a store, you’d better recommend a product that is available in the store and fits to the customer size. Creating a single source of data and being able to propagate it as speed allows both better control and a wider audience.

Moreover, having a single source of trusted data is also a way to reap the benefits of deploying your digital transformation initiatives.

According to the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms, 2017: “By 2020, organizations that offer users access to a curated catalog of internal and external data will realize twice the business value from analytics investments than those that do not.”

When it’s important

Apply a single source of trusted data from the get-go. Organizations need to establish and execute a data management strategy at the very start of their digital transformation journey in order to reconcile data sources and data sets before going further. This includes establishing accountabilities for data protection, remediation, and publishing so that data is widely and securely shared.

Our recent data trust readiness report reveals that 74% of operational data workers believe their organizations do not always put in place a single source of trusted data.

Download Data Trust Readiness Report now.
Download Now

How Talend tools can help

Talend Data Catalog helps you to create a central, governed catalog of enriched data that can be shared and collaborated on easily. It can automatically discover, profile, organize and document your metadata and makes it easily searchable.

This starts by establishing a single point of trust; that is to say, collecting all the data sets together in a single control point that will be the cornerstone of your data governance framework. Then, you need to select the identified datasets, assign roles and responsibilities directly into your single point of control to directly operationalize your governance from the get-go.

It is one of the advantages of data cataloging: regrouping all the trusted data in one place and giving access to members so that everybody can immediately use it, protect it, curate it and allow a wide range of people and apps to take advantage of it.

Talend Data Stewardship

The benefit of centralizing trusted data into a shareable environment is that it will save time and resources of your organization once operationalized so you can start fostering data ownership easily. Using Talend Data Stewardship, you can then easily delegate errors resolutions to your stewards and keep your dataset consistent, updated and curated overtime directly into your data catalog.

How to get started

Start by answering the “Who can access what” question. Apply collaborative and controlled governance to enable role-based applications that will allow assigned data stewards and the entire stakeholder’s community to harness the power of data, with governance principles put in place at the very beginning of the project.  Create a data inventory where shared data can be referenced, documented, and published.

 Watch our webinar series to discover how Talend can help you to create a single source of trusted data.

Want to explore more capabilities?

This is the third  out of ten trust & speed capabilities. Cannot wait to discover our second capability?

Go and download our Trust Data Readiness Report to discover other findings and the other 9 capabilities or stay tuned to discover our next week Capability #4 : Get control over organization’s data.

 

The post 5 best practices to deliver trust in your data project: Tip #3 Take ownership around a single source of trusted data appeared first on Talend Real-Time Open Source Data Integration Software.

Use Cases for Machine Learning

$
0
0

Talend provides a number of Machine Learning components that can be used for a variety of purposes. I have previously described some of these various components, some in more detail than others, as well as outlining what they can do. However, one question remains, what use cases can be solved by using these Machine Learning components?

Machine learning components

Firstly, a quick overview. Talend provides a set of ‘out of the box’ components for various ML techniques. These can be classified into four groups:

  • Classification – Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations.
  • Clustering – Clustering is the task of grouping a set of objects in such a way that objects in the same group are more like each other than to those in other groups.
  • Recommendation – Recommendation is a class of information filtering that tries to predict the “rating” or “preference” that a user would give to an item. 
  • Regression – Regression is a process for estimating the relationships among variables

All of the above leverage Apache Spark for scale and performance, they enable a faster time to insight and value, they focus on business outcomes – not development and they present with a lower skills barrier to use.

So, given that I have a number of components in each of the above groups, how do I use them to suit my own use cases? Let us take a look at each of the four groups mentioned above in turn.

Classification

As mentioned above, classification is the problem of identifying where new observations belong. In practice, this can cover a wide range of things. Some of the most common use cases for classification algorithms includes whether certain events could happen in the future. Contained with the area of ‘classification’ are lots of different algorithms that can do lots of different things, but the common thread is that they use data sets to make predictions about future events.

 

machine learning classification

As for use cases, the most common example here is ‘what will likely happen next?’. If you would like to predict what your customer will do next, what could their next behaviour be, what your supplier also will do, then classification algorithms are what you need. In all fields from life sciences to retail, large amounts of data are being used to build predictive models. These can do such things as predicting clinical outcomes, predicting customer behaviour, predicting the next move in a business process, predicting the likely increase in sales over Christmas. The choices are many. Within the classification group in Talend there are different types of component that allow you to build different models such as Decision Trees, Regression models and Random Forrest models and these models can be used to best fit to your ‘what will happen next?’ use cases.

Clustering

Clustering is the task of grouping together a set of objects in such a way, that objects in the same group are more similar to each other than to those in other groups. Clustering is really useful for identify separate groups and therefore is used to solve use cases such as “who are my premium customers?”.

machine learning clustering

As a simple example, if you plot out your customers v how much they spend on a graph, then you can easily identify separate groups and therefore those premium customers that you want to hold on to. Clustering can also be used for other business cases. In life sciences clustering can be used in drug discovery, in climate science cluster analysis is used to analyse weather patterns. Network analysis is another good example, who is my network being used or how is it being hit? In economics, clustering is used widely for a variety of applications.  

Recommendation

Recommendation algorithms are often self explanatory and we are all familiar with them. If you use Amazon, Netscape, eBay etc. you will be all to aware of them recommending all sorts of things, based on your previous purchases and behaviour. They simply work by predicting the “rating” or “preference” that a user would give to an item.

machine learning recommendation

We have a couple of Talend components in this base and they work by analysing data from a proceeding model component, and then making that prediction. The potential use cases are many here. If you sell items to customers, buy from a supplier, provide services etc. then recommendation algorithms are for you, and you can fit your recommendation use cases in here.

Regression

Finally, regression analysis is a statistical process for estimating the relationships in variables. It includes various techniques for modelling and analysing several variables at once, when the focus is on the relationship between a dependent variable and one or more independent variables. This means Regression analysis can be used to do various analyses on your data and then the results are used to build a model.

machine learning regression

Within Talend we have one component, the tModelEncoder component which is used to do just that. It performs various featurisation operations that can transform data into the format expected by the Talend model training components. As regression can be used to find the relationship between variables, it can be used in a number of business use cases.

Examples could be the relationship between temperature and sales, useful if you sell air conditioning units. You could look for relationships between advertising spend and sales through various channels, which ones generate the best sales? In life science you could find relationships between various drugs and clinical outcomes. The possibilities are wide and varied, but the bottom line is this; if you have data in which you think there may be relationships between variables, then Regression analysis can help you find and quantify those relationships. From that you can build a predictive model which can use to help your business.

Watch Fundamentals of Machine Learning now.
Watch Now

Talend and machine learning

As we have seen, there are a number of components that fit into four main areas. Each of those areas cover a large number of use cases that you could use within your business. Ultimately, although we have mentioned a few use cases, there are always more. The point is to give some examples so that you can think about your own use cases and how you can use Talend’s Machine Learning components can help your company bridge the gap between business, IT, and data scientists to seamlessly deploy critical machine learning models. 

 

 

The post Use Cases for Machine Learning appeared first on Talend Real-Time Open Source Data Integration Software.

Talend Connect Europe 2019: Advance your Data Journey

$
0
0

Talend Connect 2019

 

Save the date!

Talend Connect will be back in London and Paris in October. Join Talend executives, data experts, and thought leaders to discover new paths for delivering trusted data at speed.

Talend will welcome customers, partners, and influencers to its annual company conference, Talend Connect, taking place in two cities, London and Paris, in October. A must-attend event for business decision makers, CIOs, data scientists, chief architects, and developers, Talend Connect will share innovative approaches to cloud integration and data integrity, such as streaming data, data governance, DevOps, CI/CD, serverless, API, containers and data processing in the cloud.

<< Reserve your spot for Talend Connect 2019: Coming to London and Paris >>

Hear Stories from World Class Customers

Talend customers from different industries including AstraZeneca, L’Oréal, Hermes Parcelnet and Kiloutou will go on stage to explain how they are using Talend’s solutions to achieve data and business excellence, deliver optimum customer experience and drive industry innovation. Our customers now see making faster decisions and monetizing data as a strategic competitive advantage. They are faced with the opportunity and challenge of having more data than ever spread across a growing range of environments, that change at an increasing speed, combined with the pressure to manage this growing complexity whilst simultaneously reduce operational costs. 

At Talend Connect attendees can learn how Talend customers leverage more data across more environments, bridge the gap between IT and business departments to put more data to work and enable more informed and impactful business decisions.

Uniper: Building a Digital Platform for the Data Economy.

Talend Data Master Awards

The winners of the Talend Data Master Awards will be announced at Talend Connect London. Talend Data Master Awards is a program designed to highlight and reward the most innovative uses of Talend solutions. The winners will be selected based on a range of criteria including market impact and innovation, project scale and complexity as well as the overall business value achieved.

<< Meet Last Year’s Talend Data Masters Award Winners >>

Shaping the Future of Cloud Integration

Talend Connect provides an ideal opportunity to discover the leading innovations and best practices of cloud integration and to learn how Talend is changing the game for data integration and management to deliver trusted data at speed across your organization. Attendees will also learn about Talend’s roadmap and discover new product features and enhancements to support organizations’ digital transformation journey.

Special Thanks to Our Sponsors

Talend Connect benefits from the support of partners including Snowflake, Databricks, Datalytyx, Microsoft, Accenture, Business&Decision and Orange Business Service, CIMT AG, Hardis Group, Keyrus, Jems Group, Micropole, Virstusa, DataValue Consulting, Smile, VO2 Group and Ysance.

We are looking forward to welcoming users, customers, partners and all the members of the community to Talend Connect in London and Paris!

 

The post Talend Connect Europe 2019: Advance your Data Journey appeared first on Talend Real-Time Open Source Data Integration Software.

5 best practices to deliver trust in your data project: Tip #4 Empower organizations with modern tools

$
0
0

During Summer, The Talend Blog Team will relay to share fruitful tips & to securely kick off your data project. This week, we’ll start with the fourth capability: empower organizations with modern tools and systems to manage and monitor data.

 

Traditional tools for managing data integrity, such as data quality, governance and stewardship tools, were targeted at the most skilled data experts. With the advent of social networks, machine learning and smart pattern recognition technologies, these tools are getting simpler at every release. They now allow anyone with market or customer knowledge to contribute and collaborate  in a data governance effort.

Empower organizations with modern tools and systems to manage and monitor data

Why it’s important

People who know the data best are generally at the edge of data supply chains. They aren’t data professionals who can design a model or customize a data quality rule. But once guided with smart tools or supervised by machine learning that can turn their tacit knowledge into an algorithm, they become strategic contributors to digital transformations, while cutting repetitive tasks out of their daily jobs. Equipping data citizens with simple but powerful easy to use modern tools has also a clear benefit: putting very rapidly your data governance strategy into play by letting almost anyone being accountable for the data they manage.

When it’s important

Embedding controls, transparency, and monitoring along your data journey rather than as a separate discipline is required to monitor your progress over time. Using a common data platform with collaborative tools that fit the role of each contributor, data quality and governance become a team sport while rules can be integrated automatically into data workflows.

 

Our recent data trust readiness report reveals that 37% of Operational data workers are confident that they have the right tools in place to efficiently manage and monitor data whereas 50% of executive believe they do. 

Download Data Trust Readiness Report now.
Download Now

How Talend tools can help

Talend provides modern tools so that anyone in a company can embrace data management the modern way.

For business experts (data stewards): Talend Data Stewardship is an application you can use to manage data assets in the cloud. It organizes the interactions on data whenever human intervention is required to collaborate on data deduplication, classification or curation.
Talend Data Stewardship makes it easy for anyone to clean, certify and reconcile data. Using a team-based, workflow approach, curation tasks can be assigned to data experts across the organization and tracked for progress or audit.

In a customer driven scenario, imagine that the sales team has found a number of duplicates in the Contacts objects in the enterprise Customer Relationship Management system (CRM) and this hurts their productivity. A Merging campaign in Talend Data Stewardship enables you to solve the duplicates by surviving only the appropriate data.

For further information, see Adding a Merging campaign to deduplicate records 

For Business analysts: Talend Data Preparation empowers anyone to quickly prepare data for trusted insights throughout the organization. Talend Data Preparation combines intuitive self-service data preparation and data curation functionality with collaboration capabilities, allowing lines of business and IT to work together to create data the entire company can trust. 

For Data Engineer: Talend provides you with a range of open source and subscription Studios you can use to create your projects and manage data of any type or volume. Using the graphical User Interface and hundreds of pre-built components and connectors, you can design your Jobs or Routes with a drag-and-drop interface and native code generation.

All your data practitioners, whoever they are will find an app suitable to their needs and role in the data management area. The combination of UI friendly role-based applications and a powerful technical environment built on a common unified platform will drastically enhance the way modern organizations embrace data management.

How to get started:

Why not giving Talend Cloud and Talend Data Stewardship a try?

Go to talend cloud trial and see for yourself: put yourself into a data citizen’s shoes and practice 3 key users scenarios To govern data at scale, the right systems need the mechanisms and features to automate processes and rules over time from data ingestion to destination. Don’t consider data integration, data quality, and data governance as separate functions, but as key pillars of your data-driven strategy. Download the Definitive Guide to Data Governance to improve how you can deliver data you can trust.

 

 Want to explore more capabilities?

This is the fourth out of ten trust & speed capabilities. Cannot wait to discover our next capability?

Go and download our Trust Data Readiness Report to discover other findings and the other capabilities or stay tuned to discover our next week Capability #5.

The post 5 best practices to deliver trust in your data project: Tip #4 Empower organizations with modern tools appeared first on Talend Real-Time Open Source Data Integration Software.

ELK with Talend cloud

$
0
0

Overview

ELK is the acronym for three open source projects where E stands for Elasticsearch, L stands for Logstash and K stands for Kibana. ELK is a robust solution for log management and data analysis. These open source projects have specific roles in ELK as follows:

  • Elasticsearch handles storage and provides a RESTful search and analytics endpoint.
  • Logstash is a server-side data processing pipeline that ingests, transforms and loads data.
  • Kibana lets you visualize your Elasticsearch data and navigate the Elastic Stack.

In this blog, I am going to show you how to configure ELK while working with Talend Cloud. The blog will focus on Loading Streaming Data into Amazon ES from Amazon S3. Refer to this help document from AWS for more details

 

Process Flow

Talend Cloud enables you to save the execution logs automatically to Amazon S3 Bucket. The flow for Talend Cloud logs to be working with ELK is as shown below.

Talend Cloud ELK

Once you have configured the Talend cloud logs to be saved to Amazon S3 bucket, a Lambda function is written. Lambda is used to send data from S3 to Amazon ES domain. As soon as a log arrives into S3, the S3 bucket triggers and event notification to Lambda, which then runs the custom code to perform the indexing. The custom code in this blog is written in Python.

Prerequisite

To configure ELK with Talend Cloud logs you need

  • Talend Cloud account with log configuration in TMC – refer to this help document for Talend Cloud logs Configuration
  • Amazon S3 bucket – refer this amazon page on Amazon S3
  • Amazon Lambda function – refer this amazon page on Amazon Lambda functions
  • Amazon Elasticsearch domain – refer this amazon page on Amazon Elasticsearch domain

Steps

This section outlines the steps needed for loading steaming Talend Cloud logs to amazon ES domain

Step1 : Configure Talend cloud

  • Download the cloud formation template. Open your AWS account in a new tab and start the Create Stack wizard on the AWS CloudFormation Console.

In the Select Template step, select Upload a template to Amazon S3 and pick the template provided by Talend Cloud.

Elk with Talend cloud

 

In the Specify Details section, define the External IDS3BucketName, and S3 prefix

Elk with Talend cloud

 

Click Create. The stack is created. If you select the stack, you can and find the RoleARN key value in the Outputs

Elk with Talend cloud

 

In the Review step, select I acknowledge that AWS CloudFormation might create IAM resources.Elk with Talend cloud

 

Go back to Talend Cloud Management Console and enter the detailsElk with Talend cloud

 

Step2: Create Amazon Elastic Search Domain

Refer to this document : https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-gsg-create-domain.html

For this KB, iam selecting Development and TestingElk with Talend cloud

 

Give a domain nameElk with Talend cloud

 

For the rest of the options, give as needed by the organization and click on createElk with Talend cloud

 

Step3: Create Lambda function

    • There are multiple ways to create a Lambda function. For this blog, I am using Amazon Linux Machine with CLI configured.
    • Using putty login to the Ec2 instance

     

    Install Python using these commands

    yum -y install python-pip zip

    pip install virtualenv

    Elk with Talend cloud

     

    • Run the next set of commands

    # Prepare the log ingestor virtual environment

    mkdir -p /var/s3-to-es && cd /var/s3-to-es

    virtualenv /var/s3-to-es

    cd /var/s3-to-es && source bin/activate

    pip install requests_aws4auth -t .

    pip freeze > requirements.txt

    Elk with Talend cloud 

     

    Validate that the files needed are installed

    Elk with Talend cloud 

     

    Create a file s3-to-es.py and past the attach code in the file

    Elk with Talend cloud 

     

    Change the permission to 754

    Elk with Talend cloud 

     

    Run the command to package

    # Package the lambda runtime

    zip -r /var/s3-to-es.zip *

    Elk with Talend cloud 

     

    Send the package to S3 bucket

    aws s3 cp /var/s3-to-es.zip s3://rsree-tcloud-eu-logs/log-ingester/

    Elk with Talend cloud 

     

    Validate the upload in the S3 bucket

    Elk with Talend cloud 

     

    Create Lambda function

    Elk with Talend cloud 

     

    In the function code, select ‘Upload a file from Amazon S3’ as shown below and click on save

    Elk with Talend cloud 

     

    Add a Trigger, by selecting S3 bucket

    Elk with Talend cloud

     

    Validate that the trigger is added to the s3 bucket

    Elk with Talend cloud

    Elk with Talend cloud 

     

    Now let’s execute a Talend job for the log to be routed to S3. You could notice from the Lanbda Monitoring tab that the log is being pulled in. You could also view the logs in Cloudwatch

    Elk with Talend cloud 

     

    Step4: Create Visualization in Kibana

    Navigate to Elastic search domain and notice that a new indices is created

    Elk with Talend cloud 

     

    You could also search for this index in Kibana dashboard

    Elk with Talend cloud 

     

    Click on the discover to view the sample data

     

     

    You could now create visualization and see those in the dashboard

    Conclusion

    In this blog we saw how we could leverage the power of ELK with Talend Cloud. Once you have the Elk configured you could use it for diagnosing and resolving bugs and production issues or for the metrics about health and usage of jobs/resource.  Well that’s all for now, keep watching this space for more blogs and until then happy reading!!

     

The post ELK with Talend cloud appeared first on Talend Real-Time Open Source Data Integration Software.

The University of Sydney uses data to drive better decision making

$
0
0

Universities generate enormous amounts of data that are used for a plethora of reasons, including planning and management of revenue and donations, fees, growth and development; institutional reporting to the government; and tracking progress towards achieving goals.  

The University of Sydney is regularly ranked among the top research universities in Australia and the top 50 universities worldwide. With over 70,000 students and 7,500 staff, the University generates vast quantities of data that can be harnessed to drive informed decision-making to support the institution’s future success. More than 2,000 staff at the University regularly interact with institutional data.

A new data and analytics environment

As data was sometimes siloed in departments and faculties, data analysis was slow and inconsistent.

Recognising the importance of bringing together their 12 sources of data to support future decision making, the team embarked on a three-year plan to establish a Modern Data Environment (MDE) consisting of Cloudera’s Apache Hadoop distribution running on AWS infrastructure. The University also chose Talend Cloud for its ability to move high volumes of data sources into the cloud quickly, enabling to explore and innovate faster without significant transformation upfront.

A true analytics partner

This has strengthened their ability to partner on analytics projects across the University. One project involved the provision of hot data to support real-time learning analytics. Another project involved collaborating with the University’s Library to understand how resources such as online journals and databases are being used by students and academic staff. The team were able to use the MDE to measure the role the library plays in supporting students’ education, as well as to provide data-driven insights that influence aspects such as the future resources the library may procure and its engagement with the wider University community.

University of Sydney uses Talend Cloud

University of Sydney uses Talend Cloud

2 billion records and 12 Data sources

The University of Sydney chose Talend Cloud for its ability to move high volumes of data sources into AWS cloud quickly, enabling to explore and innovate faster.

 

 

 

The post The University of Sydney uses data to drive better decision making appeared first on Talend Real-Time Open Source Data Integration Software.

Have you checked out Talend’s 2019 summer release yet?

$
0
0

Have you had a chance to take a look at Talend’s summer 2019 product release? Our 2019 release has some exciting features that not only will help improve your productivity but will help you scale data projects across your organization. We are all about helping you do your work faster, and we think you’ll find the new features in this latest product release pretty great.

Changes to Pipeline Designer

First of all, some big changes are coming to our next generation cloud data integration design environment, Pipeline Designer. First of all, if you haven’t tried Pipeline Designer, it’s going to be really simple to actually try out and purchase the product. You can save money and pay as you go using a credit card. That way you can easily stick to your budget and only pay for what you use.

Plus, it’s going to be a lot easier to get support with the built-in product chat. You can debug, request help or simply ask a product-related question to a Talend employee from directly within Pipeline Designer.

Improved Connectivity between Pipeline Designer, Microsoft Azure, and Data Bricks

Talend’s summer release features improved connectivity to Azure Data Warehouse, Blob Storage and Azure Data Lake Store Gen2.  In addition, you can scale and leverage the power of Databricks directly from any Pipeline while Talend Studio for Big Data now features connectivity to Databrick’s Delta Lake.  The idea behind all these features is to increase the number of integration possibilities. With Pipeline Designer, you can increasingly connect anything to anything!

Improving Data Quality and Data Governance

Pipeline Designer isn’t the only Talend product that got improvements in this release. We also improved Talend’s data quality and data governance capabilities. Check out these new features:

  • MagicFill
  • Reversible Masking
  • Campaign Monitoring

MagicFill utilizes machine learning to automatically suggest transformations to data based on sample input from the user, minimizing mistakes and saving you time. Reversible masking increases your data security even further by enabling sensitive data to be protected across any data flow.  Finally, Campaign Monitoring is a cutting-edge dashboard contained within Talend Data Stewardship. It can track the effectiveness of data remediation across your organization. 

 

Containerization and Native Docker Support

As your organization adopts containers and microservices, you can use Talend’s new native Docker support for Data Services and Routes. Companies can scale up or down depending on the needs of the business, and the containers created by Talend work with any Containerized or Kubernetes environment. 

The new Zero-Config CI Plugin scales DevOps with enterprise grade automation without the complexity. There’s no command line to install or configure. Automatically install, configure, and build to maximize CI with Talend. And everything is optimized for cloud CI/CD managed services. 

Take a look at Talend’s new features

This video gives you an up-close look at Talend’s new features and capabilities:

Give Talend’s new release a try

Do these features sound enticing? Thought so. See everything that’s included in this summer’s release and give all the new products and features a try.

 

 

 

 

The post Have you checked out Talend’s 2019 summer release yet? appeared first on Talend Real-Time Open Source Data Integration Software.


9 top trends that are driving AI and software investments

$
0
0

IT and data leaders are constantly challenged to keep up with new trends in emerging and disruptive technologies, and to determine how each can best aid the organization. In the midst of all the changes going on in 2019, it gets increasingly hard to know where to invest in all this new technology.

To help add clarity, here are my thoughts on some of the most important trends that will shape data management and software development for the next couple of years.

Cloud computing

The business multi-verse expands through multi-cloud as data inefficiencies are solved: Multi-cloud promises tremendous reward if it can be used properly, but data inefficiencies and complicated compliance policies hinder progress for many.

Expect to see some of those data inefficiencies fade away as effective data strategies are implemented and new technologies unleash true multi-cloud functionality to the masses.

AI / machine learning / ML trust/ethics/bias

Questions around data morality will slow innovation in artificial intelligence and machine learning: Last year saw the hype around AI/ML explode, and data ethics, trust, bias and fairness have all surfaced to combat inequalities in the process to make everything intelligent.

There are many layers to data morality, and while ML advancements won’t cease — they’ll slow down as researchers try to hash out a fair, balanced approach to machine-made decisions.

The black box of algorithms becomes less opaque: Part of the issue with data morality with AI and machine learning is that numbers and scenarios are crunched without insight into subsequent answers came to be. Even researchers can have a hard time sorting it out after the fact.

AI Black Box

In the coming years, while it won’t lead to complete transparency with proprietary algorithms, the black box will still become less opaque as end users become increasingly educated about data and how it’s used.

GDPR / CA consumer privacy laws / Data privacy

The “G” in GDPR will soon stand for “Global”: Data privacy regulations are going to become more widespread. For example, California, Japan and China are already working on their own regulations to adopt rules similar to the EU’s GDPR.

Additionally, companies like Facebook, Google and Twitter have all severely mishandled consumer data, showing the need for increased and widespread data privacy regulations — even prompting Apple CEO Tim Cook to call for global privacy regulations. With consumers now viewing data privacy as a human right, increased data governance policies are sure to follow.

As privacy regulations spread, organizations will mistake data governance for data harassment: Based on what consumers do online, companies are able to determine, through their data, their demographics, interests and even what’s going on in their personal lives. This results in marketing so hyper targeted, it could feel like harassment.

While organizations struggle to comply with privacy regulations and create more well rounded and informed views of each of their consumers, the lines between governance and harassment will blur and there will be rocky roads as best practices are formed.

Social media is officially too big to fail: Social media companies have become the biggest publishing media brands and they are finally coming under scrutiny this year. However, there were no real repercussions for advertising fiascos and data privacy controversies despite Congress’s involvement, and the reality is that social media brands have become too big to fail.

While there will still be fights to remedy it — and there should be work done on this end — 2019 will solidify how social media companies are now too big to fail (or become regulated).

Data skills / Data as a team sport

The data skills gap will increase – but so will data literacy: Data is both the problem and the answer for businesses. It’s a problem because businesses manage to collect more data than they know how to use, yet it’s the answer because it can predict forecasts and offer insight into how the business should run.

Expect to see the data skills gap continue to increase — users need to be able to analyze properly where data comes from and how to use it, and it only gets more complicated as more data is made available and as algorithms enter the fray. But at the same time, business users will also grow more data literate as they seek to approach data as a team, and help one another get what they need from their data.

Serverless and open source

Serverless will move beyond the hype as developers take hold: Last year was all about understanding what serverless is, but as more developers learn the benefits and begin testing in serverless environments, more tools will be created to allow them to take full advantage of the architecture and to leverage functions-as-a-service.

Serverless will create new application ecosystems where startups can thrive off the low-cost architecture and creatively solve deployment challenges.

The market will double down on open source technologies: Last year saw $53 billion in deals involving open source following the Cloudera/Hortonworks merger and acquisitions of Red Hat, GitHub and others.

Expect to see businesses double down on open source technologies — more investments and deals will get done, and open source communities will also pour more effort and energy into projects after having seen the opportunity for open source in the marketplace.

To date, open source has still functioned with a freemium model, but the coming years may see that shift as the enterprise finds value in conventional open source technologies.

 

This article originally appeared on Information Management. 

The post 9 top trends that are driving AI and software investments appeared first on Talend Real-Time Open Source Data Integration Software.

How next-gen DI works

$
0
0

Overview

Data integration in the ‘Age of Digital’ brings in need for ETL development to happen at the ‘Speed of Business’ rather than at ‘IT Speed’.

Data integration layer is the important ‘glue’ between the user engagement apps in the EDGE and the systems of record at the CORE of IT landscape. Application development for the Experience Layer happens at the ‘Speed of Business’ while changes in Integration Layer move at ‘IT Speed’. Data Integration projects in the Age of Digital need to be delivered much faster and intuitively to meet demands of digital transformation of enterprises.

Needs of Automation

  • Automation interfaces for rapid ETL development.
  • Self-service ETL development based on pre-defined data integration patterns.
  • Emphasis on ETL patterns that integrate cloud based apps, un/semi-structured data formats, NoSQL data stores with enterprise systems.

 

Solution Overview

Wipro’s NextGen DI automation with Talend and pre-built patterns libraries help kick-start Big Data, digital and traditional data integration projects from Day 1 which includes accelerates cloud deployment and building of cloud applications for business analytics. While responding to current data integration needs beyond ETL development, this IP tool also provides other critical modules such as pattern discovery for design identification of existing ETL, batch analysis on data flow dependencies for batch optimization and source-to-target data lineage document generation. With the combination of all these modules in one platform, NextGen DI is a complete solution for today’s data integration needs.

Next Gen DI Wipro and Talend

 

Deployment

After deploying NextGen DI and running in tomcat, will find the home page as

Next Gen DI Wipro and Talend

Once successfully login, user can access the different modules of the application

Next Gen DI Wipro and Talend

Workflow Generation

Wipro’s Next Generation Data Integration Platform uses design pattern based approach with a rich GU to automate the Talend development process thus reducing the development effort and improving the code quality very significantly.

It provides the following functionalities

  • Automated Talend job mapping
  • session and workflow generation
  • Extensible hierarchical pattern library
  • Rich intuitive user interface
  • Creation of patterns from pre-existing mappings
  • Bulk generation of Talend jobs

 

Expandable Library

NextGenDI follows the pattern based approach for generating the mappings. The commonly used Talend logic is created and stored as a pattern in web application. This pattern can be used any number of times

  • To create the mappings. The similar kinds of patterns are grouped under a collective category and the similar category in turn is grouped under a similar kind of library.
  • Thus the library structure is a 3 level hierarchy library having the structure as
  • Library -> Category -> Pattern.
  • The application comes containing 50 patterns grouped under 20 categories and 4 libraries.

Next Gen DI Wipro and Talend

  • This library structure is an expanding one, user can easily create a new library containing categories and patterns without doing any code change.
  • Once a particular library is chosen, user shall see the available categories inside that library as,

Next Gen DI Wipro and Talend

  • Once a particular category is chosen, user shall see the available patterns inside that category as,

Next Gen DI Wipro and Talend

 

Mapping parameter screen 

In this screen, the user will be feeding the required source target metadata and the parameters for the chosen pattern to create a mapping.

  • User can upload all the files needed to generate a mapping such as source, target metadata, mapplets, dictionary files if there are any. User can upload multiple files at once.
  • User can either directly upload the xmls, or sql file containing the table structure of the source target metadata.
  • If user upload a sql, user will be able to choose the database type of the source target meta data to be created.

Next Gen DI Wipro and Talend

Download the individual job or download as a project into Talend Studio.

Next Gen DI Wipro and Talend

 

Features

  • Automated ETL job creation for Talend.
  • Portability to Big Data Edition and Cloud DI.
  • Extensible hierarchical pattern library.
  • Pre-built library of ETL patterns for various use cases such as Data Ingestion for different databases and Big data echo systems such HDFS, Hive etc.
  • End to end patterns for Digital integration.
  • Pre-built patterns for Snowflake on Next Gen DI with Talend help to jump-start Cloud Analytics and deploy modernized Cloud Data Warehouse faster.
  • Rich intuitive user interface.
  • Graph based analysis of batch data flows and dependencies.
  • Bulk generation of Talend code.

 

Benefits

  • Leverage existing code to detect and extract patterns in order to fast track future development.
  • Create an org level Data Integration Pattern Library to bring in higher level of compliance and standardization.
  • Accelerates integration and adaptation of new age technologies using pre-built Digital pattern library.
  • Improve quality of code and reduce defects through automation. Defect reduction of over 50%.
  • Bring back focus of developers into Analysis and Design.
  • Drastically cuts down development effort by ~ 40 to 50% improving time to market and reducing costs.
  • Improve quality of code and reduce defects through automation. Defect reduction of over 50%.

 

About the Authors

Ganesh Arunasalam Senior Architect – Data, Analytics & AI, Wipro Ganesh has over 19 years of data warehouse experience. He is currently focusing on open source integration technologies and has successfully executed large engagements for global companies. He is a TOGAF certified Enterprise Architect. He is also certified in different database technologies and supports the practice in managing cloud native ETL tools.

Purushottam Joshi Senior Architect – Data, Analytics & AI, Wipro Purushottam has over 21+ years of data warehouse and ETL experience. He is currently focused on open source integration technologies and has successfully executed and delivered large engagements for Fortune 500 companies of different domains like Healthcare, Manufacturing , Telecom etc., He is a TOGAF certified.

 

The post How next-gen DI works appeared first on Talend Real-Time Open Source Data Integration Software.

Can data privacy compliance create more business opportunities?

$
0
0

CCPA is the latest regulation which should be a concern for all companies with California residents in their databases. There are less than 90 days left to be ready to comply with CCPA, on top of the other regulations.

 

The spread of data privacy regulations

According to Graham Greenleaf, Professor of Law & Information Systems, UNSW Australia and author of Global Data Privacy Laws 2019: 132 National Laws & Many Bills, the number of countries that have enacted data privacy laws in 2017-18 has risen from 120 to 132, a 10% increase. These 132 jurisdictions have data privacy laws covering both the private sector and public sectors in most cases, and which meet at least minimum formal standards based on international agreements.

Graham goes on to explain that at least 28 other countries have official Bills for such laws in various stages of progress, including 9 that have introduced or replaced Bills in 2017-18. Many others, in the wake of the GDPR and ‘modernization’ of Convention 108, are updating or replacing existing laws.

In my last blog, CCPA will be live in less than 3 months. Do you have a plan?, we covered why Government felt the need to create regulations. Your company can’t avoid them.

 

Data privacy: new rules for a new game

If you thought Data Privacy was the problem of your legal department only, you would be surprised how quickly it could come back to you. As a business person, what would be the impact on the image of your company and our sales if you were fined? As a data person, how long would it take your teams to answer 200 simultaneous requests for data access? Could you meet the deadlines?

Looking at solutions seems costly at first. Linking and governing all the data sources appears complicated, long and expensive. Adopting a data privacy mindset also means changing people’s behavior, enabling new standards and adding more rules.

Stay with me, it is not as painful as it sounds.

 

What to tell to the C-level suite

Transitioning your company to a data privacy routine will be a success with the right motivation. And what will make the C-suite happy? More revenue and a data-driven company. So, what does data privacy have to do with it?

Implementing Data Governance will require changing the way you manage data. After you discover and understand the diverse sources within your data landscape, you will use a data catalogue. The opportunity of a data catalogue lies in the data inventory automation, data integrity, data lineage, data and documentation. While moving your data into a data lake and a data warehouse with the right technology, you will be ensuring data quality, data access rights and anonymization at the speed your business needs.

When all your trusted data is available in one place, you will be able to withdraw the insights to create customized experiences for you customers. Besides, your customers will be willing to spend more for a personalized experience because, according to McKinsey Research, personalization can deliver 5 to 8 times ROI on marketing spend and lift sales by 10%.

So, who said Data Privacy was a burden? Protecting your company while increasing the revenues can’t wait. Let’s start now! We have a solution we developed with our partner Prolifics you may find valuable.

 

Learn the 3 Steps to Turning CCPA & Data Privacy into Personalized Customer Experiences

 

The post Can data privacy compliance create more business opportunities? appeared first on Talend Real-Time Open Source Data Integration Software.

The Smart Data Cloud Approach

$
0
0

To maximize the benefits of cloud, data & machine learning and make it available to innovative digital teams and customers, the data lake is the definite reference model. How can we accelerate its implementation and scope without losing control or jeopardizing its sustainability?

Many data lake initiatives will lead to failure by using artisanal approaches such as specific coding and putting aside quality issues, such as security and governance or compliance. Data from a data lake is only useful if it helps make the right decisions at the right time. An agile and governed approach based on open, integrated, collaborative, and secure platforms, leads to success.

The major challenge in managing information is governance, control and processes.

40 % of hours worked are wasted in:

  • Managing growing volume of data
  • Managing access and information sharing
  • Workload dedicated to collect, consolidate and validate data

Organizations’ digital transformation creates new constraints such as the ability of processing:

  • Billions of lines and data volumes of several tens of Terabytes
  • Variable data structures (CSV, JSON, XML, etc.)
  • High-speed transactions with rapid response times
  • Complex analyses in real time

The digital transformation can be divided in 5 main topics:

Reconcile information used internally and externally

  • To ensure information reliability
  • To increase business activities comprehension

Produce quicker information

  • Improve processes
  • Improve tools used to produce information

Reduce costs of producing information

  • Automate non-value-added tasks
  • Decrease length of production cycle

Improve information quality

  • Gain more time to analyze and transform data
  • Facilitate information publication

Master information access

  • Master information visibility
  • Give information access to users outside the company

The solution to help solve these challenges, is the Smart Data Cloud Approach (3 players – 1 solution)

With its partners, Keyrus has created an approach that merges best practices in data integration with two leading market solutions: The Data Integration platform (Talend) and The Data Storage platform (Snowflake).

Talend is the leading organization in cloud and Big Data Deployments. With a combination of products built on the same platform, with one design environment and one set of management tools, Talend gives customers the flexibility of addressing different types of integration problems with the same product at a fraction of the cost of alternative solutions.

Talend is leader in cloud and Big Data Deployments

Snowflake is a leading company in Cloud Data Warehousing, providing a unique approach to ingest, store, share and allow structured and semi-structured data available. Snowflake’s unique engine reduces processing, access and management times for data warehouses.

 

Keyrus helps enterprises take advantage of the Digital and Data paradigm to enhance their performance, assist them with their transformation and generate new levers of growth and competitiveness. Keyrus specializes in performance management consulting and integration of innovative technological solutions in the Data and Digital fields. Over the last twenty years, the Group has developed its expertise and skills in helping companies in optimizing performances and meet  transformation challenges.

 

Smart Cloud Data Approach

The Smart Data Cloud approach is based on three components that will give our customers the possibility of implementing their strategy in a rapid, progressive and secured way.

1 – SaaS Architecture

The first step is the SaaS (Software as a Service) architecture selection for both Talend and Snowflake. In architecture mode, the customers will be able to develop, implement and grow their architecture without major investment. The pricing model is on a pay usage bases.

This represents a plus in term of management as the architecture cost is predictable.

One more advantage of the solution is scalability. The two solutions are designed to be scalable without having to invest in sustaining growth of activities.

2 – Agility

Keyrus uses an agile methodology to implement and build Data Warehouses for it’s customers.

Using a user’s workshops approach to understand and prioritize business needs, Keyrus guaranties an iterative and rapid delivery with visible results.

Instead of waiting several months for project delivery results, multiple project milestones will be monitored in order to control project delivery dates and performances. Results are published giving users the ability to extract and present valuable information to the organization.

3 – Turnkey device

Devices managed by Keyrus give our customers confidence in having a single contact point that understand both business goals and challenges. The Keyrus contact center, manages and supports the product, development and maintenance of the global solution.

Overall Architecture

The architecture will allow data collection, platform scalability and solution flexibility. As a major Talend and Snowflake integrator, Keyrus offers custom connectors and APIs with several external and internal data sources. Keyrus consultants build industrial workflows (DevOps) and implement continuous integration and development processes (CI/CD).

 

StakeHolders Stages

There are three phases that each organization delivers to allow customers to benefit from the full Keyrus Added Value Solutions.

  • Ability to connect to classic Databases / Cloud or API sources using either Talend Data Integration Tool or Snowflake Integration Query Language.
  • Ability to simplify and automate feed processes (either Real Time, CDC or ETL/ELT Processes).
  • Ability to integrate your historical data.

  • Ability to implement Data Quality processes using Data Integration solutions provided by Talend, or by implementing ETL business processes.
  • Ability to implement data reconciliation processes, to gather reliable data from internal or external sources.

  • Ability to expose and share data in your overall organization by creating DataMart views and API’s inside the Snowflake architecture.
  • Ability to expose data to your suppliers, customers and prospects customers using APIs.
  • Ability to resell your data to create Data as a Service offer using APIs.
  • Ability to process data using Advanced Data Analytics solution such as Machine Learning, Data Analysis or other tools.

Conclusion

The offer is based on our technical accelerators (starter kit / DevOps) that allows the possibility of delivering a final progressive and scalable Data Warehouse.

Keyrus will deliver ready to be expose shared data (Open Data, API, etc.) with the introduction of Human Knowledge and IT processes.

Combining leading technical solutions Talend and Snowflake with Keyrus’s experience, will ensure a complete and reliable solution delivery for your organization. 

 

The post The Smart Data Cloud Approach appeared first on Talend Real-Time Open Source Data Integration Software.

5 best practices to innovate at speed in the Cloud: Tip #2 Get instant access to data whenever it is needed

$
0
0

Starting September, The Talend Blog Team began sharing fruitful tips & to securely kick off your data project in the cloud at speed.  This week, we’ll start with the second capability: Get instant access to data whenever it is needed

 

Data enablement doesn’t stop by delivering data in a data lake or data warehouse. Data must reach easily its point of consumption. Whether it is accessed as a self-service by a business user or integrated into an application, making trusted data available to all when it’s needed is of utmost importance.

 

Why it’s important

Traditionally, organizations have established what IDC calls “governance with the no”, which means that business users have to come to central IT with requests and wait until they are fulfilled and authorized. This has created a gap between business and IT when it comes to data ownership, a gap that is only widening with the realities of data sprawl.

 

When it’s important

Beyond the timeliness of data, competitive differentiators can be achieved with the ability to deliver data to the proper audience, at the right time, and the right place.  Think about recommending the next product to buy to a retail customer. Pushing this recommendation in real time to the customer when he’s online will indeed make the recommendation more appealing, and much more profitable, than if pushed in an e-mail as part of a marketing outbound campaign.

 

Our recent data trust readiness report reveals that Less than half the respondents (42%) are confident about having access to data at any time.

Download Data Trust Readiness Report now.
Download Now

How Talend tools can help

Research conducted in 2017 by the Data Warehouse Institute shows that nearly half of the organizations surveyed (48%) are planning a replacement project for their data warehouse platform by 2019. A lot of these organizations are moving to cloud-based data warehousing, which gives them virtually unlimited capacity and scalability, a more economical way to leverage warehousing, and in many cases cost savings. But Cloud Data Warehousing alone won’t solve the trust & speed issue.

 

Managing multi Cloud integration and access at speed is critical

Not only organizations need a single platform with multi-cloud integration capabilities but also  cloud-based modern apps to transform, match and cleanse data at speed.

By helping companies to modernize their data lake approach with the right cloud platform, leveraging AWS, Snowflake or Azure Plarform) Talend provided what it’s needed to share trusted data at speed.

 

Big Data Processing at Speed on Azure:

Talend recently announced a partnership with Microsoft in which Talend provides fast development of Big Data ETL processing, cloud data lakes, cloud data warehousing, and real-time analytics projects on the Microsoft Azure Cloud Platform.

As an example, Talend Big Data Platform quickly integrates, cleanses, and profiles the ingested data stored on ADL Store, while the customer adds requirements for data governance, business rules, and compliance rules. The data is then sent to Azure HDInsight, a service that enables clusters of managed Hadoop instances and is commonly used for easy, fast, and cost-effective big data processing. The process of ingesting data into Azure using Talend with this architecture is 50% faster than from their existing ETL architecture.

ingesting data into Azure using Talend with this architecture is 50% faster

Faster trusted Analytics with Talend & Azure DataBricks

Azure Databricks provides the latest versions of Apache Spark and allows you to seamlessly integrate with open source libraries so you can spin up clusters and build quickly in a fully managed Apache Spark environment with the global scale and availability of Azure. With Talend & DataBricks, it’s easy to turn massive amounts of data into trusted insights at cloud scale. You can design and deploy Spark data processing, complex transformations and machine learning jobs in the cloud at ease whilst saving up to 80% of the data processing cost.

 

How to get started

Making data accessible is one of the pre-requisites to start producing value.

Data roles in most organizations have radically evolved over the last few years. Gartner notes, “Key roles such as the data steward are shifting from the IT group to placement either purely in business units or in an IT-business hybrid combination.” It’s then  essential to provide easy and fast access to data through the right tool and right channel.  

 

Download the Definitive Guide to Cloud Data Warehouses and Cloud Data Lakes to learn:

·       What you need to look for when starting to create your cloud data warehouse or data lake

·       Your three-step plan to make sure your data warehouse investment succeeds

·       Real-world case studies of the tech stacks companies use to achieve their business goals with cloud data warehouse solutions

 

Download the Cloud Architect’s Handbook to learn:

·       How to leverage a cloud data warehouse to deliver real-time tracking services

·       How to use Microsoft Azure and Talend together to optimize social media for GDPR-compliant marketing campaigns

·       Real-life case studies of organizations using Microsoft Azure and Talend together to successfully deliver big data analytics

 

 

 Want to explore more capabilities?

This is the second out of five speed capabilities. Cannot wait to discover our third capability?

Go and download our Trust Data Readiness Report to discover other findings and the other 9 trust and speed capabilities.

 

Download Talend Data Trust Readiness Report

 

The post 5 best practices to innovate at speed in the Cloud: Tip #2 Get instant access to data whenever it is needed appeared first on Talend Real-Time Open Source Data Integration Software.

Talend’s vault-based secrets revealed : Open-sourcing Vault Sidecar Injector

$
0
0

Handling secrets has always been a challenging and critical task within organizations. As production workloads deployed in remote clouds or following hybrid patterns are ever increasing, this problematic becomes more complex: we still want to manage secrets from a central place, using state-of-the-art security practices (encryption at rest, secrets rotation), with as less adherence on the underlying technology as possible to maximize components reusability across deployment topologies and ease testability.

Today we are open sourcing the code of our Vault Sidecar Injector component. You can start exploring it on GitHub right now.

This component is part of our hybrid secrets management solution relying on sidecars injection and kubernetes admission controller to provide a secure, seamless and dynamic experience.

Benefits to the Ecosystem

Kubernetes provides an effective but somewhat basic way to handle secrets. By default, encryption at rest is not enabled and advanced key management operations are not available. There is no signaling mechanism either upon any secrets change thus enforcing a file polling pattern on applications willing to use up-to-date credentials.

But, paradoxically, the biggest issue may lie in the “Kubernetes-native” nature of this solution: it is not the best candidate in architectures mixing legacy/non-containerized applications with Cloud Native ones. It may also not be a good fit if you want to handle secrets dispatched over a set of Kubernetes clusters.

For organizations looking to centralize secrets management among Kubernetes clusters and traditional applications, it is more relevant to adopt a technology/stack agnostic solution with full-fledged encryption and authentication options.

The Vault Sidecar Injector component allows organizations to securely and continuously fetch and push secrets to applications while still delivering a developer experience on par with Kubernetes Secrets.

What’s wrong with Kubernetes Secrets?

Kubernetes comes with the Secret resource to allow secret creation, retrieval, deletion. As a native Kubernetes concept, it is best used to handle secrets for applications lying in the cluster. Containers consuming Secrets mounted in a volume will receive updates whenever a change is deployed in the cluster (your application should be designed to be aware of secrets rotation).

The secret itself is stored into etcd, the key/value store that also serve as the internal database for all Kubernetes configuration and administration operations.

  • Before Kubernetes 1.7, secrets were stored in cleartext in etcd.
  • From Kubernetes 1.7, encryption of secret data at rest is supported but not applied by default. This is still the case with latest Kubernetes version (1.16 as of writing this blog post).

To sum up, Kubernetes Secrets

  • offer a Kubernetes native, off-the-shelf, solution for your in-cluster apps but cannot be leveraged by out-of-cluster apps
  • do not support storing or retrieving secret data from external secret management systems
  • are not securely stored by default
  • seamlessly provide access to secrets, in cleartext, to your apps but without any means to notify about the changes (however secrets rotation is supported if mounted as a volume)

Using built-in Kubernetes Secret resource, unless your security requirements are low, is not a good fit, even more for hybrid deployments. It also lacks advanced secret management features (admin console, secrets revocation & renewal, leases …) and cannot cope with dynamic secrets (secrets generated on the fly at read time).

Introducing Vault Sidecar Injector

At Talend, we are dealing with both on-premise, legacy, applications and cloud applications hosted either in managed or unmanaged environments on multiple cloud providers. The need for a centralized secrets management solution arose early on. Such solution should tick the following boxes:

  • same secret infrastructure for both Kubernetes/Cloud-native apps and legacy/non-Kubernetes/on-premises apps
  • avoid cloud vendor lock in (no cloud managed secret stores)
  • secure storage by design (encryption of data at rest)
  • support secrets renewal & revocation operations
  • handle dynamic secrets
  • administration capabilities
  • … and Kubernetes-ready to be deployed as a service for in-clusters and out-of-clusters apps

Moreover, in order to minimize downtime, cloud-native apps should be alerted of any secrets change in order to have a chance to reload and apply new values without incurring a restart (what we call “dynamic handling of secrets”).

As we were assessing several technologies and products, including Kubernetes Secrets, we finally selected HashiCorp Vault as the core component of our solution. Key advantages with Vault are that it is cloud agnostic and can be deployed in Kubernetes as well as on its own. Then, using a Vault agent in combination with a template engine and leveraging some advanced Kubernetes features, we were able to give birth to what became the Vault Sidecar Injector to provide developers with a solution as easy to use as Kubernetes Secrets but a lot more powerful.

In a nutshell, Vault Sidecar Injector

  • registers as a Kubernetes Admission Controller webhook and defines a set of Kubernetes annotations to easily invoke it in any workloads requiring access to some secrets
  • injects HashiCorp’s Vault Agent and Consul Template as sidecars to connect to any Vault server, issue/renew/revoke tokens and fetch secrets through flexible templates

As of today, it supports the following features:

  • handle both Kubernetes deployment and job workloads. In the latter case, an additional sidecar is injected to monitor for job termination and properly shutdown Vault Agent and Consul Template sidecars
  • authentication on Vault Servers using either Kubernetes or AppRole methods
  • continuously renew Vault access tokens and fetch up-to-date secrets values
  • able to notify applications of any secrets change
  • secrets path in Vault, templates to fetch secrets, output filename for secrets, Vault roles to use are all made customizable through dedicated annotations

After this short introduction, time to delve into implementation details!

Going deeper with some technical insights

Vault Sidecar Injector leverages Kubernetes’ Admission Controllers:

Definition excerpted from Kubernetes Blog: “Kubernetes admission controllers are plugins […] that intercept API requests and may change the request object or deny the request altogether. The admission control process has two phases: the mutating phase is executed first, followed by the validating phase. Consequently, admission controllers can act as mutating or validating controllers or as a combination of both.”

In detail, Vault Sidecar Injector is a webhook registered against the Mutating Admission Webhook controller. This Kubernetes controller does nothing on its own but calling each registered webhook with the manifests of intercepted resources. Those webhooks are then responsible for the whole logic on how to mutate the objects.

So, all it takes is to register and implement a webhook admission server that will receive pods’ creation requests and dynamically inject Vault Agent and Consul Template as sidecars in requesting pods. Mutations are performed using JSON Patch structs.

Figure below depicts what’s happening when you submit a manifest with custom annotations managed by the Vault Sidecar Injector:

Note: in the picture above, Vault Server is deployed out of cluster. Of course, you can also deploy it inside (even in another Kubernetes cluster), only constraint is for the injected containers (Vault Agent and Consul Template) to be able to contact it.

Getting Started

To help you installing and using Vault Sidecar Injector and for detailed description, go look at the README on GitHub.

The repository also comes with samples of annotated workloads to allow you to quickly test the component.

Community

Vault Sidecar Injector code is published under the very permissive Apache 2.0 license. We are already using this component internally and would eagerly listen for any improvements and suggestions from the open source community.

Here’s how you can contribute to Vault Sidecar Injector:

Tags: #technology #kubernetes #secrets #security #vault

 

The post Talend’s vault-based secrets revealed : Open-sourcing Vault Sidecar Injector appeared first on Talend Real-Time Open Source Data Integration Software.

Experience the magic of shuffling columns in Talend Dynamic Schema

$
0
0

If you are a magician specialized in Talend magic, we always hear a key word called Dynamic ingestion of data from various sources to target systems instead of creating individual Talend job for each data flow. In this blog, we will do a quick recap of the concept of Dynamic schema and how we can reorder or shuffle columns when we are employing Dynamic schema in ingestion operations.

There are multiple methods (or magical spells) available for shuffling. Below, I will share with you a simple spell which can be cast for most commonly used files for dynamic ingestion i.e., delimited files. (Statutory Warning: – If you are not a Harry Potter fan, I would recommend opening a new tab with your favorite search engine to quickly check the different magical terms. 😉)

Magic shuffling Talend dynamic schema

Before going to the details regarding shuffling of columns, let us see what Dynamic schema is and what makes this spell more exciting compared to traditional ways of data ingestion.

Dynamic Ingestion – A quick recap about the basics of magic

To make the concept more interesting and to help our new members of Talend developer community, I will explain the concepts with a simple game of arranging a deck of cards (Experienced Talend magicians who know the “Dynamic” spell can safely skip to next heading).

Let us imagine that each row of your data file is a combination of cards where each column of the row is like individual card. Now we are going to move these combinations from source to target in different methods. Sounds interesting? Ok, lets proceed 😊

In traditional method, if you must move the card combination from source to target, you will have to move one combination at a time. In the below diagram, you will have one task to move the first combination from source to target and another task to move second combination from source to target.

Magic shuffling Talend dynamic schema

Coming to Talend context, imagine you have two files with different data types and number of columns as shown below.

Magic shuffling Talend dynamic schema

Magic shuffling Talend dynamic schema

In common scenarios, Talend developers might think to create two separate Talend jobs to move from source system to target system. But in the case of big organizations, this approach means, the creation of hundreds of Talend jobs just for ingestion for processing different files with different column combinations.

Let us come back to our card magic game. Imagine if you have a magic wand to move all the different card combinations from source and target, instead of moving one combination a time.

Magic shuffling Talend dynamic schema

For the Harry Potter fans, just take your magic wand and cast the spell “Riddikulus”, where the scary task to rearrange all the card combinations magically get converted to a silly and easy task! Well, you can also use the spell Accio, in case you want to summon the cards automatically to target area 😉

In Talend world of magic, the data ingestion tasks can be made easy by using the spell “Dynamic”! It magically ingests the data from files or databases even if you are having a huge variation in underlying column schemas for each file.

Magic shuffling Talend dynamic schema

Talend secret book of magic (help.talend.com) has a specific section for Dynamic schema and you can refer some sample magic spells from this link. Now, the book of Talend magic has given the samples to load the data from multiple files from source to target (either database or file system).

Decoding the magic of shuffling columns in Dynamic schema

The experienced Talend magicians already know the above trick and they must have cast this spell many times when they are in the process of building an ingestion framework using Talend jobs. So, let us step up the magic to next level.

Some of the Talend magicians have asked me how we can do the shuffling of columns while using dynamic ingestion framework for delimited files in source and target. Coming back to the card magic, you can see the difference from the below diagram where the positions of cards have been changed for each row.

Magic shuffling Talend dynamic schema

Well, it can be achieved in multiple ways (You know by this time that you can use both Riddikulus and Accio spells to create the magical effect). I am going to show one simple trick if your source and target systems are delimited files and you just have to reorder the column patterns (Note:- In the case of databases, they will magically identify the column names based on Dynamic metadata and will populate to correct columns in target tables). If you go to the Diagon Alley of Talend (community.talend.com), you can see examples and details of various other spells (data parsing methods) created by some of the Ace magicians of Talend community.

Ingredients for magic potion (Creation of metadata table)

Since you are going to create a generic ingestion framework to shuffle the columns while using Dynamic schema, the first step will be to create configuration table having below columns. The metadata information table will have input and output file load paths, input and output file names and target column order after shuffling.

Magic shuffling Talend dynamic schema

The actual spell to load data dynamically after shuffling!

The Talend job to load the data dynamically to target delimited files after column shuffling is as shown below.

Magic shuffling Talend dynamic schema

Metadata Extraction subjob will extract the data from configuration table and convert them to multiple iterations using tFlowtoIterate component.

Magic shuffling Talend dynamic schema

A dummy tJava component will be used to orchestrate multiple “On Component Ok” control flows.

The first On Component Ok leads to the header processing subjob for target file. The target schema details for each input row will be copied from tFlowtoIterate col_list column and it will be assigned as input value for the data column of tRowGenerator.

Magic shuffling Talend dynamic schema

The header data will be populated to target file and the configuration is as shown below. We need to also make sure that the Advanced settings option to “Create directory if does not exist” is selected.

Magic shuffling Talend dynamic schema

  The next stage will be to pump the actual data from the source delimited file to target delimited file. The input file path directory and input file name variables will be extracted from tFlowtoIterate and will be added to the file name property of the tFileInputDelimited component. The schema will be marked as dynamic to maintain the schema neutrality for different types of files.

Magic shuffling Talend dynamic schema

Column shuffling will be handled in a tJavaRow component where the input data in Dynamic schema is parsed based on matching metadata column names from the input Dynamic column metadata and the list of target column captured earlier in tFlowtoIterate component. The output data is transmitted as String data type to next component.

Magic shuffling Talend dynamic schema

            The Java code snippet used within tJavaRow can be accessed below.

 

String strArray[] = ((String)globalMap.get("row3.col_list")). split(";");
String tarArray[] = ((String)globalMap.get("row3.col_list")). split(";");

Dynamic columns = row1.data;
String out="";

for (int i = 0; i < columns.getColumnCount(); i++) 
		{  
		DynamicMetadata columnMetadata = columns.getColumnMetadata(i);
		String inp_col=columnMetadata.getName().toString();
		String col_val=(String)row1.data.getColumnValue(inp_col);
		
		for (int j = 0; j < columns.getColumnCount(); j++)
		 		{
		 		if (inp_col.equals(strArray[j].toString()))
		 			{
		 			tarArray[j]=col_val;
		 			}
      			}
 
        }

for (int t=0; t < columns.getColumnCount(); t++)
       {
       if (t ==0)
          out=tarArray[0];
       else
           out=out+";"+tarArray[t];
       }

row2.data=out;

		 			}
 

 

 

The output data in the new shuffled format will be loaded to the target delimited file in Append mode.

Magic shuffling Talend dynamic schema

The output data in the target files will be according to the reshuffled column list.

Magic shuffling Talend dynamic schema

You can customize the above job further to incorporate additional data validations and data quality checks but its not the scope of current magical spell.

Conclusion

I would highly recommend reading below interesting Blogs around Dynamic Ingestion of Talend before closing your current tab!

How To Operationalize Meta-Data in Talend with Dynamic Schemas

How to Migrate Your Data From On-premise to the Cloud: Amazon S3

 

Till we meet again to discuss about some other magic about Talend, its your time to play around with the new spell. Alohomora !!! 😊

 

The post Experience the magic of shuffling columns in Talend Dynamic Schema appeared first on Talend Real-Time Open Source Data Integration Software.


How to use your data skills to keep a step ahead

$
0
0

While the impetus for transforming to a data-driven culture needs to come from the top of the organisation, all levels of the business should participate in learning new data skills. 

Assuring data availability and integrity must be a team sport in modern data-centric businesses, rather than being the responsibility of one individual or department. Everyone must buy in and be held accountable throughout the process.   

Effectively enabling greater data access among the workforce while maintaining oversight and quality is the challenge that today’s businesses face, and one in which they must meet. 

The evolution of the Data Team

The value and opportunities that data creates are now being recognised by enterprises. There is an understanding that the data needs to be handled and processed efficiently. For some companies, this has led to the formation of a new department comprised of data analysts and scientists.

The data team is led by a Chief Data Officer (CDO) – a role that is set to become intrinsic to business success in the digital era, according to recent research from Gartner. While earlier iterations of roles within the data team centered on data governance, data quality and regulatory issues, the focus is shifting. Data analysts and scientists are now expected to contribute and deliver a data-driven culture across the company, while also driving business value. According to the Gartner survey, the skills required for roles within the data team have expanded to span data management, analytics, data science, ethics, and digital transformation.

Investment in such data teams are growing as businesses recognise the importance of their functions. Office budgets for the data team increased by an impressive 23% between 2016 and 2017 according to Gartner. What’s more, some 15% of the CDOs that took part in the study revealed that their budgets were more than $20 million for their departments, compared with just 7% who said the same in 2016. 

The increasing popularity and evolution of these new data roles have largely been driven by GDPR in Europe and by new data protection regulations in the US. Evidence suggests that the position will be essential for ensuring the successful transfer of data skills throughout businesses of all sizes. 

The data skills shortage

Businesses can only unlock the full potential of their data if they have the talent to analyse it and produce actionable insights that help them to better understand their customers’ needs. But companies are already struggling to cope with the big data ecosystem due to a skills shortage, and the problem shows little sign of improving. 

The rapidly evolving digital landscape is partly to blame as the skills required have changed radically in recent years. The required data science skills needed at today’s data-driven companies are more wide-ranging than ever before. The modern workforce is now required to have a firm grasp of computer science including everything from databases to the cloud, according to strategic advisor Bernard Marr. 

In addition, analytical skills are essential to make sense of the ever-increasing data gathered by enterprises, while mathematical skills are also vital as much of the data captured will be numerical as this is largely due to IoT and sensor data. These skills must also sit alongside more traditional business and communication skills, as well as the ability to be creative and adapt to developing technologies. 

The need for these skills is set to increase, with IBM predicting that the number of jobs for data professionals will rise by a massive 28% by 2020. The good news is that businesses are already recognising the importance of digital skills in the workforce, with the role of Data Scientist taking the number one spot in Glassdoor’s Best Jobs in America for the past three years, with a staggering 4,524 positions available in 2018.

Data training employees

Data quality management is essential for all areas of a business. It, therefore, makes sense to provide the employees in the specialist departments with tools to ensure data quality in self-service. Cloud-based tools that can be rolled out quickly and easily in the departments are vital, as companies can gradually improve their data quality whilst also increasing the value of their data. 

To remain competitive, businesses must think of good data management as a team sport. Investing in the Chief Data Officer role and data skills now will enable forward-thinking businesses to reap the rewards, both in the short-term and further into the future.

This article originally appeared on: IT Brief 

The post How to use your data skills to keep a step ahead appeared first on Talend Real-Time Open Source Data Integration Software.

Speed and Trust with Azure Synapse Analytics

$
0
0

As a Microsoft partner, we’re excited by the announcement of the Azure Synapse Analytics platform.

Why? Because it furthers the ability of businesses to leverage data-driven insights and decision making at all levels in an organization. (And we love that!)

Together, our joint customers are already leveraging data in amazing ways to tackle everything from creating customer 360 views to reducing project times for data analytics from 6 months to 6 weeks. The Azure Synapse platform will help take this even further with increased insights on their data.

So, what is Azure Synapse?

Azure Synapse is the evolution of Azure SQL Data Warehouse, that brings traditional Data Warehousing and Big Data analytics together into a single offering – with an integrated security, management, and monitoring platform. Azure Synapse Analytics offers a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning applications. As a result, Azure Synapse delivers insights from all your data, across data warehouses and big data analytics systems, with blazing speed.

Azure Synapse uses deeply integrated Power BI and Azure Machine Learning to greatly expand the discovery of insights from data and apply machine learning models to deliver intelligent apps. This deep integration makes it quicker to create new experiences with data.

From an experience perspective, Azure Synapse studio unifies the experience for everyone including data engineers, data scientists, and database administrators. This allows business analysts to securely access datasets and use Power BI to build dashboards quickly with the same analytics service.

How does Talend make Azure Synapse even better?

We’re excited about the possibilities of the Azure Synapse service. Talend Cloud runs natively on the Azure environment, which means users of Azure Synapse will be able to easily leverage any data they need – whether in the cloud or on-premises – and ensure the data is correct and governed. Talend helps deliver trusted analytics at speed to different users across an organization using the new Azure Synapse platform.

With Talend, Azure Synapse users can quickly ingest data with over 900+ connectors and components, apply pervasive data quality at every step of the journey, and ensure that each analytics user gets the trusted and correct data they need whenever they need it. We allow data analysts and business users to provide integrated feedback and insights on data. Quite simply, Talend makes Azure Synapse even better.

Microsoft Azure Synapse Analytics and Talend combine the best of both worlds. Azure Synapse allows data analysis in a new powerful environment. Talend delivers speed and trust to these data analytics. We can’t wait to see what our customers do with these two great technologies.

Find out more about Azure Synapse and Talend Cloud on Microsoft Azure here.

 

The post Speed and Trust with Azure Synapse Analytics appeared first on Talend Real-Time Open Source Data Integration Software.

Quick wins for modern analytics projects with Amazon Redshift and Stitch Data Loader

$
0
0

It’s no secret that the cloud data warehouse space is exploding. Driven by the need for on-demand, performant data warehousing solutions, businesses are turning to public cloud providers to modernize their analytics infrastructure and help them make better business decisions.

Among the leading data warehouse options from the public cloud providers is Amazon Redshift. Redshift offers a petabyte-scale, fully managed data warehouse service in the cloud. With cloud data warehouses like Redshift, businesses can make data-driven decisions faster and more frequently. Implementing a cloud data warehouse can be game-changing for businesses of all sizes, but a data warehouse is only as useful as the data it contains.

At Talend, we’ve helped over 600 customers like Heap, a behavioral analytics platform, leverage the power of Amazon Redshift with Stitch Data Loader, a cloud-native data ingestion app that consolidates data from over 100 sources into Amazon Redshift While helping customers bring together data from Salesforce, Postgres databases and more, we’ve learned a few things along the way. Here are three tips for securing quick wins for your analytics projects with Amazon Redshift and Stitch Data Loader:

Get moving, and quickly

For data-driven organizations, speed and agility of decision making is critical. Stitch Data Loader offers customers access to their most important data sources in just a few clicks with a lightweight and easy to use UI. No complicated configurations or hand-coded ETL pipelines, just near-instant access to over 100 data sources, all available for analysis in Redshift. With unified data in once place, businesses can make the real-time decisions required to stay competitive.

 

Empower your data experts

 When business groups start new analytics projects, access to relevant data is paramount. Time to value for analytics is essential, but relying on IT bandwidth to access new data sources can have projects grinding to a halt. As cloud data warehousing becomes cheaper and more scalable, analytics domain experts can consolidate their own data, eliminating IT bottlenecks and reducing time to insight.

Take marketing analysts, for example. They have the technical skill and the domain knowledge to understand the data they need and how to use it, all that’s missing is the data itself. With Stitch Data Loader, marketing analysts can consolidate sources like Marketo, Google Analytics, and more into Amazon Redshift for analysis. No more manual processes or piecing reports together, just relevant and recent data in one place.

For full control over your warehousing costs, Stitch Data Loader offers advanced configuration settings to let you pick what data is loaded and when. Combined with on-demand pricing from Amazon Redshift, you can put data back in the hands of your domain experts without risking your budget.

 

Spend time on high value activities

According to Garter’s 2018 marketing analytics survey, marketing leaders reported that data wrangling is the second-most time consuming activity being done by their teams. Creating manual processes to wrangle data is static and error-prone and distracts from high value analysis. Each hour spent aggregating and cleaning data is less time spent on uncovering trends, understanding the customer, and other activities that propel business forward.

By integrating Stitch Data Loader and Amazon Redshift in your analytics stack, you can leverage the power of a modern cloud data warehouse and spend less time thinking about data wrangling and infrastructure.

 

Get going, already!

To get started moving data to Amazon Redshift, try out Stitch Data Loader, free for 14 days with unlimited data ingestion. Stitch Data Loader is also available for purchase on the AWS Marketplace.

 

Talend will be at AWS RE:Invent 2019 from December 2-6 in Las Vegas, Nevada. Come say hello and learn about how we’ve helped hundreds of customers supercharge their cloud data warehousing strategy.

The post Quick wins for modern analytics projects with Amazon Redshift and Stitch Data Loader appeared first on Talend Real-Time Open Source Data Integration Software.

5 best practices to innovate at speed in the Cloud: Tip #3 Enable access to and use of self-service applications

$
0
0

Starting September, The Talend Blog Team will relay to share fruitful tips & to securely kick off your data project in the cloud at speed. This week, we’ll start with the third capability: enable access to and use of self-service applications.

 

Data professionals face an efficiency gap; they spend too much time to get access to the data they need and then put it into the appropriate business context. The capacity of delivering trusted data to business experts at the point of need is critical if you want to liberate data value within your company.

Why it’s important

Reduced time and effort mean reduced costs and more value to be extracted from data. IDC found that a mere 19% of data professionals’ time is spent analyzing information and delivering valuable business outcomes. A recent IDC study found that only 19% of data professional’s time is spent analyzing information and delivering valuable business outcomes whereas they spend 81% of their time preparing data and protecting data. That includes duplicated efforts and wasted time, which IDC estimates at 12 hours per week. The challenge is to overcome these obstacles by bringing clarity, transparency, and accessibility to your data assets.

 

 When it’s important

Data professionals are scarce resources. Your data engineers, data scientists and data analysts won’t stay long if they don’t feel empowered with modern tools that fit their roles and make them more efficient.

 

Our recent data trust readiness report reveals that 29% of operational data workers still believe their companies do not excel in liberating data access and giving access to trusted data through self-service.

Download Data Trust Readiness Report now.
Download Now

How Talend tools can help

To fasten data adoption and use of self-service applications, there are basically two components that will help to make the difference. Simplicity and Accessibility.

Simplicity is brought by modern and intuitive User Interfaces that will put any data citizens at ease with simple yet powerful data operations. Simplicity means also intelligent software to minimize data operations, suggest assistance and propose help & tutorials to learn and grow at speed.

Accessibility can be brought by Cloud Platform that will potentially anyone to get better access and removing the hurdles of migration and updating phases.

Both components are key to deliver self-service applications to data citizens who often have no time to lose in complex operations and technical interfaces difficult to operate. 

Over the past years, Talend integrated its customer requirements to extend and simplify data usage with modern cloud-based applications. Talend developed simple and accessible role-based applications for data preparation and data stewardship with the aim to deliver data at the point of need for any role within the organization.

 

Self Service Data Curation with Data Stewardship

Talend Data Stewardships makes it it easy for anyone to clean, certify and reconcile data. Simply put, it’s the best way to bring business expertise back into your data pipelines. Think about customer records that need to be enriched with segmentation. Only the business analyst (not the data engineer) can help to determine which segment suits best each customer. Using a team-based, workflow approach, curation tasks can be assigned to data experts across the organization and tracked for progress or audit.

With the Data Stewardship app, you can instantly take remote control over any data whether it’s stored on-premises, in the cloud, or in a hybrid infrastructure. Get instant updates and minimize downtime as part of Talend Cloud. This video will basically explains you how.

 

Self Service Data Cleansing with Data Preparation

Talend Data Preparation is a self-service application that enables information workers to cut hours out of their workday by simplifying and expediting the laborious and time-consuming process of preparing data for analysis or other data-driven tasks.  It fosters collaboration between businesspeople who know the data best and central organizations, like IT or Risk Management, that define the rules and policies for data accessibility and governance. Talend Data Preparation has new great AI enables features that help data workers to deliver better data at speed. One of them is called the Magic Fill. 

This new function is part of our latest Summer Release. (some other great features here ). It allows you to define a pattern based on a handful of examples, and via a machine learning algorithm, apply the transformation on a whole column. The Magic Fill gives you many formatting possibilities, on any data type. This is the perfect example of a speed component that illustrates how Talend can help to deliver data access thru AI assistant and self-service. Only with powerful AI features such as Magic Fill you can dramatically accelerate the way data is used by business operations.


Figure 1MagicFill : It’s still magic even if you know how it’s done.

How to get started:

The real time challenge is not trivial. It also relies on the organization willingness to invest into cloud-based modern systems that can capture data out of the traditional back office, batch-oriented systems and deliver it in real time across the front office. Download our Definitive Guide to Data Integration to learn what kind of systems are needed to integrate new kind of data sources, from edge to core.

 

Try and use Talend Cloud.  Talend also developed built in tutorials so you can also learn at speed.  If you want to discover Data Preparation, feel free to practice using our free training Data Preparation.

Watch Online Free Training: Get Clean Data in Minutes! now.
Watch Now

 Want to explore more capabilities?

 

This is the third out of five speed capabilities. Cannot wait to discover our next capability?

Go and download our Trust Data Readiness Report to discover other findings and the other 9 trust and speed capabilities.

 

The post 5 best practices to innovate at speed in the Cloud: Tip #3 Enable access to and use of self-service applications appeared first on Talend Real-Time Open Source Data Integration Software.

How to do Snowflake query pushdown in Talend

$
0
0

In a typical/traditional data warehouse solution, the data is read into ETL memory, processed/transformed in the memory before loading into the target database. With the growing data, the cost of compute is also increasing and hence it becomes vital to look for alternate design. Welcome to pushdown query processing. The basic idea of pushdown is that certain parts of SQL queries or the transformation logic can be “Pushed” to where the data resides in the form of generated SQL statements. So instead of bringing the data to processing logic, we take the logic to where data resides. This is very important for performance reasons.

Snowflake supports Query pushdown with v2.1 and later. This pushdown can help you transition from a traditional ETL process to a more flexible and powerful ELT model. In this blog I will be showcasing how Talend leverages Snowflake query pushdown via ELT.

ETL VS ELT

Before we get into advance details, let rejuvenate the basics. With traditional ETL – Extract Transform Load the data is first Extracted, then transformed and then loaded into target like snowflake. Here most of the data transformation like filtering, sorting, aggregation etc. takes place at ETL tool memory before loading it into target.

With ELT – Extract Load and Transform the data is first Extracted, then loaded and then data transformations are performed. With the ELT model all the data is loaded into snowflake and then the date transformations are performed directly in Snowflake. Snowflake offers powerful SQL capabilities, via query pushdown thereby enabling data transformation to a more effective ELT model. During development using ELT, it is possible to view the code as it will be executed by Snowflake.

TALEND ELT JOB DESIGN  

In Talend, there are native components to configure pushdown optimization. These components would convert the transformation logic to an SQL query and also send the query to the snowflake database. The snowflake data base runs the query. In Talend query pushdown can be leveraged using ELT components tELTInput, tELTMap and tELTOutput. These components are available under ELT -> Map -> DB JDBC

ELT -> Map -> DB JDBC

Let’s take a quick look at these components

  • tELTInput: This component adds input tables for the SQL statement. There is no restriction on the number of input tables. One can add as many Input tables as required by the SQL statement to be executed.
  • tELTMap: this is the mapping component where the transformation are defined. This component uses the table(s) provided as input to feed the parameter in the built SQL statement. This component converts the transformation into SQL statements.
  • tELTOutput: Carries out the action on the table specified along with the action on the data as specified according to the output schema defined.

Now, lets take build a job to use these components and to utilize snowflake query pushdown. I will explain it with an example. Let’s assume that I have two tables in Snowflake named SALES and CITY.  Sales table contains details of item sold, unit sold, sales channel (Online or offline) cost of unit, total revenue, total profit as per Region, country.  City table is a dimension table which has Country code, population of country. Now the metric which I need to calculate is the total profit for online sales for each item at Region, country level. The result must be inserted into ONLINE_AGG table.

Now to this logic in ELT format, my job would look as given below:

snowflake query pushdown in talend

Let’s look at this job in more detail. As a best practice I have used tPrejob to open the snowflake connection and tPostjob to close the connection. I have also used tDie to handle exceptions at various components. The next few section explains in detail the sections marked in the image above (A,B,C and D)

  • A (tELTInput) : this component uses the open snowflake connection and reads data from SALES table. The configuration is given below.

snowflake query pushdown in talend

  • B (tELTInput) : this component uses the same connection and reads data from COUNTRY table. The detail configuration is given below

snowflake query pushdown in talend

  • C (tELTMap) : this is an important component as this component transforms the mapping into SQL. Click on the ETL Map Editor to do the join and transformation.

snowflake query pushdown in talend

 

After adding the input tables, I perform an INNER JOIN on the SALES and CITY table. As shown in the image below. This editor can also be used for providing additional where clause, group by clause and order by clause. In the example, I have given an performed the following transformation

  • where condition as PUBLIC.SALES.SALES_CHANNEL =’Online’
  • sum(PUBLIC.SALES.TOTAL_PROFIT)
  • as I am doing aggregation on Total_Profit, I have given group by on group by PUBLIC.SALES.REGION ,  SALES.COUNTRY ,  PUBLIC.SALES.ITEM_TYPE ,  PUBLIC.SALES.SALES_CHANNEL columns
  • Null handling for column ITEM_TYPE

 

These transformations are highlighted in the image below.

snowflake query pushdown in talend

Now, the beauty of this component is that as you write the transformation, the SQL gets generated. Click on the ‘Generated SQL Select query for table2 output’ to see the generated SQL

snowflake query pushdown in talend

To validate the results, I copied this sql, ran it in snowflake worksheet.

snowflake query pushdown in talend

  • D (tELTOutput) : this component is used to push the data to the table ONLINE_AGG. The detail configuration is given below.

snowflake query pushdown in talend

Now that the job design is completed, lets run the job. At runtime, you will see that the records are not bought into Talend memory

snowflake query pushdown in talend

Instead the query is executed at snowflake.

snowflake query pushdown in talend

To confirm the execution, lets query the history at snowflake.

snowflake query pushdown in talend

Expanded view of the query executed

snowflake query pushdown in talend

 

CONCLUSION

In this blog we saw how we could leverage the power of query pushdown with Talend while working with Snowflake. This job design method enables high utilization of snowflake clusters for processing data. Well that’s all for now, keep watching this space for more blogs and until then happy reading!!

 

The post How to do Snowflake query pushdown in Talend appeared first on Talend Real-Time Open Source Data Integration Software.

Viewing all 824 articles
Browse latest View live