How AB InBev is Using Data to Brew up the Best Customer Experience

January 18, 2019, 1:01 pm

≫ Next: Are APIs becoming the keys to customer experience?

≪ Previous: Data skills – Many hands make light work in a world awash with data

AB InBev, headquartered in Belgium, is one of the largest fast-moving consumer goods (FMCG) companies in the world with a diverse portfolio of well over 500 beer brands, including Budweiser, Corona, Stella Artois, Beck’s, Hoegaarden and Leffe.

When companies grow via external acquisitions, integrating the systems and data from acquired companies is always a challenge. For AB InBev, that challenge included a hybrid environment with both on-premises and cloud systems and a host of brewers operating as independent entities with their own internal systems. Also, like other alcoholic beverage producers, AB InBev must abide by strict regulations regarding gathering consumer information.

Integrating systems and data from acquired companies

AB InBev wanted to embark on a cloud journey, and Talend was built in that world. Talend extracts data from over 100 source systems, —realtime and batch, cloud and on-premises, ERP systems, data from IoT devices—and stores it in a data lake on Microsoft Azure. All data management work has to be done for multiple companies under the AB InBev umbrella and among the biggest benefits of the new IT architecture are simplification and reusability of processes to rapidly extract and provide access to data.

Selling the best beers and making people happy

Because AB InBev is leveraging reusable code, what used to take six months now takes six weeks. That translates into faster decisions.

Now, with Talend Data Preparation, internal users spend only about 30 percent of their time gathering data and can spend 70 percent analyzing it. Data helps understand drinker tastes and analyze new demands from consumers for low-calorie beers for example or determine preferences for beers according to seasonality. Data also helps improve store and bar experiences, supply chain optimization, product development and more. Learn more

The post How AB InBev is Using Data to Brew up the Best Customer Experience appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Are APIs becoming the keys to customer experience?

January 23, 2019, 9:51 am

≫ Next: Mastering Data and Self-Development at Lenovo

≪ Previous: How AB InBev is Using Data to Brew up the Best Customer Experience

In recent years, APIs have encouraged the emergence of new services by facilitating collaboration between applications and databases of one or more companies. Beyond catalyzing innovation, APIs have also revolutionized the customer-company relationship, allowing it to provide an accurate and detailed picture of the consumer at a time when a quality customer experience now counts as much as the price or capabilities of the product.

APIs: A Bridge Between the Digital and Physical World

Over the years, customer relationship channels have multiplied with consumers who can interact with their brands through stores, voice, email, mobile applications, the web or chatbots. The multiple points of interaction used by customers have made its journey more complex, forcing companies to consider data from these many channels to deliver the most seamless customer experience possible. To do this, they must synchronize data from one channel to another and cross-reference data related to its history with the brand. This is where APIs come into play. These interfaces allow data processing to refine customer knowledge and deliver a personalized experience.

Thanks to a 360° customer view, the digital experience can be extended in store. The API acts as a bridge between the digital and physical world.

The APIs also allow organizations to work with data in a more operational way and especially in real time. However, many companies still treat their loyal customers as if they’ve never interacted before. It is therefore not uncommon to have to reappear after several requests or to retrace the history of previous interactions, which can seriously damage the customer relationship.

The challenge for companies is to deliver a seamless, consistent and personalized experience through real-time analysis. This will provide relevant information to account managers during interaction and allow them to have guidance on the next best action to take, in line with the client’s expectations.

Even better, with APIs, we can predict the customer’s buying behavior and suggest services or products that meet their needs. Indeed, with the data collected, and thanks to the use of artificial intelligence, the cross-tabulations and instant analysis make it possible to refine the selection to offer an increasingly relevant and fluid experience, increasing customer loyalty and thus the economic performance of companies.

The Importance of APIs with GDPR

Recently, there has been a trend to empower consumers to control their data, after new regulations such as the European Payment Services Directive (PSD2) and GDPR came into force in May 2018.

What do they have in common? They both give individuals control over their personal data with the ability to request, delete or share it with other organizations. Thus, within the framework of PSD2, it is now possible to manage your bank account or issue payments through an application that is not necessarily that of your bank. Through this, APIs provide companies the opportunity to offer a dedicated portal to their customers to enable them to manage their data autonomously and offer new, innovative payment services.

For its part, companies will be able to better manage governance and the risks of fraudulent access to data. With an API, a company can proactively detect abnormal or even suspicious data access behaviors in near real time.

APIs are the gateways between companies and their business data and are the answer to real needs that the market is beginning to meet with customer experience. However, many organizations have not yet understood the importance of implementing an API strategy, an essential part of digital transformation, as well as the cloud, and the emergence of increasingly data-driven organizations. APIs are the missing link between data and customer experience — a key companies need to start using.

Ready to Learn More?

<< Watch the webinar on-demand “APIs for Dummies” >>

The post Are APIs becoming the keys to customer experience? appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Mastering Data and Self-Development at Lenovo

January 24, 2019, 8:12 am

≫ Next: Businesses must integrate Artificial Intelligence (AI) now or fall further behind

≪ Previous: Are APIs becoming the keys to customer experience?

In 2012 when I worked at Lenovo, the company set out on a journey to create the Lenovo Unified Customer Intelligence (LUCI) platform. The decisions we made with regard to people and technology involved in that project helped to shape my self-development, relationships with others on the team, and relationship with executives.

Data management leaders today are still facing a problem that has been around for years:

How do we create systems and processes to move, transform, and deliver trusted data insights at the speed of business?

To provide an understanding of how we solved this problem at Lenovo, it would be helpful if I shared a bit of background about myself. I come from a non-traditional background in that data and analytics is not where I started.

My first position at Lenovo was as a digital analytics implementation manager, responsible for ensuring that all of the data collected at Lenovo.com integrated with digital solutions. I used quality assurance programs to establish trust. My web analytics team quickly realized that in order to create the value we wanted, we would need to integrate with many online and off-line data sources.

Building Your Data Team & Self-Development

This realization and the understanding that the team needed to build a new kind of platform was the beginning of a multi-year self-development journey. As I began to evaluate our needs against our internal platforms, I realized that none of them were capable of supporting our key requirements.

“This realization and the understanding that the team needed to build a new kind of platform was the beginning of a multi-year self-development journey.”

We needed an analytical platform that supported batch and streaming on 10-plus terabytes per year. We chose Tableau, R, and Python for the analytics layer and leveraged Amazon Web Services cloud databases for the storage layer. But we still needed to make a decision on the data integration layer.

The 80/20 rule of data management came to mind: I refused to accept that 80% of the time would be spent on data wrangling and 20% spent on analysis. Our program had more than 60 endpoints, and change management needed to occur within one business day. We wanted 30% of our resources focused on data wrangling and 70% focused on business intelligence (BI) and analytics.

To achieve this, we selected Talend for our integration technology, established one-week agile sprints, and leveraged our people to be integrators, implementers, and administrators.

Building the IT and Business Relationship

Organizational support for data integration was decentralized and often leveraged different vendors. It was considered an IT function and put in the background several layers removed from the business. I wanted to grow my team, and the only way I could do this was by creating value with business stakeholders.

At this point, I created data architect roles that would become masters of their domains and cross-trained in others. These roles would be business facing so that architects would be working directly with the stakeholders. They were responsible for architecting, developing, and maintaining their own data solutions.

A single data solution such as the voice of the customer pipeline could have more than 10 data sources, structured and unstructured data, varying volumes and velocities, translation and natural language processing loops, and multiple analytics and visualization outputs.

Empowering data architects over such a large scope enabled them and the business to move at the pace that was needed for success. Working hand-in-hand, analysts began to understand the data wrangling processes, improving both the performance and quality of these processes.

Most important, it helped them understand the value of an efficient data integration team.

Relationships with Executives

Business executives didn’t understand, nor were they interested in understanding, how a good data management practice can help drive the business forward.

The first two to three years of my role was focused on delivering insights more efficiently. We tackled challenges such as having a dashboard that required six people over the course of a month to copy and paste in Excel to get an executive a view once a month. We got that down to half a person, automated, daily, and with quality checks.

These wins give us the credibility and momentum to then connect data sets in different ways and to experiment with new analytics models. The larger business impacts and analytical wins came after we had a strong data integration and management practice. Today, many of those business executives understand what ETL is and why it’s important for their business.

Some of my key learnings throughout this experience have been to drive a sense of ownership and business accessibility with the data architect function. The most important was to help my team to understand the “why”.

Oftentimes the “why” of a business case is return on investment (ROI). I would vigorously enforce that the architects and engineers had to articulate how their actions were impacting a business objective, regardless of how far removed they were from the problem.

This focus on ROI, understanding the why, empowering technical resources interfaced with the business, and giving more end-to-end ownership of these data processes are in my opinion the keys to building a successful data integration practice.

The post Mastering Data and Self-Development at Lenovo appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Businesses must integrate Artificial Intelligence (AI) now or fall further behind

January 28, 2019, 7:50 am

≫ Next: How Bayer Pharmaceuticals Found the Right Prescription for Clinical Data Access

≪ Previous: Mastering Data and Self-Development at Lenovo

This article was originally published on Venture Beat

Artificial intelligence became one of the hottest tech topics in 2017 and is still attracting attention and investments. Although scientists have been working on the technology and heralding its numerous anticipated benefits for more than four decades, it’s only in the past few years that society’s artificial intelligence dreams have come to fruition.

The impact AI applications stand to have on both consumer and business operations is profound. For example, a New York-based Harley Davidson dealer incorporated the Albert Algorithm AI-driven marketing platform into his marketing mix and saw a 2,930 percent increase in sales leads that helped triple his business over the previous year.

Unfortunately, success stories like this aren’t as common as the more prevalent failed AI pilot projects. However, with growing volumes of raw data about people, places, and things, plus increasing compute power and real-time processing speeds, immediate AI applicability and business benefits are becoming a reality.

According to a survey by Cowen and Company, 81 percent of IT leaders are currently investing in or planning to invest in AI, as CIOs have mandated that their companies need to integrate AI into their entire technology stacks. Another 43 percent are evaluating and doing an AI proof of concept, and 38 percent already have operational AI applications and are planning to invest more.

Additionally, McKinsey research estimates tech giants spent $20 to $30 billion on AI in 2016, with 90 percent of it going to R&D and deployment and 10 percent to AI acquisitions. Industry analyst firm IDC predicts artificial intelligence will grow to be a $47 billion market by 2020, with a CAGR of 55 percent. Of that market, IDC forecasts companies will spend some $18 billion on software applications, $5 billion on software platforms, and $24 billion on services and hardware.

With this level of investment, if your business doesn’t already have a strategy to incorporate AI or machine learning (ML) into your development efforts by 2019, then you risk irrelevancy.

The AI race is heating up

Google, Amazon, and Facebook lead the AI race, with Microsoft Corp. investing a lot of time and resources to catch up. These companies already have thousands of researchers on staff and billions of dollars set aside to invest in capturing the next generation of leading data scientists — giving them a huge head start over the rest of the market. For example:

Of Google’s 25,000 engineers, currently only a few thousand are proficient in machine learning — roughly 10 percent — but Jeff Dean, Google Senior Fellow, would like that number to be closer to 100 percent.
In its first year of operation, the AI and Research group at Microsoft grew by 60 percent through hiring and acquisitions.
Over 750 Facebook engineers and 40 different product teams are using a piece of software called FBLearner Flow, which helps them leverage AI and ML. The company has trained more than a million models on the new software.

These tech giants are just a few of the serious artificial intelligence contenders in the market today. There is only so much talent to go around, making it hard and very expensive for smaller companies to attract and retain the skilled workers required to make their AI dreams a reality. This doesn’t just impact recruiting efforts, but also the time required to conduct new employee onboarding, training, and supervised learning to effectively scale AI programs.

<< Free On-Demand Webinar “The Fundamentals of Machine Learning” >>

Most companies lack the connected, analytical infrastructure and general knowledge needed to apply AI and ML to its fullest extent. Engineers must be able to securely access data without having to deal with multiple layers of authentication, which is often the case if a company has several siloed data warehouses or enterprise resource planning application systems. Before IT leaders attempt to successfully deploy or conquer an enterprise-wide AI strategy, they must have the ability to bring large data sets together from several disparate and varied data sources into a centralized, scalable, and governed data repository.

An Artificial Intelligence services marketplace is developing

While it’s clear that the use of AI is becoming more prominent, not all companies have the IT budgets needed to recruit the talent required to build AI-fueled applications in-house. Thus, what we can expect to see more immediately is the emergence of an AI services marketplace.

We’re already seeing examples of this emerge. Many companies are beginning to offer AI self-service tools that have become both easier to use for the non-data scientist and less expensive to acquire. Much like mobile app stores, these new AI marketplaces will resell specialized AI services and algorithms that companies can instantly buy and implement within their businesses. This model makes it easier for companies with a more modest budget to keep some skin in the game and remain competitive in the race for AI.

The post Businesses must integrate Artificial Intelligence (AI) now or fall further behind appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How Bayer Pharmaceuticals Found the Right Prescription for Clinical Data Access

January 31, 2019, 2:05 pm

≫ Next: How Euronext is Utilizing Real-Time Data to Become a “Data Trader”

≪ Previous: Businesses must integrate Artificial Intelligence (AI) now or fall further behind

Like other pharmaceutical companies, Bayer Pharmaceuticals conducts research to discover new drugs, test and validate their effectiveness and safety, and introduce them to the market. That process requires accumulating, analyzing and storing vast amounts of clinical data coming from patients and healthy volunteers, which is recorded on an electronic case report form (eCRF). Data is also collected from laboratories and electronic devices and all data is automatically anonymized at the point of collection.

Bayer wanted to gather more data about its drugs to comply with documentation requirements such as those in GxP, a collection of quality guidelines and regulations created by the U.S. Food and Drug Administration (FDA).

Building a microservices-based architecture

To make it faster and easier for researchers to analyze drug development data, the company deployed a microservices-based architecture for its data platform.

dGerman software developer QuinScape GmbH helped Bayer deploy the Talend-powered Karapit framework, which the company is using to integrate several clinical databases and support the pharmacokinetics dataflow and biosampling parts of the drug development process.

“Through Talend microservices, we can obtain clinical pharmacokinetic data more rapidly in order to determine drug doses and better characterize compounds and adhere to quality processes” – Dr. Ivana Adams, Project Manager, Translational Science Systems

Understanding pharmacokinetics data to optimize drug doses

The role of pharmacokinetics (PK) in drug discovery can be described simply as the study of “what a body does to a drug,” and includes the rate and extent to which drugs are absorbed into the body and distributed to the body tissues. It also includes the rate and pathways by which drugs are eliminated from the body by metabolism and excretion. Understanding these processes is extremely important for prescribers because they form the basis for the optimal dose regimen and explain the inter-individual variations in the response to drug therapy

Having clean, verified traceable clinical data saves development time, accelerates the process of determining proper dosage and drug interactions, and leaves a clear audit trail as per FDA GxP guidelines.

Read the full case study here.

The post How Bayer Pharmaceuticals Found the Right Prescription for Clinical Data Access appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How Euronext is Utilizing Real-Time Data to Become a “Data Trader”

February 5, 2019, 4:36 pm

≫ Next: Spinning up Cloud-scale Analytics is now Even More Compelling with Talend and Microsoft

≪ Previous: How Bayer Pharmaceuticals Found the Right Prescription for Clinical Data Access

Following its split from the New York Stock Exchange in 2014, Euronext became the first pan-European exchange in the eurozone, fusing together the stock markets of Amsterdam, Brussels, Dublin, Lisbon, and Paris.

Euronext uses Optiq, an incredible trading platform, which is the active memory of 100 TB transactions, with systems that practically work in nanoseconds. But for analytics, sometimes Euronext had to wait six to twelve hours after market close on days with important events before they could send the data to business units and clients. Also, Euronext’s storage needs continued to grow, especially following several acquisitions and with regulators, expecting that Euronext store more and more data.

Migrating to a Governed Cloud

Euronext chose Talend Big Data to absorb real-time data in an AWS data lake, including internal data from its own trading platform and external data, such as from Reuters and Bloomberg. In an ultra-regulated world, Talend Data Catalog has also proven to be highly adept at meeting the challenges of data lake governance and regulatory compliance.

“In the stock exchange sector, we follow three watchwords: integrity, because it is impossible to lose a single order; permanent availability; and governance in a highly-regulated market. Talend has met these expectations.” – Abderrahmane Belarfaoui, Chief Data Officer

Making the Most of Stock Market Data

Beyond the improved architecture, the migration is also positioning Euronext to become a “data trader.” In fact, the sale of data already brings in 20% of Euronext’s revenues. Traders actually sell, buy, and make their investment decisions in milliseconds. They have a huge appetite for aggregated data in real time.

In addition to clients, this project also involves giving data scientists and business units self-service access to this data, which they can analyze in data sandboxes for tasks such as market monitoring.

Watch the full case study below:

The post How Euronext is Utilizing Real-Time Data to Become a “Data Trader” appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Spinning up Cloud-scale Analytics is now Even More Compelling with Talend and Microsoft

February 7, 2019, 4:30 am

≫ Next: Data Warehouse Modernization and the Journey to the Cloud

≪ Previous: How Euronext is Utilizing Real-Time Data to Become a “Data Trader”

Today, we’re excited to share two announcements that make adopting Microsoft Azure SQL Data Warehouse (ADW) a no-brainer.

First, ADW significantly increased the lead over the competition with the new price-performance benchmarks published by GigaOm, showing exponential price-performance improvements over similar solutions. This should be especially interesting to large enterprises with legacy, on-premises data warehouse deployments that can benefit from the best-in-class performance, security, and unmatched cost advantages of Azure SQL Data Warehouse.

Second, Stitch Data Loader, our recent addition to help support our small and mid-market customers, now supports Microsoft Azure SQL Data Warehouse destinations. With Stitch Data Loader, customers can load 5 million rows/month into Azure SQL Data Warehouse for free, or scale up to an unlimited amount of rows with a subscription.

All across the industry, there is a rapid shift to the cloud. Utilizing fast, flexible, and secure cloud data warehouse is an important first step in that journey. With Microsoft Azure SQL Data Warehouse and Stitch Data Loader companies can get started faster than ever. The fact that ADW can be up to 14x faster, and 94% less expensive than similar options in the marketplace, should only help further accelerate adoption of cloud scale analytics by customers of all sizes.

An intro to Microsoft Azure SQL Data Warehouse

Azure SQL Data Warehouse is a fully managed cloud data warehouse for enterprises of any size that combines lightning-fast query performance with industry-leading data security. With ADW, users are billed for compute and storage resources independently. You can increase storage when you need to without being forced to increase compute capacity simultaneously, so you pay for only what you need.

Azure SQL Data Warehouse’s elastic scalability also makes it fast and cost-effective to scale compute and storage resources with latency measured in seconds or minutes. That means you no longer have to perform the preload transformations required with ETL. Instead, you can load all of your raw data into your data warehouse, then define transformations in SQL and run them in the data warehouse at query time. This new sequence has changed ETL into ELT.

Building pipelines to the cloud with Stitch Data Loader

The Stitch team built the Azure SQL Data Warehouse integration with the help of Microsoft engineers. The solution leverages Azure Blob Storage and PolyBase to get data into the Azure cloud and ultimately loaded to SQL Data Warehouse. We take care of all issues with data type transformation between source and destination, schema changes, and bulk loading.

To start moving data, just specify your host address and database name and provide authentication credentials. Stitch will then start loading data from all of your sources in minutes.

Stitch Data Loader enables Azure SQL Data Warehouse users to analyze data from more than 90 data sources, including databases, SaaS tools, and ad networks. We also sponsor and integrate with the Singer open source ETL project, which makes it easy to get additional or custom data sources into Azure SQL Data Warehouse.

Stitch’s destination switching feature also makes it easy for existing Stitch users to take their existing integrations and start loading them into Azure SQL Data Warehouse right away.

Going further with Talend Cloud and Azure SQL Data Warehouse

What if you’re ready to scale out your data warehousing efforts and layer on data transformation, profiling, and quality? Talend Cloud offers many more sources as well as more advanced data processing and data quality features that are available within the ADW and the Azure Platform. With over 900 connectors available, you’ll be able to move all your data, no matter the format or source. With data preparation and additional security features built-in, you can get Azure-ready in no time.

Take Uniper for instance. Using Azure and Talend Cloud, they built a cloud-based data analytics platform to integrate over 100 data sources including temperature and IoT sensors, from various external and internal sources. They constructed the full flow of business transactions — spanning market analytics, trading, asset management, and post-trading — while enabling data governance and self-service, resulting in reduced integration costs by 80% and achieving ROI in 6 months.

What next?

Start your free trial of Stitch today and load data into Azure SQL Data Warehouse in minutes
Find out more about the Azure SQL Data Warehouse’s unmatched price-performance and related announcements from Microsoft.

Learn more about how you can use Stitch and Azure SQL Data Warehouse to build your data analytics stack.

The post Spinning up Cloud-scale Analytics is now Even More Compelling with Talend and Microsoft appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Data Warehouse Modernization and the Journey to the Cloud

February 11, 2019, 3:29 pm

≫ Next: Best Practices for Using Context Variables with Talend – Part 1

≪ Previous: Spinning up Cloud-scale Analytics is now Even More Compelling with Talend and Microsoft

To say that organizations today are facing a complex data landscape is really an understatement. Data exists in on-premises systems and in the cloud; data is used across applications and accessed across departments.

Information is being exchanged in ever-growing volumes with customers and business partners. Websites and social media platforms are constantly adding data to the mix. And now there’s even more data coming from new sources such as the Internet of Things (IoT) via sensors and smart, connected devices.

This proliferation of data sources is leading to a chaotic, “accidental architecture”, where organizations can’t get the right data to the right people at the right time. That means users such as business analysts and data scientists can’t adequately analyze relevant data and get the most value out of it to enhance the business.

While they’re dealing with growing sources and volumes of data that are increasingly difficult to manage, enterprises are also grappling with emerging business demands:

Increasing expectations, both internally and externally: They’re expected to be more agile, with faster time to market for products and services and more rapid response times for customer inquiries. Any delays can mean the loss of business to competitors that are more agile.
Constant need for changing compliance: They need to comply with a growing number of government and industry regulations related to data security, privacy, and management, which is part of the broader issue of data governance. The latest example is the new GDPR regulation governing data privacy for citizens of the European Union—or the new California Consumer Privacy Act.
Growing demand for accessible data: They must meet growing demands for self-service, as more and more business users demand immediate access to data and to the tools to analyze the data. A new generation of workers expects to have continuous access to the resources they need.

Data Warehouses Moving to the Cloud

Not too long ago, many were saying these challenges could be addressed by data warehouses. But traditional data warehouses present a number of problems. For one thing, they lead to the existence of data silos, with companies using a datamart for one project, a data warehouse for another project, and other warehouses for still other projects.

<< ebook: Cloud Data Warehouse Trends for 2019>>

This, in turn, increases complexity and makes data management more difficult. The challenge becomes even greater as the volume and sources of data increase. There are multiple systems to integrate and manage, which requires specialized skills and tools.

And in order to meet performance and capacity demands, organizations need to make investments in all kinds of proprietary hardware and then maintain that legacy hardware over time. Companies are forced to do a lot of capacity planning and try to control costs while dealing with the rapid increase in data.

Perhaps because of these shortcomings, many organizations are looking to make changes regarding their data warehouse strategy. Research conducted in 2017 by the Data Warehouse Institute shows thadt nearly half of the organizations surveyed (48%) are planning a replacement project for their data warehouse platform by 2019.

A lot of these organizations are moving to cloud-based data warehousing, which gives them virtually unlimited capacity and scalability, a more economical way to leverage warehousing, and in many cases cost savings.

A move to cloud data warehousing has its own set of challenges, however. When companies make this shift, they’re not just moving their databases to the cloud, but analytics and visualization as well. They’re transitioning to business intelligence as a service. So, one of the major issues that arise is data integration.

Organizations that are moving their data warehousing initiatives to the cloud and using integration tools however are seeing benefits. Among the two key use cases are lift and shift—where a company is taking an existing legacy data warehouse and moving it into a new, cloud-based data warehouse—and entirely new projects where the company doesn’t really care about the legacy data warehouse.

One example of a successful lift and shift strategy is a large financial and data services firm. The company had challenges with poor performance and meeting load times with a legacy data warehouse. In addition, its reporting functionality was taking too long for users to get results back.

With a large variety of data sources, the firm moved off its legacy platform to modernize its data warehouse to ensure it could access all the connectors it would need now and in the future. Among the results of the move are nine times faster load performance and eight times faster query performance. This improved end-user efficiency enabled the IT organization to meet its load window SLA.

An excellent example of a new project use case is with a large provider of healthcare analytics that relies on timely and accurate data to provide insights and ultimately to improve efficiency in the healthcare chain. The company has to integrate and manage a great number of diverse healthcare data sources which is labor intensive and time-consuming. It needed to build a big data warehouse for faster analytics.

The company takes a lot of healthcare data, bundles it up, enriches the information and provides it back to pharmaceutical, biotech, and medical technology companies. Today, the firm has to provide insights that are a lot more metrics-driven and therefore manage data at scale. What used to be about gigabytes of their own data has become petabytes of external/real-world data.

Data management was getting too complex and it was difficult to integrate all the various data sources. The company decided to move to a single repository where it could process data from many suppliers, as well as their own. Considering the sheer amount of data at play, they needed mature data management and quality capabilities without a “data tax” (a costly and unpredictable pricing model based on the number and type of connectors or source and target systems being connected.). Yet after moving to a cloud data warehouse, the company experienced robust connectivity and easy access.

For many companies, it’s not a question of if they’ll be moving their data warehouse function to the cloud, but when. It’s important for organizations to find new ways to work with data more efficiently. Without this, they can’t compete in today’s business environment.

Next Steps:

Watch this webinar from Talend and Snowflake on modern data warehousing:

7 Tips for Modern Data Warehousing

Reprinted with permission from Datanami

The post Data Warehouse Modernization and the Journey to the Cloud appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Best Practices for Using Context Variables with Talend – Part 1

February 12, 2019, 7:59 am

≫ Next: Sabre Airline Solutions Gives Airline Data a Critical Upgrade

≪ Previous: Data Warehouse Modernization and the Journey to the Cloud

A question I was regularly asked when working on different customer sites and answering questions on forums was “What is the best practice when using context variables?”

My years of working with Talend have led me to work with context variables in a way that minimizes the effort I need to put into ongoing maintenance and moving them between environments. This blog series is intended to give you an insight into the best practices I use as well as highlight the potential pitfalls that can arise from using the Talend context variable functionality without fully understanding it.

Contexts, Context Variables and Context Groups

To start, I want to ensure that we are all on the same page with regard to terminology. There are 3 ways “Context” is used in Talend:

Context variable: A variable which can be set either at compile time or runtime. It can be changed and allows variables which would otherwise be hardcoded to be more dynamic.
Context: The environment or category of the value held by the context variable. Most of the time Contexts are DEV, TEST, PROD, UAT, etc. This allows you to set up one context variable and assign a different value per environment.
Context Group: A group of context variables which are packaged together for ease of use. Context Groups can be dragged and dropped into jobs so that you do not have to set up the same context variables in different jobs. They can also be updated (added to) in one location and then the changes can be distributed to the jobs that use those Context Groups.

I’ve found that many people will refer to “context variables” as “contexts”. This leads to confusion in discussions, so if these terms are used incorrectly online it really can confuse the issue. So, now that we have a common set of definitions, let’s move forward.

Potential Pitfalls with Contexts

While context variables are incredibly useful when working with Talend, they can also introduce some unforeseen problems if not fully understood. The biggest cause of problems in my experience are the contexts. Quite simply, I do not use anything but a Default Context.

At the beginning of your Talend journey, they come across as a genius idea which allows developers to build against one environment, using that environment’s context variable values, then when the code is ready to test, change the context at the flick of a switch. That is true (kind of), but mainly for smaller data integration jobs. However, more often than not they open up developers and testers to horrible and time-consuming unexpected behavior. Below is just one scenario demonstrating this.

<< Get Your Free Book: The Definitive Guide to Data Integration>>

Let’s say a developer has built a job which uses a Context Group configured to supply database connection parameters. She has set up 4 Contexts (DEV1, DEV2, TEST and PROD) and has configured the different Context Variable values for each Context. In her main job, she reads from the database and then passes some of the data to Child Jobs using tRunJob components. Some of these Child Jobs have their own Child Jobs and all Child Jobs make use of the database. Thus, all jobs make use of the Context Group holding the database credentials. While she is developing, she sets the Context within the tRunJobs to DEV1. This is great. She can debug her Job until she is happy that it is working. However, she needs to test on DEV2 because it has a slightly cleaner environment. When she runs the Parent Job she changes the default Context from DEV1 to DEV2 and runs the Job. It seems to work, but she cannot see the database updates in her DEV1 database. Why? She then realizes that her Child Jobs are all defaulted to use DEV1 and not DEV2.

Now there are ways around this, she could ensure that all of her tRunJobs are set with the correct Context. But what if she has dozens of them? How long will that take? She could ensure that “Transmit whole context” is set in each tRunJob. But what happens if a Child Job is using a Context variable or Context Group that is not used by any of the Parent Jobs? We are back to the same problem of having to change all of the tRunJob Contexts. But this doesn’t affect us outside of the Talend Studio, right? Wrong.

If the developer compiled that job to use on the command-line, even if she sets “Apply Context to children jobs” on the Build Job page, all this does is hardcode all of the Child Jobs’ Contexts to that selected in the Context scripts drop down. When you run it, if you change the Context that the Job needs to run for, the Child Jobs stick with the one that has been compiled. The same thing happens in the Talend Administration Center (TAC) as well.

Now, this does have some uses. Maybe your Contexts are not for environments and you want to be able to use different Contexts within the same environment? That is a legitimate (if not slightly unusual) scenario. There are other examples of these sorts of problems, but I think you get the idea.

In the early days of Talend, Contexts were brilliant. But these days (unless you have a particular use case where multiple Contexts are used within a single environment), there are better ways of handling Context variables for multiple environments. I’ll cover all of those ways and best practices in part two and three of our blog series coming out next week. Until next time!

The post Best Practices for Using Context Variables with Talend – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Sabre Airline Solutions Gives Airline Data a Critical Upgrade

February 13, 2019, 9:45 am

≫ Next: Automated Machine Learning: is it the Holy Grail?

≪ Previous: Best Practices for Using Context Variables with Talend – Part 1

Sabre Airline Solutions (Sabre) supplies applications to airlines that enable them to manage a variety of planning tasks and strategic operations, including crew schedules, flight paths, and weight and balance for aircraft.

The challenge for Sabre was that many airlines had not implemented the proper upgrades. That meant some large customers were as many as five versions behind. And moving them to the new suite would have been a time-consuming, expensive, version-by-version process. Customers, understandably, were nervous about tackling that process.

<< Learn how to deliver data you can trust across your business.>>

Reducing upgrade time and costs for customers

Talend was given a deadline of two weeks to complete the migration for an important customer and surpassed the company’s expectations. This enabled Sabre to complete the needed migrations in just a matter of hours.

“To help our airline customers succeed in a very competitive industry, we need a way to migrate data more efficiently. Talend is the solution for data mobility ” – Dave Gebhart, Software Development Principal

Replicating a process to save time and money

As a result of the shorter, more-cost efficient process, Sabre can now easily replicate it. The new process reduced the cost of doing migrations by 80 percent, and it enabled Sabre to do as many as 25 upgrades in a year, whereas previously they could manage only about 10. That means Sabre more than doubled the upgrade slots they are able to serve because of all the benefits of using Talend.

What’s next? Sabre is currently working on a project that uses Talend for a more complex task. “We’re integrating three legacy applications, and we’re using Talend to extract data and transform it into objects that can be converted into XML service requests, which are then processed so the data can be loaded via web services into a new system,” he says. “Talend is the engine we’re using to drive this multi-step process.”

<< Download the full Sabre case study>>

The post Sabre Airline Solutions Gives Airline Data a Critical Upgrade appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Automated Machine Learning: is it the Holy Grail?

February 15, 2019, 8:59 am

≫ Next: Best Practices for Using Context Variables with Talend – Part 2

≪ Previous: Sabre Airline Solutions Gives Airline Data a Critical Upgrade

Authored by Gero Presser, co-founder and managing partner of Quinscape GmbH in Dortmund (@gero_presser)

Machine learning is in the ascendancy. Particularly when it comes to pattern recognition, machine learning is the method of choice. Tangible examples of its applications include fraud detection, image recognition, predictive maintenance, and train delay prediction systems. In day-to-day machine learning (ML) and the quest to deploy the knowledge gained, we typically encounter these three main problems (but not the only ones).

Data Quality – Data from multiple sources across multiple time frames can be difficult to collate into clean and coherent data sets that will yield the maximum benefit from machine learning. Typical issues include missing data, inconsistent data values, autocorrelation and so forth.

<< Download the Definitive Guide to Data Quality>>

Business Relevance – While a lot of the technology underpinning the machine learning revolution has been progressing more rapidly than ever, a lot of the application today occurs without much thought given to business value.

Operationalizing Models – Once models have gone through the build and tuning cycle, it is critical to deploy the results of the machine learning process into the wider business. This is a difficult bridge to cross as predictive modelers are typically not IT solution experts and vice versa.

There is also a whole toolbox of algorithms behind machine learning, each of which can be adjusted for greater accuracy using so-called hyperparameters. With the popular k-nearest neighbors algorithm, for example, k refers to the number of neighbors we want to take into account. In a neural network, this will cover the entire architecture of the network.

A key task that data scientists have today is finding the right algorithm for a given problem and to “set” this correctly. In reality, however, the range of tasks is much larger. A data scientist has to understand the business perspective of a problem, address the data situation, prepare the data appropriately and arrive at a model that lends itself to evaluation. This is typically a cyclical process that follows the cross-industry standard process for data mining (CRISP-DM) [1].

Correspondingly, projects in the field of machine learning are complex and demand the time of multiple people qualified in a range of fields (business, IT, data science). Furthermore, it is often unclear to begin with what the outcome will be: In this sense, therefore, such projects are risky.

The Relevance of AutoML

To this day, data science projects cannot be automated. There are cases, however, where certain steps of the project can be automated: This is what lies behind the concept of automated machine learning (AutoML). AutoML can, for example, assist in the choice of algorithm. A data scientist usually compares the results of several algorithms on the problem and selects one under consideration of a range of factors (e.g. quality, complexity/duration, robustness). Another aspect that may be automated in certain cases is the setting of hyperparameters: Many algorithms can be adjusted by means of parameters and their quality optimized with relation to the specific problem.

AutoML is a resource that can accelerate those data science projects where parts or individual steps are automated, leading to an increase in productivity. AutoML is extremely useful, for instance, in the evaluation of algorithms. Because of this, many libraries and tools have adopted AutoML as a supplementary function. Notable examples include auto-sklearn (in the Python community) or DataRobot, which specializes in AutoML. The following example, taken from RapidMiner, shows how assistants are used to compare different algorithms and very quickly find the best one for a specific problem [2]:

Nevertheless, AutoML should not be understood as a one-size-fits-all solution capable of fully automating data science projects and dispensing with the need for data scientists. In this sense, it, unfortunately, is not the Holy Grail.

As in other specialist fields, automation is useful first and foremost for tedious technical tasks where highly skilled professionals would otherwise spend most of their time systematically trying out certain parameter sets and then comparing the results – a job that really can be better left to machines.

What remains is a wealth of challenges that still have to be addressed by humans. This begins with understanding the actual problem itself and covers diverse, mostly very time-consuming, tasks ranging from data engineering to deployment. AutoML is a useful tool, but it’s not the Holy Grail yet.

[1] By Kenneth Jensen – Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610

[2] http://www.rapidminer.com/

About the author Dr. Gero Presser

Dr. Gero Presser is a co-founder and managing partner of Quinscape GmbH in Dortmund. Quinscape has positioned itself on the German market as a leading system integrator for the Talend, Jaspersoft/Spotfire, Kony and Intrexx platforms and, with their 100 members of staff, they take care of renowned customers including SMEs, large corporations and the public sector.

Gero Presser did his doctorate in decision-making theory in the field of artificial intelligence and at Quinscape he is responsible for setting up the business field of Business Intelligence with a focus on analytics and integration.

The post Automated Machine Learning: is it the Holy Grail? appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Best Practices for Using Context Variables with Talend – Part 2

February 19, 2019, 1:56 pm

≫ Next: Cloud Data Warehouse Trends You Should Know in 2019

≪ Previous: Automated Machine Learning: is it the Holy Grail?

First off, a big thank you to all those who have read the first part of this blog series! If you haven’t read it, I invite you to read it now before continuing, as ~Part 2 will build upon it and dive a bit deeper. Ready to get started? Let’s kick things off by discussing the implicit context load.

The Implicit Context Load

The Implicit Context Load is one of those pieces of functionality that can very easily be ignored but is incredibly valuable.

Simply put, the implicit context load is just a way of linking your jobs to a hardcoded file path or database connection to retrieve your context variables. That’s great, but you still have to hardcode your file path/connection settings, so how is it of any use here if we want a truly environment agnostic configuration?

Well, what is not shouted about as much as it probably should be is that the Implicit Context Load configuration variables can not only be hardcoded, but they can be populated by Talend Routine methods. This opens up a whole new world of environment agnostic functionality and makes Contexts completely redundant for configuring Context variables per environment.

You can find the Talend documentation for the Implicit Context Load here. You will notice that it doesn’t say (at the moment…maybe an amendment is due :)) that each of the fields shown in the screenshot below can be populated by Talend routine methods instead of being hardcoded.

JASYPT

Before I go any further it makes sense to jump onto a slight tangent and mention JASYPT. JASYPT is a java library which allows developers to add basic encryption capabilities to his/her projects with minimum effort, and without the need of having deep knowledge on how cryptography works. JASYPT is supplied with Talend, so there is no need to hunt around and download all sorts of Jars to use here. All you need to be able to do is write a little Java to enable you to obfuscate your values to prevent others from being able to read them in clear text.

Now, you won’t necessarily want all of your values to be obfuscated. This might actually be a bit of a pain. However, JASYPT makes this easy as well. JASYPT comes built-in with some functionality which will allow it to ingest a file of parameters and decrypt only the values which are surrounded by ….

ENC(………)

This means a file with values such as below (example SQL server connection settings)…..

TalendContextAdditionalParams=instance=TALEND_DEV

TalendContextDbName=context_db

TalendContextEnvironment=DEV

TalendContextHost=MyDBHost

TalendContextPassword=ENC(4mW0zXPwFQJu/S6zJw7MIJtHPnZCMAZB)

TalendContextPort=1433

TalendContextUser=TalendUser

…..will only have the “TalendContextPassword” variable decrypted, the rest will be left as they are.

This piece of functionality is really useful in a lot of ways and often gets overlooked by people looking to hide values which need to be made easily available to Talend Jobs. I will demonstrate precisely how to make use of this functionality later, but first I’ll show you how simple using JASYPT is if you simply want to encrypt and decrypt a String.

Simple Encrypt/Decrypt Talend Job

In the example I will give you in part 3 of this blog series (I have to have something to keep you coming back), the code will be a little harder than below. Below is an example job showing how simple it is to use the JASYPT functionality. This job could be used for encrypting whatever values you may wish to encrypt manually. It’s layout is shown below….

Two components. A tLibraryLoad to load the JASYPT Jar and a tJava to carry out the encryption/decryption.

The tLibraryLoad is configured as below. Your included version of JASYPT may differ from the one I have used. Use whichever comes with your Talend version.

The tJava needs to import the relevant class we are using from the JASYPT Jar. This import is shown below…..

The actual code is….

import org.jasypt.encryption.pbe.StandardPBEStringEncryptor;

Now to make use of the StandardPBEStringEncryptor I used the following configuration….

The actual code (so you can copy it) is shown below….

//Configure encryptor class

StandardPBEStringEncryptor encryptor = new StandardPBEStringEncryptor();

encryptor.setAlgorithm("PBEWithMD5AndDES");

encryptor.setPassword("BOB");




//Set the String to encrypt and print it

String stringToEncrypt = "Hello World";

System.out.println(stringToEncrypt);




//Encrypt the String and store it as the cipher String. Then print it

String cipher = encryptor.encrypt(stringToEncrypt);

System.out.println(cipher);




//Decrypt the String just encrypted and print it out

System.out.println(encryptor.decrypt(cipher));

In the above it is all hardcoded. I am encrypting the String “Hello World” using the password “BOB” and the algorithm “PBEWithMD5AndDES”. When I run the job, I get the following output….

Starting job TestEcryption at 07:47 19/03/2018.




[statistics] connecting to socket on port 3711

[statistics] connected

Hello World

73bH30rffMwflGM800S2UO/fieHNMVdB

Hello World

[statistics] disconnected

Job TestEcryption ended at 07:47 19/03/2018. [exit code=0]

These snippets of information are useful, but how do you knit them together to provide an environment agnostic Context framework to base your jobs on? I’ll dive into that in Part 3 of my best practices blog. Until next week!

The post Best Practices for Using Context Variables with Talend – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Cloud Data Warehouse Trends You Should Know in 2019

February 20, 2019, 4:55 am

≫ Next: 6 Dos and Dont’s of Data Governance – Part 1

≪ Previous: Best Practices for Using Context Variables with Talend – Part 2

In October 2018, TDWI and Talend asked over 200 architects, IT and Analytics managers, directors and VPs, and a mix of data professionals about their cloud data warehouse strategy in a survey conducted in October 2018. We wanted to get real answers about how companies are moving to the cloud, especially with the recent rise of Cloud Data Warehouse technologies. For instance, we wanted to know if a cloud data warehouse (CDW) is seen as a key driver of digital transformation. Which use cases are driving CDW adoption? Does a cloud data warehouse help companies become more data-driven?

We heard the feedback and divided the results in our latest report with TDWI in three key areas including progress, challenges, and what’s next.

Complex Cloud Data Warehouse Environments

Survey respondents noted that their cloud data warehouses need to do complicated work. They have to cross hybrid environments as well as accommodate a larger organizational shift to the cloud. In addition to all that, respondents wanted their cloud data warehouse to work for functions throughout the company, not just a select few technical teams.

The cloud data warehouse environment is getting more complex. The majority (36%) of the respondents indicated that they would be deploying their CDW in a hybrid environment.

CDW business use cases are spread throughout the organization. Interestingly, 62% of respondents in the process of implementing CDWs want them to complement a data lake for analytics.

Challenges for Successful Data Warehousing in the Cloud

However, there are still major road blocks for organizations to adopt CDWs successfully and the challenges and go beyond the CDW themselves.

Getting data into the CDW is only the beginning. Besides the complexity of getting various data ingested into a CDW, there are many more major challenges. The top challenges indicated by the survey respondents are governance (50%), integrating data across multiple sources (>40%), and getting data into the warehouse (38%). Some respondents told us “Cloud databases have their limitations, and on-premises will never go away completely, so the different environments just complicate everything”; and “We have hundreds of legacy systems without good master data. A cloud data warehouse does not fix a data structure problem”.

Meanwhile, the needs organizations have to perform data analytics in a CDW are increasingly complex. Companies require a number of additional processing and methodologies, the top 3 including in-memory processing (>35%), supporting structured and unstructured data (>35%), and integration with 3^rd party analytics tools (>35%).

Survey respondents also indicated that they want data processing both BEFORE and AFTER data is loaded into a CDW. And the most common transformations & integration needs are:

Conclusion:

Organizations today are fully on board with adopting a cloud data warehouse, but recognize that what a cloud data warehouse must do has evolved with changes in cloud computing, automation, machine learning, and other important trends. A CDW is no longer seen as an end in itself, but rather a stage in the data-driven journey, which has to involve managing a data lifecycle, ensuring data quality, providing a data governance framework, among other considerations.

To find out more details on the survey report, click here to read the full report.

The post Cloud Data Warehouse Trends You Should Know in 2019 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

6 Dos and Dont’s of Data Governance – Part 1

February 25, 2019, 10:38 am

≫ Next: Delivering on Data Science with Talend: Getting Quality Data

≪ Previous: Cloud Data Warehouse Trends You Should Know in 2019

Set clear expectations from the start

One big mistake I see organizations make when starting out on their data governance journey is forgetting the rationale behind data. So don’t just govern to govern. Whether you need to minimize risks or maximize your benefits, link your data governance projects to clear and measurable outcomes. As data governance is a non-departmental initiative, but rather a company-wide initiative, you will need to prove its value from the start to convince leaders to prioritize and allocate some resources.

<<ebook: Download our full Definitive Guide to Data Governance>>

What is your “Emerald City”? Define your meaning of success

In the Wonderful Wizard of Oz, the “Emerald City” is Dorothy’s ultimate destination, the end of the famous yellow brick road. In your data governance project, success can take different forms: reinforcing data control, mitigating risks or data breaches, reducing time spent by business teams, monetizing your data or producing new value from your data pipelines. Meeting compliance standards to avoid penalties is crucial to be considered. Ensure you know where you are headed and where the destination is.

Secure your funding

As you’re building the fundamentals of your projects and you’re defining your criteria for success, you will explain the why, the what, and the how. However, make sure you don’t forget to ask “how much” to identify associated costs and the necessary resources to be successful. If you’re a newly assigned Data Protection Officer (DPO), make sure you have a minimum secured operating fund.

If you’re a Chief Data Officer (CDO), align with the Chief Technology Officer (CTO) to secure your funding together. Then pitch your proposal to the finance team together so that they understand the company risks linked to failed compliance by explaining the value of your data strategy and all the hidden potential behind data. Make sure you present them with the perspective of data as a financial asset.

Don’t go in alone

As you know, and it cannot be said often enough, a data journey is not another single and IT-specific project. Even if you can go fast apprehending tools and take advantage of powerful apps, delivering trusted data is a team sport. Gather your colleagues from various departments and start a discussion group around the data challenges they’re facing. Try to identify what kind of issues they have.

Frequent complaints are:

“I don’t find the right data I am looking for”
“I cannot access datasets easily”
“Salesforce data is polluted”
“How can I make sure it’s trusted?”
“We spent too much time removing duplicates manually”.

You will soon discover that one of the biggest challenges is to build a data value chain that various profiles can leverage to get trusted data into the data pipelines. Work with peers to clarify, document and see together how to remove these pains. Bring people along on your data journey and give them responsibilities so the project won’t be your project but rather a team project. Show that the success will not just be for you, but for all team members to enjoy together.

Apply Governance with a “Yes!”

Avoid too much control with a top-down approach. On the contrary, apply the collaborative and controlled model of data governance to enable controlled role-based applications that will allow your data stakeholders and the entire stakeholder’s community to harness the power of data with governance put in place from the get-go.

Make sure that the business understands the benefits, but also that they are ready to participate in the effort of delivering trusted data at the speed of the business.

Start with your Data

Traditional governance strategies often apply a non-negotiable top-down approach to assign accountabilities into data. While you should spend time in getting directions on your Data Governance, truth is it won’t be highly efficient as you’ll often confront high levels of resistance. Start with your data and, more importantly, with the people using it. Listen to business experts and collaborators, get into your data sets to detect business value and potential business risks, then identify who is using the data set the most. There will often be the ones who will be the most inclined to protect and maintain a high level of integrity into your data sets.

Keep an eye out for my second post of six more do’s and don’ts of Data Governance. For more information about how to deliver data you can trust, don’t hesitate to download our Definitive Guide to Data Governance here.

The post 6 Dos and Dont’s of Data Governance – Part 1 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Delivering on Data Science with Talend: Getting Quality Data

February 27, 2019, 12:05 pm

≫ Next: Best Practices for Using Context Variables with Talend – Part 3

≪ Previous: 6 Dos and Dont’s of Data Governance – Part 1

Today, we are in the information age with a tremendous amount of data being created (as much as 90% of data being created in the last two years alone). This data comes from a wide range of sources and takes many different forms: human-generated documents and social media communications; transactional data that we use to run our businesses; and there is an ever-increasing proliferation of sensors producing streams of data.

It has been said that data is the new soil in which discoveries grow, and the potential for Data Scientists to make breakthroughs and to drive positive outcomes using machine learning and deep learning is unprecedented. But new opportunities always come with challenges.

In this blog series, I’ll show you how Talend can help solve common challenges with data science. First, let’s start by focusing on how to get clean and relevant data available.

Breaking the 80/20 rule of Data Science

One of the things you might have heard a hundred times while working on Data Science projects is the 80/20 rule, where 80% of a data science effort is spent on data compilation, getting clean relevant data in the right format and where it’s needed and only 20% on actual analysis. According to the 80/20 rule of data science, four days of each business week is spent on gathering data, while only one day is spent on running algorithmic models. This rule has been confirmed by Data Scientists themselves in a recent report of CrowdFlower (currently known as figure eight).

But what if data scientists already had the data they needed?

Ingest any type of data

Let’s start with this thought, even the most experienced data scientist won’t help you too much without access to data. Moreover, they must get all the data, meaning that if there is a 20 years customer data sitting in a mainframe or an mqtt topic where sensor data is published, they must be able to collect it in order to unlock the value and potential of those information systems.

According to a CloudFlower survey, a Data Scientist spends at least one day of the week just collecting data. That’s where Talend data integration capabilities become handy, with more than 900 connectors and components that allow you to connect to Databases, Business and Cloud applications, Data formats and Metadata, Protocols and messaging, Cloud services, and much more.

And it’s not going to stop there, Talend announced recently the acquisition of Stitch that has developed a simple, frictionless way for users to move data from cloud sources to cloud data warehouse quickly and easily.

Stitch enables Talend to immediately compete in a new and rapidly growing market segment for low-cost, self-service cloud data warehouse ingestion services and current customers will benefit from using both Talend and Stitch products in a near future. If you want to know more about Stitch visit the Stitch website and get started connecting your data in less than two minutes.

Driving Data Quality

Now that data scientists can access and collect the data they need with data integration tools, they’ll have to face the data quality challenge because many organizations’ data lakes have turned into dumping grounds. Coming back to the CrowdFlower report, data scientists often spend two to three days of their week working on cleaning and preparing their data.

A data scientists’ time is precious; Talend Data Quality can help them work to their full potential with our Data Quality suite that includes Data Masking capabilities as well as self-service applications for Data Preparation and Data Stewardship. Talend brings a unified platform to make data integration and data quality a team sport through collaboration and by empowering business users building up the company ground truth.

But more importantly, with Talend you will be able to automate, scale, and industrialize data integration, quality and anonymization processes, making your data scientist life easier by providing constant quality data.

Because in the end having a lot of data is good but not enough, for data science the quality of the data is key to build performant machine learning models.

Cataloging Your Data With a … Data Catalog

Even when they can get their hands on the right data, data scientists need to spend time exploring and understanding it. For example, they might not know what a set of fields in a table is referring to at first glance, or data may be in a format that can’t be easily understood or analyzed. There is usually little to no metadata to help, and they may need to seek advice from the data’s owners to make sense of it.

Talend offers tools to automate and simplify data discovery, curation, and governance. Intelligent search capabilities help data scientists find the data they need, while metadata such as tags, comments, and quality metrics help them decide whether a data set will be useful to them and how best to extract value from it.

With Talend Data Catalog, our goal is to deliver trusted data at scale in the digital era. We do this by empowering organizations to create a single source of trusted data. Talend Data Catalog achieves this objective by:

First, crawling your data landscape and using machine learning and smart semantics to automatically discover all your data
Second, it can orchestrate data governance, so data curation becomes a team sport where you can collaborate to improve data accessibility, accuracy, protection, and business relevance
Third, data consumers can find, understand, use, and share trusted data faster. Data Catalog makes it easy to search for data and visually present data relationships, then verify its validity before sharing with peers.

Integrated data governance gives data scientists confidence that they are permitted to use a given data set and that the models and results they produce are used responsibly by others in the organization.

Conclusion

All this will help break the 80/20 rule and shift it to 80/20. Data scientists could reclaim much of the time that they’re currently wasting on cleansing and spend more time on what they do best. Building more efficient predictive models as they’ll spend more time refining and comparing them.

And all of this can be done from the Talend Cloud and in a serverless fashion for the execution using Talend Cloud engines. Stay tuned for my next post where we will discuss how to scale and reduce costs using serverless technologies and how to deploy and leverage Machine Learning models in enterprise solutions with Talend and Databricks.

The post Delivering on Data Science with Talend: Getting Quality Data appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Best Practices for Using Context Variables with Talend – Part 3

February 28, 2019, 5:03 pm

≫ Next: Building Successful Governed Data Lakes with Agile Data Lake Methodology – Volume 2

≪ Previous: Delivering on Data Science with Talend: Getting Quality Data

Hello and welcome to Part 3 of my best practices guide on context variables! Before I get started, I just want to inform you that this blog builds on concepts discussed in Part 1 and Part 2. Read those before you get started.

Let’s say that we decide that we want our Talend Jobs to be able to run on any environment after they have been compiled. We do not want to have to compile them again. We want to maintain our Context variable values in a database (which at design time we have no idea of its location) and we want to keep the database connections details hidden so that they cannot easily be found by someone who might get access to the servers. How can this be done? Will this make the framework incredibly complicated?

What if I said you can do this and keep the system incredibly dynamic and developer friendly, using nothing more than a couple of Operating System Environment Variables, a flat file, and a relatively simple Talend Routine? All you would need to do is configure the Environment Variables on the servers that jobs will be run on and place a flat file on those servers. After that, the jobs will automatically pick up the Context variable values when the jobs start, regardless of which environment they run on, and without any need to change the individual jobs. Would that be useful?

The Solution

The solution I will give to the above problem is one I have evolved over several years and several projects, where some or all of the requirements above (and sometimes more complicated ones) have had to be met.

The Context variable Table

The first thing you will need to do is to set up a database table to hold your context variable values. The schema that I generally use can be seen below (this was written for SQL Server):

CREATE TABLE [context_variables](

        [id] [bigint] NOT NULL,

        [env] [varchar](255) NULL,

        [key] [varchar](255) NULL,

        [value] [varchar](255) NULL,

        [description] [varchar](255) NULL

)

This is a bare-bones table. I’ve not added any primary keys (although “id” would be the one I’d use), indexes or any other potentially useful columns. You can do that and configure this as you wish. The important columns used in this example are “key”, “value” and “env”. “key” and “value” MUST be named this way. The Implicit Context Load will need the key column (the column holding the Context variable name) to be called “key” and the value column (the column holding the Context variable value) to be called “value”. Both of these columns are Varchars. The Implicit Context Load will implicitly cast (convert) the values into an object of the correct class. The “env” column I am using to demonstrate how you can have different environment’s Context variables in the same table if you wish. I will get to this later.

The Operating System Environment Variables

In order to enable this solution, every server that Talend jobs might be run on (including development environments) will require 2 Operating System Environment Variables; “FILEPATH” and “ENCRYPTIONKEY”. The “FILEPATH” variable will point to a flat file (.properties file) with the database connection settings in it (encrypted where required) and the “ENCRYPTIONKEY” variable holds the encryption/decryption key. These must be System Environment Variables and they must be set before the Talend components (Studio, Jobservers, etc) are started.

The Properties File

This is a simple flat file which holds the connection details of the database holding the Context variables. It is pointed to by the “FILEPATH” Operating System Environment Variable. The keys (variable names) I am using in this example are pretty generic and are loosely related to the required Implicit Context Load parameters. Since they are referenced elsewhere it makes sense to keep them and just change your values, but if you want to change them, just be sure to make sure you change any names I have hardcoded in the Routine (TalendContextEnvironment has been hardcoded, for example). The file format can be seen below:

TalendContextAdditionalParams=instance=TALEND_DEV

TalendContextDbName=context_db

TalendContextEnvironment=DEV

TalendContextHost=MyDBHost

TalendContextPassword=ENC(4mW0zXPwFQJu/S6zJw7MIJtHPnZCMAZB)

TalendContextPort=1433

TalendContextUser=TalendUser

The ImplicitContextUtils Routine

I have written a basic routine which allows the Implicit Context Load variables to be set to values supplied in your properties file (pointed to by the “FILEPATH” Operating System Variable). The routine can be seen below:

package routines;




import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.util.Properties;

import org.jasypt.encryption.pbe.StandardPBEStringEncryptor;

import org.jasypt.properties.EncryptableProperties;




/*

 * This routine is used to point the Implicit Context Load to the correct database and select the

 * correct Context variable environment. It is also used to automatically decide whether to supply

 * Context variables (for Parent jobs only).

 *

 */

public class ImplicitContextUtils {

  //A static Properties class used to hold the context variables in memory after having been

  //read.

  private static Properties properties;

           

    /**

     * getImplicitContextParamterValue: used to return the appropriate parameter for the Implicit Context Load

     * configuration.

     *

     * {talendTypes} String

     *

     * {Category} Implicit Context Load

     *

     * {param} string("TalendContextDbName") parameter: the parameter name to be returned

     * {param} string("agh565") rootPID: the root process id

     * {param} string("agh565") jobPID: the job process id

     *

     * {example} getImplicitContextParamterValue("TalendContextDbName", "adfr54","adfr54") # returns "Talend_DB"

     */

            public static String getImplicitContextParameterValue(String parameter, String rootPID, String jobPID) {

                       String returnVal = "";




                       //If the properties are null call the getProperties method to populate the properties

                       if(properties==null){

                                  getProperties();                               

                       }

                       returnVal = properties.getProperty(parameter);

                      

                       //Handles formatting the environment WHERE clause when the TalendContextEnvironment parameter

                       //is requested. This must return a WHERE Clause

                       if(parameter.trim().compareToIgnoreCase("TalendContextEnvironment")==0){

                                  //If the jobPID does not equal the rootPID (not a parent job)

                                  //ensure no data will be returned

                                  if (!jobPID.equals(rootPID)) {

                                              returnVal = "env='" + returnVal + "' AND 1=0";

                                  }else{

                                              returnVal = "env='" + returnVal + "'";

                                  }

                       }

                       return returnVal;

            }




     /**

     * getProperties: used to populate the properties variable

     *

     * {talendTypes} void

     *

     * {Category} Implicit Context Load

     *

     *

     * {example} getProperties("TalendContextDbName", "adfr54","adfr54") # returns "Talend_DB"

     */

            private static void getProperties() {

                       String propFile = getEnvironmentVariable("FILEPATH");

                       String encryptionKey = getEnvironmentVariable("ENCRYPTIONKEY");




                       if (propFile != null) {




                                  try {

                                  /*

                                  * First, create the encryptor for decrypting the values in the .properties file.

                                  */

                                               StandardPBEStringEncryptor encryptor = new StandardPBEStringEncryptor();

                                               encryptor.setAlgorithm("PBEWithMD5AndDES");

                                               encryptor.setPassword(encryptionKey);

                                               

                                               /*

                                                * Create our EncryptableProperties object. This is used to decrypt

                                                * and variables surrounded by "ENC(" and ")"

                                                */

                                               properties = new EncryptableProperties(encryptor);

                                               File file = new File(propFile);

                                               FileInputStream fileInput = new FileInputStream(file);

                                               properties.load(fileInput);

                                               fileInput.close();




                                  } catch (FileNotFoundException e) {

                                              e.printStackTrace();

                                  } catch (IOException e) {

                                              e.printStackTrace();

                                  }

                       }

            }




           

    /**

     * getEnvironmentVariable: used to retrieve Environment Variables

     *

     * {talendTypes} String

     *

     * {Category} Implicit Context Load

     *

     * {param} string("TalendContextPassword") variableName: the parameter name to be returned




     *

     * {example} getEnvironmentVariable("TalendContextPassword") # returns "My Password"

     */

            public static String getEnvironmentVariable(String variableName) {

                       String returnVal = System.getenv(variableName);




                       if (returnVal == null) {

                                  System.out.println(variableName

                                                         + " does not exist or holds no value");

                       }

                       return returnVal;

            }

}

There are 4 key parts to this routine which need some explaining.

The “getEnvironmentVariable” Method

This method is a public static method and is used solely to retrieve Operating System Environment Variables. We have 2 Operating System Environment Variables configured in this example; “FILEPATH” and “ENCRYPTIONKEY”. This method is used by the “getProperties” method to retrieve those values and use them to point to your database connection settings properties file.

The “getProperties” Method

This method is a private static method (mainly because it would not be expected for this method to be used on its own by something outside of this Routine) and is used to populate the Properties variable with decrypted property values in key/value pairs. It uses the “getEnvironmentVariable” method to retrieve the Operating System Environment Variables we have set up (the names have been hardcoded within this routine, so MUST be called the same if you use this….or the hardcoded names need changing).

You will also notice that I am using some slightly more complicated JASYPT code here. This code uses the FILEPATH to read the properties file into an EncryptableProperties object. This decrypts those parameters which need decrypting and makes them available via the Properties variable.

The “properties” Variable

The “properties” variable is a private static variable used to keep the database connection properties stored in memory so that the file only needs to be read once per job.

The “getImplicitContextParameterValue” Method

This method is a public static method used to retrieve a value by key from the “properties” variable if it is populated or to call the “getProperties” method and then retrieve the value by key from the “properties” variable. This method is added to each of the Implicit Context Load parameter boxes, with the name of the Context variable supplied as the key with the root PID (Process ID) and the Job PID (Process ID).

The reason for the Root PID and the Job PID is that we do not want to retrieve Context variables in child jobs. If we are passing our Context variables from the Parent Job to the Child Jobs, we may want to keep any changes which may have taken place along the way. To allow this, I have put a little “hack” into the code above. The Implicit Context Load functionality has a “Query Condition” parameter.

I used this to filter by our “env” (environment) column in our context_variables table. The property in the properties file for this is called “TalendContextEnvironment”. Again, I have hardcoded this into my Routine since it will be consistent throughout my entire project, but you can change this if you want. Now when the value “TalendContextEnvironment” is passed into the “getImplicitContextParameterValue” method with the same rootPID and jobPID, the method will know that this is the Parent Job (rootPID and jobPID will only be the same for the Parent Job) and that it needs to retrieve the value for the “Query Condition”. In this case, it will supply a String representing a WHERE CLAUSE for the table we built earlier. Something like below:

env='DEV'

This will allow the Implicit Context Load to query the database against the correct “env” value. But if the rootPID and jobPID are different, this means that we are dealing with a child job. In which case we do not want ANY Context variables returned. In this case, the method would return:

env='DEV' AND 1=0

For the “TalendContextEnvironment” property. You will notice the addition of “0=1”. This is ALWAYS false and as such, no Context variables are returned. The Job will therefore accept the Context values from the Parent Job.

Now you may not wish to do this, in which case you can either modify the code or (more easily) just add the same hardcoded String to both the rootPID and jobPID parameters.

Hooking it all together

Now you should have your Operating System Environment Variables, your Properties file, your Talend Routine, and your Context variable database table. If you have all of these set up, you just need to configure your Implicit Context Load. Now you can do this per job, but it makes much more sense to do it for your whole project. To do it for your whole project go to “File” in your Studio and select “Edit Project Properties”. The following screen should pop-up.

I have partially configured this in the screenshot above. Notice that I have ticked the “Implicit tContextLoad” box to reveal “From File” and “From Database”. I have also selected “From Database” and set the “Property Type”, “Db Type” and “Db Version” for SQL Server (set yours to whichever database type you are using). These values cannot be dynamic, unfortunately.

To configure the rest of the parameters you can use the values below (tweaked to your configuration if you have made changes). If you want to force the Implicit Context Load to run for child jobs then replace rootPid and pid with “” and “”.

Note: rootPid and pid are variables used internally by ALL Talend jobs. They must be used exactly as shown. rootPid clearly corresponds to rootPID, but pid is not so clear. This corresponds to jobPID.

Host:

routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextHost",rootPid,pid)

Port:

routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextPort",rootPid,pid)

DB Name:

routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextDbName",rootPid,pid)

Additional Parameters:

routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextAdditionalParams",rootPid,pid)

Password:

routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextPassword",rootPid,pid)

Note: this code is encrypted since it normally represents an unencrypted password. To add this simply click on the password ellipsis button and add the above code without quotes surrounding the text.

Table Name:

”context_variables”

NB: At this point, I noticed that “table name” had not been configured in the Properties File example I put together above. I realized that in my eagerness to get this final blog post of the series out, I had made a mistake. I had hardcoded it in the Implicit Context Load settings. I decided to leave it as hardcoded to show that both Routine methods and hardcoded values can be used side by side in the Implicit Context Load settings. This was not in any way left because I didn’t want to revisit everything I had already written….honestly J. If you wish not to hard code this, simply add a variable name to the Properties file, for which a Context variable exists, which will be used to hold the table name. For example, something with the name of TalendContextTableName. Then all you need to do is replace the hard-coded value above with code like below:

routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextTableName",rootPid,pid)

Query Condition:

routines.ImplicitContextUtils.getImplicitContextParameterValue("TalendContextEnvironment",rootPid,pid)

Running a Job

Once your Implicit Context Load settings are populated you can run the Job from the Studio, compile (build) the Job and run it from the command line (if the machine has the properties file and Operating System Environment Variables configured) or run it on a JobServer via TAC (again, so long as the properties file and Operating System Environment Variables are configured on the JobServer machine). You simply have to ensure that wherever the Job is run, that the server has the correct environment variables and Properties file on it.

At first glance, this might appear to be quite complicated, but once it is set up, this solution allows you to build your jobs without caring about environments. You will KNOW that your jobs will run against the environment that is configured for the machine that they are running on.

There you have it! That’s close to everything you need to know about using Context Variables with Talend. We have one more part coming ot next week. Come back to finish the series!

The post Best Practices for Using Context Variables with Talend – Part 3 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Building Successful Governed Data Lakes with Agile Data Lake Methodology – Volume 2

March 1, 2019, 8:21 am

≫ Next: How OTTO Utilizes Big Data to Deliver Personalized Experiences

≪ Previous: Best Practices for Using Context Variables with Talend – Part 3

This post was authored by Venkat Sundaram from Talend and Ramu Kalvakuntla from Clarity Insights.

This is the second part in a series of blogs that discuss how to successfully build governed data lakes. To all those who have read the first part, thank you! If you haven’t read it, we invite you to read it before continuing, as this second part will build upon it and dive a bit deeper. In this blog, we are going to help you understand how to build a successful data lake with proper governance using metadata-driven architectures and frameworks.

What is a Metadata-Driven Architecture?

First, let’s talk about what I mean by “metadata-driven” architecture. A metadata-driven architecture allows developers to use metadata (data about data) to abstract function from logic. Metadata-driven frameworks will allow us to create generic templates and pass the metadata as parameters at run time – this allows us to write logic once and re-use it many times. This type of architecture will allow a single consistent method of ingesting data into the data lake, improve speed to market and provide the ability to govern what goes into the lake.

How does the Clarity Insights Data Ingestion Framework work?

Clarity Insights has successfully built many governed data lake solutions for clients, and as part of this effort, we have created a data ingestion framework using metadata-driven architecture on the Talend Big Data Platform. We picked Talend because it is lightweight, open source and a code generator which gives us the flexibility to design generic components for both data ingestion and transformation processes.

The diagram below illustrates the functionality of the ingestion framework built by Clarity Insights.

Diagram1: Ingestion Framework Functionality

What are the core components of the Data Ingestion Framework?

Framework Database – this metadata database stores:

Global parameters — such as your Hadoop environment details, Hive database and IP addresses
Configuration metadata — such as what ingestions to run, in what order to run, which templates to use and how many parallel processes to run
Operational Metadata — such as what jobs ran at what time, for how long, how many records were processed and job status

This database that stores the metadata can be setup on any RDBMS database.

Reusable Templates/Components – Some templates that are built in Talend include:

Object Discovery — to identify the number of objects need to be ingested from a database or files from a given directory
Metadata Definitions — to pull metadata from RDBMS database or delimited files or Excel mappings for fixed length files
Database Ingestions — Sqoop components to ingest data from RDBMS sources such as Oracle, SQL Server, MySQL, AS400, DB2 etc.
File Ingestion — template for fixed length, delimited files, XML, JSON files etc.
Change Data Capture — component to identify changes since the data was ingested in the last run along with metadata changes on source tables or source files

Common Services – the framework leverages services including:

Restartability — the framework is completely re-startable based on the run history collected at the most granular level in the framework database
Parallel processing — determine the optimal number of parallel jobs to run based on the configurations within the metadata store
Dependency Management — the sequence in which the jobs should be run based on the dependency defined in the Metadata store
Indexing/Cataloging — create index and catalog using metadata management tools such as Talend Data Catalog/Atlas/Cloudera Navigator etc.

Master Process

The master process is the job that will be set up to run from the enterprise scheduler by providing the Process ID. Master process will pull all the jobs, dependencies, parameters, etc. and run the job based on the order in which they are configured within the metadata store. Everything is controlled through this one process at run time. None of the child Talend jobs will know what they are processing until the Master process provides input to them at run time.

Governance process

When a request comes to ingest a new source system, the request will go to the governance council. The council will review the request and check the data catalog to see if this data already exists in the data lake. If it is a new dataset, they will enter the details in the framework database. The governance process is completely integrated with the data ingestion framework thus created a fully governed data lake.

How is the Data Ingestion Framework implemented in Talend?

Let’s say we receive a whole bunch of pipe delimited files from an external vendor into a DMZ server daily. The consumer of data requests the governance council to set up ingestion for this new source. The council checks their data catalog and finds that this data does not exist. So, they set up a new process and its details in the framework database.

New Process called ‘clarity demo for DF’ has been created in the process table.

Parameters for this process have been entered with the location of files, type of files, delimiter and schema information.

The last step in the setup of the process metadata is to create modules and the sequence in which they need to run. In this example, the first template – “Build Object List” – will get the list of files located in the inbound directory; the second template – “Get Object Definition” – will get the metadata of each of these files; the third template – “Ingest Objects into S3” – is for ingesting data into S3; and the fourth template – “Hive Usable” – will create Hive compressed tables.

Now that we have set up the process metadata in the database, we will use this in a Talend Template for ingesting these files. Here’s how the Talend job would look like:

The above template has three subjobs – (i) a pre-job, (ii) main subjob, and (iii) a post-job. The pre-job gets the required metadata from the framework database for the process and the list of modules that need to run. The main subjob is responsible for ingestion of data into S3 buckets. The post-job loads the run history for tracking and restartability. As you can see, no new code has been written to ingest new datasets. When the process completes successfully, data will be available within Hive for data stewards to analyze the data, profile the results to understand the anomalies and apply business glossary rules and definitions.

The framework will capture all the metadata of every file and every table. Anytime the metadata changes on the source system, the framework dynamically detects it and creates a new metadata definition within Hive and tags that with the original tables. It will create a view on top of the Hive tables for business to query and in many cases, users will not even notice that things changed underneath the table.

If we have ingestion requirements around new file formats such as JSON, XML or industry standard formats such as HL7, all we have to do is build a new generic template in Talend and the rest will all flow through the same ingestion process. Once we do this, a data lake will be more like a data library where every dataset is being indexed and cataloged.

Diagram Two: Data Ingestion Framework / Funnel

Conclusion

A robust and scalable Data Ingestion Framework needs to have the following characteristics:

Single framework to perform all data ingestions consistently into the governed data lake
Metadata-driven architecture that captures the metadata of what datasets need to be ingested, when and how often it needs to ingest them; how to capture the metadata of datasets; and what are the credentials needed to connect to the data source systems
Template design architecture to build generic templates that can read the metadata supplied in the framework and automate the ingestion process for different formats of data, both in batch and real-time
Tracking metrics, events and notifications for all data ingestion activities
Single consistent method to capture all data ingestion along with technical metadata, data lineage and governance
Proper data governance with “search and catalogue” to find data within the data lake
Data Profiling to collect the anomalies in the datasets so data stewards can look at them and come up with data quality and transformation rules

In the next part of this series, we’ll discuss how to tie the data governance process into the data ingestion framework, and we’ll see how it enables organizations to build governed data lake solutions. While we are not getting into nitty-gritty details in these blog posts, the key takeaway is this: it’s vital to have frameworks like these to be successful in your data lake initiatives. Check back soon for the next installment!

The post Building Successful Governed Data Lakes with Agile Data Lake Methodology – Volume 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How OTTO Utilizes Big Data to Deliver Personalized Experiences

March 6, 2019, 1:20 am

≫ Next: 6 Dos and Don’ts of Data Governance – Part 2

≪ Previous: Building Successful Governed Data Lakes with Agile Data Lake Methodology – Volume 2

OTTO is one of Europe’s most successful e-commerce companies and Germany’s biggest online retailer. In the past financial year, 6.6 million customers ordered online from OTTO.

Like virtually every other retailer with an online presence, OTTO competes against Amazon for customers and mindshare. To hold their own, they knew they needed to use and analyze data to know their customers better and to sharpen their pricing strategy for bidding on onsite ads.

Struggling with silos of data

OTTO’s previous IT architecture, however, made that difficult because it consisted of a large Teradata database and many silos of data. OTTO decided to standardize on Talend and built a completely new data lake on a Cloudera cluster, with Talend acting as the bridge between the new data lake — which stores data from the OTTO website and from partner websites — and the company’s 250TB Exasol database.

“Talend helps enable our strategy to get closer to the shoppers who come to our website, understand the market, and get ideas for promoting product purchases” – Michael van Ryswyk, Product Owner, Brain Real-Time

Talend also gave OTTO the option to use the Continuous Integration/ Continuous Delivery (CI/CD) process from their partner cimt ag, which none of the competing solutions did. Using CI/CD in conjunction with Talend, OTTO can continuously integrate data sources and continuously develop and deliver new, small upgrades, updates, and applications.

Sharpening pricing strategy and stock

Developing software using this CI/CD methodology makes it possible to respond to requests from business units in hours instead of weeks.

Complying with new GDPR requirements, OTTO is developing 360-degree customer profiles and gaining visibility into what products a customer is currently searching for on the company site. In addition, OTTO is using machine learning that relies on integrated customer data and product descriptions and photos to make suggestions to customers.

Download the full case study here.

The post How OTTO Utilizes Big Data to Deliver Personalized Experiences appeared first on Talend Real-Time Open Source Data Integration Software.

↧

6 Dos and Don’ts of Data Governance – Part 2

March 7, 2019, 1:22 am

≫ Next: How to deploy Talend Jobs as Docker images to Amazon, Azure and Google Cloud registries

≪ Previous: How OTTO Utilizes Big Data to Deliver Personalized Experiences

In my last post, I gave you the first six Do’s and Don’ts of Data Governance and promised to bring together an additional six to consider when making a data governance plan for your organization.

Here are six more dos and don’ts when building your data governance framework.

Do: Consider the cloud on your route to trust

Gartner predicts that “by 2023, 75% of all databases will be on a cloud platform, increasing complexity for data governance and integration”. The move to the cloud is accelerating as organizations are collecting more data, including new datasets that are created beyond their firewalls. The need to deliver data in real-time to a wider audience, and seeking for more agility and on-demand processing capabilities is also driving this shift.

<<ebook: Download our full Definitive Guide to Data Governance>>

Because your data can be off-premises, the cloud might mandate stronger data governance principles. Take the example of data privacy, where regulations mandate that:

You establish controls for cross border exchange of data
You manage policies for notification of data breaches, that you establish key privacy principles such as data portability, retention policies, or the right to be forgotten;
You establish more rigorous practices for managing the relationships with vendors who process your personal data.

The cloud brings new challenges for your data governance practices, but it brings many opportunities as well. At Talend, we see that a majority of our customers are now selecting the cloud to establish their single source of trusted data. DRG is a great example of this.

Depending on your context, there is a good chance that the cloud is the perfect place to capture the footprints of all your data into a data landscape. Further empowering all the stakeholders in your data-driven process with ready to use applications to take control and consume data.

Do: Be prepared to explain “data”

Employees often lack digital literacy. That’s one part of the problem. As data is becoming more predominant in organizations, you will also consider they often require data literacy.

You’ll likely find some employees being reluctant to learn how to use sophisticated tools. To combat this, use a data catalog to make your data more meaningful, connected to their business context, and easy to find. Leverage cloud-based apps such as Talend Data Prep or Data Stewardship so that they can access data in a few clicks without specific training before they can start.

Do: Prove the data value

As you move along with your data governance project, it’s highly likely that you will come across skeptical users. They will challenge you on your ability to control and solve their problems.

You would need to prove to them that they will save resources and money by delivering trusted data. Start by taking a simple data sample like a Salesforce or Marketo dataset. Use data preparation tools to explain how easy it is to remove duplicates and identify data quality issues. Show the recipe function that allows to effortless reproduce the prep work to other data sets. It’s data quality at first. Another quick hint will be to show them how easy it is to mask data with Talend Data Preparation.

Don’t: Expect executive sponsorship to be secured

Once you prove business value with small proofs of concepts (POC) and gain some support from the business, ask for a meeting with your executives. Then, present your plan to make data better for the entire organization.

Be clear and concise so that anybody can understand the value of your data governance project. Explain that they will gain visibility by endorsing you and hence, improve the entire organization’s efficiency. You will gain the confidence you need to have your project supported, and your work will get easier.

Do: Be hands-on

As you begin to meet with different people in the organization, to listen to their challenges and offer your assistance. Make sure all your actions are indeed efficient. As the old saying goes, “You have to plan the work and work the plan.” Follow up and outline the next milestones of the project. You will confront some obstacles, realignment priorities as your organization readapts to changing business conditions. Don’t give up and adapt your planning if needed. However, keep convincing people and (re) explain how your project would overcome the company’s challenges.

Also, ensure your data governance is really connected with your data. Too many data governance programs have established policies, workflows, and procedures, but are failing to connect with the actual data. For example, a Talend Survey has shown that among the 98% of the companies surveyed that are claiming for GDPR compliance in their legal mention, only 30% could deliver on their promises to fulfill the data access requests when their customers are asking to respect their rights for data accessibility. This means that most companies have established strong governance principles, but are failing in operationalizing them.

Do: Practice your data challenges

Work your data governance framework through some practice scenarios. Let’s say you want to practice as if you had experienced an internal data breach or a data leak to see if your framework is working in a worst-case scenario. Consider running a team drill. Make up a breaking news scenario and see how well your plan works then use those lessons learned to improve it. As you go through and practice this scenario around your framework, ask yourself:

“Is all sensitive data properly masked?”
“Can I to track and trace all of my data? “
“Do the data owners feel accountable about the data they’re responsible for?”

Get in the shoes of your customer you want to consider his right for data access or right to be forgotten.

It’s always better to be proactive rather than just experiencing a privacy incident for real with all its consequences that this entails. And this will make data governance more concrete, turning it in operational challenges rather than high-level principles.

To know more about how to deliver data you can trust, don’t hesitate to download our definitive guide to data governance.

The post 6 Dos and Don’ts of Data Governance – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How to deploy Talend Jobs as Docker images to Amazon, Azure and Google Cloud registries

March 12, 2019, 6:16 am

≫ Next: Best Practices for Using Context Variables with Talend – Part 4

≪ Previous: 6 Dos and Don’ts of Data Governance – Part 2

Since the release of Talend 7.1 users can build Talend jobs as Docker images and publish them to Docker registries. In this blog post, I am going to run through the steps to publish to the major cloud provider container registries (AWS, Azure and Google Cloud). Before I dig into publishing container images to registries, I am going to remind you the basics of building Talend Jobs in Docker images from Talend Studio as well as point out the difference between a local build and a remote build.

Requirements

Talend Studio 7.1.1 or higher
Platform license
Docker software installed and accessible from Studio

What is a container registry?

First, let’s explain the concept of a container registry. For those of you familiar with this, feel free to skip ahead.

A container registry is basically a set of repositories for Docker images. This is where you store and distribute your Docker images to further use. However, most of them also offer access control over who can see, view and download images as well as CI/CD integration and vulnerability scanning.

Let’s take a look at the major Docker registries available:

DockerHub is the world’s largest library and community for container images. This is the in-house registry of Docker.
Amazon ECR is the Amazon Web Services registry.
Google GCR is the Google Cloud Platform registry.
Azure ACR is the Microsoft Azure registry.

How do I build a Talend Job as a Docker image?

Before we publish to a registry, let me remind you of the ways to perform a build of a Docker image in Talend Studio:

Local build

To build a job as a Docker Image:

Right-Click on a Job
Select “Build Job”.
Select “Docker Image” as build type and fill in the form with your own settings.

The Docker Host is the Docker daemon currently running on the machine where you want to build your image. You can either build the image on your local machine or on a remote host. In the example above the Docker daemon is running on the same machine than the Studio, hence we use the local mode.

Remote Build

In most cases, Docker is installed on the machine where you run Talend Studio. However, you might want to build your Docker image on a remote host such as a local virtual machine or a virtual machine in the cloud. If this is the case, you need to select the remote mode.

In my case, I have a Windows laptop where I run a Linux virtual machine in VMWare. I am more comfortable with running Docker on a Linux machine. That is why, to build my image on my Linux VM, I need to select the remote mode and specify its IP address using TCP protocol. We also need to open a port to be able to access the remote Docker daemon from outside. Please refer to the Docker documentation to enable the TCP socket access to Docker.

How to publish a Talend Docker Image to a Registry

With Talend Studio you can also build your Docker images and push them to a registry with only one action called publishing.

Dockerhub

Let’s start with DockerHub. To publish your Talend Job as an image into a Docker registry, right-click on your job and select Publish:

Compared to the build function you now have 3 more fields. As you are also pushing your image to a Docker registry you need to specify your registry and the credentials used to access it.

Registry: docker.io/<DOCKER_USERNAME>

Username: <DOCKER_USERNAME>

Password: <DOCKER_PASSWORD>

Amazon ECR

If you want to publish your Docker image into your own AWS account, you have the possibility to use Amazon Elastic Container Service which is a private registry into your AWS account. With Amazon ECR you only get a single registry by account and by region. You also need to create the image repositories beforehand. So, in my case I created the repository “talendjob”:

Then you can fill the publish form as follows:

The image name will be the name you have given to your image repository and the registry will be the URI composed of your account number and region.

Image Name: Amazon ECR repository name

Registry: <AWS_ACCOUNT_NUMBER>.dkr.ecr.<REGION>.amazonaws.com

Username: <AWS_ACCESS_KEY>

Password: <AWS_SECRET_KEY>

Exceptionally the username and password are AWS credentials. As a matter of fact, Talend Studio uses Fabric8 Maven plugin, and as you can see in the documentation a custom way of connection has been developed to ease the authentication to Amazon ECR.

Azure ACR

Unlike Amazon ECR, you can create as many as registries you want in your Azure Portal. In my case I created a registry called “tgourdel”:

You can get your registry credentials using the Azure CLI as follow:

Command:

$ az acr credential show --name tgourdel
{
  "passwords": [
    {
      "name": "password",
      "value": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
    },
    {
      "name": "password2",
      "value": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
    }
  ],
  "username": "tgourdel"
}

Then you can build and publish your job in Talend Studio:

Image Name: Name of your repository image (can be created on the fly)

Registry: <AZURE_REGISTRY_NAME>.azurecr.io

Username: <AZURE_ACR_USERNAME>

Password: <AZURE_ACR_PASSWORD>

Google GCR

On GCP the authentication is again a slightly different then for the other clouds. You need to use “oauth2accesstoken” as username get an access token as password. Use the gcloud CLI to get your token:

$ gcloud auth print-access-token
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Your registry address is linked to your GCP Project name:

Registry: gcr.io/<GCP_PROJECT_NAME>

(note that you can also use located registry DNS such as eu.gcr.io or us.gcr.io)

Finally, you can see your image in GCR:

Summary

As we have seen during this article, each cloud provider registries have its own authentication mechanism. That is why I gathered all the needed information into a single table that I hope will help you publish to any cloud registry:

	Build
	Docker Host	Image Name	Image Tag
Local	Local if the Docker daemon is installed on the same machine than Talend Studio.	Created on the fly on your machine.	Any Tag.
Remote	Remote if Docker daemon is on a remote machine: tcp://`IP_ADDRESS_DOCKER_DAEMON`:`DOCKER_DAEMON_PORT`	Created on the fly on remote machine.	Any Tag.

	Publish
	Registry	Username	Password
DockerHub	docker.io/`DOCKER_USER`	`DOCKER_USER`	`DOCKER_PWD`
Amazon ECR	`AWS_ACCOUNT`.dkr.ecr. `REGION`.amazonaws.com	`AWS_ACCESS_KEY`	`AWS_SECRET_KEY`
Azure ACR	`AZ_REGISTRY`.azurecr.io	`AZURE_ACR_USER`	`AZURE_ACR_PWD`
Google GCR	gcr.io/`GCP_PROJECT`	oauth2accesstoken	`GCLOUD_AUTH_ACCESS_TOKEN`

If you are looking for info regarding integrating the processing of publishing your jobs to Docker registries in your CI/CD process, you can follow my previous blog article where I show you how you can achieve this.

The post How to deploy Talend Jobs as Docker images to Amazon, Azure and Google Cloud registries appeared first on Talend Real-Time Open Source Data Integration Software.

↧