A recent survey shows that organizations are still on their way to GDPR compliance in 2018, the initial step for many is around getting a full view of their entire information chain. It’s the first pillar of data management best practices that businesses need to put into place as they get ready for the new regulation.

To be fully prepared, organizations need to know where data that relates to privacy is today; where does it come from, where does it go, how is it processed and who is consuming it. Given today’s increasingly demanding data management environment, navigating this data requires the implementation of an increasingly complex mapping capability.

To use an analogy, before the advent of sophisticated mobile computing, people bought paper-based maps when they visited a new region. It was the only way to find places they had never been before – but this kind of map was static. It quickly became outdated, as it lacked dynamic context – in other words, there was no way of gauging roadworks, traffic problems, newly-built roads, etc. Making a change required redoing the whole map.

There was a lack of transparency; for example, without any way to track and trace, passengers wouldn’t know if their taxi driver was taking the fastest route to their destination. The advent of GPS changed everything, giving travelers a more accurate and dynamic view of a region, with details of traffic, road and weather conditions constantly updated. Today, people often use GPS even on familiar trips because it can instantly update them of any issues they may encounter during their journey.

“Organisations will need to have a more precise view of their data. Not just where it is stored but also the overall context; a dynamic real-time view of where all the data is located.”

We see a similar shift taking place in data management today. In the past, for many businesses, there was no need for dynamic data mapping. A high-level paper-based view of their landscape was sufficient. That’s changing today with the explosion of data and this change further accelerates as data privacy regulations as GDPR come into force. Organisations now need to have a more precise and up to date view of their data. Not just where it is stored but also the overall context; a dynamic real-time view of where all the data is located. They need to provide transparency for “the rights of the data subject” such as the rights to be forgotten, rights of accessibility, and rights of rectification.

Navigating In-depth Data Management

This explains the rationale behind metadata management – but from that 360° umbrella view now let’s drill down to discover how the concept of in-depth data management applies in the new world. Once again, we find a recurring theme – how drawing connections between disparate data sets is key to the kind of data management that GDPR demands.

One of the biggest impacts that GDPR has is that businesses must take a more holistic view of their private data and its management. In the past, organizations managed data relative to privacy and eventually processed opt-ins but it was typically done in a specific context and limited to one department. If people working in the marketing department were responsible for managing a list of customers that potentially contained private data, they might have had to inform the local authorities about it. Equally, the HR department would take on exclusive responsibility for the privacy of employee data.

That’s all changed. Today, with GDPR in the offing, businesses need to have a comprehensive view of the private data they are managing. One business may know an individual in many different contexts. If they have bought their products or services, they will know them as a customer, and their details will be stored in the CRM system. If they are also contracted, however, they will be in the financial system; if they have taken out a subscription, details will be stored in the support department and if for digital products or services, such as connected objects in the internet of things, everything that they do might be tracked somewhere.

This highlights the broader view of data that compliance with GDPR will require businesses to achieve. The emphasis can no longer be on a single department, such as marketing, for example, managing data for its own requirements. Instead, the focus must be on managing all the private data that relates to a customer or an employee across the entire enterprise. That’s clearly a complex undertaking, so how can businesses most effectively go about it?

To see how metadata management can bring clarity to your data landscape and support GPR compliance, watch this webinar on line with demo.

Gaining a Holistic Data View

The first stage of the process is to create full segmentation of your data, or in other words a data taxonomy. At this point, the focus should be on creating a high-level view of the private data that needs to be managed. In the case of GDPR, that’s likely to be some data related to customers and some to employees. Drilling down into the latter, that’s likely to include information about their performance, salary, benefits and even health or family data. High-end business tools might be needed to complete this task, in a business glossary.

The next stage for the business is to assign responsibility for the different data areas. This involves deciding who takes care of employees’ health data, for example, or who looks after their performance details. In parallel, organizations can start to define the foundations of their approach to data policy, something which typically includes outlining their data retention strategy – how long they need to keep certain types of data before archiving or deleting it.

Businesses will, of course, also need to start drilling down into the data more. If they are dealing specifically with identity data, for example, they will need to identify and process all the critical data elements. In the case of identity, that may mean the passport number of the individual, their date of birth, gender, how many children they have and whether they are married, for example.

Once this whole process has been undertaken, the business will understand the datasets it needs to control in this context. It doesn’t necessarily know where all this data resides, but it does at least understand what information needs to be managed and what data will need to be considered when a customer asks for information to be changed or deleted. The business may also need to implement technology to connect to the data in order to maintain its quality and ensure it is kept consistently accurate and up-to-date.

Making the Right Connections

When it comes to connecting to the data, businesses will have to carry out a metadata management technique known as ‘stitching,’ which involves connecting the data element in question to the physical system that manages it. If the organization concerned is looking at identity data specifically, they should connect to the HR system but maybe also the payroll system. Beyond that, they might also need to consider that identity data will also be in the recruitment system because before the person in question became an employee, they were a candidate. And they should as well consider the travel and expense management system who might hold sensitive information such as credit card numbers.

In order to ensure compliance, businesses will need to carry out the ‘stitching’ process referenced above. In other words, they will need to make a physical connection to the actual data that they are managing. Some tools are now coming out that enable this process to be carried out semi-automatically. In other words, taking the high-level definition we referenced earlier, they can map directly to the file if the attribute name is identical or alternatively they can connect through the creation of relevant correspondence that helps to make links between the logical high-level data and the physical data. This also means that when a candidate becomes an employee, a data integration project can be run that takes data about the candidate residing in the recruitment system and brings it into the HR system, effectively helping to draw the lineage between the different bits of data.

Foundations in Place

At this point, the business has come a long way in its metadata management journey. It has developed the kind of dynamic mapping that we referenced in our earlier GPS analogy. All the finer-grained data elements have been defined and linked to all the systems that use them and the dependency or relationship between each of the systems has been established. This solid mapping foundation makes it easier to make adjustments further down the line. This means that if the business needs to change the format of its data in any way, using four digits for the year, for example, instead of two, it is far easier to achieve. They can get answers to questions like – where do I have the data first? If I change it in the HR system, what is the impact elsewhere? They can ask – should I change the data integration job that takes the data from the recruitment information or should I just propagate these four digits down to the HR application?

The same principles can apply to data masking. The organization can leverage its mapping and data integration capability to start applying guidelines to the data. They might want to disguise the exact birth date of a given individual within the system, for example, or to avoid the segregation of younger and older candidates in the recruitment system; they might want to mask the date of birth information completely.

As we have seen then, good metadata management is about having a dynamic view of the data. So, to use the GPS analogy once again, you need to be able to see the route to your customer, roughly where their offices are and how long it will take for you to drive there. But you also need to be able to act whenever an exception occurs. Metadata management is not simply about mapping and visualizing the data; it is also about knowing how to act when there is a problem, and it’s about helping to guide that action. Today, after all, the latest GPS systems don’t just tell you that there is a traffic jam, they also suggest another route to take. That’s the same kind of benefit that the business can attain with metadata management – whenever a change or a new regulation is introduced the metadata management tool should guide the business to apply the right action to its data.

Technology Whose Time Has Come

In the past, despite the rapid growth in data volumes affecting multiple industry sectors, the market for metadata management remained largely restricted to banking, financial services and other highly regulated industries. The advent of GDPR and the demands it places on companies of all sizes and all types has raised metadata management up the priority list for all businesses.

What this means will vary from company to company. Some businesses will use existing software to document their data and then focus on keeping records accurate and up-to-date by evolving their systems over time. However, as time goes by and the importance of this approach becomes increasingly clear, more and more companies will opt to commit themselves fully to a metadata management approach and to the growing portfolio of technologies being brought into support it.

To see how you can operationalize your data governance with Metadata Management, MDM and Data Quality, see this webinar on line with demo.

The post Data Mapping Essentials in the GDPR Era appeared first on Talend Real-Time Open Source Data Integration Software.

Cloud-based platforms have become a standard component within most enterprise IT infrastructures, however, many organizations find they cannot source everything they need from a single provider. As a result, increasing numbers are adopting a multi-cloud strategy in an effort to better meet their business requirements.

According to the Voice of the Enterprise (VotE) Digital Pulse survey^[1] produced by analyst company 451 Research, 60 percent of enterprises will run the majority of their IT outside the confines of enterprise datacentres by the end of 2019. Meanwhile, in the Asia-Pacific region, research company IDC predicts that more than 70 percent of enterprises will have a multi-cloud strategy by the end of this year^[2].

Taking such an approach can deliver significant benefits, but it also creates a range of challenges. Carefully assessing both is important before a multi-cloud strategy is implemented.

The Multi-Cloud Journey

The task of designing and building a multi-cloud infrastructure will be different for every organization. For most, the process will involve linking legacy, on-premise equipment, applications and data stores with a range of resources and services provided by external cloud providers.

Alternatives that will need to be evaluated include the growing range of Infrastructure-as-a-Service (IaaS) providers that offer raw compute and storage capacity on which an organization can deploy and run its applications and data stores.

Another alternative is to make use of a Platform-as-a-Service (PaaS) provider. In this scenario, the organization is offered a suite of managed computing resources that run within an external data center. These resources can be used to replace or extend the IT capabilities that the organization has in-house.

A third option is to adopt one or more Software-as-a-Service (SaaS) offerings. SaaS providers deliver managed applications that are used as needed by the client organization. These could be anything from a simple hosted desktop productivity tool to a global sales team management suite.

The Benefits

Regardless of the options that are chosen, it’s likely an organization will end up with a multi-cloud infrastructure in which different IT components and services are delivered by different cloud providers.

One of the biggest benefits this provides is flexibility. The most appropriate resources can be sourced and deployed in such a way that they precisely match the organization’s particular requirements. As a result, it will find itself much more able to deal with changes in demand, scaling IT capacity up and down as needed.

A multi-cloud strategy can also assist an organization to shift some applications and data stores to an external service provider while retaining core systems within its existing data center. Termed a hybrid-cloud approach, this can help to reduce operational costs as well as the need for large capital investments as requirements grow.

Such a strategy can also be a precursor to the adoption of a cloud-only strategy. Applications and data can be gradually migrated over time, rather than requiring a ‘big bang’ approach that could be seen as too risky.

This approach can safeguard you from vendor lock-in, and far more importantly, ensure you won’t get locked out of leveraging the unique strengths and future innovations of each cloud provider as they continue to evolve at a breakneck pace in the years to come.

The Challenges

The benefits are significant, but it’s important not to overlook the challenges that might arise from adopting this strategy.

The first is the effective assessment of the cloud providers themselves. The organization must be sure the data centers from which services will be provided are well managed, secure, and provide the levels of performance that are expected. Such assessment can be time-consuming and require a technical understanding of how the various platforms can be linked together.

Another challenge is complexity. Rather than needing to manage only in-house resources, an organization will find itself having to deal with multiple external parties. This can cause challenges if problems arise and finger pointing occurs between the various providers.

There may also be challenges around the data networks needed to link chosen cloud platforms with internal IT resources. Slow or unreliable links will have a significant and detrimental impact on application performance and cause flow-on problems for users. An organization should carefully evaluate the network links offered by cloud providers and ensure redundant alternatives are in place should they be needed.

A final challenge is likely to come in the form of compliance and governance. Once corporate data is stored on external cloud platforms, rather than within an on-premise data center, ensuring it remains secure becomes more complex.

This issue is particularly relevant when you consider the way in which organizations are bound by Australia’s mandatory data breach notification laws and the European Union’s General Data Protection Regulation (GDPR). Both will require organizations to have constant oversight of their data stores and be responsible for ensuring effective security is in place at all times. Enterprise IT decision makers and architects alike are increasingly adopting multi-cloud strategies as they look to increase use of existing IT infrastructures, deploy new capabilities at scale, reduce costs, streamline resources and avoid cloud vendor lock-in.

A multi-cloud strategy can deliver significant business benefits as long as thorough planning and evaluation are completed prior to adoption. By carefully considering all factors, an organization can enjoy the benefits offered by the strategy while keeping challenges and risks to a minimum.

[1] https://451research.com/services/customer-insight/voice-of-the-enterprise?&utm_campaign=2018_digpulse_pr&utm_source=press_release&utm_medium=press&utm_content=product_interest&utm_term=2018_01_digpulse_pr

[2] https://www.idc.com/getdoc.jsp?containerId=prAP43008417

The post Adopting a Multi-Cloud Strategy: Challenges vs. Benefits appeared first on Talend Real-Time Open Source Data Integration Software.

In January 2017, the AURELIUS Group (Germany) acquired the European operations of Office Depot, creating Office Depot Europe. Today, Office Depot Europe is the leading reseller of workplace products and services with customers in 14 countries throughout Europe selling anything from paper, pens and flip charts, to office furniture and computer.

Centralizing Data to Respond to Retail Challenges

Traditionally, Office Depot’s European sales were primarily sourced through an offline, mail-order catalog model drive by telemarketing activities. The company has since moved to a hybrid retail model, combining offline and online shopping, which required a data consolidation strategy that optimized the different channels. Additionally, the company’s myriad of backend systems and disparate supply chain data collected from across Europe had become difficult to analyze.

Using Talend, Office Depot can now ingest data from its vast collection of operational systems. The architecture includes an on-premise Hadoop cluster using Hortonworks, Talend Data Integration, and Data Quality to perform checks and quality control on data before ingesting it into the Hub’s data lake.

Powering Use Cases from Supply Chain to Finance

Integrating online and offline data results in a unified, 360-degree view of the customer and a clear picture of the customer journey. Office Depot can now create more-specific audience segments based on how customers prefer to buy, and tailor strategies to reach the most valuable consumers whether they buy online or in-store. They can compare different offline customer experiences to see how they are influenced by digital ads. Customer service operators have complete information on a customer, so they can talk to them as they know their details.

Office Depot’s data hub approach also provides high-quality data to all back-office functions throughout the organization, including supply chain and finance. Office Depot can now integrate data from the range of supply chain back-end systems in use in various countries, and answer questions such as which distribution center has the most efficient pick-line and why; or which center is in the risky position of having the least amount of stock for the best-selling products.

The post Office Depot Stitches Together the Customer Journey Across Multiple Touchpoints appeared first on Talend Real-Time Open Source Data Integration Software.

In the last few years, Python has become the go-to programming language for Data Scientists. In a way, this is kind of surprising. Python was not originally developed to do analytical tasks or data science, but it has evolved to become the ‘Swiss-army’ tool in the Data Scientist toolbox. The reason for this comes from a large number of third-party packages available for Data Scientists. For example, there is ‘Pandas’ for manipulation of heterogeneous and labeled data, ‘SciPy’ for common scientific computing tasks, ‘Matplotlib’ for visualizations, ‘NumPy’ for the manipulation of array-based data, and many, many others.

Why Data Scientists <3 Python

Nowadays Python is used for everything from data handling to visualization to web development. It has become one of the most important and most popular open source programming languages in use today. Many people think of it as a new language, but it is older than both Java and R. Python was created by Guido van Rossum of the Dutch CWI research institute in 1989. One of its main strengths is its easy ability to be extended as well as its support for multiple platforms. The ability of Python to be able to communicate with different file formats and libraries makes it very useful and is the main reason it is used by Data Scientist today.

For programmers, Python is not a difficult language to learn. In fact, most experienced programmers regard Python as an easy language to learn. Many now even recommend Python as the first language anyone should learn, which says a lot. The syntax of the language itself is very easy to pick up. Write a ‘Hello World’ program in any language. Java and C take no less than three lines of code, whereas Python takes just one. Now, its all that easy, learning how to use libraries for example takes time, but its an easy language to start and get coding with, easier than most.

Talend & Python

This year, Talend introduced a new, cloud-first app called Talend Data Streams. With Data Streams everything is a “stream”, like a flow. Even batch processing is a stream that is time bound. It means we have one architecture for both batch and real-time stream processing. Data Streams has a live preview so that developers will know their design is right every step along the way. When they drop the final target connector on the canvas, they can instantly see that their design is complete. Now, Data Quality relies on complex mathematics to solve the problem of data deduplication, matching, and standardization. Data Streams is designed to let anyone easily add snippets of Python using an embedded code editor that provides code auto-completion as well as intuitive syntax highlighting. We want to empower the user with the power of Python.

Now, sometimes it’s just easier to code, and we developers often go straight to it, depending upon the user and the task at hand. And this is where Python comes on board. Talend Data Streams has native support for Python built in. So we at Talend are investing in Python. We think it offers great functionality as well as ease of programming. We invite you to give Talend Data Streams a try and see how you can easily extend your data pipelines with embedded Python coding components.

The post Why Data Scientists Love Python (And How to Use it with Talend) appeared first on Talend Real-Time Open Source Data Integration Software.

Digital Transformation, the application of digital capabilities across the organization to uncover new monetization opportunities, is the path that any company wishing to survive in today’s world must follow. Sectors like agriculture, healthcare, banking, retail, and transportation are exploring challenges and opportunities that have come with the digital revolution. As a result of this, new business models have emerged while IT departments have become the focus of digital transformation.

Today SaaS and PaaS applications are being combined with social, mobile, web, big data, and IoT. With this, an organization’s ability to integrate legacy data from siloed on-premises applications with new data from emerging digital technologies, like cloud, have become a deciding factor for a successful digital transformation.

The reality is that many organizations still use hand-coding or ad hoc integration tools to solve immediate and specific needs, often leading to an integration nightmare. They suffer the consequences of an uncontrolled set of access, potential risks in regulatory compliance or auditory practices, inability to organically scale with the data volume surge, etc. Organizations need an enterprise strategy to address the changes brought on by the hybrid era of cloud and on-premises integration scenarios. Through this, organizations are able to unlock the true value of the data to accelerate digital transformation, a key enabling component in such strategy is a cloud integration platform (iPaaS).

iPaaS has started to go mainstream in recent years. Results from research firm MarketsandMarkets predicted that the market will be worth 2998.3 million US dollars by 2021 and there will be 41.5% market growth between 2016 and 2021.

So what exactly is an iPaaS?

iPaaS (integration platform-as-a-service), in simplest terms, is a cloud-based platform that supports any combination of on-premises, cloud data, and application integration scenarios. With an iPaaS, users can develop, deploy, execute, and govern cloud and hybrid integration flows without installing or managing any hardware or middleware. There are many benefits iPaaS offers: faster time to value, reduced TCO, accelerated time to market, bridging the skillset gap, reduced DevOps headaches, etc.

However, if your organization has integration needs or challenges, how do you pick the right iPaaS? I’ll explore a few key questions to ask when choosing the right solution.

Key areas for considerations when choosing an iPaaS

Does it support Big Data and Data Lake?

If you are embarking on an initiative to utilize your big data, you will need a solution that can handle the volume, velocity, and variety requirements of big data integration use cases. The solution should be able to support a data lake with complex data ingestion, transformation, and works well with the Hadoop ecosystems.

Does it support broad hybrid integration scenarios?

Although SaaS and cloud apps are widely adopted across organizations, on-premises apps are not likely to go away anytime soon. Because hybrid integration scenarios are still the norm, it’s important to have a solution that supports cloud to cloud, cloud-to-ground, ground-to-ground, multi-point, and multi-cloud integrations.

Can it empower my LoB and analyst teams in the project?

In most cases, an integration project is initiated to better serve business units to enable new business models, better serve customers, make more effective marketing and sales strategies, etc. LoBs are increasingly involved in integration work and therefore, to get your projects moving forward quickly, you need a solution with an easy-to-use UI, self-service capabilities like data prep, and data governance that don’t require extensive technical training.

Will multiple teams collaborate on the integration project?

If your integration projects require multiple teams, you may want to consider SDLC (software development life cycle) capabilities. This will allow you to create separate environments for each stage of SDLC like development, QA, production, etc. So your teams can plan and execute integrations more frequently and efficiently.

How do I ensure real-time data delivery?

Business agility, real-time decision making, and customer and partner interactions depend on the analytics from data delivered in real-time. To make fast and timely decisions, you will want to consider a solution that delivers data anytime with both batch or real-time streaming data processing.

What about data quality?

A critical capability that you also want in an iPaaS is built-in data quality that allows you to create clean and trusted data for analytics and decision making. You will want a solution that can handle different type and format data and ensure consistent quality throughout your data’s lifecycle.

Talend Cloud is Talend’s iPaaS that offers broad connectivity, native big data support, built-in data quality, enterprise SDLC support, and much more. If you are interested in jumpstarting your integration projects with Talend Cloud sign up for a 30-day free trial.

The post Key considerations for choosing a cloud integration (iPaaS) solution appeared first on Talend Real-Time Open Source Data Integration Software.

In an always-on, competitive business environment, organizations are looking to gain an edge through digital transformation. Subsequently, many companies feel a sense of urgency to transform across all areas of their enterprise—from manufacturing to business operations—in the constant pursuit of continuous innovation and process efficiency.

Data is at the heart of all these digital transformation projects. It is the critical component that helps generate smarter, improved decision-make by empowering business users to eliminate gut feelings, unclear hypotheses, and false assumptions. As a result, many organizations believe building a massive data lake is the ‘silver bullet’ for delivering real-time business insights. In fact, according to a survey by CIO review from IDG, 75 percent of business leaders believe their future success will be driven by their organization’s ability to make the most of their information assets. However, only four percent of these organizations said they are set up a data-driven approach for successfully benefits from their information.

Is your Data Lake becoming more of a hindrance than an enabler?

The reality is that all these new initiatives and technologies come with a unique set of generated data, which creates additional complexity in the decision-making process. To cope with the growing volume and complexity of data and alleviate IT pressure, some are migrating to the cloud.

But this transition—in turn—creates other issues. For example, once data is made more broadly available via the cloud, more employees want access to that information. Growing numbers and varieties of business roles are looking to extract value from increasingly diverse data sets, faster than ever—putting pressure on IT organizations to deliver real-time, data access that serves the diverse needs of business users looking to apply real-time analytics to their everyday jobs. However, it’s not just about better analytics—business users also frequently want tools that allow them to prepare, share, and manage data.

To minimize tension and friction between IT and business departments, moving raw data to one place where everybody can access it sounded like a good move. The concept of the data lake first coined by James Dixon in 2014 expected the data lake to be a large body of raw data in a more natural state where different users come to examine it, delve into it, or extract samples from it. However, increasingly organizations are beginning to realize that all the time and effort spent building massive data lakes have frequently made things worse due to poor data governance and management, which resulted in the formation of so-called “Data Swamps”.

Bad data clogging up the machinery

The same way data warehouses failed to manage data analytics a decade ago, data lakes will undoubtedly become “Data Swamps” if companies don’t manage them in the correct way. Putting all your data in a single place won’t in and of itself solve a broader data access problem. Leaving data uncontrolled, un-enriched, not qualified, and unmanaged, will dramatically hamper the benefits of a data lake, as it will still have the ability to only be utilized properly by a limited number of experts with a unique set of skills.

A success system of real-time business insights starts with a system of trust. To illustrate the negative impact of bad data and bad governance, let’s take a look at what happened to Dieselgate. The Dieselgate emissions scandal highlighted the difference between real-world and official air pollutant emissions data. In this case, the issue was not a problem of data quality, but of ethics, since some car manufacturers misled the measurement system by injecting fake data. This resulted in fines for car manufacturers exceeding more than tens of billions of dollars and consumers losing faith in the industry. After all, how can consumers trust the performance of cars now that they know the system-of-measure has been intentionally tampered with?

The takeaway in the context of an enterprise data lake is that its value will depend on the level of trust employees have in the data contained in the lake. Failing to control data accuracy and quality within the lake will create mistrust amongst employees, seed doubt about the competency of IT, and jeopardize the whole data value chain, which then negatively impacts overall company performance.

A cloud data warehouse to deliver trusted insights for the masses

Leading firms believe governed cloud data lakes represent an adequate solution to overcoming some of these more traditional data lake stumbling blocks. The following four-step approach helps modernize cloud data warehouse while providing better insight into the entire organization.

Unite all data sources and reconcile them: Make sure the organization has the capacity to integrate a wide array of data sources, formats and sizes. Storing a wide variety of data in one place is the first step, but it’s not enough. Bridging data pipelines and reconciling them is another way to gain the capacity to manage insights. Verify the company has a cloud-enabled data management platform combining rich integration capabilities and cloud elasticity to process high data volumes at a reasonable price.
Accelerate trusted insights to the masses: Efficiently manage data with cloud data integration solutions that help prepare, profile, cleanse, and mask data while monitoring data quality over time regardless of file format and size. When coupled with cloud data warehouse capabilities, data integration can enable companies to create trusted data for access, reporting, and analytics in a fraction of the time and cost of traditional data warehouses.
Collaborative data governance to the rescue: The old schema of a data value chain where data is produced solely by IT in data warehouses and consumed by business users is no longer valid. Now everyone wants to create content, add context, enrich data, and share it with others. Take the example of the internet and a knowledge platform such as Wikipedia where everybody can contribute, moderate and create new entries in the encyclopedia. In the same way Wikipedia established collaborative governance, companies should instill a collaborative governance in their organization by delegating the appropriate role-based, authority or access rights to citizen data scientists, line-of-business experts, and data analysts.
Democratize data access and encourage users to be part of the Data Value Chain: Without making people accountable for what they’re doing, analyzing, and operating, there is little chance that organizations will succeed in implementing the right data strategy across business lines. Thus, you need to build a continuous Data Value Chain where business users contribute, share, and enrich the data flow in combination with a cloud data warehouse multi-cluster architecture that will accelerate data usage by load balancing data processing across diverse audiences.

In summary, think of data as the next strategic asset. Right now, it’s more like a hidden treasure at the bottom of many companies. Once modernized, shared and processed, data will reveal its true value, delivering better and faster insights to help companies get ahead of the competition.

The post Don’t Let your Data Lake become a Data Swamp appeared first on Talend Real-Time Open Source Data Integration Software.

Major data breaches are becoming more common. Big data breaches may include password data (hopefully hashed), potentially allowing attackers to login to and steal data from our other accounts or worse.

The majority of people use very weak passwords and reuse them. A password manager assists in generating and retrieving complex passwords, potentially storing such passwords in an encrypted database or calculating them on demand.

In this article, we are going to show how easy it is for Talend to integrate with enterprise password vaults like Cyberark. By doing this no-developer has direct access to sensitive passwords, Talend need not know password during compile(design) time, and password management (change/update password) is done outside of Talend. This saves administrators time as well as improves the security of the overall environment.

An Introduction Password Security with CyberArk

CyberArk is an information security company offering Privileged Account Security, is designed to discover, secure, rotate and control access to privileged account passwords used to access systems throughout the enterprise IT environment.

Create a CyberArk Safe

To get started, we first need to build our safe. A safe is a container for storing passwords. Safes are typically created based on who will need access to the privileged accounts and whose passwords will be stored within the safe.

For instance, you might create a safe for a business unit or for a group of administrators. The safes are collectively referred to as the vault. Here’s a step-by-step guide on how to do that within CyberArk.

Navigate to Policies -> Access Controls(Safes) -> Click on “Add Safe”.

Create a safe.

You will need a CyberArk safe to store objects.

Creating the Application

Next, we need to create the application we will use in order to retrieve the credential from the vault. Applications connect to CyberArk using their application ID and application characteristics. Application characteristics are additional authentication factors the applications created in the CyberArk.

Each application should have a unique application ID (application name). We can change it later, but it may also require code change on the client side where this application is used to get the password. In our case, it’s Talend Code. Here’s how to do this.

Navigate to Applications and click on “Add Application”.

Now we have CyberArk – account and application id. Now give permissions to application id to retrieve credentials. Click on the Allowed Machines tab and enter the IP’s of the servers from where Talend will retrieve credentials.

Access to Applications from Safe

Navigate to policies -> access controls(safe) -> select your safe and click on members

Click on add member and search for your application, select it and check appropriate accesses and add it.

Now the next step is to install the credential provider in the development environment.

Installing credential provider (CP)

In order to retrieve credentials, we need to install a CyberArk module in the same box as your application (client) is running. This will deliver a Java API that will call the credential provider and talks to your application through Java API and then talks to CyberArk vault through their own proprietary protocol and retrieve the credentials that you need and then delivers back to your application through the Java API.

You will need to login to the CyberArk Support Vault in order to download the Credential Provider.

Retrieve password from Talend using Java API

Last but not least, we need to build a password retrieval mechanism with Talend. Create a Talend job with tLibrary_Load and a tjavaFlex.

Configure tLibraryLoad to “JavaPasswordSDK.jar” path. This is make sure that “JavaPasswordSDK.jar” is added to classpath by Talend during compilation.

And on tJavaFlex, navigate to advanced settings and make sure you import necessary classes for implementation.

In the basic setting of the tjavaFlex, below code is written to call CyberArk using Java API.

try

{

PSDKPasswordRequest passRequest = new PSDKPasswordRequest ();

PSDKPassword password = null;



passRequest.setAppID ("Talend_FS");

passRequest.setSafe ("Test");

passRequest.setFolder ("root");

passRequest.setObject ("Operating System-UnixSSH-myserver.mydomain.com-root");

passRequest.setReason ("This is a demo job for password retrival");





// Sending the request to get the password

password = javapasswordsdk.PasswordSDK.getPassword (passRequest);

// Analyzing the response

System.out.println ("The password UserName is : " +password.getUserName ());

System.out.println ("The password Address is : " +password.getAddress ());

context.dummy_password= password.getContent ();

System.out.println("password retrieved from cyberark's vault is -- context.dummy_password ==>> "+context.dummy_password);



}

catch (PSDKException ex)

{

System.out.println (ex.toString ());

}

Save the job and execute it.

Conclusion

The CyberArk Application Identity Manager™, integrated with Talend, provides secured credentials for Talend to conduct in-depth data reporting and analytics.

Below are a few quick key benefits that an enterprise would get by leveraging CyberArk’s applications with Talend:

Eliminating hard-coded credentials
Securely store and rotate applications credentials
Authenticate Applications
Deliver Enterprise level scaling and availability

Above all, Talend integrates seamlessly with CyberArk and helps customers in leveraging all the benefits provided by CyberArk’s and build an enterprise-level password vault.

The post Build an Enterprise Level Password Vault with Talend and CyberArk appeared first on Talend Real-Time Open Source Data Integration Software.

For any financial service organization, failure to comply with regulations is front page news, which can majorly impact brand reputation, customer loyalty, and the bottom line.

The drive for greater transparency over customers’ finance data has led to a number of regulations and legal standards such as PSD2, Open Banking and, most recently, GDPR being introduced to the mix. In this article, I will discuss how we should view regulations as an opportunity rather than a barrier to innovation.

The regulatory minefield known as 2018…

This year has been a milestone one for regulatory changes in financial services. Open Banking launched in January 2018 with a whimper more than a bang. One possible explanation for this was a reluctance to cause a panic among consumers. Research by Ipsos MORI found that while almost two thirds (63%) of UK consumers see the services enabled by Open Banking as ‘unique’, just 13% of them would be comfortable allowing third parties to access their bank data. These figures are likely to have been impacted by high-profile breaches affecting the finance industry, which soured attitudes towards data protection policies.

Open Banking is built on the second Payment Services Directive, more commonly known as PSD2. Despite its fame being somewhat dwarfed by that of the General Data Protection Regulation (GDPR), PSD2 is a data revolution in the banking industry across Europe. By opening up banks’ APIs to third-parties, consumers will be able to take advantage of smoother transactions, innovative new services and greater transparency in terms of fees and surcharges. In the UK, this is partly enabled through the Competition and Markets Authority’s (CMA) requirement for the largest current account providers to implement Open Banking.

Creating these experiences for consumers requires APIs which seamlessly draw together information from multiple datasets and sources. Step in GDPR, which has tightened up the controls consumers have regarding their data and introduced greater financial ramifications on companies and organizations that do not adhere to it. $20,000,000 or 4% of global revenue, whichever is highest, is the penalty for non-compliance. One of the fundamental principles of GDPR compliance is providing greater transparency over where personal data is and how it is being used at all times.

PSD2 and Open Banking align with this because it is the consumer that has the control over whether their data is shared with third parties, as well as the power to stop it being shared. In addition, the concept of the ‘right-to-be-forgotten’ enshrined in GDPR means that consumers can demand that any data held by the third-party service provider be permanently deleted. Similarly, because GDPR puts the onus of data protection on both data controllers (i.e., banks) and data processors (i.e., PISPs and AISPs) it is in the interests of both to ensure that their data governance strategies and technology are fit for purpose. As has been pointed out by Deloitte and Accenture, there might be contradictions within these regulations, but the overriding message is that transparency and consent are key for banks who need good quality data to provide more innovative services.

Regulating the world’s most valuable commodity

Having untangled the web of data regulations facing the finance industry, we must remember that with the rise of big data, the cloud, and analytics based on machine learning, data is no longer something which clogs up your internal systems until it needs to be disposed of. Data is the world’s most valuable commodity – the rocket fuel that has powered the rise of Internet giants like Facebook, hyperscalers like AWS, and industry disruptors like Uber. To the finance industry, data is a matter of boom or bust, and given the vital role they play in society, consumers and businesses need banks to have data. This is why banks must take a proactive view towards data governance and treat it as an opportunity rather than a necessary evil.

EY’s 2018 annual banking regulatory outlook stresses the importance of banks staying on the front foot when it comes to regulatory compliance. It lists five key actions as achieving good governance: creating a culture of compliance; exerting command over data; investing in the ability to analyze data, and developing strategic partnerships. As these key points suggest, a proactive view of data governance does not stop at compliance. It’s about creating a virtuous cycle of data being analyzed and the insight gleaned from this analysis being turned into services which customers appreciate. This will make customers want to share their data as they can see the hyper-personalized and customized services which they get as a result.

As a rule of thumb, the more information you give your bank, the more personalized the service they can provide. This is true in the context of an entire range services such as calculating credit ratings, advising on savings and borrowing. However, this scenario works both ways, and regulations such as Open Banking, PSD2, and GDPR put the power firmly in consumers’ hands. So, the more data organizations ask for, the higher the expectation of personalized services from customers. Customers need to see what their data is being used for, so transparency is key if financial firms are to build and maintain trust with customers. Furthermore, to offer highly personalized products and services based on complex analysis of big data, organizations should already know where data is stored and how it is being used.

Data-driven finance

In summary, data protection regulations such as Open Banking, PSD2, and GDPR must be viewed as opportunities for financial services organizations to re-establish trust with consumers, which may have been eroded by high-profile data breaches in 2017. In a way, this brings us back to the basics of what financial services are all about: being a steward of people’s assets. “When it comes to customer trust, financial leaders shouldn’t wait on regulators to keep their companies in check”

Understanding where data is and that it is managed correctly is not only fundamental to regulatory compliance and customer trust, but also to providing the highly personalized and predictive services that customers crave. Therefore, the requirements of the regulation are by no means at odds with the strategies of data-driven finance firms, but in actual fact perfectly aligned.

The post Where does GDPR sit in finance’s regulatory puzzle? appeared first on Talend Real-Time Open Source Data Integration Software.

A common practice in any development team is to conduct code reviews, or at least it should be. This is a process where multiple developers inspect written code and discuss its design, implementation, and structure to increase quality and accuracy. Whether you subscribe to the notion of formal reviews or a more lightweight method (such as pair programming), code reviews have proven to be effective at finding defects and/or insufficiencies before they hit production.

In addition, code reviews can help ensure teams are following established best practices. The collaboration can help identify new best practices to follow as well along the way! Not only that, regular code reviews allow for a level of information sharing and an opportunity for all developers to learn from each other. This is especially true for the more junior developers, but even senior developers learn a thing or two from this process.

While you aren’t writing actual code with Talend (as it is a code generator), developing jobs do share many characteristics with line-by-line coding. After all, Talend is a very flexible platform that allows developers to build jobs many ways. All the benefits of code reviews still apply, even if you are only reviewing job designs, settings, and orchestration flow.

The “Why” of Talend Job Reviews

If you had to summarize the goals of doing Talend job reviews into a single word, I would say that it is ‘Quality’. Quality in the tactical sense that your reviewed jobs will probably perform better, will tend to have fewer defects and be likely much easier to maintain over time. Quality in the more strategic sense that over time, these job reviews will naturally improve your developer’s skills and hone best practices such that future jobs also perform better, have fewer defects and are easier to maintain even before hitting job reviews!

Have I sold you on Talend job design reviews yet? Attitudes vary on this, after all there is a certain amount of ego involved with jobs that developers build… and that’s ok. It’s important however to focus on the positive; treat reviews as an opportunity to learn from each other and to improve everyone’s skills. It’s important to always be thoughtful and respectful during the process to the developers involved. Depending on your team culture, paired reviews may work better than formal full team reviews. I do recommend meeting face-to-face, however, and not creating offline review processes as there is a lot to be gained by collaborating during these reviews. Be pragmatic. Solicit the team for input into how to improve your code review process.

Quantitative & Qualitative

When reviewing Talend jobs, it’s important to think both Quantitatively and Qualitatively.

Qualitatively speaking, you want to adapt and adopt best practices. If you haven’t read Talend’s recommended best practices, then I highly recommend doing so. Currently the four-part Blog series can be found here:

In these documented best practices, we discuss some qualities of good job design. Recommendations on how to make your jobs easy to read, easy to write, and easy to maintain. In addition, you’ll find other foundational precepts to building the best possible jobs. Consider these:

Functionality
Reusability
Scalability
Consistency
Performance
And several others

Choosing and balancing these precepts are key!

You might also find interesting the follow-on series (2 parts so far) on successful methodologies and actual job design patterns.

I think you’ll find all these blogs worth the read.

Ok, great, now how should we quantitatively review jobs? In code review projects, often there are metric tools that teams may use to assess the complexity and maintainability of your code. Did you know that Talend has an audit tool that includes much of this? You’ll find it in the Talend Administration Center (TAC) here:

Here you will find all sorts of really great information on your jobs! This is a very often overlooked feature of TAC, so be sure to check it out. I recommend reading our Project Audit User Guide.

Talend Project Audit provides several functions for auditing a project through investigating different elements in Jobs designed in a Studio. Talend Project Audit reviews:

Degree of difficulty of Jobs and components used in jobs
Job error management
Documentation and Job versioning
Usage of metadata items in Job designs
Layout-related issues in Job and subjob graphical designs
Job analysis

I’ve been asked in the past if it is possible to use standard code metric tools with Talend. The answer is yes, although with some caveats. Code metrics tools can be run after the code generation phase of the build, so if you could perhaps create a Jenkins job for doing this or inject the metrics right into your standard builds. It’s likely that a complex job in Talend would generate more complex code.

However, keep in mind that you are maintaining this code on a graphical user interface with prebuilt and reusable components. The complexity and maintainability metrics are fine to compare between jobs however, it is not always apples-to-apples with hand-coded programs. For that reason, I generally recommend using our Audit functionality.

7 Best Practice Guidelines for Reviewing Jobs

Now that I’ve (hopefully) convinced you of the concept of Talend Job reviews, I want to conclude with some best practices and guidelines as you start to put this concept into practice.

Capture metrics and when you find areas of jobs to change, count them.
- Use t-shirt size classification.
- It’s important to try and quantify the impact of your review sessions.
- Track your defects and see if your review efforts are paying off with fewer defects making it past the review process. Capture how long your reviews take.
Don’t review too many jobs at one time.
- In fact, don’t even review too much of the same job at one time if it is complex.
- Take your time to understand the job design and patterns.
Keep your jobs well described and annotated to ease the review process.
Label your components, data flows and subjobs.
Document poor patterns and create a watch list for future job reviews.
Create and share your discovered best practices.
Be sure to include a feedback mechanism built into your process so that the recommended changes and defect resolution can be implemented on schedule.
- Talend job reviews should be part of your development process, and should not be rushed in as an afterthought.

Above all, keep it positive! Until next time.

The post Conducting Effective Talend Job Design Reviews – A Primer appeared first on Talend Real-Time Open Source Data Integration Software.

Talend Spring ’18 (Talend 7.0) was Talend’s biggest release yet. With improvements to Cloud, big data, governance and developer productivity, Talend Spring ’18 is everything you need to manage your company’s data lifecycle. For those that missed the live webinar, you can view it here. But as highlighted in the webinar, I realize that everyone is not going to Serverless today, or is looking to ingest streaming data … so what are some of the things in this release that most Talend users will start using immediately? Aka the hidden gems that did not make the headlines.

#1 Hidden Gem – Unified database selection

For this scenario, you wired your database components into an integration pipeline, e.g. Oracle and Salesforce into AWS Redshift, and then a few months later you want to use a different database like Snowflake. But of course, this needs to be done across the many pipelines that you’ve implemented. With Talend Spring ’18, several unified database components, labeled tDB***, have been added into the Talend Studio as an entry point to a variety of databases. Instead of deleting the Redshift component, adding the Snowflake component and configuring the Snowflake connection, you just click on the Redshift component, a drop-down list appears, and you select Snowflake. Big time savings!

#2 Hidden Gem – Smart tMap fuzzy auto-mapping

For this scenario, you are using your favorite component, tMap. You select two data sources to map, and as long as the input and output column names are the same, the auto-mapping works great! But most of the time since different developers created the tables, the column names are different, e.g. First_Name or FirstName or FirstName2 or FirstName_c (custom field in Salesforce). With Talend Spring ’18 and Smart tMap Fuzzy Auto-mapping, tMap uses data quality algorithms (Levenshtein, Jaccard) to do fuzzy matching of the names, saving you time as similarly named columns are matched for you. Just image when dealing with tables with hundreds of columns how much faster this will be!

Example:

Input column names	Output column names
Name	Name	Auto-map works Talend 6.5
First_Name	FirstName_c	Auto-map works Talend 7.0

#3 Hidden Gem – Dynamic distribution support for Cloudera CDH

For this scenario, you just installed the latest release of Talend 7.0, which has Cloudera CDH 5.13 support, and you want to use some of the new features in Cloudera CDH 5.14. Bummer! Talend Spring ’18 enables you to quickly access the latest distros without upgrading Talend through a feature called Dynamic Distributions (technical preview). Maven is used to decouple the job from the Big Data Platform version, making it quicker and easier to adopt new Hadoop distributions, so you can switch to the latest release of a new Big Data distribution as soon as it becomes available. Initial support is provided for Cloudera CDH (our R&D team are working on others J )

How does it work? When a new version of Cloudera CDH is released by Cloudera, Talend Studio will automatically list available versions. If you want to use it within your Talend jobs, then Studio will download all the related dependencies. This is an option in the project properties of the Studio (not in TAC or TMC) as this is only needed at design-time.

Not only are you able to use the features more efficiently, this update can also save you days of administration and upgrade time.

#4 Hidden Gem – Continuous Integration updates

For this scenario, your DevOps team operates like a well-oiled machine firing on all cylinders. While projects are getting completed faster, there were some Talend CI/CD command inconsistencies and if you changed a referenced job, you would need to recompile all of the connected jobs. With Talend Spring ’18, the CI features have been rewritten to use Maven standards and CI best practices.

You can now do incremental builds in the Talend Studio, so only the updated job is rebuilt, not all of them. Other features include broader Git support including Bitbucket Server 5.x and Nexus 3 support for the Talend Artifact Repository, standard Maven commands application integration (technical preview), and the ability to easily extend the build process through Maven Plug-ins and custom Project Object Models (POMs). Translation .. we have seen a 50%-time improvement in the build process! How cool is that?

#5 Hidden Gem – Remote Cloud Debug

For this final scenario, we look at the process of debugging Talend Cloud integration jobs. Everything works in Talend Studio, and the next step was to publish to Talend Cloud and test in the QA environment. Talend Spring ’18 provides a free test engine and the ability to remotely debug jobs (data integration or big data) on either Talend Cloud Engines or Remote Engines. The new feature allows Integration Developers to run their pipelines from the studio onto the cloud engine or remote engine and see logs and debug designs locally, which increases productivity by cutting test and debugging from minutes to seconds.

Now that you know all the hidden gems that Talend Spring ’18 can bring you, go put more data to work today. Enjoy!

To learn more about all the Talend Spring ’18 hidden gems, check out What’s New, or take it for a test drive.

The post 5 Hidden Gems in Talend Spring ’18 (Talend 7.0) appeared first on Talend Real-Time Open Source Data Integration Software.

In the last 3 years, IT has moved to the center of many business-critical discussions taking place in the boardroom. Chief among these discussions are those focused on digital transformation, especially those around operational effectiveness and customer centricity. A critical component for the success of these initiatives is the ability for the business to make timely business decisions and to do so with confidence. To do this, you must have 100% confidence in your ability to access and analyze all the relevant data.

The Rise of Cloud

Among the major technology shifts that are fostering this digital transformation, the rise of cloud has provided companies with innovative and compelling ways to interact with their teams, customers and business partners.

However, organizations often find themselves limited by their capabilities to work with the vast amount of data available. Most IT organizations don’t have the ability to access all the useful data available in a timely manner to fulfill business organizations’ request. This gap between the IT teams and business groups is limiting organizations’ ability to fully take advantage of the digital transformation.

One way to solve those challenges is to create a data lake with a Cloud Data Warehouse.

Talend Cloud & Snowflake

Through the economics of using a Cloud Data Warehouse, you can store significantly more information for less budget, you can dramatically expand the types of data that can be analyzed and drive deeper insight, to provide the ability for users to have a self-service capability.

Talend Cloud and Snowflake in the cloud enable you to connect a broad set of source and target data such as structured and unstructured data sources.

In the video below, you’ll learn how you can use Talend to quickly and easily load your data coming from various sources and bulk load that data directly into tables in a Snowflake data warehouse. Once the data has been bulk loaded into Snowflake, you can further use Talend to perform ELT (Extract/Load/Transform) functions on that data within Snowflake. This allows you to take full advantage of Snowflake’s processing power and ease-of-use SQL querying interface to transform data in place.

If you handle large volumes of data (to the petabytes) and you need a fast and scalable data warehouse, Talend will be the right solution to access and load data to a Cloud data warehouse and then give everyone in your organization access to the data when and where they need it! Start your free 30-day trial here.

The post [Step-by-Step] Using Talend to Bulk Load Data in Snowflake Cloud Data Warehouse appeared first on Talend Real-Time Open Source Data Integration Software.

This blog post is part of a series on serverless architecture and containers. The first post of this series discussed the impact of containers on DevOps.

Talend Data Integration is an enterprise data integration platform that provides visual design while generating simple Java. This lightweight, modular design approach is a great fit for containers. In this blog post, we’ll walk you through how to containerize your Talend job with a single click. All of the code examples in this post can be found on our Talend Job2Docker Git repository. The git readme also includes step-by-step instructions.

Building Job2Docker

There are two parts to Job2Docker. The first part is a very simple Bash script that packages your Talend job zip file as a Docker image. During this packaging step, the script tweaks your Talend job launch command so it will run as PID 1. It does not modify the job in any other way. As a result, your Talend Job will be the only process in the container. This approach is consistent with the spirit and best practices for Docker. When you create a container instance, your Talend job will automatically run.

When your job is finished, the container will shut down. Your application logic is safely decoupled from the hosting compute infrastructure. You can now leverage container orchestration tools such as Kubernetes, OpenShift, Docker Swarm, EC2 Container Services, or Azure Container Instances to manage your job containers efficiently. Whether operating in the Cloud or on-premises, you can leverage the improved elasticity to reduce your total costs of ownership.

Running Job2Docker

The second part of Job2Docker is a simple utility Job written in Talend itself. It monitors a shared directory. Whenever you build a Job from Studio to this directory, the Job2Docker listener invokes the job packaging script to create the Docker image.

All you need to run the examples are an instance of Talend Studio 7.0.1 (that you can download for free here) and a server running Docker.

If you run Studio on Linux, you can simply install Docker and select a directory to share your Talend Jobs with Docker.
If you run Studio on Windows, then you can either try Docker on Windows, or you can install Linux on a VM.

The examples here were run on a Linux VM while Talend Studio ran in Windows on the host OS. For the Studio and Docker to communicate, you will need to share a folder using your VM tool of choice.

Once you have installed these Job2Docker scripts and listener, the workflow is transparent to the Studio user.

1. Start the Job2Docker_listener job monitoring the shared directory.

2. Click “Build” in Talend Studio to create a Talend job zip file in the shared directory. That’s it.

3. The Talend Job2Docker_listener triggers the Job2Docker script to convert the Talend zip file to a .tgz file ready for Docker.

4. The Talend Job2Docker_listener triggers the Job2Docker_build script creating a Docker Image.

5. Create a container with the Docker Run command. Your job runs automatically.

The Job2Docker repository includes some basic Hello World examples, and it also walks you through how you can easily pass parameters to your Job. The step-by-step process is detailed in a short video available on YouTube.

Once you have your Docker image you can publish it to your Docker registry of choice and stage your containerized application as part of your Continuous Integration workflow.

While this process is deceptively simple it has big implications for how you can manage your workflows at scale. In our next post, we’ll show you how to orchestrate multiple containerized jobs using container orchestration tools like Kubernetes, EC2 Container Services, Fargate, or Azure Container Instances.

The post How to containerize your integration jobs with one click with Talend and Docker appeared first on Talend Real-Time Open Source Data Integration Software.

The day an IBM scientist invented the relational database in 1970 completely changed the nature of how we use data. For the first time, data became readily accessible to business users. Businesses began to unlock the power of data to make decisions and increase growth. Fast-forward 48 years to 2018, and all the leading companies have one thing in common: they are intensely data-driven.

The world has woken up to the fact that data has the power to transform everything that we do in every industry from finance to retail to healthcare– if we use it the right way. And businesses that win are maximizing their data to create better customer experiences, improve logistics, and derive valuable business intelligence for future decision-making. But right now, we are at a critical inflection point. Data is doubling each year, and the amount of data available for use in the next 48 years is going to take us to dramatically different places than the world’s ever seen.

Let’s explore the confluence of events that have brought us to this turning point, and how your enterprise can harness all this innovation – at a reasonable cost.

Today’s Data-driven Landscape

We are currently experiencing a “perfect storm” of data. The incredibly low cost of sensors, ubiquitous networking, cheap processing in the Cloud, and dynamic computing resources are not only increasing the volume of data, but the enterprise imperative to do something with it. We can do things in real-time and the number of self-service practitioners is tripling annually. The emergence of machine learning and cognitive computing has blown up the data possibilities to completely new levels.

Machine learning and cognitive computing allows us to deal with data at an unprecedented scale and find correlations that no amount of brain power could conceive. Knowing we can use data in a completely transformative way makes the possibilities seem limitless. Theoretically, we should all be data-driven enterprises. Realistically, however, there are some roadblocks that make it seem difficult to take advantage of the power of data:

Trapped in the Legacy Cycle with a Flat Budget

The “perfect storm” of data is driving a set of requirements that is dramatically outstripping what most IT shops can do. Budgets are flat —increasing only 4.5% annually — leaving companies to feel locked into a set of technology choices and vendors. In other words, they’re stuck in the “legacy cycle”. Many IT teams are still spending most of budget just trying to keep the lights on. The remaining budget is spent trying to modernize and innovate, and then a few years later, all that new modern stuff that you brought is legacy all over again, and the cycle repeats. That’s the cycle of pain that we’ve all lived through for the last 20 years.

Lack of Data Quality and Accessibility

Most enterprise data is bad. Incorrect, inconsistent, inaccessible…these factors hold enterprises back from extracting the value from data. In a Harvard Business Review study, only 3% of the data surveyed was found to be of “acceptable” quality. That is why data analysts are spending 80% of their time preparing data as opposed to doing the analytics that we’re paying them for. If we can’t ensure data quality, let alone access the data we need, how will we ever realize its value?

Increasing Threats to Data

The immense power of data also increases the threat of its exploitation. Hacking and security breaches are on the rise; the global cost of cybercrime fallout is expected to reach $6 trillion by 2021, double the $3 trillion cost in 2015. In light of the growing threat, the number of security and privacy regulations are multiplying. Given the issues with data integrity, organizations want to know: Is my data both correct and secure? How can data security be ensured in the middle of this data revolution?

Vendor Competition is Intense

The entire software industry is being reinvented from the ground up and all are in a race to the cloud. Your enterprise should be prepared to take full advantage of these innovations and choose vendors most prepared to liberate your data, not just today, but tomorrow, and the year after that.

Infographic – Data Economics Are Broken from Talend

Meet the Data Disruptors

It might seem impossible to harness all this innovation at a reasonable cost. Yet, there are companies that are thriving amid this data-driven transformation. Their secret? They have discovered a completely disruptive way, a fundamentally new economic way, to embrace this change.

We are talking about the data disruptors – and their strategy is not as radical as it sounds. These are the ones who have found a way to put more data to work with the same budget. For the data disruptors, success doesn’t come from investing more budget in the legacy architecture. These disruptors take an approach with a modern data architecture that allows them to liberate their data from the underlying infrastructure.

Put More of Your Data to Work

The organizations that can quickly put right data to work will have a competitive advantage. Modern technologies make it possible to liberate your data and thrive in today’s hybrid, multi-cloud, real-time, machine learning world. Here are three prime examples of innovations that you need to know about:

Cloud Computing: The cloud has created new efficiencies and cost savings that organizations never dreamed would be possible. Cloud storage is remote and fluctuates to deliver only the capacity that is needed. It eliminates the time and expense of maintaining on-premise servers, and gives business users real-time self-service to data, anytime, anywhere. There is no hand-coding required, so business users can create integrations between any SaaS and on-premise application in the cloud without requiring IT help. Cloud offers cost, capability and productivity gains that on-premise can’t compete with, and the data disruptors have already entrusted their exploding data volumes to the cloud.
Containers: Containers are quickly overtaking virtual machines. According to a recent study, the adoption of application containers will grow by 40% annually through 2020. Virtual machines require costly overhead and time-consuming maintenance, with full hardware and operating system (OS) that needs managed. Containers are portable with few moving parts and minimal maintenance required. A company using stacked container layers pays only for a small slice of the OS and hardware on which the containers are stacked, giving data disruptors unlimited operating potential, at a huge cost savings.
Serverless Computing: Deploying and managing big data technologies can be complicated, costly and requires expertise that is hard to find. Research by Gartner states, “Serverless platform-as-a-service (PaaS) can improve the efficiency and agility of cloud services, reduce the cost of entry to the cloud for beginners, and accelerate the pace of organizations’ modernization of IT.”

Serverless computing allows users to run code without provisioning or managing any underlying system or application infrastructure. Instead, the systems automatically scale to support increasing or decreasing workloads on-demand as data becomes available.

Its name is a misnomer; serverless computing still requires servers, but the cost is only for the actual server capacity used; companies are only charged for what they are running at any given time, eliminating the waste associated with on-premise servers. It scales up as much as it needs to solve that problem, runs it, and scales it back down, turns off. The future is serverless, and its potential to liberate your data is limitless.

Join the Data Disruptors

Now is the time to break free from the legacy trap and liberate your data so its potential can be maximized by your business. In the face of growing data volumes, the data disruptors have realized the potential of the latest cloud-based technologies. Their business and IT teams can work together in a collaborative way, finding an end-to-end solution to the problem, all in a secure and compliant fashion. Harness this innovation and create a completely disruptive set of data economics so your organization can efficiently surf the tidal wave of data.

The post How to be Data-Driven when Data Economics are Broken appeared first on Talend Real-Time Open Source Data Integration Software.

Today on July 19, we released Talend Summer ’18, which is jam-packed with cloud features and capabilities. We know you are going to love the Talend Cloud automated integration pipelines, Okta Single Sign-On, and the enhanced data preparation and data stewardship functions…there is so much to explore!

Taking DevOps to the Next Level with the Launch of Jenkens Maven Plug-in Support

DevOps has become a widely adopted practice that streamlines and automates the processes between Development (Dev) and IT operations (Ops), so that they can design, build, test, and deliver software in a moreagile, frictionless, andreliable fashion. However, the conventional challenge is that when it comes to DevOps, customers are not only tasked with finding the right people and culture, but also the right technology.

Data Integration fits into DevOps when it comes to building continuous data integration flows, as well as governing apps to support seamless data flows between apps and data stores. Selecting an integration tool that automates the process is critical. It will not only allow for more frequent deploying and testing of integration flowsagainst different environments, increase code quality, reduce downtime, but also free up DevOps team’s time to work on new codes.

Talend Cloud has transformed the way developers and ops teams collaborate to release software in the past few years. With the launch of Winter ’17, Talend Cloud accelerated the continuous delivery of integration projects by allowing teams to create, promote, and publish jobs in separate production environments. An increasing number of customers recognize the value that Talend Cloud brings for implementing DevOps practice. And now they can use the Talend Cloud Jenkins Maven plug-in in this Summer ’18 release, a feature that lets you automate and orchestrate the full integration process by building, testing, and pushing jobs to all Talend Cloud environments. This in turn further boosts the productivity of your DevOps team and reduces time-to-market.

Security and Compliance made Simple: Enterprise Identity and Access Management (IAM) with 1 Click

If you are an enterprise customer, you are likely faced with the growing demands of managing thousands of users and partners who need access to your cloud applications, at any time and from any devices. This adds to the complexity of Enterprise Identity and Access Management (IAM) requirement: meeting security and compliance regulations and audit policies, minimizing IT tickets, and only giving the right users access to the right apps. Single Sign-On (SSO) feature helps address this challenge.

In the Summer ’18 release, Talend Cloud introduced the Okta Single Sign-On (SSO) support. SSO permits a user to use one set of company login credentials to access multiple applications at once. This update ensures greater compliance with your company security and audit policies as well as improve user convenience. If you are with other identity management providers, you can simply download a plug-in to leverage this SSO feature.

The other security and compliance features worth mentioning in this release are the Hadoop User Impersonation for Jobs for the cloud integration app, and the feature that enables fine-grained permissions on the sematic types definition, both will provide greater data and user visibility for better compliance and audit, see this release note for details.

Better Data Governance at Your Finger Tips: New Features in Talend Data Preparation and Data Stewardship Cloud Apps

The Summer ’18 release introduces several new data preparation and data stewardship functions. These include:

More data privacy and encryption functions with the new “hash data” function.
Finer grained access control in the dictionary service for managing and accessing the semantic types.
Improved management in Data Stewardship, now that you can perform mass import, export and remove actions on your data models and campaigns, allowing you to promote, back up or reset your entire environment configuration in just two clicks.
Enhancements in the Salesforce.com connectivity that allows you to filter the data in the source module, by defining a condition directly in your Salesforce dataset and focus on the data you need. This reduces the amount of data to be extracted and processed. Making the use case of self-service cleansing and preparation of Salesforce.com data even more compelling.

Those functionalities make cloud data governance a lot simpler and easier.

To learn more, please visit Talend Cloud product pageor sign up for a Talend Cloud 30-day free trial.

For more exciting updates, you can pre-register for Talend Connect 2019.

The post Talend Summer’18 Release: Under the Hood of Talend Cloud appeared first on Talend Real-Time Open Source Data Integration Software.

For the past two years, Snowflake and Talend have joined forces developing deep integration capabilities and high-performance connectors so that companies can easily move legacy on-premises data to a built-for-the-cloud data warehouse.

Snowflake, which runs on Amazon Web Services (AWS), is a modern data-warehouse-as-a-service built from the ground up for the cloud, for all an enterprise’s data, and all their users. In the first part of this two-part blog series, we discussed the use of Talend to bulk load data into Snowflake. We also showcased the ability to use Talend to perform ELT (Extract, Load, Transform) functions on Snowflake data, allowing you to take full advantage of Snowflake’s processing power and ease-of-use SQL querying interface to transform data in place.

This second video highlights Talend’s ability to harness the power of pure cloud deployments, making it possible to keep your data in place until it’s needed in Snowflake. By running Talend jobs directly in the cloud, no data is ever processed client-side, no remote processing is required. As a result, the governance and restrictions that you have implemented to secure your data remain intact. In addition to the security benefits, you get the full computing performance of Snowflake. Once that data is required in Snowflake, it is moved or copied seamlessly and directly into Snowflake from your cloud provider location.

The video walks you through the entire process. From extracting data from an Amazon S3 bucket to moving that data into a Snowflake data warehouse using Talend Cloud. Talend Cloud provides an easy-to-use platform to process this data in the cloud, and then leverage the power and ease-of-use of Snowflake to access and analyze that data in the cloud.

Talend Cloud with Snowflake delivers cloud analytics 2 to 3 times faster, in a governed way.

The post [Step-by-step] Using Talend for cloud-to-cloud deployments and faster analytics in Snowflake appeared first on Talend Real-Time Open Source Data Integration Software.

This is the second part of my blog series on data preparation and data wrangling best practices. For those of you who are new to this blog, please refer to Part 1 of the same series ‘Data Preparation and Wrangling Best Practices – Part 1’ and for those who are following my blog series, a big thank you for all the support and feedback! In my last blog, we saw the first ten best practices when working with your data sets in Talend Data Preparation. In this blog, I want to touch on some advanced best practices. So with that, let’s jump right in!

Best Practice 11 – Identify Outliers

Outliers are problematic because they can seriously compromise the outcome of a data set. Outliers are data points that are distant from the rest of the distribution in your data set. They are either very large or very small values compared with the rest of the dataset. When faced with outliers, the most common strategy is to delete them. However, it depends on the individual project requirements.

The image below shows data preparation identification of outliers on the Talend Data Preparation UI:

Figure 1: Quick identification of outliers in Talend Data Preparation

A single outlier can have a huge impact on the value of the mean. Because the mean is supposed to represent the center of the data, in a sense, this one outlier renders the mean useless. Keep an eye out for these!

Practice 12 – Deal with Missing values

A missing value in your dataset could cause potential risk to the data being analyzed. As a best practice, I always recommend handling the missing values rather than ignoring them. The best way to resolve missing values in your dataset really depends on the project but you could either:

Replace the missing values with a proper value.
Replace it with a flag to indicate a blank.
Delete the row/record entirely.

Figure 2: Dealing with missing values

Best Practice 13 – Share and Reuse Preparations

Re-usability is music to the ears of anyone that lives in the coding world. It saves a lot of effort, time and eases out the whole Software Development Life Cycle (SDLC). With Talend Data Preparation, users can share the preparations/datasets to either individual users or a group of users. As a best practice, always create a shared folder and place all the shareable data preparations in the folder.

Figure 3: folder ‘ASIA_DEFAULTERS’ is being shared with ‘User1’ by owner ‘Rekha Sree’

Best Practice 14 – Mask Sensitive Data

When manipulating sensitive data, such as names, addresses, credit card numbers or social security numbers, you will probably want to mask the data. To protect the original data while having a functional substitute, you can use the mask data (obfuscation) function direclty in Talend Data Preparation.

Figure 4: Masking function using Talend Data Preparation

Best Practice 15 – Use Versioning

Adding versions to your preparation is a good way to see the differences that have been made to the dataset over time. It also ensures that your data is always the same state of a preparation that is used in Talend Jobs, even if the preparation is still being worked on. Versions can be used in Data Integration as well as Big Data Jobs.

Figure 5: Versioning in Data Preparation

Best Practice 16 – Check the Log File Location

Using logs in Talend Data Preparation allows you to analyze and debug the activity of Talend Data Preparation. By default, Talend Data Preparation will log in two different places, namely the console, and in a log file. The location of this log file depends on the version of Talend Data Preparation that you are using:

<Data_Preparation_Path>/data/logs/app.log for Talend Data Preparation.
AppData/Roaming/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on Windows.
Library/Application Support/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on MacOS.

As a best practice, it is recommended to change the default locations. The location can be configured by editing the logging.file property of the application.properties file.

Best Practice 17 – Know the Data Storage Location

Depending on the version of Talend Data Preparation that you are using, your data is stored in different locations. Here is a quick overview:

Talend Data Preparation
- If you are a subscription user, nothing is saved directly on your computer. Sample data is cached temporarily on the remote Talend Data Preparation server, in order to improve the product responsiveness. In addition, CSV and Excel datasets are stored permanently on the remote Talend Data Preparation server.
Talend Data Preparation Free Desktop
- Talend Data Preparation Free Desktop is meant to be able to work locally on your computer, without the need of an internet connection. Therefore, when using a dataset from a local file such as a CSV or Excel file, the data is copied locally, in one of the following folders depending on your operating system:
- Windows: C:\Users\<your_user_name>\AppData\Roaming\Talend\dataprep\store
- OS X: /Users/<your_user_name>/Library/Application Support/Talend/dataprep/store

Best Practice 18 – Always Backup

Backing up Talend Data Preparation and Talend Data Dictionary on regular basis is important to recover from a data loss scenario or any other causes of data corruption or deletion. Here is how to backup both:

Talend Data Preparation
- To have a copy of Talend Data Preparation instance, backup MongoDB, the folders containing your data, the configuration files and the logs.
Talend Data Dictionary
- Talend Data Dictionary Service stores all the predefined semantic types used in Talend Data Preparation. It also stores all the custom types created by users and all the modifications done on existing ones. To back up a Talend Dictionary Service instance, you need to back up the MongoDB database and the changes made to the predefined semantic types.

Best Practice 19 – Know Your Dataset

As you would be dealing with the raw data and probably the very data kind of input, it is recommended to build knowledge while you are analyzing the data. Here are some suggestions:

Discover and learn data relationships within and across sources; find out how the data fits together; use analytics to discover patterns
Define the data. Collaborate with other business users to define shared rules, business policies, and ownership
Build knowledge with a catalog, glossary, or metadata repository
Gain high-level insights. Get the big picture of the data and its context

Best Practice 20 – Document Knowledge

While it is important to build and enhance the knowledge, it is equally important to document the gained knowledge. In particular, every project must maintain a document for

Business terminology
Source data lineage
History of changes applied during cleansing
Relationships to other data
Data usage recommendations
Associated data governance policies
Identified data stewards

Best Practice 21 – Create a Data Dictionary

A data dictionary is a metadata description of the features included in the dataset. As you analyze and understand the data it is recommended to have it stored in the data dictionary. It will help the other users to identify the data they are working with and to establish the relation between various data.

Figure 6: Data Dictionary in Data Preparation

Best Practice 22 – Set Up a Promotion Pipeline

When using Talend Data Preparation, you should consider is to set up one instance for each environment of your production chain. Talend only supports promoting a preparation between identical product versions. To promote your preparation from one environment to the other, you have to export it from the source environment, and then import it back to your target environment. For the import to work, a dataset with the same name and schema as the one which the export was based on must exist on the target environment.

Best Practice 23 – Hybrid Preparation Environments

Sometimes the transformation are either too complex or bulky to be created in a simple form. To help you in such scenarios, Talend offers you a hybrid preparation environment. You could use either the dedicated Talend Preparation service or the Talend jobs to create data preparations. Leverage tDatasetOutput component as output in Create mode.

Figure 7: Creating a dataset from Talend Studio

Best Practice 24 – Operationalizing a Recipe

The tDataprepRun component allows you to reuse an existing preparation made in Talend Data Preparation, directly in a data integration Job. In other words, you can operationalize the process of applying a preparation to input files that have the same model.

Figure 8: Executing Talend data preparation in a Talend Studio job

Figure 9: Executing Talend data preparation in a Talend Studio job using dynamic selection

Note: In order to use the tDataprepRun component with Talend Data Preparation Cloud, you must have the 6.4.1 version of Talend Studio.

Best Practice 25 – Live Dataset

What if your business doesn’t need sample data but real live data for analysis? Because the Job is designed in Talend Studio, you can take advantage of the full components palette and their Data Quality or Big Data capabilities. Unlike a local file import, where the data is stored in the Talend Data Preparation server for as long as the file exists, a live dataset only retrieves this sample data temporarily.

It is possible to retrieve the result of Talend Integration Cloud flows that were executed on a Talend Cloud engine, as well as remote engines. Live datasets

Use a preparation as part of Data integration flow or Talend Spark Batch or Streaming job in Talend Studio.
The live dataset feature allows you to create a Job in Talend Studio, execute it on demand via Talend Integration Cloud as a Flow, and retrieve a dataset with the sample data directly in Talend Data Preparation Cloud.

Note: In order to create Live datasets, you must have the 6.4.1 version of Talend Studio, patched with at least the 0.19.3 version of the Talend Data Preparation components.

Conclusion

And with this, I come to an end of the two-part series blog. I hope these best practices are helpful and you would embed these while working with data preparation.

The post Data Preparation and Wrangling Best Practices – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

So far, our journey on using Apache Spark with Talend has been a fun and exciting one. The first three posts on my series provided an overview of how Talend works with Apache Spark, some similarities between Talend and Spark Submit, the configuration options available for Spark jobs in Talend and how to tune Spark jobs for performance. If you haven’t already read them you should do so before getting started here. Start with “Talend & Apache Spark: A Technical Primer”, “Talend vs. Spark Submit Configuration: What’s the Difference?” and “Apache Spark and Talend: Performance and Tuning”.

To finish this series, we’re going to talking about logging and debugging. When starting your journey with using Talend and Apache Spark you may have run into the error like below printed out in your console log:

    “org.apache.spark.SparkContext - Error initializing SparkContext. 
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master”

How do you find out what caused this? Where should you be looking for more information? Did your Spark job even run? In the following sections, I will go over how you can find out what happened with your Spark job, and where you should look for more information that will help you resolve your issue.

Resource Manager Web UI

When you get an error message like the one above, you should always visit is the Resource Manager Web UI page first to locate your application, and see what errors may be reported in there.

Once you locate your application you’ll see in the bottom right corner that you have an option for retrieving container logs by clicking on the “logs” link that is provided next to each attempt to get more information about what happened.

In my experience, I never get all the logging information that I need from the Web UI alone, so it is better to login into one of your cluster edge nodes, and then use the YARN commandline tool to grab all the logging information for your containers and output it into a file like below.

Interpreting the Spark Logs (Spark Driver)

Once you have gotten the container logs through the command shown above and have the logs from your Studio, you now need to interpret them and see where our job may have failed. The first place to start is with the Studio logs that contain the logging information for the Apache Spark driver. These logs indicate in the first few lines that state that our Spark driver has started:

[INFO ]: org.apache.spark.util.Utils - Successfully started service 'sparkDriver' on port 40238.
[INFO ]: org.apache.spark.util.Utils - Successfully started service 'sparkDriverActorSystem' on port 34737.

Now the next part that you should look for in the Studio log is the information of the Spark Web UI that is started by the Spark driver:

[INFO ]: org.apache.spark.ui.SparkUI - Started SparkUI at http://<ip_address>:4040

This is the Spark Web UI that is launched by the driver. The next step that you should see in the logs is the libraries needed by the executors being uploaded to the Spark cache:


[INFO ]: org.apache.spark.SparkContext - Added JAR ../../../cache/lib/1223803203_1491142638/talend-mapred-lib.jar at  spark://<ip_address>:40238/jars/talend-mapred-lib.jar with timestamp 1463407744593

Once all the information that will be needed by the executors for the job is uploaded to the Spark cache, you will then see the log the request for the Application Master in the Studio:

[INFO ]: org.apache.spark.deploy.yarn.Client - Will allocate AM container, with 896 MB memory including 384 MB overhead
[INFO ]: org.apache.spark.deploy.yarn.Client - Submitting application 20 to ResourceManager
[INFO ]: org.apache.spark.deploy.yarn.Client - Application report for application_1563062490123_0020 (state: ACCEPTED)
[INFO ]: org.apache.spark.deploy.yarn.Client - Application report for application_1563062490123_0020 (state: RUNNING)

After this, the executors will be registered and then the processing will be started:

[INFO ]: org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Registered executor NettyRpcEndpointRef(null) 
(hostname2:41992) with ID 2
[INFO ]: org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend - Registered executor NettyRpcEndpointRef(null) 
(hostname1:45855) with ID 1
[INFO ]: org.apache.spark.scheduler.TaskSetManager - Finished task 1.0 in stage 1.0 (TID 3) in 59 ms on hostname1 (1/2)
[INFO ]: org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 2) in 79 ms on hostname2 (2/2)

At the end of the Spark driver log you’ll see the last stage which is the shutdown and cleanup:

[INFO ]: org.apache.spark.util.ShutdownHookManager - Shutdown hook called
[INFO ]: org.apache.spark.util.ShutdownHookManager - Deleting directory /tmp/spark-5b19fa47-96df-47f5-97f0-cf73375e59e1
[INFO ]: akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.

In your log, you may see every stage being reported, but depending on the issue some of them may not get executed.

This excercise should give you a very good indicator of where the actual failure happened, and if it was an issue encountered by the Spark driver. As a note, the Spark driver logs shown above will only be available through the Studio console log, if the job is run using YARN-client mode. If YARN-cluster mode is used, this logging information will not be available in the Studio, and you will have to use the YARN command mentioned earlier to get that logging.

Interpreting the Spark Logs (Container Logs)

Now, let’s move on to reviewing the container logs (if your application started running at the cluster level) and start interpreting the information. The first step that we will see in these logs is the Application Master starting:

05/05/18 16:09:13 INFO ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]

Then you will see the connectivity happening with the Spark Driver:

05/05/18 16:09:14 INFO ApplicationMaster: Waiting for Spark driver to be reachable.
05/05/18 16:09:14 INFO ApplicationMaster: Driver now available: <spark_driver>:40238

It will then proceed with requesting resources:

05/05/18 16:09:15 INFO YarnAllocator: Will request 2 executor containers,  
each with 1 cores and 1408 MB memory including 384 MB overhead
05/05/18 16:09:15 INFO YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)
05/05/18 16:09:15 INFO YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)

Then once it gets the resources it will start launching the containers:

05/05/18 16:09:15 INFO YarnAllocator: Launching container container_e04_1463062490123_0020_01_000002 for on host hostname1
05/05/18 16:09:15 INFO YarnAllocator: Launching container container_e04_1463062490123_0020_01_000003 for on host hostname2

It then proceeds with printing the container classpath:

CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$YARN_HOME/*<CPS>$YARN_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
{{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms1024m -Xmx1024m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.driver.port=40238' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@163.172.14.98:40238 --executor-id 1 --hostname hostname1 --cores 1 --app-id application_1463062490123_0020 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr

You’ll now see our executors starting up:

05/05/18 16:09:19 INFO Remoting: Starting remoting
05/05/18 16:09:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@tbd-bench-09:38635]

And then the executor communicating back to the Spark Driver:

05/05/18 16:09:20 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@163.172.14.98:40238
05/05/18 16:09:20 INFO CoarseGrainedExecutorBackend: Successfully registered with driver

Then proceeding to retrieve the libraries from the Spark cache and updating the classpath:

05/05/18 16:09:21 INFO Utils: Fetching spark://<ip_address>:40238/jars/snappy-java-1.0.4.1.jar to /data/yarn/nm/usercache /appcache/application_1563062490123_0020/spark-7e7a084d-e9e2-4c94-9174-bb8f4a0f47e9/fetchFileTemp7487566384982349855.tmp
05/05/18 16:09:21 INFO Utils: Copying /data/yarn/nm/usercache /appcache/application_1563062490123_0020/spark-7e7a084d-e9e2-4c94-9174-bb8f4a0f47e9/1589831721463407744593_cache to /data2/yarn/nm/usercache /appcache/application_1563062490123_0020/container_e04_1563062490123_0020_01_000002/./snappy-java-1.0.4.1.jar
05/05/18 16:09:21 INFO Executor: Adding file:/data2/yarn/nm/usercache /appcache/application_1563062490123_0020/container_e04_1563062490123_0020_01_000002/./snappy-java-1.0.4.1.jar to class loader

Next you will see the Spark executor start running:

05/05/18 16:09:22 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)

Finally, the Spark executor shutdowns and cleans up once the processing is done:

05/05/18 16:09:23 INFO Remoting: Remoting shut down
05/05/18 16:09:23 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

As I mentioned in the previous section, you might see your Spark job go through all those stages, or it might not depending on what the issue was that we encountered. Understanding all the different steps involved, and being able to interpret all this information in the logging, will help you to get more information and a better understanding of why a Spark job may have failed.

Spark History Web UI

In previous blogs, I mentioned that, as a best practice, you should always enable the Spark event logging in your jobs, so that the information in the Spark History Web Interface is available even after the job ends.

This Web Interface is the next location that you should always check for more information regarding the processing of your job. The Spark History Web UI does a great job on giving a visual representation of the processing of a Spark Job. When you navigate to that web interface you will see the following tabs:

First, I suggest starting with the environment tab, to verify all the environment information that was passed to our job:

The next step now is to look at the timeline of events in Spark. When you click on the jobs tab, you will see what how our application was executed after our executors got registered:

As you can see in the image above, it will show you the addition of all the executors that were requested, then the execution of the jobs that our application was split into, if any of them run in parallel, and if there was any failure with any of them. Now you can proceed by clicking in one of those jobs, to get further information (example below):

Here you’ll see the stages that run in parallel and don’t depend on each other as also that the stages that depend on others to finish and don’t start until the first ones are done. Now the Spark History Web UI allows to look further inside those stages and see the different tasks that are executed:

Here you are looking at the different partitions (in this case we have 2) and how they are distributed among the different executors. You’ll also see how much of the execution time was spend on shuffling and on actual computation. Furthermore, if you are doing also joins in our Spark job, you will notice that in each job in the Web UI it shows also the execution DAG visualization that allows you to easily determine the type of join that was used:

As a final step, make sure to check the executors tab which will give you an overview of all the executors that were used by our Spark job, how many tasks each one of them processed, the amount of data processed, the amount of shuffling, and how much was the time spend on the task and in Garbage Collection:

All this information that you gather it is important as it will lead you to better understanding the root cause of a potential issue, and what is the corrective action you should take.

Conclusion

This concludes my blog series on Talend with Apache Spark. I hope you enjoyed this journey, and had as much fun reading the blogs as I had while putting all this information together! I would love to hear from you on your experience with Spark and Talend, and if the information within the blogs was useful, so feel free to post your thoughts and comments below.

References

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-history-server.html

The post Talend & Apache Spark: Debugging & Logging appeared first on Talend Real-Time Open Source Data Integration Software.

It’s an exciting time to be part of the data market. Never before have we seen so much innovation and change in a market, especially in the areas of cloud, big data, machine learning and real-time data streaming. With all of this market innovation, we are especially proud that Talend was recognized by Gartner as a leader for the third time in a row in their 2018 Gartner Magic Quadrant for Data Integration Tools and remains the only open source vendor in the leaders quadrant.

According to Gartner’s updated forecast for the Enterprise Infrastructure Software market, data integration and data quality tools are the fastest growing sub-segment, growing at 8.6%. Talend is rapidly taking market share in the space with a 2017 growth rate of 40%, more than 4.5 times faster than the overall market.

The Data Integration Market: 2015 vs. 2018

Making the move from challengers to market leaders from 2015 to today was no easy feat for an emerging leader in cloud and big data integration. It takes a long time to build a sizeable base of trained and skilled users while maturing product stability, support and upgrade experiences.

While Talend still has room to improve, it’s exciting recognition of all the investments Talend has made to see our score improve like that.

Today’s Outlook in the Gartner Magic Quadrant

Mark Byer, Eric Thoo, and Etisham Zaidi are not afraid to change things up in the Gartner Magic Quadrant as the market changes, and their 2018 report is proof of that. Overall, Gartner continued to raise their expectations for the cloud, big data, machine learning, IoT and more. If you read each vendor’s write up carefully and take close notes, as I did, you start to see some patterns.

In my opinion, the latest report from Gartner indicates that in general, you have to pick your poison, you can have a point solution with less mature products and support and a very limited base of trained users in the market, or go with a vendor that has product breadth, maturity and a large base of trained users, but with expensive, complex and hard to deploy solutions.

Talend’s Take on the 2018 Gartner Magic Quadrant for Data Integration Tools

In our minds, this has left a really compelling spot in the market for Talend as the leader in the new cloud and big data use cases that are increasingly becoming the mainstream market needs. For the last 10+ years, we’ve been on a mission to help our customers liberate their data. As data volumes continue to grow exponentially along with growth in business users needing access to that data, this mission has never been more important. This means continuing to invest in our native architecture to enable customers to be the first to adopt new cutting-edge technologies like serverless, containers which significantly reduce total cost of ownership and can run on any cloud.

Talend also strongly believes that data must become a team sport for businesses to win, which is why governed self-service data access tools like Talend Data Preparation and Talend Data Streams are such important investments for Talend. It’s because of investments like these that we believe Talend will quickly become the overall market leader in data integration and data quality. As I said at the beginning of the blog, our evolution has been a journey and we invite you to come along with us. I encourage you to download a copy of the report, try Talend for yourself and become part of the community.

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
GARTNER is a federally registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and is used herein with permission. All rights reserved.

The post Making Sense of the 2018 Gartner Magic Quadrant for Data Integration Tools appeared first on Talend Real-Time Open Source Data Integration Software.

Log management solutions play a crucial role in an enterprise’s layered security framework— without them, firms have little visibility into the actions and events occurring inside their infrastructures that could either lead to data breaches or signify a security compromise in progress.

Splunk is the “Google for log files” heavyset enterprise tool that was the first log analysis software and has been the market leader ever since. So lots of customers will be interested in seeing how Talend can integrate with their enterprise Splunk and leverage Splunk’s out of the box features.

Splunk captures, indexes, and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards, and visualizations. It has an API that allows for data to be captured in a variety of ways.

Splunk’s core offering collects and analyzes high volumes of machine-generated data. It uses a standard API to connect directly to applications and devices. It was developed in response to the demand for comprehensible and actionable data reporting for executives outside a company’s IT department.

Splunk has several products but in this blog, we will only be working with Splunk Enterprise to aggregate, analyze and get answers from your Talend job logs. I’ll also cover an alternative approach where developers can also log customized events to a specific index using the Splunk Java SDK. Let’s get started!

Intro to Talend Server Log

Let’s start by introducing you to the Talend Log Server. Simply put, this is a logging engine based on Elasticsearch which is developed alongside a data-collection and log-parsing engine called Logstash, and an analytics and visualization platform called Kibana (or ELK).

These technologies are used to streamline the capture and storage of logs from Talend Administration Center, MDM Server, ESB Server and Tasks running through the Job Conductor. It is a tool for managing events and Job logs. Talend supports the basic installation but features like HA and APIs to read/write are beyond Talend scope of supportability.

To understand configuring Talend logging modules with an external Elastic stack please read this article.

Configure Splunk to Monitor Job Logs

Now that you have a good feel for the Talend Server Log, let’s set up Splunk to actually monitor and collect data integration job logs. After you log into your Splunk deployment, the Home page appears. To add data, click Add Data. The Add Data page appears. If your Splunk deployment is a self-service Splunk Cloud deployment, from the system bar, click Settings > Add Data.

The Monitor option lets you monitor one or more files, directories, network streams, scripts, Event Logs (on Windows hosts only), performance metrics, or any other type of machine data that the Splunk Enterprise instance has access to. When you click Monitor, Splunk Web loads a page that starts the monitoring process.

Select a source from the left pane by clicking it once. The page is displayed based on the source you selected. In our case we want to monitor Talend job execution logs, select “Files & Directories”, the page updates with a field to enter a file or directory name and specify how Splunk software should monitor the file or directory. Follow the on-screen prompts to complete the selection of the source object that you want to monitor. Click Next to proceed to the next step in the Add data process.

Creating a Simple Talend Spark Job

To start, log in to your Talend Studio and create a simple job that will read a string via context variable, extract first three characters and displays both actual and extracted string.

Creating Custom Log Events from Talend Spark Job

Now that we’ve gotten everything set up, we’ll want to leverage the Splunk SDK to create custom (based on each flow in the Talend job) events and send it back to Splunk server. A user routine is written to make Splunk calls and register the event to an index. The Splunk SDK jar is set up as a dependency to the user routines so that leverage Splunk SDK methods

Here is how to quickly build the sample Talend Job Below:

Splunk configuration is created as context and passed to routine via tJava component
Job started and its respective event is logged
Employee data is read and its respective event is logged
Department data is read and its respective event is logged
Employee and Department datasets are joined to form a de-normalized data and its respective event is logged

Switch back to Splunk and search with the index used in the above job – you’ll be able to see events published from job.

Conclusion:

Using the exercise and process above, it is clear that Talend can seamlessly connect to customer’s enterprise Splunk and push customized events and complete job log files to Splunk.

The post Talend and Splunk: Aggregate, Analyze and Get Answers from Your Data Integration Jobs appeared first on Talend Real-Time Open Source Data Integration Software.

Data Quality is often perceived as the sole task of a data engineer. As a matter of fact, nothing could be further from the truth. People close to the business are eager to work and resolve data related issues as they’re the first to be impacted by bad data. But they are often reluctant to update data as Data Quality apps are not really made for them or just because they are not allowed to use them. That’s one of the reasons bad data keeps increasing. According to Gartner, poor data quality cost rose by 50% in 2017, reaching $15 million per year for every company. This cost will explode in the upcoming years if nothing is done.

But things are changing: Data Quality is now increasingly becoming a company-wide strategic priority involving professionals from different horizons. To succeed, working as a team like in a sports team is a fair analogy to illustrate the key ingredients to succeed and win any data quality challenge:

As in team sports, you will hardly succeed with a solo approach rather than tackling from all angles
As in team sports, there are some practice to make the team succeed and win
As in team sports, Business/IT Teams would need the right tools, the right approach, and the right people to tackle the data quality challenge

This said it is not as difficult as one could imagine. You just need to take up the challenge and do things the right way from the get-go.

The Right Tools: Fight Complexity with Simple but Interconnected Apps

There is a plethora of data quality tools on the market. Go and register to a big data tradeshow and you will discover plenty of data preparation, stewardship, and tools offering several benefits to fight bad data. But only a few of them cover Data Quality for all. On one side, you will have sophisticated tools requiring deep expertise for a successful deployment.

These tools are often complex and require in-depth training to be deployed. Their User Interface is not suitable for everyone so only IT people can really manage them. If you have short-term data quality priorities, you will miss your deadline. That would be like trusting a rookie to pilot a jumbo jet with flight instruments that are obviously too sophisticated to end successfully.

On the other side, you will find simple and powerful apps that are often too siloed to be injected into a data quality process. Even if they successfully focus on the business people with simple UI, they will miss a big piece to the puzzle, collaborative data management. And that’s precisely the challenge: success relies not only in the tools and capabilities themselves but in their ability to simply talk to one another. For that, you would need to have a platform-based solution that share, operate and transfer data, actions, and models together. That’s precisely what Talend provides.

You will confront multiple use cases where it will be next to impossible to manage your data successfully alone. By working together, users will empower themselves through the full data lifecycle. Giving your business the power to overcome traditional obstacles such as cleaning, reconciling, matching or resolving your data.

The 3 Step Approach to Making Data a Team Sport

It all starts with the key simple steps approach to manage data better together: the right approach: analyze, improve and control.

Analyze your Data Environment:

Start by getting the big picture and identify key data quality challenges. Analyzing will help to give the big picture of your data. Rather than profiling data on its own with Data profiling in Talend Studio, a data engineer could simply delegate that task to a business analyst who knows customers best. In that case, Data Preparation offers simple yet powerful features that help the team get a glimpse of Data Quality with inflight indicators such as quality in every Data Set Columns. Data Preparation allows you to easily create a preparation based on a data set.

Let’s take the example of a team wishing to prepare a marketing campaign together with sales but suffering from bad data in the SalesForce CRM System. With Data Preparation, you have the ability to automatically as well as interactively profile and browse business data coming from SalesForce. Connected to Salesforce thru DataPrep, you will get a clear picture of your data quality. Once you identified the problem, you can solve it on your own with simple but powerful operations. But you’ve only just scratched the surface. That’s where you would need the expertise of a Data Engineer to go deeper and improve your data quality flows.

Improve your data within-depthh tools and start remediation designing stewardship campaigns

Using Talend Studio as your Data Quality Engine, the data engineers of your IT department will get access to a wide array of very powerful features included into Talend Studio. You will, for example, separate the wheat from the chaff using a simple data filter operation such as t-filter to identify wrong email patterns or exclude from your domain list improper domain addresses. At that stage, you will need to make sure you isolate bad data into your data quality process. Once filtering is done, you will then continue to improve your data and for that you will call on others for help. Talend Studio will work as the pivot of your data quality process. From Talend Studio, you will enable you to log on your credentials to Talend Cloud and expand your data quality to users close to the business. Whether you’re a business user or a data engineer, Stewardship-now in the Cloud will then allow you to launch cleaning campaigns and solve the bad data challenge with your extended team. This starts with designing your campaign.

Using the same UI look and feel as Talend Data Preparation, Talend Data Stewardship will offer the same easy to use capacities that business users love. As it’s fully operationalized and connected to Talend Studio, it will enable IT or Business Process People to expand Data Quality Operations to new people unfamiliar to technical tools but keen on cleaning data with simple apps relying on their business knowledge and experience.

That’s the essence of collaborative data management: one app for each dedicated operation but seamlessly connected on a single platform that manages your data from ingestion to consumption.

As an example, feel free to view this webinar to learn how to use cloud-based tools to make data better for all: https://info.talend.com/en_tld_better_dataquality.html

Control your data quality process to the last mile with the whole network of stewards

Once you have designed your stewardship campaign, you need to call on Stewards for help and conduct the campaign to have them checked the data at their disposal. Talend Data Stewardship will play a massive role here. Unless other tools existing on the market, the ability to extend your data quality to stewards with UI-friendly applications will make it easier to resolve your data and make sure you have engaged key business contributors in an extended data resolution campaign. They will feel comfortable resolving business data using simple apps.

Engaging business people in your data quality process will bring your data quality processes several benefits too. You will get more accurate results as business analysts have the experience and required skills to choose the proper data. You will soon realize that they will feel committed and be eager to cooperate and work together with you as they’re finally the most concerned by Data Quality.

Machine learning will act here as a virtual companion of your data-driven strategy: as stewards will complete missing details, the machine learning capabilities of Talend Data Quality Solutions will learn from stewards and predict future matching records based on initial records resolved by Stewards. As the system will learn from users, it will give you free hands to pursue other stewardship campaigns and reinforce the impact and control of your data processes.

Finally, you will then build a data flow back to your SalesForce CRM System from your stewardship campaign so that bad data cleaned and resolved by Stewards will then be reinjected into the Salesforce CRM System. Such operations can only be achieved with simplicity if you have apps connected together on a single platform. You’ll have the opportunity to mark datasets as certified directly into a business app like Data Preparation so that users getting access to data will then have cleaned and trusted data to be used.

Remember this three-step approach is a continuous improvement process that will only get better with time.s

To learn more about Data Quality, please download our Definitive Guide to Data Quality

The post Making Data a Team Sport: Muscle your Data Quality by Challenging IT & Business to Work Together appeared first on Talend Real-Time Open Source Data Integration Software.