Data Quality in the Real-World: 6 Examples

May 2, 2018, 5:35 pm

≫ Next: The Rise of Ad Hoc and Citizen Integrators

≪ Previous: Making data-intensive processing efficient and portable with Apache Beam

As more and more businesses get ready to face the data tsunami, data quality becoming an ever-more-critical success factor. Yet, 84% of CEOs are concerned about the quality of the data they’re basing critical business decisions on, according to KPMG’s “2016 Global CEO Outlook.

Procedures and perceptions about data quality at many organizations—what it is, how to improve it, and how to institutionalize it—haven’t kept pace with its growing importance. Many Talend customers, however, are the exception—they’ve made data quality a priority and are receiving the benefits.

Here are examples from six vertical industries illustrating how a focus on data quality has made a positive impact on business results.

Retail – Right Product, Right Place, Right Time

When Travis Perkins started their data quality journey company data was siloed and not maintained or validated in any consistent way. As the company was moving into a multichannel world, focusing more on online sales, data quality was key. Relying on assorted employees and suppliers to enter product information resulted in incomplete and inconsistent details—and while data was supposed to be manually reviewed and approved, that didn’t always happen.

Travis Perkins adopted Talend Data Quality to provide a data quality firewall that checks for duplicates, confirms that check digits are valid for barcodes, standardizes data, and maintains a consistent master list of values.

Since deploying the solution, 500,000 product descriptions have been checked for data quality by Talend. In addition, 10,000 updates to product entries were made in the first month after Talend went live, and Travis Perkins saw a 30 percent boost in website conversions, due in part to having consistent, accurate product descriptions.

Travis Perkins’ success in standardizing and validating data quality helps dispel the general misperception that “it’s hard to control data quality.”

Construction – Optimizing the human capital management

In any international group, communication with employees, collaboration, and identifying and sharing expertise are essential. But at VINCI, the global player in concessions and construction, managing employees turned out to be a matter of consolidating information contained in highly complex and diverse IT systems. With as many as 30% of internal emails failing to be sent, VINCI has become aware of the need to better manage and continually update data on its 157,000 employees.

VINCI selected Talend MDM to create a common language to be shared between all divisions as well as Talend Data Quality to correct these data and return them to all divisions and employees.

Since each division operates independently, it was important to make each division responsible for governing its own data. A support team was put in place to monitor the quality of the data and to identify errors in a monthly report. The group’s HRMs regularly intervene to identify errors and alert employees. Today, the error rate is as low as 0.05%, compared to nearly 8% in the past and the employee information is always up-to-date.

VINCI ’s successful centralized solution counters the misperception that “it’s still hard to make all data operations work together.”

Marketing – Optimizing marketing campaigns with quality data

For DMD Marketing, a pioneer in healthcare digital communications and connectivity, data quality is a key differentiator. Because the principal service DMD provides—emails to health care professionals—is essentially a commodity that can be supplied more cheaply by competitors, DMD needs to maintain its edge in data quality.

The company’s client base needs to know they are targeting the proper healthcare professionals, so having clean data for names, addresses, titles and more is vital. DMD has deployed Talend Cloud Data Preparation and Data Quality to help deliver that.

DMD also chose Talend for its data stewardship and self-service functionality. The company felt it was important to enable its internal users and, to a certain extent, its clients, to get in and see the email data and web tracking data on their own—without needing advanced technical skills. The company is also moving away from manual processes with manual data checks and is instead automating as much as possible, then providing human users access so they can augment and enhance the data.

The ROI for DMD Marketing includes raising the mail deliverability rate to a verified 95 percent, reducing email turnaround time from three days to one, and getting a 50 percent faster time to insight.

DMD Marketing’s success in empowering internal users and clients to monitor data quality proves it’s not true that “data quality software is complicated and just for experts.”

Media – Exposing data to the public

In early 2016, The International Consortium of Investigative Journalists (ICIJ) published the Panama Papers –one of the biggest tax-related data leaks in recent history involving 2.6 Terabytes (TBs) of information. It exposed the widespread use of offshore tax havens and shell companies by thousands of wealthy individuals and political officials, including the British and Icelandic Prime Ministers. Now if that wasn’t fascinating or mind-blowing enough, shortly after came the Paradise Papers –wherein 1.4 Terabytes of documents were leaked to two reporters at the German newspaper Suddeutsche Zeitung.

To make public a database containing millions of documents, ICIJ raised its requirements for data quality and for documenting data integration procedures. Since millions of people would see the information, a mistake could be catastrophic for ICIJ in terms of reputation and lawsuits.

Talend Data Quality became ICIJ ‘s preferred solution when it came to cleaning, transforming, and integrating the data they received. It was key for ICIJ’s data team to efficiently work remotely across two continents and have each step of the preparation process documented.

The Panama and Paradise Papers investigation has found an extraordinary global audience, which was unprecedented for ICIJ and their media partners. Within two months of Panama Papers publication, ICIJ’s digital products received more than 70 million page views from countries all around the world. In the six weeks after public disclosure of the Paradise Papers, Facebook users had viewed posts about the project a staggering 182 million times. On Twitter, people liked or retweeted content related to the Paradise Papers more than 1.5 million times.

These series of investigations in which the ICIJ and its partners used mass data to examine offshore-related matters — advances public knowledge to yet another level.

ICIJ’s public database, with unimpeachable data quality, shows it’s not true that “data quality is just for traditional data warehouses.”

Charity – Saving lives

Save the Children UK (SCUK) saves lives by preparing for and responding to humanitarian emergencies. The charity has been using Talend Data Quality to dedupe data being imported into the database of donors, and to review existing CRM data.

By reducing duplicates and improving data quality, the charity ensures the information it has on an individual is as accurate as possible. That, in turn, aids in ensuring that donors receive only the information they ask for, and allows SCUK to manage the flow of messages to them in a truly relevant manner.

Improved data quality also helps SCUK avoid alienating its donors. For example, if the charity has three records for a J. Smith, John Smith, and J. Smyth with slight variations in the held addresses, and it’s all the same person, they might mail him three times for the same campaign. That costs SCUK money, and may prompt the donor to say they do not wish to be contacted anymore. In addition, efficiently importing higher-quality data supports the production of better, faster reporting by analysts and provides SCUK greater insight into donor behavior and motivations.

SCUK’s commitment to an ongoing campaign to maintain a clean, accurate donor record shows that ensuring data quality is a process and not an event, and that “once you solve your data quality problem, you’re done,” is an outdated misperception.

Transportation – Being compliant with regulations

A world leader in transporting passengers and cargo, Air France-KLM needs high-quality data to meet its goal of catering to every one of its customers. It needs accurate phone numbers and emails, which are essential for booking flights, and it needs to reconcile online and offline information, since visitors to these sites and applications are most often connected to their personal accounts. Data management was a challenge, and Air France-KLM resolved to get organized in order to ensure data quality, respect the privacy of its customers and offer customers and employees a clear benefit.

In addition, Air France-KLM collects and processes personal data (PII, or personally identifiable information) concerning passengers who use the services available on its website, its mobile site, and its mobile applications. The company is committed to respecting privacy protection regulations regarding its passengers, loyalty program members, prospects and website visitors. All personal data processing is therefore carried out with Talend Data Masking, which makes it possible to anonymize certain sensitive data and make it unidentifiable in order to prevent unauthorized access.

Every month, a million pieces of data are corrected with Talend Data Quality—proof of the essential role the Talend solution plays in ensuring Air France-KLM can deliver on its promise to cater to every customer. Their success also proves that it’s no longer true that “data quality falls under IT responsibility”; rather, it’s a business priority with such goals as respecting customer privacy.

Conclusion

All these Talend customers have moved beyond existing misperceptions about data quality and have implemented solutions that deliver a competitive advantage, enable them to meet their responsibilities as data stewards, and contribute to the success of their organizations.

The Rise of Ad Hoc and Citizen Integrators

May 3, 2018, 2:00 pm

≫ Next: 5 Strategies CIOs Should Consider for a Successful Cloud Migration

≪ Previous: Data Quality in the Real-World: 6 Examples

In the past few years, there has been a shift in the data industry, leading to the emergence of a new category of data citizens: the ‘ad hoc’ or ‘citizen’ integrators. With these new personas adding to the (already long) list of data workers having access to corporate information, companies are needing to re-think the way they approach their data security and data governance strategies. Unlike data engineers, this new class of ‘citizens’ or personas, don’t necessarily utilize data integration as part of their day-to-day job, but it does still come into play every so often.

So who exactly are these integrators? According to Gartner, ad hoc integrators can include line-of-business developers, such as application developers, who may need to integrate data for a portion of their development, but data integration may not necessarily be an ongoing daily task. Citizen integrators include data scientists, data analysts, and other data experts within the business who may need to integrate data for their main job: analytics.

Innovation Brings the Opportunity for Insight

With an increasing number of data sources, there is a corresponding increase in the number of use cases that allow for every part of the business to become data-driven. Large-scale adoption of streaming data processing technologies like Kafka have increased the velocity at which data is absorbed by businesses, making IoT and clickstream data analysis feasible.

At the same time, open source big data technologies like Apache Spark have provided a framework for processing and analyzing growing volumes of data. Even Cloud service providers such as Amazon Web Services and Microsoft Azure have played a part in encouraging company-wide data-driven practices by further enabling businesses of every size to store, process, and explore more data without the monetary and resource investment that was previously required with on-premises technologies.

All of these innovations have created an environment that nurtures and encourages nearly every segment of the business—from supply chain to marketing—to utilize their data to inform choices that range from daily tasks to major strategic initiatives. Furthermore, deriving accurate insights and acting on those insights has become a competitive advantage in every market. In other words, having inaccurate analytics or even delayed insights can put you in a vulnerable position—a position where your competitors can overtake your market share by acting on industry and customer needs that you haven’t yet identified.

In order to remain competitive, companies have hired armies of data scientists and data analysts who are tasked with identifying trends and articulating actionable insights to each segment of the business.

The Non-Traditional Integrator is Born

IT is often not able to manage both the increasing number of data sources and the exponential number of ad hoc and citizen integrators’ requests to prepare data sets for analytics. Unfortunately, IT typically becomes a bottleneck where ad hoc and citizen integrators can be left waiting days or even weeks to access the data they need for analytics, which is not acceptable in a world where faster time-to-insight separates an industry’s leaders from its stragglers.

As a result, line of business developers, data scientists, and data analysts are left to integrate and prepare their data if they want to work with their data instantly instead of waiting for a few days or a few weeks to start their analytics.

The Future of Integration: Enabling New Integrators Without Losing Oversight

Now that we understand ad hoc and citizen integrators are assuming an increasingly important role in the world of data integration, where do we go from here?

Since ad hoc and citizen integrators have often been neglected or flat out ignored by IT due to lack of resources, many have resorted to finding their own tools for the collection, cleansing, and preparation of their data for analysis. And often, they use tools that are outside IT’s purview and thus outside of IT’s oversight and governance. While this gives the ad hoc and citizen integrators the ability to prepare their data quickly, it also opens up the risk that comes with unmonitored, ungoverned data.

In a time where data security and data governance are often the subject of front page news, it is paramount that every company restrict who is accessing what data, what each person is doing with that data, and how they are storing the data. In order to achieve this level of governance, companies need to focus their attention on a few things: people, process, and products.

First, people working with data within your organization need to understand that data management has become a team sport, and just like in any team, every person needs to understand his role. Just as important as understanding your own role, you need to understand how you can interact and contribute to your team to get the most accurate data (and subsequently, insights) possible. Because data management often spans across different teams, it is important for a process to be agreed upon by the different teams for some of these interactions. Last, it is important to find a product that will help you enable all of your data team members and operationalize the processes set between data team members all while being governed by IT.

With the changing data integration and analytics landscapes, the people who are interacting with the data, the processes they use to interact with each other, and the products that can support them are all changing. Is your organization prepared to enable and empower your new integrators?

The post The Rise of Ad Hoc and Citizen Integrators appeared first on Talend Real-Time Open Source Data Integration Software.

↧

5 Strategies CIOs Should Consider for a Successful Cloud Migration

May 7, 2018, 10:07 am

≫ Next: Introducing Talend Data Streams: Self-service Streaming Data Integration for Everyone

≪ Previous: The Rise of Ad Hoc and Citizen Integrators

With the growing adoption of cloud-based IT infrastructures, the proliferation of mobile and IoT devices, and the rise of social media, companies of all sizes, across all industries are amassing huge quantities of data in differing variety, velocity, veracity and validity.

Shifting to the Cloud

For this reason, as more organizations consider shifting their entire data platforms to the cloud, IT leaders need a carefully charted approach that ensures all data is well managed, governed, cleansed and protected. Oftentimes, companies make the mistake or assume the misconception that when your data is in the cloud, your chosen cloud service provider assumes responsibility or accountability for that data. This couldn’t be further from the truth. When migrating to the cloud, companies need to take even more ownership of not only its data quality but also its compliance, protection, accessibility and trustworthiness.

Another misnomer about moving data to the cloud is that it all lives in a single, easily accessible place—but that’s not always the case. No doubt, data is a company’s most valuable asset, but it provides little insight when stored in organizational or technology silos. Data only becomes highly valuable when combined with existing customer, product, and partner data to drive informed decision making.

Ready For More? Download How Leading Enterprises Achieve Business Transformation with Talend and AWS User Guide now.

Download Now

5 Strategies CIOs Should Consider for Cloud Migration

With that in mind, here are a few key concepts CIOs should keep in mind as they strategize their transition to the cloud:

How do you transition to the cloud at scale (moving beyond the POC stage)?

Many often get caught flat-footed thinking short term rather than long-term. A short-term approach often handicaps innovation and incurs increasing costs in the long term, and could include running a cloud POC using a single, custom-coded application, rather than using a suite of tool-built apps that can be quickly and easily applied across different teams and systems. The latter approach will give organizations a far better indication of what it will take for a data platform to perform successfully in the cloud.

How do you keep your environment on the cutting edge?

Innovation in the cloud is incredibly rapid, and CIOs need an approach that enables agility to continually leverage the latest advances in big data cloud technology over time. It’s important to think about building cloud systems in a portable way, so emerging technologies can be plugged in as they arise, and allow cloud environments to keep pace with the growing needs of a business. For example, advances in machine learning, AI, and database technologies are accelerating time-to-outcome, and CIOs need to ensure that data platforms can adapt with new capabilities and to new standards as the latest technologies come to market and the industry continues to evolve.

How do you maintain necessary resources & control maintenance costs?

As you begin your journey to the cloud, you need to think about your support skills and maintenance costs over time. Will your developers be able to maintain the cloud environment you built over the long term? What skills and knowledge will they need to maintain the cloud environments over time? An important takeaway is that you wouldn’t want to do the initial cloud migration with the top five percent of coders at an organization—you need an approach that will scale. The great thing about easy-to-use tools like Talend is that your data engineers can be up and running with very little training. However, CIOs should also be thinking about creating standardized best practices that can be reused over time, which will make cloud environments easier to maintain in the long run. It will also make the implementation more scalable because you’ll be less at risk of being bottlenecked by your own resources.

Make sure to have a separate strategy for your data science team.

Data scientists are incredibly data hungry and would be eager to run the type of highly complex machine learning algorithms that require time and cloud resources. While it’s important for those data activities to happen as they seek to understand metrics—and which are the most important to the business—this is notable for two reasons:

As a CIO, you should strongly consider having a sandbox environment for data scientists to run their tests in a contained environment, otherwise, it may impact performance and resources for everyone else.

Be sure to have a data retirement/elimination plan. You don’t want to create a single cloud data lake, containing all your data and has machine learning tools on top—that just creates a multiplication table and a data swamp. We’ve seen a healthcare research and consulting company increase data volumes by 50 percent within just 10 days because of the ML algorithms they were running on their data lake. That amount of data and resources isn’t sustainable—not to mention expensive. Thus, it’s important to retire data at a rate that makes sense for your business objectives and compliance regulations.

Make data governance and quality the cornerstones of your cloud data strategy.

It’s critical to take a business outcome-based approach to your cloud journey, and that requires for data governance and data quality strategies to sit at the core. I think we can all agree that if you’re sitting on a trove of data that is not qualified, governed, or trusted, then you’re working with garbage. You can’t possibly scale any kind of enterprise data strategy if you don’t have quality data at the start—but qualifying data can be a time-consuming process. In fact, current industry stats indicate organizations report spending more than 60 percent of their time qualifying or preparing data, leaving little time for actual analysis. Hence, a new wave of self-service tools has emerged that enable business users to access, merge, cleanse, and qualify data faster than ever before. Using self-service tools can help companies take a more ‘collaborative approach’ to data quality wherein business stakeholders who are most familiar with the information, can help organize, cleanse, govern and keep it up-to-date.

These are just a few of the key considerations IT leaders should bear in mind as they look to migrate their data to the cloud. What’s most important is to make sure you start your cloud strategy with your desired business outcome in mind and work your way back from there. In doing so IT leaders can better ensure that the end result will be an environment that eliminates silos and leads to real-time, trusted insights that can help anticipate customer needs and keep pace with a constantly changing, dynamic marketplace.

The post 5 Strategies CIOs Should Consider for a Successful Cloud Migration appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Introducing Talend Data Streams: Self-service Streaming Data Integration for Everyone

May 8, 2018, 6:32 am

≫ Next: Building data lakes for GDPR compliance

≪ Previous: 5 Strategies CIOs Should Consider for a Successful Cloud Migration

I am very excited to introduce Talend Data Streams, a brand new, cloud-native application that enables you to get streaming data integrations up and running in minutes all while providing unparalleled portability powered by Apache Beam.

<<Download Talend Data Streams for AWS Now>>

Why Talend Data Streams?

One of the biggest challenges that businesses face today is working with all types of streaming paradigms as well as dealing with new types of data permeating everywhere from social media, web, sensors, cloud and so on. Companies see real-time data as a game changer, but it’s still a challenge to actually get there.

Take IoT data for example. Data from sensors or internet connected “things” is always-on, the stream of data is non-stop and flowing all the time. Typical batch approach to ingest or process this data is obsolete as there is no start or stop of the data.

Different devices also produce heterogeneous data formats. For example, there could be hundreds of sensors on a single wind turbine to monitor and collect data about oil level, the position and sway of the tower, pressure on the blade, temperatures, and so on. The sensors themselves could even be all different firmware or produced by different manufacturers. There is often no standard for IoT devices. And because of the use of mixed devices, the schema of data is prone to change unpredictably, and that could easily break the data pipelines. If you do get through that though… IT still needs to deliver the data to business owners.

In a recent data scientist survey, over 35% data scientists reported the unavailability of data and the difficulty to access that data is a top challenge for them.¹ Many business users would likely agree and often go to an ad hoc approach to work with their cloud applications and data sources when IT can’t.

Introducing Talend Data Streams

As we saw this scenario play out over and over again with customers and prospects, we knew we could help. That’s why we built Talend Data Streams. So what is it?

Talend Data Streams is a self-service web UI, built in the cloud, that makes streaming data integration faster, easier, and more accessible, not only for data engineers, but also for data scientists, data analysts and other ad hoc integrators so they can collect and access data easily.

It is built with the goal to help our customers further close the gap between IT and business team, so they can enable more users with more use cases.

<<Download Talend Data Streams for AWS Now>>

So what makes Data Streams so unique? Here are a few highlights I really wanted to share with Talend users:

Live Preview

The live preview in Talend Data Streams allows you to do incremental data integration design, which we call “continuous design”.

You no longer need to design the whole pipeline, compile, deploy, run, and then test and debug to see if it actually works. It is similar to the Read-Evaluate-Print-Loop concept often used in data science. You can see your data changes in real time, at every step of your design process, in the exact same design canvas. This will dramatically reduce development time and help to shorten the cycle to design.

Schemaless Design

Talend Data Streams is completely schemaless. And that brings benefits for both design time and run time.

Designers can create and refine pipelines more easily because the schemas are dynamically discovered, and enforcement is only optional. Pipelines are also more resilient to schema changes. For example, imagine a scenario where you are streaming from a message queue. Several message structures may co-exist like sensor and machine. Schemaless allows those pipelines to automatically adapt to multiple data variants during ingestion, as opposed to creating as many pipelines as there are variants.

Unparalleled Portability with Apache Beam

Talend has long been a leader in big data and our open source approach allows us to help our customers run on the best data framework of their choice, and also to help them move to the next best framework when it comes around. A typical example is when we turned our code generator from MapReduce to Spark. But now we are pushing this model to a whole new level by embracing Apache Beam.

Apache Beam is an open source framework led by Google,Talend, data Artisans, PayPal, and others. At its core, Apache Beam is an abstraction layer, that provides a portable data pipelining framework. It decouples design with runtime, and merges batch and streaming in a unique data pipeline semantics. Because Talend Data Streams is powered by Apache Beam, it empowers customers with unparalleled portability. [Click here to learn more about Apache Beam]

So you could plug the same pipeline on a bounded source, like a SQL query, or an unbounded source, for example, a message queue, and it will work as a batch pipeline or a stream pipeline simply based on the source of data. And beyond that, you can even choose to run natively in the cloud platform where your data resides. Truly achieving “design once and run anywhere”, and get portability across multiple clouds.

Embedded Python Component

Last but not least, we wanted Talend Data Streams to be an app that could embrace the data scientist and coder community. So we embedded a Python component to allow them to script or code with Python for customizable transformations.

Looking to bridge the IT & Business Gap, and put more data to work?

What’s even better with Data Streams, is that it’s not a standalone app or a single point solution. It is part of Talend Data Fabric platform to help companies break down barriers and collaborate like never before, delivering data they can trust, and make data a team sport. How so?

Data pipelines, data sets, and metadata can all be shared across the Talend platform and with other apps. It helps dramatically increase the reusability of your data, but more importantly bring your IT and business teams together and achieve collaborative data management and better governance.

For ad hoc integrators, users like data scientists can ingest data they need more easily without going to IT all the time.

And of course, IT gets all the other benefits of Talend Data Fabric, to be in control of data usage, so it’s easy to audit, ensure privacy, security and data quality, and so on.

We are excited to bring a free edition of the product to the market via AWS Marketplace, so anyone with an AWS account can launch and use it immediately, with zero software cost. You can find more details of the product features on https://www.talend.com/products/data-streams/data-streams-free-edition/

Launch now: www.talend.com/datastreams-aws/

Source:

The State of Data Science & Machine Learning 2017 https://www.kaggle.com/surveys/2017

The post Introducing Talend Data Streams: Self-service Streaming Data Integration for Everyone appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Building data lakes for GDPR compliance

May 15, 2018, 8:45 pm

≫ Next: Building data lakes for GDPR compliance

≪ Previous: Introducing Talend Data Streams: Self-service Streaming Data Integration for Everyone

Authored by Jean-Michel Franco, Sr. Director of Product Marketing at Talend and Pinakin Patel, Senior Director EMEA Solutions Engineering at MapR.

If there’s one key phenomenon that business leaders across all industries have latched onto in recent years, it’s the value of data. The business intelligence and analytics market continue to grow, with Gartner forecasting the market will reach $18.3 billion in 2017, at a massive rate as organizations invest in the solutions that they hope will enable them to harvest the potential of that data and disrupt their industries.

But while companies continue to hoard data and invest in analytics tools that they hope will help them determine and drive additional value, the General Data Protection Regulation (GDPR) is forcing best practices in the capture, management and use of personal data.

The European Union’s GDPR stipulates stringent rules around how data must be handled. Impacting the entire data lifecycle, organizations must have an end-to-end understanding of its personal data, right through from its collection and processing, to storage and – finally – its destruction.

As companies scramble to make the May 25^th deadline, data governance is a key focus. But organizations cannot just think of the new regulations as a box to check. Continuous compliance is required and most organizations are having to create new policies that will help them achieve a privacy by design mode.

<<Your Single Source For GDPR Preparedness – Visit the Talend GDPR Page>>

Diverse data assets

One of the great challenges posed in securely managing data is the rapid adoption in data analytics across businesses, as it moves from an IT office function, to become a core asset for business units. As a result, data often flows in many directions across the business, so it becomes difficult to understand the data about the data – such as lineage of data (where it was created and how it got there).

Organizations may have personal data in many different formats and types (both structured and unstructured), across many different locations. Under the GDPR, it will be crucial to know and manage where personal data is across their business. While no one is certain in exactly what form GDPR will be enforced, organizations will need to be able to demonstrate that their data management processes are continually in compliance with the GDPR at a moment’s notice.

With the diverse sources and banks of data that many organizations have, consolidating this data will be key to effectively managing their compliance with the GDPR. With the numerous different types of data that must be held across an organization, data lakes are a clear solution to the challenge of storing and managing disparate data.

Pool your data

A data lake is a storage method that holds raw data, including structured, semi-structured and unstructured data. The structure and requirements of the data are only defined once the data is needed. Increasingly, we’re seeing data lakes used to centralize enterprise information, including personal data that originates from a variety of sources, such as sales, CX, social media, digital systems and more.

Data lakes, which use tools like Hadoop to track data within the environment, helps organizations bring all the data together into a data lake where it can all be maintained and governed collectively. The ability to store structured, semi-structured and unstructured data is crucial to the value of this approach for consolidating data assets, compared to data warehouses which primarily maintains structured, processed data. Enabling organizations to discover, integrate, cleanse and protect data that can then be shared safely is essential for effective data governance.

Further to the view across the full expanse of the data lake, organizations can look upstream to identify the sources of data from before they flowed into the lake. That way, organizations can track specific data back to their source – like the CX or marketing applications – providing end-to-end visibility across their entire data supply chain so that it can be scrutinized and identified as necessary.

This end-to-end view of personal data is crucial under the GDPR, enabling businesses to identify the quality and point of origin for all their information. Further to enabling organizations to store, manage and identify the source of all their data, data lakes provide a cost-effective means for organizations to store all their data in one place. On the other hand, managing this large volume of data in a data warehouse has a far higher TCO.

Setting the foundations

While data lakes currently present the best approach for data management and governance for GDPR compliance, this will not be the last stop in organizations’ journey towards innovative, efficient and complaint data management. The data storage approaches of the future will be built with consideration for the new regulatory climate and will be created to serve and adhere to the challenges they present.

However, with the demand on organizations to create data policies and practices that will support the compliance of their future data storage and analytics endeavors, it is clear that businesses need to start refining processes and policies that will lay the foundations for compliant data innovation in the future. Being able to quickly and easily identify and access all data, with a clear understanding of its source and stewardship, is now the minimum standard for the management of personal data.

The clock is ticking

Time is running out for many organizations on achieving GDPR compliance, with just weeks until its enforcement. However, companies must take a long-term view and build a data storage model that will enable them to consolidate, harmonize and identify the source of their data in compliance with the GDPR.

GDPR is bringing new dimensions with respect to customers demand: now they value trust and transparency and will vote with their feet. They will follow companies that will be able to deliver personalized interactions while letting their customers taking full control over their personal data. Ultimately, companies that establish a system of trust at the core of their customer and/or employee relationship will win in the digital economies.

The post Building data lakes for GDPR compliance appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Building data lakes for GDPR compliance

May 15, 2018, 8:52 pm

≫ Next: Successful Methodologies with Talend – Part 2

≪ Previous: Building data lakes for GDPR compliance

Authored by Jean-Michel Franco, Sr. Director of Product Marketing at Talend and Pinakin Patel, Senior Director EMEA Solutions Engineering at MapR.

<<Your Single Source For GDPR Preparedness – Visit the Talend GDPR Page>>

Diverse data assets

Pool your data

Setting the foundations

The clock is ticking

Next Steps: Free GDPR Assessment Tool

On demand Webinar : Using a GDPR Data Hub to Protect Personal Data

Download the solution brief : Get Ahead Of General Data Protection Regulation (GDPR) With MapR And Talend

The post Building data lakes for GDPR compliance appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Successful Methodologies with Talend – Part 2

April 16, 2018, 12:48 pm

≫ Next: CIOs: How Can You Stretch Your Data Dollar?

≪ Previous: Building data lakes for GDPR compliance

Let’s continue our conversation about Job Design Patterns with Talend. In my last blog, Successful Methodologies with Talend, I discussed the importance of a well-balanced approach:

Having a clearly defined ‘Use Case’,
Incorporating the proper ‘Technology’
Defining the way everyone works together with a ‘Methodology’ of choice (Waterfall, Agile, JEP⁺, or maybe some hybrid of these).

Your Talend projects should be no exception. Also, when you follow and adopt Best Practices (many are discussed in my other previous blogs), you dramatically increase the opportunity for successful Talend Job Designs. This leads to successful Talend Projects; and joyful cheers!

With all these fundamentals in place, it seems like a good time for me to elaborate on Job Design Patterns themselves. Brief descriptions of several common patterns are listed in Part 1 of this series. Now, let’s take a deeper dive. First, however, to augment our discussion on the ‘Just-Enough-Process’ methodology, I want to reinforce the importance of team communication. An Agile team (called a Scrum) is comprised of several players:

A Scrum-Master
Stakeholders
Business Analysts
Developers
Testers

A successful ‘Sprint’ with defined milestones that follow the agile process communicates, at the Scrum level, with well-defined tasks using tracking tools like Jira from Atlassian. I’ll assume you know some basics about the Agile Methodology; in case you want to learn more, here are some good links:

Agile Modeling

The Agile Unified Process

Disciplined Agile Delivery

JEP⁺ Communication Channels

To understand how effective communication can propel the software development life cycle, JEP⁺ defines an integrated approach. Let’s walk through the ideas behind the diagram on the left. The Software Quality Assurance (SQA) team provides the agnostic hub of any project sprint conversation. This is effective because the SQA main purpose is to test and measure results. As the communication hub, the SQA team can effectively, and objectively, become the epicenter of all Scrum communications. This has worked very effectively for me on small and large projects.

As shown, all key milestones and deliverables are managed effectively. I like this stratagem, not because I defined it, but because it makes sense. Adopt this style of communication across your Scrum team, using tools of your choice, and it will likely increase your team’s knowledge and understanding across any software development project, Talend or otherwise. Let me know if you want to learn more about JEP⁺; maybe another blog?

Talend Job Design Patterns

Ok, so let’s get into Talend Job Design Patterns. My previous blog suggested that of the many elements in a successful approach or methodology, for Talend developers, one key element is Job Design Patterns. What do I mean by that? Is it a template-based approach to creating jobs? Well, yes, sort of! Is it a skeleton, or jumpstart job? Yeah, that too! Yet, for me, it is more about the business use case that defines the significant criteria.

Ask yourself what is the job’s purpose? What does it need to do? From there you can devise a proper approach (or pattern) for the job’s design. Since there are many common uses cases, several patterns have emerged for me where the job depends greatly upon what result I seek. Unlike some other ETL tools available, Talend integrates both the process and data flow into a single job. This allows us to take advantage of building reusable code resulting in sophisticated and pliable jobs. Creating reusable code is therefore about the orchestration of intelligent code modules.

It is entirely possible ,of course, that job design patterns will vary greatly from use case to other use cases. This reality should force us to think carefully about job designs and how we should build them. It should also promote consideration into what can be built as common code modules, reusable across different design patterns. These can get a bit involved so let’s examine them individually. We’ll start with some modest ones:

LIFT-N-SHIFT: Job Design Pattern #1

This job design pattern is perhaps the easiest to understand. Extract the data from one system and then directly place a copy of the data into another. Few (if any) transformations are involved. It’s simply a 1:1 mapping of source to target data stores. Examples of possible transformations may include a data type change, or column length variation, or perhaps adding or removing an operative column or two. Still the idea of a ‘Lift-n-Shift’ is to copy data from one storage system to another, quickly. Best practices assert that using tPreJob and tPostJob components, appropriate use of tWarn and tDie components, and a common exception handler ‘Joblet’ are highly desirable.

Here is what a simple ‘Lift-n-Shift’ job design pattern may look like:

Let’s go through some significant elements of this job design pattern:

The layout follows a Top-to-Bottom / Left-to-Right flow: Best Practice #1
Use of a common ‘Joblet ’ for exception handling: Best Practice #3
Entry/Exit Points are well defined: Best Practice #4
- tPreJob and tPostJob components are in place to initialize and wrap up job design
- tDie and tWarn components are used effectively (not excessively)
- Also, notice the green highlighted components; these entry points must be understood
Exception handling is incorporated: Best Practice #5
The process completes all selected data and captures rejected rows: Best Practice #6
Finally, the Main-Loop is identified as the main sub-job: Best Practice #7

If you haven’t read my blog series on Job Design Patterns & Best Practices, you should. These call outs will make more sense!

It is also fair to say that this job design pattern may become more complex depending up the business use case and the technology stack in place. Consider using a Parent/Child orchestration job design (Best Practice #2). In most cases, your job design patterns will likely keep this orchestration to a minimum, instead of using the technique I describe for using the tSetDynamicSchema component (Best Practice #29). This approach, with the source information schema details, may even address some of the limited transformations (ie: data type & size) required. Keep it simple; make it smart!

MIGRATION: Job Design Pattern #2

A ‘Migration’ job design pattern is a lot like the ‘Lift-n-Shift’ pattern with one important difference: Transformations! Essentially, when you are copying and/or moving data from source systems to target data stores significant changes to that data may be required thus becoming a migration process. The source and target systems may be completely different (ie: migration of an Oracle database to MS SQL Server). Therefore the job design pattern expands from the ‘Lift-n-Shift’ and must accommodate the some or many critical steps involved in converting the data. This may include splitting or combining tables, accounting for differences between the systems on data types, and/or features (new or obsolete). Plus you may need to apply specific business rules to the process and data flow in order to achieve the full migration effect.

Here is what a simple ‘Migration’ job design pattern may look like:

Take notice of the same elements and best practices from the ‘Lift-n-Shift’ job design plus some important additions:

I’ve created an additional, separate DB connection for the lookup!
- This is NOT an option; you can’t share the SELECT or INSERT connections simultaneously
- You may, optionally, define the connection directly in the lookup component instead
- When multiple lookups are involved I prefer the direct connection strategy
The tMap component opts for the correct Lookup Model: Best Practice #25
- Use the default ‘Load Once’ model for best performance and when there is enough memory on the Job Server to execute the lookup of all rows in the lookup table
- Use the ‘Reload at Each Row’ model to eliminate any memory limitations, and/or when the lookup table is very large, and/or not all records are expected to be looked-up

It is reasonable to expect that ‘Migration’ job designs may become quite complex. Again the Parent/Child orchestration approach may prove invaluable. The tSetDynamicSchema approach may dramatically reduce the overall number of jobs required. Remember also that parallelism techniques may be needed for a migration job design. Review Best Practice #16 for more on those options.

COMMAND LINE: Job Design Pattern #3

The ‘Command Line’ job design pattern is very different. The idea here is that the job works like a command line executable having parameters which control the job behavior. This can be very helpful in many ways, most of which will be highlighted in subsequent blogs from this series. Think of the parent job as being a command. This parent job validates argument values and determines what the next steps are.

In the job design pattern below we can see that the tPreJob component parses the arguments for required values and exits when they are missing. That’s all. The main body of the job then checks for specific argument values and determines the process flow. Using the ‘RunIF’ trigger we can control the execution of a particular child job. Clearly, you might need to pass down these arguments into the child jobs where they can incorporate additional validation and execution controls (see Best Practice #2).

Here is what a simple ‘Command Line’ job design pattern may look like:

There are several critical elements of this job design pattern:

There are no database connections in this top-level orchestration job
- Instead, the tPreJob checks to see if any arguments have been passed in
- You may validate values here, but I think that should be handled in the job body
The job body checks the arguments individually using the ‘RunIF ’ trigger, branching the job flow
The ‘RunIF ’ check in the tPreJob triggers a tDie component and exits the job directly
- Why continue if required argument values are missing?
The ‘RunIF’ check on the tJava component triggers a tDie component but does not exit the job
- This allows the tPostJob component to wrap things up properly
The ‘RunIF’ checks on the tRunJob components triggers only if the return code is greater than 5000 (see Best Practice #5: Return Codes) but does not exit the job either

In a real world ‘Command Line’ use case, a considerable intelligence factor can be incorporated into the overall job design, Parent/Child orchestration, and exception handling. A powerful approach!

DUMP-LOAD: Job Design Pattern #4

The ‘Dump-Load’ job design pattern is a two-step strategy. It’s not too different from a ‘Lift-n-Shift’ and/or ‘Migration’ job design. This approach is focused upon the efficient transfer of data from a source to a target data store. It works quite well on large data sets, where replacing SELECT/INSERT queries with write/read of flat files using a ‘BULK/INSERT’ process is likely a faster option.

Take notice of several critical elements for the 1^st part of this job design pattern:

A single database connection is used for reading a CONTROL table
- This is a very effective design allowing for the execution of the job based upon externally managed ‘metadata’ records
- A CONTROL record would contain a ‘Source’, ‘RunDate’, ‘Status’, and other useful process/data state values
  - The ‘Status’ values might be: ‘ready to dump’ and ‘dumped’
  - It might even include the actual SQL query needed for the extract
  - It may also incorporate a range condition for limiting the number of records
  - This allows external modification of the extraction code without modifying the job directly
- Key variables are initialized to craft a meaningful, unique ‘dump’ file name
  - I like the format {drv:}/{path}/YYYYMMDD_{tablename}.dmp
- With this job design pattern, it is possible to control multiple SOURCE data stores in the same execution
  - The main body of the job design will read from the CONTROL table for any source ready to process
  - Then using a tMap, separated data flows can handle different output formats
- A common ‘Joblet’ updates the CONTROL record values of the current data process
  - The ‘Joblet’ will perform an appropriate UPDATE and manage its own database connection
    - Setting the ‘Run Date’, current ‘Status’, and ‘Dump File Name’
  - I have also used an ‘in process’ status to help with exceptions that may occur
    - If you choose to set the 1^st state to ‘in process’ an additional use of the ‘Joblet’ after the SELECT query has processed successfully is required to update the status to ‘dumped’ for that particular CONTROL record
    - In this case, external queries of the CONTROL table will show which SOURCE failed as the status will remain ‘in process’ after the job completes its execution
  - Whatever works best for your use case: it’s a choice of requisite sophistication
  - Note that this job design allows ‘at-will re-execution’
- The actual ‘READ’ or extract then occurs and the resulting data is written to a flat file
  - The extraction component creates its own DB connection directly
    - You can choose to create the connection 1^st and reuse it if preferable
  - This output file is likely a delimited CSV file
    - You have many options
  - Once all the CONTROL records are processed the job completes using the ‘tPostJob’, closing the database connection and logging its successful completion
  - As the ‘Dump’ process is decoupled from the ‘Load’ process it can be run multiple times before loading any dumped files
    - I like this as anytime you can decouple the data process you introduce flexibility

Let’s take notice of several critical elements for the 2^nd part of this job design pattern:

Two database connections are used
- One for reading the CONTROL table and one for writing to the TARGET data store
The same key variables are initialized to set up the unique ‘dump’ file name to be read
- This may actually be stored in the CONTROL record, yet you still may need to initialize some things
This step of the job design pattern controls multiple TARGET data stores within the same execution
- The main body of the job design will read the CONTROL table for any dump files ready to process
- Their status will be ‘dumped’ and other values can be retrieved for processing like the actual data file name
- Then using a tMap, separated data flows can handle the different output formats
The same ‘Joblet’ is reused to update the CONTROL record values of the current data process
- This time the ‘Joblet’ will again UPDATE the current record in process
  - Setting the ‘Run Date’ and current ‘Status’: ‘loaded’
- Note that this job design also allows ‘at-will re-execution’
The actual ‘BULK/INSERT’ then occurs and the data file is written to the TARGET table
- The insertion component can creates its own DB connection directly
  - I’ve created it in the ‘tPreJob’ flow
  - The trade-off is how you want the ‘Auto Commit’ setting
- The data file being process may also require further splitting based upon some business rules
  - These rules can be incorporated into the CONTROL record
  - A ‘tMap’ would handle the expression to split the output flows
  - And as you may guess, you might need to incorporate a lookup before writing the final data
  - Beware, these additional features may determine if you can actually use the host db Bulk/Insert
- Finally, process wither saves processed data or captures rejected rows
- Again, once all the CONTROL records are processed the job completes using the ‘tPostJob’, closing the database connections and logging its successful completion

Conclusion

This is just the beginning. There will be more to follow, yet I hope this blog post gets you thinking about all the possibilities now.

Talend is a versatile technology and coupled with sound methodologies, job design patterns, and best practices can deliver cost-effective, process efficient and highly productive data management solutions. These SDLC practices and Job Design Patterns present important segments for implementation of successful methodologies. In my next blog, I will augment these methodologies with additional segments you may find helpful, PLUS I will share more Talend Job Design Patterns!

Till next time…

The post Successful Methodologies with Talend – Part 2 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

CIOs: How Can You Stretch Your Data Dollar?

May 23, 2018, 3:50 pm

≫ Next: The Rise of Agritech: How the Agriculture Industry is Leveraging Cloud, Big Data and IoT

≪ Previous: Successful Methodologies with Talend – Part 2

Broken data economics

For the past few years, big data has been all the rage, but it has not delivered on all its promise. We were expecting omniscience at our fingertips, and we’re getting moderately well-targeted advertising instead.

Don’t get confused though, big data is not going anywhere. On the contrary, data keeps growing at a dizzying rate, increasing in volume, variety, velocity, and veracity. IDC estimates that in 2025, the world will create and replicate 163 zettabytes of data (yes, a trillion of gigabytes), representing a tenfold increase from the amount of data created in 2016. This is the big data era for which companies have been preparing for years.

And IT is undergoing transformative changes to manage big data and embrace technology innovation. Five years ago, the most common big data use case was “ETL offload” or “data warehouse optimization”, but now more projects have requirements to:

Focus on velocity, variety, and veracity of data, e.g., moving to real-time, ingesting streaming data, supporting more data sources in the cloud, and making sure data can be trusted.
Add automation and intelligence such as machine learning to data pipelines
Run across different cloud platforms leveraging containers and serverless computing

Almost concurrently, because companies have all recognized the value of data, there is a multiplication of data consumers across the enterprise. Everyone wants access to data so they can provide better insight, but there are also new data roles, such as data scientists, data engineers, and data stewards to analyze and manage data. A recent report by Gartner states that by 2020, the number of data and analytics experts in business units will grow at three times the rate of experts in IT departments. And big data becomes addictive, once you use it in your analytics, you want more to provide even deeper insight.

And this is the challenge – companies and, in particular IT departments, can no longer keep pace with data demands nor the users requesting access to it. To harness data, companies have made ever bigger annual investments in software solutions, in their infrastructures, in their IT teams. However, simply increasing the IT budget and resources is not a sustainable strategy, especially as data keeps on growing and more users pop up eager to get their hands on it. It’s clear current data economics are broken, and the looming challenge for companies looks particularly daunting with the rise of hybrid cloud and the multiplication of applications.

Gartner stated that “Through 2020, integration will consume 60% of the time and cost of building a digital platform”. At Talend, we believe we can completely change this statistic. Instead of throwing more money at big data, how about being smarter in how you enable users and use technology to manage data? Talend enables just that.

Enabling More People

If companies are to realize more value from more of their data, they need to make data management a team sport. Rather than each group going their own way, business and IT need to collaborate, so everyone can access, work with, and share trusted data. This is the balancing act between collaboration and self-service. At Talend, we believe that the productivity gains and cost savings are significant if users across the organization collaborate and manage data from the same platform, just like it is best to standardize on a productivity suite like Microsoft Office vs using different tools for word processing, spreadsheets and presentations. Talend Data Fabric supports this through a range of new self-service apps for developers, data scientists, and other data workers. With these new apps, the business and IT can collaborate on integration, transformation and governance tasks more efficiently, and easily share work between apps.

In addition to business productivity gains through persona-based apps, IT improves efficiencies as the apps are all governed and managed through the Talend platform, i.e., there is a central management console for managing users, projects and licenses; a common way to share data, data pipelines and metadata across on-premises and cloud deployments; a single DevOps framework; and one method for implementing security and privacy across all your data. So instead of spending time integrating and managing each of your integration tools, you spend more time delivering data-driven insight.

By implementing governed, self-service, collaborative data management, trusted data flows faster through your business and everyone becomes more confident in the decisions they make.

Today, Talend provides the following apps:

Talend Studio– a developer power tool to quickly build cloud and big data integration pipelines
Talend Cloud Data Preparation – an easy to use, self-service tool where IT, marketing, HR, finance and other business users can easily access, cleanse and transform data.

And we just announced:

Talend Cloud Data Stewardship – a team-based, self-service data curation and validation app where data users quickly identify, manage, and resolve any data integrity issue.
Talend Cloud Data Streams – a self-service web UI, built in the cloud, that makes streaming data integration faster, easier, and more accessible, not only for data engineers, but also for data scientists, data analysts and other ad hoc integrators so they can collect and access data easily.

Talend Data Fabric and its set of applications for different data workers are all about enabling everyone to work on data together.

Embracing Technology Innovation

Enabling more users is the first lever for companies; embracing innovation is the second.

The collective innovation of the entire technology ecosystem has been focused on reimagining what we do with data and how we do it. Everything in the data world gets continually re-invented from the bottom up at an accelerating pace. There is an endless supply of new technologies to consume data that improve both performance and costs.

Companies need to be enabled to embrace all this innovation and get the benefits from cloud computing, containers, machine learning, and whatever comes next. This is why Talend made the choice to build on an open and native architecture from day one. Talend Studio is a model-driven code generator, where you graphically design and deploy a data pipeline once, and then can easily switch technologies to support new data requirements. With support for over 900 components, including databases, big data, messaging systems and cloud technologies, it is easy to change a data source or target in your pipeline, and then let Talend generate the optimized code for that platform. This abstraction layer between design and runtime, combined with Talend’s continued support for new technologies, means IT teams don’t have to worry about big migration projects anymore, and can easily gain all the performance, security, functionality, and management capabilities of the underlying platform, such Hadoop. Today, Talend includes support for AWS, Microsoft Azure, Google Cloud Platform, Apache Spark, Apache Beam, Docker and serverless computing, providing the ultimate flexibility and portability to run your data pipelines anywhere.

The 2018 imperative for companies is to put more data to work for their business faster. Because data is sprawling and users are multiplying, incremental improvement won’t make the cut. Companies must be exponentially better at putting data to work for their business. The only solution to this data challenge is a force multiplier to enable more users at a lower cost with more innovative technologies.

At Talend, we are excited about the next release of the Talend Data Fabric and Talend Spring ’18 because it serves those exact needs, a true game changer. This is the most completed and most exciting release in Talend’s history and provides the speed and efficiency for everyone in your company to put more data to work right now.

The post CIOs: How Can You Stretch Your Data Dollar? appeared first on Talend Real-Time Open Source Data Integration Software.

↧

The Rise of Agritech: How the Agriculture Industry is Leveraging Cloud, Big Data and IoT

May 24, 2018, 9:50 am

≫ Next: GDPR Is Here – And Only 19% Of Companies Are Fully Ready

≪ Previous: CIOs: How Can You Stretch Your Data Dollar?

How can we meet the world’s food needs while respecting the environment? With the population set to grow to over 9 billion people by 2050, answering this question is becoming more urgent.

As arable land decreases by 100,000 hectares per year, global agricultural production needs to double in the next 30 years to cope with demand. Add to this the impact of climate change and mass urbanization, and it is clear the agricultural sector faces a monumental challenge. Beyond investments and the political will to support it, part of the solution lies in the digital transformation of the entire agricultural supply chain.

Data at the Heart of Farming

Technology and data can open new opportunities and help solve problems with production, traceability, and the preservation of natural resources. Despite its traditional image, agriculture is adopting new technological innovations and leveraging the cloud, big data, and the Internet of Things (IoT) solutions to increase productivity while protecting our environment.

From 2013 to 2015, the use of professional agricultural mobile applications jumped by 110%, a sign that the sector was adopting digital technology and seeing the benefits. Today, precision farming allows more accurate and efficient crop monitoring through the rationalization of plot management. This makes it possible to optimize yields, taking into account the different environmental factors that can affect growth, such as soil pH, irrigation, fertilizer, and sunshine. That’s alongside other benefits such as increased food safety and better respect for the environment.

Data is at the heart of this technique. IoT, satellite and drone imagery, weather data and historical yield data can be brought together to inform and speed up the decision-making process that will make farming more efficient. For example, it can optimize the use of pesticides or fertilizers based on the data gathered.

Of course, all of this will not be possible without support from government and the agricultural sector to help UK farmers, suppliers, cooperatives, traders, and manufacturers take advantage of advances in agrictech. In 2015, the then Defra Secretary Liz Truss announced that the Government and the Open Data Institute would support and work towards the adoption of Open Data strategies in the UK agricultural sector to take advantage of the farming-specific data.

Similarly, as part of its Industrial Strategy, the Government is to make £90m available to help British farmers capitalize on developments in agritech with artificial intelligence, robotics, and satellite-powered earth observation. The Government hopes that this will help agriculture realize the benefits of innovations such as the Croprotect app, which helps farmers prevent pest-related crop damage, and the Ordnance Survey’s use of satellite imagery and digital data collection to map farmland.

Cloud-based data collection and real-time analysis can improve product traceability – traders benefit from better allocation planning, simplifying logistics, while cooperatives have the information they need to monitor and advise farmers from production to the point of sale to accurately calculate losses.

It can also improve quality. Analysing data on the entire production chain can help track potentially contaminated foods through the supply chain and stop their distribution. Food safety and consumer information will be enhanced with this detailed insight into the journey of each food product until it reaches the shelves. Carbon footprint, quantities of pesticides or water used are all elements that can be communicated and will make the entire chain more responsible by enabling consumers to make informed choices.

Precision Farming

Precision farming is expected to contribute 30% of the growth needed in agricultural production to feed the world by 2050. The potential gain is tremendous, while technologies such as artificial intelligence should give a new impulse to intelligent crop management in the coming years. IoT will increasingly invade the fields and drones will monitor the land and crops in hard-to-reach places.

This revolution will awaken new vocations, particularly among the digitally literate generations. For a sector still perceived as very traditional, technological innovation will profoundly affect it and make it possible to solve the food supply challenges of the 21st century.

The post The Rise of Agritech: How the Agriculture Industry is Leveraging Cloud, Big Data and IoT appeared first on Talend Real-Time Open Source Data Integration Software.

↧

GDPR Is Here – And Only 19% Of Companies Are Fully Ready

May 25, 2018, 1:00 am

≫ Next: [Step-by-Step] Introduction to Talend Master Data Management

≪ Previous: The Rise of Agritech: How the Agriculture Industry is Leveraging Cloud, Big Data and IoT

May 25^th, 2018. Here we are! The European General Data Protection Regulation (GDPR) has come into effect. So are we finally finished and ready to help consumers take back control of their data? The newly released survey from BARC/CXP clearly hints that for most companies, May 25^th, 2018 looks more like a beginning than a deadline. Here is why.

Ready! For What?

Now that GDPR is a reality for all businesses that control or process data about or relating to EU citizens, including businesses and organizations headquartered outside the European Union, what are the consequences if your organization is not fully prepared?

“Even if you’re not finished preparing for the GDPR on May 25th, this is not a problem. This is a learning curve, and we will consider, of course, that this is a learning curve. The role of the regulator is to be very pragmatic and to be proportionate. However, it’s important that you start today, not tomorrow,” says Isabelle Falque-Pierrotin, Chairman of the CNIL, the French Data Protection Authority.

A recent survey from BARC of 200+ CXO respondents shows that organizations understand that guidance; only 19% consider that they are GDPR ready, most among the other being still in planning (17%) or development (30%) phase.

Does that mean that slow and steady wins the race? Probably not. May 25^th is the D-day where data subjects, your customers, and your employees have gained new rights: the right of access, the right of rectification, the right of portability, and the right to be forgotten. For them, GPDR is more of a liberation than a constraint, and there is no doubt that they will want to try out your system of trust, the one that can guarantee a fair and safe usage of their personal data.

Some privacy activists, like the French association « La Quadrature du Net », or Austrian Lawyer Max Shrems, are already pushing for group actions, pressuring the data protection authorities to ensure that the regulation is respected.

When a customer asks to enact for their data subject right that a company is not able to fulfill, this is embarrassing. And what will companies do if thousands of customers ask?

Data Management to the Rescue

Enacting the rights for the data subject, managing consent, and minimizing personal data when it is used beyond the scope of legitimate interest and consent — these are the kinds of mandates that bring GDPR beyond a box-ticking exercise and require data management technologies.

This doesn’t mean that you shouldn’t underestimate the organizational challenges. The survey highlights those traditional challenges, such as tackling the organizational issues, addressing the lack of expertise or availability of resources.

But it shows the need as well to get hands-on with the data with proper tools.

On that front, the BARC survey brings us back to the basics; 57% of participants plan to expand their use of Data Integration to comply. Creating a consistent, trusted and auditable 360° view of their data subjects with Data Quality, Master Data Management, and reporting, is also in the radar for almost half of the participants. And then there is the Data Governance dimension to track and trace the origin and usage of personal data.

Beyond GDPR: trusted data for personalized experiences

Many organizations say they struggle to implement complex things like consent management, the right of portability, or the right to be forgotten. The survey reveals that significant sums are being spent on GDPR compliance in 2018, with the large majority of companies spending more than €250,000, including 16% spending up to €5 million.

But our survey also highlights the business benefits. Indeed, the regulation forces an organization to finally take total control over their critical data. Thanks to the potentially very costly fines, data governance has now become a board discussion. That’s why close to 80% of participants agree that GDPR is helping to improve data trust, control, and relevance.

Even more importantly, the survey reminds us of the ultimate goal. Personal data at its best delivers personal experiences. It allows companies to better know their customers, and turn this knowledge into increased sales, customer satisfaction, and innovative services.

Today is just a starting point for organizations to take control of their data. For more about how organizations are implementing GDPR, and to benchmark your own progress, take a look at our whitepaper, 16 Steps to GDPR Compliance.

To learn more about the BARC/CXP survey, you can download the survey , or infographics.

The post GDPR Is Here – And Only 19% Of Companies Are Fully Ready appeared first on Talend Real-Time Open Source Data Integration Software.

↧

[Step-by-Step] Introduction to Talend Master Data Management

May 30, 2018, 3:03 pm

≫ Next: Big Data in Healthcare: 4 Big Benefits

≪ Previous: GDPR Is Here – And Only 19% Of Companies Are Fully Ready

With increasing needs for data analytics and ever more stringent laws concerning data security and privacy (looking at you, GDPR), data governance has become a business imperative for any data-driven company. Master Data Management (MDM) is an essential part of any data governance strategy, it is commonly defined as the comprehensive method used to consistently define and manage the critical data of an organization to provide a single point of reference.

Master Data Management is often perceived as an IT responsibility; but because data governance is a company-wide responsibility, IT teams need MDM solutions that enable everyone in the company. Talend MDM provides an intuitive and accessible solution to manage and master critical data.

The video series below explains how IT teams can use Talend MDM to implement a sustainable Master Data Management strategy that works for everyone in the company.

The first video gives an overview of the complete MDM solution from data models to data integration jobs to data stewardship workflows. The intuitive User Interface of Talend MDM makes it easy to interact with charts and deep dive into specific Master data. The highly customizable dashboard provides users with an end-to-end view of their critical data.

The second video presents the step-by-step process to create and deploy MDM data models using the concrete use case of an HR department managing employees’ data. Users can create any number of MDM domains, the user can define the appropriate data model by adding specific business elements such as social security numbers, hiring data etc. When creating the view, the user can decide which business elements should be visible, searchable and hidden to each role within your organization. Permissions to read and write in the data container, data model and view can then be assigned to different sets of users before deployment. Finally, the deployment of all your MDM objects are done with just a few simple clicks!

The last video of the series focuses on one of the key roles of data stewards: matching data to remove duplicates and cleanse data sets. Match rules are used to define how MDM decides whether two or more data records will match, and how to handle them if they do. In Talend MDM, match rules can be directly created and accessed from the repository tree on the left-hand side. When creating a new matching rule, it’s easy to define elements to consider for data matching (e.g. first and last names) and the confidence weight of each element. The new match rule can then be used in data integration jobs. And YES, I understand that is a very simple example, but it was to keep the video short. I could build hours of videos on how you can build very complex matching algorithms using advance standardizations, phonetics, generated keys for building blocking keys to do highly performing matches, but I want you to see in these videos how Talend’s integrated platform works together in a truly unified fashion.

And with this, you’re all set to successfully start implementing a sustainable phased-approach to Master Data Management strategy.

Talend’s approach is how many of customers can claim success within the first 6 months of starting their MDM projects! Carhartt, for instance, went live with their MDM project in just 4 months.

Learn More about Talend MDM

The post [Step-by-Step] Introduction to Talend Master Data Management appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Big Data in Healthcare: 4 Big Benefits

June 1, 2018, 7:48 am

≫ Next: Building Ingestion Pipelines for IoT Data with Talend Data Streams

≪ Previous: [Step-by-Step] Introduction to Talend Master Data Management

In this episode of Craft Beer and Data, Mark and I travel just south of the Bay Area to chug some beer and talk data.

Devil’s Canyon Brewery started in San Carlos, California, 15 years ago. We met with their Production Lead, Ryan Edmonds, to learn more about the Silicon Valley brewery. The motto of Devil’s Canyon is “Crafting Beer, Building Community,” and it’s apparent why. The brewery building (that was once a Tesla factory) hosts all kinds of events to bring people together, and, of course, they make delicious beer!

To hear more about Devil’s Canyon and their brewing process, watch the rest of the conversation in the video above!

After Mark crushed me in the chugging contest, we both grabbed another glass to discuss how data is changing the healthcare world and saving countless lives in the process.

What Is Big Data Doing for Healthcare?

Few industries have been revolutionized by big data as much as healthcare. In this episode, we talked about:

Gene mapping
Creating new drugs
Real-time health data
Organ donation

Ready For More? Download How Leading Enterprises Achieve Business Transformation with Talend and AWS User Guide now.

Download Now

1. Gene Mapping

You may have heard of The Human Genome Project. This process of mapping genes is essentially identifying what makes us human. Recently, there has been a huge explosion in the capabilities to map genes, and a lot of that has come from all the advancement processing data at scale using technologies like apache spark. it’s probably the most important project we’ll see in our lifetime.

Here are a few facts:

In the early stages, gene mapping cost 3 billion dollars to sequence. Now, it only cost about $1,000 per patient.
It’s estimated that by 2025 we will have to start working with zettabytes of sequencing per year for the processing power.
By 2025, storage needs will be in the range of two to four exabytes. (That’s a lot of data!)

Because of data processing, there is huge potential for understanding who we are as a humans. Gene mapping is the lead in to individualized medicine, which—in some cases—can be a matter of life or death.

2. Creating New, Life-Saving Drugs

It is a long, expensive process to make a new drug. The first phase is the discovery and testing of molecules.

Discovery and testing can cost $500 million or more, and requires a lot of data. Ten to 15 years ago, it was too much data to process, so data scientists sampled. At that time decisions were made on roughly seven to 10% of the available data. With today’s capabilities we can make stronger data driven decisions that encompass the full picture, enabling researchers a higher chance of success when determining which molecules are worth spending another $200 million on. It’s no surprise that many companies are reviewing old datasets to see if anything was missed.

3. Real-Time Health Data

No longer do doctors have to waste precious time waiting on a patient’s health history. Watch the full episode (above) for a list of ways that big data is becoming real-time in healthcare. With so much change happening, it will be fascinating to see how other aspects of the healthcare industry evolve to support and take advantage of real-time data updates.

4. Organ Donation

When it comes to saving lives, one Talend customer knows first hand the importance of having the capacity to handle a lot of data. UNOS, “United Network for Organ Sharing,” is a 35-year-old network that matches organ donors with recipients.

UNOS is now able to match patients with organs in a 24 to 48 hour window! The program also creates an organ offer report, which is sent to all the hospitals in their network. It is critical to handle the velocity of that data so patients can survive.

Big Data in Healthcare Needs an 11th “V”

Big data is crucial for modern healthcare, but there is a potential dark side. There is a lot of personal information involved in all that data, and we must be careful to treat it with respect.

I’ve said this before, but I’ll say it again: it’s time to add a new “V” to the big data list—virtue. We have the data, but how are we going to use it?

Watch the full episode for more on how big data is transforming healthcare. You can also check out how Talend is helping the healthcare industry optimize their big data processes, and catch up on the entire first season of Craft Beer & Data.

The post Big Data in Healthcare: 4 Big Benefits appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Building Ingestion Pipelines for IoT Data with Talend Data Streams

June 1, 2018, 1:56 pm

≫ Next: Data provenance: Be a star of GDPR

≪ Previous: Big Data in Healthcare: 4 Big Benefits

Introduction

This month, Talend released a new product called Talend Data Streams. Talend Data Streams was designed for data scientists, analysts and engineers to make streaming data integration faster, easier and more accessible. I was incredibly excited when it finally released on the AWS marketplace and have been testing out a few use cases.

Today, I want to walk you through a simple use case of building ingestion pipelines for IoT data. I’m going to show you how to connect to your Kafka queue from Talend Data Streams, collect data from an IoT device, transform that raw data and then store it in an S3 bucket. Let’s dive in!

Pre-requisites

If you want to build your pipelines along with me, here’s what you’ll need :

Kafka running instance (if you don’t have one have a look at this article: Install & setup Kafka on Windows
An AWS account
Talend Data Streams from Amazon AMI Marketplace (if you don’t have one follow this tutorial: Access Data Streams through the AWS Marketplace)
An IoT device (can be replaced by any IoT data simulator)

Let’s Get Started: IoT with Talend Data Streams

Kafka

Network setup

I’m going to start with the network setup. Here I have an Amazon Web Services EC2 Windows instance and I’ve installed Apache Zookeeper and Kafka using the default settings and ports (Zookeeper: 2181; Kafka: 9092) as mentioned in this article: Install & setup Kafka on Windows.

A couple of things to remember as you are setting up your Kafka network

On the Kafka machine, make sure that all firewalls are turned off.
On AWS management console, make sure that you’ve create inbound rules that allow all TCP and UDP traffic from your Data Streams AMI (using the security group Id of your Data Streams AMI).

Launch Kafka

Run Zookeeper then Kafka as described in the article:

Launch Zookeeper

zkserver

Launch Kafka

.\bin\windows\kafka-server-start.bat .\config\server.properties

Zookeeper and Kafka up and running

Create Kafka topic

To finish up, let’s create a Kafka topic from the command line tools. I’m going to label the topic “mytopic”.

kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic mytopic

Your Kafka broker is now up and running!

Setting Up Your S3 Bucket

Create your S3 bucket

Now, we need somewhere to put the incoming data. Let’s create that place using Amazon S3. To get started, login to your AWS account and look for S3 service in the “Storage” category.

AWS Management Console

Create a new bucket from the user interface.

Next, fill in your bucket name and select a region.

Leave the default setting by clicking twice on Next then “Create Bucket”.

Create IAM role for an S3 bucket

In order to access the bucket from a third-party application, we need to create an IAM role that has access to S3 and generates the Access Key ID and Secret Access Key. It’s a critical set up in this use case. Here’s how you get that done.

From the AWS management console, look for IAM.

Select the User section and click on Add user.

Choose a username and tick the box “Programmatic Access” then click on “Next”.

To make this easy we will use existing policies for S3 with full access. To do this, select “Attach existing policies” and check the AmazonS3FullAccess (you can change the policy setting afterward).

Make sure your setup is correct and click on “Create User”.

Now write down your access key and click on “Show” to see your secret key (as you will see it just once).

IoT device: Raspberry Pi & Arduino Uno

Set up your device

I have a Raspberry Pi with internet connectivity over wifi; I’ve set it up using this tutorial: Install Raspbian using NOOBS. An Arduino Uno is connected to the Raspberry Pi over serial. It has one RGB LED, a temperature and humidity sensor and a distance sensor (if you are interested to learn more about it, contact me and I’ll share my setup with you).

The Arduino Uno reads temperature, humidity and distance values from the sensors, the RGB LED color change based on the distance measured.

Send Sensor Data to Kafka

The Raspberry Pi acts as a cloud gateway and a hub that collects sensor values from the Arduino. I’m using Node-red (embedded with Raspbian) on the Pi in order to read sensor value from serial and send them to Kafka broker.

Node-red Flow on the Raspberry Pi

Talend Data Streams

As a reminder, if you don’t have your own Data Streams AMI follow this tutorial: Getting Started on Talend Data Streams

Talend Data Streams is free of charge in this open source version, and you only pay for AWS storage and compute consumption.

Create Connections

When your Data Streams AMI is up and running you can access it using the public DNS.

On the login screen use “admin” for user and your AMI ID (found in the EC2 console) for the password.

Now select the Connections section and click on Add Connection.

Let’s create first our Kafka connection. First, give it a name, select the type as Kafka and fill in your broker DNS with port and click on Check Connection then Validate.

Now create the S3 bucket connection, check the connection and Validate.

Create your Dataset

Click on Datasets then Add Dataset. From there, select the Kafka connection that we’ve just created, write the topic where the IoT device is sending over data (mytopic), choose the value format (CSV) and the field delimiter (;) and click on View Sample.

Do the same for your Amazon S3 instance.

Create a Pipeline

Now that our source and target are set, it’s time to create a Data Streams pipeline. Here’s how to do that:

Click on Pipeline and press ADD PIPELINE.

Now edit the name of your pipeline.

Now on the canvas, click on “Create Source”.

Select the dataset we created on the top of our Kafka broker and press select dataset.

Now you have your first source defined on your canvas, and you can press the refresh button to retrieve new data from Kafka.

Now, click on the green + button next to your Kafka component and select a Python Row processor.

The Python processor is used to rename the column, change data type and create an extra field based on the value of the Distance sensor. Copy and paste the Python code below and click on Save.

output = json.loads("{}")
 
led="test"
 
test=input['field2']
 
 
 
if test <= 20:
 
    led = "Red"
 
elif test > 80:
 
    led="Green"
 
else:
 
    led="Blue"
 
   
 
output['Temperature'] = int(input['field0'])
 
output['Humidity'] = int(input['field1'])
 
output['Distance'] = int(input['field2'])
 
output['LedColor'] = led
 
 
 
outputList.append(output)

Let’s now add a sink. I’m going to use the S3 bucket connection I created earlier. Click on Create sink in the canvas.

Select the loT dataset from Amazon S3.

Now that we are all set, press the run button on the top to run your pipeline.

Just like that, we’ve built our first Talend Data Streams pipeline that reads from Kafka, uses Python Row to process the data that is then stored on Amazon S3.

In my next blog post, we will dig deeper into Talend Data Streams components and capabilities by leveraging this pipeline to create a real-time anomalies detection model on the humidity sensor, using the Window component and by calculating the Z-Score for each sensor value in a Python processor.

Happy Streaming!

The post Building Ingestion Pipelines for IoT Data with Talend Data Streams appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Data provenance: Be a star of GDPR

June 5, 2018, 6:18 pm

≫ Next: How to Manage Access to 3rd Party Resources in Kubernetes with Helm

≪ Previous: Building Ingestion Pipelines for IoT Data with Talend Data Streams

The big data ecosystem is constantly changing and with General Data Protection Regulation (GDPR) due to go into effect in May 2018, businesses need to deploy a coordinated approach to data management. Aging data architecture will not be able to meet the needs of the new EU regulations which prescribes much more stringent and specific rules on the use and storage of personal data than ever before.

The good news is that shifting to a modern data architecture can help to liberate data and make it more useful for your business. According to IDC research there is almost a 50:50 split between organizations that view GDPR as an opportunity and those that see it as a threat. While it certainly presents a challenge, GDPR provides a great opportunity to develop more transparent relationships with clients while simultaneously cleansing the data businesses hold to make it more usable. The opt-in rule that GDPR will introduce requires that companies need to know if, how and when each customer has opted in. All companies will need to have a comprehensive view of the private data they hold so that a clear digital paper trail for each customer can be identified. While this has clear privacy and transparency benefits for customers, it can also help businesses put more of the right data to work. Credit Agricole Consumer Finance for example, have been successful in capturing personal data from their customers to improve the customer experience.

<Get the BARC GDPR Preparedness Report>

Creating a personal data hub

To be fully compliant with GDPR, data-driven enterprises will need to create a Personal Data Hub where all relevant data can be collated, making it easily traceable and readable. Updating legacy data architecture that enables the establishment of a personal data hub is a must.

The first step is to bring all the data into a data lake so that it can be connected to the personal data hub where it can be cleansed, discovered and shared. Cleansing legacy contact lists will eliminate any out-of-date information that is no longer useful to the business. Some 65% of organizations cleanse their customer data just once a year, have no processes in place at all, or simply don’t know how often their data is cleansed, according to Royal Mail Data Services research.

In the multi-cloud era, information may well need to be collated from several sources, with data quality tools helping to match disparate data. The data trail for one customer could cover numerous parts of a business. For example, private details could be held by the subscriptions department, as well as by sales, marketing and finance. Not only is the data often siloed in different departments, it may also be held in non-compatible formats. Bringing all this data together to build a “single version of the truth” will not only help companies to meet GDPR requirements but also enables easier collaboration between departments. This will help businesses to build trust with customers over the privacy and use of their personal information.

<<16 Steps to GDPR Compliance>>

Building trust through transparency

Master data management (MDM) will be needed for reconciling data and will also be essential for managing opt-ins, because these will apply across multiple applications. It will help to create a personal audit trail for each customer and then apply it across all the enterprise’s applications. Creating a healthier opt-in relationship with consumers will benefit both sides and the modern data infrastructure required will need to be built on a platform that runs on any cloud, so that all areas of the enterprise are covered. Infrastructure for data-driven companies will also require governance that is built-in and not merely an afterthought. It will be essential for companies to introduce the right data governance policies to implement parameters around opt-in periods.

Data portability is also important as GDPR requires that consumers can request access to all the personal data that a company holds for them at any time. There are many ways that companies could deal with this, including creating a tool to enable customers to easily download all their data. Some 82% of European consumers plan to exercise their new rights to view, limit, or erase the information businesses collect about them, according to a study from customer engagement software firm Pegasystems.

Ultimately, all companies will need the right data platform in place to ensure that GDPR does not have a negative effect on their business. While the new regulations may seem like a complex challenge, it’s not just about ensuring compliance. GDPR may also prove to be a useful catalyst in delivering data that you can trust, by cleansing and consolidating what you already have, while also helping to nurture a more trusting and transparent relationship with consumers.

The post Data provenance: Be a star of GDPR appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How to Manage Access to 3rd Party Resources in Kubernetes with Helm

June 6, 2018, 8:16 am

≫ Next: Building Real-Time IoT Data Pipelines with Python and Talend Data Streams

≪ Previous: Data provenance: Be a star of GDPR

Post By Sébastien Gandon and Iosif Igna

Kubernetes provides the ability to easily deploy and run containerized applications in cloud, on-premise, or hybrid environments. Kubernetes has gained a lot of attention recently and it has become a platform for innovation in containerized applications. One technology which has probably helped Kubernetes grow a lot is Helm, which provides the means to package, install, and manage Kubernetes applications.

At Talend, we are using Kubernetes and Helm for our cloud applications. In this post, we will show how we have used Kubernetes resources and a Helm chart to address a specific deployment challenge.

The specific use case is a Kubernetes application that needs to connect to a PostgreSQL database. First, we create a Helm chart to manage the deployment of the application. We don’t yet know how to provision PostgreSQL and how to manage the connection from the application to PostgreSQL.

Although the answers might seem trivial at first, it becomes more complicated when we consider different deployment strategies for PostgreSQL. Therefore, let’s start by looking at several possible PostgreSQL deployment scenarios, which we have tried over the course of our journey with Kubernetes and Helm.

Embedded Deployment Scenario

In this scenario, PostgreSQL is deployed alongside the application inside the Kubernetes cluster, as shown in the following diagram.

Figure 1 – Embedded Postgresql provisioning

While it might not be the perfect scenario for a production system, it provides an easy and flexible way to get a PostgreSQL up and running in a very short time.

The official Kubernetes Helm charts repository provides a PostgreSQL chart which installs and configures a PostgreSQL database inside the cluster. The database name, database user name, and database password can be provided in the values.yaml file or as input parameters at install time. The chart stores the database password in a Kubernetes secret which is then used by the pod that hosts the PostgreSQL container and by the applications which need to connect to the database.

OSBA Deployment Scenario

OSBA (Open Service Broker API) enables service providers to deliver services to applications running in a cloud-native platform, such as Kubernetes. The idea is to provision resources that are managed by a cloud provider using Kubernetes manifests.

In this scenario, we are using the Kubernetes service catalog to connect to a Microsoft Azure service broker and provision a PostgreSQL database.

Figure 2 – OSBA Postgresql provisioning

Azure Catalog service proposes 3 different PostgreSQL configurations:

Provision only the cluster
Provision the cluster and the DB
Provision a new DB on an existing cluster

We have chosen to provision both the cluster and the DB to have a setup similar to the embedded model.

Once the cluster/database has been provisioned, the service broker creates a secret inside the Kubernetes cluster which contains all parameters required to access the PostgreSQL database, such as: host, port, database name, user, password, etc. Like in the embedded scenario, this secret can be used by the applications which need to connect to the database.

It is important to notice that the created secret is highly dependent on the way the service broker provider decides how it is going to be formatted. The keys of the secret may vary from one cloud provider to another.

Learn more about the Open Service Broker for Azure and PostgreSQL.

External Deployment Scenario

This scenario implies that a PostgreSQL database is provisioned and managed outside the Kubernetes cluster. This could be a managed service in a cloud environment (i.e. Amazon RDS) or a self-managed PostgreSQL cluster in a cloud or on-premise environment.

Kubernetes provides a special service called “service without selector” to enable the communication from resources inside the cluster to resource outside the cluster. In this scenario, we are using a service without selector to connect from our application deployed inside the cluster to an external PostgreSQL database.

Figure 3 – External Postgresql Provisioning

The database host and access credentials are stored in a Kubernetes secret which can be accessed by our application inside the cluster. In this case the secret attributes are freely chosen to match our service’s required parameters.

The Problem

When we look at these three deployment scenarios, we see that in each case the application that uses PostgreSQL needs to be aware of the different secrets with their own names and content. The embedded scenario uses values and a secret provided by the official PostgreSQL Helm chart, the OSBA scenario uses a secret with vendor-specific attribute names, and in the external scenario we are free to define the secret name and its content.

This is fine if we only need to work with one deployment scenario. However, this might not always be the case. For instance, in the development phase we might want to start with an embedded deployment, but later in production, we may need to use a managed service from a cloud provider – therefore needing an OSBA deployment. We can see that there is a need for abstraction here, so that the application can access the database in the same way, independent of how and where the database is deployed.

The Solution

The solution we have chosen consists of a generic secret that provides the abstraction layer required by an application to connect to a database without having any knowledge of its whereabouts. The advantage of a secret, apart from being the resource to handle sensitive data, is that it also provides a way to synchronize the launch of pods. If you have a pod depending on a secret file or some environments variables depending on the secret, the pod will not even start before the secret is available.

Figure 4 – Generic Secret abstracts custom provisionings

Below is an example of the generic secret we create for accessing the PostgreSQL cluster and database.

Secret key	Secret value description
postgresql.database	the name of the DB
postgresql.host	host name of the cluster, can be an IP or a K8s service name
postgresql.password	master password
postgresql.port	cluster port, usually 5432
postgresql.user	master user

To create this generic secret for each of the deployment scenarios above, we use different mechanisms as described below.

Embedded scenario

The generic secret is created by a Kubernetes job from the values.yaml file and the PostgreSQL secret is created by the official PostgreSQL chart. We use the environment variables of the job to wait for the embedded PostgreSQL secret to be created. You can find an example here.

OSBA scenario

The OSBA provisioning process involves two Kubernetes resources, a service instance, and a service binding. The service binding describes the name of the secret to be created after the successful provisioning. So, just like in the embedded scenario, we are using a Kubernetes job that waits for the OSBA secret to be created and then creates our generic secret out of it. You can find an example here.

External scenario

This is the easiest scenario because the credentials data comes from outside, and therefore can be provided during the helm install process. So, a simple secret template is enough to create the generic secret. You can find an example here.

A New Level of Managing Connections

Kubernetes has taken the orchestration of containerized applications to a different level and is helping software vendors reduce the gap between development, QA, and production environments. At the same time, the options for 3^rd party resources/services have increased significantly, and software vendors are faced with the challenge to build their applications in a way that provides enough flexibility to switch between a locally managed service and a cloud-based service.

In our example, we have shown how you can leverage Kubernetes and Helm to manage the connection from a Kubernetes application to a PostgreSQL database. This could be deployed either alongside the application inside the cluster, on-demand in a cloud environment, or pre-deployed outside the cluster. We have used a Helm chart to create a single generic secret which provides an abstraction layer between the application and the database. In this way, the application should never know where the database is deployed or change the way it connects to the database.

You may find the related helm charts that where created using this approach here: https://github.com/sgandon/helm-postgresql-multi/tree/master/tpostgresql

This post was inspired by a workshop that Talend did with Microsoft. We’d like to thank Gil Isaacs for organizing this workshop and Julien Corioland for his great knowledge and skills around Kubernetes and Azure.

The post How to Manage Access to 3rd Party Resources in Kubernetes with Helm appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Building Real-Time IoT Data Pipelines with Python and Talend Data Streams

June 7, 2018, 7:01 am

≫ Next: 4 Possible Ways a Blockchain Can Impact Data Management

≪ Previous: How to Manage Access to 3rd Party Resources in Kubernetes with Helm

Introduction

I was incredibly excited when Talend finally released a new product called Talend Data Streams that you can access for on free on the AWS marketplace. From day one, I have been testing out a few use cases.

<<Download Talend Data Streams for AWS Now>>

Following my previous post on Data Streams, today, I want to walk you through a simple use case of anomaly detection pipeline for IoT data. I’m going to show you how to connect to your Kafka queue from Talend Data Streams, collect data from an IoT device, transform the data, create a streaming window and build Z-Score Model with a Python processor to detect anomalies in humidity readings.

Pre-requisites

If you want to build your pipelines along with me, here’s what you’ll need:

Kafka running instance (if you don’t have one have a look at this article: Install & setup Kafka on Windows)
An AWS account
Talend Data Streams from Amazon AMI Marketplace. (If you don’t have one follow this tutorial: Access Data Streams through the AWS Marketplace)
An IoT device (can be replaced by any IoT data simulator)

Let’s Get Started: IoT with Talend Data Streams

Kafka

In my previous post, I’ve described how to install, run and create a topic in Kafka. Feel free to reference that post if you need more detail on how to use Kafka here. In this post, I’ll focus on Talend Data Streams itself.

IoT device: Raspberry Pi & Arduino Uno

I will be using the same device as my previous post, but this time we will analyze humidity values. As I mentioned before, you could use any device simulator.

Humidity sensor attached to Arduino, Raspberry reading sensor value from serial.

Anomaly detection: Z-Score model

Anomaly detection

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. It has many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.

One of the most popular and simple method for outlier detection in the IoT world is Z-Score.

You can forget about importing NumPy, SciPy or Scikit learn libraries since Data Streams can only access to Python 2.7 standard libraries for now PythonDoc2.7 This is not a problem that we can’t fix. Let’s start with Statistics lesson and calculate the Z-Score manually.

Z-Score

The z-score or standard score of an observation is a metric that indicates how many standard deviations a data point is from the sample’s mean, assuming a Gaussian (normal) distribution. This makes z-score a parametric method.

Let’s do the Math

After making the appropriate transformations to the selected feature space of the dataset (in our case creating a sliding window), the z-score of any data point can be calculated with the following expression:

x is the sensor value, m is the mean of the sensor values within the window and the s Standard deviation.

So now what is the Standard Deviation? The simple definition according to MathIsFun is that a standard deviation is a measure of how spread out numbers are. It can be calculated with the following expression:

x_i is the series of sensor value within the defined windows and N the number of values within the window.

Coming back to the Z-Score with a simple example, the goal is to find how many standard deviations a value is from the mean.

In this example, the value 1,7 which is 2 standard deviations away from the mean of 1,4, so 1,7 has a z-score of 2. Similarly 1,85 has a z-score of 3. So to convert a value to a Z-Score first subtract the mean, then divide by the standard deviation

In IoT use cases we usually consider that an anomaly is detected when the sensor value is outside the -2s and +2 s range.Z-score is a simple yet powerful method to get rid of outliers in data if you are dealing with parametric distributions in a low dimensional feature space.

Data Streams: Getting Started

I’ll assume for this part that you have a Data Streams AMI up and running. If you don’t, follow this tutorial: Getting Started on Talend Data Streams.

When your Data Streams AMI is up and running you can access using the public DNS, on the login screen use Admin for user and the AMI id for the password.

Additionally, you can have a look at the Talend Data Streams Documentation.

Create a Connection

Let’s create a connection to the Kafka cluster that I have used in my previous blog post, select Connections section and click on Add Connection.

Create a Kafka connection, give it a name, the type is Kafka and fill in your broker DNS with port and click on CHECK CONNECTION then VALIDATE.

Create a Dataset

Click on DATASETS then ADD DATASET and select the Kafka connection that we’ve just created. Next, write the topic where the IoT device is sending over data (humidity), choose the value format (CSV), specify the field delimiter (semicolon) and click on VIEW SAMPLE.

Data Streams Pipeline

Create your Pipeline

Now click on Pipeline and press the button ADD PIPELINE.

Rename your Pipeline (we have chosen PiedPiper for this example).

On the canvas click on Create source, select the data we’ve just created and click on SELECT DATASET.

Our PiedPiper Pipeline is now ready to ingest data from the Kafka queue.

Data Preparation

Since the incoming IoT messages are really raw at this point, let’s convert current value types (string) to number, click on the green + sign next to Kafka component and select the Type Converter processor.

Let’s convert all our fields to “Integer”. To do that, select the first field (.field0) and change the output type to Integer. To change the field type on the next fields, click on NEW ELEMENT. Once you have done this for all of the fields, click on SAVE.

If you look at the Data Preview at the bottom you’ll see the new output.

Next to the Type Converter processor on your canvas, click on the green + sign and add a Windows processor—remember, in order to calculate a Z-Score, we need to create a processing window.

Now let’s set up our window. My Arduino sends sensor values every second, and I want to create a Fixed Time window that contains more or less 20 values, so I’ll choose Window duration = Window slide length = 20000 ms— don’t forget to click Save.

Since I’m only interested about Humidity, which I know is in field1, I’ll make things easier for myself later by converting the humidity row values in my window into a list of values (or array in Python)by aggregating by the field1 (humidity) data. To do this, add an aggregation processor next to the Window Z-Score component. Within the aggregation processor, choose .field1 as your Field and List as the Operation (since you will be aggregating field1 into a list).

On the Aggregate processor click on the preview button to see the list.

Calculate the Z-Score

In order to create a more advanced transformation, we need to use the Python processor, so next to the Aggregate processor add a Python Row processor.

Remember we can’t import non-standard Python libraries yet, so we need to break up the Z-Score calculation into a few steps.

Even if the code below is simple and self-explanatory, let me sum up the different steps:

Calculate the average humidity within the window
Find the number of sensor values within the window
Calculate the variance
Calculate the standard deviation
Calculate Z-Score

#Import Standard python libraries

import math



#average function

def mean(numbers):

    return float(sum(numbers)) / max(len(numbers), 1)



#initialize variables

std=0



#Load input list

output = json.loads("{}")



#average value for window

avg=mean(input['humidity'])



##lenth window

mylist=input['humidity']

lon=len(mylist)

# x100 in order to wokraround Python limitation

lon100=100/lon



#Calculate Variance

for i in range(len(mylist)):

    std= std + math.pow(mylist[i]-avg,2)



#Calculate Standard deviation    

stdev= math.sqrt(lon100*std/100)



#Re-import all sensor values within the window

myZscore=(input['humidity'])



#Calculate Z-Score for all sensor value within the window

for j in range(len(myZscore)):

    myZscore[j]= (myZscore[j]-avg)/stdev



#Ouput results

output['HumidityAvg']=avg

output['stdev']=stdev

output['Zscore']=myZscore

Change the Map type from FLATMAP to MAP, click on the 4 arrows to open up the Python editor and paste the code above and click SAVE. In the Data Preview, you can see what we’ve calculated in the Python processor: the average humidity, standard deviation and Z-Score array for the current window.

If you open up the Z-Score array, you’ll see Z-score for each sensor value.

Let’s go one step further and select only the anomalies, if you remember anomalies are Zscore that outside the -2s and +2 s range, in our case -0.88 and +0.88.

Next to the Python processor add a Normalize processor to flatten the python array into records, in the column to normalize type Zscore and select is list option then save.

And now add a FilterRow processor, the product doesn’t allow us yet to filter on range of values so we will filter the Absolute value of the Zscore that are superior to 0.88, we use test on the absolute value because Zscore can be negative.

The last output shows that 14 records out of the 50 are anomalies.

Conclusion

There you go! We’ve built our first Anomaly Detection Pipeline with Data Streams that reads from Kafka, uses Type Convertor, Aggregation and Window processors to transform our raw data and then Python row to calculate Standard Deviation, Average and Z-Score for each individual humidity sensor readings. Then we’ve filtered out normal values and kept anomalies.

Next step of course would be to act on anomalies detected like turning off a system before it breaks down for example.

Again, Happy Streaming!

The post Building Real-Time IoT Data Pipelines with Python and Talend Data Streams appeared first on Talend Real-Time Open Source Data Integration Software.

↧

4 Possible Ways a Blockchain Can Impact Data Management

June 11, 2018, 8:32 pm

≫ Next: How Dominos Pizza is Mastering Data, One Pizza at a Time

≪ Previous: Building Real-Time IoT Data Pipelines with Python and Talend Data Streams

We all know we are at the peak of the hype cycle for…wait for it – Blockchain! We are also already aware of some of the benefits of blockchain – but can blockchain be applicable to traditional data management? Though real-world blockchain implementations in the enterprise are minimal so far, I do believe there is a ton of potential to solve some of the problems that businesses face. But as implementations go up in the industry, can we, as data management practitioners, take advantage of the inherent qualities of a blockchain?

The answer is – maybe! Let me explain by getting a little more deeper into data management concepts in relation to the blockchain.

Data Quality

Blockchain inherently provides a validation to the blocks of data. The data needs to fit into a specific structure and only then can the block be inserted into the blockchain. If the data block doesn’t fit the rigid requirements of the blockchain, it will be rejected. There are many types of blockchain but overall the validation provides a consistency for the data blocks. But consistency is just one dimension of data quality. What about accuracy? Blockchain data is as accurate as the application allows it to be. In other words, there is no inherent check on the data itself. The garbage in and garbage out syndrome still applies. So the application needs to have good checks to make sure inaccurate data doesn’t get into the block. What about the remaining DQ dimensions such as completeness and timeliness? Those issues still remain with data in the blockchain.

Reference Data

The distributed ledger paradigm of blockchain could actually be used to manage reference data. It will help in collaboration between two non-competing parties who like to maintain contractual data between them. This particularly applies well for financial companies who have to share data with regulatory agencies. It could lead to an accurate and automated blockchain reference data reducing costs and operational risks.

Master Data

Blockchains will create more silos of data complicating the master data management processes even more. They are inherently used for transactional data and as traditional apps are replaced by blockchains, there is a danger that the data gets even more siloed. The move to the cloud to manage the blockchain node processes doesn’t help the concern either. Therefore there is an even bigger necessity to manage master data properly. The success of master data could lead to a successful blockchain implementation.

Data Warehousing

Due to the transactional nature of blockchain, we really cannot see it being used to store historical data, which is a requirement to build a Data Warehouse. In fact, there will be even more need to integrate the data from the private blockchains that the IT organizations will develop. Also, blockchain is no replacement for databases. It solves a completely different use case and databases will be needed for reporting and predictive analytics. Some blockchain enterprise platforms such as Corda are making a database available where users can actually run SQL statements. These databases will potentially become source systems for integration into a Data Lake or a Data Warehouse.

The blockchain is a next-generation technology and the technology needs to mature for corporations to use it for internal business operations, let alone to use for an important asset such as data. But it will be exciting to see how the technology will evolve in the next decade. Hopefully, through this, blog readers can better understand the impact of blockchains on traditional data management practices.

The post 4 Possible Ways a Blockchain Can Impact Data Management appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How Dominos Pizza is Mastering Data, One Pizza at a Time

June 13, 2018, 5:58 pm

≫ Next: Surfing the Big Data Wave – Our 3 Key Takeaways from The Forrester Wave™: Big Data Fabric, Q2 2018

≪ Previous: 4 Possible Ways a Blockchain Can Impact Data Management

In an era where food delivery competition ranges from your local grocery store to data-driven companies like Amazon, you need to stake out a competitive differentiator. Domino’s Pizza, founded in 1960, recognizes this and that’s why they remain remains the largest pizza company in the world, with a significant business in both delivery and carryout pizza.

Relying on technology as a competitive differentiator helped Domino’s achieve more than 50 percent of all global retail sales in 2017 from digital channels, primarily online ordering and mobile applications. Here’s how they did it.

Data everywhere

Domino’s AnyWare is the company’s name for their digital platform that allows customers’ to order pizzas via smartwatches, TVs, car entertainment systems and social media platforms. All those pizzas orders add up to an information tsunami, which the company recognized as a potentially critical competitive advantage.

Dominos wanted to integrate information from every channel—85,000 structured and unstructured data sources in all—to get a single view of its customers and global operations. Adding yet more complexity was the fact that the company was using multiple tools for data capture, and external agencies for marketing campaigns, and was managing 17TB of data.

Enabling one-to-one buying experience across multiple touchpoints

With Talend, Domino’s has built an infrastructure that collects information from all the company’s point of sales systems and 26 supply chain centers, and through all its channels, including text messages, Twitter, Pebble, Android, and Amazon Echo. Data is fed into Domino’s Enterprise Management Framework, where it’s combined with enrichment data from a large number of third party sources, such as the United States Postal Service, as well as geocode information, demographic and competitive information.

With its modern data platform in place, Domino’s now has a trusted, single source of the truth that it can use to improve business performance from logistics to financial forecasting while enabling one-to-one buying experiences across multiple touchpoints.

Big Data-Driven Decision-Making At Domino’s Pizza

Domino’s Pizza: fun facts

The post How Dominos Pizza is Mastering Data, One Pizza at a Time appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Surfing the Big Data Wave – Our 3 Key Takeaways from The Forrester Wave™: Big Data Fabric, Q2 2018

June 15, 2018, 1:34 pm

≫ Next: Podcast – How Talend Is Helping Companies Liberate Their Data

≪ Previous: How Dominos Pizza is Mastering Data, One Pizza at a Time

This edition of the Forrester Wave is yet another proof of the profound changes in the data market. New innovations are helping companies extract more value from data faster than ever before. In just 18 months, a new wave is breaking. These are our 3 takeaways.

<< Download the Forrester Wave: Big Data Fabric, Q2 2018 Here>>

1) Legacy is out

In a world where big data and cloud are colliding, the pace of innovation rewards agile and penalizes complexity. With big data workloads now moving to the cloud, data-driven companies can reach virtually infinite scale, but current economics are not sustainable at scale. That’s what makes incumbents and new players drift apart: The usual incumbents mostly capitalize on their legacy while new players can adopt innovations more quickly and easily at a fraction of the cost. Legacy is out.

2) Continuous innovation is in

According to Forrester, “[Big data fabric] minimizes complexity by automating processes, workflows, and pipelines, generating code automatically, and streamlining data to simplify deployment” and Talend’s platform “simplifies the process of working with Hadoop and Spark distributions as well as new technologies like serverless computing and containers.”

Here at Talend, we believe that continuous innovation is the key to a modern data architecture, and we view recognition by Forrester of Talend as a Leader as validation of our position as the next generation leader.

3) The Cloud makes everything possible

Cloud has also opened the door to new workloads making a broader array of use cases possible. Data workloads are more numerous, more complex and data types are more varied. They are also processed differently and automated.

The future is in the cloud, and a big data fabric enables you to take full advantage of what the cloud has to offer. As per Forrester report: “The Leaders […] identified support a broader set of use cases, enhanced AI and machine learning capabilities, and offer good scalability features. The Strong Performers have turned up the heat on the incumbent Leaders to offer more data management features and deployment options”

Conclusion

Today, we are proud to announce that we are a leader in the Forrester Wave Big Data Fabric! Talend earned the highest scores of any vendor in the report in both the Current Offering and Strategy categories, two of the three high-level categories Forrester evaluated in this report.

We strongly believe that these scores are a result of what we provide data-driven companies with. Our unified data integration platform help companies liberate all their data to deliver faster and greater insights for their business. But don’t take our word for it, read the report.

The post Surfing the Big Data Wave – Our 3 Key Takeaways from The Forrester Wave™: Big Data Fabric, Q2 2018 appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Podcast – How Talend Is Helping Companies Liberate Their Data

June 19, 2018, 3:39 pm

≫ Next: Data Mapping Essentials in the GDPR Era

≪ Previous: Surfing the Big Data Wave – Our 3 Key Takeaways from The Forrester Wave™: Big Data Fabric, Q2 2018

Earlier this year, I had the opportunity to chat with Nathan Latka on his podcast “The Top.” It’s a broad sweeping discussion that covers everything from Talend’s origins through our current market approach and growth. We talk about Talend’s financial model, acquisition strategy, product focus, and plan for competing for in a $16B market opportunity. Hope you enjoy!

The post Podcast – How Talend Is Helping Companies Liberate Their Data appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Retail – Right Product, Right Place, Right Time

Construction – Optimizing the human capital management

Marketing – Optimizing marketing campaigns with quality data

Media – Exposing data to the public

Charity – Saving lives

Transportation – Being compliant with regulations

Conclusion

Innovation Brings the Opportunity for Insight

The Non-Traditional Integrator is Born

The Future of Integration: Enabling New Integrators Without Losing Oversight

Shifting to the Cloud

5 Strategies CIOs Should Consider for Cloud Migration

How do you transition to the cloud at scale (moving beyond the POC stage)?

How do you keep your environment on the cutting edge?

How do you maintain necessary resources & control maintenance costs?

Make sure to have a separate strategy for your data science team.

Make data governance and quality the cornerstones of your cloud data strategy.

Why Talend Data Streams?

Introducing Talend Data Streams

Live Preview

Schemaless Design

Unparalleled Portability with Apache Beam

Embedded Python Component

Looking to bridge the IT & Business Gap, and put more data to work?

Diverse data assets

Pool your data

Setting the foundations

The clock is ticking

Diverse data assets

Pool your data

Setting the foundations

The clock is ticking

JEP+ Communication Channels

Talend Job Design Patterns

LIFT-N-SHIFT: Job Design Pattern #1

MIGRATION: Job Design Pattern #2

COMMAND LINE: Job Design Pattern #3

DUMP-LOAD: Job Design Pattern #4

Conclusion

Broken data economics

Enabling More People

Embracing Technology Innovation

Data at the Heart of Farming

Precision Farming

Ready! For What?

Data Management to the Rescue

Beyond GDPR: trusted data for personalized experiences

What Is Big Data Doing for Healthcare?

1. Gene Mapping

2. Creating New, Life-Saving Drugs

3. Real-Time Health Data

4. Organ Donation

Big Data in Healthcare Needs an 11th “V”

Introduction

Pre-requisites

Let’s Get Started: IoT with Talend Data Streams

Kafka

Setting Up Your S3 Bucket

Create your S3 bucket

Create IAM role for an S3 bucket

IoT device: Raspberry Pi & Arduino Uno

Set up your device

Send Sensor Data to Kafka

Talend Data Streams

Create Connections

Create your Dataset

Creating a personal data hub

Building trust through transparency

Embedded Deployment Scenario

OSBA Deployment Scenario

External Deployment Scenario

The Problem

The Solution

A New Level of Managing Connections

Introduction

Pre-requisites

Let’s Get Started: IoT with Talend Data Streams

Kafka

IoT device: Raspberry Pi & Arduino Uno

Anomaly detection: Z-Score model

JEP⁺ Communication Channels