Quantcast
Channel: Blog – Talend Real-Time Open Source Data Integration Software
Viewing all 824 articles
Browse latest View live

Air France-KLM: Change is in the air to delight customers with “made-just-for-me” travel experiences

$
0
0

 

Air France-KLM serves 90 million passengers per year and 27 million Flying Blue reward program members. The data explosion has been a clear game changer in the last few years, owing especially to social media: Air France-KLM has 16 million fans on Facebook and 3 million Twitter followers.

Travel as you like

Many airlines fly to the same destinations, so a key differentiator is the level of service you can provide. Customers will not ask their carrier to “make them travel,” but “to travel as they like”.

Air France-KLM’s biggest challenge was to re-centralize all client information to be able to share it with all Air France-KLM group agents. The company implemented a big data platform developed in-house on Hadoop.

Data Management inside

To create a 360-degree view, Air France-KLM introduced Talend to organize this data platform and address data management challenges.

The airline has implemented a complete data quality policy to verify the quality of data with Talend Data Quality. Up to one million data entries are corrected per month with this tool. The issue of data security is also raised, since Air France-KLM processes clients’ personal data, and this data must be protected. Data masking consists of anonymizing a part of the client data for analytical needs.

Air France-KLM also implemented Talend Metadata Manager to be able to share information within the company– which now enables to determine where data is located, where it is coming from and where it is going, ten times faster than before.

360-degree customer vision for a better travel experience

 “Our goal, by compiling all our clients’ data, is to reassure them throughout their entire trip with us”, explains Gauthier Le Masne – Chief Customer Data Officer. “We will start from home, where, by geolocating them, we will let them know how much time they need to get to the airport. When clients reach the airport, their gate may have changed. They will automatically receive a Facebook Messenger message informing them of the new departure gate”.

Each time you travel with Air France-KLM, the airline studies your history and preferences, and can also send you a deal on the next destination that might interest you.

Gauthier Le Masne concludes:” Our wish is to be systematically in contact with each client. It will involve face to face contact with our teams and their tablets. And direct contact with our clients if they have the smartphone application. And if they don’t have the application, it will also involve connecting with our clients via social media”.

The goal is to have several tens of millions of unique experiences with this data platform.

The post Air France-KLM: Change is in the air to delight customers with “made-just-for-me” travel experiences appeared first on Talend Real-Time Open Source Data Integration Software.


What Exactly is Data Stewardship and Why Do You Need It?

$
0
0

 

What does your next data-driven project have to do with data stewardship? 

Well actually a lot if you want to get the most out of your data. Many companies today are filling the data lake with vast amounts of structured and unstructured data. But they tend to forget an important fact:  on average, organizations believe that 32 percent of their data is inaccurate. Sounds like addressing this data quality issue before your data lake turns to a data swamp is a must, not an option, right? That is where data stewardship comes into play.

Data stewardship is becoming a critical requirement for successful data-driven insight across the enterprise. And cleaner data will lead to more use, while reducing the costs associated with “bad data quality” such as decisions made using incorrect analytics.

What is Data Stewardship?

If you think of all the data you need to work with each day, you know that often it is incomplete and sometimes incorrect.  You may be able to fix it since you know it, but that process does not scale when dealing with vast amounts of data and when other groups “bring their own data” and know what it should look like. Also, let’s not forget that using email or Excel to resolve data quality issues one by one is not very efficient, not to mention the risks that come with the proliferation of uncontrolled copies of potentially sensitive data everywhere in the enterprise across file folders. You need purposed tools, processes and polices to effectively and sustainably manage data quality.

As a critical component of data governance, data stewardship is the process of managing the lifecycle of data from curation to retirement.  Data stewardship is about defining and maintaining data models, documenting the data, cleansing the data, and defining the rules and policies. It enables the implementation of well-defined data governance processes covering several activities including monitoring, reconciliation, refining, deduplication, cleansing and aggregation to help deliver quality data to applications and end users.

In addition to improved data integrity, data stewardship helps ensure that data is being used consistently through the organization, and reduces data ambiguity through metadata and semantics.  Simply put, data stewardship reduces “bad data” in your company, which translates to better decision-making and the elimination of the costs incurred when using incorrect information.

Traditionally, data stewardship tasks are assigned to a staff of data experts, the so-called data stewards. But the challenge is that there are few data stewards in a company and they are generally dedicated to high risk projects, such as regulatory compliance. In the absence of data stewards, nobody knows who is accountable for data quality, and that is what leads to a frustrating situation where organizations are fully aware that almost one third of their data assets are not accurate, but nobody acts on it.

Data Stewardship, Now a Team Activity

With more data-driven projects, “bring your own data” projects by the line of business, and increased use of data by data workers such as data scientists, marketing and operations, there presents a need to rethink data stewardship.  Next generation data stewardship tools need to evolve to support:

  • Self-service – so that any user from IT to the business can solve data quality issues in a controlled way 
  • Team collaboration – including workflow and task orchestration
  • Manual interaction – in the case of data arbitration or certification where human intervention is required to validate, certify, tag, or select a dataset
  • Integration with data preparation – defining a process for “bring your own data” 
  • Built in privacy – empowering the data protection officer and compliance teams to address new industry regulations for maintaining privacy such as GDPR (General Data Protection Regulation)

Introducing Talend Data Stewardship

With Talend Winter ’17, we are proud to launch a new capability, the Talend Data Stewardship app, a comprehensive tool you can use to configure and manage data assets, that addresses the quality challenges holding your data-driven projects back.

talenddatastewardship1

More than a tool just for data stewards with specific data expertise, IT can empower business users to use a point-and-click, Excel-like tool to curate their data. With Talend Data Stewardship you can manage and quickly resolve any data integrity issue to achieve “trusted” data across the enterprise. With the tool, you define common data models, semantics, and rules needed to cleanse and validate data, then define user roles, workflows, and priorities, and delegate tasks to the people that know the data best. Productivity is improved in your data curation tasks by matching and merging data, resolving data errors, certifying, or arbitrating on content.

Delegating tasks that used to be done by data professionals, such as data experts, to operational workers that now the data best is called self-service. It requires workflow-driven, easy to use tools with an excel-like user experience and smart guidance. With this respect, Talend Data Stewardship uses the same user interface that Talend Data Preparation and the tools are bundled together in a unified suite for self-data access, preparation, integration and curation. While Talend Data preparation empowers business users  to get clean, useful data in minutes, not hours, in an ad-hoc way, Talend Data Stewardship orchestrates the collaborative work of fixing, merging and certifying data with self-service data curation. Similar to using Excel and Word for office automation, data workers get access to those to tools with consistent user experience that the use depending on their use case.

Because it is fully integrated with the Talend Platform, it can be associated to any data flow and integration style that Talend can manage, so you can embed governance and stewardship into data integration flows, MDM initiatives, and matching processes.

Tools For Everyone

The core concepts of Talend Data Stewardship are campaigns and tasks, and the product comes with two predefined roles namely: campaign owners and data stewards.

  • Campaign owners can define different campaigns including Arbitration, Resolution or Merging; engage the data stewards that will contribute in each campaign; define the structure of the data used by the campaigns; refer to Talend Jobs to load tasks into the campaigns; retrieve tasks from the campaigns and assign tasks in the campaigns to different data stewards. 

  • Data stewards can explore the data that relates to their tasks, resolve the tasks on a one to one basis or for a whole set of records, delegate tasks to colleagues and monitor and audit stewardship campaigns and data error resolution errors.

Additionally, Talend Data Stewardship can trigger validation workflows for tasks that should be double-checked. Because it is easy to use through a guided user experience and workflow-driven, anyone can participate in the data curation efforts with clear responsibilities and efficient tools to execute them. 

CRM Example Use Case

Consider a use case where you want to improve the quality of data in your CRM system, as it has incorrect data and many duplicates. As the campaign owner using Talend Data Stewardship, you would define a Resolution campaign and objective (e.g. resolve incorrect addresses) and quarantine the data that needs attention, typically the records with invalid or empty contact data, or the potential duplicates. You would then define the participants in the campaign, for example all regional marketing managers, digital marketing managers, and the sales admin. Then you would assign tasks, e.g. the error resolution tasks for the German marketing contacts are assigned to the German marketing managers, because they know this data best to certify it, correct it, or reconcile it against multiple versions of the truth.  And they will benefit from the cleansed data through higher conversion rates in their marketing campaigns. As each stakeholder updates the data, you can track the changes made, e.g. marketing verified mailing addresses, telephone numbers and email addresses.

Next, a merging campaign is created to match and merge duplicate records and the sales admin can merge the duplicate records. 

talenddatastewardship2

Take It For A Test Drive

In summary, as companies consume more data and start providing self-service access to data, there is a clear requirement for self-service data quality tools to get the most out of your data. The business benefits from increased data usage and more informed decisions using better data.  IT also benefits by delegating data cleaning tasks to data workers.

Are you starting to fill the data lake and realize that you now need to manage it?  Take Talend Data Stewardship for a test drive!

With Talend Winter ’17, you get 2 free licenses of Talend Data Stewardship and Talend Data Preparation with your Talend subscription. Contact your local Talend sales representative for your Talend Winter ’17 evaluation.

The post What Exactly is Data Stewardship and Why Do You Need It? appeared first on Talend Real-Time Open Source Data Integration Software.

The Future of Apache Beam, Now a Top-Level Apache Software Foundation Project

$
0
0

 

Our journey to this day started 10 months ago, and what an exciting road it has been.

In February 2016, Google, Talend, Cloudera, dataArtisans, PayPal and Slack joined efforts to propose Apache Beam (see Introduction to Apache Beam”) to the Apache Incubator, the entry path into The Apache Software Foundation (ASF).

The numbers are pretty impressive. During the incubation period for Apache Beam we saw:

  • More than 1600 pull requests created, resulting in more than 4500 commits.
  • More than 1000 tickets created and fixed.
  • More than 100 contributors to the code.
  • Three releases, made by three different release managers.

Beyond the impressive appeal shown by the numbers above, it was great to see how the Apache Beam community grew and rallied behind this project. After the “legacy” players got involved, new actors also joined the Apache Beam community.

Thanks to the design and approach of Apache Beam, we interacted with a large range of other projects, including Apache (Kafka, Cassandra, Avro, and Parquet) and non-Apache projects such as Elasticsearch, Kinesis, and Google. We are eager to see additional contributions and feedback that will keep improving the capabilities of Apache Beam.  As a mentor to the Apache incubation podling, I can honestly say the team is awesome to work with. They are very open minded, eager to help and committed to this project. I am equally committed to contributing to the Apache Beam project daily. I truly believe that Apache Beam is the next level of streaming analytics and data processing. It is a great choice for both batch and stream processing and can handle bounded and unbounded data sets.

Talend began to evaluate Google Dataflow in 2015 and immediately knew we wanted to get involved because we see Beam as a natural extension to our code-generating platform and a way to provide even greater agility to our customers. By updating the Beam “runner” for any new API changes (including adopting a brand new framework like Flink or Apex), we get 100 percent full fidelity support across the product suite. It’s not surprising then that Talend has been a very active contributor to the Apache Beam community over the last two years.

Stay tuned for more information on Apache Beam project developments and forthcoming Talend products enhancements in the near future.

About Jean-Baptiste (@jbonofre)

ASF Member, PMC for Apache Karaf, PMC for Apache ServiceMix, PMC for Apache ACE, PMC for Apache Syncope, Committer for Apache ActiveMQ, Committer for Apache Archiva, Committer for Apache Camel, Contributor for Apache Falcon

jean_baptiste_onofre_0

The post The Future of Apache Beam, Now a Top-Level Apache Software Foundation Project appeared first on Talend Real-Time Open Source Data Integration Software.

Talend Data Masters 2016: How the ICIJ Decoded the Panama Papers with Talend

$
0
0

 

The Panama Papers is the history’s biggest data leak and cross-border investigation in journalism history. For one year, around 400 reporters across almost 80 countries dived into this massive trove of information that exposed how the offshore economy works. Talend Big Data was instrumental in bringing that information into the public domain.

Founded in 1997, the International Consortium of Investigative Journalists (ICIJ)  is a global network of more than 190 independent journalists in more than 65 countries who collaborate on exposing big investigative stories of global social interest. On May 2015, ICIJ obtained from German newspaper Süddeutsche Zeitung an encrypted hard drive with leaked data from the Panamanian law firm Mossack Fonseca.

Massive Mountains of Information

The 2.6 terabytes and 11.5 million files of Panama Papers data were made up of more than 320,000 text documents, 1.1 million images, 2.15 million PDF files, 3 million database excerpts and 4.8 million emails. The entire set of printed documents would weigh 3,200 tons, take more than 41 years of nonstop operation to print it on an office laser printer, consuming a small forest of 80,000 trees as paper.

talend-icij-panama-papers-1

Open-Source Technology Inside 

The ICIJ used Talend Big Data to reconstruct Mossack Fonseca’s client database from the database excerpts and convert it into a Neo4j graph database. They visualized it with Linkurious, a graph visualization platform to organize and access the information.

They knew from the beginning that they ultimately wanted to make this database open to the public. The data quality requirements were raised, since millions of people would see the information and a mistake could be catastrophic for ICIJ in terms of brand reputation and lawsuits. Talend was key for the ICIJ’s data team to efficiently work remotely across two continents and have each step of the preparation process documented.

Data Democratization

On April 3, 2016 more than 100 media organizations published the results of the year-long investigation. Included in the list of over 210,000 companies across 21 jurisdictions, were activities from the ongoing Syrian war, the looting of resources in Africa, and individual offshore transactions from billionaires, sports players and other celebrities. The report also linked company relationships with 140 politicians in more than 50 different countries – including 12 current or former world leaders.

The political reaction came almost immediately. Iceland’s prime minister resigned two days after the revelations, France put Panama back on its tax haven list, and U.S. President Barack Obama called for international tax reform. Swiss police conducted two raids, including one on the headquarters of UEFA, the body that oversees professional soccer in Europe. A member of FIFA’s ethics committee was forced to resign.  

ICIJ’s Panama Paper investigation produced a daily drumbeat of regulatory moves, follow-up stories and calls for more action to combat offshore financial secrecy. At least 150 inquiries, audits or investigations into its revelations have been announced in 79 countries around the world as a result of being a pioneer in the use of data to help uncover illegal activity.

The post Talend Data Masters 2016: How the ICIJ Decoded the Panama Papers with Talend appeared first on Talend Real-Time Open Source Data Integration Software.

Apache Beam Your Way to Greater Data Agility

$
0
0

 

If you are Captain Kirk or Mr. Spock and you need to get somewhere in a hurry, then you “beam” there, it’s just what you do. If you are a company and you want to become more data driven, then as surprising as it may sound, the answer there could be beam as well, Apache® Beam™.

This week, the Apache Software Foundation announced that Apache Beam has become a top-level Apache project. Essentially, becoming a top-level project formalizes and legitimizes it and indicates the project has strong community support. For those of you not familiar with Beam, it’s a unified programming model for batch and streaming data processing. Beam includes software development kits in Java and Python for defining data processing pipelines, as well as runners to execute the pipelines on a range of engines, such as Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. So, with Beam, you no longer have to worry about the actual runtime where your processes will be deployed and executed. We see this as massive for IT teams looking to keep up with both data technology innovation and the increasing pace of business. There’s an introduction to Beam overview here, if you wish to learn more.

All organizations are racing to transform themselves into digital businesses and use data as the basis for growth and innovation ahead of the competition. The problem is, getting there is anything but straightforward, and the shot clock is on. Modern information needs demand far more complex data integration and management to support things like greater market responsiveness, and real-time, personalized relationships with customers, partners, and suppliers. With the data needs of the business increasing exponentially, CIOs are being forced to make strategic technology bets even as the market continues its dramatic transformation. This can be a major issue, as a technology choice made today to fuel progress can easily become an anchor to advancement tomorrow.

Helping companies get over this hurdle has been a major focus for Talend and is why we designed our Data Integration solutions the way we did to be native code generators. Even ten years ago when we first introduced Talend, we knew that there had to be a better, faster, and cheaper way than hand-coding to manage data integration projects. Our head of sales in Europe, Francois Mero, recently wrote an entire piece detailing the advantages of code-generating tools over hand-coding, so I won’t go into a lot of detail here. Net/net, ten years ago code generation provided strong economies of skills and scale over hand coding because it is quicker and more cost effective; however, today with the velocity of modern data use cases, it’s an absolute no-brainer. It’s really about creating greater agility through the portability or re-usability of projects. To explain, in 2014 and 2015, MapReduce was the standard, but by the end of 2016 Spark emerged to replace it. Spark offered such significant advantages over its predecessor it was a competitive advantage to make the switch as quickly as possible. If companies were using hand-coding to develop MapReduce projects, then they had to recreate everything in Spark costing them a tremendous amount of time and money. In the case of companies leveraging code-generating tools, the change was as simple as a couple of clicks of the mouse.

Enter Apache Beam. Here is what our CTO, Laurent Bride stated in the Apache Software Foundation announcement about Beam’s move to becoming a top-level project:

“The graduation of Apache Beam as a top-level project is a great achievement and, in the fast-paced Big Data world we live in, recognition of the importance of a unified, portable, and extensible abstraction framework to build complex batch and streaming data processing pipelines. Customers don’t like to be locked-in, so they will appreciate the runtime flexibility Apache Beam provides. With four mature runners already available and I’m sure more to come, Beam represents the future and will be a key element of Talend’s strategic technology stack moving forward.”

Talend has chosen to embrace Beam because we see it as a natural extension to our code-generating platform and a way to provide even greater agility to our customers. By updating the Beam “runner” for any new API changes (including adopting a brand new framework like Flink or Apex), we get 100% full fidelity support across the product suite. In contrast, what do those still using custom code do when Spark changes their APIs? They have to rewrite large chunks of it or the entire thing. Again, this isn’t theoretical; Spark made big disruptive changes to their APIs going from 1.6 to 2.0. It’s a particularly tricky situation now since Spark 2.0 isn’t ready for production use. What do you do? If you write to the version that works now you know that you’re digging yourself into a huge hole going forward. Or perhaps you roll the dice and write to the new one and hope that it’s ready for real production when you need to go live. Perhaps you guess right, and that’s good for you; however, it’s only going to keep happening – over and over again. It’s not just about Spark. If you decide that your streaming use cases are latency sensitive enough that micro-batching isn’t good enough then you’ll want to look at using Flink, Apex, or something else. Again, with a code-generation tool like Talend, that’s a couple of clicks for the increasing number of frameworks that Beam supports, or at worst a new runner away from supporting something brand new. With hand coding, it’s a complete ditch and restart. 

So, what say you, ready to get beamed up?

The post Apache Beam Your Way to Greater Data Agility appeared first on Talend Real-Time Open Source Data Integration Software.

Accelerate Data Lake Creation and Software Development Lifecycles with Talend Integration Cloud Winter ’17

$
0
0

 

Today, we announced the general availability of the Winter ’17 release of Talend Integration Cloud, an Integration-as-a-Service (iPaaS) platform. This release helps customers build data lakes dramatically faster using AWS S3 and reduce data security vulnerabilities through controlled access. The new release also enables customers to continuously deliver integration projects, accelerate the software development lifecycle with seperate environments, and speed widespread Salesforce adoption.

Create Data Lakes Faster

There are several reasons why the Winter ‘17 release should be extremely attractive to our customers and prospects. For starters, if customer data lakes are built on S3, or they are considering moving their data lakes to S3, Talend Winter ’17 offers improved support for AWS functionalities that help customers create data lakes quickly using:

* Fast, Easy Uploading of Large Files: This feature is extremely useful for customers that need to upload any file object greater than 100 MB. You can now break larger objects into ‘chunks’ and upload a number of these multiple file parts in parallel—a tremendous time savings! If the upload of a certain file part fails, you can easily restart it. Simply put, customers can upload massive volumes of data without having to worry about an unreliable network connection.

* Low Cost, Reliable Message Transmission with Amazon Simple Queue Service (SQS): Amazon SQS is a fast and reliable message queuing service that is often used to transmit large volumes of data without losing communications or requiring other services to be available. With Talend’s new native connector for Amazon SQS, customers can incorporate a highly scalable messaging cluster into the integration workflow.

screen-shot-2017-01-24-at-5-18-24-pm

Accelerate Integration Projects with Continuous Delivery

Most customers are challenged by the wide range of adoption and customization work required for SaaS apps and platforms. Additionally, the demands of the modern enterprises require multiple QA checks to ensure software will perform at its utmost efficiency. Talend Winter ’17 includes capabilities for development teams to incorporate continuous delivery into their software development lifecycle (SDLC) that enables enterprises to plan and execute integration projects frequently, quickly, and in a frictionless manner, which speeds time to market.

* Create Separate Environments for Each Specific Stage of the SDLC

The SDLC process includes three simple steps:

  1. Development – A developer creates a job in Talend Studio (with environment creation capabilities).
  2. QA – Upon creation of a job, a developer publishes it to the cloud in the QA environment. A QA engineer then receives the job, configures and tests it. If the job passes the test, the QA engineer notifies the Ops engineer and creates a specific workspace for it.
  3. Production – The Ops engineer receives the notification and promotes the job to the production environment by either selecting publication to all environments or putting each job in a particular workspace.

Simple and seamless, right? Customers can even further streamline their SDLC process by creating and managing role-specific security and configuration policies, another new feature in this release. This dramatically enhances software development security, prevents unintentional or unauthorized access, and greatly improves operational efficiency.

screen-shot-2017-01-24-at-5-57-22-pm

Connect with Partners and Customers Faster using Salesforce

The Winter ’17 release of Talend Integration Cloud provides support for the Salesforce Sales, Salesforce Service, and Wave Summer ’16 APIs. This new functionality enables Talend Integration Cloud customers and prospects to connect with customers, employees, and partners more efficiently by taking full advantage of all Salesforce Summer ’16 features.

Where to Learn More

In addition to the new features cited above, there are some additional product enhancements in Talend Winter ’17 that customers may find beneficial for integration projects. For more detailed information about the new features in the Winter ’17 release of Talend Integration Cloud, check out the following resources:

The post Accelerate Data Lake Creation and Software Development Lifecycles with Talend Integration Cloud Winter ’17 appeared first on Talend Real-Time Open Source Data Integration Software.

Power to The People – Creating Trust in Data with Collaborative Governance

$
0
0

 

Today’s enterprise IT organizations are once again experiencing a massive upheaval due to pressure from employee forces.

It’s a familiar story. Just think of the turmoil caused by the dawning of the bring-your-own-device (BYOD) era, with employees demanding to use their beloved, personal mobile phones or tablets for work. If IT balked at their requests for mobile, some resourceful users would resort to workarounds – creating ‘shadow IT’ – to get access to corporate systems on their personal devices.  Of course, in the process, those employees also unknowingly put sensitive company information at risk.

Now even more IT agitations are on the way, once again being generated by employee demand. This time, users demand access to the growing pools of big data companies have amassed, and the insights they likely contain. If IT can’t deliver the tools to access the information residing in corporate data lakes, employees will — just as they did in the BYOD era—find a workaround, which will likely put enterprise information at risk. Thus, there is no other option for IT than to deliver data via self-service access across all lines of business. However, IT must find the proper way to do so in order to prevent exposing company assets to unnecessary risk. They must adopt a model of collaborative governance.

The transition from authoritative to collaborative governance of company data might be hard, but there’s an opportunity for corporate IT departments to create a system of trust around enterprise data stores, wherein employees collaborate with IT to maintain and/or increase the quality, governance and security of data. The good news is that IT professionals have a blueprint from the companies that pioneered the use of the World Wide Web for collaborative data governance. Just as Web 2.0 evolved around trends that focused on the idea of user collaboration, sharing of user-generated content, and social networking, so too does the concept of collaborative governance. Collaborative governance breaks down the technological and psychological barriers between enterprise data keepers and information consumers, allowing everyone within an organization to share the responsibility of securing enterprise data.  This concept has the power to transform entire industries.

Wikipedia is a good example of user collaboration in action. Launched in 2001, it is the world’s sixth most popular website regarding overall visitor traffic.  Everyone can contribute or edit entries – a mixed blessing when it comes to reliability and trust. 

Or take TripAdvisor, an American travel website company that provides reviews of travel-related content and interactive travel forums.  It was an early adopter of user-generated content. 

Airbnb is another excellent example of collaborative governance in action. Founded in 2008 as a “trusted community marketplace for people to list, discover, and book unique accommodations around the world,” including, as the website states, “an apartment for a night, a castle for a week, or a villa for a month,” it is the users themselves that provide the venues, and the company that provides a platform which owners and travelers can leverage to share and book venues. 

The greatest challenge – and enabler – for this model has always been trust. Users place their trust in others to accurately update content and information (ratings), meaning consumers are putting their trust solely in the information presented to them. The system works because the data in the system is bountiful and the platform it resides within is designed specifically to enhance user experience.

Now let’s consider the typical IT landscape in an enterprise. Information used to be designed and published by a very small number of data professionals targeting their efforts to “end-users”, or consumers, who were ingesting the information. Today, the proliferation of information within companies is uncontrollable, just like it was on the Web. We’re all experiencing the rise of a growing number of cloud applications coming through sales, marketing, HR, operations or finance to complement centrally designed, legacy IT apps, such as ERP, data warehousing or CRM. Digital and mobile applications connect IT systems to the external world. To manage these new data streams, we are watching new data-focused roles emerging within corporations, such as data analysts, data scientists or data stewards, which are blurring the lines between enterprise data consumers and providers. Just like the adoption of BYOD, these new roles are presenting challenges of corporate data quality, reliability, and trust that must be addressed by IT organizations. 

As the Web 2.0 model evolved, trust between consumers and their service providers was established by crowdsourced mechanisms for rating, ranking and establishing a digital reputation (think Yelp). One lesson learned in the consumer world is that the rewards of trust are huge. These same positive returns can be realized by enterprise IT departments that adopt selected strategies embraced by their more freewheeling consumer counterparts.   

Delivering a system of trust through collaborative data governance and self-service is just one of the opportunities available to evolving IT organizations.  Through self-service, line of business users become more involved with the actual collection, cleansing, and qualification of data from a variety of sources, so that they can then analyze that data and use it for more informed decision-making. Currently, many companies—in their mad rush to become data-driven—are increasingly making decisions based on incomplete and inaccurate data. In fact, according to The Data Warehousing Institute, ‘dirty data’ is costing businesses $600B a year. Companies will continue to experience extreme loss and possible failure if they don’t have a sound data governance system in place.

Collaborative data governance is an easy way for IT to help ensure that the quality, security, and accuracy of enterprise information is preserved in a self-service environment. Collaborative governance allows employees in an organization to correct, qualify and cleanse enterprise information. This helps IT because the master data records are being updated by those most familiar with or closest to the data itself (i.e. the marketing analyst who cleanses tradeshow leads, or the financial analyst who rectifies a budget spreadsheet).

Additionally, fostering the crucial shift to more business user involvement with an organization’s critical data leads to numerous other benefits.  For example, users save time and increase productivity when they work with trusted data. Marketing departments improve their campaigns. Call centers work with more reliable, accurate customer information, much to everyone’s satisfaction. And the enterprise gets better control over its most valuable asset: data.  

So my simple message to companies looking to become more data driven: digital transformation can be achieved—it’s all just a matter of trust.

 

 

 

The post Power to The People – Creating Trust in Data with Collaborative Governance appeared first on Talend Real-Time Open Source Data Integration Software.

Getting Started with Big Data

$
0
0

 

Big data is here to stay

After social media, the Internet of Things is the next big driving force behind the increase in data worldwide, which is doubling in size every two years.  [1] At the same time, data processing speeds and capabilities are becoming increasingly important because—much like food—data loses relevance after a certain date.  Additionally,  the increasing variety of structured, unstructured and semi-structured data (such as pictures, text, videos and sound) is now becoming easier to capture and analyze. The three main factors defining big data are: volume, velocity and variety. [2] Companies which are in command of these three classes of data can derive great value from them and will be more successful in comparison to their less digitally adept counterparts. [3]

Digital leadership today is best demonstrated by companies such as Amazon, Netflix, and Uber, that have successfully implemented a comprehensive data acquisition strategy that helps differentiate them from competitors. Having the right big data technology platform also needs to be part of that strategy. According to O’Reilly’s report on “The Big Data Market 2016” [4] ,“larger enterprises (those with more than 5,000 employees) are adopting big data technologies [such as Hadoop and Spark] much faster than smaller companies.”

However, there are lots of opportunities for smaller organizations to become ‘digital leaders’ using today’s modern data platforms. According to Cloudera, maturing Hadoop ecosystems can not only help achieve cost savings, but can also open up new business opportunities by making it possible to use data more strategically. The main applications of big data are better understanding customers, improving products and services, achieving more effective processes, and reducing risks with improved quality assurance and better problem detection. [5]

How to get off to a good start

At the outset, it can seem difficult to get started with big data projects. In our day to day work we see many medium-sized companies (in the German-speaking market) thinking about big data technologies in principle, but not managing to get things off the ground with specific projects.  So what’s really the best way to get started?

In our experience, usually the most successful approach is to start small way with a clearly defined project plan which is relevant to your business. Many of our customers are currently facing the challenge of having to connect with novel data sources and store ever larger quantities of data. This is often machine data, and occasionally social media data. In principle, it would be possible to accommodate this data in a relational database, or possibly in an existing data warehouse. But this is usually expensive in the context of substantial projects, so it is well worth considering alternatives. To put it simply, it just doesn’t feel quite right.

Typically, smaller projects are an excellent way to gain experience with big data technologies. Basically, a relatively small, manageable and isolated project can often provide a low-risk way to get started with a new technology. It doesn’t matter if all the three Vs (volume, velocity, variety) are completely fulfilled or not. But it is important to find a controllable, appropriate and relevant big data use case with measurable factors of success in order to ensure it can be transferred quickly from a pilot to a production environment. [6]

Even if you could address a big data project with the tried and tested technologies you currently own, taking a chance to get started with big data technologies should not be missed. Otherwise, you might find yourself unable to deal with the complexity of a larger project, if you haven’t already experimented with new technologies on a more controllable scale.

The architecture of a big data project is usually quite manageable as the technologies are already mature and much more accessible. A Hadoop distribution is used for data storage. First data has to be collected at the source, potentially transformed (although for big data it is advisable to store raw data without transforming it) and then loaded into Hadoop. The Talend Big Data platform provides everything you need to implement such a link based on a model, generating high performance native code that helps your team get up and running with Apache Hadoop, Apache Spark, Spark Streaming and NoSQL technologies quickly.

In the end the data is usually evaluated, either directly at a raw data level or via a detour to a data mart with preprocessed data. The data mart can in turn be filled with Talend. Evaluations can then be carried out with suitable tools already in use, although this is also a good opportunity to introduce new tools, typically from the fields of data visualization,  discovery and advance analytics.

big-data-architecture

Start, grow and create opportunities

Big data and traditional data warehouses are growing closer together. Theoretically, a complete data warehouse can be modernized with the help of big data technologies, such as Hadoop, something that frequently leads to significant cost savings while simultaneously opening up new opportunities. But it is also possible for the world of big data to merge with traditional data warehouses at a more leisurely pace. Once the big data infrastructure is there, it is simple to link Hadoop with the data warehouse – potentially in both directions as well. The data warehouse can serve as a source of data which is stored in Hadoop.  By the same token, data from Hadoop can be read, transformed and finally stored in the data warehouse. The two worlds don’t have to be isolated from one another, instead they merge together and, in the end, you have a data warehouse based on big data technologies.

Great journeys always begin with a first small step. Big data technologies are more mature and accessible now, but, as so often in life, you can only progress after you get started on something specific. That is why we recommend you actively look for such projects. Once you have set out on your journey and have investigated the technologies and set up the infrastructure, you will quickly benefit from the new opportunities it presents. All this means you can continue to stay competitive in this age of data-based companies.

References

[1] https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

[2] https://en.wikipedia.org/wiki/Big_data

[3] https://www.idc.com/getdoc.jsp?containerId=prAP40943216

[4] https://www.oreilly.com/ideas/the-big-data-market

[5] http://www.cloudera.com/content/dam/www/marketing/resources/whitepapers/the-business-value-of-an-enterprise-data-hub.pdf.landing.html

[6] http://www.gartner.com/newsroom/id/3466117

[7] https://talend.com/products/big-data

About the author Dr. Gero Presser

Dr. Gero Presser is a co-founder and managing partner of Quinscape GmbH in Dortmund. Quinscape has positioned itself on the German market as a leading system integrator for the Talend, Jaspersoft/Spotfire, Kony and Intrexx platforms and, with their 100 members of staff, they take care of renowned customers including SMEs, largecorporations and the public sector. 

Gero Presser did his doctorate in decision-making theory in the field of artificial intelligence and at Quinscape he is responsible for setting up the business field of Business Intelligence with a focus on analytics and integration.

017b512_1

The post Getting Started with Big Data appeared first on Talend Real-Time Open Source Data Integration Software.


How to Offload Oracle and MySQL Databases into Hadoop using Apache Spark and Talend

$
0
0

 

In the space of Big Data, a common pattern found is offloading a traditional data warehouse into a Hadoop environment. Whether it be for primary use or to only store “cold” data, Talend makes it painless to offload.

Many organizations trying to optimize their data architecture have leveraged Hadoop for their cold data or to maintain an archive. With the native code generation for Hadoop, Talend can make this process easy.

Talend already provides out of the box Connectors to support this paradigm using SQOOP; here we are going to focus on how to make the same using Apache Spark.

Apache Spark is a fast and general engine for large-scale data processing. This engine is available in most of the latest Hadoop distribution version (Cloudera, Hortonworks, MapR, AWS EMR, etc…). Built on a massively parallel processing (MPP) architecture, it allows you to massively parallelize a data flow to handle any enterprise workload.

The fastest and most known solution today to bring data from your Databases into Hadoop is to leverage and use SQOOP (Sqoop is leveraging underneath a MapReduce process to perform the offload from RDBMS to Hadoop). Today I wanted to introduce you to something which will perform the same purpose of SQOOP but using SPARK as framework/ending.

In this blog post, I’m going to address first how to use Spark to move 1 table from Oracle or MySQL into Hadoop. Then once we have one working job to do this task; how we can turn this job to be generic to be control by a list of Tables to move from your databases server to Hadoop.

For simplicity, we will key in on the following two scenarios:

  • How to move a Table into HDFS from a Spark job.
  • How to automate and turn the job above into a Metadata-driven ingestion framework to work on a list of tables.

Moving a Table into HDFS from a Talend Spark Job

leveragingtalendtooffloaddatabases1

In this scenario, we created a very generic job that extract from a Database table and move the data into HDFS using Apache Spark and a generic Query statement such as:

"SELECT concat_ws('"+context.FIELD_SEPARATOR+"',  "+context.column_list+" ) as my_data FROM my_table".
) as my_data FROM my_table".

context.FIELD_SEPARATOR = is a context variable at the job level set to ‘,’ or ‘;’ or ‘|’ or others context.column_list = is a context variable which is the concat of the FIELD required to be extracted (for example: field1, field2, field3, etc…)

The Offload piece will execute the query statement natively on Hadoop using Spark. The generated code is deployed directly through YARN.

Automating and turning the Job above into a Metadata driven ingestion framework to work on a list of tables

leveragingtalendtooffloaddatabases2

The offload-preparation process starts at the database. Next, the table list is pulled and contextualized, along with a list of the columns in the table (preparation of the variables to be sent to Offload job). Once this has been completed, a simple call to Offload Job is made through the iteration over tables to offload to Hadoop. The Offload process is the Job we described in the section “How to move a Table into HDFS from a Talend Spark job” above.

The post How to Offload Oracle and MySQL Databases into Hadoop using Apache Spark and Talend appeared first on Talend Real-Time Open Source Data Integration Software.

Edge Analytics – The Pros and Cons of Immediate, Local Insight

$
0
0

 

A number of data scientists reached out to me about data storage and processing as discussed in my last blog around ‘IoT’. All of their questions largely fell into the same bucket: they are puzzled by what to do with their data. Whether they should store or discard their enterprise data, and if stored, what is the best approach they can take to making that data a strategic asset for their company.

Despite widespread proliferation of sensors, the majority of industrial internet of things, or ‘IIoT’ data, collected is never analysed—which is tragic. Many existing IoT platform solutions are painfully slow, expensive and a drain on resources—which makes analysing the rest extremely difficult. Gartner mentioned that 90% of deployed data will be useless and Experian mentioned about 32% of data in US firms to be inaccurate. The key takeaway is that data is the most valuable asset for any company. So it would be a shame to completely discard or let it lie dormant in an abandoned data lake somewhere. It’s imperative that all data scientists tap into their swelling pools of IoT data to make sense of the various endpoints of information and help develop conclusions that will ultimately deliver business outcomes.  I am totally against of discarding data without processing.

As mentioned in ‘IoT Blog’, in few years there will be an additional 15 to 40 billion devices generating data from the edge vs. what we have today[1]. That brings new challenges. Just imagine an infrastructure transferring this data to data lakes and processing hubs to process. The load will continue to rise exponentially over coming months and years, creating just another problem of stretching the limits of your infrastructure. The only benefit of this data will come from analysis either it is traffic of “things” or surveillance cameras. In time critical situations, if we delay this analysis that might be “too late”. The delay could be due to many reasons like limited network availability or overloaded central systems. 

140123ac

A relatively new approach namely “edge analytics” is in use to address these issues. Basically it is as simple as to say, perform analysis at the point where data is being generated. It’s about analysing in real-time on site. The architectural design of “things” should consider built-in analysis. For example, sensors in train or at stop lights that provide intelligent monitoring and management of traffic should be powerful enough to raise an alarm to nearby fire or police departments based on their analysis of the local surroundings. Another good example is security cameras. To transmit the live video without any change is pretty much useless.  There are algorithms that can detect a change and if new image is possible to generate from pervious image, they will only send the changings. So these kind of events makes more sense to be processed locally rather than sending them over the network for analysis. It is very important to understand that where edge analytics makes sense and if “devices” do not support local processing, how we can architect a connected network to make sense of data generated by sensors and devices at the nearest location. Companies like Cisco, Intel and others are proponents of Edge computing and they are promoting their gateways as Edge computing devices. IBM Watson IoT, an IBM and Cisco project that is reshaping analysis architectural design by offering powerful analytics anywhere. Dell, a typical server hardware vendor, has developed special devices (Dell Edge Gateway) to support analytics on edge. Dell has built a complete system, hardware and software, for analytics that allows an analytics model to be created on one location or on cloud and deployed to other parts of the ecosystem.

However, there are some compromises that must be considered with edge analytics. Only a subset of data is processed and analysed. The analysis result is transmitted over the network. Which means that we are effectively discarding some of the raw data and potentially missing some insights. The situation arises here if this “loss” is bearable? Do we need the whole data or the result generated by that analysis is enough for us? What impact it will have? There is no generalised answer to this. An airplane system cannot afford to miss any data so all data should be transferred to be analysed to detect any kind of pattern that could lead to any abnormality. But still transferring data during flight is not convenient. So collection of data offline and edge analytics during flight is a better approach. The others where there is a fault tolerance can accept that not everything can be analysed. This is where we will have to learn by experience as organizations begin to get involved in this new field of IoT analytics and review the results.

Again, data is valuable. All data should be analysed to detect patterns and market analysis.  Data driven companies are making a lot more progress compare to traditional one. IoT edge analytics is an exciting space and is the answer of maintenance and usability of data as many big companies are investing in it. An IDC FutureScape report for IoT reported that by 2018, 40 percent of IoT data will be stored, processed, analysed and acted where they are created before they are transferred to the network[2]. Transmission of data cost and we need to cut the cost without impacting the quality of decision in a timely manner and Edge Analytics is definitely answer to that.

 

 

Sources:

  1. [1] “The Data of Things: How Edge Analytics and IoT go Hand in Hand,” September 2015.
  2. [2] Forbes article by Bernard Marr, “Will Analytics on the Edge be the Future of Big Data?”, Aug 2016.
  3. http://www.forbes.com/sites/teradata/2016/07/01/is-your-data-lake-destined-to-be-useless
    http://www.kdnuggets.com/2016/09/evolution-iot-edge-analytics.html
    https://www.datanami.com/2015/09/22/the-data-of-things-how-edge-analytics-and-iot-go-hand-in-hand
  4. https://developer.ibm.com/iotplatform/2016/08/03/introducing-edge-analytics/
  5. http://www.forbes.com/sites/bernardmarr/2016/08/23/will-analytics-on-the-edge-be-the-future-of-big-data/#7eb654402b09http://www.ibm.com/internet-of-things/iot-news/announcements/ibm-cisco/
  6. https://www.experianplc.com/media/news/2015/new-experian-data-quality-research-shows-inaccurate-data-preventing-desired-customer-insight/

The post Edge Analytics – The Pros and Cons of Immediate, Local Insight appeared first on Talend Real-Time Open Source Data Integration Software.

How to Load Data into Microsoft Azure SQL Data Warehouse using PolyBase & Talend ETL

$
0
0

 

Azure SQL Data Warehouse is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational. Built on a massively parallel processing (MPP) architecture, SQL Data Warehouse can handle any enterprise workload.

With increasing focus on business decisions in real-time, there has been a paradigm shift in not only keeping data warehouse systems up to date but reduce load times.  The fastest and most optimal way to load data into SQL Data Warehouse is to use PolyBase to load data from Azure Blob storage.  PolyBase uses SQL Data Warehouse’s massively parallel processing (MPP) design to load data in parallel from Azure Blob storage.

One of  Talend’s key differentiators is its open source nature and the ability to leverage custom components, either developed in-house or by the open source community @ Talend Exchange.  Today our focus will be on one of such custom components, tAzureSqlDWBulkExec, and how it can enable Talend to utilize PolyBase to load data into SQL Data Warehouse.

For simplicity, we will key in on the following two scenarios:

  • Load data from any source into SQL DW
  • Load data into SQL DW while leveraging Azure HDInsight and Spark

Load data from any source into SQL DW

talendazureetl1

In this scenario data can be ingested from one or more sources as part of a Talend job.  If needed, data will be transformed, cleansed and enriched using various processing and data quality connectors that Talend provides out of the box.  The output will need to conform to a delimited file format using tFileOutputDelimited.

The output file will then be loaded into Azure Blob Storage using tAzureStoragePut.  Once the file is loaded into blob, tAzureSqlDWBulkExec will be utilized to bulk load the data from the delimited file into a SQL Data Warehouse table.

Load data into SQL DW while leveraging Azure HDInsight and Spark

screen-shot-2017-02-08-at-8-09-59-am

 

As data volumes have increased so has the need to process data faster.  Apache Spark, a fast and general processing engine compatible with Hadoop, has become the go-to big data processing framework for several data-driven enterprises.  Azure HDInsight is a fully-managed cloud Hadoop offering that provides optimized open source analytic clusters for Spark (Please refer to the following link, How to Utilize Talend with Microsoft HDInsight, for instructions on how to connect to an HDInsight cluster using Talend Studio).

Talend Big Data Platform (Enterprise version) provides graphical tools and wizards to generate native Spark code that combines in-memory analytics, machine learning and caching to deliver optimal performance and increased efficiency over hand-coding.  The generated Spark code can be run natively on an HDInsight cluster directly from Talend Studio.

In this scenario, a Talend Big Data job will be set up to leverage an HDInsight Spark cluster to ingest data from one or more sources, apply transformations and output the results to HDFS (Azure Blob storage).  The output file format in the Talend Big Data job can vary between (supported by PolyBase):

  • Delimited Text – using tFileOutputDelimited
  • Hive ORC – using tHiveOutput
  • Parquet – using tHiveOutput / tFileOutputParquet

After the completion of the Spark job, a standard job will be executed that bulk loads the data from the Spark output file into a SQL Data Warehouse table using tAzureSqlDWBulkExec.

Performance Benchmark

tAzureSqlDWBulkExec utilizes native PolyBase capability and therefore fully extends the performance benefits of loading data into Azure SQL Data Warehouse.  In-house tests have shown this approach to provide a 10x throughput improvement versus standard JDBC.

 

 

The post How to Load Data into Microsoft Azure SQL Data Warehouse using PolyBase & Talend ETL appeared first on Talend Real-Time Open Source Data Integration Software.

Stripping Websites and Translating Text using Talend and Google Translate API

$
0
0

 

Disclaimer: While I work for AstraZeneca.  All opinions expressed in this article are my own and do not necessarily reflect the position of my employer. 

Recently I developed a proof of concept ETL job to strip product information from a website, translate it, and create a data set comprising the original text and the translated text. 

For my example, I chose to strip drug product names out of a French health authority website, and translate them to English.  This is not a detailed instructional article, just a high-level overview of the process of solving this problem in Talend using third party java libraries and calling on external translation services.

A sample URL to perform a search of this website is below, the zero value is an offset.  The search returns 20 products at a time in a fairly simple HTML table.  For this example we will do a keyword search of “Glucose”, but we could choose the marketing authorisation holder or a number of other attributes.

This returns the first 20 products (offset of 0):

http://ansm.sante.fr/searchengine/search/(offset)/0?keyword=Glucose

This returns the next 20 products (offset now 20):

http://ansm.sante.fr/searchengine/search/(offset)/20?keyword=Glucose

This is a good fit for Talend, because you have the ability to drop in jar files to add additional functionality to the workflow.  For HTML parsing I am using the excellent jsoup Java library.  I added the jsoup jar file as a code routine into the Talend Routines library and used this in a tJavaFlex component to parse the products in the HTML table and create a flow for each product in the list.

For language translation, I am making a REST API call to Google Translate – using the Talend tRest component.  I needed to set up a Google Cloud account, with billing enabled to use the Google Translate API, but for this volume of translation, the costs will be in the pennies range.

The rest of the workflow is processed using out of the box Talend components.

  • tSetProxy component (x2) to set the corporate proxies for the job (one for HTTP and one for https required to make the Google API calls).
  • tForEach component to fabricate a set of URLs using offset values of 0,20,40,60,80 etc. (see above URL).  This offset value is used to construct the URL for a tHttpRequest component.
  • tHttpRequest component to make the calls to the health authority website, using URLs fabricated with the different offset values sent from the tForEach component.  Each HTTPd request writes the HTML response to a numbered file for later HTML parsing by jsoup.
  • tJavaFlex component which is where the jsoup HTML parser is used to split the product list in the HTML into separate rows for passing down the flow.  Jsoup has a powerful jquery type selector syntax to make it easy to target the HTML nodes you need to process.
  • tRest component to make a call to the Google Translate API using my application key.  We pass, for translation, the product name which jsoup parsed out of the HTML.
  • tExtractJSONFields component to extract the translated product name from the Google Translate API JSON response from the tRest call.
  • Few other components such as tLogRow and tFileOutputExcel to capture the output which is effectively just 2 columns: Product Name (in French) and the translated product name courtesy of Google Translate.

Download >> Talend Open Studio for Data Integration

The end result is scraping all these French product names from the website:

…and we end up with the French product name and an English translated product name in a nice spreadsheet format which we can work with.

And the job isn’t that complicated to look at, in fact, there is some redundancy in here (a tJavaRow I probably don’t actually need).

The first step is to harvest the HTML out into a set of files. We need to fabricate URLs with different offsets to get the first 20 products, then the next 20 products etc.  An offset of 180 is enough to retrieve ~200 product names as a test, so we need to fabricate these URL calls:

http://ansm.sante.fr/searchengine/search/(offset)/0?keyword=Glucose

http://ansm.sante.fr/searchengine/search/(offset)/20?keyword=Glucose 

http://ansm.sante.fr/searchengine/search/(offset)/180?keyword=Glucose

I used a tForEach component to loop generating the offsets 0,20,40 etc. 

Then the tHttpRequest uses this variable from tForEach to generate a URL with the offset, and writes the response to a file we can process later.

Note the tHttpRequest component uses context variables where the base URL and file locations are maintained:

You can see the result is a set of text files which contains the HTML responses for parsing.  The numeric prefix matches the tForEach variable, and _0.txt represents the URL response with a zero offset, _20.txt the URL response with a 20 offset etc.

The bulk of the clever stuff is handled in tJavaFlex which uses the jsoup HTML parsing library to target the table cells which contain the product names:

An HTML text file created earlier in the workflow, as a result of tHttpRequest, is opened and then a jquery type selector is used to home in on the table cell required. 

Elements products = doc.select(“td.first > a > strong”);

This selector basically states “go and return – in a collection – all the HTML nodes, which are <strong> tags, which are descendants of an <a> tag which are descendants of a table cell <td> tag with the CSS class called ‘first’…”.

We can see from the HTML of the website, how the product names are structured, and why this selector works.  Each page presents 20 results, so the tJavaFlex generates 20 rows of data for the flow, one for each product.

We then make a call to the Google Translate API using tRest, note this is not my real API key in the screenshot…we specify a source language (French) and a target language (English), and we pass in the product name. 

Note that I am reading this product name from the globalMap, as I had to turn the flow into an iterate – so I could feed the product name into the tRest component.  I am also URL encoding the product name, as this caused me problems calling the Google Translate API.

It is then a case of reading the response from Google Translate and parsing out the JSON field from the response.  For this, we use a tExtractJSONFields component and use an XPath query. a link to the Google Translate API documentation is provided here, followed by the configuration of the JSON Talend component.

You can see how this works by looking at the JSON response sample from the Google Translate documentation:

And that is, at a very high level, how Talend enables you to parse HTML, translate it from one language to another – and produce a nicely formatted data set…despite not supporting either natively.

Download >> Talend Open Studio for Data Integration

The post Stripping Websites and Translating Text using Talend and Google Translate API appeared first on Talend Real-Time Open Source Data Integration Software.

When It Comes to Big Data and Cloud, Continuous Innovation is the Model

$
0
0

 

During the 2008 election campaign, Barack Obama denounced his opponents for advocating “change” as an electoral argument while belonging to the government party that “had changed nothing.” He approached the situation with humor, employing an expression well-known across the Atlantic: “You can put lipstick on a pig, but it’s still a pig.

This expression could also be applied to some software companies. With the success of the Cloud and Software as a Service models on the one hand and the increase of subscription services on the other, it seems that software offered in the form of a subscription is becoming the new standard. In fact, as early as 2015, Gartner estimated that by 2020, 80% of vendors would adopt a subscription model.

This change in the way companies use software has frequently been said to reflect users’ demand for flexibility. Indeed, companies are no longer willing to lay out a major investment to get equipped. They are looking to prioritize variability in their spending based on usage and to ensure they benefit from the value of the software before making a long-term commitment.

But while business models are moving toward subscription services, if technical developments don’t keep pace, a piece of the puzzle will be missing: And it’s a crucial piece.

Subscription Services: Beyond Cost, Bringing Innovation as Close to the Market as Possible 

Let’s forget for a moment the how the software is billed and take a look at another essential aspect, the value that it offers the user. This is where the real challenge is: The ability to provide users with frequent releases (to be as close as possible to current technological innovation and customer demands) becomes a means for differentiating between two commercial subscription offers.

A perpetual license mode also allows for software updates. However, the rhythm of updates and the frequency at which they are available to users cannot be compared with the ongoing agility and innovation offered by providers of subscription services. This is not related to how their software is marketed, but to the vendor’s intrinsic organization and ability to establish a continuous cycle of innovation and to transfer this innovation to his customers.

Big Data and Cloud: Continuous Innovation is the Model

The world of Big Data and Cloud is an obvious illustration of the need for continuous innovation. The speed at which these technologies become obsolete requires users to adapt at an unprecedented rate. The platforms adopted by customers today can become obsolete in just a few months (Spark replaced MapReduce in record time; Spark 2.0 is a revolution compared to Spark 1.6). It is essential for integration, processing, and operating software vendors responsible for these massive volumes of data to get as close to the market as possible, which means complying with key standards such as Hadoop, Spark, and BEAM and align themselves with the open source communities defining them. In practical terms, a company needs to anticipate the product roadmap generated from these innovative technologies to adapt that of its own products. This makes upstream preparation possible for integrating new capabilities, which will benefit end users as soon as the platforms put them into production.

While open source—based on technological openness, community and collaboration between various partners—is particularly well suited to this model, it’s not a secret that the new generation of vendors has developed a specific organization designed to offer continuous updates. And subscription services are a way to finance a policy of continuous innovation. In the former model, with a so-called “proprietary” software solution and a perpetual license format, it took 18 to 24 months to benefit from new features. At the rhythm of machine learning advances, IoT, and real-time and streaming data analysis capabilities, the model that consists of delivering new versions every 18 months is simply not viable for the user.

Supporting the Emergence of New Data Uses

Modern solutions for big data and cloud integration must be at the front lines of technology innovation, not only to address customers various and rapidly evolving challenges, including innovation, sustainability, agility, and economies of scale, but in particular, to encourage the emergence of new data uses—streaming, real-time and self-service—themselves a vehicle for competitive advantage.

The speed at which these platforms become obsolete is palatable. In the past, a technological feature could last for years without risk of becoming obsolete (e.g. SQL). Today, technologies become outdated far more quickly. Competition is fierce between companies using digital transformation as a strategic lever for performance and competitiveness. The result: users of these technologies need the ability to easily adapt from one standard to another. That’s why it’s so vital to select a vendor that is in line with the times so that you may continuously benefit from market innovation, and of course, easily recognize when it’s just a pig in makeup.

The post When It Comes to Big Data and Cloud, Continuous Innovation is the Model appeared first on Talend Real-Time Open Source Data Integration Software.

What’s Blockchain and Can It Help You Trust Your Data?

$
0
0

 

It first appeared in 2008 with the Bitcoin currency, this year, Blockchain technology achieved the summit of Gartner’s “Hype Cycle.” While many economists or policy actors have expressed their interest to use the technology (i.e. the government of Honduras, Ghana and Georgia wish to secure their land titles in a Blockchain and, in the private sector, several financial institutions have begun to experiment), we see that concrete and real applications are not common place. Even if the market associated with this technology is estimated by Gartner to reach 10 billion dollars in 2022.

Uberizing Uber

However, Blockchain’s potential for disruption is unprecedented. By negating the need for a trusted third party to set up a direct relationship between two groups and by ensuring the security of this relationship and generating an unfalsifiable historical (thanks to its distributed character), Blockchain may well contribute to uberiserizing Uber! A system based on “smart contracts” would, in fact, make it possible to place drivers and customers in direct contact, while securing payment. The collaborative economy, currently dominated by intermediates, i.e. Blablacar, AirBnB, Drivy, and Uber, would enter into a second phase of de-intermediation.

For certain observers, it’s only a matter of time. While it took 30 years between the first e-mail and the advent of online banking, Big Data only needed ten years to appear at the top of marketing, logistics, and even HR priorities. Digital transformation requires Artificial Intelligence, and predictive analysis that were only a short while ago Hollywood clichés, but are now very real and in fact at the heart of the fight against terrorism.

Ensuring traceability of data

One of the primary concerns associated with Big Data resides in their governance. What data do we use? Where do we store them? How can we ensure usage is compliant with the regulations? Who updates them? etc. The failure of initial projects is often explained by the eternal silos slowing down business agility. But recently, the appearance of “data lakes” has helped to break down those silos, finally giving access to the data that is relevant to business users, or even partners and customers, in real time.

In the same way, Blockchain appearance could make it possible to secure some processes in a Big Data approach (e.g. the authentication and traceability of data). The prospects are endless: In the field of health, first and foremost, where confidentiality issues are tied to personal data, but also in the financial sector, where disintermediation is already in progress yet is still coming up against security and regulation issues. Or another prospect is in the insurance sector, where Blockchain is a new momentum for the first peer-to-peer models and establishing the foundations of the automated insurance contract.

Very small businesses and SMBs are also concerned: A library can administer its book loans and subscription fees; a startup can manage its financing, etc.  With the emergence of Smart Cities, consumers can even generate the use and distribution of the electricity produced by solar panels.

Generating trust

Beyond the business world, society itself can take advantage of this technology (e.g. to secure online voting and set up a framework of trust which would make it possible to multiply direct consultations with citizens, such as in the case of a referendum), thus reinforcing participatory democracy.

Trust. The word means “reveal”. According to IDC, around 30% of the decision-makers decline to use their company’s data due to a lack of trust and governance. With Blockchain, a trust catalyst, the use of data could be considerably amplified. Like artificial intelligence, industry 4.0 and the collaborative economy may, as we have seen, bring about major changes both business and social. And these are the organizations that must contribute to it, through their experiments and by discovering new uses that will forge the society of tomorrow.

The post What’s Blockchain and Can It Help You Trust Your Data? appeared first on Talend Real-Time Open Source Data Integration Software.

Using Talend to Gather Data About Data

$
0
0

 

This article was developed using the free, open source version of Talend Open Studio for Data Integration which is available here.

What’s more exciting than data. Data about data!

Recently I had to assess the impact of data model changes within a transactional system feeding our data warehouse.  I directed our SQL scripts and ETL processes to the pre-production environment, and ran the ETL jobs and did some basic regression analysis.  For every table and column, were there any significant changes which warranted further investigation:

  • Number of rows
  • Number of nulls per column
  • min and max length of each column
  • distinct values per columns

The idea was, that any large shifts in data volume, or null values could indicate something worth investigating.

I could use Oracle dictionary tables like “ALL_TAB_COL_STATISTICS” to get some of this information, but wanted more control over the final output.  The end result needed to be a data set suitable for analysis in Tibco Spotfire.

I built a simple table within the schema to gather some information for each table/column, and then created a simple Talend job.  The job gathers stats using dynamic queries, retrieving table and column names from the Oracle system table, “ALL_TAB_COLUMNS”.  The stats table created to store the final output is below.

dataaboutdata1

I ran the job before and after the ETL change, and then used Spotfire to look for outliers in the metrics.  Below, for example, there is a big difference in the “number of nulls” metric for the “STATE_PROVINCE” column when comparing “before” vs “after”.

This is a manufactured example, using Oracle Express HR database.  The Talend Job was straightforward, the most complicated part is the creation of the dynamic SQL statements.

dataaboutdata2

Step 1: Create the Oracle connection

dataaboutdata3

Step 2: Iterate over the Oracle tables

Use a “tOracleTableList” component to iterate through the tables. I am excluding the STATS table itself (DB_STATS) using a WHERE clause in the component.

This component iterates over the table names.  During each iteration I can grab the table name from the globalMap.  The table name is used to build the dynamic SQL statements:

dataaboutdata4

Step 3: Get the columns names for the current table iteration

Now we have the table name, we can create a query against the “ALL_TAB_COLUMNS” system table in Oracle.  The query simply returns the column names for the table.  We pull out the table name from the globalMap, courtesy of the “tOracleTableList”.

It is not obvious from the screenshot, but there are single quotes and double quotes next to each other. This is a theme for most of the dynamic SQL statements in this article.

dataaboutdata5

Step 4: Iterate over the column names

Later we will need both the table name AND a column name to create SQL statements, so I added a “tFlowToIterate” component.  This will let me iterate over the column names. The column name for each iteration will get added to the globalMap as a variable.

At any point I can now pull out both the table name and column name from the globalMap. Note I also have access to other fields I am not using in this example, such as “DATA_TYPE” and “DATA_LENGTH” which are available in the “ALL_TAB_COLUMNS” table.

dataaboutdata6

Step 5: Execute the SQL needed to gather the metrics

This may need some explaining.  If, for a particular iteration, the table name is “DEPARTMENTS” and the column name is “DEPARTMENT_NAME”, to gather some basic stats and produce one row of output I need to create a query like this:

dataaboutdata7

We are selecting from the special Oracle table “DUAL” which is a dummy table with one row and one column, which is great for when you need to do this kind of query to produce a single row of output.  You can see above, we have generated some basic stats for the table and column. 

Note:  There is some redundancy here, as for each column I am calculating the total number of rows in the table, and really should do this once per table.

To create this SQL, we have to generate a lot of the values dynamically, by extracting the table and column name from the globalMap variables created earlier in the job.

The next component to add is a “tOracleInput” component.  Here we construct and execute the dynamic SQL above, for each table and column.  Again, we do get into a little bit of “single quote/double quote” hell, but it isn’t too bad.

I am also adding a “Before” or “After” value from the Talend context to indicate if this is a view of the stats before or after we implemented the data model changes.

It isn’t as bad as it looks, we are basically trying to recreate the SQL above by pulling table and column names out of the globalMap:

dataaboutdata8

The schema is straightforward:

dataaboutdata9

Step 6: Congratulate Ourselves 😉

You are now querying a database (Oracle system catalog table), to get data about the database (table and column names), so you can create dynamic SQL on the fly – to get more data about the database (metrics)…

Step 7: Insert the record into the stats table

This is a simple insert of the stats record we just created through dynamic SQL, using a “tOracleOutput” component to write the results into the DB_STATS table:

dataaboutdata10

Step 8: Commit (or rollback)

If the subjob completes without error – we perform an Oracle commit. This is necessary as I didn’t choose auto-commit in my tOracleConnection in step 1:

dataaboutdata11

If you have hundreds of tables, a trillion rows, and lots of columns, then think carefully about pulling the trigger. It did take me a few hours to execute the job against approximately 150 tables (200 million rows across all), but when I put the data into Spotfire it gave me a view of my data which was incredibly useful, and it did pick up some changes I would have missed otherwise.

The changes in the transactional system had in fact included logic changes to several key PL/SQL procedures used during our ETL extractions.

Download >> Talend Open Studio for Data Integration

Disclaimer: All opinions expressed in this article are my own and do not necessarily reflect the position of my employer. 

The post Using Talend to Gather Data About Data appeared first on Talend Real-Time Open Source Data Integration Software.


[VIDEO] Modern Data Management Needs a Governed, Self-Service Approach

$
0
0

 

IT Pros, have you felt left out of technical projects lately? Yes? You are not the only one. According to the Qlik-CXP 2016 Barometer, 34 percent of IT and BI executives are no longer involved in data-related projects.  Discover our video that illustrates in real-life how governed self-service helps IT take more initiative.

It’s generally understood that data is at the heart of the digital transformation. However, much of today’s data lies unused. In fact, IDC estimates that less than 1 percent of the data produced in the world is analyzed. And business users have a lot of progress to make in terms of utilization: according to McKinsey, only 7 percent of employees’  digital potential is currently achieved. According to Blue Hill Research, up to 80 percent of a data analysts’ time is spent preparing data, leaving only 20 percent of time for analysis, which is where they add the most value.

IT, excluded from digital transformation?

Finally, in the face of massive expectations, IT does not always provide the right answer despite its traditional dominance in data management: a small number of data specialists elaborate on data models, define access policies, oversee data management, quality, data protection and monitor its use.

This centralized governance is not adapted to the new situation: data management is the business of several stakeholders rather than that of a single specialist. In 2016, the Experian Group introduced new data experts such as data analysts (42%), the Chief Data Officer (22%), the Chief Financial Officer (22%) and the Chief Marketing Officer (14%). It says the CDO are involved in 42% of the cases, etc.

These new roles need data and want “everything right away”. Giving these business users the ability  to be autonomous, and take control of data with intuitive tools to collect, optimize, analyze and integrate it in order to extract value. They no longer want to be passive in front of an IT department that teams them with predefined reports. It sounds like self-service, right?

IT, at the heart of digital transformation

But these business workers also want to be protected, guided and boosted. For example, an HR user will work more freely on sensitive data in SAP that he knows well, but IT only gave him access to the data to which he is entitled: time records but not salaries.

Alternatively, a marketing user will launch his campaign serenely if IT ensures that his campaign’s Salesforce lists of leads and contacts respect the GDPR privacy regulations because they have all given their explicit prior consent.

Alternatively, if IT validates the data files shared between the different employees of a department by anonymizing confidential information and automates their injection into Netsuite without human manipulation—but that creates more work for IT.

It sounds like governance, right?

Governed self-service, a must-have for IT departments

Indeed, a new approach to data governance is emerging: governed, self-service wherein IT frames, protects and transforms the freedom of business users by multiplying their power over the data.

The trades immediately access the reliable sources of data desired, clean, transform, enrich and share them with simple tools to use. IT provides self-service access and clean-up of data without compromising compliance. It propagates a single version of the truth in the business, encouraging the exploitation of business expertise where it is and extending it to the entire enterprise.

A new IT – Business collaboration is established on the basis of governed, self – service, avoids the eternal “ping – pong” over dataset between data analyst and IT, and allows IT to take the initiative in its organization and sit in the driver’s seat of the digital transformation.

Discover our video that illustrates governed, self-service in real-life situations.

The post [VIDEO] Modern Data Management Needs a Governed, Self-Service Approach appeared first on Talend Real-Time Open Source Data Integration Software.

A First for Apache Beam

$
0
0

 

At Talend, we like to be first. Back in 2014, we made a bet on Apache Spark for our Talend Data Fabric platform which paid off beyond our expectations. Since then, most of our competitors tried to catch-up…

Last year we announced that we were joining efforts with Google, Paypal, DataTorrent, dataArtisans and Cloudera to work on Apache Beam which since has become an Apache Top Level Project.

On January 23, 2017, we released Winter 17’, our latest Integration Platform, which included Talend Data Preparation on Big Data. In this blog, I’d like to drill a little bit more into the technology and architecture behind it, as well as how we are leveraging Apache Beam for scale and runtime agility.

1) Architecture

   a) Overview

Figure1 below represents a high-level architecture of Talend Data Preparation Big Data with both the application layer and the backend server side.

You’ll notice the Beam JobServer part and more specifically the Beam Compiler (that allows the generation of an Apache Beam pipeline out of the JSON document) as well as the Beam runners where we specify the set of properties for Apache Beam runner target (Spark, Flink, Apex or Google DataFlow).

Note that in our Winter 17’ version, the only Apache Beam runner we support for the full run is Spark.

 

Figure 1. Talend Data Preparation with Apache Beam runtime

 

b) Workflow

Figure 2. From preparation DSL to Apache Beam pipeline

 

The Beam Compiler is invoked to transform the DSL into an optimized Beam Pipeline where the source, sink, and various actions are defined.

 

 

 

 

 

 

 

 

 

2) Details: What Gets Generated

       a) Preparation DSL: JSON Document

Figure 3. Build your Data Preparation Recipe

As you apply your cleaning and enrichment steps, Talend Data Preparation generates a recipe which then gets transformed into a JSON document.

In the JSON example below, the input is a .csv file stored into HDFS. The file contains only two string columns

We applied the “uppercase” function on the 1st column

 

 

 

 

 

{
   "input": {
      "dataset": {
"format": "CSV",
"path": "/tmp/input",
"fieldDelimiter": ";",
"recordDelimiter": "\n",
"type": "HdfsDataset",
"@definitionName": "HdfsDataset"
      },
      "datastore": {
"@definitionName": "HdfsDatastore",
"username": "testuser",
"type": "HdfsDatastore"
      },
      "properties": {
"type": "HdfsInput"
      }
   },
   "preparation": {
      "name": "sample_prep",
      "rowMetadata": {
"columns": [
   {
       "id": "0000",
       "name": "a1",
       "type": "string"
   },
   {
      "id": "0001",
      "name": "a2",
      "type": "string"
   }
]
      },
      "actions": [
{
       "action": "uppercase",
       "parameters": {
       "column_id": "0000",
       "scope": "column",
       "column_name": "a1"
       }
}
        ]
},
   "output": {
      "dataset": {
"format": "CSV",
"path": "/tmp/output",
"fieldDelimiter": ";",
"recordDelimiter": "\n",
"@definitionName": "HdfsDataset",
"type": "HdfsDataset"
      },
      "datastore": {
"@definitionName": "HdfsDatastore",
"username": "testuser",
"type": "HdfsDatastore"
      },
      "properties": {
"type": "HdfsOutput"
      }
   },
    "authentication": {
      "principal": "USER@REALM.COM",
      "realm": "REALM.COM",
      "useKeytab": true,
      "keytabPath": "/keytabs/mykeytab.keytab",
      "kinitPassword": "nothing"
  }
}

b) The Beam Compiler

Below is a snapshot of the code that creates the Apache Beam pipeline based on the various Talend components

public class RuntimeFlowBeamCompiler {
public Pipeline compile(RuntimeFlowBeamCompilerContext bcc) {
    RuntimeFlow runtimeFlow = bcc.getRuntimeFlow();
    // Start to create Beam pipeline from RuntimeFlow
    PipelineSpec<RuntimeComponent, RuntimeLink, RuntimePort> pipelineSpec =
bcc.getPipelineSpec();
    ...
    // Create Beam pipeline to build the job into.
    Pipeline pipeline = Pipeline.create(bcc.getBeamPipelineOptions());
   ...
   RuntimeFlowBeamJobContext ctx = new RuntimeFlowBeamJobContext(pipelineSpec, pipeline,
...);
   // Compile the components in topological order.
   Iterator<RuntimeComponent> components = pipelineSpec.topologicalSort().toIterator();
   components.forEachRemaining(component -> compileComponent(ctx, component));
   // Return the resulting Beam pipeline
   return ctx.getPipeline();
}
}

 c) The Beam Jobserver

Below is a snapshot of the code that validates and runs the actual Apache Beam pipeline:

public class BeamJobController {

   private RuntimeFlowBeamCompilerContext pipelineSpecContext = null;
 
   public JobValidation validate(Config config) {

Config dslConfig = config.getConfig("job");

String dsl = dslConfig.root().render(renderOpts);

// Create the compiler context

pipelineSpecContext = new RuntimeFlowBeamCompilerContext(dsl);

// Validate before execution

try {
   pipelineSpecContext.validate();
   return JobValid;
} catch(...) {
   return SparkJobInvalid(e.getCause().toString());
}
   }
   public void runJob(Config config) {
RuntimeFlowBeamCompiler bc = ...
// pre compilation
RuntimeFlowBeamCompilerContext optimizedPipelineSpecContext =
bc.preCompile(pipelineSpecContext);
// compilation
Pipeline compiledBeamPipeline = bc.compile(optimizedPipelineSpecContext);
// post compilation
Pipeline optimizedCompiledBeamPipeline = bc.postCompile(
optimizedPipelineSpecContext, compiledBeamPipeline
);
// run Beam Pipeline
try {
   optimizedCompiledBeamPipeline.run().waitUntilFinish();
} catch {
   ...
}
   }
}

Talend Data Preparation is the first Talend Big Data application that leverages the portability and richness of Apache Beam. As we move forward, Apache Beam footprint will continue to be a growing part of Talend’s technology strategy and the backend part (as presented in this blog) will be reused by other applications in both Batch and Streaming contexts where the essence of Apache Beam and its runners will be used to their full extend. Stay tuned for more information!

 

The post A First for Apache Beam appeared first on Talend Real-Time Open Source Data Integration Software.

How to Use Click Stream Analysis to Optimize your Company’s Social Outreach

$
0
0

 

In this blog, I’ll be discussing how I expanded the recommendation demo provided in Talend’s Big Data Sandbox to influence my promotional Twitter campaign.

Enterprises are now taking data-oriented approaches when defining their social strategy as they find new and interesting influencers around their business. It is critical to implement plans that utilize this information to optimize social cadence, enabling companies to stay top of mind without blowing their marketing budget. A great example of this is the work that was done over at Molson Coors Brewing Co. By correlating their brand outreach to specific weather conditions, they were able to increase the visibility of their posts by 93% while reducing the cost-per-click by 67% compared to their generic ads.

With these analytical approaches taking hold within some our customers, I wanted to use Talend to accomplish something similar.

As mentioned, I did a bit of hacking on the recommendation demo that is part of the Talend Sandbox to complete the job below. For those of you who haven’t tried out our big data sandbox, you can find it here.  In the meantime, here’s a little background on the use case.  In the example, Talend is collecting clickstream information in real time from a cycling e-commerce store. This information is then routed through a recommendation engine that scores the likelihood of the visitor purchasing an item. Results are then displayed on a web page for analysis.   

My addition identifies the product grouping (frames, components, misc) that is driving the most interest on the site and highlights them on social media.  Specifically, it creates a tweet that includes a promotional code that could be used to convert visitors from viewers to purchasers. A screen shot of my mapping is below.

The job collects the scored views from the Cassandra table that the previous steps in the demo populate. It filters out lower level scores and then joins the relevant items with their product group. I then average the groups to determine which group had the highest chance of being purchased, sort the information and then find the most relevant group. Finally, I join the main flow with a file containing the preformatted tweets I want to send out.

If I wanted to take the example further, like the Molson Coors use case mentioned above, I could add weather data, locational-based events or additional social feeds to enrich the decision and message of my tweet.

All in all, pretty simple with the most time-consuming part being the Twitter app setup.

Here’s a quick walkthrough on how to do that and set up the tTwitterOutput Component.

Whether you’re sourcing or targeting Twitter, the first thing you need to do is go to https://apps.twitter.com and create an app.

Fill out the application details and go into the application settings to find your keys, you’ll need them to set up the tTwitterOutput component.

Click on manage keys and access tokens which will get you the consumer secret key as seen below.

Drop down to the Access token section and generate your tokens.

Once you have the keys, just fill them into the corresponding fields of the tTwitterOutput Component.

With that, you’re off to the races creating tweets using Talend.

I hope this gives you an idea or two on things you could do with Talend as well as our big data sandbox. If you make any modifications and want to share what you did, we’d love to hear about it.  Feel free to tweet @Nick_Piette and @Talend about your story. I’m sure there is some swag somewhere around here that I can send your way. 🙂

The post How to Use Click Stream Analysis to Optimize your Company’s Social Outreach appeared first on Talend Real-Time Open Source Data Integration Software.

Unlocking Data Preparation for Business Intelligence (BI)

$
0
0

 

We live in a world surrounded by data. From our daily grocery shopping, to our mobile phone usage, fitness regime tracker, bank accounts, social media etc., practically everything we do is either driven by or a contributor to data volumes. In this blog I would like to reiterate the importance of data and data preparation in the rapidly growing and demanding data warehousing world. This is only a single use case in a wide variety of applications for data preparation tools in today’s business environments. In later blogs, I’ll also cover some best practices and various potential use cases of Talend Data Preparation tool.

Journey of Data

Let’s start with the journey of data. Data has evolved significantly in last decade. It has grown in its size, content, value and state. Today data comes in a variety of shapes and sizes and volumes. It may range from a small sample set to a million, billion or even trillions of pieces of data consisting of capricious states like text, voice, video, tapes, etc.

For many years, the data warehouse was believed to be static because data didn’t change that often. However, in today’s world data warehouses are either real-time or near real-time, dealing with rapidly changing data. Today businesses are becoming data driven and they are investing heavily in data preparation either with self-driven tools or with their data warehouse.

Importance of Data

Data, at its core, is basically the raw details of transactions/events/statistics/recordings collected for a reason primarily for business improvement, competition or product feedback. This raw data is not necessarily transparent, but it is very important as it provides the foundation for reporting the information businesses metrics and trends to make crucial decision or run operations. Having the right data is important for an organization to have insight on the criteria needed to ensure optimal business performance, uncover areas of improvement and drive other key aspects of the business. For example – for an organization like Talend, it is important to measure the number of active clients, expiring licenses, revenue and upsell/downsell from each client, etc. Having accurate data about the health of your business is important in order to make informed decisions and ultimately keep ahead in today’s data-driven competitive landscape.

From Data To Insight

Now that we know the importance of data, let’s look at how to convert the raw data into meaningful insights. Data that is in a very raw form is not going to be actionable for a business.  Usually raw data is not in a readable format, has missing values and might have errors or invalid information, etc. Hence it becomes extremely important to put raw data into a consumable format.

Preparing Data for Insights

Data preparation is a process where the raw data undergoes multiple phases. It needs to be assembled/integrated (if coming from multiple sources), cleansed, formatted, organized, complete and checked for accuracy and consistency so it can be analyzed using business intelligence or business analytics programs and be a valid input to the decision support system. The data preparation process also focuses on business user requirement, improving data quality, completeness and transforming data into a format that meets their needs.

Let’s look at an example use case to get a better understanding. In the diagram given below, the raw data has details pertaining to two movie theaters. It has details like which movie was playing and how many customers bought ticket.

At the first glance, this data set doesn’t give us any meaningful information. It is not consistent, has typo errors and is not complete.  However, once it is prepared it gives us a clean data which we can use to determine business performance. Business Intelligence (BI) professionals or business users could take this clean data and derive meaningful information from it such as which theater has

Business Intelligence (BI) professionals or business users could take this clean data and derive meaningful information from it such as which theater has most number of customers or which movie had the most tickets sales.  Now when such analytical data is given to business, it helps them decide whether or not they might want to stop playing Movie 1 or play Movie 2 for another week or open another screen theater 2, etc.

Data Preparation for Business Intelligence (BI)

Now that we understand that data preparation is very important for organizations making decisions utilizing data, let’s have a look at the various techniques available for data preparations in the BI world.

  • Manual Data Preparation: Performing manual data preparation using excel or similar tool would be too time consuming, error prone and mostly would not work for repeated tasks. This might work perfectly for small data sets, however it wouldn’t be appropriate for dealing with large and complex data like video for instance. Typically, in such scenarios data preparation and analysis would be done by same person or team thereby spending more time on preparation and less time on the actual analysis. Eventually manual data preparation turns out to be high on cost with no reusability features and ultimately ends up creating silos.
  • Build a large data warehousing BI team: This team would build a time consuming, sometimes expensive data warehouse. Typically, the team would follow the systems development life cycle (SDLC). End users have to be very thoughtful while giving requirements to the BI team, as any changes in the requirement might affect the outcome. Typically, this approach is inflated and is iterative in nature because of support, maintenance and sometimes changing requirement. However, this method would ensure that the data analytical team gets the right input for them to act.
  • Use a self-sufficient, governed self-service data preparation tools that anyone can use like Talend Data Preparation: Using such tools would be fast, avoid manual errors and would give one roof for analytical team to prepare data in a collaborative and controlled way.

All the three methods listed above are widely used however the choice of data preparation method solely depends on individual/organizational need and data availability.

Talend – Data preparation

Talend Data Preparation allows business users and IT to perform many things with data in less time. Just a few of the features of the Talend Data Preparation tool would be to perform the following activities:

  • Data Discovery
  • Data Cleansing
  • Data Visualization

External References

 

The post Unlocking Data Preparation for Business Intelligence (BI) appeared first on Talend Real-Time Open Source Data Integration Software.

Data Matching 101: What Tools Does Talend Have?

$
0
0

This blog is the second part of a three-part series looking at Data Matching. In the first part, we looked at the theory behind data matching. In this second part, we will look at the tools Talend provides in its suite to enable you to do Data Matching, and how the theory is put into practice.

If you remember, we discussed how you match by first blocking data into similar groups, things that are unlikely to change, and then match within those groups using various matching parameters. This is basically what the Data Matching Components in the Talend suite do.

To start, we will look at the tools available in the Talend Data Quality (TDQ) toolset. TDQ provides some components grouped into the various sections, the one of interest to us is the Matching section. We will look at the major components individually. In using the components, we are again doing just what we described in the first blog in this series. We are choosing features that are unlikely to change (Blocking) and then matching within those features (Blocks).

tFirstnameMatch is a component that checks first names against an index file embedded in the component itself. This component searches through first names in the index file according to the input gender and input country you specify in the component settings. The index file has reference first names for about 162 countries. Some of the countries listed in the index have a huge number of reference first names. It also contains a ‘Fuzzy’ search option which can do including approximate matches.

tFuzzyJoin joins two tables by doing a ‘fuzzy’ match on several columns. It compares columns from the main flow with reference columns from the lookup flow and outputs the main flow data and/or the rejected data. This component allows you to define a matching type column and select from a list the method to be used to check the incoming data against the reference data. The types of Matches available are Exact Match, Metaphone, Double Metaphone and Levenshtein. Exact Match is self-explanatory, Metaphone and Double Metaphone are based on a phonetic algorithm for indexing entries by their pronunciation. Levenshtein is more complicated; it basically measures the distance between two words as the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. You can set minimum and maximum distances. In this method, the distance is the number of character changes (insertion, deletion or substitution) that need to be carried out for the entry to match the reference fully. For example, if you want the minimum distance to be 0 and the maximum distance to be 2. This will output all entries that exactly match or that have maximum two character changes. There are many other methods available.

Download >> Talend Open Studio for Data Integration

tMatchGroup is a component that compares columns in both standard input data flows and in Map Reduce input data flows by using matching methods and groups similar encountered duplicates together. Several tMatchGroup components can be used sequentially to match data against different blocking keys. This will refine the groups received by each of the tMatchGroup components through creating different data partitions that overlap previous data blocks and so on. In defining a group, the first processed record of each group is the master record of the group. The other records are computed as to their distances from the master records and then are distributed to the due master record accordingly. The Matching algorithms used by this component are Exact Match, Soundex, Metaphone, Double Metaphone, Levenshtein, Jaro (which matches processed entries according to spelling deviations. It counts the number of matched characters between two strings. The higher the distance is, the more similar the strings are), Jaro-Winkler (a variant of Jaro, but it gives more importance to the beginning of the string), Fingerprint key (matches entries after doing things such as removing white and control spaces), q-grams (which matches processed entries by dividing strings into letter blocks of length q in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams), and finally custom. Custom Matching enables you to load an external matching algorithm from a Java library using the custom Matcher column.

tFuzzyMatch is a component which compares a column from the main flow with a reference column from the lookup flow and outputs the main flow data displaying the ‘distance’ between them. In this component, the Match Types are Metaphone, Double Metaphone, and Levenshtein.

tFuzzyUniqRow is a component which compares columns in the input flow by using a defined matching method and collects the encountered duplicates. In this component the Match Types are Exact, Metaphone, Double Metaphone, and Levenshtein.

tGenKey is a Big Data component that generates a functional key from the input columns, by applying different types of algorithms on each column and grouping the computed results in one key. It outputs this key together with the input columns.  This component helps narrow down your data filter/ matching results using the generated functional key. It can be used as a Standard component or in MapReduce jobs. The algorithms used can do things such as format data as well as do matching such as Exact, Soundex (a simple phonetic matching based on the removal of vowels), Metaphone, Double Metaphone and ColognPhonetic; a soundex base phonetic algorithm optimized for the German language. It encodes a string into a Cologne phonetic value. This code represents the character string that will be included in the functional key.

These are the major components that can do Matching, there are others, but they follow the same methods and use the same Matching Types and algorithms as above.

In the final blog in this series, we will look at how to tune your matching to obtain the best possible matching with your data.

Download >> Talend Open Studio for Data Integration

The post Data Matching 101: What Tools Does Talend Have? appeared first on Talend Real-Time Open Source Data Integration Software.

Viewing all 824 articles
Browse latest View live