Quantcast
Channel: Blog – Talend Real-Time Open Source Data Integration Software
Viewing all 824 articles
Browse latest View live

Talend Data Masters 2016: Lenovo’s Data-Driven Retail Transformation

$
0
0

Last year, Talend introduced its annual Data Masters awards program, which recognizes companies that have dramatically transformed their businesses through [1innovative use of big data and cloud integration technologies. This year’s Data Master’s blog series focuses on the winners of the 2016 Data Masters Awards, which were revealed on November 17th during a ceremony at the annual Talend Connect conference in Paris, France.

How can digital marketing drive the most ROI? What is the most popular product and how do we sell more of it? How can we better understand and deliver what the customer is looking for within our online store? These are challenging questions on the minds of both online and brick-and-mortar retailers of all sizes. Retail is a world that is awash in customer data, yet businesses still struggle to better grasp anticipate customer needs and deliver real-time, personalized recommendations. It’s still not an exact science, but with the help of big data, cloud and real-time information technologies, it’s starting to get easier. Lenovo’s story of digital transformation and their elastic hybrid-cloud platform is a good example of how a 360-degree customer management platform can help dramatically improve business success.

Getting In-Tune with Customer Data

Lenovo is the world’s largest PC vendor[1] and a $46 billion personal computing technology company. They analyze over 22 Billion transactions of structured and unstructured data annually in order to achieve a 360° view of each of its millions of customers worldwide. Even with all this data at their fingertips, Lenovo still faced a problem of quickly turning rows and rows of customer information into real business insight. This is where the vision for LUCI Sky, an elastic hybrid-cloud platform supporting real-time business intelligence and operational analytics, came into the view.

Pradeep Kumar G.S., a big data architect at Lenovo, explains that LUCI Sky helps Lenovo to do “real-time operational analytics on top of data sets” through 60+ data sources every day. Through the use of their new big data environment powered by Amazon Web Services, Lenovo Servers, Talend Big Data and other technologies, they were able to run analytics to understand which products and configurations customers preferred most. Through that exercise Lenovo was able to reduce cost by $1M and increase revenue per-unit by 11 percent in year 1 sustained over the last 3 years.

A 180° Business Transformation from a 360° View of the Customer

Marc Gallman, senior manager of Big Data Architecture and Global Business Intelligence at Lenovo said, “With a 360° view of the customer that we’ve been able to achieve over the past several years we’ve been able to really quickly understand the needs of our customers and… get answers and insights, faster.”

Talend would like to congratulate Lenovo on winning this year’s Data Masters award for Business ROI. Watch the video below to hear the full story on the company’s dynamic business transformation.

[1Gartner Research, http://www.gartner.com/newsroom/id/3373617, July 2016.

 


Sensors, Environment and Internet of Things (IoT)

$
0
0

According to Jacob Morgan, “Anything that can be connected, will be connected.” At one time, the terms cloud and big data were regarded as just ‘hype’, but now we’ve all witnessed the dramatic impact both of these key technologies have had on businesses across every industry around the globe. Now people are beginning to ask if the same is true of IoT—is this all just hype? In my opinion, hype is certainly not the word I would use to describe a trend or technology that has implications of profoundly changing the world as we know it today.

In 2006, I was doing research on the use of RFID and we introduced a way to organize chaotic office document with an automatic data organizer and reminder technology. I published a paper on this topic called, “Document Tracking and Collaboration Support using RFID”. That was my first interaction with sensors where we focused on M2M (machine-to-machine) and afterward integration with collaborative environments, hence pushing us in Subnet of Things.  The idea emerging at that time like smart home, smart cars, smart cities are now realizing. According to IDC, there will be 28 billion sensors in use by 2020 with an economical worth of $1.7 trillion. Before we jump in some applied scenarios, let outline the scope of sensors communication and how they are organized into three groups. 

M2M

Machine-to-machine is different from IoT in scope and domain. Normally the two are restricted in availability and come with pre-defined operational bindings based on data. The suitable example would be manufacturing units and their communication with different sensors built-in. A more localized example would be heating sensor in the car that has a very definitive purpose and is only limited to that car.

SoT                        

SoT (Subnets of Things) can be scoped at the organization or enterprise level. Like in the above example where you have RFIDs on each file or book, SoT can be located inside organizations using collaborative platforms. For example, a car sending data to measure quality and utilization of its components to deliver better operations and stable experience.

IoT

In 2013 the Global Standards Initiative on Internet of Things (IoT-GSI) defined the IoT as "the infrastructure of the information society. The Internet of Things (IoT) is a system of interrelated computing devices, mechanical and digital machines, objects, animals or people that are provided with unique identifiers and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction. IoT has evolved from the convergence of wireless technologies, micro-electromechanical systems (MEMS), micro-services and the Internet. 

 

Sensor(s) Data Type and Network Challenges

During my travel to San Francisco, I met SAP’s R&D lead for IoT. It was very interesting to hear how big data is already shaping our daily lives. For example, he mentioned an airplane system where thousands of sensors are delivering data and need to make calculations up to 8 to 16 decimals in a fraction of a second. Thanks to sensors, we have so much data that our current technology infrastructure is the only gating factor to companies being able to harness the power of real-time, big data insight.

IT decision makers should create infrastructures and utilize solutions that can handle data which was kept due to unavailability of a network. Sometimes, you don’t have telecommunication network or sometimes surroundings do not allow you to connect due to a signal jammer, etc. so IT leaders need to focus on infrastructure which can hold and keep data until a network is available and data transfer is possible.

Sensor data is just like any other data coming from different sources that needs cleansing, analysis and governance. At the same time, it has some distinct properties. Normally, sensor data is a stream of information (values) juxtaposed against time. Not all data is meaningful but we cannot simply discard information as all data is destined to have some value, even if not known at time of collection. So you should not use any loose or questionable algorithms for data compression. There are few possibilities to overcome a situation like this:

·      Send filtered data only if it is required

·      Only transmit abnormal values

·      Compress the data without using unfounded/untested algorithms

As my discussion with him goes on, he says, “just look around this place. It is generating so much data. If something happens or if any component fails, the data generated before that failure is very important. It’s a complete echo system. If we lose that data, then we will not be able to predict any such event in the future. So our only option is to compress the data using lossless compression to keep cost down (the only alternative is one we can’t afford).  There are some situations where we have duplication of data or the frequency of data is overwhelming, which results in a lot of overhead. But again our back-end system should be capable of digesting all incoming information. At sensor level we can apply some curve-fitting techniques like Fast Fourier Transformation, so that we keep getting aggregated value. The sensor data are best stored in low-cost, scale-out systems called Data Lakes. Raw data normally doesn’t get pulled into the data warehouse as it lacks proven value but requires definitely governance like security etc.

To perform analysis and get value from data it should be stored in Data Warehouse and I would definitely recommend reading our Data Vault blog that explains how to store data effectively.  

Sensors Impact on our Daily Lives

Sensors are impacting almost every aspect of our life. From movement sensors to manufacturing units, everything is going to get smarter day by day. Let’s walk through some very basic and brief examples of IoT that are going to impact us in the near future.

Smart Manufacturing

Manufacturing is a $12 trillion global industry a year. This is an industry where robots are transferring goods from one place to another and every action of metal-stamping machines in an auto-parts factory and assembly line is tracked by sensors. Otherwise known as Industrial Manufacturing, this sector has already shown remarkable growth rates. Data collected from different processes can help prevent unplanned downtime and/or predict supply needs based on Sales forecasts.

Smart Transport

Automobile integration is being touted as the next great frontier in consumer electronics. According to Gartner, driverless vehicles will represent approximately 25 percent of passenger vehicles population in mature markets by 2030.  Driverless cars may have a steering wheel just for legacy design and ‘machine override’ purposes, but ultimately there will be no manual controls in car at all. All traditional controls like brakes, speed or function indicators, and acceleration rate will be based on sensors, radar, GPS mapping, and a variety of artificial intelligence to enable self-driving, parking and the safest route to destination to avoid accidents. This impressive technology will focus not only on controlling cars, but on communication between other vehicles on the road based on that car’s road condition status or relation to the driver’s vehicle. Communication that require external networks will enable Internet-based services, like route calculation, vehicle status, e-call, and b-call usage based on insurance and backup of data.

Smart Energy

Electric Power grids started operating in the 1890s, which were highly centralized and isolated. Later, the electricity network was extended and power houses got connected with load shifting technologies in-case of technical shutdown (backup and recovery). Many small power generating units like windmills and solar energy parks are now generating electricity at varying capacity depending on condition. Today, electrical distribution networks accommodate many more sources as households are also generating electricity, putting pressure on ‘the grid’ as well as making it difficult to manage a centralized system.

Based on system condition, utilities leaders will be able to decide which source of energy is cheapest for the moment and shut-down coal or fuel based energy sources to avoid unnecessary generation and cost. Similarly, smart meters are now billing clients based on usage, but in future, as they get more connected, those sensors might be capable of deciding which energy source is cheapest.

Smart Homes and Cities

The smart city is a very interesting possible application of the IoT. Smart homes and smart cities are connected to each other. Smart cities provide great infrastructure for communication and are an ideal candidate from which to extract benefits. There are already some projects in Chicago, Rio de Janeiro, and Stockholm where governments, in collaboration with private sector companies, are taking advantage of IoT to collect data from city street assets to determine whether or not they need repair, i.e. street lights. From school bus monitoring to garbage collection, IoT is changing the scope of how society functions.

"Smart Home" is the term commonly used to define a residence that has appliances, lighting, heating, air conditioning, TVs, computers, entertainment audio & video systems, security, and camera systems that are capable of communicating with one another and can be controlled remotely by a time schedule, from any room in the home, as well as remotely from any location in the world by phone or internet. The next iteration of today’s Smart Home will be capable of conducting inventory management on your fridge, where it automatically places an order for milk or eggs after determining your supplies of each are low. Or devices capable of recognizing interruption in electricity, water or even network connectivity and can inform service providers with needs for repair. Imagine smart trash bins that automatically notify garbage collectors that you’re your trashcans are full so they can pick it up.

I would summarize the whole article by stating that our data-driven world is now taking another shift towards Smart Systems. Our current infrastructure has empowered us to overcome major challenges from data generation, ingestion to analysis. In addition to above collective smart scenarios, personal body sensors like GPS chips, health and activity monitors has increased the living standard. I see a rapid growth in IOT in coming months to years.

http://www.wsj.com/articles/how-u-s-manufacturing-is-about-to-get-smarter-1479068425?mod=e2fb

https://blog.fortinet.com/2016/10/27/driverless-cars-a-new-way-of-life-brings-a-new-cybersecurity-challenge

http://www.idc.com/getdoc.jsp?containerId=prUS25658015

http://data-informed.com/how-the-internet-of-things-changes-big-data-analytics/

http://internetofthingsagenda.techtarget.com/definition/Internet-of-Thin...

 

IT: How to Survive in a Self-Service World

$
0
0

This article aims to explain to IT and data professionals how to satisfy users' expectations about data access management while ensuring data control and data security. It provides advice for how to stimulate the use of governed, self-service data preparation.

8.33 AM, coffee in hand (the 3rd since the alarm clock anyway), Executive Management floor, Dominique says to Jean-Christophe: "Don’t tell me we need to wait for IT to have access to this data! ". Or the more rational scenario may go something like this: "We need clean, complete and trusted data on consumer behavior  during the meeting at 12:00 PM: no exceptions!"

Getting complete data sets on consumer behavior may be doable, except that in this scenario IT is you. Who has not felt this mixed feeling, between rising stress and wanting to take up the challenge?

A quick diagnosis of the data landscape shows:

- The expected data exists but is disseminated throughout the organization in silos: various applications, data repositories, etc.

- The quality of the data is questionable and therefore it needs to be reviewed, cleaned and organized

- The ability to associate this data with other data set requires a familiarity with the information (often only known by business users)

- Some of the employees using these applications can access the data themselves and will often export it into Excel

Data management: Should concessions be made?

It is widely proven and accepted that a data-driven organization optimizes its performance. For example, per the McKinsey Datamatics Survey, data-driven companies have a 23x greater customer acquisition, 6x better customer retention, and 19x larger profits. But how does a company establish a culture of being data-driven?

Download>>Try Talend Data Preparation for Free

The CDO 2016 barometer estimates that 80 percent of current projects aim to help a company become 'data-driven'. The use of data is spreading rapidly at all levels of every organization and across  all departments. As the volumes and varieties of data continue to rise, the use of data becomes more frequent.. Subsequently, the demand for the availability of real-time data to make business decisions grows.

The aspiration created by the consumerization of data gives more power to business users. Refusing to consider  business users’ demands for more agility, autonomy, and collaboration, is dangerous. Whether they are experts in data or business users, if IT denies them access, they will find a workaround—otherwise known as ‘shadow IT’, which may jeopardize your enterprise information.

Having an overbearing IT department goes against the core mission corporate IT is supposed to fulfill, which is to enable employees to be productive and successful in their roles. The challenge is finding a way to make data available to employees in a safe, governed, secure way. The 2016 CXP Group - Qlik Barometer found that 34 percent of IT and BI managers are no longer involved in data-related projects. The idea of allowing employees to have self-service, data access and management scares IT. The misconception is that IT will lose control over the sanctity and security of enterprise data by implementing self-service data preparation.

Is your organization currently challenged by the fact that despite centralized Business Intelligence tools and information repositories, data still circulates in the form of Excel files? If this is the case, the question is no longer about how to maintain control of enterprise information because it’s already been lost. The increasing risk of cyber-attacks and data leaks makes this repossession of control even more crucial and urgent. Cybercriminals no longer need to target application systems. They simply target the messaging systems. For example, on average a company will have 6097 files containing the word 'salary' and reside outside primary business applications.

BIG IT is Not the Answer but IT has the Answer to Deliver Reliable and Secure Data

Because business demands are too numerous and varied to be satisfied by a single department, centralizing data management in the hands of IT lacks the ability to scale. Centralized IT will only result in the frustration of both IT side and business users.

Because no one is better qualified than an accountant to clean up, enrich and reconcile billing and supplier portfolio data, or a marketing manager to do the same on leads generated by a given event, the competencies and the ability to make sense of data is – by definition - distributed across  each line of business.

But because clean vendor billing files or enriched new lead marketing files create value for the business and deserve to be shared and reused within the organization, having an IT framework that allows for this type of collaboration is important to have with the appropriate  data management and security rules.

Because individual employee data preparations will be even more useful if they can be shared and accessed by  all the company's data sources and feed all its target applications, IT is justified in orchestrating  these centralized repositories.

Self-service, Governed Data Preparation is the Holy Grail for  IT and Business Users

The traditional border between the data producer and user has faded: we are all producers and users.

IT has a unique opportunity in front of it to keep (or regain) its role as a catalyst for change in the company. Take the initiative to set up and exploit collaborative enterprise repositories and rely on self-service data-processing solutions vs. those that are centrally managed.

The best self-service data preparation platforms enable IT to deliver massive individual productivity gains for all employees, as well as collective insight through collaboration. They allow the integration of all data sources and targets. They ensure the consistency, compliance and security of data management.

Self-service data preparation tools represent an opportunity for all IT professionals to meet immediacy, autonomy and collaboration needs of employees, while maintaining data governance, and escape the pitfalls of maintaining Big IT.

Matt Aslett, Research Director for the Data Platforms and Analytics group at 451 Research, said,  "We anticipate greater interest in 2017 and beyond in products that deliver data governance and data management capabilities as an enabler for self-service analytics."

11:12 AM, Executive Management floor, Jean-Christophe to Dominique: "The customer data is ready! I have provided a self-service data preparation platform to marketing employees. As a result, the profiling data from our CRM system has been standardized and rehabilitated by Anne-Sophie in Marketing Operations and then enriched with the latest online activities of our customers by Andrew of the Digital department, and finally approved by their manager before feeding our BI application. We’re all set for today’s meeting! "

If you are a Jean-Christophe at Talend, you're in luck! Learn how your Talend platform already allows you to implement governed, self-service data preparation.

Not Jean-Christophe? No need to change your first name. You can to learn how to adopt Talend technologies by clicking here.

About the Author

Francois Lacas is Product Marketing Manager at Talend. As a hands-on strategist, he is passionate about driving Digital Transformation and Modern Marketing to business increase contribution, since the early 2010s. Prior to joining Talend, he was Director of Marketing & Digital Transformation at Wallix – a public cybersecurity software vendor - and at ITESOFT – a public ECM / BPM software vendor. As part of his various roles, he regularly writes blog articles and presents at events and tradeshows. Get in touch with him on LinkedIn and on Twitter @francoislacas.

Guide de survie de l’IT dans un monde de Self-Service

$
0
0

Cet article explique aux professionnels de l’IT et de la donnée comment satisfaire les attentes de leurs utilisateurs en matière de manipulation des données, tout en garantissant leur contrôle et leur sécurité : stimuler le recours à la préparation de données en self-service gouverné.

8.33 AM, café en main (le 3ème depuis le réveil quand même), étage de la Direction, Dominique - directeur marketing – interpelle Jean-Christophe : “On ne va quand même pas devoir attendre l’IT pour avoir ces données clients !”. Ou la variante plus rationnelle (je veux dire, 2 cafés en moins … promis, nous traiterons la gestion du stress des managers vis-à-vis de leurs équipes, mais sur un autre blog) : “Nous avons besoin de travailler sur des données de comportement consommateurs propres, complètes et certifiées pendant la réunion de 12.00 PM : pas d’alternative ».

Message reçu … d’autant plus que Jean-Christophe, c’est nous, responsable IT ! Qui n’a pas ressenti ce sentiment mêlé, entre montée de stress et envie de relever le challenge ?

Une fois expédiées les urgences et délégué le reste, un rapide diagnostic de la situation montre :

  • Que les données attendues existent … disséminées en silos : applications, référentiels de données, etc.
  • Que la qualité des données est perfectible et donc leur fiabilité reste à construire
  • Qu’associer ces données entre-elles nécessite une compétence et une compréhension métier
  • Que certains des utilisateurs de ces applications peuvent eux-mêmes y accéder chacun de leur côté en en faisant une extraction sur Excel
  • … Et que le temps est compté

Gestion des données : faut-il lâcher du lest ?

Il est largement prouvé et accepté qu’une organisation ‘data-driven’ optimise ses performances : par exemple, selon le Mc Kinsey Datamatics Survey, les entreprises ‘data-driven’ multiplient par 23 leur pouvoir d’acquisition clients, par 6 leur capacité de fidélisation clients et par 19 leurs profits. Le débat n’est plus là. Quelles sont les conséquences d’un tel constat dans les organisations ?

Le baromètre des CDO 2016 évalue que 80% des projets en cours ont pour objectif de faire en sorte que l’entreprise soit ‘data-driven’. Le recours à la donnée se répand rapidement à tous les niveaux hiérarchiques et dans tous les départements de l’entreprise. Le recours à la donnée est plus en plus fréquent jusqu’à devenir omniprésent. Le recours à la donnée temps réel devient impératif. Le recours à la donnée est massif : des volumes et des variétés vertigineux de données sont disponibles.

L’aspiration créée par la consumérisation des données donne le pouvoir à ses utilisateurs. S’y opposer est vain. Refuser de prendre en compte ces attentes impatientes et boulimiques d’agilité, d’autonomie et de collaboration ainsi que ces initiatives localisées est dangereux : qu’ils soient experts des données ou utilisateurs métiers, ils trouveront la solution sous forme de souscription à des solutions cloud pay per use, au travers de l’import-export et la manipulation de données en individuel, etc. Alors on laisse faire ?

Je vous décris ici la situation d’une fonction IT désintermédiée … cela ne vous rappelle pas quelque chose ? Une fonction IT ubérisée, qui perd de vue sa mission 1ère de rendre possible l’évolution des entreprises et qui n’est plus au centre du jeu. Vue de l’esprit ? Le baromètre 2016 CXP Group – Qlik a établi que 34% des responsables IT et BI sont conscients de ne plus être incontournables dans les projets liés à la donnée. Et ouvrir la manipulation des données en self-service fait peur car cela peut être considéré par l’IT comme une perte supplémentaire de contrôle.

Je décris ici votre situation où, malgré les projets centralisés de Business Intelligence, les données circulent encore sous forme de fichiers Excel. Votre question n’est plus de garder le contrôle, mais de le reprendre. L’enjeu est bien de rendre les données disponibles au ‘bout des doigts’ en toute sécurité. Le risque croissant de cyber-attaques et de fuites de données rend crucial et urgent cette reprise de contrôle. Les cybercriminels n’ont plus besoin de cibler les systèmes applicatifs. Il leur suffit de cibler les messageries : en moyenne dans une entreprise, 6097 fichiers portent le mot ‘salaire’ et résident en dehors des applications.

L’IT n’est pas la réponse mais l’IT a la réponse pour délivrer des données fiables et sécurisées

Parce que les demandes métiers sont trop nombreuses et variées pour être satisfaites par un seul département, centraliser dans les mains de l’IT la manipulation de la donnée manque de réactivité. Cela ne crée que de la frustration tant du côté de l’IT que chez les utilisateurs.

Parce que personne n’est plus pertinent qu’un responsable comptable pour nettoyer, enrichir et rapprocher les données de facturation et de portefeuille fournisseurs, ou bien qu’un marketing manager pour gérer les données de leads générés par tel ou tel évènement, la compétence et la capacité à donner du sens aux données est par définition distribuée dans chaque ligne de métier.

Mais parce que le fichier de facturation fournisseurs nettoyé ou que le fichier de nouveaux leads marketing enrichi, créent de la valeur pour l’entreprise et méritent d’être partagés et réutilisés au sein de l’organisation, il est légitime que ce soit l’IT qui encadre cette collaboration par des règles de gestion et de sécurité.

Et parce que les préparations de données individuelles des utilisateurs seront d’autant plus utiles si elles peuvent exploiter toutes les sources de données de l’entreprise et alimenter toutes ses applications de destinations, l’IT est légitime pour orchestrer ces intégrations en central.

La préparation de données en Self-Service gouverné, eldorado de l’IT et des utilisateurs

La frontière traditionnelle entre producteur de données et utilisateur s’est estompée : nous sommes tous producteurs et utilisateurs.

L’IT a devant elle une opportunité évidente de garder (ou de retrouver) son rôle de catalyseur du changement dans l’entreprise : prendre l’initiative de mettre en place et d’exploiter de véritables espaces de co-working de la donnée. Et s’appuyer sur des solutions de data preparation distribuées à tous en mode self-service et gouvernées centralement.

Les meilleures plateformes de Self-Service Data Preparation permettent à l’IT de délivrer des gains de productivité individuelle colossaux pour chaque utilisateur, ainsi des gains de productivité collective grâce à la collaboration. Elles autorisent l’intégration de toutes les sources et destinations de données. Elles garantissent la cohérence, la conformité et la sécurité de la gestion des données.

Elles représentent l’opportunité pour tous les professionnels de l’IT de satisfaire les besoins d’immédiateté, d’autonomie et de collaboration des utilisateurs, d’être promoteurs d’une gouvernance renouvelée des données et finalement, d’échapper à leur propre ubérisation sous les coups répétés des réalisations locales des utilisateurs.

Cette opportunité, c’est maintenant que les professionnels de l’IT doivent la saisir. Analyste chez 451 Research, Matt Aslett prédit : « Nous anticipons un intérêt croissant en 2017 et au-delà, pour les solutions qui intègrent gouvernance de données et capacité de gestion de données pour permettre l’analyse en libre-service ».

11.12.AM, étage de la Direction, Jean-Christophe à Dominique : « Les données clients sont prêtes ! J’ai mis à disposition des utilisateurs marketing une plate-forme de préparation de données en self-service gouverné. Les données de profilage issues de notre CRM ont été standardisées et fiabilisées par Anne-Sophie du Marketing Operations puis enrichies des dernières activités en ligne de nos clients par Andrew du département Digital, et enfin approuvées par leur manager avant d’alimenter notre application de BI. Bonne réunion ! »

Si vous êtes un Jean-Christophe client chez Talend, vous avez de la chance ! Découvrez comment votre plateforme Talend vous permet déjà de mettre en place la Data Preparation en Self-Service gouverné.

Vous n’êtes pas un Jean-Christophe ? Pas besoin de changer de prénom pour découvrir comment adopter les technologies Talend ici.

Et vous, quelle stratégie avez-vous mis en place pour satisfaire les demandes en données de vos utilisateurs ? Vos retours d’expérience nous seront utiles, laissez-nous votre commentaire !

About the Author

Francois Lacas is Product Marketing Manager at Talend. As a hands-on strategist, he is passionate about driving Digital Transformation and Modern Marketing to business increase contribution, since the early 2010s. Prior to joining Talend, he was Director of Marketing & Digital Transformation at Wallix – a public cybersecurity software vendor - and at ITESOFT – a public ECM / BPM software vendor. As part of his various roles, he regularly writes blog articles and presents at events and tradeshows. Get in touch with him on LinkedIn and on @francoislacas.

Data Matching 101: How Does Data Matching work?

$
0
0

This blog is the first in a series of three looking at Data Matching and how this can be done within the Talend toolset. This first blog will look at the theory behind Data Matching, what is it and how it works. The second blog will look at the use of the Talend toolset for actually doing Data Matching. Finally, the last blog in the series will look at how you can tune the Data Matching algorithms to achieve the best possible Data Matching results.

First, what is Data Matching? Basically it is the ability to identify duplicates in large data sets. These duplicates could be people with multiple entries in one or many databases. It could also be duplicate items, of any description, in stock systems. Data Matching allows you to identify duplicates, or possible duplicates, and then allows you to take actions such as merging the two identical or similar entries into one. It also allows you to identify non-duplicates, which can be equally important to identify, because you want to know that two similar things are definitely not the same.

So, how does Data Matching actually work? What are the mathematical theories behind it? OK, let’s go back to first principles. How do you know that two ‘things’ are actually the same ‘thing’? Or, how do you know if two ‘people’ are the same person? What is it that uniquely identifies something? We do it intuitively ourselves. We recognise features in things or people that are similar, and acknowledge they could be, or are, the same. In theory this can apply to any object, be it a person, an item of clothing such as a pair of shorts, a cup or a ‘widget’.

This problem has actually been around for over 60 years. It was formalised in the 60’s in the seminal work of Fellegi and Sunter, two American statisticians. The first use was for the US census bureau. It’s called ‘Record Linkage’, i.e. how are records from different data sets linked together? For duplicate records it is sometimes called De-duplication, or the process of identifying duplicates and linking them. So, what properties help identify duplicates?

Well, we need ‘unique identifiers’. These are properties that are unlikely to change over time. We can associate and weigh probabilities for each property. For example, noting the probability that those two things are actually the same. This can then be applied to both people and things.

The problem, however, is that things can and do change, or they get misidentified. The trick is to identify what can change, i.e. a name, address, or date of birth. Some things are less likely to change than others. For objects, this could be size, shape, color, etc.

NOTE: Record linkage is highly sensitive to the quality of the data being linked. Data should first be ‘standardized’ so it is all of a similar quality.

Now there are two sorts of data linkage-

1.     Deterministic Record Linkage

a.     That is based on a number of identifiers that match

2.     Probabilistic Record Linkage

a.     This is based on the probability that a number of identifiers match

The vast majority of Data Matching is Probabilistic Data Matching. Deterministic links are too inflexible.

So, just how do you match? First, you do what is called blocking. You sort the data into similar sized blocks which have the same attribute. You identify ‘attributes’ that are unlikely to change. This could be surnames, date of birth, color, volume, shape. Next you do the matching. First, assign a match type for each attribute (there are lots of different ways to match these attributes). Names can  be matched phonetically; dates can be matched by similarity. Next you calculate the RELATIVE weight for each match attribute. It’s similar to a measure of ‘importance’. Then you calculate the probabilities for matching and also accidently un-matching those fields. Finally, you assign an algorithm for adjusting the relative weight for each attribute to get what is called a ‘Total Match Weight’. That is then the probabilistic match for two things.

To summarize:

•       Standardize the Data

•       Pick Attributes that are unlikely to change

•       Block, sort into similar sized blocks

•       Match via probabilities (remember there are lots of different match types)

•       Assign Weights to the matches

•       Add it all up – get a TOTAL weight

The final step is to tune your matching algorithms so that you can obtain better and better matches. This will be covered in the third article in this series.

The next question then is what tools are available in the Talend tool set and how can I use them to do Data Matching? This will be covered in the next article in this series of Data Matching blogs.

About the Author

Stefan Franczuk is a Customer Success Architect at Talend, based in the UK. Moving in IT over 25 years ago, after starting his working career in engineering and aviation, and switching to academia and IT; Dr Franczuk has a wide range of experience in many different IT disciplines. For the last 15 years he has architected many integration solutions for clients all over the world. Dr. Franczuk also has a wide range of experience in Big Data, Data Science and Data Analytics. Dr. Franczukholds a PhD in Experimental Nuclear Photo-Physics from The University of Glasgow.

Talend Data Masters 2016 – UNOS: How many lives can you save?

$
0
0

Did you know that a single organ donor can save up to 8 lives? The United Network for Organ Sharing (UNOS) is a private, non-profit organization that manages the United States’ national organ transplant system under contract with the federal government. On average, 85 people receive transplants every day, but another 22 people die each day while waiting for an organ transplant due to the lack of donors.

Each organ has a short shelf life

Today, there are nearly 120,000 people on the organ waiting list. There were only 9,000 deceased donors last year, so optimizing the use of organs is essential – and a challenge. When a transplant hospital accepts a transplant candidate and an organ procurement organization gets consent from an organ donor, both enter medical data into UNOS’ computerized network. Using the combination of candidate and donor information, the UNOS system generates a “match run,” or a rank-order list of candidates for each organ. There is a tremendous sense of urgency as a doctor has just one hour to decide whether or not to accept an organ for their patient on the list. Organs have a limited time frame in which they must be transplanted.  A heart, for example, must be transplanted within four hours but kidneys can tolerate 24 to 36 hours outside the body.

Looking back to drive future decisions

UNOS wanted to facilitate making more information available via a self-service model. A first offering in this model was an Organ Offer Report where Transplant centers accessing it can now see what they’ve transplanted over the last three months. They can also see the outcome of the organs they did not accept so they can analyze why they turned them down, how they were successfully used by other centers, and whether they should consider accepting them in the future. It is critical for surgeons to look back at decisions they made about accepting or rejecting organs to determine if their future decisions should be altered. Data helps make better decisions, and better decisions mean more saved lives.

600% reduction in the amount of time needed to generate the organ offer report 

UNOS uses Talend Big Data to extract and organize data used to generate the Organ Offer Report provided to transplant centers. Transplant professionals can now see more information than ever before to help with quality improvement efforts. Using Talend enabled UNOS to automate the process of integrating systems and processing data as well as reduce the time required for this essential task from 18 hours down to three or four hours.

UNOS’ databases currently contain approximately three terabytes of data. UNOS is using Talend to integrate both structured data from Microsoft and Oracle databases with JSON data from the Web. UNOS is using Talend’s ability to generate Spark code to accelerate integration jobs, with Talend data pipelines feeding three separate Hadoop clusters. Talend outputs the results to a source system where Tableau data visualization software can read them. Tableau then serves up the Organ Offer Report, which gives transplant centers a list of recent transplant activity at their hospital.

Almost 31,000 transplants were performed last year in the United States.  For this reason, Talend named UNOS a grand prize winner of the 2016 Talend Data Masters Awards.

View the full story below:

The Role of Statistics in Business Decision Making

$
0
0

The use of statistics in business can be traced back hundreds of years.   As early as 744 AD, statistics were used by Gerald of Wales to complete the first population census of Wales (1).  It wasn’t long before merchants realized that statistics could be used to measure and quantify trade.  The first record of this was in Florence.  It was recorded in Giovanni Villani’s “Nuova Cronica”, in 1346 (1).  Moreover, statistical methods were further adopted to help drive quality and in doing so helped contribute to the advancement of statistics itself.  In 1504, William Sealy Gosset, chief brewer for Guinness in Dublin, devised the t-test (2) to measure consistency between batches of stout (1).

With the rise of big data, organizations are looking to extract deep insights from their data using advanced analytical techniques.  With big data, new roles like Data Scientists are being developed within organizations. But no matter the title of the role, be it quantitative analyst or data scientist, they all share one thing in common.  Mathematical statistics and probability are at the heart of these disciplines and they are seen as critical to the success of a business.

Currently you would be hard pressed to find a business that does not perform some level of statistical analysis on their data.  Most of these analyses are performed under the general term of Business Intelligence (BI) (3). BI can mean many things but in general, BI is used to run a company’s day-to-day operations and includes software, process, and technology (4).  BI enables organizations to make data driven decisions and effect change.

The term “data driven” is synonymous with companies that leverage their data and analytics to unearth hidden insights that have a real and measurable impact on their business (5).

“FedEx and UPS are well known for using data to compete. UPS’s data led to the realization that, if its drivers took only right turns (limiting left turns), it would see a large improvement in fuel savings and safety, while reducing wasted time. The results were surprising: UPS shaved an astonishing 20.4 million miles off routes in a single year.”(5)

By applying statistical and probabilistic methods to their data, organizations can unlock patterns and insights that otherwise would have gone unnoticed.  These insights, as in the case with UPS, can lead to significant increases in revenue while driving down costs to the business.

Statistics use in business is currently undergoing a paradigm shift in its scope and application.  Today, data scientists are leading the charge in the application of statistics and probability to help businesses use their most important organizational asset; their data (6).

(6)

Above we see a comparison between the work of a statistician and that of a data scientist.  A data scientist deals with data in its raw form including structured, semi-structured, and unstructured data. The outputs of the data scientist are generally data applications or data products.  Data driven applications are driving how companies are generating revenue, examples include Facebook, LinkedIn, and Google (7).   Data driven applications are creating what is known as the “smart enterprise.”  Smart enterprises allow not only management, but also rank and file, the ability to have analytics at their fingertips.  We see this on LinkedIn with their recommendations for connections, the same for Facebook with friend recommendations.  The “data application” is constantly looking for people that may enhance a user’s network.

(6)

Above is a comparison between traditional BI and data science.  The biggest difference is that BI is generally backwards looking (simple descriptive statistics) and data science is forward looking (inferential statistics).  BI will always be a part of the enterprise, traditional EDW’s aren’t going away anytime soon.  However, these traditional systems are being complimented with emerging technologies (Hadoop, In-memory databases, plus others) to support big and fast data analytics.

Throughout history, statistics have been recognized as an indispensable tool of business operations.  Starting with the population census of Wales in 744 AD, statistics have been applied to many facets of business.  The need for quality and consistency have been the major drivers for the adoption of statistics.

Data has been touted as “The New Oil” in the era of Big Data. (8)  Companies looking to have a competitive advantage need to embrace statistics and probability in the form of advanced analytics. There is no doubt that data will continue to be the control point of success for business. Mathematical Statistics and Probability will be a critical underpinning to winning data strategy.

References

[1] Statslife.org.  Timeline of Statistics.

[2] Kopf, D.  The Guinness Brewer Who Revolutionized Statistics.

[3] Blackman, A.  (March, 2015)  What is Business Intelligence?

[4] Heinze, J.  (November, 2014).  Business Intelligence vs. Business Analytics: What’s the Difference?

[5] Mason, H. Patil, J. (January, 2015) Data Driven: Creating a Data Culture.

[6] Smith, D.  (January, 2013).  Statistics vs. Data Science vs. BI.

[7] Accel Partners. (Summer, 2013).  The Last Mile in Big Data:  How Data Driven Software (DDS) Will Empower the Intelligent Enterprise.

[8] Yonego, J.  (July, 2014).  Data is the new oil of the digital economy. 

 

4 Considerations for Delivering Data Quality on Hadoop

$
0
0

Organizations are increasingly trying to become data driven. In my last blog, I outlined steps organizations should take to become data driven using the Kotter Model. In this blog I want to highlight a key aspect of the digital transformation journey, which is Data Governance specifically in the era of Big Data.

As companies adopt Big Data technologies, they will have to understand how their standard Data Governance practices, such as Data Quality and Stewardship, that they’ve used to building Enterprise Data Warehouses, Master Data Management (MDM) and Business Intelligence (BI) reporting, will apply to the new source systems that will be processed through Hadoop.

But the types of source systems which were sourced for Data warehouses and MDM had limitations in terms of the 3V’s (Velocity, Variety, Volume) . This is because there was no requirement to access these complex data structures for BI reporting. With the rise of the Hadoop ecosystem, a new term, Data Lake, got coined to accommodate all the diverse data that Hadoop is able to support. Gartner cautions that “Data Lakes carry substantial risks.” One of the concerns when it comes to Data Lakes is how to manage Data Governance. Until now, companies have had some sort of governance implemented in their existing Data Architectures, but the growing volumes of unstructured and streaming data that are in Data Lakes today are forcing companies to revisit their processes for maintaining data governance and stewardship.

Here are some of the key elements to consider for governing data in Hadoop:

Analytics needs to be part of Data Organization: As the technological landscape around Data changes, companies need to rethink their IT organizational structure supporting and monitoring data. Companies should consider having a center of excellence (COE) in Data (EDW, BI, Master Data) and analytics (Data Scientists et all) need to be part of the COE. This will allow the COE leadership to have some control over Data Governance. Data Scientists will have access to raw and transformed data that EDWs and MDM use, which could be used in their analytics. Data Quality rules and Stewardship processes can be applied to the data that Data Scientists use wherever applicable.

Prioritize what data needs to be cleansed: Though all data could be important, not all data is equal. It is important to define where the data came from, how the data will be used and how the data will be consumed. Data that is being consumed by your customers or vendors from your business ecosystems will need to be cleansed, matched and survived. Stringent data quality rules might be needed and applied to data that requires strict audit trails and carries regulatory compliance guidelines. On the other hand, we would not get much value in cleansing social media data or data coming from sensors. We should also consider having the data cleansed on the consumption side rather than on the acquisition side. Therefore a single governance architecture might not apply for all types of data.

Balance governance vs results: We have to keep in mind that the results used for analytics could have an impact if the data is ‘cleansed’. So by applying data quality rules on the data, you could damage the analytical-value of that data. Traditional data quality practices always insist on ‘correcting’ the data, but in analytics that data could be an outlier and could signify a change in the pattern. Having some kind of data quality methods to signify the quality of the dataset as a whole as opposed to individual record content would make sense in these scenarios. Data quality checks such as ‘the Data Load is 10x smaller or larger than expected’ or ‘more than half of all the values are empty or null’ would be a better fit for Data Scientists who like to apply machine learning models on the data.

Make use of big data processing engines: Data Quality tasks such as profiling and matching inherently involve processing record by record and/or doing aggregations on the source data and hence can be processed in Hadoop. Therefore, Data Quality can also take advantage of the distributed computing in Hadoop and cloud architectures. Additionally, we can also apply machine learning to Data Quality functions such as data matching.

Data Governance is becoming a key area of focus for CIOs. It would be a challenge for CIO's to extend governance to Big Data Analytics. Making analytics a part of the Data COE team and analyzing how they can apply their existing governance policies to Hadoop would be a good start.


Your ‘Resolution List’ for 2017: 5 Best Practices for Unleashing the Power of Your Data Lakes

$
0
0

As we all prepare for the New Year, what are the top priorities on your agenda for 2017? Are data lakes part of it? Are you looking for ways to do it right? Then we might be able to help you go through the holiday break with some food for thought.

Data lakes are an important cornerstone for companies needing to manage and exploit new forms of data in order to fuel their digital transformation. Data lakes allow employees to do more with data, faster, in order to drive business results. So far, many IT teams are still trying to figure out how to get return from their initial data lake investments. 

Starting with the ‘Why’ instead of the ‘What’: Why do you need a data lake?

Let’s look at the case of GE. GE’s CEO Jeff Immelt once said: “All industrial companies are in the information business whether they like it or not.” GE was founded in 1892 and since that time the company has been able to remain relevant by evolving their business models and focus throughout the years in order to keep pace with the market.

One GE division, GE Power and Water, was able to use two percent of its gas turbine data for analytics. The business unit (BU) decided to create a data lake containing all the data related to the gas turbines. It had to integrate data from over 100 apps, numerous different ERP systems, etc. Then the BU enabled widespread access to its data lake using a data-as-a-service model—or as Don Perigo, VP of Architecture at GE Power Services, calls it, ‘cafeteria-style’. Using this model, GE was able to enable over 200,000 people across 80 plus countries with access to the data lake. Using this model, GE was able to perform machine and equipment diagnoses, reliability management, and maintenance optimization.

In the past, companies thought they’d gain full 360-degree visibility into their enterprise information with a data warehouse. However, the advent of big data has put these systems under distress, pushing them to capacity, and driving up the costs of storage.  As a result, some companies have started moving some of their data (often times less utilized data) off to a new set of systems like those run in Hadoop, NoSQL databases or the Cloud.

As a result of this migration, companies also came to realize that they can actually do more with Hadoop, NoSQL and Cloud vs. using enterprise data warehouses. Thus, they started adding new sources of data like sensor, mobile, social and big data to these systems, ultimately transforming their Hadoop, NoSQL and Cloud systems into data lakes.

So what is a data lake?

According to Nick Huedecker at Gartner, “Data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organization."[1]

The data lake metaphor emerged because ‘lakes’ are a great concept to explain one of the basic principles of big data. That is, the need to collect all data and detect exceptions, trends and patterns using analytics and machine learning. This is because one of the basic principles of data science is the more data you get, the better your data model will ultimately be. With access to all the data, you can model using the entire set of data versus just a sample set, which reduces the number of false positives you might get.

And what’s the difference between a data warehouse and a data lake?

Data lakes provide the flexibility to store anything without having to worry about preformatting data. However, this flexibility has also led to a new set of challenges: because there is much less construct, there is a need to figure out the data structure when reading the data.

And with the overwhelming amount of data flowing into organizations today, there are concerns over what data can be accessed by employees, and what shouldn’t be shared. Due to the lack of tools, there is also confusion around what data lies where, and limited understanding of where the data came from or what has been done with it thus far.

As a result, until now only a limited number of people are able to access the information residing in corporate data lakes. These individuals tended to be those who knew how to work with data science tools in order to deal with the volume and complexity of data. The rest of the organization was just simply drowning in the data lake.

This gap between those who could utilize enterprise data lakes and those who couldn’t led to a gridlock causing most data lakes to fail at delivering on their true promise -business ROI.

So here are five best practices to successfully unleash the power of your data lakes.

1) Accelerate Data Ingestion

Most organizations end up with a disjointed architecture, with numerous enterprise data silos coming from point solutions plus new data from the cloud, big data, IoT, etc. applications. So the pre-requisites for creating a solid platform for data ingestion should be:

►  Wide connectivity – Connect all your data big and small, on premises, hybrid and in the cloud.

►  Batch & streaming ubiquity – Ensure it has the ability to handle both historical and real-time data ingestion, and can process data pipelines as they come in, including for advanced analytics.

►  Scale with volume and variety – It should have the ability to quickly onboard new data sources, such as data from Web clickstreams, Social or Smart devices.

Pitfalls to watch for:

  • Ø Hand coding – This will prevent the system’s ability to scale and deliver on business needs in a timely fashion.
  • Ø Fragmented tools – Using too many of these will create even more silos.                                                                                                                                                                         

2) Understand & govern your data

The lack of data governance is preventing many organizations from fully opening up data lakes for all employees to use, because more often than not, data lakes contain sensitive data like social security numbers, date of birth, credit card numbers, etc. that need to be protected. Hence these organizations will not reap the full benefits and get their full return on data lake investment without having a thorough information governance strategy. So here are some the things to consider:

►  Add context to data (provenance, semantics…) – Where is the data coming from, and what‘s the relationship between various data sets?

►  Optimize data with curation, stewardship, and preparation– Involve the right people to help clean and qualify data.

►  Use a collaborative data governance process- Get IT and the business to work together on ensuring enterprise information can be trusted.

Pitfalls to watch for:

  • Ø Authoritative governance – A top down approach to data governance never really works well for user engagement. Instead you need a bottom up approach wherein users can model the data at will, but central IT still certifies, protects and governs the data.
  • Ø Fragmented tooling – Use of fragmented tools leads to an inconsistent governance framework.

3) Remove data silos and unify data management

In order to get a single version of the truth, you need a unified framework for all data management tasks with:

►  Pervasive data quality, data masking- These need to be part of the data platform.

►  Consistent operationalization– To increase data trust and agility.

►  Single platform for all use cases & personas– Increases productivity and collaboration across teams.

Pitfalls to watch for:

  • Ø Fragmented tools– Use of fragmented tools leads to unpredictable and exponential costs.
  • Ø Hand coding – Will prevent your system from being scalable and easy to deploy.
  • Ø Shadow IT – Employees will find work arounds to access data lakes, which creates chaos and puts enterprise information at risk.

4) Deliver data to a wide audience

Your data lake will only gain its full power if you get the data into the hands of more employees.

►  Make data accessible – IT needs to deploy easy to use tools for less tech savvy line of business users, who are making the business decisions using data from the lake.

►  Governed self-service – More general access to corporate information without chaos or risk.

►  Scalable operationalization – Allows you to industrialize projects.

Pitfalls to watch for:

  • Ø Unmanaged autonomy – Use of isolated, unmanaged tools.
  • Ø Self-service tools for the tech savvy – Supplying only a handful of data savvy users with access to the data lake.

5)Get ready for change

The pace of change just keeps on accelerating and data volumes are growing exponentially. So you need a modern data platform that can deliver data for real-time, more informed decision making that can take your organization through its digital transformation and on to future success.  

Want to learn more? Check out our recent Talend Connect presentation on SlideShare.
Read the blog: CIO: 3 Questions to Ask about your Enterprise Data Lake



[1]“Beware of the Data Lake Fallacy,” July 28, 2014, http://www.biztechafrica.com/article/gartner-beware-data-lake-fallacy/8517/

 

Top 6 Technology Market Predictions for 2017

$
0
0

Big Data Will Transform Every Element of the Healthcare Supply Chain

The entire healthcare supply chain has been being digitized for the last several years. We’ve already witnessed the use of big data to improve not only patient care, but also payer-provider systems, reducing wasted overhead, predict epidemics, cure diseases, improve the quality of life and avoid preventable deaths. Combine this with the mass adoption of edge technologies to improve patient care and wellbeing such as wearables, mobile imaging devices, mobile health apps, etc. However, the use of data across the entire healthcare supply chain is about to reach a critical inflection point where the payoff from these initial big data investments will be bigger and come more quickly than ever before. As we move into 2017, healthcare leaders will find new ways to harness the power of big data to identify and uncover new areas for business process improvement, diagnose patients faster as well as drive better more personalized, preventative programs by integrating personally generated data with broader healthcare provider systems.

 

Your Next Dream Job: Chief Data Architect

By 2020, an estimated 20.8 billion connected “things” will be in use, up from just 6.4 billion in 2016. As Jeff Immelt, chairman and CEO of GE famously said, "If you went to bed last night as an industrial company, you're going to wake up today as a software and analytics company.” Mainstream businesses will face ongoing challenges in adopting big data and analytics practices. As the deluge of IoT continues to flood enterprise data centers, the new ‘coveted’ or critical role on the IT team will no longer be the Data Scientist—instead, it will be the Data Architect. The growing role of the Data Architect is a natural evolution from a Data Scientist or business analyst and underscores the growing need to integrate data from numerous unrelated, structured and unstructured data sources—exactly what the IoT era demands. If a company makes the misstep of tying their IT environment to the wrong big data platform, or establishing a system that lacks agility, it could paralyze a company in this data driven age. Two other challenges presented by the rise of the data architect is the shortage of qualified talent needed to fill data architect roles, as well as driving the cultural shift necessary to make the entire company more data driven.

 

The AI Hypecycle and Trough of Disillusionment, 2017

IDC predicts that by 2018, 75 percent of enterprise and ISV development will include cognitive/AI or machine learning functionality in at least one application. While dazzling POCs will continue to capture our imaginations, companies will quickly realize that AI is a lot harder than it appears at first blush and a more measured, long-term approach to AI is needed. AI is only as intelligent as the data behind it, and we are not yet at a point where enough organizations can harvest their data well enough to fulfill their AI dreams.

 

At Least one Major Manufacturing Company will go belly up by not utilizing IoT/big data

The average lifespan of an S&P 500 company has dramatically decreased over the last century, from 67 years in the 1920s to just 15 years today. The average lifespan will continue to decrease as companies ignore or lag behind changing business models ushered in by technological evolutions. It is imperative that organizations find effective ways to harness big data to remain competitive. Those that have not already begun their digital transformations, or have no clear vision for how to do so, have likely already missed the boat—meaning they will soon be a footnote in a long line of once-great S&P 500 players.

 

The Rate of Technology Obsolescence Will Accelerate

In 2017, we are going to see an increasing number of companies shift from simply ‘kicking the tires’ on the use of cloud and big data technologies to actually implementing and deriving significant ROI from enterprise deployments. However, at the same time, the rate of technology innovation today is at an all-time high—nearly replacing existing solutions every 12-18 months. As companies continue to drive new uses of big data and technologies, the rate of technology obsolescence will accelerate. Creating a new challenge for businesses: the possibility that the solutions/tools they are using today may need to be updated or entirely replaced within a period of a few months.  

 

The increased use of public information for public good

Under the current administration, The White House has set a precedent in creating transparency of government through data.gov, making hundreds of thousands of datasets available online for public use. Data is inherently dumb or ‘dirty’. So we must use the algorithmic economy to define action and make sense of the data being in order to power the next great discovery. In 2017, organizations will find a way to apply machine learning to this public data lake in order to contribute to the greater good. For example, Uber might utilize this data to determine where accidents frequently occur on the roads and create new routes for drivers in order to augment passenger safety.

 

Talend Data Masters 2016: Lenovo’s Data-Driven Retail Transformation

$
0
0

 

Last year, Talend introduced its annual Data Masters awards program, which recognizes companies that have dramatically transformed their businesses through [1innovative use of big data and cloud integration technologies. This year’s Data Master’s blog series focuses on the winners of the 2016 Data Masters Awards, which were revealed on November 17thduring a ceremony at the annual Talend Connect conferencein Paris, France.

How can digital marketing drive the most ROI? What is the most popular product and how do we sell more of it? How can we better understand and deliver what the customer is looking for within our online store? These are challenging questions on the minds of both online and brick-and-mortar retailers of all sizes. Retail is a world that is awash in customer data, yet businesses still struggle to better grasp anticipate customer needs and deliver real-time, personalized recommendations. It’s still not an exact science, but with the help of big data, cloud and real-time information technologies, it’s starting to get easier. Lenovo’s story of digital transformation and their elastic hybrid-cloud platform is a good example of how a 360-degree customer management platform can help dramatically improve business success.

Getting In-Tune with Customer Data

Lenovo is the world’s largest PC vendor[1] and a $46 billion personal computing technology company. They analyze over 22 Billion transactions of structured and unstructured data annually in order to achieve a 360° view of each of its millions of customers worldwide. Even with all this data at their fingertips, Lenovo still faced a problem of quickly turning rows and rows of customer information into real business insight. This is where the vision for LUCI Sky, an elastic hybrid-cloud platform supporting real-time business intelligence and operational analytics, came into the view.

Pradeep Kumar G.S., a big data architect at Lenovo, explains that LUCI Sky helps Lenovo to do “real-time operational analytics on top of data sets” through 60+ data sources every day. Through the use of their new big data environment powered by Amazon Web Services, Lenovo Servers, Talend Big Data and other technologies, they were able to run analytics to understand which products and configurations customers preferred most. Through that exercise Lenovo was able to reduce cost by $1M and increase revenue per-unit by 11 percent in year 1 sustained over the last 3 years.

A 180° Business Transformation from a 360° View of the Customer

Marc Gallman, senior manager of Big Data Architecture and Global Business Intelligence at Lenovo said, “With a 360° view of the customer that we’ve been able to achieve over the past several years we’ve been able to really quickly understand the needs of our customers and… get answers and insights, faster.”

Talend would like to congratulate Lenovo on winning this year’s Data Masters award for Business ROI. Watch the video below to hear the full story on the company’s dynamic business transformation.

[1Gartner Research, http://www.gartner.com/newsroom/id/3373617, July 2016.

The post Talend Data Masters 2016: Lenovo’s Data-Driven Retail Transformation appeared first on Talend Real-Time Open Source Data Integration Software.

Sensors, Environment and Internet of Things (IoT)

$
0
0

 

According to Jacob Morgan, “Anything that can be connected, will be connected.” At one time, the terms cloud and big data were regarded as just ‘hype’, but now we’ve all witnessed the dramatic impact both of these key technologies have had on businesses across every industry around the globe. Now people are beginning to ask if the same is true of IoT—is this all just hype? In my opinion, hype is certainly not the word I would use to describe a trend or technology that has implications of profoundly changing the world as we know it today.

In 2006, I was doing research on the use of RFID and we introduced a way to organize chaotic office document with an automatic data organizer and reminder technology. I published a paper on this topic called, “Document Tracking and Collaboration Support using RFID”. That was my first interaction with sensors where we focused on M2M (machine-to-machine) and afterward integration with collaborative environments, hence pushing us in Subnet of Things.  The idea emerging at that time like smart home, smart cars, smart cities are now realizing. According to IDC, there will be 28 billion sensors in use by 2020 with an economical worth of $1.7 trillion. Before we jump in some applied scenarios, let outline the scope of sensors communication and how they are organized into three groups. 

M2M

Machine-to-machine is different from IoT in scope and domain. Normally the two are restricted in availability and come with pre-defined operational bindings based on data. The suitable example would be manufacturing units and their communication with different sensors built-in. A more localized example would be heating sensor in the car that has a very definitive purpose and is only limited to that car.

SoT                        

SoT (Subnets of Things) can be scoped at the organization or enterprise level. Like in the above example where you have RFIDs on each file or book, SoT can be located inside organizations using collaborative platforms. For example, a car sending data to measure quality and utilization of its components to deliver better operations and stable experience.

IoT

In 2013 the Global Standards Initiative on Internet of Things (IoT-GSI) defined the IoT as “the infrastructure of the information society. The Internet of Things (IoT) is a system of interrelated computing devices, mechanical and digital machines, objects, animals or people that are provided with unique identifiers and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction. IoT has evolved from the convergence of wireless technologies, micro-electromechanical systems (MEMS), micro-services and the Internet. 

 

Sensor(s) Data Type and Network Challenges

During my travel to San Francisco, I met SAP’s R&D lead for IoT. It was very interesting to hear how big data is already shaping our daily lives. For example, he mentioned an airplane system where thousands of sensors are delivering data and need to make calculations up to 8 to 16 decimals in a fraction of a second. Thanks to sensors, we have so much data that our current technology infrastructure is the only gating factor to companies being able to harness the power of real-time, big data insight.

IT decision makers should create infrastructures and utilize solutions that can handle data which was kept due to unavailability of a network. Sometimes, you don’t have telecommunication network or sometimes surroundings do not allow you to connect due to a signal jammer, etc. so IT leaders need to focus on infrastructure which can hold and keep data until a network is available and data transfer is possible.

Sensor data is just like any other data coming from different sources that needs cleansing, analysis and governance. At the same time, it has some distinct properties. Normally, sensor data is a stream of information (values) juxtaposed against time. Not all data is meaningful but we cannot simply discard information as all data is destined to have some value, even if not known at time of collection. So you should not use any loose or questionable algorithms for data compression. There are few possibilities to overcome a situation like this:

·      Send filtered data only if it is required

·      Only transmit abnormal values

·      Compress the data without using unfounded/untested algorithms

As my discussion with him goes on, he says, “just look around this place. It is generating so much data. If something happens or if any component fails, the data generated before that failure is very important. It’s a complete echo system. If we lose that data, then we will not be able to predict any such event in the future. So our only option is to compress the data using lossless compression to keep cost down (the only alternative is one we can’t afford).  There are some situations where we have duplication of data or the frequency of data is overwhelming, which results in a lot of overhead. But again our back-end system should be capable of digesting all incoming information. At sensor level we can apply some curve-fitting techniques like Fast Fourier Transformation, so that we keep getting aggregated value. The sensor data are best stored in low-cost, scale-out systems called Data Lakes. Raw data normally doesn’t get pulled into the data warehouse as it lacks proven value but requires definitely governance like security etc.

To perform analysis and get value from data it should be stored in Data Warehouse and I would definitely recommend reading our Data Vault blog that explains how to store data effectively.  

Sensors Impact on our Daily Lives

Sensors are impacting almost every aspect of our life. From movement sensors to manufacturing units, everything is going to get smarter day by day. Let’s walk through some very basic and brief examples of IoT that are going to impact us in the near future.

Smart Manufacturing

Manufacturing is a $12 trillion global industry a year. This is an industry where robots are transferring goods from one place to another and every action of metal-stamping machines in an auto-parts factory and assembly line is tracked by sensors. Otherwise known as Industrial Manufacturing, this sector has already shown remarkable growth rates. Data collected from different processes can help prevent unplanned downtime and/or predict supply needs based on Sales forecasts.

Smart Transport

Automobile integration is being touted as the next great frontier in consumer electronics. According to Gartner, driverless vehicles will represent approximately 25 percent of passenger vehicles population in mature markets by 2030.  Driverless cars may have a steering wheel just for legacy design and ‘machine override’ purposes, but ultimately there will be no manual controls in car at all. All traditional controls like brakes, speed or function indicators, and acceleration rate will be based on sensors, radar, GPS mapping, and a variety of artificial intelligence to enable self-driving, parking and the safest route to destination to avoid accidents. This impressive technology will focus not only on controlling cars, but on communication between other vehicles on the road based on that car’s road condition status or relation to the driver’s vehicle. Communication that require external networks will enable Internet-based services, like route calculation, vehicle status, e-call, and b-call usage based on insurance and backup of data.

Smart Energy

Electric Power grids started operating in the 1890s, which were highly centralized and isolated. Later, the electricity network was extended and power houses got connected with load shifting technologies in-case of technical shutdown (backup and recovery). Many small power generating units like windmills and solar energy parks are now generating electricity at varying capacity depending on condition. Today, electrical distribution networks accommodate many more sources as households are also generating electricity, putting pressure on ‘the grid’ as well as making it difficult to manage a centralized system.

Based on system condition, utilities leaders will be able to decide which source of energy is cheapest for the moment and shut-down coal or fuel based energy sources to avoid unnecessary generation and cost. Similarly, smart meters are now billing clients based on usage, but in future, as they get more connected, those sensors might be capable of deciding which energy source is cheapest.

Smart Homes and Cities

The smart city is a very interesting possible application of the IoT. Smart homes and smart cities are connected to each other. Smart cities provide great infrastructure for communication and are an ideal candidate from which to extract benefits. There are already some projects in Chicago, Rio de Janeiro, and Stockholm where governments, in collaboration with private sector companies, are taking advantage of IoT to collect data from city street assets to determine whether or not they need repair, i.e. street lights. From school bus monitoring to garbage collection, IoT is changing the scope of how society functions.

“Smart Home” is the term commonly used to define a residence that has appliances, lighting, heating, air conditioning, TVs, computers, entertainment audio & video systems, security, and camera systems that are capable of communicating with one another and can be controlled remotely by a time schedule, from any room in the home, as well as remotely from any location in the world by phone or internet. The next iteration of today’s Smart Home will be capable of conducting inventory management on your fridge, where it automatically places an order for milk or eggs after determining your supplies of each are low. Or devices capable of recognizing interruption in electricity, water or even network connectivity and can inform service providers with needs for repair. Imagine smart trash bins that automatically notify garbage collectors that you’re your trashcans are full so they can pick it up.

I would summarize the whole article by stating that our data-driven world is now taking another shift towards Smart Systems. Our current infrastructure has empowered us to overcome major challenges from data generation, ingestion to analysis. In addition to above collective smart scenarios, personal body sensors like GPS chips, health and activity monitors has increased the living standard. I see a rapid growth in IOT in coming months to years.

http://www.wsj.com/articles/how-u-s-manufacturing-is-about-to-get-smarter-1479068425?mod=e2fb

https://blog.fortinet.com/2016/10/27/driverless-cars-a-new-way-of-life-brings-a-new-cybersecurity-challenge

http://www.idc.com/getdoc.jsp?containerId=prUS25658015

http://data-informed.com/how-the-internet-of-things-changes-big-data-analytics/

http://internetofthingsagenda.techtarget.com/definition/Internet-of-Thin…

The post Sensors, Environment and Internet of Things (IoT) appeared first on Talend Real-Time Open Source Data Integration Software.

IT: How to Survive in a Self-Service World

$
0
0

 

This article aims to explain to IT and data professionals how to satisfy users’ expectations about data access management while ensuring data control and data security. It provides advice for how to stimulate the use of governed, self-service data preparation.

8.33 AM, coffee in hand (the 3rd since the alarm clock anyway), Executive Management floor, Dominique says to Jean-Christophe: “Don’t tell me we need to wait for IT to have access to this data! “. Or the more rational scenario may go something like this: “We need clean, complete and trusted data on consumer behavior  during the meeting at 12:00 PM: no exceptions!”

Getting complete data sets on consumer behavior may be doable, except that in this scenario IT is you. Who has not felt this mixed feeling, between rising stress and wanting to take up the challenge?

A quick diagnosis of the data landscape shows:

– The expected data exists but is disseminated throughout the organization in silos: various applications, data repositories, etc.

– The quality of the data is questionable and therefore it needs to be reviewed, cleaned and organized

– The ability to associate this data with other data set requires a familiarity with the information (often only known by business users)

– Some of the employees using these applications can access the data themselves and will often export it into Excel

Data management: Should concessions be made?

It is widely proven and accepted that a data-driven organization optimizes its performance. For example, per the McKinsey Datamatics Survey, data-driven companies have a 23x greater customer acquisition, 6x better customer retention, and 19x larger profits. But how does a company establish a culture of being data-driven?

Download>>Try Talend Data Preparation for Free

The CDO 2016 barometer estimates that 80 percent of current projects aim to help a company become ‘data-driven’. The use of data is spreading rapidly at all levels of every organization and across  all departments. As the volumes and varieties of data continue to rise, the use of data becomes more frequent.. Subsequently, the demand for the availability of real-time data to make business decisions grows.

The aspiration created by the consumerization of data gives more power to business users. Refusing to consider  business users’ demands for more agility, autonomy, and collaboration, is dangerous. Whether they are experts in data or business users, if IT denies them access, they will find a workaround—otherwise known as ‘shadow IT’, which may jeopardize your enterprise information.

Having an overbearing IT department goes against the core mission corporate IT is supposed to fulfill, which is to enable employees to be productive and successful in their roles. The challenge is finding a way to make data available to employees in a safe, governed, secure way. The 2016 CXP Group – Qlik Barometer found that 34 percent of IT and BI managers are no longer involved in data-related projects. The idea of allowing employees to have self-service, data access and management scares IT. The misconception is that IT will lose control over the sanctity and security of enterprise data by implementing self-service data preparation.

Is your organization currently challenged by the fact that despite centralized Business Intelligence tools and information repositories, data still circulates in the form of Excel files? If this is the case, the question is no longer about how to maintain control of enterprise information because it’s already been lost. The increasing risk of cyber-attacks and data leaks makes this repossession of control even more crucial and urgent. Cybercriminals no longer need to target application systems. They simply target the messaging systems. For example, on average a company will have 6097 files containing the word ‘salary’ and reside outside primary business applications.

BIG IT is Not the Answer but IT has the Answer to Deliver Reliable and Secure Data

Because business demands are too numerous and varied to be satisfied by a single department, centralizing data management in the hands of IT lacks the ability to scale. Centralized IT will only result in the frustration of both IT side and business users.

Because no one is better qualified than an accountant to clean up, enrich and reconcile billing and supplier portfolio data, or a marketing manager to do the same on leads generated by a given event, the competencies and the ability to make sense of data is – by definition – distributed across  each line of business.

But because clean vendor billing files or enriched new lead marketing files create value for the business and deserve to be shared and reused within the organization, having an IT framework that allows for this type of collaboration is important to have with the appropriate  data management and security rules.

Because individual employee data preparations will be even more useful if they can be shared and accessed by  all the company’s data sources and feed all its target applications, IT is justified in orchestrating  these centralized repositories.

Self-service, Governed Data Preparation is the Holy Grail for  IT and Business Users

The traditional border between the data producer and user has faded: we are all producers and users.

IT has a unique opportunity in front of it to keep (or regain) its role as a catalyst for change in the company. Take the initiative to set up and exploit collaborative enterprise repositories and rely on self-service data-processing solutions vs. those that are centrally managed.

The best self-service data preparation platforms enable IT to deliver massive individual productivity gains for all employees, as well as collective insight through collaboration. They allow the integration of all data sources and targets. They ensure the consistency, compliance and security of data management.

Self-service data preparation tools represent an opportunity for all IT professionals to meet immediacy, autonomy and collaboration needs of employees, while maintaining data governance, and escape the pitfalls of maintaining Big IT.

Matt Aslett, Research Director for the Data Platforms and Analytics group at 451 Research, said,  “We anticipate greater interest in 2017 and beyond in products that deliver data governance and data management capabilities as an enabler for self-service analytics.”

11:12 AM, Executive Management floor, Jean-Christophe to Dominique: “The customer data is ready! I have provided a self-service data preparation platform to marketing employees. As a result, the profiling data from our CRM system has been standardized and rehabilitated by Anne-Sophie in Marketing Operations and then enriched with the latest online activities of our customers by Andrew of the Digital department, and finally approved by their manager before feeding our BI application. We’re all set for today’s meeting! “

If you are a Jean-Christophe at Talend, you’re in luck! Learn how your Talend platform already allows you to implement governed, self-service data preparation.

Not Jean-Christophe? No need to change your first name. You can to learn how to adopt Talend technologies by clicking here.

About the Author

Francois Lacas is Product Marketing Manager at Talend. As a hands-on strategist, he is passionate about driving Digital Transformation and Modern Marketing to business increase contribution, since the early 2010s. Prior to joining Talend, he was Director of Marketing & Digital Transformation at Wallix – a public cybersecurity software vendor – and at ITESOFT – a public ECM / BPM software vendor. As part of his various roles, he regularly writes blog articles and presents at events and tradeshows. Get in touch with him on LinkedIn and on Twitter @francoislacas.

The post IT: How to Survive in a Self-Service World appeared first on Talend Real-Time Open Source Data Integration Software.

Data Matching 101: How Does Data Matching Work?

$
0
0

 

This blog is the first in a series of three looking at Data Matching and how this can be done within the Talend toolset. This first blog will look at the theory behind Data Matching, what is it and how it works. The second blog will look at the use of the Talend toolset for actually doing Data Matching. Finally, the last blog in the series will look at how you can tune the Data Matching algorithms to achieve the best possible Data Matching results.

First, what is Data Matching? Basically it is the ability to identify duplicates in large data sets. These duplicates could be people with multiple entries in one or many databases. It could also be duplicate items, of any description, in stock systems. Data Matching allows you to identify duplicates, or possible duplicates, and then allows you to take actions such as merging the two identical or similar entries into one. It also allows you to identify non-duplicates, which can be equally important to identify, because you want to know that two similar things are definitely not the same.

So, how does Data Matching actually work? What are the mathematical theories behind it? OK, let’s go back to first principles. How do you know that two ‘things’ are actually the same ‘thing’? Or, how do you know if two ‘people’ are the same person? What is it that uniquely identifies something? We do it intuitively ourselves. We recognise features in things or people that are similar, and acknowledge they could be, or are, the same. In theory this can apply to any object, be it a person, an item of clothing such as a pair of shorts, a cup or a ‘widget’.

This problem has actually been around for over 60 years. It was formalised in the 60’s in the seminal work of Fellegi and Sunter, two American statisticians. The first use was for the US census bureau. It’s called ‘Record Linkage’, i.e. how are records from different data sets linked together? For duplicate records it is sometimes called De-duplication, or the process of identifying duplicates and linking them. So, what properties help identify duplicates?

Well, we need ‘unique identifiers’. These are properties that are unlikely to change over time. We can associate and weigh probabilities for each property. For example, noting the probability that those two things are actually the same. This can then be applied to both people and things.

The problem, however, is that things can and do change, or they get misidentified. The trick is to identify what can change, i.e. a name, address, or date of birth. Some things are less likely to change than others. For objects, this could be size, shape, color, etc.

NOTE: Record linkage is highly sensitive to the quality of the data being linked. Data should first be ‘standardized’ so it is all of a similar quality.

Now there are two sorts of data linkage-

1.     Deterministic Record Linkage

a.     That is based on a number of identifiers that match

2.     Probabilistic Record Linkage

a.     This is based on the probability that a number of identifiers match

The vast majority of Data Matching is Probabilistic Data Matching. Deterministic links are too inflexible.

So, just how do you match? First, you do what is called blocking. You sort the data into similar sized blocks which have the same attribute. You identify ‘attributes’ that are unlikely to change. This could be surnames, date of birth, color, volume, shape. Next you do the matching. First, assign a match type for each attribute (there are lots of different ways to match these attributes). Names can  be matched phonetically; dates can be matched by similarity. Next you calculate the RELATIVE weight for each match attribute. It’s similar to a measure of ‘importance’. Then you calculate the probabilities for matching and also accidently un-matching those fields. Finally, you assign an algorithm for adjusting the relative weight for each attribute to get what is called a ‘Total Match Weight’. That is then the probabilistic match for two things.

To summarize:

•       Standardize the Data

•       Pick Attributes that are unlikely to change

•       Block, sort into similar sized blocks

•       Match via probabilities (remember there are lots of different match types)

•       Assign Weights to the matches

•       Add it all up – get a TOTAL weight

The final step is to tune your matching algorithms so that you can obtain better and better matches. This will be covered in the third article in this series.

The next question then is what tools are available in the Talend tool set and how can I use them to do Data Matching? This will be covered in the next article in this series of Data Matching blogs.

About the Author

Stefan Franczuk is a Customer Success Architect at Talend, based in the UK. Moving in IT over 25 years ago, after starting his working career in engineering and aviation, and switching to academia and IT; Dr Franczuk has a wide range of experience in many different IT disciplines. For the last 15 years he has architected many integration solutions for clients all over the world. Dr. Franczuk also has a wide range of experience in Big Data, Data Science and Data Analytics. Dr. Franczuk holds a PhD in Experimental Photo-Nuclear Physics from The University of Glasgow.

The post Data Matching 101: How Does Data Matching Work? appeared first on Talend Real-Time Open Source Data Integration Software.

Talend Data Masters 2016 – UNOS: How many lives can you save?

$
0
0

 

Did you know that a single organ donor can save up to 8 lives? The United Network for Organ Sharing (UNOS) is a private, non-profit organization that manages the United States’ national organ transplant system under contract with the federal government. On average, 85 people receive transplants every day, but another 22 people die each day while waiting for an organ transplant due to the lack of donors.

Each organ has a short shelf life

Today, there are nearly 120,000 people on the organ waiting list. There were only 9,000 deceased donors last year, so optimizing the use of organs is essential – and a challenge. When a transplant hospital accepts a transplant candidate and an organ procurement organization gets consent from an organ donor, both enter medical data into UNOS’ computerized network. Using the combination of candidate and donor information, the UNOS system generates a “match run,” or a rank-order list of candidates for each organ. There is a tremendous sense of urgency as a doctor has just one hour to decide whether or not to accept an organ for their patient on the list. Organs have a limited time frame in which they must be transplanted.  A heart, for example, must be transplanted within four hours but kidneys can tolerate 24 to 36 hours outside the body.

Looking back to drive future decisions

UNOS wanted to facilitate making more information available via a self-service model. A first offering in this model was an Organ Offer Report where Transplant centers accessing it can now see what they’ve transplanted over the last three months. They can also see the outcome of the organs they did not accept so they can analyze why they turned them down, how they were successfully used by other centers, and whether they should consider accepting them in the future. It is critical for surgeons to look back at decisions they made about accepting or rejecting organs to determine if their future decisions should be altered. Data helps make better decisions, and better decisions mean more saved lives.

600% reduction in the amount of time needed to generate the organ offer report 

UNOS uses Talend Big Data to extract and organize data used to generate the Organ Offer Report provided to transplant centers. Transplant professionals can now see more information than ever before to help with quality improvement efforts. Using Talend enabled UNOS to automate the process of integrating systems and processing data as well as reduce the time required for this essential task from 18 hours down to three or four hours.

UNOS’ databases currently contain approximately three terabytes of data. UNOS is using Talend to integrate both structured data from Microsoft and Oracle databases with JSON data from the Web. UNOS is using Talend’s ability to generate Spark code to accelerate integration jobs, with Talend data pipelines feeding three separate Hadoop clusters. Talend outputs the results to a source system where Tableau data visualization software can read them. Tableau then serves up the Organ Offer Report, which gives transplant centers a list of recent transplant activity at their hospital.

Almost 31,000 transplants were performed last year in the United States.  For this reason, Talend named UNOS a grand prize winner of the 2016 Talend Data Masters Awards.

View the full story below:

The post Talend Data Masters 2016 – UNOS: How many lives can you save? appeared first on Talend Real-Time Open Source Data Integration Software.


The Role of Statistics in Business Decision Making

$
0
0

 

The use of statistics in business can be traced back hundreds of years.   As early as 744 AD, statistics were used by Gerald of Wales to complete the first population census of Wales (1).  It wasn’t long before merchants realized that statistics could be used to measure and quantify trade.  The first record of this was in Florence.  It was recorded in Giovanni Villani’s “Nuova Cronica”, in 1346 (1).  Moreover, statistical methods were further adopted to help drive quality and in doing so helped contribute to the advancement of statistics itself.  In 1504, William Sealy Gosset, chief brewer for Guinness in Dublin, devised the t-test (2) to measure consistency between batches of stout (1).

With the rise of big data, organizations are looking to extract deep insights from their data using advanced analytical techniques.  With big data, new roles like Data Scientists are being developed within organizations. But no matter the title of the role, be it quantitative analyst or data scientist, they all share one thing in common.  Mathematical statistics and probability are at the heart of these disciplines and they are seen as critical to the success of a business.

Currently you would be hard pressed to find a business that does not perform some level of statistical analysis on their data.  Most of these analyses are performed under the general term of Business Intelligence (BI) (3). BI can mean many things but in general, BI is used to run a company’s day-to-day operations and includes software, process, and technology (4).  BI enables organizations to make data driven decisions and effect change.

The term “data driven” is synonymous with companies that leverage their data and analytics to unearth hidden insights that have a real and measurable impact on their business (5).

“FedEx and UPS are well known for using data to compete. UPS’s data led to the realization that, if its drivers took only right turns (limiting left turns), it would see a large improvement in fuel savings and safety, while reducing wasted time. The results were surprising: UPS shaved an astonishing 20.4 million miles off routes in a single year.”(5)

By applying statistical and probabilistic methods to their data, organizations can unlock patterns and insights that otherwise would have gone unnoticed.  These insights, as in the case with UPS, can lead to significant increases in revenue while driving down costs to the business.

Statistics use in business is currently undergoing a paradigm shift in its scope and application.  Today, data scientists are leading the charge in the application of statistics and probability to help businesses use their most important organizational asset; their data (6).

(6)

Above we see a comparison between the work of a statistician and that of a data scientist.  A data scientist deals with data in its raw form including structured, semi-structured, and unstructured data. The outputs of the data scientist are generally data applications or data products.  Data driven applications are driving how companies are generating revenue, examples include Facebook, LinkedIn, and Google (7).   Data driven applications are creating what is known as the “smart enterprise.”  Smart enterprises allow not only management, but also rank and file, the ability to have analytics at their fingertips.  We see this on LinkedIn with their recommendations for connections, the same for Facebook with friend recommendations.  The “data application” is constantly looking for people that may enhance a user’s network.

(6)

Above is a comparison between traditional BI and data science.  The biggest difference is that BI is generally backwards looking (simple descriptive statistics) and data science is forward looking (inferential statistics).  BI will always be a part of the enterprise, traditional EDW’s aren’t going away anytime soon.  However, these traditional systems are being complimented with emerging technologies (Hadoop, In-memory databases, plus others) to support big and fast data analytics.

Throughout history, statistics have been recognized as an indispensable tool of business operations.  Starting with the population census of Wales in 744 AD, statistics have been applied to many facets of business.  The need for quality and consistency have been the major drivers for the adoption of statistics.

Data has been touted as “The New Oil” in the era of Big Data. (8)  Companies looking to have a competitive advantage need to embrace statistics and probability in the form of advanced analytics. There is no doubt that data will continue to be the control point of success for business. Mathematical Statistics and Probability will be a critical underpinning to winning data strategy.

References

[1] Statslife.org.  Timeline of Statistics.

[2] Kopf, D.  The Guinness Brewer Who Revolutionized Statistics.

[3] Blackman, A.  (March, 2015)  What is Business Intelligence?

[4] Heinze, J.  (November, 2014).  Business Intelligence vs. Business Analytics: What’s the Difference?

[5] Mason, H. Patil, J. (January, 2015) Data Driven: Creating a Data Culture.

[6] Smith, D.  (January, 2013).  Statistics vs. Data Science vs. BI.

[7] Accel Partners. (Summer, 2013).  The Last Mile in Big Data:  How Data Driven Software (DDS) Will Empower the Intelligent Enterprise.

[8] Yonego, J.  (July, 2014).  Data is the new oil of the digital economy. 

The post The Role of Statistics in Business Decision Making appeared first on Talend Real-Time Open Source Data Integration Software.

4 Considerations for Delivering Data Quality on Hadoop

$
0
0

 

Organizations are increasingly trying to become data driven. In my last blog, I outlined steps organizations should take to become data driven using the Kotter Model. In this blog I want to highlight a key aspect of the digital transformation journey, which is Data Governance specifically in the era of Big Data.

As companies adopt Big Data technologies, they will have to understand how their standard Data Governance practices, such as Data Quality and Stewardship, that they’ve used to building Enterprise Data Warehouses, Master Data Management (MDM) and Business Intelligence (BI) reporting, will apply to the new source systems that will be processed through Hadoop.

But the types of source systems which were sourced for Data warehouses and MDM had limitations in terms of the 3V’s (Velocity, Variety, Volume) . This is because there was no requirement to access these complex data structures for BI reporting. With the rise of the Hadoop ecosystem, a new term, Data Lake, got coined to accommodate all the diverse data that Hadoop is able to support. Gartner cautions that “Data Lakes carry substantial risks.” One of the concerns when it comes to Data Lakes is how to manage Data Governance. Until now, companies have had some sort of governance implemented in their existing Data Architectures, but the growing volumes of unstructured and streaming data that are in Data Lakes today are forcing companies to revisit their processes for maintaining data governance and stewardship.

Here are some of the key elements to consider for governing data in Hadoop:

Analytics needs to be part of Data Organization: As the technological landscape around Data changes, companies need to rethink their IT organizational structure supporting and monitoring data. Companies should consider having a center of excellence (COE) in Data (EDW, BI, Master Data) and analytics (Data Scientists et all) need to be part of the COE. This will allow the COE leadership to have some control over Data Governance. Data Scientists will have access to raw and transformed data that EDWs and MDM use, which could be used in their analytics. Data Quality rules and Stewardship processes can be applied to the data that Data Scientists use wherever applicable.

Prioritize what data needs to be cleansed: Though all data could be important, not all data is equal. It is important to define where the data came from, how the data will be used and how the data will be consumed. Data that is being consumed by your customers or vendors from your business ecosystems will need to be cleansed, matched and survived. Stringent data quality rules might be needed and applied to data that requires strict audit trails and carries regulatory compliance guidelines. On the other hand, we would not get much value in cleansing social media data or data coming from sensors. We should also consider having the data cleansed on the consumption side rather than on the acquisition side. Therefore a single governance architecture might not apply for all types of data.

Balance governance vs results: We have to keep in mind that the results used for analytics could have an impact if the data is ‘cleansed’. So by applying data quality rules on the data, you could damage the analytical-value of that data. Traditional data quality practices always insist on ‘correcting’ the data, but in analytics that data could be an outlier and could signify a change in the pattern. Having some kind of data quality methods to signify the quality of the dataset as a whole as opposed to individual record content would make sense in these scenarios. Data quality checks such as ‘the Data Load is 10x smaller or larger than expected’ or ‘more than half of all the values are empty or null’ would be a better fit for Data Scientists who like to apply machine learning models on the data.

Make use of big data processing engines: Data Quality tasks such as profiling and matching inherently involve processing record by record and/or doing aggregations on the source data and hence can be processed in Hadoop. Therefore, Data Quality can also take advantage of the distributed computing in Hadoop and cloud architectures. Additionally, we can also apply machine learning to Data Quality functions such as data matching.

Data Governance is becoming a key area of focus for CIOs. It would be a challenge for CIO’s to extend governance to Big Data Analytics. Making analytics a part of the Data COE team and analyzing how they can apply their existing governance policies to Hadoop would be a good start.

The post 4 Considerations for Delivering Data Quality on Hadoop appeared first on Talend Real-Time Open Source Data Integration Software.

Your ‘Resolution List’ for 2017: 5 Best Practices for Unleashing the Power of Your Data Lakes

$
0
0

 

As we all prepare for the New Year, what are the top priorities on your agenda for 2017? Is a data lake part of it? Are you looking for ways to do it right? Then we might be able to help you go through the holiday break with some food for thought.

Data lakes are an important cornerstone for companies needing to manage and exploit new forms of data in order to fuel their digital transformation. Data lakes allow employees to do more with data, faster, in order to drive business results. So far, many IT teams are still trying to figure out how to get return from their initial data lake investments. 

Starting with the ‘Why’ instead of the ‘What’: Why do you need a data lake?

Let’s look at the case of GE. GE’s CEO Jeff Immelt once said: “All industrial companies are in the information business whether they like it or not.” GE was founded in 1892 and since that time the company has been able to remain relevant by evolving their business models and focus throughout the years in order to keep pace with the market.

One GE division, GE Power and Water, was able to use two percent of its gas turbine data for analytics. The business unit (BU) decided to create a data lake containing all the data related to the gas turbines. It had to integrate data from over 100 apps, numerous different ERP systems, etc. Then the BU enabled widespread access to its data lake using a data-as-a-service model—or as Don Perigo, VP of Architecture at GE Power Services, calls it, ‘cafeteria-style’. Using this model, GE was able to enable over 200,000 people across 80 plus countries with access to the data lake. Using this model, GE was able to perform machine and equipment diagnoses, reliability management, and maintenance optimization.

In the past, companies thought they’d gain full 360-degree visibility into their enterprise information with a data warehouse. However, the advent of big data has put these systems under distress, pushing them to capacity, and driving up the costs of storage.  As a result, some companies have started moving some of their data (often times less utilized data) off to a new set of systems like those run in Hadoop, NoSQL databases or the Cloud.

As a result of this migration, companies also came to realize that they can actually do more with Hadoop, NoSQL and Cloud vs. using enterprise data warehouses. Thus, they started adding new sources of data like sensor, mobile, social and big data to these systems, ultimately transforming their Hadoop, NoSQL and Cloud systems into data lakes.

So what is a data lake?

According to Nick Huedecker at Gartner, “Data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”[1]

The data lake metaphor emerged because ‘lakes’ are a great concept to explain one of the basic principles of big data. That is, the need to collect all data and detect exceptions, trends and patterns using analytics and machine learning. This is because one of the basic principles of data science is the more data you get, the better your data model will ultimately be. With access to all the data, you can model using the entire set of data versus just a sample set, which reduces the number of false positives you might get.

And what’s the difference between a data warehouse and a data lake?

Data lakes provide the flexibility to store anything without having to worry about preformatting data. However, this flexibility has also led to a new set of challenges: because there is much less construct, there is a need to figure out the data structure when reading the data.

And with the overwhelming amount of data flowing into organizations today, there are concerns over what data can be accessed by employees, and what shouldn’t be shared. Due to the lack of tools, there is also confusion around what data lies where, and limited understanding of where the data came from or what has been done with it thus far.

As a result, until now only a limited number of people are able to access the information residing in corporate data lakes. These individuals tended to be those who knew how to work with data science tools in order to deal with the volume and complexity of data. The rest of the organization was just simply drowning in the data lake.

This gap between those who could utilize enterprise data lakes and those who couldn’t led to a gridlock causing most data lakes to fail at delivering on their true promise -business ROI.

So here are five best practices to successfully unleash the power of your data lakes.

1) Accelerate Data Ingestion

Most organizations end up with a disjointed architecture, with numerous enterprise data silos coming from point solutions plus new data from the cloud, big data, IoT, etc. applications. So the pre-requisites for creating a solid platform for data ingestion should be:

►  Wide connectivity – Connect all your data big and small, on premises, hybrid and in the cloud.

►  Batch & streaming ubiquity – Ensure it has the ability to handle both historical and real-time data ingestion, and can process data pipelines as they come in, including for advanced analytics.

►  Scale with volume and variety – It should have the ability to quickly onboard new data sources, such as data from Web clickstreams, Social or Smart devices.

Pitfalls to watch for:

Ø Hand coding – This will prevent the system’s ability to scale and deliver on business needs in a timely fashion.

Ø Fragmented tools – Using too many of these will create even more silos.                                                           

2) Understand & govern your data

The lack of data governance is preventing many organizations from fully opening up data lakes for all employees to use, because more often than not, data lakes contain sensitive data like social security numbers, date of birth, credit card numbers, etc. that need to be protected. Hence these organizations will not reap the full benefits and get their full return on data lake investment without having a thorough information governance strategy. So here are some the things to consider:

►  Add context to data (provenance, semantics…) – Where is the data coming from, and what‘s the relationship between various data sets?

►  Optimize data with curation, stewardship, and preparation– Involve the right people to help clean and qualify data.

►  Use a collaborative data governance process– Get IT and the business to work together on ensuring enterprise information can be trusted.

Pitfalls to watch for:

Ø Authoritative governance – A top down approach to data governance never really works well for user engagement. Instead you need a bottom up approach wherein users can model the data at will, but central IT still certifies, protects and governs the data.

Ø Fragmented tooling – Use of fragmented tools leads to an inconsistent governance framework.

3) Remove data silos and unify data management

In order to get a single version of the truth, you need a unified framework for all data management tasks with:

►  Pervasive data quality, data masking– These need to be part of the data platform.

►  Consistent operationalization– To increase data trust and agility.

►  Single platform for all use cases & personas– Increases productivity and collaboration across teams.

Pitfalls to watch for:

Ø Fragmented tools – Use of fragmented tools leads to unpredictable and exponential costs.

Ø Hand coding – Will prevent your system from being scalable and easy to deploy.

Ø Shadow IT – Employees will find work arounds to access data lakes, which creates chaos and puts enterprise information at risk.

4) Deliver data to a wide audience

Your data lake will only gain its full power if you get the data into the hands of more employees.

►  Make data accessible – IT needs to deploy easy to use tools for less tech savvy line of business users, who are making the business decisions using data from the lake.

►  Governed self-service – More general access to corporate information without chaos or risk.

►  Scalable operationalization – Allows you to industrialize projects.

Pitfalls to watch for:

Ø Unmanaged autonomy – Use of isolated, unmanaged tools.

Ø Self-service tools for the tech savvy – Supplying only a handful of data savvy users with access to the data lake.

5) Get ready for change

The pace of change just keeps on accelerating and data volumes are growing exponentially. So you need a modern data platform that can deliver data for real-time, more informed decision making that can take your organization through its digital transformation and on to future success.  

Want to learn more? Check out our recent Talend Connect presentation on SlideShare.
Read the blog: CIO: 3 Questions to Ask about your Enterprise Data Lake



[1] “Beware of the Data Lake Fallacy,” July 28, 2014, http://www.biztechafrica.com/article/gartner-beware-data-lake-fallacy/8517/

The post Your ‘Resolution List’ for 2017: 5 Best Practices for Unleashing the Power of Your Data Lakes appeared first on Talend Real-Time Open Source Data Integration Software.

Top 6 Technology Market Predictions for 2017

$
0
0

 

Big Data Will Transform Every Element of the Healthcare Supply Chain

The entire healthcare supply chain has been being digitized for the last several years. We’ve already witnessed the use of big data to improve not only patient care, but also payer-provider systems, reducing wasted overhead, predict epidemics, cure diseases, improve the quality of life and avoid preventable deaths. Combine this with the mass adoption of edge technologies to improve patient care and wellbeing such as wearables, mobile imaging devices, mobile health apps, etc. However, the use of data across the entire healthcare supply chain is about to reach a critical inflection point where the payoff from these initial big data investments will be bigger and come more quickly than ever before. As we move into 2017, healthcare leaders will find new ways to harness the power of big data to identify and uncover new areas for business process improvement, diagnose patients faster as well as drive better more personalized, preventative programs by integrating personally generated data with broader healthcare provider systems.

Your Next Dream Job: Chief Data Architect

By 2020, an estimated 20.8 billion connected “things” will be in use, up from just 6.4 billion in 2016. As Jeff Immelt, chairman and CEO of GE famously said, “If you went to bed last night as an industrial company, you’re going to wake up today as a software and analytics company.” Mainstream businesses will face ongoing challenges in adopting big data and analytics practices. As the deluge of IoT continues to flood enterprise data centers, the new ‘coveted’ or critical role on the IT team will no longer be the Data Scientist—instead, it will be the Data Architect. The growing role of the Data Architect is a natural evolution from a Data Scientist or business analyst and underscores the growing need to integrate data from numerous unrelated, structured and unstructured data sources—exactly what the IoT era demands. If a company makes the misstep of tying their IT environment to the wrong big data platform, or establishing a system that lacks agility, it could paralyze a company in this data driven age. Two other challenges presented by the rise of the data architect is the shortage of qualified talent needed to fill data architect roles, as well as driving the cultural shift necessary to make the entire company more data driven.

The AI Hypecycle and Trough of Disillusionment, 2017

IDC predicts that by 2018, 75 percent of enterprise and ISV development will include cognitive/AI or machine learning functionality in at least one application. While dazzling POCs will continue to capture our imaginations, companies will quickly realize that AI is a lot harder than it appears at first blush and a more measured, long-term approach to AI is needed. AI is only as intelligent as the data behind it, and we are not yet at a point where enough organizations can harvest their data well enough to fulfill their AI dreams.

At Least one Major Manufacturing Company will go belly up by not utilizing IoT/big data

The average lifespan of an S&P 500 company has dramatically decreased over the last century, from 67 years in the 1920s to just 15 years today. The average lifespan will continue to decrease as companies ignore or lag behind changing business models ushered in by technological evolutions. It is imperative that organizations find effective ways to harness big data to remain competitive. Those that have not already begun their digital transformations, or have no clear vision for how to do so, have likely already missed the boat—meaning they will soon be a footnote in a long line of once-great S&P 500 players.

The Rate of Technology Obsolescence Will Accelerate

In 2017, we are going to see an increasing number of companies shift from simply ‘kicking the tires’ on the use of cloud and big data technologies to actually implementing and deriving significant ROI from enterprise deployments. However, at the same time, the rate of technology innovation today is at an all-time high—nearly replacing existing solutions every 12-18 months. As companies continue to drive new uses of big data and technologies, the rate of technology obsolescence will accelerate. Creating a new challenge for businesses: the possibility that the solutions/tools they are using today may need to be updated or entirely replaced within a period of a few months.  

The increased use of public information for public good

Under the current administration, The White House has set a precedent in creating transparency of government through data.gov, making hundreds of thousands of datasets available online for public use. Data is inherently dumb or ‘dirty’. So we must use the algorithmic economy to define action and make sense of the data being in order to power the next great discovery. In 2017, organizations will find a way to apply machine learning to this public data lake in order to contribute to the greater good. For example, Uber might utilize this data to determine where accidents frequently occur on the roads and create new routes for drivers in order to augment passenger safety.

The post Top 6 Technology Market Predictions for 2017 appeared first on Talend Real-Time Open Source Data Integration Software.

Talend “Job Design Patterns” & Best Practices ~ Part 4

$
0
0

 

Our journey in Talend Job Design Patterns & Best Practices is reaching an exciting juncture.  My humble endeavor at providing useful content has taken on a life of its own.  The continued success of the previous blogs in this series (please read Part 1, Part 2, and Part 3 if you haven’t already), plus Technical Boot Camp presentations (thanks to those I have met there for attending) and delivering this material directly to customers, has led to an internal request for a transformation.  Our plan then to create several webinars around this series and make them available in the near future is now underway.  Please do however be a little patient as it will take some time and coordinated resources, but my hope is see the first such webinar available sometime in early 2017.  I am certainly looking forward to this and welcome your continued interest and readership.

As promised however, it is time to dive into an additional set of Best Practices for Job Design Patterns with Talend.  First let me remind you of a simple and often ignored fact.  Talend is a Java Code Generator and thus crafting developer guidelines fortifies and streamlines the java code being generated through job design patterns.  It seems obvious, and it is, but well-designed jobs that generate clean java code, by painting your canvas using these concepts is the best way I know to achieve great results.  I call it ‘Success-Driven Projects’.

Success-Driven Talend Projects

Building Talend Jobs can be very straight forward; they can also become quite complex.  The secret for their successful implementation is to adopt and adapt the good habits and discipline required.

From the ‘Foundational Precepts’ discussed in the start of this series to now, my goal has always been to foster an open discussion on these best practices to achiever solid, useful job design patterns with Talend.  Most use cases will benefit from atomic job design and parent/child orchestration and when projects contain significant reusable code, overall success can be accelerated.  Of course choose your own path, but at the very least all I ask is: Be consistent!

Database Development Life Cycle – DDLC

But hold on!  Maybe it is not just about Job Design, what about the data?  We are processing data, aren’t we?  For the most part data resides in a database.  I ask you, do databases need best practices?  A rhetorical question?  Data models (schemas) change over time, so Database Designs must have a life cycle too!  It just makes sense.

Databases evolve and we developers need to accommodate this fact.  We have embraced our SDLC process so it should not be that hard to accept that we need a Database Development Life Cycle, alongside.  It is actually quite straight forward in my mind.  For any environment (DEV/TEST/PROD) a database needs to support:

  • A Fresh INSTALLbased upon the current version of the schema
  • Apply an UPGRADE drop/create/alter dB objects upgrading one version to the next
  • Data MIIGRATION where a disruptive ‘upgrade’ occurs (like splitting tables)

Understanding the database life cycle and its impact on job design becomes very important.  Versioning your database model is key.  Follow a prescribed design process.  Use graphical diagrams to illustrate the designs.  Create a ‘Data Dictionary’ or ‘Glossary’ and track linage for historical changes.  I will be writing a separate blog on this topic in more detail soon.  Watch for it.  Meanwhile please consider the following process when crafting database models.  It is a higher discipline; but it works!

ddlc_dbdesignflow

More Job Design Best Practices

Ok.  Here are more job design patterns & best practices for your immediate delight and consumption!  These begin to dive deeper into Talend features that may be common for you or perhaps less frequently used.  My hope is that you will find them helpful.

8 more Best Practices:

tMap Lookups

As many of you already know, the essential tMap component is widely used within Talend Jobs.  This is due to its powerful transformation capabilities.  The most common use for the tMap component is to map a data flow schema from a source input to a target output: Simple, Right?  Sure it is!  We know that we can also incorporate multiple source and target schema data flows,tmaplookup1 offering us complex opportunities for joining and/or splitting these data flows as needed, possibly incorporating transformation expressions that control what and how the incoming data becomes dispersed downstream.  Expressions within a tMap component can occur on the source or the target schemas.  They can also be applied using variables defined within the tMap component.  You can read up on how to accomplish all this in the Talend Components Reference Guide.  Just remember: Great Power comes with Great Responsibility!

Another very compelling use for the tMap component is incorporating lookups that join with the source data flow.  While there is no physical limit to how many lookups you can apply to a tMap component, or what comprises the lookup data, there are real and practical considerations to make.

Look at this basic example: Two row generators; one is the source, the other the lookup.  At runtime, the lookup data generation occurs first and then source data is processed. 

Because the join setting on the lookup data is set to ‘Load Once’, all its records are loaded into memory and then processed against the source data result set.  This default behavior provides high performance joins and can be quite efficient.

tmaplookup2

Alternately, you might imagine that when loading up millions of lookup rows or dozens of columns, considerable memory requirements may occur.  It probably will.  What if multiple lookups having millions of rows each, are required?  How much memory will that need?  Carefully consider your lookups when many records or hundreds of columns are involved.

tmaplookup3Let us examine a trade-off: Memory vs Performance.  There are three Lookup Models available:

  • Load Once reads all qualifying records into memory
  • Reload at each Row reads the qualifying row for each source record only
  • Reload at each Row (cache) reads the qualifying row for each source record, caching it

Clearly, lookup data that has been loaded into memory to join with the source can be quite fast.  However, when memory constraints prevent enormous amounts of lookup data, or when you simply do not want to load ALL the lookup data as the use case might not need it, use  the ‘Reload at each Row’ lookup model.  Note there is a trick you need to understand to make this work.

First inside the tMap component, change the Lookup Mode to ‘Read at each Row’.  Notice the area below expands to allow input of the ‘Key(s)’ you will need to do the lookup.  Add the keys, which effectively define global variables available outside the tMap component.

tmaplookup4

For the lookup component use the (datatype)globalMap.get(“key”) function in the ‘WHERE’ clause of your SQL syntax to apply the saved key value defined in the tMap on the lookup dataset.  This completes the lookup retrieval for each record processed from the source.

tmaplookup5

There you are, efficient lookups, either way!

Global Variables

There are several aspects to the definition and use of what we think of as ‘Global Variables’.  Developers create and use them in Talend Jobs all the time and we refer to them as ‘Context Variables’.  Sometimes these are ‘Built-In’ (local to a job), and sometimes they are found in the ‘Project Repository’ as Context Groups, which allow them to be reused across multiple jobs. 

Either way these are all ‘Global Variables’ whose value is tglobalvarload1determined at runtime and available for use anywhere within the job that defines them.  You know you are using one whenever a context.varname is embedded in a component, expression, or trigger.  Please remember to place commonly used variables in a ‘Reference Project’ to maximize access across projects.

Talend also provides the tSetGlobalVar and tGlobalVarLoad components that can define, store and use ‘Global Variables’ at runtime.  The tSetGlobalVar component stores a key-value pair within jobs that is analogous to using a ‘Context Variable’ providing greater control (like error handling).  Look at my example where a single MAX(date) value is retrieved and then used in a subsequent SQL query to filter a second record set retrieval.To access the Global Variable use the (datatype)globalMap.get(“key”) function in the SQL ‘WHERE’ clause.  Get very familiar with this function, as you will likely use it a lot once you know its power!

The tGlobalVarLoad component provides similar capabilities for Big Data jobs where the tSetGlobalVar component is not available.  Look at my example where an aggregated value is calculated and then used in a subsequent read to qualify which records to return.tglobalvarload2

We are not quite done on this topic.  Hidden in plain sight are a set of ‘System Global Variables’ that are available within a job whose values are determined by components themselves.  We talked about one of them before on the Error Handling Best Practice way back in Part 1 of this series: CHILD_RETURN_CODE and ERROR_MESSAGE.  These System Global Variables are typically available for use immediately after a component’s execution setting its value.  Depending upon the component, different system variables are available.  Here is a partial list:

  • ERROR_MESSAGE / DIE_MESSAGE / WARN_MESSAGE
  • CHILD_RETURN_CODE / DIE_CODE / WARN_CODE / CHILD_EXCEPTION_STACK
  • NB_LINE / NB_LINE_OK / NB_LINE_REJECT
  • NB_LINE_UPDATED / NB_LINE_INSERTED / NB_LINE_DELETED
  •  global.projectName / global.jobName (these are system level; their use is obvious)

Loading Contexts

Context Groups’ support highly reusable job design, yet there are still times when we want even more flexibility.  For example, suppose you want to maintain the context variables default values externally.  Sometimes having them stored in a file or even a database makes more sense.  Having the ability to maintain their values externally can prove quite effective and even support some security concerns.  This is where the tContextLoad component comes in.

tcontextload1

The example above shows a simple way to design your job to initialize context variables at runtime.  The external file used to load contains comma-delimited key-value named pairs and as read in will override the current values for the defined context variables within the job.  In this case, the database connection details are loaded to ensure a desired connection.  Notice that you have some control over some error handling and in fact, this presents another place where a job can programmatically exit immediately: ‘Die on Error’.  There are so few of these.  Of course, the tContextLoad component can use a database query just as easily and I know of several customers who do just that.

There is a corresponding tContextDump component available, which will write out the current context variable values to a file or database.  This can be useful when crafting highly adaptable job designs.

Using Dynamic Schemas

Frequently I am asked about how to build jobs that cope with dynamic schemas.  In reality, this is a loaded question as there are various use cases where dealing with dynamic schemas occurs.  The most common seems to focus on when you have many tables whose data you want to move to another corresponding set of tables, perhaps in a different database system (say from Oracle to MS SQL Server).  Creating a job to move this data over is straightforward yet almost immediately, we conclude that building one job for each table is not that practical.  What if there are hundreds of tables.  Can we not simply build a single job that can handle ALL the tables?  Unfortunately, this remains a limitation in Talend.  However, do not be dismayed, we can do it with TWO jobs: One to dump the data and one to load the data: Acceptable?

tsetdynamicschema1

Here is my sample job.  Establishing three connections; the first two to retrieve the TABLE and COLUMN lists and the third to retrieve actual data.  Simply through iteration of each table, saving its columns, I can read and write data to a positional flat file (the DUMP process) by using the tSetDynamicSchema component.  A similar job would do the same thing except the third connection would read the positional file and write to the target data store (the LOAD process).

In this scenario, developers must understand a little bit about the inner workings of their host database.  Most systems like Oracle, MS SQL Server, and MySQL have system tables, often called an ‘Information Schema’, which contain object metadata about a database, including tables and their columns.  Here is a query that extracts a complete table/column list from my TAC v6.1 MySQL database (do you like my preferred SQL syntax formatting?):

SELECT tbl.table_name
      ,col.column_name
      ,col.data_type
      ,col.is_nullable
      ,col.character_maximum_length
      ,col.numeric_precision
  FROM information_schema.tables  tbl
      ,information_schema.columns col
 WHERE tbl.table_type   = "BASE TABLE"   
   AND tbl.table_schema = "tac621"
    ORDER BY tbl.table_name
      ,col.ordinal_position;

Be sure to use connection credentials having ‘SELECT’ permissions to this usually protected database.

Notice my use of the tJavaFlex component to iterate through the table names found.  I save each ‘Table Name’ and establish a ‘Control Break’ flag, then I iterate for each table found and retrieve its sorted column list.  After adjusting for any nulls in column lengths, the saved ‘Dynamic Schema’ is complete.  The conditional ‘IF’ checks the ‘Control Break’ flag when the table name changes and begins the dump process of the current table.  Voilà!

Dynamic SQL Components

Dynamic code is awesome!  Talend provides several ways to implement it.  In the previous job design, I used a direct approach in retrieving table and column lists from a database.  Talend actually provides host system specific components that do the same thing.  These t{DB}TableList and t{DB}ColumnList components (where {DB} is replaced by the host component name) provide direct access to the ‘Information Schema’ metadata without having to actually know anything about it.  Using these components instead for the DUMP/LOAD process previously described cold work just as well: but what is the fun in that?

Not all SQL queries need to retrieve or store data.  Sometimes other database operations are required.  Enlist the t{DB}Row and t{DB}SP components for these requirements.  The first allows you to execute almost any SQL query that does not return a result set, like ‘DROP TABLE’.  The later allows you to execute a ‘Stored Procedure’.

Last but not least is the t{DB}LastInsertId component which retrieves the most recently inserted ‘ID’ from a database output component; very useful on occasion.

CDC

Another common question that occurs is; Does Talend support CDC: ‘Change Data Capture’?  The answer is YES, of course: Through ‘Publish/Subscribe’ mechanisms tied directly with the host database system involved.  Important to note is that not all database systems support CDC.  Here is the definitive ‘current’ list for CDC support in Talend jobs:

o   Oracle

o   MySQL

o   MS SQL Server

o   PostgreSQL

o   Sybase

o   Informix

o   Ingres

o   DB2

o   Teradata

o   AS/400

There are three CDC modes available, which include:

  • Trigger (default) Uses DB Host Triggers that tracks Inserts, Updates, & Deletes
  • Redo/Archive Log Used with Oracle 11g and earlier versions only
  • XStream Used with Oracle 12 and OCI only

Since the ‘Trigger’ mode is the most likely you will use, let us peek at its architecture:

cdcprocessflow

The Talend User Guide, Chapter 11 provides a comprehensive discussion on the CDC process, configuration, and usage within the Studio and in coordination with your host database system.  While quite straightforward conceptually, there is some considerable setup required.  Fully understand your requirements, CDC modes, and job design parameters up front and document them well in your Developer Guidelines!

Once established, the CDC environment provides a robust mechanism for keeping downstream targets (usually a data warehouse) up to date.  Use the t{DB}CDC components within your Talend jobs to extract data that has changed since the last extraction.  While CDC Takes time and diligence to configure and operationalize, it is a very useful feature!

Custom Components

While Talend now provides well over 1,000 components on the palette, there are still many reasons to build your own.  Talend developers often encapsulate specialized functionality within a custom component.   Some have built and productized their components while others publish them with free access on the recently modernized Talend Exchange. When a component is not available on the Talend Palette, search there instead, you may find exactly what you need.  A ‘Talend Forge’ account is required, but you have probably already created one.

customcomponents1To start, ensure that the directory where custom components will be stored is set properly.  Do this from the ‘Preferences’ menu and choose a common location all developers would use.  Click ‘Apply’, then ‘OK’.

Find the ‘Exchange’ link on the menu bar allowing the selection and installation of components.  The first time you do this, check ‘Always run in Background’ and click the ‘Run in Background’ button as it takes time to load up the extensive list of available objects.  From this list, you can ‘View/Download’ objects of interest.  After completing a component download, click on the ‘Downloaded Extensions’ to actually install them for use in your Studio.  Once completed, the component will show as ‘Installed’ and will available from the palette.

customcomponents2

A component and its associated files, once installed can be hard to find.  Look in two places:

{talend}/studio/plugins/org.talend.designer.components.exchange{v}
{talend}/studio/plugins/org.talend.designer.components.localprovider{v}

If you want to create a custom component yourself, switch to the ‘Component Designer’ perspective within the Studio.  Most custom components utilize ‘JavaJet’, which is the file extension for encapsulating Java code for the ‘Eclipse IDE’.  A decent tutorial on ‘How to create a custom component’ is available for beginners.  While a bit dated (circa 2013), it presents the basics on what you need to know.  There are third party tutorials out there as well (some are listed in the tutorial.  Here is a good one: Talend by Example: Custom Components.  Also try Googleing them to find even more information on creating ‘Custom Components’. 

JobScript API

Normally we use the ‘Designer’ to painting our Talend Job, which then generates the underlying Java code.  Have you ever wondered if a Talend Job can be automatically generated?  Well there is a way!  Open up any of your jobs.  There are three tabs at the bottom of the canvas: DESIGNER, CODE, & JOBSCRIPT.  Hmm: that is interesting.  You have probably clicked on the CODE tab to inspect the generated Java code.  Have you clicked on the JOBSCRIPT tab?  If you did, were you aware of what you were looking at?  I bet not for most of you.  This tab is showing the script that represents the Job design.  Take a closer look next time.  Do you see anything familiar as it pertains to your job design?  Sure, you do…

So what, you say!  Well, let me tell you what!  Suppose you create and maintain metadata somewhere about your job design and run it through a process engine (that you create), generating the properly formatted JobScript, perhaps adjusting key elements to create multiple permutations of the job.  Now that is interesting!

jobscript1

Look in the ‘Project Repository’ under the CODE>ROUTINES section to find the ‘Job Scripts’ folder.  Create a new JobScript (I called mine ‘test_JobScript’).  Open up any of your jobs and copy the JobScript tab contents, pasting it into the JobScript file and save.  Right click on the JobScript and choose ‘Generate Job’.  Now look in the ‘Job Designs’ folder and you will find a newly minted job.  Imagine what you can do now!  Nice!

Conclusion

Whew!  That about does it; Not to say that there are no more best practices involved with creating and maintaining Talend job designs, I am sure there are many more.  Instead let me leave that to a broader community dialog for now and surmise that this collection (32 in all) offers a comprehensive breadth and depth for success driven projects using Talend.

Look for my next blog in this series where we’ll shift gears and discuss how to apply all these best practices to a conventional Use Case.  Applied Technology; Solid Methodologies; Solutions that achieve results! The backbone of where to use these best practices and job design patterns.  Cheers!

The post Talend “Job Design Patterns” & Best Practices ~ Part 4 appeared first on Talend Real-Time Open Source Data Integration Software.

Viewing all 824 articles
Browse latest View live