Step-by-Step: Constructing a Job in Talend Open Studio

Angesichts der derzeitigen rapiden Veränderungen bei SaaS-Anwendungen und Cloud-Plattformen befindet sich der Cloud-Integrationsbereich gegenwärtig ständig im Wandel. Vor Jahren wurde Cloud-Integration als Hilfsmittel betrachtet, das einen einfachen Anwendungsfall befriedigte, wie z. B. die Replikation von SaaS-Daten in einer lokalen Datenbank für die Analyse. Anhand der in der Cloud in den Bereichen Analyse, Big Data und Anwendungsentwicklung stattfindenden Innovationen verändert sich jedoch die Grundessenz der Cloud-Integration. Hier finden Sie drei wichtige Veränderungen der Cloud im Jahr 2016, die sich auf die Cloud-Integrationsstrategien von Unternehmen auswirken werden.

Zentrale IT übernimmt Cloud-Analyse-Initiativen

Mit der Ankündigung von Wave Analytics durch salesforce.com im Oktober 2014 mit der Tagline „Analytics for the Rest of Us“ begann das Cloud-Analyse-Rennen. Noch im gleichen Monat gab Birst seine Partnerschaft mit SAP HANA bekannt und rühmte eine Architektur, die dem Endbenutzer eine unmittelbare Analyse bescheren würde. Kaum ein Jahr später, im Oktober 2015, kündigte Amazon Web Services seinen Cloud-BI-Dienst QuickSight an, der sich insbesondere an Datenanalytiker richtete.

2015 fanden noch einige weitere interessante Ereignisse im Cloud-Analyse-Bereich statt:

Im Januar 2015 wurde gemeldet, dass eine grundlegende Salesforce Wave-Lizenz ca. 40.000 US-Dollar kosten würde, und das zusätzlich zu anderen Standortlizenzen oder Kosten pro Nutzer.
Im April 2015 kündigte Domo seine neueste Finanzierungsrunde zusammen mit weiteren interessanten Informationen an: Kunden würden Domo direkt kontaktieren müssen, falls sie eine Datenintegrations-Einrichtung wünschten, damit das Unternehmen die Geheimhaltung in Bezug auf ihren Cloud-Analyse-Stapel wahren konnte.
Wenige Tage später gründeten die beiden einstigen bitteren Rivalen Tableau und Birst eine Partnerschaft, dank der Birst-Benutzer eine direkte Verbindung mit Tableau herstellen konnten. Birst stellte damit einen zentralen Enterprise-Datenspeicher bereit, aus dem neue Datamarts für Geschäftsbenutzer erstellt werden konnten, während Tableau seine Data Discovery- und Visualisierungs-Fähigkeiten einbrachte.
Tata Consulting Services (TCS) und Tableau gaben eine Partnerschaft bekannt, in deren Rahmen sich TCS auf die Entwicklung von „Lieferkapazitäten im großen Umfang“ für die Datenvisualisierungsfunktionen von Tableau konzentrieren würde.

Diese Ereignisse belegen, dass Cloud-Analyse zwar ursprünglich in erster Linie die Selbstbedienungsanforderungen von Geschäftsnutzern innerhalb von Geschäftssparten ansprach, viele dieser Bereitstellungen aber komplexer sind als ursprünglich gedacht. Die Notwendigkeit der direkten Kontaktierung von Lieferanten zur Einrichtung von Datenintegrationen, Preise auf Enterprise-IT-Niveau, die Erkenntnis, dass Datenvisualisierung nur eine Ebene des Cloud-Analyse-Stapels ausmacht, und die Beteiligung globaler Systemintegratoren weisen allesamt darauf hin, dass nicht die einzelnen Sparten, sondern Central IT ein Hauptfaktor bei Cloud-Analyse-Projekten im Jahr 2016 sein wird.

Was Sie für die Cloud 2016 mitnehmen sollten

Organisationen, die auf der Suche nach einer Cloud-Integrationslösung zur Integration vieler verschiedener, für eine Initiative der Enterprise-Klasse erforderlichen Datenquellen sind, sollten eine Cloud-Integrationsplattform wählen, die eine einheitliche Datengrundlageüber alle Datenquellen, Umwandlungen und Integrationsmuster hinweg aufweist.

Big Data-Verarbeitung wird in die Cloud verlagert

Mit dem Wachstum von Big Data wird MapReduze als Datenverarbeitungsstandard von Spark ersetzt. Die Branche bewegt sich hin zu Echtzeit-Streaming-Daten und experimentiert zunehmend mit maschinellem Lernen. Um Big Data-Anwendungsfälle in der Produktion erfolgreich skaliert zu implementieren, müssen die damit verbundene gesamte Infrastruktur für Vorbereitung, Bereitstellung, Protokollierung und Überwachung sowie Aufnahmetechnologien für Daten-Streaming und Workflow-Management-Tools sich auf einer einzigen Cloud-Plattform befinden. Führende Cloud-Unternehmen wie Amazon Web Services haben die Kunst der Minimierung von Latenz und Optimierung von Clustern für die Verarbeitung großer Datensätze perfektioniert. Die jüngste Ankündigung im Januar 2016 durch den chinesischen E-Commerce-Riesen Alibaba, im Rahmen seines AliCloud-Angebots in Bezug auf Big Data 20 neue Online-Dienste einzuführen, unterstreicht die Bedeutung einer Cloud-zentrischen Big Data-Strategie nur. Dieser Trend wird sich 2016 noch beschleunigen.

Was Sie für die Cloud 2016 mitnehmen sollten

Die Nutzung der Cloud-Integration als Zentrale für die Big Data-Verarbeitung in den Cloud-Angeboten ist eine Strategie, die Enterprise-Architekten ins Auge fassen sollten. Weil die Zukunft von Big Data in Echtzeit abläuft, müssen Organisationen erkennen, dass es verschiedene Echtzeit-Streaming-Technologien wie Kafka, Storm und Spark Streaming gibt, um nur einige zu nennen. Jede Streaming-Technologie ist für unterschiedliche Anwendungsbereiche geeignet. Eine Cloud-Integrationsplattform muss mit einer Vielzahl solcher Streaming-Technologien verknüpft werden.

Hybrid-Integration wird wieder neu definiert

Der Begriff „Hybrid-Integration“ ist einer der verwirrendsten IT-Begriffe überhaupt. In den letzten Jahren meinten Middleware- und Integrationsanbieter damit viele verschiedene Dinge. Es ist zwar klar, dass ein Hybrid-Integrationsszenario die Cloud und Datenquellen vor Ort umfasst, der jeweilige Ansatz hin zum Beheben der mit der Integration dieser Anwendungen auftretenden Herausforderungen wird aber definieren, wie erfolgreich Unternehmen dabei sein werden, mithilfe ihrer Investitionen eine Wertschöpfung zu erzielen.

Was Sie für die Cloud 2016 mitnehmen sollten

Die sich ständig verändernde Definition der Hybrid-Integration wird sich aufgrund der immer unterschiedlicheren Daten im Internet der Dinge, neuer Open-Source-Big-Data-Technologien und dem neuen Bereich der Mikroservices weiter entwickeln. Letztlich ist wichtig, dass Unternehmen erkennen, dass es in ihrem eigenen Interesse ist, alle ihre Apps, Datenbanken und ihre gesamte Infrastruktur in die Cloud zu verlagern. Eine hybride Integrationsstrategie, die dieses Endziel unterstützt, wird für schneller aufeinander folgende Innovationen sorgen.

Was denken Sie, was die Cloud dieses Jahr für Sie bereithält? Teilen Sie mir Ihre Meinung mit, @Ash__V auf Twitter, oder geben Sie unten Ihre Kommentare ein.

Related Resources

Ist die Cloud-Integration für Ihr Unternehmen das Beste?

Products Mentioned

Aktivieren Sie Ihre Big Data Projekte jetzt mit Talend

“Hey ‘Big Data’ is just a big fuzzy word for me” quoted a Vice President of an Innovation Center at a big logistic company back in early 2015. Just one year later, he not only has to admit that ‘Big Data’ is the next big revolution, but has already applied big data technology to dramatic effect significantly growing the business and reducing costs. In fact, he’s been so successful that he’s been given the funding for a new Machine Learning department!

UPS’ 1 billion investment in big data more broadly blows the whistle for all logistics companies all around the globe to get very serious about becoming data-driven or otherwise be in fear of being wiped out. The delta being created by those who have been quick to embrace big data is growing rapidly.

There are many project prototypes initiated by logistic companies in order to exploit big data analysis and few amazing projects that will be part of our daily lives very soon. This includes real-time analysis using Spark on Hadoop to assess large volumes of data that are stored on registers’ logs, excel, database or HDFS, which has changed the business dynamics completely. Here are several big data projects related to the logistics sector:

Volume Analysis

Prediction of parcel volume on a particular day of week, month, year has always been a major concern for logistic companies looking to optimize resource allocation and budget. Many logistics companies are investing in this area to determine the patterns that help predict peak volumes. This is an ideal use case where data scientists are running batch analysis to generate recommendations.

Parcel Heath Data

It is crucial for medicines in particular and for some other commodities in general to be transported in controlled environment. For example, some medicine requires being stored between 2-8 degrees Celsius. Some equipment is fragile and requires extra handling care. It’s very expensive for logistics companies and also for end-customer to manage the entire process, so companies are investing to find alternate routes that ensure the safest delivery within the required parameters.

Researchers and data scientists are deploying IoT sensors in order to monitor temperature, shock level and other factors on parcels. The analysis of this data in offline mode is used to define the safest and most economical passage for “fragile” commodities.

Routing Economy

Should our company organize our own plane to transport parcels or should we leverage the existing infrastructure? Which provider has better facilities, routing paths, costs, and transparency? There are some prototypes on which data scientist are working to answer these types of questions. They are using big data analysis to study massive amounts of parcel data to predict which routes are most reliable, cost effective and viable for future growth. The outcome for management is accurate data in order to make decisions around things like which airlines to choose or what warehouse services are the best match for their needs.

Transparency for Management

Current situation risk analysis and resolutions are high on the wish list for all product owners. It could be political unrest or major union strikes in a particular region or even just vehicle breakdowns, management always wants to have a clear view of the problem and potential remedies.

One of the strategic projects currently in the prototyping stage at some logistics firms is the enablement of management to proactively look for problems and system-generated suggestions on alternate route or strategy options. The challenge is dealing with the enormous volumes and variety of unstructured data coming from everything from social media to IoT sensors on transportation vehicles to both assess risk and define resolutions.

Apache Spark on Hadoop is an ultimate choice of tool for this type of real-time data processing that enables management to anticipate and handle crises before they even happen. Spark ability to analyses data from different streams is the most optimal way to predict events and alternates.

Transparency for End Client

Customers are very conscious about parcel delivery time. To predict when exactly parcels will arrive is a nightmare for logistics companies. Timing depends on range of variables: Number of packages, traffic status along with stability of vehicle – even driver health. Millions of packages are delivered daily. Only through the analysis of volumes of data such as vehicle health, driver efficiency info as well as current traffic conditions, can an approximate time-window be provided for when a package will be delivered.

This pilot projects currently in prototype stage require real-time data analysis. With real-time data analysis tools logistic companies will be able to provide highly accurate information regarding package status with complete transparency to their clients.

Campaigns and Web Analytics

A current usage of big data in many logistics companies is for the analysis of web traffic. A leading logistic company in Germany for example is using big data analysis to provide more personalized services for web visitors. Based on the individual interests of the visitor, different campaigns and services can be offered automatically.

HelpLine

Another interesting project is to streamline and personalize the customer experience when dealing with customer service calls. Typically packages are shipped with receiver and sender phone numbers. This information can be tied to customer service and, in the case of calls coming from these numbers, callers may be automatically notified about the status of their shipments and expected arrival time. Expected times are calculated based on real-time data analysis from different sources about that location (traffic, volume etc.). This project is designed to help reduce the overall load on help center.

The amount of data we have today is far beyond the processing power of conventional systems. Big Data allows us to do things which were not possible before. The quantitative shift is now leading toward a qualitative shift. With spark and real-time transactional database (Cassandra, MongoDB, NOSQL) everything is changing rapidly. Companies adopting big data are racing fast and filling the gaps which weren’t even possible before i.e. proactive measures to act before an ‘event’ even occurs. In short “Predictive Analytics with big data” will be the new norm in the future.

Die Self-Service-Datenaufbereitung– von uns definiert als Möglichkeit für Unternehmenskunden und Analysten, Daten vor der eigentlichen Analyse nach eigenem Bedarf aufzubereiten – wird häufig als die nächste große Sache bezeichnet. Gartner prognostiziert sogar, dass „im Jahr 2017 die meisten Geschäftsbenutzer und Analysten in den Unternehmen Zugang zu Self-Service-Tools haben werden, um Daten für die Analyse aufzubereiten.“ In meinen Ohren klingt das wie ein Weckruf für IT-Profis.

Das Neue bei der Self-Service-Datenaufbereitung ist die Tatsache, dass dies nicht mehr nur den CIO betrifft. Es handelt sich vielmehr um einen gigantischen Schritt für datengesteuerte Unternehmen als Ganzes. Und wenn man es gut macht, dann wird in Zukunft jeder Mitarbeiter Daten im eigenen Arbeitskontext gewinnbringend nutzen können – ob in der IT, der Buchhaltung, im Marketing oder sonst wo. McKinsey hat sogar kürzlich herausgefunden, dass die Effektivität von Mitarbeitern durch die Verwendung digitaler und mobiler Self-Service-Tools gesteigert werden kann. Dies bezeichnen wir als „letzte Meile der Business Intelligence und Data Integration“. Mit Self-Service kann ein Unternehmen nicht nur die Reichweite seiner Daten erheblich erhöhen, sondern auch ihre Nutzung in den verschiedensten betrieblichen Abläufen optimieren.

Talend Data Preparation

Talend Data Preparation ist unsere Lösung für Unternehmen und IT-Teams, die bei diesem wichtigen Trend an vorderster Front stehen wollen. Mit Talend Data Preparation kann jeder Entscheider in Ihrem Unternehmen Daten im Handumdrehen aufbereiten und hat so mehr Zeit für Analyse und Umsetzung.

Sieht man sich moderne Unternehmen an, dann stellt man fest, dass Daten praktisch überall vorhanden sind; gleichzeitig jedoch sind sie relativ nutzlos, solange sie nicht im Kontext ihrer Verwertung betrachtet werden. Business-Analysten verbringen viel zu viel Zeit damit, die Daten für die Analyse aufzubereiten: Datenquellen müssen gesucht und mit anderen Datenpunkten verbunden, Inhalte erfasst, bereinigt und für die Analyse standardisiert werden. 80 Prozent der Zeit bei der Datenanalyse wird mit Aufgaben verbracht, die sich stets wiederholen – und nur 20 Prozent mit dem Entwickeln und Teilen von Erkenntnissen aus diesen Daten. Mit Talend Data Preparation können Datenanalysten ihre Effektivität und Produktivität steigern und bereinigte, wertvolle Datenbestände in wenigen Minuten generieren, statt dafür mehrere Stunden zu benötigen.

Talend Data Preparation richtet sich jedoch nicht nur an Business-Analysten, sondern an einen größeren Anwenderkreis. Schließlich geht es bei der digitalen Transformation nicht nur darum, einige wenige glücklich zu machen. Für eine Datenunterstützung eignet sich praktisch jede betriebliche Aufgabe in Unternehmen aus allen möglichen Branchen. Während des Rollouts dieses Produkts haben wir mit Geschäftsbenutzern aus allen Abteilungen bei Talend gesprochen – vom Manager für Marketingkampagnen bis zum Buchhalter, vom Digital-Marketing-Manager bis hin zum Vertriebsexperten. Dabei stellten wir fest, dass all diese Mitarbeiter Stunden über Stunden damit verbringen, die immer gleichen Schritte zur Datenaufbereitung durchzuführen. Sie verwenden Excel statt Tableau oder ein ähnliches BI-Tool. Mit Data Preparation kann der erforderliche zeitliche Aufwand für solche Aufgaben erheblich reduziert werden, und Marketingexperten und andere Geschäftsbenutzer können, statt sich mit Excel herumzuplagen, endlich das tun, was sie am besten können.

Dies unterscheidet Talend Data Preparation von den meisten ähnlich gelagerten Konkurrenzprodukten: Diese sind in der Regel nur für eine bestimmte Zielgruppe vorgesehen, beispielsweise Business-Analysten oder Datenwissenschaftler. Wenn man sich auf eine bestimmte Zielgruppe beschränkt, entsteht früher oder später ein neuer Datensilo, was die Probleme beim Datenzugriff nicht beseitigt, sondern vergrößert. Im Gegensatz dazu ist Talend Data Preparation nicht nur auf die Bedürfnisse verschiedener Rollen und Funktionen vorbereitet, sondern fördert auch die Zusammenarbeit zwischen diesen. Doch wichtiger noch: Talend Data Preparation ermöglicht eine enge Verzahnung von IT und Geschäftsbereichen, sodass diese gemeinsam auf Datenbestände zugreifen können. Self-Service heißt hier nicht „Do It Yourself“ – schließlich soll die Datendemokratisierung nicht zur Anarchie führen, sondern braucht Kontrolle, Regeln und Aufsicht, sonst sind Fehlschläge vorprogrammiert.

Gartner prognostiziert sogar, dass der Weg zahlreicher Unternehmen in Sachen Self-Service gegenwärtig in die Sackgasse führt: „Im Jahr 2016 werden weniger als 10 Prozent der Self-Service-BI-Initiativen so gut gesteuert sein, dass Inkonsistenzen mit negativen Auswirkungen auf das Geschäft vollständig vermieden werden können.“

Saubere Daten – nur einen Mausklick entfernt

Doch keine Sorge: Schon jetzt sind Sie nur einen Mausklick davon entfernt, von den Versprechen der Self-Service-Daten-Governance zu profitieren. Data Preparation Free Desktop ist nämlich bereits erhältlich. Hierbei handelt es sich um das erste Open-Source-Tool am Markt – Sie können es kostenfrei herunterladen. Wie der Name bereits sagt, wird dieses Programm auf dem Desktop ausgeführt. Die Installation geht schnell von der Hand, und nach nur wenigen Mausklicks können Sie loslegen. Wir haben Videos für den Einstieg aufgenommen und einen Schulungsleitfaden verfasst. So werden Sie in nur wenigen Minuten zum Beherrscher Ihrer Daten.

Zusätzlich zum Funktionsumfang der kostenlosen Desktop-Version bietet die kommerzielle Variante (Veröffentlichung im 2. Quartal 2016) Funktionen auf Enterprise-Niveau: rollenbasiertes Mehrbenutzer-Zugriffsmodell, Kollaboration und Daten-Governance, gemeinsames Inventar mit veröffentlichten zertifizierten Daten und weiteren Unternehmensdaten, Unterstützung für Hunderte von Datenquellen und -zielen sowie Datenverarbeitung auf Hochleistungsservern. Und weil diese Funktionen als zentrale Bestandteile in Talend Data Fabric bereitgestellt werden, erhalten auf diese Weise alle Talend-Integrationsszenarios – von der Data Integration bis Big Data, von der Echtzeitintegration bis zum MDM – Self-Service-Funktionen.

Sie können mir glauben: Talend Data Preparation ist schnell installiert, macht Ihre tägliche Arbeit einfacher, erleichtert das schnelle Erreichen neuer Meilensteine auf Ihrer Data Journey – und macht auch noch Spaß! Wir würden uns freuen, Sie dabei zu haben und von Ihnen Feedback zu erhalten. (Und weil uns das so wichtig ist, haben wir sogar ein ansprechendes kleines Feedbackformular allein für diesen Zweck in das Produkt integriert.)

Teilen Sie uns nachstehend in den Kommentaren mit, was Sie von Talend Data Preparation halten, oder schicken Sie mir einen Tweet an @jmichel_franco!

Related Resources

Products Mentioned

With Talend, Speed Up Your Big Data Integration Projects

This article was originally published on the O'Reilly Media Ideas Blog

When organizations seek the benefits of a data-driven culture, they require more efficient approaches to uncovering answers and insights. Self-service analytics can help address that need for speedy understanding. Self-service analytics provides data access to more people within a company—along with the autonomy to explore connections between disparate data sources. But that doesn’t mean self-service is a completely do-it-yourself data experience. Instead, it requires new ways of collaborating with IT, to help ensure accuracy and security don't suffer in the pursuit of efficiency.

“Self-service done right is about digitalizing the workplace. Data needs to be fully accessible to a wide audience and customizable to any context, but at the same time it has to meet quality standards and be protected against fraudulent use,” says Jean-Michel Franco, director of product marketing for Talend. "Self-service enables a win-win situation wherein workers can do their jobs more efficiently while IT can still have governance over the data.”

Today's business landscape changes rapidly and companies need to stay nimble. Self-service analytics can help in that regard, as Yahoo discovered. As we explore in detail in our new report Self-Service Analytics, some of Yahoo’s more complex reports used to take several months to complete because those requests were sent to the company’s central reporting team. But then the company switched to a self-service model that focused on ease-of-use and found that employees could generate their own customized reports in seconds.

That’s just one of the cases covered in the report, but it also reveals how several other organizations have maximized the benefits of a self-service approach. Their experiences highlight the importance of metadata, ideas for creating a culture of data literacy, and ways to ensure employees have the appropriate analytics tools available.

When workers have more autonomy to explore data and delve deeper into the realm of “what if,” organizations reap vital insights. But more freedom to create customized analysis doesn’t mean a free-for-all. That’s why a thorough assessment of data governance should precede any foray into self-service analytics. Most organizations find they have plenty of room for improvement in data governance. It’s better to learn that before employees dive into oceans of data, because self-service environments can create significant risks in terms of report accuracy, as Kurt Schlegel, research vice president at Gartner, advises. One approach he recommends is to build a cultural understanding that ad hoc reports and analyses are not meant to be used as systems-of-record to run the business. This means enabling users to create their own reports and share them in public folders, but only if they share a fundamental understanding about the data quality. Until the reports go through a rigorous validation process (usually carried out by the central BI team), they should treat the results as preliminary.

Download the free O’Reilly report Self-Service Analytics: Making the Most of Data Access for an overview and case studies on instituting self-service analytics in several different types of organizations.

About the Author - Sandra Swanson

Sandra Swanson jumped out of a plane once, for the sake of journalism (it was a skydiving story). Since then, she's found plenty of invigorating work that doesn't require plummeting toward the ground at 120 mph. As a Chicago-based writer, she's covered technology, science, and business for dozens of publications. She’s also keen on topics that intersect in unexpected ways. One example: "Toys That Inspired Scientific Breakthroughs,” a story she wrote for ScientificAmerican.com. Connect with her on Twitter (@saswanson) or at www.saswanson.com...

Related Resources

Related Products

A Talend Community Coders post brought to you by: Sergey Beryozkin

JAX-RS 2.1 specification work has finally started after a rather quiet year and this is a good news for JAX-RS users at large and CXF JAX-RS users in particular.
JAX-RS 2.1 is entirely Java 8 based and a number of new enhancements are on the way. I was concerned earlier on that having a Java 8 will slow down the adoption but I think now the spec leads were right, Java 8 is so rich and JAX-RS needs to be open to accepting the latest Java features - ultimately this is what will excite the users.

The main new features list is: support for Server-Sent Events (something CXF users will enjoy experimenting with while also keeping in mind CXF has some great WebSocket support done by Aki), enhanced NIO support and introducing a reactive mode into Client API.

I've already mentioned before that JAX-RS 2.0 AsyncResponse API is IMHO very impressive as it makes a fairly complex task of dealing with suspended invocations becoming rather trivial to deal with. Marek and Santiago are doing it again with the new 2.1 proposals. Of course there will be some minor disagreements here and there but overall I'm very positive about this new JAX-RS project.

We now have a CXF Java 8 master branch to support the future JAX-RS 2.1 features but having a Java 8 trunk is great for all of the CXF community.

What is really good is that there appears to be no obvious end to the new requirements coming into the JAX-RS space. The HTTP services space is wide open, with the new ideas generated around the security, faster processing, etc, and it all will be eventually available as future JAX-RS features. I'm confident JAX-RS 3.0 will be coming in due time too.

More about Sergey Beryozkin: a committer of Apache CXF and Apache Aries.

Sergey has focused on working with web services and XML technologies for over twenty years. He is currently the leader of the Apache CXF JAX-RS and OAuth2 implementation projects. As a software architect in the Application Integration Division of Talend, he focuses on Talend Service Factory and works with Talend colleagues on creating better support for REST.

Related Resources

Products Mentioned

Téléchargez>>Essayez Talend Data Preparation gratuitement

La préparation des données en libre-service, qui permet aux utilisateurs et analystes métiers de préparer des données avant de les analyser, est souvent considérée comme la grande tendance de demain. Selon les prévisions de Gartner, « d’ici 2017, la plupart des utilisateurs et des analystes disposeront d’outils dédiés en libre-service ». Je pense que cette prédiction devrait sonner comme un signal d’alarme pour les professionnels de l’informatique.

La plus grande particularité de cette tendance à la préparation des données en libre-service est qu’elle s’étend au-delà de la DSI. Cela représente un grand bond en avant pour les entreprises orientées données en général. Abordée correctement, elle permettra aux équipes informatiques, services financiers, marketing ou autres d’exploiter leurs données dans leur contexte professionnel. Ainsi, une étude réalisée récemment par le cabinet McKinsey révèle que l’utilisation de technologies numériques et mobiles en libre-service accroît le rendement des employés. C’est pour cette raison que l’on considère cette tendance comme « la dernière étape en matière de Business Intelligence et d’intégration de données ». Le modèle du libre-service permet en effet aux organisations d’étendre considérablement la portée des données, mais aussi de s’assurer qu’elles soient utilisables quelles que soient les tâches de production.

Le nouveau Talend Data Preparation

Talend Data Preparation est notre proposition afin de permettre aux équipes métiers et informatiques d’être les leaders de cette grande tendance. Avec cette solution, les décideurs de votre organisation pourront accélérer la préparation des données, et ainsi passer plus de temps à les analyser et à les exploiter.

Omniprésentes dans les organisations du monde entier, les données restent néanmoins relativement inutilisables sans une mise en contexte préalable. Les analystes métiers passent bien trop de temps à préparer des données pour les analyser : ils doivent en effet repérer la source de données, en découvrir le contenu, les nettoyer, et les standardiser afin qu’elles s’agrègent correctement et de pouvoir les connecter à d’autres points de données. Les tâches répétitives occupent ainsi 80 % du temps consacré à l’analyse de données, ce qui ne laisse que 20 % pour extraire et partager des connaissances. Talend Data Preparation accroît le rendement et la productivité des analystes en leur permettant de créer des données propres et utilisables en quelques minutes au lieu de plusieurs heures.

Mais notre solution s’adresse à un public plus large que les seuls analystes métiers. La transformation numérique a vocation à profiter à tous : toutes les tâches de production de chaque division peuvent bénéficier d'une approche davantage axée sur les données. Lors du déploiement de ce produit, nous nous sommes entretenus avec des utilisateurs métiers dans tous nos services : responsables de campagnes de marketing, contrôleurs financiers, responsables du marketing numérique, administrateurs commerciaux, etc. Nous avons découvert que les équipes de production passaient des heures à effectuer des tâches répétitives sur leurs données, et qu’elles utilisaient pour cela Excel au lieu de Tableau ou d’autres outils de BI similaires. Talend Data Preparation peut réduire considérablement le temps consacré à ces tâches, et permettre ainsi aux spécialistes du marketing et autres utilisateurs métiers de passer rapidement à ce qu’ils font le mieux plutôt que de progresser péniblement avec Excel !

C’est précisément ce qui différencie notre solution de celles de nos concurrents : la plupart d’entre elles sont destinées à un public spécifique, comme les analystes métiers et de données. Le fait de cibler un public particulier crée une nouvelle forme de cloisonnement (silo de données), rendant ainsi l’accès aux données plus difficile au lieu de le faciliter. À l’inverse, Talend Data Preparation est conçu pour répondre aux besoins de différents profils d’individus, tout en encourageant la collaboration entre eux. Surtout, notre solution permet de réconcilier les équipes informatiques et les branches d’activité afin qu’elles puissent exploiter le potentiel de leurs données de façon collaborative. Mais « libre-service » ne signifie pas « se débrouiller seul », et la démocratisation de la gestion des données n’est pas synonyme d’anarchie : il faut des mécanismes de contrôle, des règles, et assurer une gouvernance, sans quoi la mise en place de cette approche échouera.

Le cabinet Gartner annonce d’ailleurs que beaucoup d’organisations sont en passe d’échouer dans ce processus : « en 2016, moins de 10 % des initiatives de Business Intelligence feront l’objet d’une gouvernance suffisante pour éviter les incohérences affectant typiquement les activités des entreprises ».

Des données propres à portée de clic

La bonne nouvelle du jour est que grâce au nouveau Data Preparation Free Desktop, vous n’êtes qu’à un clic de pouvoir concrétiser les promesses de la gouvernance de données en libre-service. Il s’agit du premier outil open source du marché ; il est donc disponible gratuitement en téléchargement. Comme son nom l’indique, il s’agit d’un logiciel de bureau. Vous pouvez donc l’installer et le prendre en main en quelques clics. Nous avons créé des vidéos de prise en main et des tutoriels expliquant comment devenir expert en matière de préparation de données en seulement quelques minutes.

En outre, une version commerciale de Talend Data Preparation sera lancée au 2e trimestre. Elle proposera des fonctionnalités supplémentaires et de classe professionnelle : l’accès à de multiples utilisateurs et en fonction de leurs rôles ; la collaboration et la gouvernance de données ; un inventaire partagé de données certifiées et publiées, ainsi que d’autres données d’entreprise ; la prise en charge de centaines de sources et cibles de données ; et le traitement de données sur serveur et hautes performances. Ces fonctionnalités seront également fournies en tant que composantes clés de la plateforme Talend Data Fabric. L’approche du libre-service pourra ainsi être étendue à l’ensemble des scénarios d’intégration à l’aide des logiciels Talend, de l’intégration de données au traitement des Big Data, en passant par l’intégration en temps réel et la gestion de données de référence.

Soyez certains d'une chose : Talend Data Preparation n’est pas seulement facile à prendre en main et convivial : notre solution vous apportera immédiatement une valeur ajoutée au quotidien, et vous permettra de franchir rapidement des caps en matière de gestion de données. Nous serions enchantés de vous faire découvrir et d’avoir votre opinion sur notre produit (nous y avons même intégré un formulaire à cet effet) !

N’hésitez pas à nous dire ce que vous pensez de Talend Data Preparation ci-dessous ou à m’envoyer un tweet à @jmichel_franco !

Ressources associées

Accélérez vos projets Big Data avec Talend

Products mentionnés

Download>> Test Drive Talend Real-Time Big Data for Free

A new survey by Talend gauging the views of regular users of betting services in the UK, indicates that betting companies are missing a trick when it comes to driving and maintaining customer loyalty. Reducing customer churn is a concern for any sector, but has been heightened amongst bookmakers and gaming companies where competition is so fierce, both on the high street and online.

So how can they go about delivering enhanced customer engagement and increasing customer loyalty? At Talend, we believe the ability to personalise service offerings is potentially a key differentiator for any betting services provider. And our survey supports this view with more than two-thirds of respondents (67%) stating they would stay loyal to the brand if offered more personalised service options, such as tailored odds/offers; exclusive offers in real-time or targeted push notifications to their smartphone.

However, the results of the survey also highlights the fact that bookmakers are behind on personalising their messaging - 72% of the sample say that their preferred bookmaker does not offer them a personalised service.

Betting services providers are surely missing out on an opportunity here. If they want to retain existing business and drive competitive edge in the future, they need to deliver the tailored experience customers are increasingly looking for. It’s a point highlighted by the 20% of our sample who, when asked what would make them loyal to a particular bookmaker, gave ‘personalised customer experience’ top ranking, second only to the 39% of respondents who cited ‘competitive odds’.

In an environment where customers are increasingly capricious, these figures also highlight the potential benefits all gaming companies could achieve by making better use of their data through the latest data analytics tools to drive individualised campaigns or targeted promotional offers.

Today, most gaming operators are collecting vast volumes of high-value data but so far very few are making optimum use of it. The potential is huge, however. There’s a great opportunity for bookmakers to dig deep into the data they have and use it to segment customers and then target them with promotions, incentives and personalised betting offers.

It’s important to remember though that in order to do this effectively, having access to data is not sufficient in itself. Operators understand this, of course, but for many, actually doing it is both challenging and expensive, especially given the increasing volumes and complexity of information coming into the business. If gaming operators and bookmakers want to embrace big data to improve customer engagement and drive up loyalty, they need a scalable value added integration platform for processing and transforming data into a format for analysis and smart application. It’s a significant undertaking but the potential benefits will be significant.

After all, if they have the means to pull all this data together and analyse it rapidly, these businesses can begin to create hyper-tailored offers that steer high-value customers in the right direction at just the right moment in their gaming activities – for example in near real-time to their mobile platforms.

This ability to integrate all datasets can also help businesses tackle the high churn rates that are typical across the industry – particularly in online and social formats. Once an operator has a holistic view of their customers through this mix they can determine their most important and target them with offers to ensure they stay.

After all, while accessing the data itself is key to achieving business insight, it is the ability to run analytics on that data that gives gaming companies the opportunity to find out much more about both prospects and customers and to use this knowledge, to personalise their service and to reduce customer churn.

Check out our infographic below with the rest of the survey findings:

[Infographic] Are Bookmakers Betting on Big Data? from Talend

Related Resources

With Talend, Speed Up Your Big Data Integration Projects

Products Mentioned

Talend Big Data

En décembre dernier, plus de 190 pays du globe se sont réunis à Paris pour sceller un accord global sur la réduction des gaz à effet de serre, les conséquences du réchauffement climatique ne font plus de doute. Bonne nouvelle, un consensus a été trouvé et le numérique, les technologies Big Data en tête, constituent un levier majeur pour le développement de solutions pour notre planète.

Des capteurs pour des solutions plus écoénergétiques

La COP 21 s’est ouverte sur fond de révolution dans le domaine de l’énergie, notamment avec l’apparition des premiers réseaux intelligents. L’initiative d’ERDF qui vise à déployer dans 35 millions de foyers français un compteur électrique dit « intelligent », combinant différentes technologies – électrotechnique, information et télécommunications – pour optimiser la gestion des ressources électriques est sans équivalent dans le monde.

Dans le domaine de l’énergie et du transport, les Big Data associés aux objets connectés peuvent permettre de réduire l’empreinte carbone de manière significative. En croisant les données produites par différents acteurs (producteurs électriques, réseaux de transport, municipalités, particuliers, open data...), il est possible de modéliser le comportement d’une ville et d’en réduire l’impact sur l’environnement. C’est tout l’intérêt du concept de « ville intelligente » ou « ville connectée », portée par des acteurs comme m2ocity, dont l’infrastructure permet d’une part de rationaliser la collecte des données des compteurs et d’autre part, d’en industrialiser l’analyse dans le but d’aider villes et collectivités à trouver des solutions plus écoénergétiques.

m2ocity, opérateur de télérelevé, a choisi Talend pour orchestrer la collecte et les traitements des données issues des compteurs raccordés à son réseau de plus d’1,6 millions de compteurs répartis dans 2000 communes. Les données récoltées – eau, électricité, carburants, gaz, pression, humidité, environnement etc. – permettent non seulement d’évaluer et d’anticiper les risques comme les fuites mais également de mettre en place des systèmes d’alertes permettant aux consommateurs de mieux gérer leur consommation d’eau ou d’électricité.

De l’action à l’anticipation

L’exemple du compteur connecté prouve toute la validité d’un modèle combinant Internet des Objets et technologies Big Data. Dès lors, il est aisé d’imaginer tout le champ des possibles, comme par exemple l'énergie intelligente (smart grids), la mesure et le contrôle de la pollution, la gestion des infrastructures publiques – éclairages, routes, déchets, réseaux d’eau, etc.

Pour optimiser au mieux l’exploitation de ces immenses flux de données, ces derniers doivent être traités et analysés en temps réel afin de pouvoir prédire les comportements. Pouvoir établir par exemple des probabilités de pics de pollution dans les grandes métropoles en fonction des données météo, de circulation etc. et prendre les mesures nécessaires pour éviter que le phénomène ne se produise.

Du temps réel et de l’anticipation pas seulement à l’échelle des villes, des collectivités mais aussi à l’échelle de l’individu - pouvoir alerter le consommateur d’une pression inhabituelle dans ses tuyaux d’eau, signe d’une possible fuite d’eau à venir.

Cela devient possible grâce aux technologies Big Data comme le streaming et le traitement des flux de données en temps réel.

Pour en savoir plus sur le cas d’usage m2ocity et leur utilisation des Big Data, venez assister à leur présentation sur le salon Big Data Paris le 7 mars prochain de 15h à 15h30 en salle C.

Pour en apprendre davantage sur d’autres cas d’usages et découvrir comment Talend peut vous aider à tirer profit de vos Big Data, venez nous rencontrer au salon Big Data Paris les 7 et 8 mars, stand 308.

My fist blog post is on a topic which connects me to Talend in a special way - custom components. The open source nature and concept of custom components are a chance - especially for consultants/developers like me - to open spaces and make Talend a solution which does not end at the limits given by the vendor.

Developers can publish these components or extensions on the Talend Exchange platform for other users to leverage and benefit from.

Today, I want to talk about a few ways to enable Talend to work with Google services (Analytics, YouTube, Adwords) using these custom components found on Talend Exchange. Google is one of the largest service providers for web analytics. With its substantial knowledge of the Internet and a large pool of data, Google has carved out a leading market position in this area and they offer a very good web interface for accessing web data. They’ve also developed RESTful web service interfaces for clients that want to use this data in conjunction with their own in-house data and in a data warehouse.

Here are 9 ways you can use Google’s services with Talend right now:

1. Google Analytics: Core Reporting API

This is a classic among the Google APIs. Web analytics is a core competence of Google and the API is well tested and well engineered. Google Analytics Reports are described by a compilation of dimensions, metrics, filter conditions (etc.) and a start and end date. These reports are based on data collected daily.

Component:tGoogleAnalyticsInput

Special Features:

- Automatic repetition of requests for certain error types

- Can use service accounts

- Normalized output of the results

2. Google Analytics: Unsampled Reports API

If you are a Google Analytics user that deals with large data quantities (for example: web clicks) using the Core Reporting API, it can happen that Google does not set up the reports based on the whole database, but rather on a subset. This is called sampling. Usually the deviation is rather small, less than 5 per cent, but it can reach double digit rates if filters are applied.

However, there is a way to get to all that data! Google supplies an API in order to create such reports in an asynchronous manner from the full data quantity.

The necessary steps to do so are:

Starting of the report
Checking the processing status
Download of the results (Google Drive)
Import of the results

Component:tGoogleAnalyticsUnsampledReports

This component supports the steps 1, 2 and 4, with dedicated operational modes.

Special Features:

- Automatic repetition of requests for certain error types

- Can use service accounts

- Normalized output of the results

3. Google Drive API

Google offers a very safe cloud with hardly any limits in space.

This cloud is interesting in the setting of Google Analytics as the results of un-sampled reports that are preferably stored as CSV files in Google Drive.

We can execute the following file operations in Talend with the fitting/right component:

- Upload (including the setting of rights)

- Download

- Delete

- Move/Copy

- List (including directories and user rights)

Component:tGoogleDrive

Special Features:

- Automatic repetition of requests for certain error types

- Can use service accounts

- Normalized output of the results

Talend has developed special components for Google Drive. They differ mainly by the above mentioned special features. Thus for the time being there is a necessity to use these special Google Drive components.

4. Google Analytics: Real Time API

In order to collect web identification figures very soon after data processing, Google offers a Real-time API. It supplies less key data points than the Core Reporting API, but it supplies real-time results and results from the preceding minutes unlike the Core Reporting API. This is especially interesting for live monitoring in connection with marketing campaigns.

Component:tGoogleAnalyticsRealtimeInput

Special Features:

- Automatic repetition of requests for certain error types

- It is possible to use service accounts

- Normalized output of the results

- Dimension minutesAgo is additionally given as time stamp (real time)

5. Google Analytics: Management API

All website management data is available via this API:

- Accounts

- Web properties

- Views

- Segments

- Goals

- User rights

- Descriptions for all dimensions and metrics

This data helps to keep an overview of your website performance. A report on all users and their user rights for the accounts in the own data warehouse - their history is available as well, which might be interesting for larger companies.

Component:tGoogleAnalyticsManagement

Special Features:

- Automatic repetition of requests for certain error types

- It is possible to use service accounts

6. Google AdWords: AdWords Report API

These days, Google AdWords is THE standard for digital advertising on the web. Google offers various reports regarding advertising performance and data for campaigns. It is absolutely necessary to have a regular look at these evaluations in order to take trends into account and in order to place the advertisements effectively.

Component:tGoogleAdWordsReports

Special Features:

- Automatic repetition of requests for certain error types

- Possible to use service accounts

- AdWords Query Language can also be applied as regular report controlled by attributes

- Download of the results as flow or allocation as input flow

7. Google Analytics: Multi-Channel-Funnel Analysis Reports

Google offers functionality that can be used to survey the conversion of potential clients via so called channel funnels. The Talend component collects data from the Google Analytics Multi Channel Funnel API.

Component:tGoogleAnalyticsMCFInput

8. Google Analytics: Upload

Google offers a function for return-on-investment (ROI) calculations which can integrate external data sources into the reports. These data sources are fed out of CSV files.

The files then can be comfortably and automatically uploaded into the data sources with Talend. The data can be processed by Talend jobs and the upload is done by Talend jobs as well.

Component:tGoogleAnalyticsUpload

9. Google YouTube Analytics

There are also analysis and reports for YouTube channels reporting. Typically, the advertising departments request a longer storage of these data in the data warehouse and expect detailed reports about trends.

Component:tYoutubeAnalyticsInput

In this post we have examined how Talend can connect to several services including Google Analytics, Adwords, Drive and YouTube. Visit Talend Exchange to download and try these extensions or to explore other useful components, connectors, jobs, templates, patterns, data models and more contributed by Talend and the broader community.

Remember our latest Step-by-Step blog post around the construction of a Job in Talend Open Studio (seen here)? Well, now we’ve taken some time to write another post around running, testing and debugging a job in Talend Open Studio to show you how easy it is to troubleshoot your Job when any issues pop up while it’s running. For those of you who have run Jobs in the past, you understand that issues can happen when working on databases and tables. That’s exactly when our Debug mode comes in handy: you can take an in-depth look at your Talend Job and see precisely how it’s running.

In this video below, we will cover the different ways to run Jobs, how to test with smaller datasets, as well as debug with the logging features and the built-in debug feature, all within the Talend Studio. Our Job is a simple process: reading data from a large file, aggregating and writing it into a table.

The video will give you insight regarding the debugging of Jobs. It takes you through each debug step so you finally build the Job and start running the process. Then you will see the data as it goes through each transformation with all the attributes to identify any bugs within your Job.

The debug process is detailed in two different ways: 1) Talend users can reduce the dataset to see a smaller volume of data and go through the Job to debug at their convenience; or 2) Users are able to add a screen output to the Studio using the tLogRow component and link the output from the main component (such as the tMap in the video) into the tLogRow. Once users run the Job in regular mode, the data output that's been aggregated can be seen on the screen.

Now watch this useful video to understand new ways to debug and run processes within Talend Open Studio. Remember, if you don’t already have Talend installed, you can still play along with the video by downloading Talend Open Studio for free.

Guest Blog by Bernard Marr, Founder and CEO of The Advanced Performance Institute

A quiet revolution has been taking place in the technology world in recent years. The popularity of open source software has soared as more and more businesses have realized the value of moving away from walled-in, proprietary technologies of old.

And it’s no coincidence that this transformation has taken place in parallel with the explosion of interest in big data and analytics. The modular, fluid and constantly-evolving nature of open source is in synch with the needs of cutting edge analytics projects for faster, more flexible and, vitally, more secure systems and platforms with which to implement them.

Open Source and Big Data

So what exactly is open source, and what is it that makes it such a good fit for big data projects? Well, like big data, open source is really nothing new – it’s a concept which has existed since the early days of computing. However, it’s only more recently, with the huge growth in the number of people, and amount of data online, that its full potential is starting to be explored.

The lazy description of open source is often that it is “free” software. Certainly that’s how you will hear the more popular open source consumer and business products (such as the Microsoft Office alternative LibreOffice, or the web browser Firefox) described. But there’s much more to it than that. Generally, truly open source products are distributed under one of many different open source licenses, such as the GNU Public License or the Apache License. As well as granting the user the right to freely download and use the project, it can also be modified and redistributed. Software developers can even strip out useful parts from one open source project to use in their own products – which could either be open source themselves, or proprietary. In general, the only stipulation is that they must acknowledge where open source material has been used in their own products, and include the relevant licensing documentation in their distribution.

Advantages of Open Source

Open source development has many advantages over its alternative – proprietary development. Because anyone can contribute to the projects, the most popular have huge teams of enthusiastic volunteers constantly working to refine and improve the end product.

In fact, Justin Kestelyn, senior director of technical evangelism and developer relations at leading open source vendor Cloudera, tells me that proprietary solutions are no longer the default choice for data management platforms.

He says “Emerging data management platforms are just never proprietary any more. Most customers would simply see them as too risky for new applications.

“There are multiple – and at this point in history, thoroughly validated – business benefits to using open source software.”

Among those reasons, he says, are the lack of fees allowing customers to evaluate and test products and technologies at no expense, the enthusiasm of the global development community, the appeal of working in an open source environment to developers, and the freedom from “lock in”.

This last one has one caveat, though, Kestelyn explains – “Be careful, though, of open source software that leaves you on an architectural island, with commercial support only available from a single vendor. This can make the principle moot.”

The literal meaning of open source is that the raw source code behind the project is available for anyone to inspect, scrutinize and improve. This brings big security benefits – flaws which could lead to the loss of valuable or personal data are more likely to be spotted when hundreds or thousands of people are examining the code in its raw form. In contrast, in the world of proprietary development, only the handful of people whose job it is to write and then test the code will ever see the exact nature of the nuts and bolts holding it all together.

It also makes it far more difficult for software developers to hide or obfuscate exactly what it is that their programs are doing, while they are running on a user’s computer. Consumers are growing ever more aware of the importance of knowing what their computers are doing with their personal data “behind the scenes”. This was proven by the recent outcry over what many saw as excessive snooping built into the latest upgrade to Microsoft’s Windows. Increasingly, customers are aware that running open source gives them the confidence of knowing that their software has been heavily scrutinized by a large community of non-affiliated developers. Anything that the software is attempting to do with data which could be seen as unethical or deceptive will be spotted and will not be tolerated by the open source community. Even if the open source software you are using is not one of the many larger packages, in theory you can still examine the source code yourself to find out exactly what it does (or pay an independent expert to audit it for you.)

Who Uses Open Source?

Don’t be mistaken by thinking that because it is free, open source software is amateur software. As well as the armies of volunteers which work on the projects in their spare time, large numbers of employed professionals are getting paid to do so, too. Tech giants such as IBM, Microsoft and Google are now some of the keenest contributors, in terms of man hours, to the biggest open source projects such as Apache Hadoop and Spark.

Of the involvement of these “internet scale” businesses in open source, Ciaran Dynes, vice president of products at vendor Talend, says “What’s interesting is that their business models are not dependent on ‘owning’ the software. The open sourcing of the software is a by-product of their need to innovate to address a market gap they’ve identified – for example Google Search.

“Open sourcing is a part of their branding and being recognized as a good company to join. This is quite different from vendors, such as Talend or Redhat, where the use of open source has been to seed the market with our technology to upset the status quo of proprietary vendors.”

Many popular big data related open source projects actually started out as in house initiatives at tech companies – for example, the Presto query engine which was developed at Facebook before being released into the wild and adopted by, among others, Netflix and AirBnB to handle back end analytics tasks.

Open source can often be more flexible than proprietary software, too. Because the code, poured over and optimized by thousands of contributors, is often highly efficient, it is often less demanding on computing resources and power than proprietary software which does the same job . This means there is less of a need to constantly be updating hardware and operating systems in order to make sure you can run your software.

The Internet is built on open source – and at the same, it enabled open source to begin to reach its potential by bringing together programmers from around the world and enabling them to collaborate with each other. An entire industry has sprung up around some of the most popular open source products – in the case of big data, that would include Hadoop and Spark – aimed at helping businesses get the most from them. These businesses typically produce enterprise distributions of open source products which, for a fee, come adapted for specific markets, or with packaged consulting services to help their customers get the most from them.

All this means is that it is easier than ever to get involved with open source, and in many cases it is becoming the mainstream rather than alternative choice. Last year a survey of 13,000 professionals across all industries found that 78% relied on open source technology to run their companies. This represents a 100% increase since 2010. As Cloudera’s Justin Kestelyn puts it, “We are quite literally living in a Golden Age, right now.” I couldn’t agree more.

Related Resources

With Talend, Speed Up Your Big Data Integration Projects

Easier Data Integration: 5 Steps to Success

Related Products

Talend Big Data

Today’s digitally-driven businesses intensify the data integration challenges that IT departments face. With a greater number of SaaS and on-premises applications, machine data, and mobile apps, we are seeing the rise of complex value-chain ecosystems that are proliferating to support these digital business initiatives. IT leaders need to incorporate a portfolio-based approach and combine cloud and on-premises deployment models in order to sustain a competitive advantage. Improving the scale and flexibility of data integration in the cloud is necessary to provide the right data to the right people at the right time.

Over time, the definition of hybrid integration has changed frequently—ranging from integrating mission-critical on-premises apps with SaaS apps, or B2B data sources from partners and suppliers presented as a mashup in a portal, to integrating big data and backend systems in order to present the data via mobile applications. The evolution of hybrid integration approaches creates requirements and opportunities for converging application and data integration. As time goes on, the definition of hybrid integration will continue to morph, but what's important to realize is that eventually, everything will move to the cloud. In fact, according to IDC, cloud IT infrastructure spending will grow at a compound annual growth rate (CAGR) of 15.6 percent each year between now and 2019 at which point it will reach $54.6 billion, accounting for 46.5 percent of the total spending on IT infrastructure. This technology evolution will drive inherent synergies between data and application integration as a unified and pervasive set of disciplines. This being the case, customers should realize how to advance their hybrid integration strategy so that they are well positioned to leverage the power of the cloud.

Let’s now examine the five phases of hybrid integration, starting from one of the oldest and most mature phases to some of the more recent, bleeding edge, disruptive scenarios. We’ll look at the first two phases of hybrid integration and cover the remaining three phases in the second installment of this blog. You can easily assess for yourself which phase of hybrid integration you’re in, and identify how you can propel yourself to a more advanced phase.

Phase 1: Replicating SaaS Apps to On-Premise Databases

The very first step in developing a hybrid integration platform is to replicate SaaS applications to on-premises databases. Companies in this stage are doing one of two things:

1. They need analytics on some of the business-critical information contained in their SaaS apps; or

2. They are sending SaaS data to a staging database so that it can be picked up by other on-premise apps.

This hybrid integration scenario has been around for several years, and first started when SaaS apps such as Salesforce really started growing in leaps and bounds. However, in order to increase the scalability of your infrastructure, it’s best to actually move to a cloud-based data warehouse service within AWS, Azure, or Google Cloud. The scalability of these cloud-based services means that you don't need to spend cycles refining and tuning the databases. Additionally, you get all the benefits of utility-based pricing. With SaaS apps generating more data, you will also need to adopt a cloud analytics solution such as Tableau or Wave as part of your hybrid integration strategy.

Phase 2: Integrating SaaS Apps directly with on-premises apps

Each line of business within an organization has their preferred SaaS app of choice: Sales departments have Salesforce, Marketing has Marketo, HR has Workday, and Finance has NetSuite

However, these SaaS apps still need to connect to a back-office ERP on-premises system.

Due to the complexity of back-office systems, there isn't yet a widespread SaaS solution that can serve as a replacement for ERP systems such as SAP R/3 and Oracle EBS. In fact, SAP and Oracle are still in the early stages of on-boarding customers to their own pure-play cloud ERP platforms. My advice here is to not "boil the ocean" by trying to integrate with every single object and table in these back-office systems – but rather to accomplish a few use cases really well so that your business can continue running, while also experiencing the benefits of agility afforded by the cloud. For SAP and Oracle ERP environments, you should quickly accomplish the following use cases: Account Synch, Product Synch, and Opportunity-to-Order when integrating with your Salesforce application. Make sure the Service Cloud module within Salesforce also synchronizes with the Products information within SAP or Oracle so that you can track returns of products, as well as customer satisfaction. Also, to lower operating costs while expanding your business faster, consider having a 2-tiered ERP deployment where your corporate headquarters run SAP or Oracle, and your subsidiaries run NetSuite.

In part two of this blog, we’ll look at some of the more advanced phases of hybrid integration that customers are deploying within their enterprise.

Related Resources

Is Cloud Integration Best for Your Organization?

Products Mentioned

With Talend, Speed Up Your Big Data Integration Projects

Guest blog post from The Digital Group. T/DG, stands for The Digital Group, has been working on Talend based data source integration for quite some time. The Digital Group recently launched 3RDi (Third Eye) Enterprise Search Discovery and Analytics Platform that utilizes capabilities of Talend for all data integration layers. 3RDi is a comprehensive suite of products to help address your Enterprise Search needs, offering best-in-class solutions for Content Discovery, Semantic Enrichment, Governance, Analytics, Relevancy Management and Automated Testing. You can read more information about it here: http://www.3rdisearch.com/

Problem

3RDi uses Apache Solr, the open source enterprise search platform, as the backbone for its search features. Apache Solr’s major features include full-text search, hit highlighting, faceted search, real-time indexing, and dynamic clustering. Providing distributed search and index replication, Solr is designed for scalability and Fault tolerance.

As a part of its evaluation and testing of the Solr platform, the T/DG team faced many challenges in attempting to index millions of large complex XML documents at high speed and with minimum errors. Recognizing these limitations, the team realized it could help users by helping cleanse, transform and enrich XML documents with semantic information before they are indexed in the Solr platform. While Apache Solr provides many integration tools like Data Import Handlers and RESTful APIs; these solutions have their own challenges. Solr’s Data Import Handler needs to be loaded in the same JVM, which results in heavy footprint of Solr’s JVM. For large data transfers, this becomes an even bigger challenge and requires the use of a mature integration (or rich ETL) solution like Talend to support higher data transfers, with excellent error handling. Additionally, Talend provides greater flexibility at the data integration layer and provides out of box components to read data from various databases, read data from files in various formats like xml, csv, etc., and transform data, process data in parallel and in batch. However, the challenge is that Talend 6 does not provide a direct way of Apache Solr integration.

Solution:

The solution? New high-speed data integration plugins that make it possible to utilize Talend capabilities for Apache Solr data integration use cases. Today, Talend exchange hosts three different free plugins for Talend-Solr Integration:

1. Talend-Solr Update High Speed Plugin (link)

2. Talend-Solr Query High Speed Plugin (link)

3. Talend-Solr Insert High Speed Plugin (link)

These are developed and contributed by T/DG. These three plugins address most of the cases for migrating data in different forms from various data sources to Apache Solr. These plugins work in Talend’s environment, and do not interfere in Solr’s memory. They are built on top of latest Apache Solr (5.X), and they utilize concurrent APIs of Apache Solr, for high speed data transfer in concurrent threads, and in batches.

Now users can define complex workflows with the help of Talend Open Studio, by plugging in any of these Talend components. They can perform complex data transformation, process data concurrently in batches, and eventually push or pull them from Apache Solr.

These components were benchmarked for performance and error handling against old plugins, and were proven to outperform in terms of speed, error handling and flexibility of integration. The data ingestion workflow used for benchmarking, involved complex data transformation and external web service calls for sematic enrichment of data before indexing. These plugins are easy to use by any Talend Developer, and they provide advance settings to control the concurrency level and other Solr parameters. They are maintained by The Digital Group and the company provides technical support for any issues around these plugins. Below are the screenshots of these plugins in action.

Figure 1: Talend-Solr High Speed Data Insert Plugin

Figure 2: Talend Solr High Speed Data Update Plugin

Figure 3: Talend-Solr High Speed Data Query Plugin

Related Resources

Easier Data Integration: 5 Steps to Success

Products Mentioned

Download>>Try Talend Data Preparation for Free

Guest blog by Charles Parat, Strategy & Innovation Director, Micropole Group

In today’s world, business workers must utilize a range of enterprise applications in order to complete some of the more routine tasks associated with their job title. When these applications work well and help streamline repetitive tasks, workers have time to focus on more strategic endeavors that add great value to their organization.

The Marketing Data Deluge

When it comes to matters of data, marketing teams are constantly exposed to handling large amounts of information, both internal and externally, while being pressed to process these huge data sets efficiently within very short timelines. Setting and managing an increasing number of marketing campaigns is also a major demand. Had their CRM tools delivered on the promise of seamless data

transformation and transfer processes, everything would have been fine. However, such is not the case and since Marketing is on an ongoing quest for greater efficiency, where quicker processing of increasing amounts of data can become a key competitive differentiator, existing systems and tools needed to evolve.

What was planned for yesterday can quickly become obsolete today so it’s essential for firms to rapidly reconcile data captured from the increasing number of customer and prospect interactions that are happening across a growing number of channels.

These are daily challenges that cannot be delegated to over-worked IT teams.

Data Quality is Key

Today, as much from an operational or an analytical standpoint, customer and prospect data quality is key. More and more refined prospect targeting, as well as mandatory reductions on database demands leaves no room for guesswork. What’s more, customers can’t bear personal data mistakes, privacy intrusions or inaccurate targeting.

For a long time, hiring trainee "data crunchers" or falling back on colleagues who deal with heavy statistical tools has proved to have limitations and has often brought more risks and delays than it has greater efficiency.

While Marketing staff generally understand their business data environment quite well, they are not really trained in computer science, and are often discouraged by the lack of business understanding demonstrated by some of their well-intentioned colleagues who offer to support them in a business emergency. Thus, the marketing team is condemned to face a data crisis alone, with only the help of available office tools such as local databases and spreadsheets.

However complex the data handling, marketing managers will pile up an incredible number of transformations, cut-and-paste and correction steps, before sending an untraceable set of key data into diverse software applications (such as CRM sales apps like Salesforce, Marketing campaign tools such as Eloqua, phone calling lists, etc., etc.).

Data Management + Business Intelligence = Marketing Excellence

Therefore, it was high time that Business Intelligence techniques joined with Data Management features to offer solutions favoring data governance, and delivering optimized software platforms to help marketeers better face their data challenges. But of course it’s not only marketing teams facing these challenges. Marketing’s colleagues in HR and finance face similar data-wrangling struggles and will benefit greatly from tools that empower them to extract value out of their business data as directly, securely and as quickly and efficiently as possible on their own. You can experience the potential yourself by downloading Talend’s Free Desktop version of Data Preparation. The getting started guide shows you exactly how to cleanse and standardize leads and contact data: http://bit.ly/1QnGPal .

Talend Data Preparation

To summarize, I would strongly recommend that you consider Data Preparation as a key initiative in your marketing operations! This is advice that I would share not only as innovation director of a consulting business, but also as a leader in a company which aspires to “eat its own dog food” while transforming into a data-driven business.

Related Resources

Products Mentioned

This blog addresses the operationalizing of meta-data usage in data management using Talend. To explain further you have files and tables which have schema definitions. The schemas hold information like name, data type, and length. This information can be imported or keyed in with a visual editor like Talend Studio at design time; however, this can add a lot of extra work and is prone to errors. If you could extract this information from a known & governed source and use it at run time you could automate the creation of tables or file structures. Once tested and verified the tables can be manually imported if stored meta-data schemas are desired.

What is schema?

Schema is the definition of the data formats, fields, types and lengths.

Persisted Schema

This is an example of what a persisted meta-data schema looks like in Talend. This is created at design time in Talend.

Data Dictionary File Example

This is what a data dictionary file could look like. This can be used to instead of defining a static schema in Talend Meta-Data as displayed above. The data dictionaries used in this process will have a static meta-data schema from a data modeling tool like Erwin or a database schema. Your dictionaries should be transformed to a common format allowing for more code reuse.

Schema for a Data Dictionary

This is a basic layout for a data dictionary using Column Name, Type and Length. An exhaustive list of Talend dictionary items is listed below.

ColumnName
DBName
Type
DBType
Precision
Nullable
Key

What is the conceptual process?

The diagram below is the logical process for handling dynamics schemas with a data dictionary at run time. Talend has components for handling dynamic schemas for positional non-Big-Data files, but all other types of data sources and targets could apply to this pattern.

The process:

Load the data dictionary
Define input and output as a single data item
- String for Big Data
- Dynamic for files, tables and internal schemas
Use Java APIs to operationalize the use of data dictionaries
- For non-Big-Data use the Talend API which will be demonstrated in subsequent examples.
- For Big-Data Java string utilities will be used

This process can be used to:

Virtualize schemas for files
- Fixed (NOTE: There are Talend components to do this as well)

Virtualize schemas for Tables
Virtualize schemas for Big Data elements like HDFS or Hive Schemas
Virtualize Internal Schemas used for Data services or Queues

Load the Dictionary

Below is the code to load the data dictionary from a file or table. The values read from the dictionary are moved into memory in the form of an ArrayList. This ArrayList can then be used throughout the data management process to operationalize the processing of data.

// Define a counter that controls number of columns

int rowCnt = (Integer) globalMap.get("rowCount");

// Define three arrays to hold the data dictionary columns

List nameList = new ArrayList();

List typeList = new ArrayList();

List lengthList = new ArrayList();

// Load the arrays from global variables

if (rowCnt == 0)

{

}

else

{

nameList = (ArrayList) globalMap.get("nameList");

typeList = (ArrayList) globalMap.get("typeList");

lengthList = (ArrayList) globalMap.get("lengthList");

}

// Move data dictionary file or tables values to the array elements

nameList.add(row3.name);

typeList.add(row3.type);

lengthList.add(row3.length);

// Put the array back to a global variable with the new values added

globalMap.put("rowCount", rowCnt + 1);

globalMap.put("nameList", nameList);

globalMap.put("typeList", typeList);

globalMap.put("lengthList", lengthList);

Define Input and Output

Dynamic Schema for non-Big-Data Components

This is an example of a dynamic schema definition for a delimited file. One field of the type Dynamic is used for the entire record and the schema will be determined at runtime.

Dynamic Schema definition for Big-Data Components

This is an example of a dynamic schema for Big Data. Notice that a data type of string is used instead of dynamic. Talend doesn’t support dynamic types for Bid Data components.

Java APIs

In the following weeks I will go into specific usages of dynamic schemas and the implementation of Talend jobs, components and Java APIs.

Dynamic Schemas for traditional files
Dynamic Schemas for Big Data Files
Dynamic Schemas for NoSQL Tables
Operationalizing with the Meta Data Bridge (Available in a future release of Talend)

Conclusion:

The usage of dynamic schemas can save on maintenance and development in the data management layer. Dynamics schemas can also be used to create tables or files that can be imported as persisted meta-date schemas. This article is intended to propose some complementary technologies around meta-data management. How you use them will apply to the priority of your architectural objects. These objectives can be somewhat opposing such as code maintainability vs strictly governed meta-data.

To find the meta-data approach that works you can ask questions such as:

Does your Talend use-case favor code re-use and the use of governed 3^rd-party data dictionaries?
Use Dynamic Schemas
Does Talend use case favor persisted meta-data? Is this meta-data the vehicle for schema governance?
Use Persisted Schemas
Do you want to use dynamic schemas to create meta-data for testing and POCs which will eventually end up as persisted meta-data?
Use a Hybrid Approach by dynamically creating files or tables that can be imported as persisted schemas.

Related Resources

Products Mentioned

3 Steps to Easier Cloud Integration

The challenges emerging from digital business, and demand for greater agility, are forcing changes to integration approaches. Integration leaders can assess technology providers by understanding the five key phases of hybrid integration projects and their needs as an organization. In our last installment, we looked at the very first two phases of hybrid integration, which by now are widespread in most organizations. In this piece, we’ll discuss some of the more advanced hybrid integration patterns: data warehousing in the cloud, real-time analytics and machine learning.

Phase 3: Hybrid Data Warehousing with the Cloud

As the volume and variety of data gets bigger, you need to have a strategy to move your data from on-premise data warehouses to newer Big Data resources. But, you ask, there are so many different Big Data processing protocols out there, how does one choose the solution that’s right for them?

While you take the time to decide which Big Data protocols best serve the variety of integration use cases present within your enterprise, start by trying to at least create a Data Lake in the cloud with a cloud-based service such as AWS S3 or Azure Blobs. These cloud-based services can relieve the cost pressures imposed by on-premises relational databases and can be your "staging area" while you decide which Big Data protocol you want to use moving forward—whether it's MapReduce, Spark, or something else. The primary goal of establishing this staging area is so that you can process all this raw data, be it unstructured or structured, using your Big Data protocol of choice and then transfer it into a cloud-based data warehouse such as AWS Redshift or Microsoft Azure SQL Data warehouse. Once your enterprise data has been aggregated, you can enable your line-of-business analysts with Data Preparation tools- to organize and cleanse this data prior to analysis with a cloud analytics tool such as AWS Quicksight, Tableau, or Salesforce Wave Analytics.

Phase 4: Real-time Analytics on Streaming Data

In today’s highly competitive marketplace, companies can no longer afford to work with information that is weeks or even days old - they need insight at their fingertips in real-time. In order to prosper from the benefits of real-time analytics, you need a hybrid integration infrastructure to support it. These infrastructure needs may change depending on your use case—whether it be to support weblogs, clickstream data, sensor data, database logs, or social media sentiment.

There are many real-time Big Data messaging protocols such as Kafka, Storm, AWS Kinesis, and Flume, and to make sure that there is as little latency as possible, some organizations choose to keep their infrastructure on-premises. However, while latency with streaming use cases is definitely a concern, organizations should not let this be a barrier toward moving to the cloud. The best course of action is look to first assess all your data sources in order to judge which ones truly need to remain on-premises versus those that need to be moved to the cloud. For example, most IoT use cases involving sensors with industrial equipment are on-premises, so it’s best to keep your streaming analytics infrastructure on-premises. Same with high-availability databases whose logs you want to collect. For use cases where you're collecting streaming data about systems that are already in the cloud, it’s probably best to keep your infrastructure in the cloud as well and utilize existing services within those ecosystems such as AWS Kinesis or DynamoDB, to set up your streaming infrastructure. That way you are far ahead in your journey towards eventually moving everything to the cloud.

Phase 5: Machine Learning for Optimized App Experiences

Machine learning can bring tremendous value to the applications you build. In the future, every experience will be delivered as an app through mobile devices, whether it's a consumer mobile app, or an enterprise mobile app. The correct hybrid integration infrastructure needs to be architected to provide the ability to discover patterns buried deep within data through machine learning so that these applications can be more responsive to users’ needs. Well-tuned algorithms allow value to be extracted from immense and disparate data sources that go beyond the limits of human analysis. For developers, machine learning offers the promise of applying business-critical analytics to any application in order to accomplish everything from improving customer experience to providing product recommendations to serving up hyper-personalized content.

To make this happen, developers need to:

Be "all-in" with the use of Big Data technologies and the latest streaming protocols
Have large enough datasets in order for the machine algorithm to be able to recognize patterns
Create segment-specific datasets using machine-learning algorithms to target diverse customer segments
Ensure that whatever mobile app they build has a robust API to draw upon those datasets and provide the end user with whatever information they are looking for in the correct context

In order for companies to reach this level of ‘application nirvana’, they will need to have first achieved or implemented each of the four previous phases of hybrid integration. The right iPaaS solution can properly guide a company through the various phases of hybrid integration so that they are able to successfully reach stage five. When machine learning becomes prevalent within an organization, we can expect to see a lot more data-driven decisions taking place – but to get there, it all starts with the right hybrid integration strategy.

Related Resources

Products Mentioned

3 Steps to Easier Cloud Integration

Several years ago, the cloud was a concept that many forward looking businesses were just beginning to think about and which many feared. In fact, even as few as three years ago —particularly for enterprises—there were looming concerns around security, apprehension to relinquish one’s data to a third party, periodic incidents of massive cloud provider outages, and vendor lock-in. Now that all seems like a glimmer in the rear-view mirror.

Today, cloud computing has matured as a go-to platform for a large portion of enterprise applications and data. Don’t believe it? Take a look at Amazon Web Services (AWS) who have been cited by many as an unstoppable force in the cloud market, by turning $7B in profit and growing at 81 percent a year. But it’s not just AWS that’s fueling the growth of the cloud market IDC reports that purchases of SaaS apps are growing 5x faster than on premise and by 2018 the cloud software model will account for $1 of every $5 spent on business software. The research firm also indicates that spending on Public IT Cloud Services is expected to grow from $70 billion in 2015 to $141 billion in 2019, a compound annual growth rate of 19 percent.

With numbers like that, it’s clear that the cloud computing market is showing no signs of slowing down. Combine all of this with trying to turn data from multiple SaaS, on-premises and social media applications, as well as sensor data into meaningful business insights and the IT strategy behind cloud deployment suddenly starts to get messy. This is why Talend has been keenly focused on designing an easy-to-deploy solution that can integrate big data workloads in the cloud with ease with our latest release of Talend Integration Cloud.

What’s New in Talend Integration Cloud Spring ‘16

Big Data Integration with AWS

Remember that AWS juggernaut I mentioned earlier? The Spring ‘16 release of Talend Integration Cloud is specifically designed to help companies more easily manage big data or data warehousing workloads in their AWS Redshift or AWS Elastic Map Reduce (EMR) environments. Talend is the first integration Platform-as-a-Service (iPaaS) solution to enable users to automatically provision and terminate AWS Redshift and EMR clusters when integration jobs are ready to run. This functionality significantly eases the management burden on IT, while enabling businesses to unlock the full potential of their environments at a reduced cost.

For a full rundown of what you can do with Talend on AWS, click here.

Support for Apache Spark and Kafka

It’s no surprise that real-time big data in the cloud is now a reality for any business trying to keep pace with their competition. Transforming real-time streaming data for analysis requires the use of a streaming engine for ingestion, such as AWS Kinesis, Apache Kafka, or Flume, amongst others. The Spring release of Talend Integration Cloud adds supports for Apache Spark and Kafka and makes it easy to build intelligent pipelines from a variety of big data sources. Moreover, Talend Integration Cloud makes use of the common architecture of Talend Data Fabric, which means that users can access over 900 connectors and components instantaneously.

Enterprise Class Capabilities to Optimize Hybrid Integration

With the explosion of data coming from sensors, social media, on-premises apps and more, hybrid integration is becoming increasingly tricky. Connectivity to data alone does not make an organization efficient. IT must automate integrations across on-premises and cloud data services for greater productivity. Talend Integration Cloud now provides enterprise class tools to help you manage your hybrid integration scenarios successfully. This includes improved job scheduling, monitoring and clustering as well as the ability to monitor the status of your integration jobs to quickly pinpoint and correct errors.

Unleash the Full Power of Your AWS Cloud

We've already established that the growth and potential for cloud in the enterprise is, and will continue to be, unstoppable for the foreseeable future. However, there are still a few key hurdles to overcome. Mick Bass, CEO of 47Lining said, "One of the major challenges companies face today is integrating silos of disparate data found on-premises and in the cloud to improve customer insights. Businesses that are looking to make their organizations more data-driven need to extend integration processes to the cloud."

The good news is that today you can take full advantage of these new features in Talend Integration Cloud and get a jump start on your journey to success with big data in the cloud.

To learn more, register for our Talend Integration Cloud Spring '16 launch webinar which takes place on March 31, 2016 at 10am PDT.

Leave your comments on what you think about the new release of Talend Integration Cloud below or tweet me at @Ash__V!

Related Resources

Products Mentioned

Customer data is everywhere. In some organizations, it can become the root cause for serious business inefficiencies: Undeliverable outbound e-mails, returned shipping, unaddressed customer claims, lack of privacy regulation compliance or data breaches. However, if customer data is managed correctly, it can fuel new levels of customer acquisition, company performance, sales conversion rates and the overall customer lifetime value.

Talend commissioned a survey to Enterprise Management Associates to understand the difference between the winners and the losers when it comes to the data driven economy. Here are 5 takeaways from this survey:

1. Master Data Management is a foundation for data driven, customer centric organizations. The survey clearly highlights that levels of maturity in those two disciplines are closely interrelated.

2. A 360° view of the customer is a moving picture. Because customer touchpoints are exponentially expanding over time, a more agile integration approach is required for success.

3. Turning customer data into a well-managed and trusted source of information across the organization is no longer an option—it’s a must have. But once this mandate has been achieved, you can derive far more value from your reconciled 360° customer view through actionablecustomer facing applications that directly interact with customers or customer facing employees and business partners.

4. Shifting the dataaccountability to the lines of business is clearly a key differentiator for the most successful organizations. Enterprise data architects have a major role to play in achieving this shift, especially in large organizations.

5. Finally, mature organizations succeed in their ability to fund their MDM initiatives based on clear and quantifiable business benefits that overcome the inherent complexities of most MDM projects. Linking MDM initiatives with clear business outcomes, once considered as a difficult if not elusive exercise, has become straightforward now that lines of business understand the value and the urgency of being data driven.

Over time, we will continue to share findings from this survey. Stay tuned for a webinar where we will further present the key takeaway and invite some of our customers to share their experiences with MDM. I'd be happy to get your feedback as well, through this blog or through my twitter account ( @jmichel_franco ).

Related Resources

Creating the Golden Record that Makes Every Click Personal

Products Mentioned

Download>> Gartner’s Top Data Priorities for Chief Data Officers in 2016

Talend MDM

Gartner research recently posted the results of its inaugural survey of chief data officers (CDOs), which revealed that the primary mandate and objective of today’s CDO is to manage, govern and utilize information as an organizational asset. Well, at Talend we see ‘enterprise information’ as more than just ‘an asset’ – we believe it is a company’s most valuable commodity. In keeping with this philosophy, we view the CDO as one of the most important roles in today’s C-suite.

While Gartner indicates that CDO appointments have been accelerating in the last three years, the role is still somewhat nascent within most companies today. Because this role is still emerging, there is a high degree of variability in what CDOs do, to whom they report, and how they are measured, which can be three of the most critical defining aspects of a person’s job. I thought I’d dig a bit deeper into how Gartner’s survey results can help guide CDOs on the steps they need to take in order to make themselves highly-valued members of their organizations.

What does CDO even stand for and what do they do?

It may be a newly defined position, but the role of a CDO is a serious job—particularly in today’s data-filled world. It’s no secret that data is growing at an alarming rate. IDC reports that the digital universe is doubling in size every two years. A CDO holds the huge responsibility of spearheading enterprise-wide governance, analytics and utilization of corporate information as an asset, via data processing, analysis, data mining, information trading and other means. This gives them a significant amount of influence over what kinds of information the enterprise selects to capture, retain and devise a way to utilize in their organization. They are in charge of how companies reinvent themselves—in what Gartner would term, ‘the digital transformation’—in order to be data driven. The CDO needs to lead the way in terms of teaching the business how to understand the best ways to use their data and then how to implement the insights from that data at scale—both in terms of keeping pace with the increasing volumes of data as well as how to make use of corporate data lakes enterprise wide. One of the ways in which a CDO can do this is to put data into the hands of EVERY person within the company via self-service, but to do so with the appropriate amount of governance. Thus, the CDO needs to be closely aligned with IT in order to make sure that utilizing all corporate data for the purposes of bettering the business is done with security and governance in mind.

How do you measure the success of a CDO?

The first thing a CDO needs to do is help the company build it’s “vision” for becoming a data driven organization. Something that looks across all of the company’s data silos. Something that is aggressive, but achievable. Once that is established, then the company needs to come up with a series of milestones that each deliver value to the company along the way towards achieving their goal. CDOs should also consider and create benchmarks and measurements for information-related programs and activities. This means always taking into account the business outcomes they want to achieve as a result of more effectively managing and utilizing data within their organization—i.e. increase customer sales and revenue by x amount. They should then map those objectives to measurable activities that can contribute to success—in this case it might mean increasing the success of marketing campaigns by augmenting email open rates or increasing lead conversions. That way, at the end of the project a CDO can easily prove whether or not they’ve effectively impacted the desired business outcome. Measuring success versus their established milestones, the CDOs ability to lead cross-functionally and achieve identifiable business objectives will be the determinant of their success.

Where does the CDO fit within the corporate reporting structure?

One of the key findings of Gartner’s survey is that there is “a lack of maturity and understanding [among today’s companies] of how to “embrace, support and exploit the CDO role.” Chief Data Officers shouldn’t be mistaken for “lone wolf saviors”, as Gartner describes. They can’t exactly parachute in like a Navy Seal team and strike at the heart of an organization’s perplexing data-related challenges. As mentioned above, once a CDO has devised the company’s vision and the milestones they will need to hit in order to help the business become data driven, then it will be easier to determine through which chain of command within an organization a CDO should report. Regardless of that reporting structure, they will require the support and collaboration of other C-level executives and extended layers of the organization. Just as importantly as determining where within the organization the CDO sits, the staff and budget that a CDO has at their disposal in order to help drive change will need to be determined by the organization as a whole. Only with this support can they hope to understand the holistic mission of the company and devise a way for the business to use data strategically in order to accomplish transformative goals.

To learn more about the challenges and opportunities facing today’s CDOs, download this free copy of Gartner’s first CDO survey here.

Related Resources