Talend “Job Design Patterns” & Best Practices - Part 3

October 5, 2016, 8:11 am

≫ Next: 6 Steps that will Pave the way for your Hadoop Journey with Data Governance and Metadata Management

≪ Previous: Diventate protagonisti della vostra avventura con i Big Data grazie alla nuova Big Data Sandbox di Talend

It appears that my previous blogs on this topic have been very well received. Let me express my continued delight and thanks to you, my (dedicated or avid) readers.

If you new here and haven’t read any of my previous posts, please start by reading these first (Talend “Job Design Patterns” & Best Practices Part 1 & Part 2) and then come back to read more; they all build upon a theme. The popularity of this series has in fact necessitated translations, and if you are so inclined, please read the new French versions (simply click on the French Flag). Thanks go out to Gaël Peglissco, ‘Chef de Projets’ of Makina Corpus for his persistence, patience, and professionalism in helping us get these translations crafted and published.

Before I dive in on more Job Design Patterns and best practices, please note that the previous content has been encapsulated into a 90-minute technical presentation. This presentation is being delivered at ‘Talend Technical Boot Camps’ popping up on my calendar across the globe. Please check the Talend Website for an upcoming event in your area. I’d love to see you there!

As we continue on this journey, please feel free to comment, question, and/or debate my guidance as expanding the discussion out to the Talend Community is, in fact, my ‘not-so-obvious’ end-game. ‘Guidelines,’ not ‘Standards,’ remember? Is it fair for me to expect that you may have contributions and opinions to share; am I right? I do hope so…

Building upon a Theme

By now it should be clear that I believe establishing ‘Developer Guidelines’ are essential to the success of any software life cycle, including Talend projects; but let’s be very sure: Establishing developer guidelines, attaining team adoption, and instilling discipline incrementally are the keys to delivering exceptional results with Talend. LOL; I hope you all agree! Building Talend jobs can take many turns and twists (and I am not talking about the new curvy lines), so understanding the foundation of “Business Use Case”, “Technology”, and “Methodology” dramatically improves your odds of doing it right. I believe that taking time to craft your teams’ guidelines is well worth the effort; I know that you’ll be glad you did!

Many Use Cases challenging Talend customers are usually wrapped around some form of data integration process; Talend’s core competency; moving data from one place to another. Data flow comes in many forms and what we do with it and how we manipulate it, matters. It matters so much that it becomes the essence of almost every job we create. So if moving business data is the use case, and Talend is integral to the technology stack, what is the methodology? It is, of course, the SDLC Best Practice we’ve already discussed. However, it is more than that. Methodologies, in context with data, encompass Data Modeling! A topic I get very passionate about. I’ve been a database architect for over 25 years and have designed and built more database solutions than I can count, so it’s a practical matter to me that Database Systems have a life cycle too! Irrespective of flat-file, EDI, OLTP, OLAP, STAR, Snowflake, or Data Vault schemas, ignoring the cradle-to-grave process for data and their corresponding schemas is at best an Achilles-heal, at worst – disaster!

While Data Modeling Methodologies are not the subject of this blog, adopting appropriate data structural design and utilization is highly important. Take a look at my blog series on Data Vault and watch for upcoming blogs on Data Modeling. We’ll just need to take it at face value for now, but DDLC, ‘Data Development Life Cycle’ is a Best Practice! Think about it; you may discover that I’m on to something.

More Job Design Best Practices

Ok, time to present you with some more ‘Best Practices’ for Talend Job Designs. We’ve covered 16 so far. Here are eight more (and I am sure there is likely going to be a Part 4 in this series, as I find I am unable to get to everything in here and keep the blog somewhat digestible). Enjoy!

Eight more Best Practices to consider:

Code Routines

On occasion, Talend components just don’t satisfy a particular programmatical need. That’s Ok; Talend is a Java Code Generator, right? Sure it is, and there are even Java components available you can place on your canvas to incorporate pure-java into the process and/or data flow. But what happens if even that is not enough? Let me introduce you to my little friend: Code Routines! Actual Java methods you can add to your project repository. Essentially user defined java functions you code and utilized in various places throughout your job.

Talend provides many java functions you’ve probably already utilized; like:

- getCurrentDate()

- sequence(String seqName, int startValue, int step)

- ISNULL(object variable)

There are many things you can do with code routines when you consider the big picture of your job, project, and use case. Reusable code is my mantra here and whenever you can craft a useful code routine that helps streamline a job in a generic way you’re doing something right. Make sure you incorporate proper comments as they show up when selecting the function as ‘helper’ text.

Repository Schemas

The Metadata section of the project repository provides a fortuitous opportunity to create reusable objects; a significant development guideline; Remember? Repository Schemas present a powerful technique for creating reusable objects for your jobs. This includes:

- File Schemas - used for mapping a variety of flat file formats, including:

Delimited
Positional
Regex
XML
Excel
JSON

- Generic Schemas - used for mapping a variety of record structures

- WDSL Schemas - used for mapping Web Service method structures

- LDAP Schemas - used for mapping an LDAP structure (LDIF also available)

- UN/EDIFACT - used for mapping a wide variety of EDI transaction structures

When you create a schema you give it an object name, purpose, and description, plus a metadata object name which is referenced in job code. By default this is called ‘metadata’; take time to define a naming convention for these objects or everything in your code appears to have the same name. Perhaps ‘md_{objectname}’ is sensible. Take a look at the example.

Generic schemas are of particular importance as this is where you create data structures that focus on particular needs. Take as an example a Db Connection (as seen in the same example) which has reverse engineered table schemas from a physical database connection. The ‘accounts’ table has 12 columns, yet a matching generic schema defined below has 16 columns. The extra columns account for added value elements to the ‘accounts’ table and used in a job data flow process to incorporate additional data. In reverse, perhaps a database table has over 100 columns and for a particular job data flow only ten are needed. A generic schema can be defined for the ten columns for a query against the table with the matching ten columns; A very useful capability. My advice: Use Generic Schemas - A LOT; except for maybe 1 column structures; makes sense to me to simply make them built-in.

Note that other connection types like SAP, Salesforce, NoSQL, and Hadoop clusters all have the ability to contain schema definitions too.

Log4J

Apache Log4J has been available since Talend v6.0.1 and provides a robust Java logging framework. All Talend components now fully support Log4J services enhancing the error handling methodology discussed in my previous blogs. I am sure you’ve all now incorporated those best practices into your projects; at least I hope you have. Now enhance them with Log4J!

To utilize Log4J it must be enabled. Do this in the project properties section. There, you can also adapt your teams logging guidelines to provide a consistent messaging paradigm for the Console (stderr/stdout) and LogStash appenders. Having this single location to define these appenders provides a simple way to incorporate Log4J functionality in Talend Jobs. Notice that the level values incorporated in the Log4J syntax match up with the already familiar priorities of INFO/WARN/ERROR/FATAL.

On the Talend Administrator Console (TAC) when you create a task to run a job, you can enable which level of priority Log4J will log too. Ensure that you set this appropriately for DEV/TEST & PROD environments. The best practice is to set DEV/TEST to INFO level, UAT to WARN, and PROD to ERROR. Any level above that will be included as well.

Working together with tWarn and tDie components and the new Log Server, Log4J can really enhance the monitoring and tracking of job executions. Use this feature and establish a development guideline for your team.

Activity Monitoring Console (AMC)

Talend provides an integrated add-on tool for enhanced monitoring of job execution which consolidates collected activity of detailed processing information into a database. A Graphical interface is included; accessed from the Studio and the TAC. This facility helps developers and administrators understand component and job interactions; prevent unexpected faults, and support important system management decisions. But you need to install the AMC database and web app; it is an optional feature. The Talend Activity Monitoring Console User Guide provides details on the AMC component installation, so I’ll not bore you all here with that. Let’s focus on the best practices for its use.

The AMC database contains three tables which include:

- tLogCatcher - captures data sent from Java exceptions or the tWarn/tDie components

- tStatCatcher - captures data sent from tStatCatcher Statistics check box on individual components

- tFlowMeterCatcher - captures data sent from the tFlowMeter component

These tables store the data for the AMC UI which provides a robust visualization of a job’s activity based on this data. Make sure to choose the proper log priority settings on the project preferences tab and consider carefully any data restrictions placed on job executions for each environment, DEV/TEST/PROD. Use the Main Chart view to help identify and analyze bottlenecks in the job design before pushing a release into PROD environments. Review the Error Report view to analyze the proportion of errors occurring for a specified timeframe.

While quite useful this is not the only use for these tables. As they are indeed tables in a database, SQL queries can be written to pull valuable information externally. Setup with scripting tools it is possible to craft automated queries and notifications when certain conditions occur and are logged in the AMC database. Using an established return code technique as described in my 1^st blog on Job Design Patterns, these queries can programmatically perform automated operations that can prove themselves quite useful.

Recovery Checkpoints

So you have a long running job? Perhaps it involves several critical steps and if any particular step fails, starting over can become very problematic. It would certainly be nice to minimize the effort and time needed to restart the job at a specified point in the job flow just before an error has occurred. Well, the TAC provides a specialized execution restoration facility when a job encounters errors. Placed strategically and with forethought, jobs designed with these
‘recovery checkpoints’ can pick up execution without starting over and continue processing.

When a failure occurs, use the TAC ‘Error Recovery Management’ tab to determine the error and there you can launch the job for continued processing; Great stuff, right?

Joblets

We’ve discussed what Joblets are; reusable job code that can be ‘included’ in a job or many jobs as needed, but what are they really? In fact, there are not many use cases for Joblets however when you find one, use it; it is likely a gem. There are different ways you can construct and use Joblets. Let’s take a look, shall we?

When you create a new Joblet, Input/Output components are automatically added to your canvas. This jumpstart allows you to assign the schemas coming in from and going out to the job workflow utilizing the Joblet. This typical use of Joblets provide for the passing of data through the Joblet and what you do inside it is up to you. In the following example, a row is passed in, and a database table is updated, logged to stdout, and then passing the same row unchanged (in this case), out.

The non-typical use can either remove the input, the output, or both components to provide special case data/process flow handling. In the following example, nothing is passed in or out of this Joblet. Instead, a tLogCatcher component watches for various selected exceptions for subsequent processing (you’ve seen this before on error handling best practices).

Clearly using Joblets can dramatically enhance code reusability which is why they are there. Place these gems in a Reference Project to expand their use across projects. Now you’ve got something useful.

Component Test Cases

Well if you are still using a release of Talend prior to v6.0.1 then you can ignore this. LOL, or simply upgrade! One of my favorite new features is the ability to create test cases in a job. Now these are not exactly ‘unit tests’ however they are component tests; actual jobs tied into the parent job, and specifically the component it is testing. Not all components support test cases, yet where a component takes a data flow input and pushes it out, then a test case is possible.

To create a component test case, simply right click the selected component and find the menu option at the bottom ‘create test case’. After selecting this option, a new job is generated and will open up presenting a functional template for the test case. The component under test is there along with built-in INPUT and OUTPUT components wrapped up by a data flow that simply reads an ‘Input File’, processes the data from it and passing the records into the component under test, which then does what it does and writes out the result to a new ‘Result File’. Once completed that file is compared with an expected result, or ‘Reference File’. It either matches or not: Pass or Fail! - Simple right?

Well let’s take a look, shall we?

Here is a job we’ve seen before; it has a tJavaFlex component that manipulates the data flow passing it downstream for further processing.

A Test Case job has been created which looks like this: No modifications are required (but I did cleanup the canvas a bit.

It is important to know that while you can modify the test case job code, changing the component under test should only occur in the parent job. Say for instance the schema needs to be changed. Change it in the parent job (or repository) and the test case will inherit the change. They are inextricably connected and therefore coupled by its schema.

Note that once a test case ‘instance’ is created, multiple ‘input’ and ‘reference’ files can be created to run through the same test case job. This enables testing of good, bad, small, large, and/or specialized test data. The recommendation here is to evaluate carefullynot only what to test but also what test data to use.

Finally, when the Nexus Artifact Repository is utilized, and test case jobs are stored there along with their parent job, it is possible to use tools like Jenkins to automate the execution of these tests, and thus the determination of whether a job is ready to promote intothe next environment.

Data Flow ‘Iterations’

Surely you have noticed having done any Talend code development that you link components together with a ‘trigger’ process or a ‘row’ data flow connector. By right clicking on the starting component and connecting the link ‘line’ to the next component you establish this linkage. Process Flow links are either ‘OnSubJobOk/ERROR’, ‘OnComponentOK/ERROR’, or ‘RunIF’ and we covered these in my previous blog. The Data Flow links, when connected are dynamically named ‘row{x}’ where ‘x’, a number, is assigned dynamically by Talend to create a unique object/row name. These data flow links can have custom names of course (a naming convention best practice), but establishing this link essentially maps the data schema from one component to the other and represents the ‘pipeline’ through which data is passed. At runtime data passed over this linkage is often referred to as a dataset. Depending upon downstream components the complete dataset is processed end-to-end within the encapsulated sub job.

Not all dataset processing can be done all at once like this, and it is necessary sometimes to control the data flow directly. This is done through the control of ‘row-by-row’ processing, or ‘iterations.’ Review the following nonsensical code:

Notice the components tIterateToFlow and tFlowToIterate. These specialized components allow you to place control over data flow processing by allowing datasets to be iterated over, row-by-row. This ‘list-based’ processing can be quite useful when needed. Be careful however in that in many cases once you break a data flow into row-by-row iterations you may have to re-collect it back into a full dataset before processing can continue (like the tMap shown). This is due to the requirement that some components force a ‘row’ dataset flow and are unable to handle an ‘iterative’ dataset flow. Note also that t{DB}Input components offer both a ‘main’ and ‘iterate’ a data flow option on the row menu.

Take a look at the sample scenarios: Transforming data flow to a list and Transforming a list of files as a data flow found in the Talend Help Center and Component Reference Guide. These demonstrate useful explanations on how you may use this feature. Use this feature as needed and be sure to provide readable labels to describe your purpose.

Conclusion

Digest that! We’re almost done. Part 4 in the blog series will get to the last set of Job Design Patterns and Best Practices that assure the foundation for building good Talend code. But I have promised to discuss “Sample Use Cases”, and I will. I think getting all these best practices under your belt will serve well when we start talking about abstract applications of them. As always, I welcome all comments, questions, and/or debate. Bonsai!

↧

6 Steps that will Pave the way for your Hadoop Journey with Data Governance and Metadata Management

October 6, 2016, 1:47 pm

≫ Next: Hand Coding vs. Tools: Our Take on Gartner’s Report

≪ Previous: Talend “Job Design Patterns” & Best Practices - Part 3

This is the first installment to a two-part blog series focused on Governing Big Data and Hadoop.

So, you’re ready to embark on your data-driven journey, huh? The business case and project blueprint are well defined and you’ve already secured executive sponsorship for your digital transformation. You’re ready to run a modern data platform based on Hadoop and your team is set-up on the starting blocks to deliver the promises of Big Data to the wider organization.

But then you feel some hesitation as you envision a whole new set of challenges: are you ready to operate at the fast-pace of Big Data?; to control the risks that will inevitably arise from the proliferation of data in your data lake?; to scale a data lab that is currently only accessible by a few data scientists, to a broadly shared, self-service, center of excellence that anyone can access and that seamlessly connects with your critical business processes?

Like it or not, you’re not equipped for success until you address the legacy enterprise challenges related to security, documentation, auditing and traceability. But the good news is that there is a modern way to harness the power of your Hadoop initiative with data governance in order to bring you significant business benefits.

Tackling the Six Most Pressing Issues in Governing Various types of New Big Data

To get a full understanding of the potential benefits and best practices related to Data Governance on Hadoop, Talend commissioned a report by TDWI, which outlines six pillars to ensure the success of your Big Data project:

1. Deliver Big Data accessibility to a wide audience, without putting data at risk. Self-service approaches and tools allow IT leaders to empower data workers and analysts to do their own data provisioning in an autonomous way. But one cannot just throw data preparation tools into the hands of business users without first having a governance framework to deliver this service in a managed and scalable way.

2. Accelerate data ingestion with smart discovery and exploration. It takes weeks, sometimes months, to onboard new sets of data and publish it to the right audience(s) using traditional data platforms. Now, with new “schema-on-read” approaches, IT and data experts can onboard data as it comes. As soon as it is done, data is accessible on tap to a whole community of data workers that gain the flexibility to further discover, model, connect and refine data in an ad-hoc way, at any time.

3. Capture metadata for the fullest use and governance. Metadata is the crown jewel of data-driven applications. It increases data accessibility by embedding documentation, brings context on top of raw data for better interpretation and draws the connection between disparate data points to turn data into meaning and insights. Last but not least, it brings control and traceability over the information supply chain. Modern data platforms provide new ways to capture, stitch, crowdsource and curate metadata.

4. Unify the disciplines of data management into a common platform. Silos are destroying the value of enterprise data and bring both quality and security risks. There’s a need to establish a single point of control and access to data across integration styles, while decentralizing responsibilities across data citizens.

5. Consider Hadoop for its flexibility, but beware of its governance challenge. Hadoop can process bigger and more diverse data faster, and delivers it to a wider audience in a more agile way. But, now that you can operate at extreme scale, speed and reach, there’s a mandate to master data traceability and auditability, protection, documentation, policy enforcement, etc. Consider environments like Apache Atlas or Cloudera Navigator, together with metadata driven platforms, to fully address those challenges.

6. Get ready for change, continuous innovation and diversity. IT systems are evolving from monolithic to multi-platform. SQL databases are no longer a one size fits all environment where data is modeled, stored, linked, processed and access. Metadata driven approaches help simplify data access across disparate data stores, provide data lineage and traceability, as well as accelerate data migration and movement.

In part 2 of this series, we will see how Talend can guide you through addressing these challenges with Talend Big Data, Metadata Manager, Talend Data Preparation, and Talend Data Fabric.

↧

Hand Coding vs. Tools: Our Take on Gartner’s Report

October 11, 2016, 3:27 am

≫ Next: 9 questions à vous poser avant d’entreprendre un développement spécifique sur votre futur projet d’intégration de données

≪ Previous: 6 Steps that will Pave the way for your Hadoop Journey with Data Governance and Metadata Management

As the CMO of a data integration software company, I’ve invested a lot of time convincing IT managers that they should switch from hand coding to a tool-based approach. Hence the reason I was excited to see and impressed by the practical advice outlined in Gartner’s recently published report, “Does Custom-Coded Data Integration Stack Up to Tools?”[1]

If you are a customer working with new cloud and big data technologies, Gartner’s report offers several great points to contemplate, but there are also other questions you should take into consideration before initiating any custom coded projects. In this blog, I’ve summarized my key takeaways from Gartner’s research and translated them into a checklist of questions every IT manager should ask themselves as they evaluate the trade-offs between hand coding vs. a tool-based approach to data integration.

My key takeaways from “Does Custom-Coded Data Integration Stack Up to Tools?”

· Be sure to look at both short and long term costs: While your deployment costs may be reduced by 20 percent with a custom coded approach, the maintenance costs will increase by 200 percent.

· There is a time and place for hand coding, but only in very specific situations: Custom coding can make sense for very targeted, simple projects that will not require a lot of maintenance. It could also be necessary for situations where there are no tools capable of doing the work required. Additionally, only 11 percent of the organizations surveyed by Gartner were pursuing custom development.

· Data integration projects requiring multiple developers will benefit from the visual design environments provided by tools. This will make it easier to re-use prior development elements and can lead to more efficient data integration flows. This is because a single job can feed multiple targets instead of having a series of data integration flows all doing roughly the same thing over and over.

· It’s important to understand the maintenance and support costs that will go along with any projects. If different people maintain and support the code once it’s in production, their learning curve with a hand-coded approach will be high, and if the code is in production for years, turnover will lead to far higher costs in the long run.

Given the above points, I’ve come up with the checklist of questions below to help IT managers make a more informed decision about which approach to take.

The Hand Coding Checklist:

1. Does my development team have the expertise to do this using hand coding?

If you’re using a new technology like Hadoop or cloud platform, who is going to do the work and how much ramp time will they need?

2. Is this what I want my hand coding experts spending time on?

Hand coding experts are typically about a quarter of the full development team, making them a scarce resource. If a non-expert could do the same work using a tool—and save hours of time doing so—wouldn’t you rather have the experts doing something where their unique skills are required?

3. Can I do this same work with a tool cheaper and faster than my team can hand code it?

Most IT teams are constantly being asked to do more with less. A tool-based approach often allows a lower cost per developer to do the work, and accomplish it quickly.

4. Is this a one-off, stand-alone project or is this an area where I plan to continue doing more and more development over time?

If you are embarking on an initiative using a big data or cloud platform, chances are you are going to want to do more and more on that platform over time. If so, relying on expert hand coders will be a very hard approach to scale given the scarcity of these resources.

5. How portable will this code be if I want to repurpose it on a new technology platform like Spark or Flink?

Portability is an easily overlooked point. The Apache Hadoop ecosystem is moving incredibly quickly. For example, in 2014 and 2015, MapReduce was the standard, but by the end of 2016 Spark emerged as a new standard. If you’ve taken a hand-coding approach, it was impossible to port that code from MapReduce to Spark. Leading Data Integration tools allow you to do this, eliminating legacy code situations.

6. Will multiple developers be collaborating on this project?

As covered in the summary above, there are multiple benefits that come from a tools-based approach when you have multiple developers working together including easy reuse and code sharing, visual design environments, even wizards and experts to advise the developer.

7. How long will this code be in production?

When embarking on a new project, it’s tempting to focus on the time needed to develop and forget how long something will be in production. Often something that takes six months to develop will be in production for five years or longer. If that is the case, the support and maintenance costs of that code will continue for 10X longer than the initial development work making it critical that you understand your support and maintenance costs.

8. Who will own the maintenance of this code?

If you only have a handful of Spark developers, then they will be the ones forced to maintain and support their code. Eventually, support and maintenance will consume all of their capacity, making it impossible for them to take on new projects that could potentially be tasks that help your organization gain a competitive edge.

9. How often will the code need to be updated to accommodate new business needs or changes in the data sources or targets?

Data sources, targets, and business needs are constantly evolving. If it’s reasonable to expect this constant stream of changes, then the cost of maintenance and support will be significantly higher.

[1]“Does Custom Coding Stack Up to Data Integration Tools,” Mark A. Beyer, Ehtisham Zaidi, Eric Thoo, Gartner Research, September 2016.

↧

9 questions à vous poser avant d’entreprendre un développement spécifique sur votre futur projet d’intégration de données

October 11, 2016, 8:02 am

≫ Next: The Industrial Internet of Things: Why You Need to Get up to Speed Fast

≪ Previous: Hand Coding vs. Tools: Our Take on Gartner’s Report

En tant que directeur marketing d’une entreprise proposant des solutions d’intégration de données, j’ai investi beaucoup de temps à convaincre les responsables des services informatiques de l’opportunité de passer du développement spécifique, à une approche basée sur des solutions. Voilà pourquoi j’ai découvert avec grand plaisir les conseils pratiques du rapport “Does Custom-Coded Data Integration Stack Up to Tools?”[1], publié récemment par Gartner.

Si vous vous apprêtez à travailler avec des nouvelles technologies Big Data et Cloud, ce rapport de Gartner présente de nombreux points importants à prendre en considération, mais il y a également d’autres questions sur lesquels vous devriez vous pencher avant de vous lancer dans tout projet de développement spécifique. Dans ce blog, je vous propose un résumé des principales leçons que j’ai tirées de cette étude, ainsi qu’une check list des questions que tout responsable IT devrait se poser au moment d’évaluer une approche basée sur du développement spécifique versus une solution d’intégration de données.

Mes principaux points à retenir du rapport “Does Custom-Coded Data Integration Stack Up to Tools?”

. Assurez-vous d’avoir en vue tous les coûts, tant à court qu’à long terme : si une approche par développement spécifique permet de réduire globalement les coûts de 20%, les coûts de maintenance s’en trouveront augmentés de 200%.

. Il y a un temps et lieu pour le développement spécifique, mais seulement à certaines conditions : le codage manuel peut être un bon choix pour des projets simples et très ciblés qui ne nécessitent pas beaucoup de maintenance. Il pourrait aussi s’avérer nécessaire lors de situations où il n’existe pas de solutions capables de réaliser la tâche demandée. A noter, seulement 11% des entreprises interrogées par Gartner poursuivent le développement spécifique.

. Les projets d’intégration de données qui exigent plusieurs développeurs tireront parti des environnements de design visuel fournis par les solutions. Réutiliser des éléments issus de développements précédents sera donc plus facile, ce qui pourra déboucher sur des flux d’intégration de données plus efficaces. Cela est dû au fait qu’un seul job peut alimenter plusieurs cibles, au lieu d’avoir une série de flux d’intégration de données exécutant tous plus ou moins le même chose de manière répétitive.

. Il est important de comprendre les coûts de maintenance et de support associés à chaque projet. Si plusieurs utilisateurs maintiennent et supportent le code une fois qu’il est en production, leur courbe d’apprentissage avec l’approche manuelle sera importante, et si le code est destiné à être utilisé pendant des années, le turnover du personnel entraînera à long terme une augmentation de ces coûts.

Compte tenu des considérations ci-dessus, j’ai conçu une check list de questions afin d’aider les responsables des services informatiques à prendre des décisions plus éclairées concernant l’approche à adopter.

La check list du développement spécifique :

1. Est-ce que mon équipe de développeurs dispose des compétences nécessaires pour faire du développement spécifique ?

Si vous utilisez une nouvelle technologie comme par exemple Hadoop ou une plateforme Cloud, qui sera en charge du travail, et combien de temps de formation sera nécessaire ?

2. Est-ce que je veux vraiment que mes experts du développement consacrent du temps à cette tâche ?

Les spécialistes du codage manuel constituant en général un quart de l’équipe de développement, ils sont donc une ressource rare. Si un développeur -non spécialiste- pouvait effectuer la même tâche en utilisant une solution, et en économisant ainsi des heures de travail, ne préféreriez-vous pas plutôt assigner aux spécialistes des missions nécessitant leurs expertises ?

3. Est-ce que je peux effectuer le même travail avec une solution, de manière plus rapide et économique par rapport au développement spécifique ?

On demande sans cesse à la plupart des services IT de faire plus avec moins. Une approche basée sur des solutions permet souvent de réduire les coûts par développeur assigné à la tâche, et cela de manière plus rapide.

4. S’agit-il d’un projet ponctuel, indépendant ou bien d’un domaine où je prévois du développement de plus en plus intensif dans le temps ?

Si vous vous lancez dans une initiative qui utilise une plateforme Big Data ou Cloud, il est probable qu’au fil du temps vous voudrez toujours en faire plus avec cette solution. Si tel est le cas, s’appuyer sur des experts du développement spécifique est une approche qu’il sera difficile de faire évoluer au vu de la pénurie de ces ressources.

5. Quelle est la portabilité de ce code si je veux le ré exploiter sur une nouvelle plateforme comme Spark ou Flink ?

La portabilité est un sujet que l’on néglige facilement. L’écosystème Apache Hadoop évolue à une vitesse incroyable. Par exemple, en 2014 et en 2015, MapReduce était la norme. A la fin de 2016, la norme, c’est Spark. Si vous avez opté pour un développement spécifique, vous aurez constaté qu’il est impossible de migrer le code de MapReduce vers Spark. En revanche, les principales solutions d’intégration de données vous permettent de le faire, éliminant ainsi les problèmes liés aux codes préexistants.

6. Est-ce que plusieurs développeurs collaboreront à ce projet ?

Comme déjà indiqué plus haut, dans le cas où plusieurs personnes sont dédiées à un projet, les avantages d’une approche basée sur des solutions sont nombreux, et incluent une réutilisation et un partage du code simplifiés, des environnements de design visuel, et même un support et des experts pouvant fournir des conseils aux développeurs.

7. Pendant combien de temps ce code restera-t-il en production ?

Quand on s’engage dans un nouveau projet, il est tentant de se concentrer sur le temps nécessaire à sa mise en œuvre et de perdre de vue sa durée de vie. Souvent, un élément qui demande six mois de développement sera utilisé pendant cinq ans, voire plus. Si tel est le cas, les coûts de maintenance et de support du code continueront donc dix fois plus longtemps par rapport au temps de développement initial. C’est à cause de cet aspect qu’il est essentiel d’avoir une compréhension globale des coûts de maintenance et de support associés.

8. Qui sera en charge de la maintenance de ce code ?

Si vous disposez seulement de quelques développeurs Spark, c’est à eux qu’il appartiendra de maintenir et supporter le code qu’ils ont créé. Au fil du temps, ces deux tâches absorberont toute leur capacité de travail, ce qui les empêchera de démarrer de nouveaux projets. A savoir, des missions susceptibles d’aider votre entreprise à obtenir un avantage compétitif.

9. A quel rythme le code devra-t-il être mis à jour pour pouvoir répondre aux nouveaux besoins de l’entreprise ou s’adapter aux changements au niveau des cibles et sources de données ?

Les sources de données, les cibles et les besoins de l’entreprise sont en constante évolution. S’il est raisonnable de penser que ces éléments vont évoluer, alors il faut prendre conscience du fait que les coûts de maintenance et de support seront significativement plus élevés.

[1]“Does Custom Coding Stack Up to Data Integration Tools,” Mark A. Beyer, Ehtisham Zaidi, Eric Thoo, Gartner Research, Septembre 2016.

↧

The Industrial Internet of Things: Why You Need to Get up to Speed Fast

October 13, 2016, 3:09 pm

≫ Next: Looking Back at Ten Years of Growth

≪ Previous: 9 questions à vous poser avant d’entreprendre un développement spécifique sur votre futur projet d’intégration de données

Originally published on Oreilly.com

There’s been a lot of buzz over the last two years around the “Internet of Things,” or IoT. However, more recently a subcategory of IoT, the Industrial Internet of Things (IIoT), has been getting a lot of well-deserved attention. For those who’ve not yet heard of this trend, the IIoT is basically the use of Internet of Things (IoT) technologies in manufacturing. It brings together many key technologies—including machine learning, big data, sensors, machine-to-machine (M2M) computing, and more—in an orchestrated fashion within manufacturing operations. The IIoT promises to drive massive economic transformations in the coming years across multiple industries, including manufacturing, health care, and mining.

O’Reilly’s Data Science for Modern Manufacturing is a new report that lays out the fundamentals of the Industrial Internet—what it is, how government initiatives are promoting it, and how it’s being driven by big data and cloud technologies. It begins with the following statement: “The world’s leading nations are standing at the precipice of the next great manufacturing revolution, and their success or failure at overhauling the way goods are produced will likely determine where they stand in the global economy for the next several decades.”

That opening grabbed my attention because it makes a bold statement about where the industry is going. It has a Tony Montana simplicity to it: “First you launch the devices, then you collect the data, then you get the insight.” OK, so Tony never actually said that in the movie Scarface, but in many ways adopting the IIoT has the blunt, straightforward logic that Tony would appreciate, although in other ways, it is very hard to implement.

There are many benefits that the IIoT promises to deliver, like shorter production cycles, more timely responses to supplier orders, the ability to predict consumer shifts and optimize supply chains to meet new demands, and the ability to quickly retool for design changes. But the IIoT also has the capacity to transform companies—and even countries—in several other ways, opening up a new era of economic growth and competitiveness.

At its best, the IIoT combines people, data, and intelligent machines to improve productivity, efficiency, and operations across a wide range of global industries. Let’s dig into a few examples:

· Fuel efficiency: Fuel is typically the largest operating expense for any airline. Over the past 10 years, fuel costs have risen an average of 19% per year. By introducing big data analytics and more flexible production techniques, manufacturers stand a chance to boost their overall company productivity by as much as 30%.
· Predictive maintenance: Predictive maintenance helps identify equipment issues for early and proactive action, creating better functioning equipment that lowers overall emissions. It can also result (in the example of GE) in saving up to 12% in scheduled repairs, reducing overall maintenance costs by up to 30%, and eliminating up to 70% of breakdowns.
· Better patient care: In the health care industry, technology tools are enabling providers to collect health data in real time and use advanced predictive analytics to help uncover how each patient’s condition may change. By proactively measuring, monitoring, and managing this data, providers can improve care management, address risk factors and symptoms of chronic disease early, and provide positive reinforcement in new and more effective ways—in some cases, literally saving lives.
· Smarter farming and agriculture: Agricultural organizations have been using data to determine crop rotation, water allocation, and fertilizer usage to not only help increase each season’s agricultural yields—ultimately helping meet growing global food demands—but also to gather data that they can then sell to commodities traders, essentially acting as an information broker. Commodities traders use this data to predict which companies, crops, and agricultural assets will perform well in the year ahead.

How do you prepare for the IIoT?

Connected manufacturing will not only create huge opportunities for growth, but it will bring change and upheaval to IT and operations teams. There is no simple “cookbook” for implementing an effective IIoT strategy and infrastructure; however, there are some key things to bear in mind as you ready your organization to embrace and reap the benefits from the IIoT:

· Creating and managing an IIoT infrastructure within an organization requires a unique set of skills and knowledge that is incredibly difficult to find and will become increasingly in demand as this space grows.
· IT leaders and executives need to make sure they have the right mix of talent that understands how to collect, analyze, and react to data, and knows how to effectively put it to work.
· In the data science field, there is much more demand than available talent. This will only become more exaggerated over time. Thus, it’s important to also look for data integration, IoT, and big data solutions that are intuitive and easy to deploy and adopt by those who may not be full-fledged data scientists.
· Businesses will need to look for people with the ability to not only design a product, but to also rapidly re-calibrate both the processes and pace of an agile manufacturing cycle.
The business benefits that the IIoT promises will far outweigh the challenges that may need to be overcome in order to get there. To stay competitive, today’s companies will need to staff up, strategize, and plan for the newly connected world, or risk being outperformed.

To read more about the Industrial Internet of Things, download the free report "Data Science for Modern Manufacturing."

↧

Looking Back at Ten Years of Growth

October 18, 2016, 10:56 am

≫ Next: Retour sur 10 ans de croissance

≪ Previous: The Industrial Internet of Things: Why You Need to Get up to Speed Fast

At a time when Talend is celebrating its ten years of existence and the success of its IPO, I think it’s interesting to revisit the rich history of the company and pay tribute to those who have contributed to making it a global leader in cloud and big data software. Even if this argument is overused, Talend’s success is essentially due to the women and men who joined in the adventure and who, for the most part, are still part of “the family.” Beyond the choices of its managers, Talend has enjoyed tremendous support from its investors, its board of directors, and, naturally, its customers, partners, and employees.

The Early Years: A Different Approach

We started in France, and at that time, the data integration global market was at a major crossroads. There were spectacular changes taking place in terms of the complexity, volume, and types of data needing to be integrated, as well as varying companies' needs, particularly around analytics. As we began to develop the Talend platform, our intuition told us that information management would become without question the cornerstone of the corporate information system. Faced with trying to solve for legacy, monolithic IT solutions such as SAP, IBM, Oracle, and the like, we understood that business users' integration needs would become more and more exacting and specific. However, the solutions proposed were not heading in the direction of affordability. Quite the contrary, they were becoming increasingly costly and unwieldy to deploy.

Confronted with firmly established vendors developing proprietary and thus closed approaches, we put our money on disrupting the established order, reigning technologies, business models, and market positioning. Contrary to the practices at the time, we chose to align ourselves with open source and a distributed rather than centralized architecture. We opted for pricing based on the number of users rather than data volumes or CPU usage. Thus, from the very beginning, we adopted for a global rather than local approach. In hindsight, these choices seem obvious. Trust me that was not the case in 2005 at a time when Red Hat and Salesforce.com were just emerging as market innovators.

Although we were convinced that our vision was right, two individuals were key in validating and supporting this approach – Régis Saleur at Galileo Partners and Jean-François Galloüin at AGF Private Equity (now IdInvest), our first investors. Contrary to customary market practices (to my knowledge, no open source publisher at the time received financial support equivalent to what we got) they truly fought to get the deal and their initial support was critical to our long-term success.

As well as the support of our first investors, customers have also played a decisive role in the early years at Talend. One of his best references will undoubtedly remain Citi, the retail banking operation of Citibank. They saw a great opportunity in the area of analytics with the ability to examine far greater volumes of data than ever before thanks to Hadoop. Our experience working with Citi was highly valuable and the beginning of establishing Talend’s reputation as a leader on Hadoop.

Maturity and More (Continuous) Choices

Following the launch of our first solution in October 2006, we became aware that a large proportion of our downloads were coming from the U.S., which we viewed as a golden opportunity to gain momentum in the region. To explain, unlike traditional technology companies that need to spend money on promotions in order to gain sales leads, open source companies such as Talend gain these leads by offering free open source and trial versions of their products.

Of course, beyond the commercial opportunity of having download leads, setting up shop in Silicon Valley was also a necessity for building a supporting ecosystem. So, by 2008, we opened our first office in Silicon Valley with the idea of furthering our core of corporate strategy rather than simply as a way to raise funds.

As we continued to grow, we tried to recruit the best - strong individuals that had the ability to work together. Demanding as we were, our employees in return were joining a company with a very strong corporate culture, personality, and environment based on respect, ambition, and personal excellence. The result: The first 15 developers we recruited in Talend's history are still with the company today. Beyond skills alone, we are just as attached to the mindset of our future employees now as we were then. This recipe has proven to be successful.

This human component is also found in the choices of our investors, as noted above, and in those of the members of our board of directors. There too, we were seeking the very best. Bernard Liautaud, the founder of Business Objects, clearly and uniquely fits this mold having created a company in France and fully developing it in the United States. In addition, we had other strong characters by our side such as Peter Gyenes, the founder of Ascential Software, and a leading authority on our market. This diversity of talent engendered complexity in terms of management, challenges, and ambitions, but it also allowed us to develop rapidly (on average, the company approximately doubled in size every year).

Finally, we immediately “sensed” the importance of the technological breakthrough brought by Big Data. We planned a few years back that sooner or later, the Big Data market opportunity would materialize. This last shift was particularly important for our partners like Apache and us. Together we have contributed to democratizing access to analytics and data integration solutions.

An Eye on the Future

People often ask me if I have any regrets. The answer that first comes to mind is "no." Of course, we made a few mistakes, as does every innovative enterprise. Our path could certainly have been smoother at times. We preferred to be humble enough to recognize our mistakes and correct them immediately. We were not striving to become supermen, but simply, attempting to excel.

Talend is now listed on the NASDAQ stock exchange. This major event in the company's existence is far from the end of the adventure. It is a crucial step that, on the one hand, validates the relevance of our initial vision, and on the other, creates a huge opportunity moving forward. The foundation we created in the early days is very strong, and there's no need to completely re-invent the wheel as a public company.

In my opinion, the key is for Talend to stay the course of innovation given how rapidly the market is evolving. Business cases are no longer as important as they once were. Data forms the core of all business processes, but most companies have yet to fully recognize the benefits, so the future market potential is significant.

By creating Talend in late 2005, we opened up a range of possibilities. These past ten years have been unique, and I would like to say thanks to all those who have contributed and will continue to contribute to our success. I believe even with all that the company has achieved to date that the next ten can be even more successful as the company realizes its full market potential.

↧

Retour sur 10 ans de croissance

October 18, 2016, 10:59 am

≫ Next: Applying Big Data Analytics to Clickstream Data

≪ Previous: Looking Back at Ten Years of Growth

A l’heure où Talend fête les 10 ans de sa première solution et la réussite de son entrée en bourse, il me semble intéressant de revenir sur la riche histoire de l’entreprise et de rendre hommage aux personnes qui ont contribué à en faire aujourd’hui un leader mondial des logiciels Big Data et Cloud. Même si cet argument est galvaudé, la réussite de Talend est essentiellement due aux femmes et aux hommes qui se sont associés à l’aventure et qui, pour la plupart, font toujours partie de l’écosystème de Talend. Au-delà des choix de ses managers, Talend a bénéficié d’un soutien immense de ses investisseurs, de son conseil d’administration et bien sûr de ses clients, partenaires et collaborateurs.

Les premières années : une approche différente

Quand nous avons démarré notre activité en France, le marché mondial de l’intégration de données était à un tournant majeur avec une évolution spectaculaire de la complexité, des volumes et des types de données à intégrer, ainsi que des besoins des entreprises, notamment en matière d’analytique. Quand nous avons commencé à développer notre plateforme, nous avons eu une intuition : le management de l’information allait devenir, avec certitude, la pierre angulaire du système d’information de l’entreprise. Face à l’héritage monolithique de solutions informatiques (SAP, IBM, Oracle…), nous avons compris que les besoins d’intégration des utilisateurs métiers allaient devenir de plus en plus pointus et spécifiques. Pourtant, les solutions proposées alors n’allaient pas dans le sens de la démocratisation. Au contraire, elles étaient de plus en plus coûteuses et lourdes à déployer.

Confrontés à des acteurs bien établis développant une approche propriétaire et donc fermée, nous avons fait le pari de la disruption, tant en termes de technologie, de modèle économique, de positionnement sur le marché, et d’organisation. En opposition des pratiques d’alors, nous avons choisi de recourir à l’open source et à une architecture distribuée plutôt que centralisée. Nous avons opté pour une souscription basée sur le nombre d’utilisateur plutôt qu’au volume de données ou à l’utilisation de CPU. Enfin, nous avons privilégié, dès le départ, une approche globale plutôt qu’hexagonale. Avec le recul, ces choix paraissent aujourd’hui évidents. Croyez-moi cela n’était pas le cas en 2005, où Red Hat et Salesforce.com apparaissaient à peine comme les innovateurs du marché.

Si nous étions convaincus de la justesse de notre vision, deux personnes ont eu une contribution décisive : nos deux premiers investisseurs Régis Saleur chez Galileo Partners et Jean-François Galloüin chez AGF Private Equity (devenue depuis IdInvest). A rebours des pratiques habituelles du marché (à ma connaissance, aucun éditeur open source n’a bénéficié à l’époque d’un soutien financier équivalent à celui que nous avons reçu), ils se sont réellement battus pour obtenir le deal. Ce premier soutien a été crucial pour notre réussite.

Nos clients ont également joué un rôle décisif dans les premières années de Talend, autant que le soutien de nos investisseurs. Une de ses meilleures références restera sans doute Citi, du groupe Citibank, en charge de l’activité bancaire de détail. Ils ont vu une énorme opportunité dans le domaine de l'analyse décisionnelle, qui offrait d’exceptionnelles possibilités pour examiner des volumes de données plus importants que jamais, grâce à Hadoop. Notre expérience de travail avec Citi a été très précieuse et nous a permis d’établir la réputation de Talend en tant que leader sur Hadoop.

La maturité et encore (toujours) des choix…

Après le lancement de notre première solution en octobre 2006, nous avons pris conscience qu’une forte proportion des téléchargements émanait des Etats-Unis. Nous tenions là une opportunité rêvée pour prendre de l’ampleur localement sans dépense démesurée. Contrairement aux entreprises technologiques qui doivent investir massivement pour acquérir des prospects, les entreprises open source comme Talend peuvent les approcher en proposant gratuitement l’accès à leurs logiciels.

Au-delà même des aspects commerciaux liés à la génération de leads, l’implantation dans la Silicon Valley fut une véritable nécessité pour la construction d’un écosystème. Ainsi en 2008, nous avons ouvert notre premier bureau dans la Silicon Valley avec dans l’idée de développer le cœur de notre stratégie plutôt que simplement y lever des fonds.

Comme nous continuions à croitre, nous avons toujours cherché à recruter les meilleurs : des personnalités fortes ayant la capacité de travailler ensemble. Si notre exigence était élevée, les collaborateurs intégraient en retour une entreprise dotée d’une culture très forte, d’une « personnalité », porteuse de valeurs essentielles comme le respect, l’ambition, le dépassement de soi. Résultat, les 15 premiers développeurs recrutés dans l’histoire de Talend sont encore dans l’entreprise. Au-delà des compétences, nous nous sommes toujours attachés à la personnalité de nos futurs collaborateurs. La recette a fait ses preuves.

Cet aspect humain se retrouve également dans les choix de nos investisseurs, comme on l’a vue précédemment, auprès des membres de notre conseil d’administration. Là aussi, nous cherchions à nous adjoindre les meilleurs. Bernard Liautaud, fondateur de Business Objects, disposait d’une expérience unique : il avait créé une entreprise en France et l’avait parfaitement développée aux Etats-Unis. En outre nous avions également à nos côtés d’autres fortes personnalités, comme par exemple Peter Gyenes, le fondateur d’Ascential Software, une autorité reconnue de notre marché. Cette variété de talents a créé de la complexité en termes de management, de challenges et d’ambitions, mais cela nous a également permis de nous développer si rapidement (la taille de l’entreprise a environ doublé en moyenne tous les ans).

Enfin, nous avons tout de suite « senti » l’importance de la rupture technologique apportée par les Big Data, puisque nous avions prévu quelques années plus tôt que cette rupture se matérialiserait tôt ou tard. Ce dernier virage en date a été particulièrement important et aux côtés de nos partenaires où figure bien sûr Apache, nous avons contribué à démocratiser les technologies d’analyse et d’intégration de données.

Un œil sur le futur

On me demande souvent si j’ai des regrets. La réponse est spontanément « non ». Certes nous avons fait des erreurs – comme toute entreprise innovante. Notre parcours aurait parfois pu être plus lisse. Nous avons préféré avoir l’humilité de reconnaître nos erreurs pour les corriger rapidement. Nous ne visons pas à devenir des surhommes, mais simplement à nous dépasser.

Talend est désormais cotée au Nasdaq. Cet événement majeur de la vie de l’entreprise est loin de marquer la fin de l’aventure. C’est une étape cruciale qui, d’une part, valide la pertinence de notre vision initiale et d’autre part, est une énorme opportunité pour l’avenir. Les fondamentaux créés dans les premiers jours sont très forts, nul besoin de réinventer la roue en tant qu’entreprise côtée.

A mon avis, la clé pour Talend est de tenir le cap de l’innovation étant donné la rapidité avec laquelle le marché évolue. Les business cases ne sont plus aussi importants qu’auparavant – la donnée est au cœur de l’ensemble des processus des entreprises – mais la plupart des entreprises doivent encore en reconnaitre pleinement les avantages. Le potentiel à venir du marché est significatif.

En créant Talend fin 2005, nous nous sommes ouvert le champ des possibles. Ces dix années ont été uniques et je voudrais dire merci à tous ceux qui y ont contribué et qui continue à contribuer à ce succès. Je pense qu’au-delà de tout ce que Talend a déjà réalisé à ce jour, les 10 prochaines années connaitront un succès encore plus important, en matérialisant pleinement le potentiel de son marché.

↧

Applying Big Data Analytics to Clickstream Data

October 20, 2016, 8:35 am

≫ Next: Setting Up an Apache Spark Powered Recommendation Engine

≪ Previous: Retour sur 10 ans de croissance

If you are a retailer, how well do you know your products and how well do you know your customers? You may know which products are most popular based on their purchase history because you keep records of those transactions. But do you know which of your products are the most and least viewed? Do you know what is driving traffic to certain products and what type of customers are most interested in those products? Are you able to provide intelligent and meaningful product recommendations to your customers and increase potential sales revenues? Gaining insight into this type of information is becoming increasingly more valuable to online retailers to gain an invaluable edge over the competition.

In this blog, I will discuss how you can utilize Talend Platform for Big Data to simplify a full-scale analysis of clickstream data, which is a recorded series of clicks within a website, to glean valuable insight into customer trends and behaviors.

Download >> Get Talend’s New Big Data Sandbox Here

What is Clickstream Data?

Clickstream data is nothing new. It has been recorded and analyzed for years to track and understand an individual’s online behavior. Recently, analysis of clickstream data has become increasingly more popular in the online retail space. But the common problem many companies face with clickstream analysis is the sheer volume of data makes it almost impossible to process through standard ETL methods in a timely fashion. While Hadoop and MapReduce make the analysis much more feasible, there is still a very sharp learning curve to the technology. With the use of Talend Platform for Big Data, now you can build quick and simple jobs that use native Hadoop and MapReduce Technology to analyze these enormous datasets within minutes, and produce impressive graphical dashboards to present to Executive Team Members who are making critical business decisions that drive the success of the company. All of this can be done through Talend’s easy-to-use graphical interface and without a single line of MapReduce code being written.

In this example we demonstrate using native Map Reduce to enrich a dataset and aggregate the results for different web-based dashboards.

Data-Driven Retail, Right Now

When you download and try the new Talend Big Data Sandbox you will get hands-on experience with a simple Clickstream analysis job that demonstrates the value such analysis can bring to the success of your company. In a matter of minutes, you will know which product categories are the most popular and which are the least popular across the country and within each state. Allowing you to pinpoint exactly where your focus should be.

Clickstream analysis is the perfect example of the benefits of using Hadoop and MapReduce to make sense out of what would otherwise seem to be a mass of meaningless data. And Talend Platform for Big Data will simplify your transition into Big Data Analysis by making sense out of Hadoop and MapReduce. You may be thinking you can’t afford this type of analysis, but in today’s data-driven retail economy, I would suggest you can’t afford to be without it. To take it a step further, in an upcoming blog, I will be discussing how you can further use this clickstream data to produce real-time, intelligent recommendations to customers for increased sales potential.

Download your own Sandbox here and get your Big Data project started instantly using our step-by-step cookbook and video.

↧

Setting Up an Apache Spark Powered Recommendation Engine

October 26, 2016, 1:27 pm

≫ Next: Which Flavor of Talend Data Preparation is Best for You?

≪ Previous: Applying Big Data Analytics to Clickstream Data

One of the easiest ways for retailers to increase their customers’ average shopping cart size is to suggest items to prospective buyers at the point of sale. We see every day in the physical world when we go to the grocery store or hardware store, the packs of gum, beef jerky, batteries, magazines, etc. lining the checkout lanes, tempting us with impulse items or things we may have forgotten we needed. This practice of suggestive selling is nothing new. But now, with online shopping at an all-time high, there is a much more effective way to target specific items to specific customers at this most crucial point in the digital shopping experience.

As we discussed in a previous post, Clickstream data analysis is a very valuable asset in the retail industry. It is a perfect example of the benefits of Big Data Analysis. When Clickstream data is combined with customer demographic data and passed into a Machine Learning algorithm, retailers are now able to make reasonable predictions about which products their customers are most interested. This makes the analysis much more valuable. The problem is, while still very valuable, this analysis could take minutes to calculate with a standard MapReduce configuration. By the time analysis is done, the online shopper is long gone and there is no guarantee he will be coming back. With the introduction of Apache’s Spark Streaming technology, combined with Talend’s native Spark architecture, the same analysis can now be done in seconds to provide real-time, intelligent recommendations at any time throughout the shoppers’ experience.

Download >> Get Talend’s New Big Data Sandbox Here

Let’s take a deeper look into an example of this scenario, which you can experience first-hand with the new Talend Big Data Sandbox. In the sandbox, you have the option to run this scenario right within Talend Studio using a stand-alone Spark Engine that comes embedded within Studio, or you can choose to run it on either of the latest Cloudera or Hortonworks Hadoop Distributions through their respective YARN Clients. You can read more about the cool technology behind the new sandbox that makes all this possible in a separate blog. For now, I am going to focus on creating this Real-time Recommendation Pipeline to show how you can increase your online sales.

First, A Recommendation…

To begin building an intelligent recommendation pipeline, you first need a Recommendation Model. The Model is what drives the recommendations presented to your customers as they browse through your website. The Alternating Least Squares (or ALS) Algorithm is the most widely accepted method of generating such a model and it sounds very complex. That’s because it is. But with Talend Studio, you have a single, simple component that does all that nasty calculation for you. All you have to do is feed the model with data you already have at your fingertips from your Hadoop Cluster, Data warehouse and/or Operational Data Store. Give it some historical Clickstream data (the more the better to produce more accurate recommendations) combined with user demographic data from your registered user community and then link it to your product catalog. Let the Algorithm do the rest. If you have a rocket scientist in your IT department, they may want to tweak the algorithm for even more accurate results. Talend gives you the ability to do that as well with parameters for Training Percentage, Latent Factors, Number of Iterations and Regularization Factor, which can all be factored into your final Recommendation Model. I’m not a rocket scientist though, so I have no idea how those parameters affect the Model. But that is OK, the Talend Component is configured with default values that provide a good starting point for a first-time Recommendation Model! What is produced is a sort of mini database of information, stored on your Hadoop cluster, that can be referenced at a moment’s notice. Pretty cool, huh? But this is just the beginning. Let’s put this model to work!

Apache Spark Streaming & Real-Time Retail

The way we put this model to work is through a Spark Streaming job that reads real-time web traffic “clicks” through a Kafka Topic, enriches the data with customer demographic information and sends it to the Recommendation Model to generate a list of products to present to the user. Confused? Let me break it down another way. When a user enters your retail website and starts navigating your product pages, each click of the mouse they make is captured in your Clickstream data. By utilizing Spark Streaming and Kafka queuing, each click can be immediately analyzed to identify the customer making the click and the product they are clicking on. That single click now contains enough information to pass to the Recommendation Model where it can be instantly compared against not only that users’ historical click history but other user’s click history that share some of the same demographic characteristics. The generated recommendations can be stored in a NoSQL Database for quick reference when presenting the information to the customer immediately upon their next click or page refresh. On a personal level, maybe a previous shopper viewed a necklace and earrings on your site but didn’t buy the items – when they return days later you can identify that same shopper is about to click “purchase” on a dress and heels. At this critical moment in the shopping experience, they are much more likely to add recommended accessories to their purchase that they have viewed and have an interest, such as the specific necklace and earrings, rather than random accessories they would be almost sure to overlook. Now you have made a personal connection with that shopper through a digital transaction and thereby increased not only the potential revenue of the sale but also the likelihood they will return. From a dollars and cents perspective, instead of offering a shopper a pack of gum or packages of Beef Jerky at checkout in the hopes of cashing in on a $2 impulse buy, now you can offer them the perfect pair of $50 wool socks that similar customers bought when also buying the same pair of $200 hiking boots. Cha-Ching! You just increased the potential of this sale by 25%!! Do I have your attention now?

Understanding Your Customers in an Instant

Let’s bring this full circle. Not only do you have your Clickstream Analysis to understand product visibility and popularity across different regions, you can also combine that information with a specific customer’s demographic information and retrieve their full history of clicks to understand their individual interests and compare across different demographic metrics. Now, with a Real-time Recommendation Pipeline to present them with instant, intelligent and meaningful recommendations, you are creating a personal relationship through a digital transaction and ultimately increasing your potential revenue. Further, you can track the recommendations presented to your customers in Hadoop for later Big Data Analysis. For example, cross-referencing their purchase history with the recommendations they were presented to identify which recommendations are providing the most value. As you continue to collect the clicks and grow your customer base, you can update and improve your Recommendation Model for more accurate recommendations and even predict with a certain level of accuracy what your customers will want and need – depending on Age, Gender, Income and even Region, Climate and Season among others.

Download your own Sandbox here and get your Big Data project started instantly using our step-by-step cookbook and video.

↧

Which Flavor of Talend Data Preparation is Best for You?

November 3, 2016, 11:05 am

≫ Next: Talend Connect 2016: Unlock Your Data for Unlimited Possibilities

≪ Previous: Setting Up an Apache Spark Powered Recommendation Engine

In its latest Market Guide for Self-Service Data Preparation, Gartner predicts that “by 2019, data and analytics organizations that provide agile, curated internal and external datasets for a range of content authors will realize twice the business benefits of those that do not”[1].

Organizations today are swimming in data, but most companies are only utilizing a fraction of what they collect. By implementing a self-service data preparation strategy, companies can enable more widespread use of data throughout their organization and move towards creating a data-driven culture. But it’s not easy. IT leaders who aspire to become data heroes and transform their organization into a data-driven business tend to feel threatened when they learn that data workers spend most of their time—an estimated 500 hours and $22,000 per year—collecting, correcting and formatting the data before they can turn them into insights. Additionally, there is the concern about the risk of uncontrolled proliferation of data sources.

The newest release of Talend Data Preparation provides an alternative. It not only empowers customers to achieve demonstrable business benefits faster and easier but also enables them to expand the reach of their modern data platform to a broader audience. This is designed to allow everyone within the organization—from IT developers to business information workers, including data analysts, stewards or scientists—to benefit from increased access to corporate information to inform their day-to-day tasks. Additionally, Talend Data Preparation is designed to help balance the need for broader data access and collaboration to improve employee productivity and business insight, with IT-controlled data governance.

Desktop, On Demand or Enterprise?

I’m regularly asked about the differences between the various versions of Data Preparation Talend offers. Thus, I’d like to take this opportunity to provide some guidance on the various ‘flavors’ of Talend Data Preparation so you can decide which the best fit is for your business information workers’ needs.

In a nutshell:

- If you want to get hands-on on self-service data preparation, and/or to be self-sufficient in your data-driven tasks, look for the Free Desktop, or alternatively the On Demand AWSversion in case you have a cloud-first strategy.

- If you are working as a team to maximize the value of data in your activities, or if you aim to establish a managed self-service access to data to a community of data workers, you should definitively consider the subscription version.

Desktop: A Personal Productivity Booster!

Think of the Free Desktop as a personal productivity tool. Business workers install it on their Mac or PC to fix and work on personal datasets they have at their disposal (i.e. a tradeshow leads list, a monthly financial forecast, a compensation measurement tracker, etc.). This type of data is typically available in an Excel or CSV file. Once the Excel file is ‘cleaned up’ using Talend Data Preparation, it can be exported as a CSV or Excel file, or into Tableau. As long as your desktop resources can handle the data volume, you’ll be fine with the Open Source version of Talend Data Preparation. In our experience, business users are typically able to work interactively with tens of thousands of rows, which is the reason why we have set a 30,000 rows limit by default. But you might fall short with your desktop resources when trying to handle larger datasets.

Download it now here

On-Demand: Access through the Cloud!

We also recently introduced a version of Talend Data Preparation for Amazon Web Services. It is a free, single-user edition that doesn’t require any installation on your desktop: you just connect to it remotely through your browser, and then it provides capabilities that very similar, if not identical, to the Free Desktop version. If you’re familiar with Amazon Web Services and have an active user account, this version is worth a try. Stay tuned for an upcoming blog that dives deeper into the capabilities of the AWS version.

Access it now here

Enterprise: Enabling Data Collaboration and Governance

The enterprise or subscription version of Talend Data Preparation delivers a governed, self-service platform for the entire company. This version provides role-based access and collaboration capabilities for sharing and reusing dataset preparations between data workers. You can see it in action in this video.

Through the Talend platform, it can connect to almost any data source in your enterprise and expose that file as a self-service dataset in batch or real-time. As mentioned earlier in this article, the enterprise edition of Talend Data Preparation can work on large datasets through server-based processing and sampling. Last but not least, any user-defined preparation can be pushed back to Talend Data Fabric platform. Here it can be connected to every cloud or on-premises data source across the enterprise and combined with high-end capabilities provided by the Fabric such as Data Masking, advanced mapping, or complex matching, and then process the data on a scheduled basis or applied to real-time data flows.

Onboarding with this version is pretty straightforward. If you are an existing Talend customer, you are entitled to two free named-user licenses as part of your Talend subscription. We are also offering a half-day on-demand training session and a 2-day quick start consulting package to learn how to implement, administer and use Talend Data Preparation. As a special offer for early adopters, the on-demand training session is free of charge throughout 2016.

If you are a new user and you wish to discover the software, you can download and experiment with the Free Desktop version of Talend Data Preparation on our website.

Whatever version you choose to utilize, there is a laundry list of business benefits that your organization stands to gain. Let’s now take a deeper look into some of these.

Interact with Large Data Volumes through Selective Sampling

Data Preparation is an interactive experience. Because data is exposed to data workers in a spreadsheet-like user interface, they can easily and rapidly find out the needed actions to fix its quality, and enrich and shape it to fit their context.

This experience works fine with relatively small sets of data, but the challenge is to make it scale with larger sets. Data sampling is critical to address this challenge, and this is a feature that we introduced in our commercial version. The latest release of Talend Data Preparation brings this capability to a new level with selective sampling. It allows the data worker to specify the sample that they want to interact with.

Suppose, for example, you want to cleanse your 32,000 rows contact data from Salesforce.com, and more particularly the US state. By default, Talend Data Preparation will retrieve a sample of the data set for interactive preparation. Through its semantic dictionary, not only it understands that one column refers to a state but also drives the user attention to the invalid values for that datatype. The user can then selects the rows with invalid state within that sample, corrects ‘Texas’ to ‘TX’ a single cell and then applies it to all the rows. But, there might other invalid values for state columns in the dataset that were not considered in the sample. Through selective sampling, Talend Data Preparation selects more rows that matches the current filter on invalid state to refine the preparation: this operation allows to correct all invalid data, for example highlighting a data quality issue related to the Iowa State (IA). Selective sampling: optimized data accuracy.

Fix Data Across Columns Faster

Because Talend Data Preparation can automatically discover the semantics of your data (For example, understand that the first column of your data set is a first name; the second, a last name and the third, an e-mail, and the fourth a phone), it can highlight the invalid data that doesn’t conform to those data types automatically. This capability can be very helpful in improving the productivity of data workers when fixing errors in their datasets.

The latest release of Talend Data Preparation lets you immediately point out the set that needs to be fixed by applying a filter on all the rows with invalid or empty values in one simple action. When combined with smart sampling, this function is extremely useful to manage data quality in large datasets.

In the following video, the user wishes to keep only business e-mails in a marketing leads list. After having extracted e-mail parts, he deletes in a single operation every ‘gmail.com’ and ‘yahoo.com’ e-mail address from the date set. Multi-filter: time saved, personal productivity increased.

Another productivity accelerator provided by Talend Preparation is the ability to avoid repetitive actions when you need to implement the same standardization on multiple columns of information. this is a productivity accelerator that many of our 30,000 early adopters had on their wishlist: the ability to select multiple columns by using <Ctrl><Click> or <Shift><Click> and apply functions across these columns.

In this following video, the user notes that 2 columns are date columns and that both contain unnormalized data. Talend Data Preparation allows the user to standardize both column in a row. The user selects both columns and applies 1 single time the “change date format” function. Cleansing time divided by 2”

Work with Locations, IBAN and Temperatures

When working with Iso2 country codes (with the commercial version), your data is displayed in the form of a world map in the chart tab. Like any charts in this tab, it is interactive, which means that you can click on a value to drill down. We also introduced an interactive map of the United States in the commercial version when working with two-letter US States.

IBAN are supported, and we deliver more than just controlling the pattern and standardizing the formatting: the algorithm of IBAN validation is embedded. Indeed, our data masking capability fully applies for this very sensitive data.

For those working with weather data or sensor data, there’s also a new “convert temperature” to switch the measurement unit of your temperature data between Celsius, Fahrenheit, and Kelvin.

Design and Maintain your Dataset Preparation

Designing a preparation is an ad-hoc experience. In some cases, especially when working with a presentation that needs dozens of steps, you might want to add a new step, but then realize that it needs to be applied earlier in your preparation sequence. Now you can dynamically move this step up to the right sequence, and even reorder the preparation steps at any time while maintaining your preparation. This makes maintenance of complex preparations with dozens much easier and is particularly useful when standardizing data against a lookup file.

Here the user wants to identify the store brand products amongst a full list of products. As usual, he uses the look-up function to blend the core data set of products catalog and an external data set listing the store brands products. In theory, 1 single step is needed to get requested result. But, today, it seems there are still unmatched values after look-up execution. It is due to white spaces in some cells. So, the user cleanses these white spaces and then, reorders the steps of the recipe to anticipate the cleansing. Reorder: optimized combination of cleansing steps

[1] Gartner Research, Inc., ‘Market Guide for Self-Service Data Preparation,’ Rita Sallam, Paddy Forry, Ehtisham Zaidi, Shubhangi Vashisth, August 2016.

↧

Talend Connect 2016: Unlock Your Data for Unlimited Possibilities

November 4, 2016, 1:15 pm

≫ Next: Talend Connect : Libérez vos données pour un potentiel sans limites

≪ Previous: Which Flavor of Talend Data Preparation is Best for You?

In August 2016, Talend was recognized as a "Leader" in data integration by Gartner's "Magic Quadrant for Data Integration Tools." Talend is the first new vendor in five years to be added to the Leaders Quadrant, improving its position over 2015, moving up and to the right within the Magic Quadrant based on its completeness of vision and ability to execute. I believe this reflects a broader shift taking place in the market with the introduction of new cloud and big data platforms. Talend has fully embraced these technologies and today offers a compelling solution for organizations looking to become data driven.

2016 ushered Talend into a new era, not only with our IPO on the NASDAQ this summer but also with our expansion in the European market. Already having a strong presence in France, UK, and Germany, we have decided to set up shop in Italy, Spain, and the Netherlands as well as in Sweden to cover the Nordic region, where the adoption of Big Data and Cloud technologies is growing by leaps and bounds. This is why this year we are happy to be welcoming customers and users from all over Europe so that they can share their feedback with us.

Just Imagine the Opportunities You'll Have!

Talend Connect is THE forum that will allow you to exchange ideas and best approaches for tackling the challenges presented by the cloud, big data, self-service, and the Internet-of-things (IoT) to name but a few. You will also learn about our product road map and the newest innovations that will be available starting in 2017. Some of the specific topics being covered:

· We'll present the best practices and reference architecture to build a Data Lake with Talend Big Data. We'll also show you the latest data streaming capabilities that will be introduced in the next version of our solutions.

· We'll review the best practices to migrate legacy data warehouses to Cloud data warehouses, such as Amazon Redshift. All attendees will receive a free eBook and an architectural blueprint to help them get started. We will also be revealing some of the exciting features coming in Talend Integration Cloud Winter '17 around Continuous Delivery.

· By attending the self-service session, you'll also find out how Talend Data Fabric addresses data governance issues and get a look at the new self-service data features.

Real-World Use Cases From Talend Customers

International customers from different industries, such as Vinci, Veolia, Unos, Lenovo, Air France, or Wejo, will take the podium to explain how Talend helped them innovate and tap into the full potential of their data. From new customer approaches and services to optimizing retail and the creation of predictive analytics applications will be just some of the best practices being shared during Talend Connect.

The winners of the Talend Data Masters Awards will be revealed during Talend Connect. This program celebrates companies that are using Talend solutions in unique ways to help make their businesses more agile, effective and data-driven. The winners will be chosen from Talend’s more than 1,300 global customers representing large enterprises and private firms across major industries. Last year’s award recipients included companies like Citi, GE Healthcare, Hess and Travis Perkins.

Discover, Share, Network

Beyond the keynotes, Talend Connect is also a fantastic opportunity for you to network and connect with professionals that share some of your same challenges and opportunities There will also be a partner area available, allowing you to discover new things and interact with Talend experts.

I am truly thrilled and delighted to welcome our users, customers, analysts and all our community members for this first European-wide edition ofTalend Connect. More than 450 users and customers are expected to attend this year and we hope you'll be among them!

A big thank you goes out to our sponsors

Talend Connect enjoys the kind support of Keyrus, JEMS Group, Micropole, Accenture, VO2 Group, Business & Decision, Tableau, Datalytyx, CGI, MapR, Quinscape, Smile, Cloudera, Ysance, Audaxis, Contentserv, EXL group and ADBI, sponsors of the 2016 edition.

↧

Talend Connect : Libérez vos données pour un potentiel sans limites

November 4, 2016, 1:24 pm

≫ Next: Views from the Top: 5 Key Pieces of Advice from Talend CTO on the Future of Cloud

≪ Previous: Talend Connect 2016: Unlock Your Data for Unlimited Possibilities

Depuis Aout 2016, Talend est désormais reconnu comme un « Leader » en matière d’intégration de données au sein du « Magic Quadrant for Data Integration Tools » de Gartner. Talend est ainsi le premier nouvel éditeur à obtenir le statut de « Leader » depuis cinq ans. Nous avons ainsi amélioré notre position en vertu de notre vision globale et de notre capacité à la mettre en œuvre. Tout ceci reflète le vaste changement auquel nous assistons actuellement sur le marché avec le lancement de nouvelles plateformes de gestion Cloud et de Big Data. Talend a pleinement adopté ces technologies et propose aujourd’hui une solution idéale pour les organisations cherchant à mettre en place une approche orientée données.

2016 marque aussi l’entrée dans une nouvelle ère pour la société avec notre introduction en bourse sur le NASDAQ cet été, mais également avec notre développement sur le continent européen. Déjà fortement présents en France, en Grande-Bretagne et en Allemagne, nous avons décidé de nous implanter en Italie, Espagne, Pays-Bas ainsi qu’en Suède pour couvrir les pays nordiques, où l’adoption des technologies Big Data et Cloud ne cesse de croître. C’est pourquoi cette année, nous avons eu envie d’accueillir nos clients et utilisateurs basés aux quatre coins de l’Europe afin qu’ils puissent partager avec nous leur retour d’expérience.

Imaginez les possibilités qui s’offriront à vous !

Talend Connect est LE forum vous permettant d’échanger des idées et les meilleures pratiques pour faire face à ces défis qu’il s’agisse du Cloud, des Big Data, des données en libre-service ou de l’Internet des Objets (IoT). Vous découvrirez également notre roadmap produits ainsi que nos toutes dernières innovations, disponible début 2017 :

· Nous vous présenterons les meilleures pratiques et l’architecture de référence pour construire un Data Lake avec Talend Big Data. Nous vous montrerons également les nouvelles fonctionnalités de streaming des données offertes par la prochaine version de nos solutions.

· Nous passerons en revue les meilleures pratiques pour migrer des entrepôts de données traditionnels vers un entrepôt de donnés dans le Cloud tel qu’Amazon Redshift. Tous les participants recevront un eBook gratuit et un schéma d’architecture pour les aider à démarrer leur projet. Nous vous révélerons aussi quelques-unes des étonnantes fonctionnalités de livraison continue incluses dans la prochaine version de Talend Integration Cloud.

· En participant à la session self-service, vous découvrirez comment Talend Data Fabric affronte cette problématique de gouvernance des données et vous aurez un aperçu des nouvelles fonctionnalités de données en libre-service.

Des cas d’utilisations de renoms

Des clients internationaux et français de Talend issus de différents secteurs d’activité, comme Vinci, Veolia, Unos, Lenovo, Air France, ou Wejo, monteront également sur scène pour expliquer dans quelle mesure Talend les ont aidés à innover pour tirer le meilleur parti de leurs données. Nouvelles approches et nouveaux services à destination de leurs clients, optimisation des points de ventes et sites e-commerce, création d’applications d’analyse prédictive, réorganisation de leurs systèmes d’information autour de la donnée, seront quelqu’un des thèmes abordés lors de ces échanges.

Les vainqueurs des Talend Data Masters Awards seront enfin annoncés lors du Talend Connect. Ce programme récompense les clients utilisant leurs données de façon innovante afin de gagner en agilité et efficacité. Les lauréats seront sélectionnés avec soin parmi plus de 1 300 clients mondiaux représentant de grandes organisations publiques et privées dans tous les secteurs d’activité. L’année dernière, parmi les lauréats figuraient les sociétés Citi, GE Healthcare, Hess et Travis Perkins.

Découvrez, échangez, réseautez

Mais au-delà des présentations, le Talend Connect s’est également l’occasion pour vous de réseauter et de tisser des liens avec des professionnels partageant vos points de vue. Un espace partenaire sera en plus à votre disposition pour découvrir et échanger avec les experts Talend.

Je suis très enthousiasmé et ravi de recevoir nos utilisateurs, clients, analystes et tous les membres de la communauté pour cette première édition européenne du Talend Connect. Plus de 450 utilisateurs et clients sont attendus cette année, nous espérons vous compter parmi eux !

Un grand merci à nos sponsors

Talend Connect bénéficie du soutien de Keyrus, JEMS Group, Micropole, Accenture, VO2 Group, Business & Decision, Tableau, Datalytyx, CGI, MapR, Quinscape, Smile, Cloudera, Ysance, Audaxis, Contentserv, EXL group et ADBI, sponsors de l’édition 2016.

↧

Views from the Top: 5 Key Pieces of Advice from Talend CTO on the Future of Cloud

November 10, 2016, 9:14 am

≫ Next: Views from the Top: 5 Key Pieces of Advice from Talend CTO on the Future of Cloud - Part 2

≪ Previous: Talend Connect : Libérez vos données pour un potentiel sans limites

At Talend, we always have our eye on future trends in technology. Cloud—once just a buzzword—is now a widely-adopted technology across all businesses. No longer is it a secret that Cloud is more secure, cost-effective and scalable than the majority of alternatives. However, we also acknowledge that cloud solutions will not entirely replace on-premises infrastructures. Rather, they will largely enhance and extend the processing power and accessibility of those data sources.

FREE TRIAL: Test Talend Integration Cloud for 30 Days

This is why we decided to sit down with Talend’s Chief Technology Officer, Laurent Bride, to get his views on what’s ahead for the cloud and why having a solid hybrid cloud integration strategy is shaping up to be the way of the future for most companies.

In this two-part Q&A series, we’ll explore these topics. We also welcome your comments and additional questions in our TalendForge Community Forum.

Q1: What are some of the main disruptions you are seeing in the cloud marketplace today?

LB: From an infrastructure standpoint, I see more evolutions than disruptions. When I look at the cloud technology movement, I really look at it from four different benefit standpoints:

· Elasticity

· Scalability

· Security

· Time-to-market

We aren’t really seeing massive disruption in any of these areas, but we do see a lot of incremental improvements that add business value and help companies accelerate their overall time-to-market. Take security, for example, which was at one point a major concern for businesses wanting to migrate to the cloud. The cloud that we have today is likely much more secure than most on-premises IT environments. While it’s not a massive change, it is a necessary evolution of the technology. Where I’m really seeing the disruption is from the rise of cloud services that are now more packaged than they have been in the past because of the layers of abstraction that make these services more consumable.

Q2: What about disruptions for Big Data in the Cloud Use Cases?

LB: I see a lot of changes as well as challenges happening in the next 1-2 years with Big Data use cases in the cloud. Advanced Big Data concepts like Artificial Intelligence, Internet of Things, machine learning and predictive analytics all require a massive investment in not only infrastructure, but also in the ability to package these complex technologies in a way that enables all business users to take advantage of them. Therefore, most companies are missing out on reaping the full benefits of their data by not being able to deploy Big Data projects focused on these use cases. Major cloud players such as Amazon, Google and Microsoft Azure will make it possible for these use cases to be easily deployed in the cloud and will push the economic boundaries of these projects. For example, emerging image recognition services such as Google’s TensorFlow, have the ability to help the visually impaired and safety features in cars that detect large animals to auto-organizing untagged photo collections and extracting business insights from socially shared pictures, the benefits of image recognition, or computer vision. Google’s TensorFlow is used to develop many of the company’s AI initiatives, from autonomous cars and translation to Google Now and Google Photos.

The service uses open source picture sets to feed their machine learning beasts, and they have the advantage of access to millions of user-labeled images from apps such as Google Photos. Have you ever wondered why Google allows you to upload so many pictures for free? It’s because those pictures are used to train their deep learning networks to become more accurate.

But TensorFlow is hardly the first — or only — open-source framework. UC Berkeley’s Caffe has been around since 2009, and remains popular because of its ease of customizability and large community of innovators, not to mention heavy use by Pinterest and Yahoo!/Flickr. Even Google turns to Caffe for certain projects such as DeepDream.

Q3: It seems like every software or infrastructure vendor has some kind of cloud strategy. What questions should customers be asking when evaluation one vendor's approach from another's?

LB: Years ago, when the cloud really started gaining traction, companies like Salesforce pushed the idea of “no software”. In fact, that was their tagline for a long time. Salesforce had an absolutely cloud-first mentality in their business. The tagline really didn’t mean that Salesforce wasn’t software, but instead meant that you wouldn’t experience the type of IT issues that were typical of licensed, on-premises software.

As legacy software and infrastructure companies started increasingly shifting to a cloud strategy, they quickly realized that you couldn’t develop cloud products the same way as you would with on-premises software. They ran into integration, speed and security issues such as resource-sharing and multi-tenancy that were critical to success in the cloud. This isn’t to say that all legacy vendors couldn’t make the shift (look at Microsoft for instance), but that it really takes the mentality of “cloud first” to be successful in the web-based, on-demand computing world.

When IT teams are looking at cloud solutions, there is a simple checklist of 5 questions that they should ask themselves to guide their evaluation and design a “cloud first” approach:

1. Has the app been designed for the cloud up-front or is it just supporting something that is actually an on-premises software?

2. Is the product portfolio truly integrated or will you experience problems moving from one application to the other?

3. Can you avoid vendor lock-in with your data?

4. Will the service have top performance anywhere in the world that your business is?

5. As your business needs scale, will the pricing of the solution scale disproportionately?

Continue Reading Part 2 here.

↧

Views from the Top: 5 Key Pieces of Advice from Talend CTO on the Future of Cloud - Part 2

November 11, 2016, 9:00 am

≫ Next: Catch the Big Data Wave - Talend Named Leader in Forrester Wave™: Big Data Fabric, Q4 2016

≪ Previous: Views from the Top: 5 Key Pieces of Advice from Talend CTO on the Future of Cloud

In the last installment of our chat with Laurent Bride, CTO for Talend, we covered how the cloud market has evolved to become a mainstream technology and the questions customers should bear in mind as they evaluate cloud solutions. In this installment, we’ll look at how to bridge the gap between IT and the business when selecting a cloud solution that’s right for long-term business success, making sure cloud is part of a company’s overall business strategy instead of being part of shadow IT, and why open source is a critical component to an effective cloud solution.

Q4: Today there is a bit of a debate on which cloud initiatives should be led by the business, and which ones should be led by IT. What’s your opinion on who is better equipped to lead a cloud initiative?

LB: I don’t think it’s entirely up to one side or the other. In an ideal world, both business and IT would be involved and work together to find a mutually beneficial solution that meets both business and IT objectives.

For example, when you look at cloud analytics tools, you don’t really need much IT involvement as long as the data stays within a business silo. Very quickly, however, you will find that you are going to need data from other areas of the business to make the tool viable. That’s where you run into integration and governance issues; and you cannot solve integration and governance challenges without the involvement of IT.

Collaboration is definitely important to success in the cloud. It’s widely understood that our world is becoming digital—consumers, business workers and leaders alike want information and services to be delivered ‘as they like, when and wherever they want.’ In order to meet this demand, companies need to become more data-driven—using enterprise information to make educated decisions—in order to increase customer service and overall competitive advantage. However, research from McKinsey estimates that The US economy as a whole is realizing only 18 percent of its digital potential. IT doesn’t become more relevant to the business in assisting with this digital transformation, they risk becoming simply a cost center. Today, more IT departments are shifting to be closer to the business so that they can help their companies become more data-driven and remain competitive.

FREE TRIAL: Test Talend Integration Cloud for 30 Days

Q4: Where does cloud integration play a role in adding value to an overall cloud strategy?

LB: Going back to some of the early examples of SaaS deployments I can recall, often the businesses wanted a best-of-breed for each application. For example, Salesforce.com is seen as the best CRM solution, while Marketo is seen as the best for marketing automation. As different lines of business moved forward with these tools, they eventually reached a point where they required some type of integration in order to get more insights out of the solution. This could be pushing marketing campaign data into a cloud-based CRM system or moving employee feedback data into a financial tool to link it with a budget.

Since the data from these applications is already in the cloud, a hybrid cloud integration solution will allow you to achieve faster processing times with cloud data and to make it more cost-effective to manage as the business only pays for the compute power needed.

Q4: Finally, we want to ask you a topic that's near and dear to your heart. That, of course, is open source. Do you believe that open source changes the way in which the cloud operates today?

LB: Absolutely. I think that open source is one of the primary accelerators for cloud technology. When you look at the big technology players that have invested heavily in cloud infrastructure (Uber, Google, Facebook, Netflix, for example), all of them contributed to open source projects. When you look at the internal technical architecture of these services, most of them are based on open source. There are of course some proprietary layers but really the backbone is open source. Open source is really fueling the cloud and companies in the cloud are adding key differentiators on top of it in order to generate business value.

Open source and the cloud are accelerators of innovation. Today a startup can get up and running in minutes using cloud and open-source software while 10-15 years ago, a business had to make large upfront investments in technology. It is really helping lead the way to major technology shifts.

↧

Catch the Big Data Wave - Talend Named Leader in Forrester Wave™: Big Data Fabric, Q4 2016

November 22, 2016, 9:11 am

≫ Next: Helping Data Driven Companies Advance to Artificial Intelligence

≪ Previous: Views from the Top: 5 Key Pieces of Advice from Talend CTO on the Future of Cloud - Part 2

I am proud to announce that for the second time this year, Talend has been recognized by a leading independent research firm as a Leader in Big Data. Today Forrester Research named Talend a “Leader” in its new vendor ranking report, “The Forrester Wave™: Big Data Fabric, Q4 2016.”

Forrester describes Big Data Fabric as “a platform that helps [enterprise architects] discover, prepare, curate, orchestrate, and integrate data sources by leveraging big data technologies in an automated manner.” The analyst firm examined past research, user requirements, and vendor interviews to develop a comprehensive set of 26 criteria to evaluate the 11 top commercial big data fabric vendors, and to help EA pros better understand all available offerings and select the right solution for their organization.

The Forrester Wave: Big Data Fabric report positions Talend as a Leader with the highest possible score in the Data Management criterion, which includes metadata, data lineage, data quality, data governance, and data security sub criteria. Talend also earned the highest possible score in each of the following sub criteria: data connectivity, data transformation, self-service, and collaboration.

In the March 2016 report entitled Big Data Fabric Drives Innovation and Growth, Forrester defines Big Data Fabric as: “Bringing together disparate big data sources automatically, intelligently, and securely, and processing them in a big data platform technology, such as Hadoop and Apache Spark, to deliver a unified, trusted, and comprehensive view of customer and business data.”

All the basic concepts are there, i.e. you need to connect and move data, and transform and manage data. But connect, move, transform, and manage have new meaning when you go from gigabytes to petabytes, batch to real-time, on-premises to cloud, a few data sources to hundreds, and IT-service to self-service.

The report highlights that “Enterprises of all types and sizes are embracing big data, but the gap between business expectations and the challenges of supporting big data technology (such as Hadoop) has become the primary motivation to innovate with big data fabric.” We see time and again how having a comprehensive big data platform can save a big data project. Big data platforms, such as our Talend Data Fabric, turns corporate data “swamps” into qualified pools of information for use by any employee.

Big data and cloud integration technologies have transformed the market and the way companies approach their businesses. And companies across every industry worldwide are discovering how cloud and big data technologies can help them reinvent product and service delivery. In fact, just last week at our annual user conference in Paris—Talend Connect—we recognized the 19 winners of our Talend Data Masters Awards—a program designed to recognize the unique and innovative ways our customers are using data to transform their business.

These companies exemplify the broader shift taking place in the market wherein companies are transitioning from experimenting with or exploring the benefits of using cloud and big data solutions, to implementing these technologies to drive widespread business ROI, and in some cases also benefit society at large.

Our mission at Talend is to help companies become more data-driven. One can only do this by efficiently collecting and combining data from a broad variety of sources so companies can gain the insight needed to optimize any part of their business – from identifying fraud to improving customer service. This is what our software does exceptionally well. Not only has it been proven by our over 1300 Global 2000 customers worldwide, by our leadership in the 2016 Gartner Magic Quadrant for Data Integration, but we believe that it’s been confirmed once again today with our recognition as a Leader in the Forrester Wave: Big Data Fabric, Q4 2016.

I encourage you to download a complimentary copy here and find out for yourself how Talend excels in Big Data Integration and Management.

↧

Helping Data Driven Companies Advance to Artificial Intelligence

November 29, 2016, 8:22 am

≫ Next: News from AWS re:Invent - How do you solve the complex data problem?

≪ Previous: Catch the Big Data Wave - Talend Named Leader in Forrester Wave™: Big Data Fabric, Q4 2016

Everyone is talking about artificial intelligence (AI) and machine learning these days. This is not just of strategic relevance for companies the likes of Google, Apple, Amazon, Facebook or Salesforce.com. AI is now a term that all companies should be familiarizing themselves with (if they’re not already) because it will have a profound impact on their business in the near future. We have already witnessed vehicles operating autonomously and a proliferation of robotic counterparts and automated means for accomplishing a variety of tasks, which has all given rise to a flurry of people claiming that the AI revolution is upon us.

What is Driving This Next Wave of Change?

It is interesting to observe that very little has changed as far as the basic theories of Artificial Intelligence are concerned. Overall we still have the same approaches and concepts in the field of machine learning as we have had for decades. Concepts such as artificial neural networks, back propagation, support vector machines, Bayesian networks or hidden Markov models have all been developed in the last century.

So, it is even more important than ever to ask why now is the right time for your business to take action on devising a plan for implementing AI effectively within your organization.

As far as I am concerned, the answer to the successful utilization of AI in any business is all about the availability of big data and a company’s IT infrastructure. The idea that data is a valuable commodity has long since been established. More and more companies are now basing their business models on data and analytics. We are living in the age of data driven organizations. When it comes to infrastructure, the cloud with its great scalability has become firmly established: resources are available at the touch of a button to scale to the extent required, and can be invoiced according to use—saving cost.

Nearly all companies have big data, but not all of them have learned how to effectively leverage it. Before any meaningful business decisions can be made using data, it first needs to be collected and integrate data from a broad variety of sources, “cleansened“ to ensure it’s consistent and accurate, and rapidly moved to where it’s needed for real-time decision making. For example, data from a medical imaging process without image processing is still just plain binary data. The same applies to an online shop’s click-streams or the operating data from a machine. Data can only lead to relevant outcomes once it has been evaluated.

Artificial intelligence takes a significant step forward in terms of enabling real-time decision making by incorporating the use of machine learning. Machines can learn a model based on data which enables them to make decisions faster using new data received. In its spectacular performance on Jeopardy, Watson gave its answers all by itself (even though the answers were actually questions here). There are many other situations where machines making decisions on their own have far-reaching consequences: for example, when someone is recommended as a dating partner (with the consequence that a machine contributes to decisions about life-long relationships), or when a machine determines who is creditworthy (with the dramatic consequences this can have, e.g. being able to afford a house), or when a machine predicts potential criminal activities. Many such smaller machine decisions already have an impact on our everyday lives; i.e. determining who looks at what on Facebook, who gets which product-suggestions from Amazon; who gets which adverts from Google; who must pay which price for a flight; or gets which answers from Siri.

If you look at things from a different angle, you will see the opportunities which can be opened up by using artificial intelligence processes and especially by using machine learning: complex systems learn from experience (historical data) and make decisions more cleverly on their own to determine outcomes such as how relevant is a lead? Who should we address with which campaign? When should a technician service a system? When might an existing customer consider moving to the competition or when might a member of staff leave the company? Which information is most useful for my user right now vs. in six months, and which services should we recommend?

As noted above, if you’ve not done so already, now is the time for your business to take action on devising a plan for implementing AI effectively within your organization.

My recommendation is very simple: AI processes will be changing our lives to an even further extent than what we can foresee today. The same applies to business models and the rules of competition. Anyone who has their data under control today, will be ready to take the next step tomorrow. But those who do not start to become data-driven organizations today, will miss out completely when the next step needs to be taken. So what steps should companies—who aren’t yet data driven—take in order to become data driven? We will address this topic in detail in part II of this blog posting.

About the author Dr. Gero Presser

Dr. Gero Presser is a co-founder and managing partner of Quinscape GmbH in Dortmund. Quinscape has positioned itself on the German market as a leading system integrator for the Talend, Jaspersoft/Spotfire, Kony and Intrexx platforms and, with their 100 members of staff, they take care of renowned customers including SMEs, large corporations and the public sector.

Gero Presser did his doctorate in decision-making theory in the field of artificial intelligence and at Quinscape he is responsible for setting up the business field of Business Intelligence with a focus on analytics and integration.

↧

News from AWS re:Invent - How do you solve the complex data problem?

December 2, 2016, 8:48 am

≫ Next: Singapore Big Data Survey

≪ Previous: Helping Data Driven Companies Advance to Artificial Intelligence

Yesterday at Amazon re:Invent, Werner Vogels, Amazon’s Chief Technology Officer outlined a vision for a modern data architecture that spans data ingestion, lifecycle management, data governance, orchestration and job scheduling. This vision included the announcement of Amazon Glue. Glue is really two things – a Data Catalog that provides metadata information about data stored in Amazon or elsewhere and an ETL service, which is largely a successor to Amazon Data Pipeline that first launched in 2012.

Anyone familiar with Talend knows that we’ve been executing toward a similar vision for some time with our platform and solutions such as Talend Big Data Integration and recent innovations in Talend Data Preparation. One of the key differences with our vision is that whereas Amazon is looking to meet the needs of their platform users, Talend is addressing user requirements across a number of platforms, including multiple clouds, Hadoop, SaaS applications and traditional data warehouses.

While we share a similar vision, Talend and Amazon are addressing different needs within the data management market. Amazon is approaching the problem from the perspective of a platform provider working to simplify the development of custom applications on their platform. The easier it is for developers to build applications on AWS, the more value customers will get and the more they will use the AWS platform. Our focus is on the data integration developer that is solving more complex data integration problems. These challenges typically span many data sources within and outside of Amazon and require deep, rich transformation, cleansing and governance capabilities.

Amazon Glue is focused on Python developers hand coding applications on top of Amazon. It provides useful tools aimed at streamlining data movement on top of the Amazon platform. Its catalog tool gives developers visibility into where data lives and its ETL service allows developers to generate some starter Python code that they can then edit in their favorite IDE such as PyCharm. While Amazon Glue should be a good productivity boost for developers building analytic applications on top of EMR or Redshift, it is not intended to address the deeper data integration needs of a dedicated data integration developer.

Our view is that over the next decade, enterprises will need to solve data challenges in a deeply hybrid world, where different workloads will run on a given platform based on the services provided. It’s also clear that many enterprises will hedge their bets on being too dependent on any single platform and use a mix of platforms such as AWS, IBM Watson, GE Predix, Salesforce, SAP, Google Cloud and Microsoft Azure.

Talend users are typically knowledgeable in Java, not Python, and they have a far deeper set of integration needs. Our users will typically need to connect to tens or hundreds of data sources, both on-premises and in the cloud. For example, Lenovo is using Talend to connect to 60 different data sources, including web logs, e-commerce, customer service and social media data. Once these developers have access to this data, they typically need to do far richer data transformation. For example, with EMR (Hadoop), Amazon would expect developers to program their data quality rules by hand in Python. With Talend, we provide them with pre-built data quality components that run natively inside EMR or any other instance of Hadoop.

In the long run, we expect that Amazon will continue to expand its services, but with a focus on the analytic application developer, while Talend will continue to specialize in the deeper integration challenges. Solving these deeper problems will require strong multi-cloud and on-premises capabilities, continued innovation around data quality, including business-friendly user interfaces for data preparation and rich data governance and lineage capabilities. These are the requirements we are hearing from our customers and where we plan to focus our product development efforts.

Amazon is a key partner for Talend. In fact, earlier this year we announced that Talend is “All-in” on Amazon. There are several opportunities for Talend and Amazon to work together to improve how these two types of developers collaborate. Our first opportunity is likely to be integrating Talend’s data management tools on top of Amazon’s data catalog, and there may well be more mutually beneficial opportunities in the future.

↧

Singapore Big Data Survey

December 5, 2016, 2:19 pm

≫ Next: Top 5 Takeaways from AWS re:Invent 2016

≪ Previous: News from AWS re:Invent - How do you solve the complex data problem?

Talend ran an online survey asking consumers in Singapore a series of questions related to data usage and privacy. The results provide some good insight into how big data impacts consumers’ lives in the region as well as their sometimes conflicted views on information sharing and confidentiality.

When it comes to online shopping experiences, do you prefer (or would you prefer):

a. A personalized experience showcases products that match your style and shopping habits and tastes. (61%)

b. A standard shopping experience that showcases all the products the retailer has to offer regardless of whether or not they may be of interest to you. (39%)

2. Under what circumstances might you be willing to share personal information?

a. I would provide my birth date for a 15% discount from my favorite retailer on my birthday? (53%)

b. I would provide my email address and clothing preferences to a retailer for highly personalized shopping recommendations? (52%)

c. I would let my insurance company monitor my driving to receive a 20% discount on my car insurance (44%)

d. I would let my supermarket collect information about my grocery purchases if it meant I’d receive coupons for my favorite items (33%)?

e. Share my health history to improve the preventative care recommendations I receive from my physicians (24%)?

3. What common applications do you use on your mobile phone that use big data?

a. Dining recommendations (29%)

b. Public transport app (bus schedules/arrival times) (59%)

c. Uber / Grab (51%)

d. Smart home applications (32%)

e. Health/fitness applications (28%)

f. Online shopping applications: Zalora, Lazada, Qoo10 (57%)

g. Travel websites (43%)

4. Which of the following personal data capture apps/devices do you own?

a. Fitness Watch e.g. Nike/Fitbit (40%)

b. Apple Watch (22%)

c. Samsung Smart Watch (19%)

d. Smartphone Health/fitness apps (49%)

5. To what extent do you agree or disagree with the following statement?

It is okay for a store to use my past purchase history to create a profile of me that improves the services I receive?

a. Strongly Agree (10%)

b. Agree (44%)

c. Somewhat agree (31%)

d. Disagree (11%)

e. Strongly disagree (7%)

When you download a new app, what is your general approach when asked to allow information access and/or granting permissions (e.g. location)?

a. I typically provide all the information access and/or permissions requested (33%)

b. I’m selective about the information access and or permissions I provide (67%)

7. Is it important for you to know what happens with your personal data (e.g. age or credit card information) after you have shared it with a company?

a. Very important (71%)

b. Important (21%)

c. Somewhat Important (6%)

d. Not important at all (2%)

8. Which of the following companies would you most trust with your personal information?

a. Apple (36%)

b. Google (42%)

c. Amazon (4%)

d. Netflix (3%)

e. Facebook (7%)

f. LinkedIn (8%)

9. To what extent do you agree or disagree with the following statement? Companies should face fines if it is discovered that their poor data security lead to a data security breach (i.e. the loss of personal consumer information (credit card or banking details)?

a. Strongly Agree (56%)

b. Agree (35%)

c. Somewhat Agree (4%)

d. Disagree (2%)

e. Strongly disagree (3%)

10. To what extent do you agree or disagree with the following statement? Individual employees at companies should face penalties if it is discovered that their careless actions lead to a data security breach of personal consumer information?

a. Strongly Agree (56%)

b. Agree (35%)

c. Somewhat Agree (4%)

d. Disagree (2%)

e. Strongly disagree (3%)

11. To what extent do you agree or disagree with the following statement? There should be national standards/regulations regarding data loss (e.g. Companies must inform customers of data loss within given period)?

a. Strongly Agree (52%)

b. Agree (32%)

c. Somewhat Agree (12%)

d. Disagree (1%)

e. Strongly disagree (3%)

12. How do you judge a brand that continues to target you with products that you have already purchased?

a. It discourages me from buying anything else from that brand (20%)

b. It annoys me, but it won’t change my buying habits (46%)

c. It detracts from their brand image in my mind (17%)

d. I just accept that it’s difficult for brand to target accurately (17%)

↧

Top 5 Takeaways from AWS re:Invent 2016

December 6, 2016, 6:27 am

≫ Next: Where’s a Russian Linesman When You Need One? Talend Scores Highest Position in Visionaries Quadrant for Data Quality

≪ Previous: Singapore Big Data Survey

The AWS 2016 re:Invent cloud conference, last week in Las Vegas, was chock full of product announcements. With more than 32,000 people attending, it became one of the largest IT conferences to date.

As a strategic partner, Talend went “all in” with AWS earlier this year. So, while I was attending the keynotes and sessions, I couldn’t help but think what all this meant to organizations interested in working with cloud-based solutions, particularly in Big Data, Integration, and Analytics. After some reflection, here are my top 5 takeaways:

#1: Data warehouse migration to Amazon Redshift continues to pick up steam

Over the week, AWS held numerous migration sessions, with more than 20 of them focused on Amazon Redshift alone. Amazon Redshift is not new. It is a fast, simple, fully managed, petabyte-scale data warehouse. Clearly, AWS has made a commitment to ensure businesses understand how they can minimize costs and increase business agility. During these sessions, the speakers explored ways to select projects for the cloud, to plan and execute migrations, as well as to manage the IT transformation at a large scale. These sessions were packed, with additional attendees lining up and sitting on the floor outside. It’s great to see that cloud data warehouse migration continues to pick up steam among businesses. At Talend, we offer Talend Integration Cloud, built on AWS infrastructure for enterprise-scale, mission-critical data migration projects onto Redshift. And just two weeks ago, Talend’s Data Architect shared the ten lessons learned for building a better data warehouse in the cloud.

FREE TRIAL: Test Talend Integration Cloud for 30 Days

#2: Big Data cloud migration made easy with Amazon Snowmobile and Snowball Edge

Last year, AWS announced Amazon Snowball, an appliance that can move terabytes of data from legacy systems. This year, AWS showcased the AWS Snowmobile, a truck that comes in handy if you need to move exabytes of on-premises data to the cloud (Amazon S3 or Amazon Glacier) in a matter of weeks. In the keynote, AWS CEO Andy Jassy also announced the general availability of a new Snowball data transfer appliance, the AWS Snowball Edge, which comes with onboard storage and compute capabilities. With it, businesses can easily transport up to 100 terabytes of data, which is more than twice the amount of data that the original AWS Snowball can handle. As adoption of public cloud continues, tools like AWS Snowmobile and Snowball Edge are key to enable enterprises to migrate Big Data with ease.

#3: More powerful compute and analytics tools are available in AWS

AWS announced a broad spectrum of compute capabilities in Jassy’s keynote. These new capabilities, along with several new Virtual Machine instances offer better performance in workloads processing. In fact, Talend customers have used several of these capabilities to address their needs. You can learn more about it from this AWS+Talend architecture. They also announced a new analytics tool, Amazon Athena. At face value, it is a serverless query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. In my view, it is a great tool that complements the existing Amazon EMR and Redshift; that said, Talend integrates with Amazon S3, EMR, and Redshift. The combination will significantly lower the bar for businesses to merge all of their data into one repository and to glean insights from high-quality data. And lastly, AWS announced another analytics tool, AWS Glue. Our CTO shared in his blog how Talend shares a similar vision yet addresses different market needs.

# 4: Continuous Integration and Deployment (CI/CD) is a key enabler for business agility

There were some exciting announcements for DevOps. Amazon announced several new developer tools such as AWS CodeBuild that allows developers to build and test code in the cloud. And there is AWS X-Ray that helps developers identify the cause of performance issues and errors. These new tools, along with the existing AWS developer tools enable IT teams to implement more rapid continuous integration and deployment processes. Organizations are therefore empowered to change the frequency of their application updates in a swift fashion. This is a big boon for businesses seeking agility and improved team productivity by building technology that gives them a competitive advantage. At Talend, we also offer CI/CD to help our customers become more agile with their cloud and big data integration processes.

#5: Artificial Intelligence (AI) powers the next wave of Big Data Analytics

When Andy Jassy announced the first batch of AWS AI products at his keynote, I was amazed by the level of convenience they could bring to our daily lives. For example, Amazon Lex powers Amazon Alexa to let you book a flight using voice commands; Amazon Polly lets you transform text into lifelike speech, and Amazon Rekognition recognizes a face easily. I then realized these services will in fact help companies build more customer-centric businesses. These services will process large batches of images and rich media files to improve their performance. It is not surprising that as these services become more widley adopted, they will power new use cases for Big Data analytics.

If I were to sum up AWS re:Invent 2016, it was exciting and encouraging to see how so many of the cloud players are working together to help customers solve Big Data and Cloud challenges powered by an ever-increasing array of dazzling and innovative technologies from AWS.

↧

Where’s a Russian Linesman When You Need One? Talend Scores Highest Position in Visionaries Quadrant for Data Quality

December 6, 2016, 6:39 am

≫ Next: Talend Data Masters 2016: Lenovo’s Data-Driven Retail Transformation

≪ Previous: Top 5 Takeaways from AWS re:Invent 2016

As some might recall, in the 1966 World Cup, Sir Geoff Hurst, a former English international football (A.K.A. ‘soccer’) player, scored a controversial goal during England’s World Cup final match against West Germany at Wembley Stadium that helped his team secure a 4–2 victory. For many the ball hadn’t crossed the goal line and therefore wasn’t a score; however, the Russian linesman saw things differently and ruled the goal ‘Good.'

When this year’s Gartner Magic Quadrant for Data Quality Tools was published this past week, for a moment, I thought “where’s a Russian linesman when you need one?” But that moment quickly passed as I recalled my statement from earlier this year—when we received the results of the Gartner Magic Quadrant for Data Integration Tools—that it’s not about the dot, it’s about the journey.

I’m sure my American colleagues will appreciate the words of NFL football player and coach, Vince Lombardi, who said: “The price of success is hard work, dedication to the job at hand, and the determination that whether we win or lose, we have applied the best of ourselves to the task at hand.”

When you apply this to Talend’s positioning in the 2016 Magic Quadrant for Data Quality Tools, the takeaway is: it’s all about moving the ball down the field—making a significant jump within the Visionaries quadrant vs. our 2015 placement, which is an overall triumph for the betterment of our customers and the market.

When describing the market in this year’s report, Gartner notes that increasing numbers of “organizations seek to monetize their information assets, curate external with internal data using a trust-based governance model, and apply machine learning as they explore the value of the Internet of Things (IoT).”[1] I believe our jump in this year’s Magic Quadrant for Data Quality report can be largely attributed to our focus on providing machine learning data quality capabilities for Big Data, and self-service data preparation capabilities with governance.

Over the last year, Talend has augmented our Data Quality capabilities in three key areas: Big Data, Self-Service, and Metadata Management. We believe these are the three necessary ingredients for enabling the next generation of workers with tools that can help them do their jobs faster, which are presented in a more consumable, simple, and integrated way. This approach is in-line with the same shifts Gartner sees taking place in the market.

Let’s look at each one of these areas of innovation in turn:

1. Data Quality on Big Data—Having high-quality data is a prerequisite for guaranteeing the business value that can be generated from analyzing and using the steadily increasing volumes of enterprise information. Otherwise, you just end up with a ‘garbage in, garbage out’ scenario, which results in poor decision making. Our latest data quality product innovations utilize machine learning to capture and automatically apply a lot of human decision making to data sets to cleanse, organize and verify big data information stores at scale.

2. Self-Service—Many business workers today can’t get the information or data sets they need from IT in a timely enough manner to get their jobs done. Add to this the fact that more and more millennials are entering today’s workforce, nearly a third of whom—according to a survey of 1,050 Americans conducted by Conversion Research—would rather clean a toilet than talk to customer service (or in this case, IT). The millennial generation has no desire to wait on anyone to do anything for them—they’d rather be able to do things for themselves (i.e. not involve IT to get the information they need to do their jobs faster). By introducing self-service data preparation, we’ve put data quality in the hands of business users vs. IT. Not only does this self-service access to information make more workers happy—particularly millennials—but it also allows informed decision-making to take place enterprise-wide and enables IT to scale with emerging business needs.

3. Governance—As businesses grow and enable more workers to access enterprise data lakes, IT needs to have tighter information security and auditing policies to ensure the quality of enterprise information is maintained over time. One way to do this at scale is to institute collaborative data governance; wherein, all users are making updates to the metadata to ensure it’s up to date and accurate at all times.

With the many different expectations being placed on IT by today’s evolving workforce, coupled with the demands of big data, how should IT adapt or provision their services to succeed? At Talend, we believe the answer lies in the right combination of all three ingredients listed above—and that our current Data Quality solution embodies this perfect recipe.

So to bring it back to what I learned from the results of this year’s Gartner Magic Quadrant for Data Quality Tools report, as Vince Lombardi also said, “Perfection is not attainable, but if we chase perfection we can catch excellence.”

Again, it’s all about moving the ball down the field.

[1]“Magic Quadrant for Data Quality Tools,” by Saul Judah, Mei Selvage, and Ankush Jain, Gartner Research, November 2016.

↧