Open Integration Meets Metadata With The New Talend Metadata Bridge

April 13, 2015, 7:23 am

≫ Next: Talend – Implementation in the ‘real world’: what is this blog all about?

≪ Previous: Big Data - a Relatively Short Trip for the Travel Industry

Great news! Talend Metadata Bridge is now general availability (GA) since March, 8^th. Talend customers with an active subscription to any of our Enterprise or Platform products can download and install it as an add-on to our latest 5.6.1 Talend Studio. Then, from 5.6.2 and onwards, it will be installed automatically.

So, what does it bring to Talend developers, data architects and designers?

Metadata is data about data. Business metadata generally includes information like the definition of business objects (such as a customer), its attributes (for example, a customer ID), the relationships between objects (a contract related to a customer), the business rules that apply to that information (an active customer has at least one open contract), the roles in regards to that information, etc. It brings clarity to information systems, making them more useable and accessible as self-services by business users, and it brings auditability too, a key capability especially needed in heavily regulated industries.

Technical metadata is created by any tool that deals with data: databases, data modelling tools, Business Intelligence tools, development tools, enterprise applications, etc. In fact, metadata is a core capability for solutions and platforms that can bring a high level of abstraction to the IT technical layer, for example, for visual programming or Business Intelligence.

Talend is a perfect example. Metadata is at the cornerstone of our visual design capabilities. So Metadata is not new to Talend. What Talend Metadata Bridges adds is the ability to exchange Talend’s Metadata with Metadata from other tools. In addition, the Excel Data Mapping tool allows for the exposing and authoring of Talend’s data transformation capabilities such as mappings and transformations directly into Excel.

Let’s run through some of the new capabilities. Please refer also to our new web page or online webinar for a more exhaustive overview.

Faster design with the Talend Metadata Bridge

In many organizations, developers, application designers and data architects may not use the same tools when designing, implementing or maintaining systems. Designers may use tools that provide a very high level of abstraction but don’t deal with the technical details: they may use data, objects or process modeling tools, like CA ERwin Data Modeler, Embarcadero ER/Studio, SAP Sybase Power Designer, IBM Infosphere Data Architect, etc. Developers use other tools like a database, an ETL, a Business Intelligence tool, etc. The lack of integration between the tools leads to inefficiencies during the implementation phase.

What Talend Metadata Bridge does is seamlessly integrate Talend with higher-level tools. It can also reverse-engineer existing Talend data jobs into the modeling tools and keep them in sync during the project life cycle. In addition, it not only synchronizes data models with Talend’s physical models, but it also synchronizes metadata across all tools because of its ability to export the metadata across databases and BI tools.

The aforementioned modeling tools are very good at designing and managing data models and data relationships inside a system. However, they don’t provide similar capabilities to manage the relationships between systems, which are the typical problem that you are addressing when you are using Talend. Although Talend Studio provides a high level of abstraction to those data integration processes, some stakeholders involved in the design of a system may still find it too complex for their design job.

Excel screenshot - Talend Metadata Bridge This is where our new Excel Bridge for data mapping comes into play. It is an Excel add-in, delivered as part of the Talend Metadata Bridge that allows designing mappings with simple data transformations between data sources and targets, in a simple spreadsheet. Designers will enjoy it for prototyping, documenting, auditing, or applying quick change to the transformation process, directly from the Excel interface they are familiar with. The Excel add-in includes a new “ribbon” with helper functions to format the sheet. It also provides drop-down lists in the cells for easy access to the source or target metadata. Through this new tool, collaboration between the designers and the developer becomes a matter of import-export, eliminating the traditional specifications / implementation / acceptance cycle. The developer deals with the connectivity and other technicalities of the job, while the designer or a subject-matter expert, uses Excel as a frontend to complete the mappings.

So what are the benefits of the Talend Metadata Bridge? It brings reduced implementation times and maintenance costs, increases data quality and compliance through better documentation and information consistency, and improves agility for change.

At the same time, it empowers designers and business users with simple authoring capabilities for mappings and transformations in Talend, accelerates development by using common formats for specifications and development, and avoids delays at runtime for a quick fix in case of unforeseen changes.

Re-platforming and ETL offloading

Data platforms are at the core of any information system. Changing the core in a system often seems as a daunting task, which is a reason why data platforms don’t change much over time. But, there are times when change is needed. It happened twenty years ago when relational databases outperformed their alternatives for information management. It happened more recently, but to a much lesser extent, when came alternatives to the traditional relational databases, such as open source or datawarehouse appliances. And now, as we are engaging the Big Data trend, we see a new generation of open-sourced innovative databases like NoSQL and data management environments such as Hadoop that can manage more volume, variety and velocity of information at a fraction of the cost of their predecessors.

But re-platforming may appear as a risky and costly project that may hamper the benefits of the new technologies. There is a need for accelerators and a well-managed approach to address this challenge. With this respect, Talend Metadata Bridge enhances Talend capabilities to handle migration projects. The bridge’s metadata connectors complement the existing Talend data connectors to automatically create the data structures in the new environment before moving the data itself. It allows therefore renewing your platforms without losing your previous design and implementation investments, and preserves existing development standards such as naming conventions.

When used in conjunction with Talend Big Data, it also dramatically streamlines the offloading of ETL processes to Hadoop: Existing ETL jobs can be converted into native Hadoop jobs that run without even requiring a proprietary run-time engine. In this scenario, the Talend Metadata Bridge can replicate the metadata from the legacy system to the one needed in the new Big Data platform. Note that it is an accelerator, but not a magic box: re-platforming is a project that needs a well-defined approach and methodology. This is something that we are investigating with System Integrators in the context of a new conversion program.

And that’s not all…

Metadata Management holds a lot more promise. Beyond the capabilities mentioned in this article, the Talend Metadata Bridge will drive new best practices in our Talend community, as we envisioned through the feedback of some of the experts that participated in our early adoption program. In addition, the Talend Metadata Bridge provides the foundation for future capabilities within the Talend Unified Platform. Stay tuned...

↧

Talend – Implementation in the ‘real world’: what is this blog all about?

April 23, 2015, 6:45 am

≫ Next: Talend – Implementation in the ‘Real World’: Data Quality Matching (Part 1)

≪ Previous: Open Integration Meets Metadata With The New Talend Metadata Bridge

Let me introduce myself: My name is Adam Pemble and I am a Principal Professional Services Consultant based in the UK. My area of expertise is MDM (Master Data Management), along with the related disciplines of DI (Data Integration or what was traditionally known as ETL – Extract Transform Load) and DQ (Data Quality).

Now a little background: Talend has four main grades of consultant: Analyst, Consultant, Senior and Principal, with each grade bringing different levels of experience, technical knowledge and of course different price points. I have been with Talend for around four and a half years now, starting as a consultant (the first consultant in the UK team!) and working my way up. Before that, I worked for a competitor for three and a half years as an Analyst, then Consultant. I have two main roles: consulting for Talend customers and what we call ‘practice contribution’ – business development, defining best practices, training our consultants etc. When I am not consulting, I like to race cars – sponsored by Talend!

Talend implementation in the real world

So why am I telling you this?

A little while ago I was asked by the Talend management team to start writing a blog for the website. I was given a free reign to write about anything I liked (except cars – sadly). When I looked through the blogs and the bloggers that we already have on the site I realised that many of my colleagues were already doing a really good job of writing about the marketplace, the challenges faced by organisations, and where the industry is heading. All great stuff indeed, but as an MDM practitioner, in particular I’d recommend reading the blogs of Mark http://www.talend.com/blogger/mark-balkenende, Christophe http://www.talend.com/blog/ctoum, Sebastiao http://www.talend.com/blogger/sebastiao-correia and Jean-Michel http://www.talend.com/blogger/jean-michel-franco).

Given that I like to pick my own “lane” (pun intended), I thought I should use my blog to discuss real-world problems and use cases, as well as provide some practical examples of how these may be overcome. I thought this type of content might also serve to augment some of the other information we make available to current and potential customers such as documentation, training, forums etc. that focuses on the ‘how’ to do something, but not necessarily the ‘when?’ and ‘why?’.

Let’s think about how most DI / DQ / MDM developers begin their journey with Talend. The practical reality is that if you have a decent sized project, your company will have chosen one of our Enterprise or Platform products. As a developer you may have been involved in the pre-sales process, but this is not a given. Perhaps you might have even downloaded and used one of the Talend Open Studio products, which was the catalyst for your company considering the purchase (if this is the case, and not that I am biased, well done!) Perhaps there is a Systems Integrator / partner involved in your project that you will work alongside or maybe you have your own development team. You may have used another DI tool in the past, perhaps worked extensively with Databases, or come from a coding background. Then again, maybe none of the above fits your particular situation. The truth is everyone starts at a different level and progresses at different speeds – this is only natural. Perhaps some people in your company think that Talend solves all their problems with no thought / effort required (encouraged by our marketing team I am sure). However, as practitioners, we know the reality is not quite as simple. What are the key factors in delivering a successful DI / DQ / MDM project?

- The right tools (aka Talend!)

- The right people at the right time – technical and business experts

- Experience

- Best practices / standards

- Analysis / Requirements / Design (i.e. a methodology which delivers results)

- Realistic expectations

All this can be rather daunting - so where do we begin?

The Talend training courses are a great place to start – incidentally, if you opt for an Instructor led course, this may be your first interaction with someone from our Professional Services team. The courses are a great first step that will put you on your way, but Talend is a powerful solution with a lot of depth, so mastering it will take time. Of course the disciplines of DI / DQ / MDM are complex, so no matter your background, it will take time and hands-on experience to be able to build truly ‘production-quality’ logic. I can’t quantify how long this will take because everyone is different (the quickest learners tend to be experienced practitioners who have used similar tools), but you are not alone as there are numerous resources available to you, some of which I have mentioned already.

You should also consider utilising our Professional Services consultants – most customers use us for architecture design and installation of Talend, but we can also help mentor you through the development journey. We live and breathe the tool on a daily basis and have been through the project lifecycle many times. In most cases, we will have implemented something similar before (not always though and we love a challenge!). No one Talend consultant is an expert in all Talend products as the platform is too big for one person to know everything. Given this, we tend to specialise – for example I only deal with our ESB and Big Data products at a high level – MDM, DI and DQ are my specialisms. Regardless of your needs though, I can guarantee we have staff on hand that can help.

My hope is that my blog entries can be a practical guide to real-world problems using Talend, and that it will give you a little more insight into the way we work in Talend Professional Services.

Thanks for reading!

Next time: A practical guide to Data Quality Matching.

↧

Talend – Implementation in the ‘Real World’: Data Quality Matching (Part 1)

April 28, 2015, 8:51 am

≫ Next: Agatha Christie and the Challenge of Cloud Integration

≪ Previous: Talend – Implementation in the ‘real world’: what is this blog all about?

Data Quality (DQ) is an art form. If I was to define two simple rules for a project involving some element of DQ, they would be:

Don’t underestimate the time and effort required to get it right
Talk to the people who own and use the data and get continuous feedback!

In many cases the DQ role on a project is a full-time role. It is a continuous process of rule refinement, results analysis and discussion with the users and owners of the data. Typically the DI (Data Integration) developers will build a framework into which DQ rules can be inserted, adjusted and tested constantly without the need to rewrite the rest of the DI processes each time.

This blog focuses on the process of matching data, but many of the principles can also be used in other DQ tasks.

First of all, lets understand what we are trying to achieve – why might we want to match?

- To find the same ‘entity’ (be it product, customer, etc.) in different source systems

- To de-duplicate data within a single system – or at least identify duplicates within a system for a Data Steward to be able to take some sort of action

- As part of a process of building a single version of the truth, possibly as part of an MDM (Master Data Management) initiative

- As part of a data-entry process to avoid the creation of duplicate data

To help us with these tasks I propose a simple DQ methodology:

As you can see, this is an iterative process and, as I said earlier, you are unlikely to find that just one ‘round’ of matching achieves the desired results.

Let’s look at each step on the diagram:

Profiling

Before we can attempt to match, we must first understand our data. We employ two different strategies to achieve this:

Consulting:

- Reading relevant documentation

- Talking to stakeholders, end users, system administrators, data stewards, etc.

- Source system demonstrations

- Discussing the change in data over time

Technical Profiling

- Using Talend tools to test assumptions and explore the data

- Analyse the actual change in data over time!

Both strategies must be employed to ensure success. One thing I have found is that end users of systems are constantly finding ways to ‘bend’ systems to do things that business teams need to do, but perhaps that the system wasn’t designed to do. A simple example of this would be a system that doesn’t include a tick box that let’s call centre operators know that a customer does not want to be contacted by phone. You may find that another field which allows free text has been co-opted for this purpose e.g.

Adam Pemble ****DO NOT PHONE****

This is why we cannot rely on the system documentation alone. A combination of consulting the users and technical profiling would help us identify this unexpected use of a field.

Typically for this step I would start by listing every assumption and query about the data – you should have at least one thing for every field of every table and file, plus row and table level rules. Next, design Talend Profiler Analyses and /or Data Integration Jobs to test those assumptions. These results will then be discussed with the business users and owners of the data. The reports produced by the DQ profiler can be a great way to share information with these business users, especially if they are not very technical. DI can also produce results in formats familiar to business users e.g. spreadsheets.

Specific to the task of matching, some examples of assumptions we may wish to test:

“In source system A, every Customer has at least one address and every address has at least one associated customer”

“Every product should have a colour, perhaps we can use that in our matching rules? The colour field is free text in the source system.”
“Source systems A and B both have an email address field – can we match on that?”
“Source system X contains a lot of duplicates in the Customer table”

It is also important to analyse the lineage of each piece of data. For example, say we had an email address field. We may profile it and discover that it contains 100% correctly formatted email addresses. Is this because the front-end system is enforcing this, or is it just by chance? If it is the latter, our DI jobs may need to be written to cope with the possibility of an incorrect or missing email, even though none currently exist.

Note: I may write a future blog going into more detail about the importance of analysis before beginning to write any process logic.

Standardisation

Whilst performing the Analysis stage, it is likely that we will notice things about our data that will have an impact on our ability to match records. For example, we might profile a ‘colour’ column for a product and find results similar to those shown below:

What do we notice?

- Blk is an abbreviation of Black

- Chr is an abbreviation of Chrome

- Aqua MAY be a synonym of Blue

- Blu is a typo of Blue

- Etc.

If we were to do an exact match based on colour, some matches could be missed. Fuzzy matching could introduce false positives (more on this later).

To improve our matching accuracy, we need to standardise our data BEFORE attempting to match. Talend allows you to apply a number of different standardisation techniques including:

- Synonym indexes

- Reference data lookups

- Phone Number Standardisation

- Address Validation tools

- Other external validation / enrichment tools

- Grammar-based parsers

Let’s look at each of these in turn:

Synonym indexes

Our scenario with colours would be a classic use case for a synonym index. Simply a synonym index is a list of ‘words’ (i.e. our master values) and synonyms (related terms that we would like to standardise or convert to our master value). For example:

The above is an excerpt from one of the synonym indexes that Talend provides ‘out of the box’ (https://help.talend.com/display/TalendPlatformUniversalStudioUserGuide55EN/E.2++Description+of+available+indexes), in this case one that deals with names and nicknames. Talend also provides components to build your own indexes (the index itself is a Lucene index, stored on the file system) and to standardise data against these indexes. The advantages of using Lucene is that it is fast, it is an open standard and that we can leverage Lucene’s fuzzy search capabilities, so we can in some cases cater for synonyms that we can’t predict at design time (e.g. typos).

These jobs are in the Talend demo DQ project if you want to play with them. The indexes can also be utilised in the tStandardizeRow component, which we will discuss shortly.

Reference data lookups

A classic DI lookup to a database table or other source of reference data e.g. an MDM system. Typically used as a join in a tMap.

Phone Number Standardisation

There is a handy component in the Talend DQ pallet that you should know about: tStandardizePhoneNumber.

It uses a google library to try to standardise a phone number into one of a number of available formats, based on the country of origin. If it can’t do this, it lets you know that the data supplied is not a valid phone number. Take the example of the following French phone numbers:

+33147045670

0147045670

They both standardise to:

01 47 04 56 70

Using the ‘National’ format option. In their original form, we would not have been able to match based on these records – after standardisation, we can make an exact match.

Address Validation tools

Talend provides components with our DQ offerings that allow you to interface to an external address validation tool such as Experian QAS, Loqate or MelissaData. Why would you want to do this? Well, if you are pulling address data from different systems, the likelihood is that the address data will be held in differing formats e.g. Address1, Address2 etc. vs Building, Street, Locality, etc. These formats may have different rules for governing the input of addresses – from no rules (free text) to validating against an external source. Even if two addresses are held in the same structure, there is no guarantee that the individual ‘tokens’ of the address will be in the same place or have the same level of completeness. This is where address validation tools come in. They take in an address in its raw form from a source and then using sophisticated algorithms, standardise and match the address against an external reference source like a PAF file (Post Office Address file) – a file from the postal organisation of a country that contains all addresses, updated on a regular basis. The result is returned in a well-structured and most importantly consistent format, with additional information such as geospatial information and match scores. Take the example below:

Two addresses that are quite different from each other to a computer; however, to a human, we can see that they are the same address. Running the addresses through an address validation tool (in this case Loqate) we get the same, standardised address as an output. Now our matching problem is much simpler.

You might ask – can I not build something like this with Talend rather than buy another tool? I was once part of a project where this was attempted (not with Talend, it was a different tool), which had quite poor-quality addresses. The issue is that addresses were designed to allow a human to deliver a letter to a given place, and there is a great deal of variation in how addresses can be represented. Six months of consultancy later, we had something that worked in most cases, but of course it was then realised that it would have been cheaper to buy a tool…. Why is address validation not built into Talend you might ask? There are a number of reasons:

- Not all customers require Address Validation – it would make the product more expensive

- Those customers that do may already be using one of the major vendors, they don’t want a different Talend proprietary system

- Different tools on the market suit different needs – e.g. MelissaData is centred on the US

- Why should Talend re-invent the wheel, when we could just allow you to utilise existing ‘best of breed’ solutions?

Other external validation / enrichment tools

There are many tools available on the market, most of which are easy to interface with using Talend (typically via a webservice or api). For example Dun and Bradstreet is a popular source of company data and Experian provides credit information on individuals. All of this data could be useful to an organisation in general and also potentially useful in matching processes.

Grammar-based parsers

Sometimes we will come across a data field that has been entered as free text, but could contain multiple bits of useful information that we could use to match, if only we could extract it consistently. Take for example a field that holds a product description:

34-9923 Monolithic Membrane 4' x 8' 26 lbs

4' x 8' Monolithic Membrane 26lbs Green

Now again, as a human, we can see that there is a high likelihood that these two descriptions are referring to the same product. What we need to be able to do is identify all of the different ‘tokens’: Product code, Name, Dimensions, Weight, and Colour - and create a set of rules to be able to ‘parse’ out these tokens, no matter the order or variations in representation (e.g. the spaces in 26 lbs but not in 26lbs). Essentially, what we are defining is some simple rules for a language or ‘grammar’. Talend includes a variety of parsing components which can help you, from simple regular expressions through to tStandardiseRow, which lets you construct an ANTLR grammar:

A warning though: this is a hard task for even experienced professionals to get right. We are looking to include some additional intelligence in the tool to help you with building these sorts of rules in the future.

Next time: This blog continues with part 2: matching and survivorship / stewardship

↧

Agatha Christie and the Challenge of Cloud Integration

May 7, 2015, 6:43 am

≫ Next: MDM and the Chief Marketing Officer: Made for Each Other

≪ Previous: Talend – Implementation in the ‘Real World’: Data Quality Matching (Part 1)

Many classic detective novels progress in a familiar way: the hero, Hercule Poirot or Miss Marple for instance, has an incomplete understanding of how the crime played out and must painstakingly collect information from witnesses - filtering out lies from truth - until the big picture falls into place. In Agatha Christie’s best-selling novels, such as Murder on the Orient Express, this often leads to bringing all the characters together in the same room for the “reveal”. In this setting, an integrated story is told that ties everything together, the truth becomes obvious to all, and the police can take action.

Most businesses today are in the same predicament as those famous detectives. The data needed to compose a “complete view of the truth” is spread across numerous repositories and tools. Not only is data disconnected but also often contradictory. Customer data may be in a CRM solution, needing to be mapped to a Marketing Automation Platform. As well as lead lists coming in from events, user data may also reside in Customer Support tools and forums or on social media platforms.

Trying to bring together these disconnected data silos isn’t new for IT executives, but the situation is made more challenging today by the growing use of Cloud and SaaS solutions across the enterprise. Gartner reports that SaaS is now used on mission-critical projects and in production (Survey Analysis: Buyers Reveal Cloud Application Adoption Plans Through 2017). Talend research shows that half of businesses regularly use 10 or more SaaS solutions. This situation, more problematic than data silos, actually creates “islands of data”, each with its own data policies, access rules and regulations.

Lower overall cost, operational agility and increased speed, this is the promise of the Hybrid Cloud model that orchestrates SaaS and On-Premises with public and private clouds. However, this “new normal” is threatened by the lack of scalable, future-proof integration solutions that seamlessly connect SaaS applications with on-premises solutions. Executives who want to grow their business by becoming more data driven are in the same situation as our characters from the Agatha Christie novels: How do they bring together all the information needed quickly and resolve inconsistencies in order to gain a single, actionable view?

Connecting the data-driven enterprise with Cloud Integration

Leading companies are looking at iPaaS (integration Platform-as-a-Service) as a way to connect the data-driven enterprise, to connect all their data sources, cloud-to-cloud as well as cloud-to-ground, as they inevitably approach a critical phase where the number of SaaS, controlled and rogue deployments is becoming unwieldy, and need to be brought together before the organization can fully leverage promising technologies like enterprise analytics or big data.

The benefits of an iPaaS solution to your data-driven enterprise:

Drive Growth by connecting SaaS & On-premises applications and data: Use vendor-provided, community-designed or in-house connectors to bring together all the key applications inside your organization, connecting both applications rolled out as part of your IT strategy, as well as those resulting from ad-hoc projects: CRM, marketing automation, but also file sharing sites, HR, ERP platforms, etc. Enable big picture analytics, and predictable forecasting, and fuel growth by identifying new opportunities, markets, customers, and products.
Boost Security, Governance and Quality for richer data: Leverage the controlled data integration to enforce proper standards, data formats, and security protocols. Leaps in data quality can also be achieved by eliminating manual errors and identifying inconsistencies, duplications or junk data before it pollutes your repositories.
Streamline Business Processes through Automation: Increase team efficiency by slashing time wasted on manual operations, boosting employee motivation while encouraging the roll-out of continuously improved best practice templates. Automate data cleansing, standardization, and enrichment while helping data flow faster through the enterprise, making it actionable sooner.
Trigger Breakthrough Innovation by empowering Citizen Integrators: Enable business users to try new ways to simplify their day job with simple, Web-based interfaces to connect data sources in a controlled manner without needing IT expertise. Encourage them think outside the box and innovate.

Hybrid Cloud Integration is the key to improving your organization’s bottom line by realizing the full benefits of Cloud & SaaS. Talend offers one such solution, Talend Integration Cloud. As you would expect from Talend, our solution offers powerful, yet easy-to-use tools along with prebuilt components and connectors to make it easier to connect, enrich, and share data. This approach is designed to allow less skilled staff to complete everyday integration tasks so key personal can remain focused on more strategic projects. With capabilities for batch, real-time and big data integration, we believe you’ll find Talend Integration Cloud a great way to deliver on your vision of a business-oriented and future-proof Hybrid Cloud strategy.

Of course if only Agatha Christie’s heroes had such easy and fast access to consolidated information, knowing exactly what happened and why, most of her novels would only be a few pages long, perhaps even shorter. Maybe there is a new genre ready to explode here? The 140-character #crimenovel!

↧

MDM and the Chief Marketing Officer: Made for Each Other

May 12, 2015, 8:09 am

≫ Next: Self-Service is Great for Fro-Yo but is it Right for Integration?

≪ Previous: Agatha Christie and the Challenge of Cloud Integration

When it comes to CMO’s, I’m about as data centric as they get. Early in my career, I worked as an economist for a consulting firm in Washington, D.C. I was happily awash in data and found myself analyzing such hot topics as the difference in prices of power tools in Japan and the United States.

Years later when I became a CMO, I thought to myself, “Here’s where I can use my love of working with lots of data to drive decision making and performance in marketing.” I was in for a rude surprise – the data spigot was badly broken.

The reason for this data logjam quickly became apparent. Most marketers are dependent on systems that were built to automate individual business functions such as sales, finance, and customer service, to name just a few. And, despite advances in CRM, e-commerce, BI and marketing applications, very few CMOs can see across these siloed systems to get the insights they need to do their job.

It’s a frustrating dilemma – marketers are unable to resolve this problem because they do not own the internal sales, finance and customer service systems, and do not control the processes that collect the relevant data. Each of these systems was designed to automate a specific function – none were created with the entire IT landscape in mind or built to inform marketing decisions. In most companies, no one is charged with pulling all this information together so the data remains in silos – solving specific functional problems, but not addressing the larger opportunities within the business.

I recently talked to the CMO of an eyewear company in the UK about this very problem. His is a classic example – given the company’s siloed systems, he is unable to analyze SKUs to identify such essential sales patterns as how well various colors and styles of frames are selling in different regions of the country. He’s not just frustrated, he’s angry about being handcuffed because of silo creep and the influx of unstructured, dirty and largely inaccessible data. If he does not fix this problem, there is no way that he can get what he needs out of his business intelligence initiatives to market effectively.

MDM to the Rescue

Master Data Management (MDM) is the answer. MDM was created to work across all of your enterprise’s systems – to pull together all your data, clean it and categorize it, providing you with a 360-degree view of your customers and insight into every aspect of your business.

MDM helps solve three major problems:

Analytic MDM allows you to analyze and understand your entire customer base in order to segment customers and identify new opportunities and trends.
Customer-360 MDM gathers all of your data about a single customer or product, including transactional information (e.g. site navigation path and past purchases. This allows a sales person or customer service representative to leverage this 360-degree view and better sell or service their customers on a day-to-day basis.
Operational MDM enables all sales and service systems to work together on behalf of the customer. Systems are connected in real time to improve data quality and streamline the customer experience – for example, when a customer registers on a web site all other relevant systems are automatically updated.

To help implement the MDM solution, Talend has launched a new consulting services package, Passport for Master Data Management (MDM) Success. The service helps establish the foundation needed to ensure MDM projects are delivered on time, within budget, and address the needs of a company’s various lines of business – including marketing.

MDM’s Beneficial Impact

From a CMO’s perspective, MDM solves a lot of problems and alleviates a lot of frustration.

MDM can help you build your business by:

Improving product characterization in order to track and understand what lines are selling and why
Uniting customer information so you can understand which customers have purchased which specific products and services, and launch successful new offerings
Improving marketing database segmentation so you can better target customers based on role, title, product interest and past purchases
Eliminating duplicates and reducing market spend
Improving ROI tracking accuracy by gaining more insights into the interaction between marketing spend, touches and sales
Improving the customer experience by tying all of your systems together

The efficacy of MDM was brought home to me during another customer visit – this one far more upbeat. Based on projections from a recently installed MDM system, this customer forecasts an increase in e-commerce revenues of 11% because the system allows the customer service representatives to do a better job of cross-selling. And, because the MDM system provides sales people on the floor with more insights into customer past purchases, in store sales are expected to jump by 7%.

And here’s what TUI UK and Ireland had to say about their MDM implementation: “This modernization project is a key enabler for improved customer experience, enhanced multi-channel opportunities, and a reduction of contacts with our contact center,” stated Louise Williams, General Manager Customer Engagement at Ireland. “Talend is used to automatically merge customers to create a single golden record for each customer.”

For a marketing manager, a purpose-built MDM solution is the royal road out of the data management morass and an end to siloed systems. It’s enough to make any CMO smile.

↧

Self-Service is Great for Fro-Yo but is it Right for Integration?

May 19, 2015, 7:47 am

≫ Next: The Power of One

≪ Previous: MDM and the Chief Marketing Officer: Made for Each Other

I had cause to visit a self-service frozen yogurt wonder emporium on a recent visit to the U.S. It was delightful and, at first, a tad overwhelming – so many flavors to choose from, so many toppings. Needless to say, I over indulged (goodbye, ideal running weight). Based on the number of similar establishments I saw during the rest of my stay, people seem to like control and convenience of the self-service business model. And, whether you look at banking, booking travel, or personal tax filings, self-service or DIY certainly appears to be a broader trend.

It’s also starting to happen in IT. For some IT teams, this change is happening too fast; for some business users this change can’t happen quickly enough. This tension is evident in the case of data integration and analytics. Because today’s business decisions are increasingly data driven, there is a growing strain between data-hungry business users and cost-constrained IT organizations. Users want to be able to tap into multiple data sources and employ a variety of integrated applications to further their initiatives, but many IT teams can’t keep up.

Here’s a typical scenario: A marketing team returns from a trade show with a gold mine of fresh leads sitting in Box or Dropbox. The leads need to be moved into a marketing tool like Marketo to make them actionable, then on to Salesforce.com for additional CRM activity. Finally, leads are imported into a data lake where the information can be combined with other data sources and analyzed using a tool like Tableau.

More easily said than done: Making the connections between the various systems is often a highly manual activity, as is cleansing the prospect contact information, which is typically riddled with data inconsistencies, duplicate names and other quality challenges. Naturally, the business users turn to IT for help.

But that help may not be as forthcoming as the users would like – IT organizations saddled with budget cuts and diminishing headcounts have a lot on their plate. The process might be further protracted if the information delivered isn’t exactly what is needed or if the business teams want to probe further based on the initial insights.

This delay in access to information has encouraged some business users to work outside the boundaries of IT control, sometimes referred to as Shadow IT. Business users may select and try to integrate SaaS applications on their own, which contributes to rogue silos that do not adhere to company standards for quality, security, compliance and governance.

In a recent Talend survey, companies told us that 50% of them have five or more SaaS apps. This figure is on the rise with IDC reporting that purchases of SaaS apps are growing 5X faster than on-premise. At the same time, Gartner predicts that 66% of integration flows will extend beyond the firewall by 2017 and 65% will be developed by the line of business.

Like my overflowing cup of frozen yogurt, independence can be costly in the traditional world of self-service integration. Users must spend hours each week cleaning and manually entering data – a mind-numbing, error prone activity that not only leads to inaccurate data, but also creates the perfect environment for procrastination. Dirty data starts to pile up like unwashed dishes in the kitchen sink.

Compounding the problem is the fact that the tools used to build integration solutions are complex, costly and require frequent updating. Most business users understandably lack the deep programming skills required to effectively code these systems.

Best of Both Worlds

The difficulties I’ve referred to actually present a great opportunity for IT to empower its users, streamline its own operations, and bring a new level of agility to the enterprise. Think “controlled self-service”. Help has arrived in the form of the recently introduced Talend Integration Cloud – a secure cloud integration platform that combines the power of Talend’s enterprise integration capabilities with the agility of the cloud.

To be specific, with Talend Integration Cloud we provide four key capabilities:

- The platform makes it fast and easy to cleanse, enrich and share data from all your on-premises and cloud applications,

- We provide the flexibility to run those jobs anywhere – in the cloud or in your own data center,

- We support cloud and on-premise big data platforms allowing you to deliver next generation analytics at a fraction of the cost of traditional warehouse solutions,

- We enable IT to empower business users with easy-to-use self-service integration tools, so they can move and access data faster than ever before.

With regard to that last point, IT is able to turn its business users into “citizen integrators,” a term used by Gartner. Business users get the tools they need to automate process that once were handled by IT, but IT retains a high-level of visibility and control.

Talend Integration Cloud allows IT to use the full power of Talend Studio and apply more than 1,000 connectors and components to simplify application integration by its citizen integrators. IT either prepares all connectors and components, or validates connectors that are available through Talend Exchange, our community marketplace. This ensures that all connectors are in compliance with enterprise governance and security guidelines.

And best of all, those problematic, standalone silos can be shelved. Talend Integration Cloud provides a whole new style of integration that allows IT to partner with business users to meet their integration needs. It may make both teams so happy that they decide a group outing is in order – self-serve fro-yo anyone?

Infogroup and Talend Integration Cloud

Infogroup is a marketing services and analytics provider offering data and SaaS to help a range of companies – including Fortune 100 enterprises – increase sales and customer loyalty. The company provides both digital and traditional marketing channel expertise that is enhanced by real-time client access to proprietary data on 235 million individuals and 24 million businesses.

“We use Talend’s big data and data integration products to help our customers access vast amounts of contextually-relevant data in real-time, so they can be more responsive to their client’s demands,” said Purandar Das, chief technology officer, Enterprise Solutions, Infogroup. “Our work requires that we constantly innovate and perfect the systems driving topline revenue for clients. Talend Integration Cloud, which we’ve been beta testing, will enable us to integrate more data from more cloud and mobile sources leveraging our existing skills and prior Talend project work.”

Bringing Benefits to IT, Users and the Enterprise

Among the many benefits associated with deploying Talend Integration Cloud are:

- Increased business agility and performance due to easy to use data integration tools provided to business users,

- Increases governance and security within compliance guidelines established by IT,

- Allows business users to work together more easily, think more strategically, and engage in dialogues fostering creativity,

- Simplifies and dramatically reduces time to deployment because all servers are in the cloud, minimizing the need for on-premises infrastructure,

- Allows you to shift workloads from on-premises to run in a secure cloud with cloud-to-cloud or cloud-to-ground connectivity,

- Lowers the cost and effort of maintaining and evolving these integrations by providing up-to-date visual representations of their status.

Plus you enjoy all the typical cloud benefits such as reduced cost, increased agility, faster speed, and lower total cost of ownership (maintenance of integration flows and connections is so much easier).

Talend Integration Cloud provides IT with an unparalleled opportunity to meet business users’ needs for data integration and self-provisioning, while maintaining the highest standards of governance and security. It’s the best of both worlds.

↧

The Power of One

May 27, 2015, 6:18 am

≫ Next: Spaghetti alla Cloud: Prevent IT Indigestion today! (Part I)

≪ Previous: Self-Service is Great for Fro-Yo but is it Right for Integration?

When we last spoke, I talked about how Talend is working with data-driven companies to define and implement their One-Click data strategies. 1-Click, introduced by Amazon.com in 1999, allows customers to make on-line purchases with a single click – and is a showcase of how well they can turn massive volumes of shopper, supplier and product data into a customer convenience and competitive advantage.

Recently I’ve noticed some customers, particularly in the area of Internet of Things (IoT), are using a similar but different metric – “1 percent.” In this instance, “1” is used to describe the significant impact a fractional improvement in efficiency can have on major industries.

The Positive Power of 1%

For example, take GE Water & Power, a $28 billion unit of the parent company. For nearly a decade, GE has been monitoring its industrial turbines in order to predict maintenance and part replacement. Recently, the company has dramatically increased its ability to capture massive amounts of data from these sources and blend it with additional large data sets.

GE Water & Power CIO Jim Fowler, in a speech last year in Las Vegas, said that GE now has 100 million hours of operating-data and maintenance and part-swap data across the 1,700 turbines its customers have in operation. Each sensor-equipped turbine is producing a terabyte of data per day. That information has been combined and processed with external data such as weather forecasts. And the results clearly illustrate the power of one.

According to Fowler, GE is using the data to help its customers realize a tiny 1% improvement in output that adds up to huge savings – about $2 to $5 million per turbine per year. While that’s impressive enough, when one considers the total savings across all 1,700 turbines over the next 15 years, 1% efficiency savings could equate to a staggering figure in the range of $66 billion.

Another excellent example of the Power of One comes from the OTTO Group in Europe, also a Talend customer. This is the world’s second largest online retailer in the end-consumer (B2C) business that reported 6B euro in online sales in 2013. OTTO is using the Talend platform to “…make quicker and smarter decisions around product lines, improve forecasts, reduce leftover merchandise and importantly, improve our customer experience,” says Rupert Steffner, Chief BI Platform Architect of the company’s IT organization.

For OTTO, like any other online retailer, shopping cart abandonment is a major challenge. Industry reports note that $4 trillion worth of merchandise will be abandoned in online shopping carts this year. By applying solutions made possible by access to extensive customer data, OTTO estimates it can predict with 90% accuracy customers who are likely to abandon a cart. It’s not hard to see how this information could be used to send incentives and promotions to these customers before they leave the site and their carts. Even if such activities were to net only a 1% change of fortune, when you consider it’s a $4 trillion issue, such a shift would be extremely meaningful.

Another European customer exemplifying the Power of One is a financial services company. A 1% improvement in cross selling its insurance policies to their existing customer base (a tiny fraction of their overall business) has resulted in a return of 600,000 euros over the past year.

GE – A Data Driven Enterprise

GE, by the way, is totally committed to making the most of Big Data. It has coined the term “The Industrial Internet” – a combination of Big Data analytics with the Internet of Things (IoT). The challenge, says CIO Fowler, is to build an open platform for ingesting and sharing Big Data to build new, highly secure applications. Talend is lending a hand.

Talend 5.6, our latest release, sets new benchmarks for Big Data productivity and profiling, provides new Master Data Management capabilities with advanced efficiency controls, and broadens IoT device connectivity.

Working with machine-generated content or IoT devices is enhanced by the platforms support for two related protocols – MQTT and AMQP – that allow sensor data to be ingested into a company’s real-time data integration workflows. It supports the latest Hadoop extensions, Apache Spark and Apache Storm, providing significant performance advantages and supporting real time and operational data projects.

Talend 5.6 is ready to help any data-driven enterprise realize the Power of One – and that’s just for openers. For smaller companies, the Talend solution may generate even greater returns – efficiencies on the order of 10% to 20%. Now that’s Power!

↧

Spaghetti alla Cloud: Prevent IT Indigestion today! (Part I)

June 16, 2015, 7:14 am

≫ Next: The Union of Real-Time and Batch Integration Opens Up New Development Possibilities

≪ Previous: The Power of One

Spaghetti alla Cloud? It’s what’s on the menu for most organizations today. With the explosion of popularity for SaaS applications, as well PaaS (Cloud platforms) and IaaS (Cloud Infrastructures), most IT architectures and business flows resemble a moving, tangled mess of noodles. I’m pretty sure that if you dig deeper, you’ll find some pretty old, legacy meatballs in there too.

The idea that our IT architectures remind us of a bowl of spaghetti isn’t new; in fact, the complexity of integrating on-premises applications and legacy systems has been a challenge for decades, leading many companies to try out SOA (Service oriented architecture) as well as ESB platforms (Enterprise Service Bus). Unfortunately, a number of new challenges are putting even greater strain on our existing architectures.

- The explosion of SaaS (Software-as-a-Service): Cloud solutions are increasingly popular with IT teams given their promise of agility, cost reduction and speed. Today, the average company uses 923 cloud services. This means incessant waves of new applications, each with disconnected islands of data,

- Even more SaaS when you consider how easy it is for business users to start using a tool without informing IT. It’s not rare to find “free” versions used in important business processes (survey tools for instance), as well as add-ons and extensions acquired directly from cloud platforms such as Salesforce AppExchange or AWS Marketplace. Some sources report that the average person now uses 28 cloud apps regularly.

“How many Cloud or SaaS applications do organizations run? At least twice the number they think they run” (tweet this now !)

- Cloud churn is a fact of today’s fast-paced cloud market, where cloud businesses come and go rapidly, meaning that you might lose access to your provider and therefore your data. Users don’t hesitate to move to a better or cheaper solution, and seem less attached to Cloud solutions than their on-premises equivalents,

- Hybrid, Private and Public clouds, different implementation flavors of cloud computing, add an additional challenge of integrating through the firewall,

- The changing nature of modern business means that organizations are continually adapting, adjusting and adopting new technologies and practices,

- And of course, let’s not even talk about the Internet of Things quite yet, but the size of your plate is about to explode.

The result? Spaghetti alla Cloud, that can leave your business bloated when cloud computing was supposed to be a liberating, game-changing paradigm instead.

Untangling your spaghetti architecture

Overloading on Spaghetti alla Cloud can have a significant impact on your organization’s competitive edge, just as there is a fine line between an athlete carb-loading before a race and a couch-potato with stomach cramps.

TechTarget's 2015 IT Priorities Survey provides insight into the top IT projects: Big data integration, data warehousing, business intelligence and reporting all depend on having a high quality, up-to-date data feed. As the IT saying goes, “garbage in, garbage out,” meaning that bad input will always result in bad output.

Developing your own point-to-point enterprise application integration (EAI) patterns is, naturally, a bad idea. The cost of development, implementation and maintenance would be prohibitive; it’s important to think about the long run and the total cost of operations. Unfortunately, the traditional alternative, a SOA + ESB approach, also has a very high cost of implementation, meaning it’s rarely rolled out properly and even less frequently kept up to par with business needs.

Your application integration solution must be robust while not hindering your agility. Flexibility and scalability are the key for both software and the supporting hardware and this is where Cloud options really shine. They lower the total cost of ownership for servers, bringing on-demand scalability and capacity, while improving performance for globally distributed users.

In part II of this post, we’ll explore how solutions for cloud integration, such as Talend Integration Cloud, help prevent IT indigestion and we will uncover four key tips for selecting your next Cloud Service Integration solution.

↧

The Union of Real-Time and Batch Integration Opens Up New Development Possibilities

June 18, 2015, 7:20 am

≫ Next: Why Everyone Will Become a Part-Time Data Scientist

≪ Previous: Spaghetti alla Cloud: Prevent IT Indigestion today! (Part I)

Hadoop's Big Data processing platforms feature two integration modes that correspond to different types of usage, but are being used interchangeably with increasing frequency. "Batch" or "asynchronous" mode enables the programming of typically overnight processing. Examples of using batch mode include a bank branch integrating the day's deposits into its books, a distributor using or updating a new product nomenclature, or a business owner consolidating sales for all branches for a given period. The primary advantages of using batch mode include the ability to process huge data sets and meet most traditional corporate analytics needs (business management, client and marketing expertise, decision-making support, etc.).

However, one of the limits of batch processing is the latency period which makes any real-time integration impossible. This constitutes a delicate problem for companies with the need to meet client demands on the spot, cases such as making a recommendation to an Internet user in the middle of a purchase (think Amazon), posting an ad on a website aimed at a specific Internet user within a matter of milliseconds, taking immediate stock of the variability of different elements in order to improve decision-making (such as weather or traffic conditions) or detecting fraud.

In the Hadoop ecosystem, a new solution to this problem has emerged: Spark, developed by the Apache Foundation, is now offering a synchronous integration mode (in near real-time), also referred to as "streaming". This multifunction analysis engine is well adapted to fast processing of large data sets and includes the same functions as MapReduce, albeit with vastly superior performance. Namely, it enables the management of both data acquisition and processing, all while offering a processing speed that is 50 to 100 times greater than that of MapReduce.

Today, Talend supports both of these integration modes (while making it possible to switch from one to the other in a transparent manner, whereas the majority of solutions on the market will require a total overhaul of the data integration layer). Not only does it simplify processing development, it also simplifies the management of the overall life cycle (updates, changes, re-use). In the face of increasing complexity when it comes to big data-related technological offerings, Talend strove to ensure its support of all Hadoop market distributions (especially the most recent versions), while masking their complexity through a simple and intuitive interface. Spark is now at the heart of Talend's batch & real-time integration offer.

What's more, Spark now features new functions, which, given the backdrop of real-time activities, provides companies with expanding options. One such example is the "machine learning" functions support, currently a native Spark feature. The primary advantage of machine learning is to improve processing based on learning. Combining batch and real-time processing to meet today's corporate needs is also just around the corner: setting up a processing chain using weekly (batch) sales figures to develop predictive functions supported by this information as well as speeding up decision-making in real-time mode in order to avoid missed opportunities that arise in real time.

The advantages are obvious for e-commerce (recommendation) sites, as well as for marketing in general: combining browsing history data with the very latest information from social networks. For banks, creation of a "data lake" where all market data (internal and external) are compiled with no volume restrictions can enable the development of a predictive program by integrating other types of data. In the banking industry, this solution also enables huge volumes of data containing pertinent information to be extracted in order to foresee several different scenarios (predictive maintenance).

At the end of the day, this implicates all business sectors, from agriculture to wholesale distribution, from service provision to digital service providers, from manufacturing to the public sector, and so on. The advent of this new type of tool gives companies unprecedented analytical potential and will assist in their alignment with the current reality of their business with greater accuracy. Talend is the only player in the big data arena to, on the one hand, offer a transformation solution and written data processing aimed specifically at capitalizing on both batch and real-time data integration functions, and on the other hand, to offer Big Data that integrates all of the traditional integration functions (Data Quality, MDM, Data Governance, etc.) addressing the needs of the biggest IT management firms for whom an Enterprise Ready solution is simply not an option.

↧

Why Everyone Will Become a Part-Time Data Scientist

June 22, 2015, 7:06 am

≫ Next: Sporting Lessons to Kick-Start Big Data Success

≪ Previous: The Union of Real-Time and Batch Integration Opens Up New Development Possibilities

Your job description just changed.

Take a look around you – Big Data is no longer a buzzword. Data volumes are exploding and so are the opportunities to understand your customers, create new business, and optimize your existing operations.

No matter what your current core competencies, if you’re not a part-time data scientist now, you will be.

The ability to do light data science (you don’t have to become a full bore PhD data maven in this new environment) will be as powerful a career tool as an MBA. Whether you’re in finance, marketing, manufacturing or supply chain management, unless you take on the mantle of part-time data scientist in addition to your other duties, your career growth might be stymied.

Successful companies today are data-driven. Your role is to be one of those drivers. As a “data literate” employee comfortable slicing and dicing data in order to understand your business and make timely, innovative decisions, you have can have a positive impact on your company’s operations and its bottom line.

For example, I’ve personally found that being able to drill into Talend’s marketing data has yielded critical insights. Analysis indicates that the adoption of Big Data– a key driver for part of our business – is much further along in some countries compared to others. As a result, different marketing messages resonate better in certain locales than others. Combine that data with web traffic activity, the impact of holiday schedules (France, for examples, has a rash of holidays in May), weather patterns and other factors, and we come up with much clearer picture of how these various elements impact our marketing efforts. I can’t just look at global trends or make educated guesses – I need to drill into campaign data on a country by country basis.

So my recommendation to you is to dive in and get dirty with data. The good news is that you can become data literate now without spending years in graduate school.

Start by becoming comfortable with Excel and pivot tables – a data summarization tool that lets you quickly summarize and analyze large amounts of data in lists and tables. (Microsoft has put quite a lot of work into its pivot tables to make them easier to use.)

Learn how to group, filter, and chart data in various ways to unearth and understand different patterns.

Now, once you’ve mastered these basics, you’ll feel comfortable bringing new data sources into the mix – like web traffic data or social media sentiment. You will realize that you can aggregate this data in much the same way as you are able to analyze basic inventory levels or discounting trends.

In the case of Talend’s marketing operation, we are using the Talend Integration Cloud to bring together data from our financial, sales and marketing systems. This allows us to better understand and serve our customers and determine who should be targeted for new products and services. By taking this approach, you don’t have to wait for weeks or months for IT to conduct the analysis – these new tools provide results in hours or even minutes.

In the future, with the introduction of new data visualization tools, working with big data will become far easier for the growing ranks of your part-time data scientists. If you’re already comfortable with spreadsheets and statistics and have the core competence to spot different patterns in your data as you roll it up by week or by month, spotting trends using data visualization will be 10 times easier as you make the transition from a spreadsheet.

And, be sure to update your job description. You’ve just joined the growing ranks of smart business users who have earned their part-time data scientist chops. Today this is a highly desirable option; tomorrow it will be mandatory.

↧

Sporting Lessons to Kick-Start Big Data Success

June 24, 2015, 1:59 am

≫ Next: Spaghetti alla Cloud: Prevent IT Indigestion today! (Part II)

≪ Previous: Why Everyone Will Become a Part-Time Data Scientist

As football teams around the world enjoy their pre-season break, these can be exciting but anxious times. Especially for the teams and players making the step-up to play in a higher division after the success of promotion.

The potential rewards are substantial but making the leap to the higher echelons will be challenging, and preparation will be everything. Businesses migrating from traditional data management to big data implementations will be experiencing similar feelings - trepidation mixed with determination to make the most of the opportunity. This time spent in the run up to a new project, similarly with pre-season, will be equally important for businesses, as they plan methodically to ensure long-term success.

Not everything will be new of course. The basics of traditional and big data management approaches are similar. Both are essentially about migrating data from point A to point B. However, when businesses move to embrace big data, they often encounter new challenges.

With the summer transfer window now open, clubs and managers focus on bringing on board new talent and skills to ensure they give themselves the best opportunity for success. Businesses too will concentrate on ensuring they have the skills and tools in place. Over time, businesses will increasingly need to deliver data in real-time on demand, often to achieve a range of business goals from enhanced customer engagement to gaining greater insight into customer sentiment or tapping into incremental revenue streams.

It won’t always be straightforward. The volume of enterprise data is increasing exponentially. Estimates indicate it doubles every 18 months. The variety is growing too, with new data sources, many unstructured, coming on stream continuously. Finally, with the advance of social media and the Internet of Things, data is being distributed faster than ever and businesses need to respond in line with that increasing speed.

These trends are driving the compelling need for organizations to migrate to big data implementations. But as traditional approaches to data management increasingly struggle to manage in this new digital world, businesses look for new ways to avoid driving costs sky-high or taking too long to reach viable results.

The emergence of big data necessitates businesses moving to a completely new architecture based on new technologies from the MapReduce programming language to Apache Spark and Apache Storm Big Data real-time in-memory streaming capabilities to the latest high-powered analytics solutions.

There is much for businesses to do. From learning new technical languages and building new skills, to governance, funding and technology integration. Getting this right isn’t going to be an overnight success and businesses need to set realistic expectations and goals - just as in sport, managers whose teams are new to the top flight, need to take a pragmatic approach and not be too dispirited if they fail to match the top team at the first attempt.

This is where testing environments can play a key role too. At Talend, we’ve developed a free Big Data Sandbox to help get people started with big data – without the need for coding. In this ready-to-run virtual environment, users can experience going from zero to big data in under 10 minutes!

We have also identified five key stages to ensuring big data readiness:

The exploratory phase
The initial concept
The project deployment
Enterprise-wide adoption
And finally, optimization.

Here are some key goals businesses will need to accomplish at each stage of their journey in order to ultimately achieve big data success:

In the initial exploratory phase, the focus should be on driving awareness of the opportunities across the business. Organizations therefore first need to become familiar with big data technology and the vendor landscape; second, find a suitable use case e.g. handling increasing data volumes and third, provide guidance to management on next steps.

The second phase is around the design and development of a proof of concept. The overarching aim should be IT cost reduction but the key landmark goals along the way will typically include building more experience in big data across the business, not least in order to better understand project complexity and risks; evaluating the impact of big data on the current information architecture and starting to track and quantify costs, schedules and functionality.

The next stage moves the project on from theory to practical reality. The project deployment phase specifically targets improved performance. Key goals include achieving greater business insight; establishing and measuring ROI and KPI metrics; and developing data governance policies and practices for big data.

Enterprise-wide adoption drives broader business transformation. It is here that businesses should look to ensure that business units and IT can respond faster to market conditions; that processes are measured and controlled and ultimately become repeatable. The final level of readiness is business optimization. To achieve this, organizations should look to use the insight they have gained to pursue new opportunities and/or to pivot the existing business.

My final recommendation is to make sure you build a clear and pragmatic execution plan, detailing what you want to achieve with big data success. Failure to do this may mean you don’t get the funding or support for a second project. It’s a bit like getting relegated at the end of your first season.

Fancy yourself as a data rock star? Find out how ready you are for big data success with our fun online quiz.

↧

Spaghetti alla Cloud: Prevent IT Indigestion today! (Part II)

June 30, 2015, 6:01 am

≫ Next: Data Preparation: Empowering The Business User

≪ Previous: Sporting Lessons to Kick-Start Big Data Success

In part I of this two-part post, we learned why the IT architecture supporting modern business feels bloated when cloud computing was supposed to be a liberating, game-changing paradigm instead, and why it is critical to address this issue as soon as possible.

Solutions for Cloud Integration

Cloud Service Integration solutions, also known as iPaaS (for integration Platform-as-a-Service), are a new generation of native Cloud offerings at the intersection where data integration and application integration converge that bring the best of both worlds with Hybrid integration across Cloud and On-Premises data sources and applications.

Cloud Service Integration solutions will typically be articulated around connectors and actions. The former are used to connect to applications and data sources, in the cloud and on-premises, implementing the service call and processing the input / output content. Most solutions provide pre-built connectors, as well as a development environment to build native connectors to applications, instead of having to implement custom web services. The integration actions provide additional control over the data: check data quality, convert to different formats, and offer additional activities for managing, merging or cleansing your data.

Enterprise offerings will include additional key functionality such as administration and monitoring, to check for data loss, errors, schedule job execution or assist the set-up of environments and templates.

4 tips for selecting your Cloud Service Integration solution

As with all Cloud solutions, it can be daunting to sift through all the different offerings and identify which is right for you. The following tips will help you pick the right future-proof, cost-effective solution.

Future-proof support of Data Quality and Big Data Integration: Connecting without improving quality is a waste of time, but many vendors do not provide out-of-the-box data quality actions. Likewise, your top priorities most definitely should include the roll out of Data Warehousing, Analytics or Big Data, meaning you will need support for MapReduce and Hadoop integration;
User interfaces adapted to different needs: Your successful deployment will require adoption by multiple categories of users, each with different needs and expectations. For instance, developers will want a powerful IDE, while business user (“citizen integrators”) might prefer a simplified Web interface;
Avoid lock-in with proprietary technologies and languages: prioritize solutions built on open source projects with existing communities (Apache Software Foundation projects are a great starting point), and using popular technologies such as Java (making it easier to find development and support resources). If the components that are developed can be reused across your other top initiatives (Big Data, Master Data Management…) this would of course be a major boost.
Check for hidden costs: beware of packages that seem attractive but don’t include the actual connectors and actions! Having to pay extra for “premium connectors” (SAP, Salesforce…) will significantly increase your total cost of ownership as your architecture grows over time, hurt your ability to evolve with agility, punish success, or even stall the project as extra budget isn’t available to fuel your growth.

Cloud Service Integration might just be your recipe for success, turning your Spaghetti alla Cloud into the fuel for your organization’s future success. It can enable you to realize the full benefits of Cloud & SaaS while successfully implementing your top priority projects, such as Big Data integration, data warehousing, business intelligence and enterprise reporting.

One such option to consider is…prepare for shameless plug…Talend Integration Cloud. Affectionately referred to as “TIC” internally, Talend Integration Cloud is a secure and managed cloud integration platform that we believe will make it easy for you to connect, cleanse and share cloud and on-premises data. As a service managed by Talend, the platform provides rapid, elastic and secure capacity so you can easily shift workloads between the ground and cloud, helping increase your agility and lower operational costs. Or, in other words, it’s like going beyond a solid shot of Pepto-Bismol to sooth your IT indigestion and making sure you are fueling your organization’s performance like an elite athlete.

↧

Data Preparation: Empowering The Business User

July 10, 2015, 6:03 am

≫ Next: Hadoop Summit 2015 Takeaway: The Lambda Architecture

≪ Previous: Spaghetti alla Cloud: Prevent IT Indigestion today! (Part II)

A growing number of business users with limited knowledge of computer programming are taking an interest in integration functions and data quality, as companies become more and more “data-driven”. From marketing to logistics, customer service to executive management, HR to finance, data analysis has become a means for all company departments to improve their process productivity and efficiency. However, even with cloud graphics development functions offered by publishers like Talend, these tools remain largely reserved for computer scientists and specialists in the development of data integration jobs.

For example, today a marketing manager wanting to launch a campaign has to go to his or her IT department in order to obtain specifically targeted and segmented data, etc. The marketing manager will have to spend time describing their needs in detail, and as for the IT department, it will have to set aside the time to develop the project and then both the marketing manager and IT department will have to conduct initial tests in order to validate the relevance of the development. In this day and age, when reaction time means everything and against a backdrop of global competition in which real-time has become the norm, this process is no longer a valid option.

And yet, business managers simply don't have the time to waste and need shared self-service tools to help them reach their goals. The widespread use of Excel is proof. Business users manage to the best of their ability to make their data usable, which means they spend 70 to 80% of their time preparing this data, without the assurance of even having quality data. Furthermore, the lack of centralized governance represents a risk in terms of the very use of the data including privacy and compliance issues, even problems with data use (such as licensing issues).

These are very common restraints and users need specific tools to manage enrichment, quality or problem-detection issues. Intended for business users, this new type of data preparation solution must be based on an Excel-type shared interface and must offer a broad spectrum of data quality functions. In addition, it must offer viewing and, it goes without saying, transformation functions, easier to use than Excel macros and specialized for the most commonly used domains in order to ensure appropriation by the business user.

For example, by offering a semantic recognition function, the solution could enable automatic model detection and categorization, while simultaneously indicating the potentially missing or non-compliant values. By also offering a visual representation mode based on color codes and pictograms, the user is able to better understand his or her data. In addition, an automatic data class recognition function (personal information, social security number or credit card, URL, email, etc) will further facilitate the user's task.

But if the company is happy with providing self-service tools, it is only addressing one part of the challenges and is “neglecting” to face issues related to the lack of data governance. The IT department, as competent as it may be, generally controls data, which sometimes unleashes the creation of a “Tower of Babel” when users extract their own version of the original data. In this way, a data inventory function would enable the data sets from companies open to “self-service” to be itemized or certified by the IT department, but directly managed by business users. This would enable the implementation of a truly centralized and collaborative platform, giving access to secure and reliable data, while reducing the proliferation of different versions.

What's more, this shared and centralized platform can help IT control the use of data by way of indicators like the popularity of data sets and the monitoring of their use. Or even alarm programming in order to detect problems with data quality, compliance or privacy, as soon as possible. Tracking is the first step in a good governance plan. All in all, it is a win-win situation for everyone: the business user is happy to have access to self-service data sets, to be self-reliant and agile in terms of carrying out the data transformations necessary for his or her business; in the same breath, IT better delegates to its users while implementing good data governance conditions.

However, a new pitfall of the “self-service” model is the fact that it encourages a new type of proliferation: that of personnel preparation scripts. In reality, many preparations can be automated, like recurring operations that have to be conducted every month or on a quarterly basis, like accounting closure. This is what we refer to as “operationalization”: the IT department launches the production of the preparation of recurring data that may be recovered in the form of a verified, certified and official flow of information. By operationalizing their preparations, the users benefit from information system reliability guarantees, including for very large volumes, and for a fraction of the cost thanks to Hadoop. In the end, this virtuous circle meets the double-edged needs highlighted by companies: reactivity (even pro-activity) of business users who have to make decisions in less and less time and the need for the governance and urbanization of the IT department.

↧

Hadoop Summit 2015 Takeaway: The Lambda Architecture

July 15, 2015, 6:42 am

≫ Next: Surprising Data Warehouse Lessons from a Scrabble Genius

≪ Previous: Data Preparation: Empowering The Business User

It has been a couple of weeks since I got back from the Hadoop Summit in San Jose and I wanted to share a few highlights that I believe validate the direction Talend has taken over the past couple of years.

Coming out of the Summit I really felt that as an industry we were beginning to move beyond the delivery of exciting innovative technologies for Hadoop insiders, to solutions that address real business problems. These next-generation solutions emphasize a strong focus on Enterprise requirements in terms of scalability, elasticity, hybrid deployment, security and robust overall governance.

From my perspective (biased of course!), the dominant themes at the Summit gravitated around:

- Lambda Architecture and typical use cases it enables

- Cloud, tools and ease of dealing with Big Data

- Machine Learning

In this blog post, I’ll focus on the first one…the Lambda Architecture

Business use cases that require a mix of machine learning, batch and real-time Data processing are not new, they have been around for many years. For example:

- How do I stop fraud before it occurs?

- How can I make my customers feel like “royalty” and push personalized offers to reduce shopping cart abandonment?

- How can I prevent driving risks based on real-time hazards and driver profiles?

The good news is that technologies have greatly improved and with almost endless computing power at a fraction of yesterday’s cost, they are not science fiction anymore.

The Lambda architecture (see below) is a typical architecture to address some of those use cases.

Lambda Architecture. (Based on Nathan Marz design)

Within the Big Data ecosystem, Apache Spark (https://spark.apache.org/) and Apache Flink (https://flink.apache.org/) are two major solutions that fit well this architecture.

Spark (the champion) stands out from the crowd because of its ability to address both batch and near real-time (micro batch in the case of Spark) data processing with great performance through its in-memory approach.

Spark is also continuously improving its platform by adding key components to appeal to more Data Scientists (on top of MLlib for machine learning, Spark R was added in the 1.4 release) and expand its Hadoop footprint.

Spark projects in the Enterprise are on the rise and slowly replacing Map/Reduce for Batch Processing in the mind of developers. IBM’s recent endorsement and commitment to put 3500 researchers and developers on Spark related projects will probably accelerate Spark adoption in the hearts of Enterprise architects.

But, because there’s a champion, there must also be a contender…

This year, I was particularly impressed by the new Apache Flink project, which attempts to address some of Spark’s drawbacks like:

- Not being a YARN first class citizen yet

- Being Micro Batch (good in 95% of the cases) versus pure streaming

- Improved/easier Memory Management

If you look at Flink “marchitecture”, you can almost draw a one for one link between its modules and Spark’s. It the same story when it comes to their APIs, they are very similar.

https://flink.apache.org/img/flink-stack-small.png

So where is Talend in all of this?

With our Talend 5.6 platform, we delivered a few Spark components in Tech Preview, since then we have doubled down on our Spark investments and our upcoming 6.0 release will see many new components to support almost any use case, batch or real-time. From a batch perspective, with 6.0, it will be easier to convert your MapReduce jobs into Spark jobs and gain significant performance improvements along the way.

It’s worth highlighting that the very famous and advanced tMap component will be available for Spark Batch and Streaming, allowing advanced Spark transformation, filtering and data routing from single or multiple sources to single or multiple destinations.

As always, and because we believe native code running directly on the cluster is better than going through proprietary layers, we are generating native Spark code, allowing our customers to benefit from the continuous performance improvements of their Hadoop data processing frameworks.

↧

Surprising Data Warehouse Lessons from a Scrabble Genius

July 27, 2015, 6:07 am

≫ Next: OSGI Service Containers

≪ Previous: Hadoop Summit 2015 Takeaway: The Lambda Architecture

Lessons in data warehousing and cloud data integration can come from unexpected places. Consider: the latest French-language Scrabble champion doesn’t speak French.

New Zealander Nigel Richards just won the international French World Scrabble® Championship without speaking the language at all, even using many obscure words that the average French speaker doesn’t know. The secret to his success seems to be twofold: his ability to memorize the full 368,000 words in the official authorized French list, known as the Officiel du Scrabble (ODS), along with his expertise in finding the best strategic placement on the game board.

Surprisingly, this achievement highlights a few lessons for any company wishing to become a data-driven enterprise and get ahead of their competitors.

It’s all about consistent, trusted data

Nigel Richards essentially turned his brain into a Data Warehouse. It contained all the approved words, a “single version of the truth” that he based his play on. This is no small accomplishment: typical memorization techniques, like the Memory Palace, aren’t adapted to this sort of task, where the word has no meaning for the mental athlete.

Modern businesses face similar challenges. To become a true data-driven enterprise, you need a very solid data management strategy and infrastructure, so that your data is trusted and that the resulting actions are effective. This means extracting data from multiple sources (for instance Salesforce, Marketo, Netsuite, SAP, Excel as well as your different databases), then taking necessary steps to consolidate, cleanse, transform and load the data into your data warehouse. Unlike the ODS, your data isn’t a relatively frozen set. It needs to be kept up to date and synchronized frequently.

Getting the most out of your data

The next lesson is that you need to focus on key high-level patterns and metrics in order to act. Nigel Richards didn’t need to understand the words in order to play his tiles. He just analyzed the number of points each word would bring, how it would help his future moves, and how it would hinder his opponent’s options. In doing this, he was applying best practices learned throughout his experience as a Scrabble champion in other languages. The words may be different, but the patterns he would seek out are the same.

Before you can produce your analytics and business insights, for instance bringing all your data into AWS Redshift and using Tableau Software or QlikView, or directly importing all data into the Salesforce Analytics Cloud, the foundation needs to be rock solid, bringing us back to data integration.

What this means for you

Mr. Richards single-handedly imported nearly 400,000 words in just nine weeks. I know many managers who would be happy to see that sort of timeline for their projects! Luckily, Cloud Data Integration solutions provide a new, agile approach to allow organizations to integrate your applications and data sources rapidly, without having to worry about provisioning hardware or administering the platform. The best tools provide out-of-the-box connectors to extract and load your data without needing to hand-code, as well as components to de-duplicate, validate, standardize and enrich for higher data quality.

If you are looking for this type of solution, I would definitely recommend you check out Talend Integration Cloud. As you would expect from Talend, the solution offers powerful, yet easy-to-use tools and prebuilt components and connectors that make it easier to connect, enrich and share data. This approach is designed to allow less skilled staff to complete everyday integration tasks so key personal can remain focused on more strategic projects. We think you’ll find Talend Integration Cloud can help you become a world-class, champion data-driven enterprise faster than you thought it was possible!

↧

OSGI Service Containers

August 4, 2015, 7:06 am

≫ Next: On the Road to MDM

≪ Previous: Surprising Data Warehouse Lessons from a Scrabble Genius

The first post in this series provided a look at the definition of a Container. The second post in the series explored how Platforms leverage Containers to deliver SOA design patterns at internet scale in the Cloud. This post presents a simplified example of applying Container architecture for extensible business infrastructure. It then addresses when and where to use the power of micro-service containers like OSGI.

Use Case

Consider a B2B company seeking to add a new trading partner. The B2B partner may wish to subscribe to a data feed, so the partner will need to adapt its internal API’s to the trading network’s published API.

Rather than an elaborate IT project, this should be a simple Cloud self-service on-demand scenario. This not only increases agility, it maximizes the market for the B2B company. And it ensures business scalability since B2B IT staff will not be on the information supply chain critical path.

The Partner will probably want extensions to the base API so the B2B platform needs to observe the open-closed principle. It needs to be closed from modification while being open to extension. Extensions could be additional validation and business rules or additional schema extensions. Schema extensions in particular will impact multiple workflow stages. In addition, transformations for message schemas might be required along with data level filtering for fine grained access control. In order to realize the Self-Service On-Demand level expected from Cloud providers, the Platform must allow this type of mediation to be dynamically provisioned without rebuilding the application.

Figure 3Dynamic Subscription Mediation

In the diagram above, the partner submits a message to subscribe to a data feed (1) via say a REST web service. The subscription message is received by the RouteBuilder which dynamically instantiates a new route (2) that consumes from a JMS topic. The route filters (3) messages based on the access privileges, provides custom subscription mediation logic (4), and then sends the message using WS-RM (5).

Where should this mediation logic be hosted? Creating a service for each partner is not too difficult. But as the number of services in a composite service increases the overhead of inter-process communication (IPC) becomes a problem. As an extreme case, consider what the performance impact would be if the subscribe, filter, and custom mediation logic each required a separate service invocation.

In many cases modularity and extensibility are even more important than performance. When partners extend the API the impact may not be easily isolated to a single stage in the processing. In such cases the extension points need to be decoupled from the core business flow.

Likewise, when the core service evolves we need to ensure consistent implementation across different B2B partners. Regardless of variation, some requirements remain common. We want to be sure that these requirements are implemented consistently. A copy-paste approach will not be manageable.

Finally, using external processes to implement variation may undermine efficient resource pooling. Each partner ends up with its own unique set of endpoints and supporting applications. In the diagram above, mediation logic belongs to a pool of routes running in the same process to improve efficiency.

So we want granular composability for managed variation as well as modularity for extensibility of business logic. This is in tension with IPC overhead and resource pooling.

Sample Architecture

This post is focused on the role of the Service Container in resolving the design forces. It is used in the context of the Application Container and ESB Containers shown in the sample architecture below.

Figure 1SOA Whiteboard Logical Architecture

The Application Container hosts the actual business services. The business service is a plain old Java object (POJO). It does not know or care about integration logic or inter-process communication. That is addressed by the Service Container. The Service Container runs in the Application Container process. The exposed service is called a Basic Service.

The Service container also runs in the ESB Container. The ESB container provides additional integration including security, exactly-once messaging, monitoring and control, transformation, etc. It provides a proxy of the Basic Service with the same business API but different non-functional characteristics.

Service Container

The Service Container is a logical concept that is language dependent. Since the Service Container runs inside both the ESB and the Application Container it has to be compatible with the languages supported by the ESB and the Application Container. It may well have multiple implementations since there may be multiple Application Containers used by the enterprise.

For purposes of discussion we will focus on Java. We can think of Tomcat as a typical Application Container and Apache Karaf as the ESB container. The Service Container depends on a Dependency Injection framework. We might use Spring for dependency injection in Tomcat; in Karaf we might choose Blueprint. The Service Container itself might be implemented in Apache Camel. Camel works with both Spring and Blueprint. The actual service implementation is a plain old Java object (POJO).

Figure 2Containerized Services

The Service Container is non-invasive in the sense that it has a purely declarative API based on xml and annotations. The service developer does not need to make any procedural calls. Adoption of the Service Container by organizations is supported by an SDK that provides a cookbook for using it. But it should be very simple. Hence no tooling is required. The SDK should address Continuous Integration (CI) and DevOps use cases for every stage of development. As such the Service Container can encapsulate any lower level complexity introduced by other containers.

The service container adds functionality beyond the basic dependency injection framework to address endpoint encapsulation, mediation and routing, and payload marshalling. Using the Service Container provides a flexible contract between the integration team and the service provider which allows performance optimization while maintaining logical separation of concerns.

But in some cases this requires the platform to be able to deploy the new jars dynamically at runtime to an existing, running container. Indeed, there may be many Containers that will need to host the new extension points or adaptor services. All such concerns should be transparent to the service provider.

This could be implemented by the service provider team, but the same mediation will be used by many service providers. So it is preferable to delegate this functionality to the Platform. This has the added benefit that service providers can focus on business logic rather than creating and managing efficient resource pools that deliver reliable, secure throughput. Business logic and IT logic are often orthogonal skill sets. So separation of concerns also leads to improved efficiency.

Having this dealt with by the Platform is good, but it begs the question, how are custom mediation jars resolved and how are conflicts with custom logic from other partners managed?

Micro Service Containers

There is a key difference in the selection of the Dependency Injection framework used in the example architecture. The Application Container uses Apache Tomcat and the ESB Container uses Apache Karaf. Apache Karaf supports Blueprint for dependency injection, but it also supports OSGI micro-services whereas traditional Spring running in Tomcat does not.

Flexible deployment of business logic can be achieved with Dependency Injection frameworks like Spring, but two problems arise. The first is dependency management and classpath conflicts. The second is managing the dynamic modules.

OSGI is a mature specification that manages dependencies at the package level. What this means for the enterprise is that we can manage multiple versions of a module within the same runtime. In turn, this means we can dynamically deploy new service modules to runtime containers without having to worry about conflicting libraries. The concept is to achieve the same pluggable ease-of-use for enterprise services that you get with your smart phone’s App Store.

In addition to dependency management, the OSGI specification provides a micro-service architecture. Micro-service refers to the fact that we are only talking about services and consumers within the same JVM. Micro-services go beyond dependency injection to provide a framework for dynamic services that can come and go during the course of execution. This supports elasticity of services in the cloud.

OSGI is the same technology used by Eclipse plugins. So it is very mature and stable. Moreover, as an open standard it is appropriate for use in the enterprise. But there is some additional complexity with OSGI. OSGI complexity is merited when used to host dynamic modules which need to be composed in-process to encapsulate variation or re-use. This is the case with the ESB Containers. But it is not always the case for Basic Services running in the Application Containers.

For example, consider a simple transformation service. It is visually designed and published as a web service. B2B partners can use the transformation service, and if the additional latency does not impact their SLA, then the additional complexity of OSGI is not merited.

As a general rule, mediation logic in ESB Containers should use OSGI to provide for flexible deployment of mediation modules. Mediation is more likely to vary and it varies over a broader set of stakeholders. These stakeholders may require diverse and potentially overlapping libraries. Since they are all on their own lifecycle, it requires the dependency management capability of OSGI. Moreover, mediation services are more likely to be highly dynamic.

In contrast, Basic Service logic can run in the Application Container and is usually delivered by a single organization along with other related services as part of a single Application deployment lifecycle. Unlike the mediation use case, the service provider team has control and can resolve any library dependency issues during development. As such it can be run in a lightweight container but it does not necessarily need OSGI.

In summary, Apache Camel provides a Service Container on top of the dependency injection framework, so it can run in Spring or OSGI. Blueprint is the dependency injection framework for OSGI. OSGI should be considered for composite services and mediation and routing to provide the flexibility and extensibility needed for self-service on-demand, elastic SaaS.

The next post in the series will explore the reference architecture in greater detail regarding dynamic provisioning and data driven mediation and routing.

↧

On the Road to MDM

August 6, 2015, 6:49 am

≫ Next: Talend and the Gartner Magic Quadrant for Data Integration Tools – Less than whisker from the leader’s quadrant

≪ Previous: OSGI Service Containers

Think big, but start small. This is particularly good advice if you plan on implementing a master data management (MDM) system any time in the near future.

MDM is an extremely powerful technology that can yield astonishing results. But like any complex, highly effective discipline it is best approached systematically and incrementally.

First, just what are we talking about? Here’s an excellent definition of MDM from Gartner’s IT Glossary: “MDM is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets.”

Gartner goes on to say that “Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.”

The metadata residing in the system includes information on:

- The means of creation of the data

- Purpose of the data

- Technical data – data type, format, length, etc.

- Time, date and location of creation

- Creator or author of the data

- Standards

That’s a lot. And it can lead to problems. For example, when moving data from one place to another, you have to know how it transforms, who owns it, where it comes from and what rules govern the data. This may mean going back into the ETL integration and trying to unscramble initial problems in data mapping and compliance. You can try tackling these problems using a spreadsheet (that way lies madness) or turn to IT for an answer that may take months in coming – not a particularly attractive solution if your business is attempting to become more agile

The fact is, that for many organizations, a full-bore MDM deployment right from the start is overkill. The massive effort required proves to be too complicated, expensive and inevitably winds up being put on hold.

To avoid this kind of quagmire, there is a better way. As I mentioned above, start small and think big. To be more specific, start with a data dictionary and ease your way into MDM over time. (A data dictionary is a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.)

The poster child for this approach is one of our valued customers – the Irish Office of Revenue Commissioners, known simply as “Revenue.” Here’s their story.

Revenue’s core business is the assessment and collection of taxes and duties for the Republic or Ireland. With more than 65 offices countrywide, the agency has a staff of over 5700. Revenue’s authorized users have access to operational business and third-party data in its data warehouse for query, reporting and analytical purposes.

The data is complex and growing rapidly. Historically the agency’s metadata has been accessed from multiple sources using spreadsheets and other business documentation, a fragmentary and ineffective solution.

Fortunately Talend’s integrations solutions are already being used by Revenue as the corporate ETL tool. Because much of the technical metadata around data manipulation and transformation were already being captured by these Talend solutions, the implementation of Talend’s MDM unified platform made a lot of sense.

The implementation included the initial use of a data dictionary within the MDM platform in keeping with the “start small, think big” dictum. In addition, Revenue was able to leverage the existing skills and knowledge of their business and data analyst employees who are familiar with Talend Studio.

Overall, the agency avoided additional costs for their metadata solution, reduced operational costs, and solved the business problem of knowing where to find pertinent data for reporting purposes.

Through their use of Talend MDM, we anticipate that Revenue’s metadata solution will increase the understanding of data throughout the organization. This, in turn, will lead to improved decision making and improved data quality over time. Plus, the solutions’ web user interface will help cut metadata management and deployment costs. It provides the Revenue business analysts with access to one centralized location to gather metadata that had previously been scattered around the organization in various formats and residing in individual business and technical silos.

For many companies dealing with today’s influx of big data, the Revenue incremental approach is a good one. They can start by building a data dictionary for free and then upgrade to handle more users and provide additional functionality.

After all, MDM is a journey, not a destination. Companies that elect to follow this path will achieve cost effective and satisfying results by starting small and then moving ahead with all deliberate speed.

↧

Talend and the Gartner Magic Quadrant for Data Integration Tools – Less than whisker from the leader’s quadrant

August 11, 2015, 6:44 am

≫ Next: Beyond “The Data Vault”

≪ Previous: On the Road to MDM

Why you should consider the emerging leader

When it comes to Talend, prospective customers have a favorite question they like to ask analysts: “If I can get all of this from Talend at a fraction of the cost of the big guys, what am I really missing by not going with bigger vendors?”

And, what the analysts likely say is something along the lines of: “Talend isn’t as mature as Informatica or IBM, so their support and how-to tools won’t be as robust.”

As you might anticipate, my answer would be different.

I’d tell prospects that they’ll never regret taking a deeper look at Talend and assessing for themselves how we stack up.

For my part, here’s how I’d categorize the Data Integration players:

Megavendors (Informatica, IBM, Oracle) – These vendors have mostly grown through acquisitions which means they all have different design and management environments. While they offer a broad set of capabilities, they also require you to deploy/learn multiple, complex and propriety products that come with a high total cost of ownership.

Stovepipes (SAP, SAS, Microsoft) – These vendors all have solid solutions, but they mostly win in their own ecosystems and we don’t see them in the market more broadly. So, if you are ONLY a Microsoft shop or ONLY a SAP shop, then these could be for you. Of course though, these vendors still require you to buy multiple products with a high TCO.

Point Players – (Syncsort, Adeptia, Denodo, Cisco, Actian, Information Builders) – If you have a specific need, these players can fit the bill and they tend to be less expensive. However they lack the full set of capabilities and they aren’t players in the broader integration market.

Talend (Yes, shameless I know, but I truly believe we are currently in a category of one) – If you’re looking for a modern integration platform, that can give you agility, meet all your integration needs, including big data, with great cost of ownership, Talend really is your only one option.

A Good Choice for Today and Tomorrow

Gartner outlines seven trends that are shaping the market and Talend is investing in all of them. We are setting the bar on support for the key platforms of the future like Hadoop, NoSQL and the cloud. As the complexity of use cases grows and the need to operate in real-time intensifies our enterprise service bus and upcoming support for Spark streaming mean that we will deliver on more real-time use cases than anyone else.

Here are all seven of the trends highlighted by Gartner in the Magic Quadrant for Data Integration Tools*:

1. Growing interest in business moments and recognition of the required speed of digital Business

2. Intensifying pressure for enterprises to modernize and enlarge their data integration Strategy

3. Requirements to balance cost-effectiveness, incremental functionality, time-to-value and growing interest in self-service

4. Expectations for high-quality customer support and services

5. Increasing traction of extensive use cases

6. Extension of integration architectures through a combination of cloud and on-premises deployments

7. Need for alignment with application and information infrastructure

Of course, don’t take my word for it. Add Talend to your due diligence list and assess for yourself how a company that’s less than a whisker away from the Leaders Quadrant matches up. You might want to start by taking a closer look at the report, which you can download free here: https://info.talend.com/gartnermqdi.html

*[1] Gartner, Inc., "Magic Quadrant for Data Integration Tools” by Eric Thoo and Mark A. Beyer, July 29, 2015

↧

Beyond “The Data Vault”

August 14, 2015, 12:00 am

≫ Next: Talend – Implementation in the ‘Real World’: Data Quality Matching (Part 2)

≪ Previous: Talend and the Gartner Magic Quadrant for Data Integration Tools – Less than whisker from the leader’s quadrant

In my last blog “What is ‘The Data Vault’ and why do we need it?” I introduced a fresh, compelling methodology for data warehouse modeling authored and invented by Dan Linstedt (http://danlinstedt.com) called ‘the Data Vault’. Solving the many characteristic and inherent problems found in crafting an Enterprise Data Warehouse (EDW), I discussed how the Data Vault’s high adaptability simplifies business ontologies, incorporating Big Data to result in durable yet flexible solutions that most engineering departments would dream of.

Before we get into the Data Vault EDW aspects however, I think we need to cover some basics.

Data Storage Systems

Certainly by now everyone has heard about Big Data; including no doubt the hype, disinformation, and misunderstandings about what Big Data is that are, grudgingly, just as pervasive. So let’s back up a moment and confine the discussion to a sensible level. Setting Relational 3NF and STAR schemas aside, ignoring e-commerce, business intelligence, and data integration, let’s look instead at the main storage facilities that encompass data technologies. These are:

Database Engines
- ROW: your traditional Relational Database Management System (RDBMS)
- COLUMN: relatively new, widely misunderstood, feels like a normal RDBMS
NoSQL: new kid on the block; really means ‘NOT ONLY SQL’
File Systems: everything else under the sun (ASCII/EBCDIC, CSV, JSON, XML, HTML, etc.)

Database Engines

The ROW based database storage methodology is one most of us are already familiar with. Depending upon your vendor of choice (like Oracle, Microsoft, IBM, Postgres, etc…) Data Definition Language (DDL) and Data Manipulation Language (DML) syntax, collectively called SQL, creates tables for the storage and retrieval of structured records, row by row. Commonly based upon some form of ‘key’ identifier, the Relational 3^rd Normal Form (3NF) Data Model thrives upon the ROW based database engine and is widely used for many Transactional (OLTP) and/or Analytic (OLAP) systems and/or applications. Highly efficient in complex schema designs and data queries, ROW based database engines offer a tried and true way to build solid data-‘based’ solutions. We should not throw this away, I won’t!

The COLUMN based database storage methodology has been around, quietly, for a while as an alternative to ROW based databases where aggregations are essential. Various vendors (like InfoBright, Vertica, SAP Hana, Sybase IQ, etc…) generally use similar DDL and DML syntax from ROW based databases, yet under the hood things are radically different; a highly efficient engine for processing structured records, column by column; perfect for aggregations (SUM/MIN/MAX/COUNT/AVG/PCT)! This is the main factor that sets it apart from ROW based engines. Some of these column based technologies also provide high data storage compression which allows for a much smaller disk footprint. In some cases as much as 80/1 over their row based counterpart. We should adopt this where appropriate; I do!

Big Data

The NoSQL based storage methodology (notice I don’t call this a database) is the newer kid on the block which many vendors are vying for your immediate attention (like Cassandra, Cloudera, Hortonworks, MapR, MongoDB, etc.). Many people think that NoSQL technologies are here to replace ROW or COLUMN based databases; that is simply not the case. Instead, as a highly optimized, highly scalable, high performance distributed ‘file system’ (see HDFS below), the NoSQL storage capabilities offer striking features simply not practical with ROW or COLUMN databases. Dramatically enhancing file I/O technologies, NoSQL extends out to new opportunities that were either unavailable, impracticable, or both. Let’s dive a bit deeper on this; OK?

There are three main variations of NoSQL technologies. These include (click on image for web link):

Key Value: which support fast transaction inserts (like an internet shopping cart); Generally stores data in memory and great for web applications that needs considerable in/out data operations;

Document Store: which stores highly unstructured data as named value pairs; Great for web traffic analysis, detailed information, and applications that look at user behavior, actions, and logs in real time;

Column Store: which is focused upon massive amounts of unstructured data across distributed systems (think Facebook & Google);Great for shallow but wide based data relationships yet fails miserably at ad-hoc queries;

(note: Column Store NoSQL is not the same as a Column Based RDBMS)

Most NoSQL vendors support structured, semi-structured, or non-structured data which can be very useful. The real value, I believe, comes in the fact that NoSQL technologies ingest HUGE amounts of data, very FAST. Forget Megabytes or Gigabytes, or even Terabytes, we are talking Petabytes and beyond! Gobs and gobs of data! With the clustering and multi-threaded inner-workings, scaling to future-proof the expected explosion of data, a NoSQL environment is an apparent ‘no-brainer’. Sure, let’s get excited, but let’s also temper it with the understanding that NoSQL is complementary, not competitive to more traditional databases systems. Also note that NoSQL is NOT A DATABASE but a highly distributed, parallelized ‘file system’ and really great at dealing with lots of non-structured data; did I say BIG DATA?

NoSQL technologies have both strengths and weaknesses. Let’s look at these too:

NoSQL Strengths
A winner when you need the ability to store and look up Big Data
Commodity Hardware based
Fast Data Ingestion (loads)
Fast Lookup Speeds (across clusters)
Streaming Data
Multi-Threaded
Scalable Data Capacity & Distributed Storage
Application focused
NoSQL Weakness
Conceivably an expensive infrastructure (CPU/RAM/DISK)
Complexities are hard to understand
Lack of native SQL interface
Limited programmatic interfaces
Poor performance on Update/Delete operations
Good engineering talent still hard to find
Inadequate for Analytic Queries (aggregations, metrics, BI)

File I/O

The FILE SYSTEM data storage methodology is really straightforward and easy. Fundamentally file systems rely upon a variety of storage media (like Local Disks, RAID, NAS, FTP, etc.) and managed by an Operating System (Windows/Linux/MacOS) supporting a variety of file access technologies (like FAT, NTFS, XFS, EXT3, etc.). Files can comprise almost anything, be formatted in many ways, and utilized in a wide variety of application and/or systems. Usually files are organized into folders and/or sub-folders making the file system an essential element to almost all computing today. But then you already know this; Right?

Hadoop/HDFS

So where does Hadoop fit it in, and what is HDFS? The ‘Hadoop Distributed File System’ (HDFS) is a highly fault-tolerant file system that runs on low-cost, commodity servers. Spread across multiple ‘nodes’ in a hardware cluster (sometimes hundreds or even thousands of nodes), 64Mb ‘chunks’ or data segments are processed using a ‘MapReduce’ programming model that takes advantage of the highly efficient parallel, distributed algorithm.

HDFS is focused on high throughput (fast) data access and support for very large files. To enable data streaming HDFS has relaxed a few restrictions imposed by POSIX (https://en.wikipedia.org/wiki/POSIX) standards to allow support for batch processing applications targeting HDFS.

The Apache Hadoop Project is an open-source framework written in Java that is made up of the following modules:

Hadoop Common: which contains libraries and utilities
HDFS: the distributed file system
YARN: a resource manager responsible for cluster utilization & job scheduling (Apache YARN)
MapReduce: a programming model for large scale data processing

Collectively, this Hadoop ‘package’, has become the basis for several commercially available and enhanced products, which include (click on image for web link):

So let’s call them all: Data Stores

Let’s bring these three very different data storage technologies into a conjoined perspective; I think it behooves us all to consider that essentially all three offer certain value and benefits across multiple use cases. They are collectively and generically therefore: Data Stores!

Regardless of what type of system you are building, I’ve always subscribed to the notion that you use the right tool for the job. This logic applies to data storage too. Each of these data storage technologies offer specific features and benefits therefore should be used in specific ways appropriate to the requirements. Let’s review:

ROW based databases should prevail when you want a complex, but not too-huge data set that requires efficient storage, retrieval, update, & delete for OLTP and even some OLTP usage;
COLUMN based database are clearly aimed at analytics; optimized for aggregations coupled with huge data compression and should be adopted for most business intelligence usage;
NoSQL based data solutions step in when you need to ingest BIG DATA, FAST, Fast, fast… and when you only really need to make correlations across the data quickly;
File Systems are the underlying foundation upon which all these others are built. Let’s not forget that!

The Enterprise Data ‘Vault’ Warehouse

Now that we have discussed where and how we might store data, let’s look at the process for crafting an Enterprise Data Warehouse (an obvious Big Data use case) based on a Data Vault model.

Architecture

An EDW is generally comprised of data originating from a ‘Source’ data store; likely an e-commerce system, or Enterprise Application, or perhaps even generated from machine ‘controls’. The simple desire is to provide useful reporting on metrics aggregated from this ‘Source’ data. Yet ‘IT’ engineering departments often struggle with the large volume and veracity of the data and often fail at the construction of an effective, efficient, and pliable EDW Architecture. The complexities are not the subject of this Blog; however anyone who has been involved in crafting an EDW knows what I am talking about. To be fair, it is harder than it may seem.

Traditionally an enterprise funnels ‘Source’ data into some form of staging area, often called an ‘Operational Data Store’ or ODS. From there, the data is processed further into either a Relational 3NF or STAR data model in the EDW where aggregated processing produdces the business metrics desired. We learned from my previous Blog that this is problematic and time consuming, causing tremendous pressure on data clensing, transformations, and re-factoring when systems up-stream change.

This is where the Data Vault shines!

Design

After constructing a robust ODS (which I believe is sound architecture for staging data prior to populating an EDW) designing the Data Vault is the next task. Let’s look at a simple ODS schema:

Let’s start with the HUB table design. See if you can identify the business keys and the ‘Static’ attributes from the ‘Source’ data structures to include into the HUB tables. Remember also that HUB tables define their own unique surrogate Primary Key and should contain record load date and source attributes.

The LNK ‘link’ tables capture relationships between HUB tables and may include specific ‘transactional’ attributes (there are none in this example). These LNK tables also have a unique surrogate Primary Key and should record the linkage load date.

Finally the SAT ‘satellite’ table capture all the variable attributes. These are constructed from all the remaining, valuable ‘Source’ data attributes that may change over time. The SAT tables do not define their own unique surrogate keys; instead they incorporate either the HUB or the LNK table surrogates plus the record load date combined as the Primary Key.

Additionally the SAT tables include a record load end date column which is designed to contain a NULL value for one and only one instance of a satellite row representing the ‘current’ record. When SAT attribute values change up-stream, a new record is inserted into the SAT table, updating the previously ‘current’ record by setting the record load end date to the date of the newly loaded record.

One very cool result of using this Data Vault model is that it is easily possible to create queries that go “Back-In-Time” as it will be possible to check the ‘rec_load_date’ and the ‘rec_load_end_date’ values to determine what the record attribute values were in the past. For those who have tried, they know, this is very hard to accomplish using a STAR schema.

AGGREGATIONS

Eventually data aggregations (MIN/MAX/SUM/AVG/COUNT/PCT), often called ‘Business Metrics’ or ‘Data Points’ must be generated from the Data Vault tables. Reporting systems could query the tables directly, which is a viable solution. I think however, that this methodology puts a strain on any EDW, and instead a column-based database could be utilized instead. As previously discussed, these column-based database engines are very effective for storing and retrieving data aggregations. The design of these column-based tables could be highly de-normalized (consider this example), versioned to account for history. This solution effectively replaces the FACT/DIMENSION relationship requirements and the potentially complex (populate and retrieve) data queries of the STAR schema data model.

Yes, for those who can read between the lines, this does impose an additional data processing step which must be built, tested, and incorporated into an execution schedule. The benefits of doing this extra work are huge. Once the business metrics are stored, pre-aggregated, reporting systems will always provide fast retrieval, consistent values, and managed storage footprints. These become the ‘Data Marts’ for the business user and worth the effort!

ETL/ELT

Getting data from ‘Source’ data systems, to Data Vault tables, to column-based Data Marts, require tools. Data Integration tools. Yes, Talend Tools! As a real-world example, before I joined Talend, I successfully designed and built such an EDW platform. I used Talend and Infobright. The originating ‘source’ data merged 15 identical databases with 45 tables each, into an ODS. Synchronized data from the ODS to the Data Vault model with 30 tables, and further populated 9 InfoBright tables. The only real transformation requirements were to map the ‘Source’ tables to Data Vault tables, then to de-normalized tables. After processing over 30 billion records, the resulting column-based ‘Data Marts’ could execute aggregated query results, across 6 Billion records, in under 5 seconds. Not bad!

Conclusion

Wow -- A lot of information has been presented here. I can attest that there is a lot more that could have been covered. I will close to say, humbly, that an Enterprise Data ‘Vault’ Warehouse, using any of the data stores discussed above, is worth your consideration. The methodology is sound and highly adaptable. The real work then becomes how you define your business metrics and the ETL/ELT data integration process. We believe, here at Talend, not too surprisingly, that our tools are well suited for building fast, pliable, maintainable, operational code that takes you “Beyond the Data Vault”.

↧

Talend – Implementation in the ‘Real World’: Data Quality Matching (Part 2)

August 21, 2015, 5:32 am

≫ Next: Focus IT development on the user experience while improving the developer/designer relationship

≪ Previous: Beyond “The Data Vault”

First an apology: It has taken me a while to write this second part of the matching blog and I am sure that the suspense has been killing you. In my defence, the last few months have been incredibly busy for the UK Professional Services team, with a number of our client’s high profile projects going live, including two UK MDM projects that I am involved with. On top of that, the interest in Talend MDM at the current time is phenomenal, so I have been doing a lot of work with our pre-sales / sales teams in Europe whenever their opportunities require deep MDM expertise. Then of course, Talend is gearing up for a major product release later this year so I recently joined a group of experts from around the world at ‘Hell Week’ at our Paris office – where we do our utmost to break the Beta with real world use cases as well as test out all the exciting new features.[1]

Anyway – when I left you last time we had discussed the importance of understanding (profiling) our data, then applying standardisation techniques to the data in order to give ourselves the best chance of correctly identifying matches / duplicates within our data sets. As I stated last time, this is unlikely to be optimised in a single pass, but more likely to be an iterative process over the duration of the project (and beyond – I will discuss this later). Now we are ready to discuss the mechanics of actually matching data with Talend. At a high level there are two strategies for matching:

1. Deterministic Matching

This is by far the simpler of the two approaches. If our data set(s) contain appropriate data we can either:

Use one or more fields as a key or unique identifier – where identical values exist we have a match. For example the primary / foreign keys of a record, national id numbers, Unique Property Reference Numbers (UPRN), tax registration numbers etc.
Do an exact comparison of the contents of some or all of the fields

In essence with deterministic matching we are doing a lookup or a join – hopefully you should all be familiar with doing this within Talend and within relational databases. Of course even this strategy can bring its own technical challenges – for example joining two very large sets of data efficiently, but this is a topic for another blog.

Everyone’s favourite component, tMap joining data from two sources and a sneak preview of new features:

2. Probabilistic Matching

The issue with deterministic matching is that it will not necessarily identify all matches / duplicates:

In the example above, the system ID’s (the primary key) are different – i.e. the record has been keyed in twice, so this doesn’t help us. The National ID could be a contender for helping us match, but it appears to be an optional field. Besides, even if it was mandatory, what if a mistake was made typing in the ID? Finally we have the name fields, but again a deterministic match doesn’t help us here, due to the typo in the ‘Last’ field. The example also illustrates that even if we had some way of addressing these issues it may not be possible to accurately determine if the two records are a match by an automatic algorithm or even human intervention – we simply might not have enough information in the four columns to make a decision.

Now let’s say we had a real world data set (or multiple real world data sets) with a far greater breadth of information about an entity. This is where it gets interesting. Probabilistic or ‘Fuzzy’ matching allows us to match data in situations where deterministic matching is not possible or does not give us the full picture. Simplistically it is the application of algorithms to various fields within the data, the results of which are combined together using weighting techniques to give us a score. This score can be used to categorise the likelihood of a match into one of three categories: Match, Possible Match and Unmatched:

· Match – automatic matching

· Possible Match – records requiring a human Data Steward to make a decision

· Unmatched – no match found

Within the Talend Platform products, we supply a number of Data Quality components that utilise these ‘fuzzy’ algorithms. I cannot stress enough the importance of understanding, at least at a high level, how each algorithm works and what its strengths and weaknesses are. Broadly, they are split into two categories: Edit Distance and Phonetic.

Edit Distance Algorithms

From Wikipedia:

In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.

From a DQ matching perspective, this technique is particularly useful for identifying the small typographical errors that are common when data is entered into a system by hand. Let’s look at the edit distance algorithms available within Talend, all of which are known industry standard algorithms:

Levenshtein distance

Most useful for matching single word strings. You can find a detailed description of the algorithm here: https://en.wikipedia.org/wiki/Levenshtein_distance, but in essence it works by calculating the minimum number of substitutions required to transform one string into another.

Example (again from Wikipedia):

The Levenshtein distance between ‘kitten’ and ‘sitting’ is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

1. kitten → sitten (substitution of "s" for "k")

2. sitten → sittin (substitution of "i" for "e")

3. sittin → sitting (insertion of "g" at the end).

As we look at each algorithm it is important to understand the weaknesses. Let’s return to our ‘Pemble’ vs ‘pembel’ example:

Pemble → Pembll
Pembll → Pemble
Pemble → pemble

Yes – that’s right – the algorithm is case sensitive and the distance is 3 – the same as ‘kitten’ and ‘sitting’! Once again, this is a nice illustration of the importance of standardisation before matching: For example, standardising so the first letter is upper case would immediately reduce the distance to 2. Later, I will show how these distance scores translate into scores in the DQ components.

Another example: ‘Adam’ vs ‘Alan’

Adam → Alam
Alam → Alan

Here the Levenshtein distance is 2. However consider the fact that ‘Adam’ and ‘Alan’ may be the same person (because the name was misheard) or they may be different people. This illustrates why we need to consider as much information as possible when deciding if two ‘entities’ are the same – the first name in isolation in this example is not enough information to make a decision. It also demonstrates that we need to consider the possibility of our fuzzy matching introducing false positives.

Jaro-Winkler

From Wikipedia: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

In computer science and statistics, the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995), a type of string edit distance, and was developed in the area of record linkage (duplicate detection) (Winkler, 1990). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

Personally I use Jaro-Winkler as my usual edit distance algorithm of choice as I find it delivers more accurate results than Levenshtein. I won’t break down detailed examples as before as the mathematics are a little more complex (and shown in detail on the Wikipedia page). However, let’s try running the same examples we looked at for Levenshtein through Jaro-Winkler:

· ‘kitten’ and ‘sitting’ -> Jaro-Winkler score: 0.7460317611694336

· ‘Pemble’ and ‘Pembel’ -> Jaro-Winkler score: 0.9666666507720947

· ‘Pemble’ and ‘pembel’ -> Jaro-Winkler score: 0.8222222328186035

· ‘Adam’ and ‘Alan’ -> Jaro-Winkler score: 0.7000000178813934

All Jaro-Winkler scores are between 0 and 1. Case is still skewing our results.

Jaro

Essentially a special case implementation of Jaro-Winkler, more details of which can be found on the internet see http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html: ‘Step3: Winkler Modification’. I don’t generally use it as Jaro-Winkler is considered more accurate in most cases. In case you are wondering, our test cases score as follows:

· ‘kitten’ and ‘sitting’ -> Jaro score: 0.7460317611694336

· ‘Pemble’ and ‘Pembel’ -> Jaro score: 0.9444444179534912

· ‘Pemble’ and ‘pembel’ -> Jaro score: 0.8222222328186035

· ‘Adam’ and ‘Alan’ -> Jaro score: 0.6666666865348816

Note where the scores are the same as Jaro-Winkler and where they are different. If you are interested, these variations can be understood by the Winkler Modifications preferential treatment to the initial part of the string and you can even find edge cases where you could argue that this isn’t desirable:

· ‘francesca’ and ‘francis’ -> Jaro score: 0.8412697911262512

· ‘francesca’ and ‘francis’ -> Jaro-Winkler score: 0.9206348955631256

To a human, we can see that these are obviously two different names (doesn’t mean that the name has not been misheard though), but Jaro-Winkler skews based on the initial part of the string - ‘fran’. Remember though that these scores would not usually be used in isolation, other fields will also be matched.

Q-grams (often referred to as n-grams)

Matches processed entries by dividing strings into letter blocks of length q in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams. At the time of writing, the q-grams algorithm that Talend provides is actually a character level tri-grams algorithm, so what does this mean exactly?

https://en.wikipedia.org/wiki/Trigram

Imagine ‘sliding a window’ over a string and splitting out all the combinations of the consecutive characters. Let’s take our ‘kitten’ and ‘sitting’ example and understand what actually happens:

‘kitten’ produces the following set of trigrams:

(#,#,k), (#,k,i), (k,i,t), (i,t,t), (t,t,e), (t,e,n), (e,n,#), (n,#,#)

‘sitting’ produces the following set of trigrams:

(#,#,s), (#,s,i), (s,i,t), (i,t,t), (t,t,i), (t,i,n), (i,n,g), (n,g,#), (g,#,#)

Where ‘#’ denotes a pad character appended to the beginning and end of each string. This allows:

· The first character of each string to potentially match even if the subsequent two characters are different. In this example (#,#,k) does not equal (#,#,s).

· The first two characters of each string to potentially match even if the subsequent character is different. In this example (#,k,i) does not equal (#,s,i).

· The last character of each string to potentially match even if the preceding two characters are different. In this example (n,#,#) does not equal (g,#,#).

· The last two characters of each string to potentially match even if the preceding character is different. In this example (e,n,#) does not equal (n,g,#).

There are two things to note from this:

The pad character ‘#’ is treated differently to a whitespace. This means the strings ‘Adam Pemble’ and ‘Pemble Adam’ will get a good score, but not a perfect match score which is a desirable result.
We should remove any ‘#’ characters from our strings before using this algorithm!

The algorithm in Talend uses the following formula to calculate a score:

normalisedScore =

(maxQGramsMatching - getUnNormalisedSimilarity(str1Tokens, str2Tokens)) / maxQGramsMatching

I won’t delve into the full details of each variable and function here, but essentially the score for out ‘kitten’ and ‘sitting’ example would be calculated as follows:

normalisedScore = (17 – 15) / 17 = 0.1176470588235294 - a low score

Once you understand the q-grams algorithm, you can see why it is particularly suited to longer strings or multi work strings. For example if we used q-grams to compare:

“The quick brown fox jumps over the lazy dog”

“The brown dog quick jumps over the lazy fox”

We would get a reasonably high score (0.8222222328186035) due to the strings containing the same words, but in a different order (remember the whitespace vs ‘#’). A Levenshtein score (not distance) for these strings would be 0.627906976744186. However it is important to note that scores from different algorithms are NOT directly comparable– we will come back to this point later. However we can say that relatively the q-grams algorithm will give us more favourable results in this - same words, different order scenario - if that’s what we were looking for.

Phonetic Algorithms

Once again from Wikipedia: https://en.wikipedia.org/wiki/Phonetic_algorithm

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result.

They are necessarily complex algorithms with many rules and exceptions, because English spelling and pronunciation is complicated by historical changes in pronunciation and words borrowed from many languages.

In Talend we include four phonetic algorithms, again all industry standards:

Soundex

Once more credit to Wikipedia (no point in reinventing the wheel) https://en.wikipedia.org/wiki/Soundex . The Soundex algorithm generates a code that represents the phonetic pronunciation of a word. This is calculated as follows:

The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Consonants at a similar place of articulation share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.

The correct value can be found as follows:

Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
Replace consonants with digits as follows (after the first letter):
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (the chars 's' and 'c' in the name would receive a single number of 2 and not 22 since an 'h' lies in between them). "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P').

A ‘score’ in Talend is generated based on the similarity of two codes. E.g. In the example above, ‘Robert’ and ‘Rupert‘ generate the same Soundex code of ‘R163’ so Talend would assign a score of 1. ‘Robert’ and ‘Rupern‘ (typo in Rupert, code = R165) would get a score of 0.75 as three of the four digits match. Also it is worth nothing Soundex is not case sensitive.

Key point: As you can see, phonetic algorithms are a useful tool especially where data may have been spelt ‘phonetically’ rather than the correct spelling. English is also full of words that can sound the same but be spelt completely differently (they are called homophones https://www.oxford-royale.co.uk/articles/efl-homophones.html) – consider ‘buy’ and ‘bye’, both words will generate the same Soundex code ‘B000’. Being able to match phonetically can be a very powerful tool, however, they also tend to ‘overmatch’ e.g. ‘Robert’ and ‘Rupert’ or ‘Lisa’ and ‘Lucy’ generate the same code. Why not have a play yourself? There are plenty of online tools that generate Soundex codes e.g. http://www.gedpage.com/soundex.html

Soundex FR

A variation of Soundex optimised for French language words. Talend began as a French company after all!

Metaphone / Double Metaphone

Once more to Wikipedia: https://en.wikipedia.org/wiki/Metaphone

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation.^[1] It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

The original author later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages.

….

Original Metaphone contained many errors and was superseded by Double Metaphone

The rules for Metaphone / Double Metaphone are too complex to reproduce here, but are available online. Suffice to say, if you are going to use a phonetic algorithm with Talend, it is likely that Double Metaphone will be your algorithm of choice. Once again though, be aware that even with Double Metaphone, the ‘over matching’ problem exists and you should handle it appropriately in your DQ processes. This could mean lower thresholds for automatic matching or stewardship processes that allow false positives to be unmatched.

This concludes our brief tour of Talend’s matching algorithms. It should be noted that we also support an algorithm type of ‘custom’ that allows your own algorithm to be plugged in. Another important point is that the algorithms supplied by Talend are focused on ‘character’ / ‘alphabet’ based languages (for edit distance) and specific languages (phonetic). Non-character based languages like Chinese will require different matching strategies / algorithms (have a look online if you want to know more on this topic).

It is at this point I shall have to apologise once more. Last time I promised that we would discuss the actual mechanics of matching with Talend and Survivorship in this blog. However I think this post is long enough as it is and I shall continue with these topics next time.

[1] This obviously took all of our time. We definitely didn’t spend any time eating amazing French food, drinking beer and swapping war stories.

↧