First an apology: It has taken me a while to write this second part of the matching blog and I am sure that the suspense has been killing you. In my defence, the last few months have been incredibly busy for the UK Professional Services team, with a number of our client’s high profile projects going live, including two UK MDM projects that I am involved with. On top of that, the interest in Talend MDM at the current time is phenomenal, so I have been doing a lot of work with our pre-sales / sales teams in Europe whenever their opportunities require deep MDM expertise. Then of course, Talend is gearing up for a major product release later this year so I recently joined a group of experts from around the world at ‘Hell Week’ at our Paris office – where we do our utmost to break the Beta with real world use cases as well as test out all the exciting new features.[1]
Anyway – when I left you last time we had discussed the importance of understanding (profiling) our data, then applying standardisation techniques to the data in order to give ourselves the best chance of correctly identifying matches / duplicates within our data sets. As I stated last time, this is unlikely to be optimised in a single pass, but more likely to be an iterative process over the duration of the project (and beyond – I will discuss this later). Now we are ready to discuss the mechanics of actually matching data with Talend. At a high level there are two strategies for matching:
1. Deterministic Matching
This is by far the simpler of the two approaches. If our data set(s) contain appropriate data we can either:
- Use one or more fields as a key or unique identifier – where identical values exist we have a match. For example the primary / foreign keys of a record, national id numbers, Unique Property Reference Numbers (UPRN), tax registration numbers etc.
- Do an exact comparison of the contents of some or all of the fields
In essence with deterministic matching we are doing a lookup or a join – hopefully you should all be familiar with doing this within Talend and within relational databases. Of course even this strategy can bring its own technical challenges – for example joining two very large sets of data efficiently, but this is a topic for another blog.
Everyone’s favourite component, tMap joining data from two sources and a sneak preview of new features:
2. Probabilistic Matching
The issue with deterministic matching is that it will not necessarily identify all matches / duplicates:
In the example above, the system ID’s (the primary key) are different – i.e. the record has been keyed in twice, so this doesn’t help us. The National ID could be a contender for helping us match, but it appears to be an optional field. Besides, even if it was mandatory, what if a mistake was made typing in the ID? Finally we have the name fields, but again a deterministic match doesn’t help us here, due to the typo in the ‘Last’ field. The example also illustrates that even if we had some way of addressing these issues it may not be possible to accurately determine if the two records are a match by an automatic algorithm or even human intervention – we simply might not have enough information in the four columns to make a decision.
Now let’s say we had a real world data set (or multiple real world data sets) with a far greater breadth of information about an entity. This is where it gets interesting. Probabilistic or ‘Fuzzy’ matching allows us to match data in situations where deterministic matching is not possible or does not give us the full picture. Simplistically it is the application of algorithms to various fields within the data, the results of which are combined together using weighting techniques to give us a score. This score can be used to categorise the likelihood of a match into one of three categories: Match, Possible Match and Unmatched:
· Match – automatic matching
· Possible Match – records requiring a human Data Steward to make a decision
· Unmatched – no match found
Within the Talend Platform products, we supply a number of Data Quality components that utilise these ‘fuzzy’ algorithms. I cannot stress enough the importance of understanding, at least at a high level, how each algorithm works and what its strengths and weaknesses are. Broadly, they are split into two categories: Edit Distance and Phonetic.
Edit Distance Algorithms
From Wikipedia:
In computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.
From a DQ matching perspective, this technique is particularly useful for identifying the small typographical errors that are common when data is entered into a system by hand. Let’s look at the edit distance algorithms available within Talend, all of which are known industry standard algorithms:
Levenshtein distance
Most useful for matching single word strings. You can find a detailed description of the algorithm here: https://en.wikipedia.org/wiki/Levenshtein_distance, but in essence it works by calculating the minimum number of substitutions required to transform one string into another.
Example (again from Wikipedia):
The Levenshtein distance between ‘kitten’ and ‘sitting’ is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
1. kitten → sitten (substitution of "s" for "k")
2. sitten → sittin (substitution of "i" for "e")
3. sittin → sitting (insertion of "g" at the end).
As we look at each algorithm it is important to understand the weaknesses. Let’s return to our ‘Pemble’ vs ‘pembel’ example:
- Pemble → Pembll
- Pembll → Pemble
- Pemble → pemble
Yes – that’s right – the algorithm is case sensitive and the distance is 3 – the same as ‘kitten’ and ‘sitting’! Once again, this is a nice illustration of the importance of standardisation before matching: For example, standardising so the first letter is upper case would immediately reduce the distance to 2. Later, I will show how these distance scores translate into scores in the DQ components.
Another example: ‘Adam’ vs ‘Alan’
- Adam → Alam
- Alam → Alan
Here the Levenshtein distance is 2. However consider the fact that ‘Adam’ and ‘Alan’ may be the same person (because the name was misheard) or they may be different people. This illustrates why we need to consider as much information as possible when deciding if two ‘entities’ are the same – the first name in isolation in this example is not enough information to make a decision. It also demonstrates that we need to consider the possibility of our fuzzy matching introducing false positives.
Jaro-Winkler
From Wikipedia: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
In computer science and statistics, the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995), a type of string edit distance, and was developed in the area of record linkage (duplicate detection) (Winkler, 1990). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.
Personally I use Jaro-Winkler as my usual edit distance algorithm of choice as I find it delivers more accurate results than Levenshtein. I won’t break down detailed examples as before as the mathematics are a little more complex (and shown in detail on the Wikipedia page). However, let’s try running the same examples we looked at for Levenshtein through Jaro-Winkler:
· ‘kitten’ and ‘sitting’ -> Jaro-Winkler score: 0.7460317611694336
· ‘Pemble’ and ‘Pembel’ -> Jaro-Winkler score: 0.9666666507720947
· ‘Pemble’ and ‘pembel’ -> Jaro-Winkler score: 0.8222222328186035
· ‘Adam’ and ‘Alan’ -> Jaro-Winkler score: 0.7000000178813934
All Jaro-Winkler scores are between 0 and 1. Case is still skewing our results.
Jaro
Essentially a special case implementation of Jaro-Winkler, more details of which can be found on the internet see http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/JaroWinklerDistance.html: ‘Step3: Winkler Modification’. I don’t generally use it as Jaro-Winkler is considered more accurate in most cases. In case you are wondering, our test cases score as follows:
· ‘kitten’ and ‘sitting’ -> Jaro score: 0.7460317611694336
· ‘Pemble’ and ‘Pembel’ -> Jaro score: 0.9444444179534912
· ‘Pemble’ and ‘pembel’ -> Jaro score: 0.8222222328186035
· ‘Adam’ and ‘Alan’ -> Jaro score: 0.6666666865348816
Note where the scores are the same as Jaro-Winkler and where they are different. If you are interested, these variations can be understood by the Winkler Modifications preferential treatment to the initial part of the string and you can even find edge cases where you could argue that this isn’t desirable:
· ‘francesca’ and ‘francis’ -> Jaro score: 0.8412697911262512
· ‘francesca’ and ‘francis’ -> Jaro-Winkler score: 0.9206348955631256
To a human, we can see that these are obviously two different names (doesn’t mean that the name has not been misheard though), but Jaro-Winkler skews based on the initial part of the string - ‘fran’. Remember though that these scores would not usually be used in isolation, other fields will also be matched.
Q-grams (often referred to as n-grams)
Matches processed entries by dividing strings into letter blocks of length q in order to create a number of q length grams. The matching result is given as the number of q-gram matches over possible q-grams. At the time of writing, the q-grams algorithm that Talend provides is actually a character level tri-grams algorithm, so what does this mean exactly?
https://en.wikipedia.org/wiki/Trigram
Imagine ‘sliding a window’ over a string and splitting out all the combinations of the consecutive characters. Let’s take our ‘kitten’ and ‘sitting’ example and understand what actually happens:
‘kitten’ produces the following set of trigrams:
(#,#,k), (#,k,i), (k,i,t), (i,t,t), (t,t,e), (t,e,n), (e,n,#), (n,#,#)
‘sitting’ produces the following set of trigrams:
(#,#,s), (#,s,i), (s,i,t), (i,t,t), (t,t,i), (t,i,n), (i,n,g), (n,g,#), (g,#,#)
Where ‘#’ denotes a pad character appended to the beginning and end of each string. This allows:
· The first character of each string to potentially match even if the subsequent two characters are different. In this example (#,#,k) does not equal (#,#,s).
· The first two characters of each string to potentially match even if the subsequent character is different. In this example (#,k,i) does not equal (#,s,i).
· The last character of each string to potentially match even if the preceding two characters are different. In this example (n,#,#) does not equal (g,#,#).
· The last two characters of each string to potentially match even if the preceding character is different. In this example (e,n,#) does not equal (n,g,#).
There are two things to note from this:
- The pad character ‘#’ is treated differently to a whitespace. This means the strings ‘Adam Pemble’ and ‘Pemble Adam’ will get a good score, but not a perfect match score which is a desirable result.
- We should remove any ‘#’ characters from our strings before using this algorithm!
The algorithm in Talend uses the following formula to calculate a score:
normalisedScore =
(maxQGramsMatching - getUnNormalisedSimilarity(str1Tokens, str2Tokens)) / maxQGramsMatching
I won’t delve into the full details of each variable and function here, but essentially the score for out ‘kitten’ and ‘sitting’ example would be calculated as follows:
normalisedScore = (17 – 15) / 17 = 0.1176470588235294 - a low score
Once you understand the q-grams algorithm, you can see why it is particularly suited to longer strings or multi work strings. For example if we used q-grams to compare:
“The quick brown fox jumps over the lazy dog”
to
“The brown dog quick jumps over the lazy fox”
We would get a reasonably high score (0.8222222328186035) due to the strings containing the same words, but in a different order (remember the whitespace vs ‘#’). A Levenshtein score (not distance) for these strings would be 0.627906976744186. However it is important to note that scores from different algorithms are NOT directly comparable– we will come back to this point later. However we can say that relatively the q-grams algorithm will give us more favourable results in this - same words, different order scenario - if that’s what we were looking for.
Phonetic Algorithms
Once again from Wikipedia: https://en.wikipedia.org/wiki/Phonetic_algorithm
A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result.
They are necessarily complex algorithms with many rules and exceptions, because English spelling and pronunciation is complicated by historical changes in pronunciation and words borrowed from many languages.
In Talend we include four phonetic algorithms, again all industry standards:
Soundex
Once more credit to Wikipedia (no point in reinventing the wheel) https://en.wikipedia.org/wiki/Soundex . The Soundex algorithm generates a code that represents the phonetic pronunciation of a word. This is calculated as follows:
The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. Consonants at a similar place of articulation share the same digit so, for example, the labial consonants B, F, P, and V are each encoded as the number 1.
The correct value can be found as follows:
- Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
- Replace consonants with digits as follows (after the first letter):
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
- If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
- Iterate the previous step until you have one letter and three numbers. If you have too few letters in your word that you can't assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.
Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261" and not "A226" (the chars 's' and 'c' in the name would receive a single number of 2 and not 22 since an 'h' lies in between them). "Tymczak" yields "T522" not "T520" (the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them). "Pfister" yields "P236" not "P123" (the first two letters have the same number and are coded once as 'P').
A ‘score’ in Talend is generated based on the similarity of two codes. E.g. In the example above, ‘Robert’ and ‘Rupert‘ generate the same Soundex code of ‘R163’ so Talend would assign a score of 1. ‘Robert’ and ‘Rupern‘ (typo in Rupert, code = R165) would get a score of 0.75 as three of the four digits match. Also it is worth nothing Soundex is not case sensitive.
Key point: As you can see, phonetic algorithms are a useful tool especially where data may have been spelt ‘phonetically’ rather than the correct spelling. English is also full of words that can sound the same but be spelt completely differently (they are called homophones https://www.oxford-royale.co.uk/articles/efl-homophones.html) – consider ‘buy’ and ‘bye’, both words will generate the same Soundex code ‘B000’. Being able to match phonetically can be a very powerful tool, however, they also tend to ‘overmatch’ e.g. ‘Robert’ and ‘Rupert’ or ‘Lisa’ and ‘Lucy’ generate the same code. Why not have a play yourself? There are plenty of online tools that generate Soundex codes e.g. http://www.gedpage.com/soundex.html
Soundex FR
A variation of Soundex optimised for French language words. Talend began as a French company after all!
Metaphone / Double Metaphone
Once more to Wikipedia: https://en.wikipedia.org/wiki/Metaphone
Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation.[1] It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.
The original author later produced a new version of the algorithm, which he named Double Metaphone. Contrary to the original algorithm whose application is limited to English only, this version takes into account spelling peculiarities of a number of other languages.
….
Original Metaphone contained many errors and was superseded by Double Metaphone
The rules for Metaphone / Double Metaphone are too complex to reproduce here, but are available online. Suffice to say, if you are going to use a phonetic algorithm with Talend, it is likely that Double Metaphone will be your algorithm of choice. Once again though, be aware that even with Double Metaphone, the ‘over matching’ problem exists and you should handle it appropriately in your DQ processes. This could mean lower thresholds for automatic matching or stewardship processes that allow false positives to be unmatched.
This concludes our brief tour of Talend’s matching algorithms. It should be noted that we also support an algorithm type of ‘custom’ that allows your own algorithm to be plugged in. Another important point is that the algorithms supplied by Talend are focused on ‘character’ / ‘alphabet’ based languages (for edit distance) and specific languages (phonetic). Non-character based languages like Chinese will require different matching strategies / algorithms (have a look online if you want to know more on this topic).
It is at this point I shall have to apologise once more. Last time I promised that we would discuss the actual mechanics of matching with Talend and Survivorship in this blog. However I think this post is long enough as it is and I shall continue with these topics next time.
[1] This obviously took all of our time. We definitely didn’t spend any time eating amazing French food, drinking beer and swapping war stories.