This blog is the first in a series of three looking at Data Matching and how this can be done within the Talend toolset. This first blog will look at the theory behind Data Matching, what is it and how it works. The second blog will look at the use of the Talend toolset for actually doing Data Matching. Finally, the last blog in the series will look at how you can tune the Data Matching algorithms to achieve the best possible Data Matching results.
First, what is Data Matching? Basically it is the ability to identify duplicates in large data sets. These duplicates could be people with multiple entries in one or many databases. It could also be duplicate items, of any description, in stock systems. Data Matching allows you to identify duplicates, or possible duplicates, and then allows you to take actions such as merging the two identical or similar entries into one. It also allows you to identify non-duplicates, which can be equally important to identify, because you want to know that two similar things are definitely not the same.
So, how does Data Matching actually work? What are the mathematical theories behind it? OK, let’s go back to first principles. How do you know that two ‘things’ are actually the same ‘thing’? Or, how do you know if two ‘people’ are the same person? What is it that uniquely identifies something? We do it intuitively ourselves. We recognise features in things or people that are similar, and acknowledge they could be, or are, the same. In theory this can apply to any object, be it a person, an item of clothing such as a pair of shorts, a cup or a ‘widget’.
This problem has actually been around for over 60 years. It was formalised in the 60’s in the seminal work of Fellegi and Sunter, two American statisticians. The first use was for the US census bureau. It’s called ‘Record Linkage’, i.e. how are records from different data sets linked together? For duplicate records it is sometimes called De-duplication, or the process of identifying duplicates and linking them. So, what properties help identify duplicates?
Well, we need ‘unique identifiers’. These are properties that are unlikely to change over time. We can associate and weigh probabilities for each property. For example, noting the probability that those two things are actually the same. This can then be applied to both people and things.
The problem, however, is that things can and do change, or they get misidentified. The trick is to identify what can change, i.e. a name, address, or date of birth. Some things are less likely to change than others. For objects, this could be size, shape, color, etc.
NOTE: Record linkage is highly sensitive to the quality of the data being linked. Data should first be ‘standardized’ so it is all of a similar quality.
Now there are two sorts of data linkage-
1. Deterministic Record Linkage
a. That is based on a number of identifiers that match
2. Probabilistic Record Linkage
a. This is based on the probability that a number of identifiers match
The vast majority of Data Matching is Probabilistic Data Matching. Deterministic links are too inflexible.
So, just how do you match? First, you do what is called blocking. You sort the data into similar sized blocks which have the same attribute. You identify ‘attributes’ that are unlikely to change. This could be surnames, date of birth, color, volume, shape. Next you do the matching. First, assign a match type for each attribute (there are lots of different ways to match these attributes). Names can be matched phonetically; dates can be matched by similarity. Next you calculate the RELATIVE weight for each match attribute. It’s similar to a measure of ‘importance’. Then you calculate the probabilities for matching and also accidently un-matching those fields. Finally, you assign an algorithm for adjusting the relative weight for each attribute to get what is called a ‘Total Match Weight’. That is then the probabilistic match for two things.
To summarize:
• Standardize the Data
• Pick Attributes that are unlikely to change
• Block, sort into similar sized blocks
• Match via probabilities (remember there are lots of different match types)
• Assign Weights to the matches
• Add it all up – get a TOTAL weight
The final step is to tune your matching algorithms so that you can obtain better and better matches. This will be covered in the third article in this series.
The next question then is what tools are available in the Talend tool set and how can I use them to do Data Matching? This will be covered in the next article in this series of Data Matching blogs.
About the Author
Stefan Franczuk is a Customer Success Architect at Talend, based in the UK. Moving in IT over 25 years ago, after starting his working career in engineering and aviation, and switching to academia and IT; Dr Franczuk has a wide range of experience in many different IT disciplines. For the last 15 years he has architected many integration solutions for clients all over the world. Dr. Franczuk also has a wide range of experience in Big Data, Data Science and Data Analytics. Dr. Franczukholds a PhD in Experimental Nuclear Photo-Physics from The University of Glasgow.