Data Quality follows the same principles than other well defined quality related processes: it is all about engaging an improvement cycle to Define & detect, Measure, Analyze, Improve and Control quality.
This doesn’t happen at one time, or one place. It should be an ongoing effort, and that is often neglected when dealing with data quality. Think about the big ERP, CRM or IT consolidation projects where data quality gets high attention during the roll out, and then fades away once the project is delivered.
The ubiquitous nature of quality is key, as well. We all have experienced that in the physical world: if you are a car manufacturer, you better have many quality checks across your manufacturing and supply chain. And you better identify the problems and root causes in the processes as early as possible in the chain. Think about the costs of recalling a vehicle at the end of the chain, once the product has been shipped to the customer, as experienced recently by Toyota who recalled 6 million vehicles for an estimated cost of $600 million. Quality should be a moving picture too: while progressing on your quality cycle, you have the opportunity to move upstream on your process. Take the example of General Electric, known for years as best in class for turning quality methodologies such as Six Sigma into the core his business strategy. Now they are pioneering the use of Big Data for the maintenance process in manufacturing. Through this initiative, they moved beyond detecting quality defects as it happens: they are now able to predict them and operate the preventive maintenance in order to avoid them.
What has been experienced in the physical world of manufacturing applies in the digital world of information management as well. This means that you need to be able to position data quality controls and corrections everywhere in your information supply chain. And I see 6 usage scenarios for that.
The first one is applying data quality when data need to be repurposed. This usage scenarios is not new: it gave birth to the principles of data quality in IT systems. Most companies adopted it in the context of their business intelligence initiative: it mandates to consolidate data from multiple sources, typically operational systems, and get them ready for analysis. To support this usage scenarios, data quality tools can be provided as stand-alone tools with their own data marts or, even better, tightly bundled with Data integration tools.
A similar usage scenario, but “with steroids”, happens in the context of Big Data. Under this context, the role of data quality is to add a fourth V, the Veracity, to the well-known 3 V’s defining Big Data: Volume, Variety and Velocity. We will cover Velocity later in the article. Managing extreme Volumes mandates new approaches for processing data quality: quality quality controls has to move where the data is, rather than the opposite way. Technically speaking, this means that data quality should run natively on big data environments such as Hadoop, and leverage its native distributing processing capabilities, rather that operate on top as a separate processing engine. Variety is also an important consideration: data may come in different form, like files, logs, databases, documents, or data interchange formats such as XML or JSON messages… Data quality would then need to turn the “oddly” structured data often seen in Big Data environments into something that is more structured and can be connected to the traditional enterprise business objects, like customers, products, employees, organizations… Data quality solutions should then provide strong capabilities in terms of profiling, parsing, standardization, entity resolution… Those capabilities can be provided before the data is stored, and designed by IT professionals. This is the traditional way to deal with data quality. Or, data preparation can delivered on an ad-hoc basis at run time, by data scientists or business users. This is sometimes referred to as data wrangling or data blending.
The third usage scenario lies in the ability to create data quality services. Data quality services allow applying data quality controls on demand. An example could be a web site with a web form to catch customer contacts information. Instead of letting a web visitor typing any data he wants in a web form, a data quality service could apply data quality checks in real time. This then gives the opportunity of checking info like e-mails, address, name of the company, phone number, etc. Even better, it can allow to automatically identifying a customer without requiring him to explicitly logon and/or provide contact informations, as social networks or best in class websites or mobile applications like Amazon.com already do. Going back to our automotive example, this case provides a way to cut the costs of data quality: such controls can be applied at the earliest steps of the information chain, even before erroneous data enters into the system. Marketing managers may be the best people to understand the value of such a usage scenario: they struggle with the poor quality of the contact data they get through internet. Once it has entered into the marketing database, poor data quality becomes very costly and badly impacts key activities such as segmenting, targeting, calculating customer value… Of course, the data can be cleansed at later steps than when a prospect or customer fills in a web form. But this mandates significant efforts for resolution, and the related cost is then much higher this is the data is quality proofed at the entry point.
Then, there is quality for data in motion. It applies for data that flows for an application to another, for example an order that goes from sales to finance and then to logistics. As explained in the third usage scenario, it is a best practice that each system implements gatekeepers at the point of entry in order to reject data that doesn’t match its data quality standards. Data quality then needs to be applied in real time, under the control of an Enterprise Service Bus. This scenario for data quality can happen inside the enterprise and behind his firewall; this is the fourth usage scenario. Alternatively, data quality may also run on the cloud, and this is the fifth scenario.
The last scenario is data quality for Master Data Management (MDM). In that context, data is standardized into a golden record, while the MDM acts as a single point of control. Applications and business users share a common view of the data related to entities such as customers, employees, products, chart of accounts, etc. The data quality then needs to be fully embedded in the master data environment and to provide deep capabilities in terms of matching and entity resolution.
Designing data quality solutions so that they can run across those scenarios is a driver for our design at Talend. Because one of the core capability of our unified platform is to generate code that can run everywhere, our data quality processing can run in any context, which we believe is key differentiator: our data quality component is delivered as a core component in all our platforms: it can be embedded into a data integration process, deployed natively in Hadoop as a Map Reduce job, and be exposed as a data quality service to any application that need to consume it in real time. It is delivered as well as a key capability of our application integration platform, our upcoming cloud platform, and, indeed, in our MDM platform. Even more importantly, data quality controls can move up into the information chain over time. Think about customer data that can be initially quality proofed in the context of a data warehouse through our data integration capabilities. Then, later, through MDM, this unified customer data can be shared across applications. Under this context, data stewards can then learn more about the data and be alerted that they are erroneous. This will help then to identify the root cause of bad data quality, for example a web form that brings junks e-mails into the customer database. Data services can then come to the rescue to avoid erroneous data inputs on the web form, and reconcile this entered data with the MDM through real time matching. And, finally Big Data could provide an innovative approach for identity resolution so that the customer can be automatically recognized by a cookie after he opts-in, turning the web form into retirement.
Indeed, such a process doesn’t happen in one day. But remember the key principle of quality mentioned at the beginning of this post. Continuous improvement is the target!
Jean-Michel