Data health, from prognosis to treatment

May 17, 2021, 2:03 am

≫ Next: AutoZone: Exceeding customer expectations with speed of service

≪ Previous: Data insights made simple, flexible, and proactive? Cheers to that.

When we talk about a healthy lifestyle, we know it takes more than diet and exercise. A lifelong practice of health requires discipline, logistics, and equipment. It is the same for data health: if you don’t have the infrastructure that supports all your health programs, those programs become moot. To establish healthy data practices, roles and responsibilities must be clear, tracking and auditing must be extensive without too much friction, and regulations must be seamlessly integrated in core processes — instead of being reluctantly endured.

In my role as Talend’s CTO, I dedicate a lot of time to thinking about how to solve data health problems. At Talend, we’re working on the whole data quality cycle: assessment, improvement, indicators and tracking for prevention… then right back to assessment, because good data is a process that never ends. It involves tools, of course, but also processes and people. Just like patients are themselves key actors of any health system, the data professionals and other users who interact with data are part of the solution to data health. Data health concerns every employee who has contact with data, therefore the approach to data health must be pervasive.

By understanding all the aspects of data quality, you set yourself up for the long-term practice of good data health. And the more you practice good data health within your organization, the less risk you have of data issues leading to bad decisions or security breaches.

What goes into “good” data?

Data quality is essential to data health. Traditionally, data originates from human entries or the integration of third-party data, both of which are prone to errors and extremely difficult to control. In addition, the data that works beautifully for its intended applications can give rise to objective quality problems when extracted for another use — typically analytics. Outside of its intended environment, the data is removed from the business logic that puts it in context, and from the habits, workarounds, and best practices of regular users, which often go undocumented.

Integration and analytics call on data sets from a wide range of applications or databases. But organizations often have inconsistent standards across apps and databases, varied embeds and optimization techniques, or even historical workarounds that make sense inside the source but become undocumented alterations when removed from their original context. So even when a data format or content is not objectively a quality issue within its original silo, it will almost certainly become one when extracted and combined with others for an integration or an analytics project.

Data quality covers the discipline, methodology, techniques, and software that counteract these issues. The first step is establishing a well-defined and efficient set of metrics that allow users to assess the quality of the data objectively. The second is taking action to prevent quality issues in the first place and improving the data to make it even more effective for its intended use.

When data quality becomes a company-wide priority, analytics won’t have to face the specific challenge of combining these disparate sources, and instead can focus on driving some of the most important decisions of the organization.

Measuring data quality

The category of data quality dimensions covers a number of metrics that indicate the overall quality of files, databases, data lakes, and data warehouses. Academic research describes up to 10 data quality dimensions — sometimes more — but, in practice, there are five that are critical to most users: completeness, timeliness, accuracy, consistency, and accessibility.

Completeness: Is the data sufficiently complete for its intended use?
Accuracy: Is the data correct, reliable, and/or certified by some governance body? Data provenance and lineage — where data originates and how it has been used — may also fall in this dimension, as certain sources are deemed more accurate or trustworthy than others.
Timeliness: Is this the most recent data? Is it recent enough to be relevant for its intended use?
Consistency: Does the data maintain a consistent format throughout the dataset? Does it stay the same between updates and versions? Is it sufficiently consistent with the other datasets to allow joins or enrichments?
Accessibility: Is the data easily retrievable by the people who need it?

Each of these dimensions correspond to a challenge for an analytics group: if the data doesn’t provide a clear and accurate picture of reality, it will lead to poor decisions, missed opportunities, increased cost, or compliance risks.

In addition to these common dimensions, business-domain specific dimensions are usually added as well, typically for compliance.

At the end, this makes measuring data quality quite a complex, multidimensional problem. To add to the challenge, the volume and diversity of data sources have long surpassed the ability for human curation. This is why, for each of these dimensions, data quality methodologies define metrics that can be computed, and then combined, to automate an objective measure of the quality of the data.

More subjective measures can still be added in the mix, too, typically by asking users to provide a rating, or through governance workflows. But even this manual work tends to be increasingly complemented by machine learning and artificial intelligence.

Putting data on the right track

Data quality assessment must be a continuous process, as more data flows into the organization all the time. The assessment of data quality typically starts by observing the data and computing the relevant data quality metrics. To get a more exhaustive view, many organizations implement traditional quality control techniques, such as sampling, random testing, and, of course, extensive automation. Any reliable data quality measurement will involve complex and intensive computation algorithms.

But companies should also be looking at quality metrics that can be aggregated across dimensions, such as the Talend Trust Score™. Static or dynamic reports, dashboards, and drill-down explorations that focus on data quality issues and how to resolve them (not to be confused with BI) provide perspective on overall data quality. For more fine-grained insight, issues will be tagged or highlighted with various visualization techniques. And good data quality software will add workflow techniques, such as notifications or triggers, for timely remediation of data quality issues as they arise.

Traditionally data quality assessment has been done on top of the applications, databases, data lakes, or data warehouses where data lives. Many data quality products must actually collect data in their own system before they can run the assessment like an audit, as part of a data governance workflow. But the sheer volume of data most companies work with makes that data duplication inefficient. And, perhaps more importantly, organizations soon realize that assessing data quality after it has already been brought into the system opens them up to needless risk and additional costs.

A more modern approach is pervasive data quality, integrated directly into the data supply chain. The more upstream the assessment is made, the earlier risks are identified, and the less costly the remediation will be. This is why Talend has always used a push-down approach — without moving the data from the data lake or data warehouse — and integrated the data quality improvement processors right into the integration pipelines.

Getting better all the time

Too often data quality is viewed only through the lens of assessment, as a sort of necessary evil similar to a security or financial audit. But where the value truly lies is in continuous improvement. Data quality should be a cycle: the assessment runs regularly (or, even better, continuously), automation is refined all the time, and new actions are taken at the source, before bad data enters the system.

Reacting to problems after they happen remains very costly, and companies who are reactive instead of proactive regarding data issues will continue to suffer questionable decisions and missed opportunities. Systematic data quality assessment would clearly be a big step toward avoiding bad decisions and compliance liabilities. Assessment is a prerequisite, but continuous improvement is the endgame. That’s why it is at the core of Talend’s approach to providing comprehensive data health products.

In reality, there is always a tradeoff between a correction at the source, revealed by a root cause analysis, and a correction at destination, typically the data lake or data warehouse. Organizations can be reluctant to change inefficient data entry, application, or business processes if they “work.” Operations are hard to change: nobody wants to break the billing or the shipping machine, even in the service of more effective and efficient processes in the long term. However, in recent years, as organizations are becoming more and more data-driven and untrusted data is more and more likely to be identified as a risk factor, this culture is starting to change. At Talend, we see an opportunity to run data quality improvement processes beyond BI, such as data standardization or deduplication in the CRM for better customer service or higher sales efficiencies — just to take one of many operational examples.

Data quality assessment and improvement are tightly intertwined. Imagine if your data quality assessment were sufficiently precise and accurate, with advanced reverse-engineering techniques like semantics extraction. The deviation from the quality standard should automatically induce corresponding improvements in process. For instance, if a data format is inconsistent, the standardization process relevant for that data type (e.g., a company name or phone number) would be applied, resulting in clean, consistent data entering the workflow. The more precise and complete the assessment, the more options there are for applying similar automation.

As with any governance process, data quality improvement is a balance between tools, processes, and people. Tools are not everything, and not every process can — or should — be automated. But with Data Fabric, Talend has taken a big step toward facilitating data-driven decisions you can trust.

And Talend does not ignore the people side of the equation. After all, human experience and expertise provide crucial insight and nuance, and insert necessary checks in an increasingly AI-driven world. Putting humans in the loop — people who are experts on the data but not experts on data quality — requires a highly specialized workflow and user experience that few products are able to provide. Talend is leading the way here, with tools including the Trust Score™ formula, Data Inventory, and Data Stewardship that allow for collaborative curation of data with human-generated metadata, such as ratings and tagging.

A prescription for data health

When it comes to data health, the analogy of physical wellness works well. Both notions of health encompass a complete lifecycle and a set of actors. Healthcare providers and the patients themselves must be responsible for prognosis and treatment, hygiene, and prevention. But the infrastructure, regulation, and coverage play an essential part of a health system, too.

So what does it take to build a good data health system?

Identification of risk factors. Some risks are endogenous, such as the company’s own applications, processes, and employees, while others (partners, suppliers, customers) come from the outside. By recognizing the areas that present the most risk, we can more effectively prevent dangers before they arise.
Prevention programs. Good data hygiene requires good data practices and disciplines. Consider the approach to nutrition labels: the generalization of standardized nutrition facts or nutrition scores function as education on how a given meal will affect your overall health. Similarly, the Talend Trust Score™ lets us assess and control the intake of data, producing information that is easier to understand and harder to ignore.
Proactive inoculation. Vaccines teach the body to recognize and fight a pathogen before an infection begins. For our data infrastructure, machine learning serves a similar function, training our systems to recognize bad data and suspect sources before they can take hold and contaminate our programs, applications, or analytics.
Regular monitoring. In the medical realm, the annual checkup used to be the primary method of monitoring a patient’s health over time. With the advent of medical wearables that can collect a number of indicators, from standard indicators such as activity or heart rate to more specific functions such as monitoring blood sugar levels in a person with diabetes, the human body becomes observable. In the data world, we use term like assessment or profiling, but it is basically the same — and continuous observability might soon become a reality here, as well. The sooner an issue is detected, the higher the chances of an effective treatment. In medicine it can be a matter of life and death (the Apple Watch has already saved lives). The risks are different, of course, but data quality observability could save corporate lives, too.
Protocols for continuous prognosis. Doctors can only prescribe the right therapy when they know what to treat. But — and this is another analogy with data health — medicine isn’t purely a hard science. The prognosis is a model that requires constant revision and improvement. It is fair to set this expectation in data health too: it is a continuously improving model, but you can’t afford not to have it.
Efficient treatments. Any medical treatment is always a risk/benefit assessment. A treatment is recommended when the benefits outweigh the potential side effects — but that doesn’t mean you only move ahead when there is zero risk. In data, there are tradeoffs as well. Data quality can introduce extra steps into the process. Crucial layers of security can also slow things down. There is a long tail of edge-case data quality problems that can’t be solved with pure automation and a human touch, despite the potential human errors. Good data health professionals like Talend master this balancing act just like doctors do.

As in medicine, we may never have a perfect picture of all the factors that affect our data health. But by establishing a culture of continuous improvement, backed by people equipped with the best tools and software available for data quality, we can protect ourselves from the biggest and most common risks. And if we embed quality functionality into the data lifecycle before it enters the pipeline, while it flows through the system, and as it is used by analysts and applications, we can make data health a way of life.

The post Data health, from prognosis to treatment appeared first on Talend Real-Time Open Source Data Integration Software.

↧

AutoZone: Exceeding customer expectations with speed of service

May 26, 2021, 3:40 pm

≫ Next: Is your data healthy?

≪ Previous: Data health, from prognosis to treatment

“Talend is amazing because it’s open, flexible, and visual. The robustness and reliability of Talend have made it an integral part of our solution set. It’s easy to learn and fast to ramp up.”

– Jason Vogel, IT Manager, AutoZone

AutoZone is America’s #1 vehicle solutions provider. It was founded in 1979 and has since expanded to more than 6,400 stores across three countries, with over 96,000 employees. It earns nearly $12 billion in annual revenue and is a Fortune 500 company.

Talend customer: AutoZone

AutoZone’s key mission is to put its customers first. The company serves everyone from everyday consumers and DIY auto enthusiasts to repair shops and other retailers, including dealerships.

For AutoZone, agility is essential. “Repairs are often emergencies,” says Jason Vogel, IT manager for AutoZone. “When your car is out of business, you are out of business. Customers expect to receive their products nearly as fast as they can order them. Parts need to be in stores, but we can’t stock an unlimited amount of every part at every store, so the supply chain is simply critical and must be optimized.”

Amidst these challenges, AutoZone is pursuing an aggressive growth strategy, opening between 50 and 150 stores annually. And it’s accelerating the growth of its commercial business, supporting local repair shops that depend on fast service to get their clients’ cars fixed. Since repair shops have many supplier options, speed is the name of the game for AutoZone, and Vogel sees fast data as the key to success.

Delivering complete, compliant, and timely data to the right people

Over its 40+ years in business, AutoZone has accumulated a massive amount of data ranging from pricing to inventory across 20 different types of databases, thousands of instances, and in many data formats. To bring all of that data together and get it to the people who need it as fast as possible, the company relies on Talend. “We don’t just hand them the data and walk away,” says Vogel. “We need to provide our users the data in a way that they can use it and trust it so that they can drive their analysis and focus on quickly generating important business decisions. This flawless execution through Talend has been critical.”

Getting trustworthy data to the right people at the right time enables AutoZone to serve its customers in countless ways. It provides retail employees the means to instantly locate and provide parts to customers, along with the knowledge of how to install those parts. It optimizes the supply chain to ensure agile, efficient operations, so stores and repair shops always have the parts they need. And it allows AutoZone to provide innovative programs like their Next Day Delivery service, which covers 85% of the US population and lets customers order up to 100,000 parts or products as late as 10 p.m. for delivery the next day. “Talend was an integral part of that Next Day Delivery program,” says Vogel. “We helped provide the data that went into that. It was magnificent and wonderful to be a part of such a highly successful and important solution.”

A 75x increase in order capacity in 2 days

“Reliability is one of the things that makes Talend such a strong solution for us,” notes Vogel. “We have thousands of processes that run every day, all day long, and they have to run reliably and robustly. It’s critical to our business, it’s critical to our users, and it’s critical to our customers.” He recalls a recent challenge where a complex process was not scaling to the level needed. Over the course of 48 hours, Vogel and his team redesigned and rebuilt that process.

The result? A 75x increase in order capacity.

“We went from processing approximately 2,000 records per hour to nearly 5,000 records per minute. We haven’t been able to stress it beyond that because we haven’t had the volume to stress it to that level. We met the SLA and we blew it out of the water. Talend’s ability to visualize what was going on in that system in two days was phenomenal.”

The road ahead

Vogel credits this kind of agility with supporting AutoZone’s growth strategy. “We continue to expand and compete, and the only way we can do that is by accelerating how we develop our systems. Speed of service means that customers are able to get their parts quickly. If they’re able to get their parts quickly, they’re likely to be repeat customers and depend on us. This allows us to further optimize our inventory, to provide operational excellence, and to grow our business and support new stores and regions.” This expansion allows AutoZone to provide even faster service to both consumers and business customers, increasing the company’s revenues and growing its market presence.

In the future, Vogel plans to use Talend to expand real-time processing to do more in less time and get products to customers faster. “We have to continue growing, and the only way to accomplish that is by moving more data, having more insight into how our data is used, and accomplishing it all faster and better. We use Talend to collect, govern, transform, and share our data, and it continually evolves into even better solutions and answers every day.”

The post AutoZone: Exceeding customer expectations with speed of service appeared first on Talend Real-Time Open Source Data Integration Software.

↧

Is your data healthy?

June 10, 2021, 12:10 pm

≫ Next: How Alan Turing inspires me as a computer scientist and an ally

≪ Previous: AutoZone: Exceeding customer expectations with speed of service

It’s no secret that what companies need from their data and what they can actually get from their data are two very different things. According to our recent survey, most executives work with data every day, but only 40% of them always trust the data they work with. We also discovered that 78% of them have challenges making data-driven decisions. Virtually every business is collecting more data than ever before, so lack of data can’t be the issue. The problem is that there’s just too much data that isn’t ready to act on.

When data isn’t accessible, reliable, or well understood, it’s no wonder that more than a third of business leaders say they go with their gut – rather than their data – to make decisions.

We think there’s a better way. Through a concept we call “data health,” organizations can develop a deeper understanding of their data and achieve a common language of data quality. With this new approach, everyone can participate in maintaining the well-being of corporate data and trust it to guide their decisions.

The post Is your data healthy? appeared first on Talend Real-Time Open Source Data Integration Software.

↧

How Alan Turing inspires me as a computer scientist and an ally

June 21, 2021, 4:37 pm

≪ Previous: Is your data healthy?

You know those internet rabbit holes you can fall down where you think you will be five minutes briefly reading a few tweets, then two hours later you’re reading something that awakens something in you? That rarely happens to me. Normally I end up wasting a couple of hours going from translating slang I’ve heard from my niece to finding out all about one of the Kardashian pets’ pampering preferences. However, the other day I had that rare, special experience while otherwise wasting neurons flicking through my Twitter feed.

I came across a tweet about Alan Turing that read, “Alan Turing killed himself 67 years ago today. He broke the Nazi Enigma code in WW2, leading to the victory over Hitler. Turing was chemically castrated by the British state for being gay. His suicide warns us today of where intolerance and hatred can lead to. #PrideMonth”.

Little of this was news to me. I’ve been fascinated by Turing for as long as I can remember. I’ve seen practically every documentary about him and his work, and I’ve read whatever I have found about him. I even studied Computer Science with Artificial Intelligence at university in part because of him… and I still get caught in internet rabbit holes when he is used as bait. The man was a fascinating and brilliant, yet tragic, character.

We are currently in the middle of celebrating Pride month. Pride takes place in June each year to commemorate the Stonewall riots, which occurred at the end of June 1969. Ironically, (and the piece of information that got me looking into this) Pride month also encompasses anniversaries of Turing’s suicide toward the start of the month (June 7, 1954) and his birth towards the end (June 23, 1912). I’ve not found any evidence of a formal link between Turing’s birth or death and Pride month, but it got me thinking: this coincidence really does elevate Turing as an important character for people working in computer science to spend some time thinking about during this period.

While we are standing on the shoulders of many intellectual giants at Talend, Alan Turing’s shoulders carry an awful lot of the theoretical weight. Most of us know of Turing as the guy who cracked the Enigma code during World War II. The top secret work he and his colleagues accomplished is thought to have shortened the war by as many as two to four years. It’s fair to say that at least some of us at Talend may have had our family trees pruned long before we came to be, were it not for that unassuming Hut 8 at Bletchley Park. The film The Imitation Game covers much of that history (with an awful lot of dramatic license), but doesn’t really go into his importance in computer science and artificial intelligence.

In 1936, Turing published an idea on “automatic machines”. These were purely theoretical at the time, but were an incredibly innovative concept. Turing envisioned a machine that could be “programmed” to perform any algorithmic function on numbers without having to be rewired or physically changed. The machine would read a tape with a series of characters or instructions one at a time, process it according to a coded algorithm, then move the tape backwards or forwards as required.

Researchers in computer science still use “automatic machines” today as a simple way to see what happens in a CPU. Along with his doctoral supervisor Alonzo Church (who dubbed them “Turing machines”), he hypothesised a universal Turing machine: a theoretical machine which would be capable of reading and performing any algorithmic function. This concept is known as “Turing completeness” and is a defining feature of modern computing.

Today, we rely on artificial intelligence practically everywhere. From identifying subjects in images, to a smart assistant understanding a question you may ask and nailing the context, to searching for a specific segment in a video via image recognition and/or speech recognition with natural language processing, to heading towards a world of driverless vehicles. All of this has come from enabling computers to “learn from experience”.

In a lecture in 1947, Turing said, “what we want is a machine that can learn from experience”. The mechanism for this, he added, was the “possibility of letting the machine alter its own instructions”. This is arguably the first public lecture mentioning computer intelligence. It’s astounding to think about what Turing was theorising at a time when Colossus, the first electric computer was only four years old. Colossus is a far cry from the smart watch you are wearing now.

Turing’s theories are still used in modern day computer science. The “Turing test” as it is currently known, or the “imitation game” as Turing named it in 1950, is a test that he devised to see how well a machine exhibits intelligent behaviour. The test is incredibly insightful given that it would be decades before it could be put to use. It works on the premise that human intelligence should be sufficient to identify a machine imposter.

Turing proposed that a test involving three participants would take place. The three participants would be made up of two humans and a machine. The three would converse using natural language, via text only, from separate locations. They would not know anything about the other participants. One of the human participants would act as the evaluator, and it would be up to them to decide which of the other two participants was human and which was a machine. If the evaluator could not reliably tell the machine from the human, then the machine is said to have passed the test.

We’ve all interacted with chatbots when dealing with service organizations. While few, if any, even come close to passing the Turing test, huge leaps have been made in natural language processing and context awareness. I do like to test chatbots, which has led to some awkward situations… as it turns out, they aren’t all chatbots. Sometimes humans do not pass the Turing test! So, if you are working with Talend’s in-product chat and you get some weird questions, it could very well be because you have someone like me trying to figure out if you are a machine.

Turing wasn’t just prominent in cryptography and computing, and arguably the father of artificial intelligence. He also lent his expertise to mathematical biology. He was a true polymath who could have offered so much more to the world were it not for the fact that he did not conform to what the British establishment (in fact, most of the world at the time) believed to be “normal”.

Turing had to hide a fundamental part of his identity for fear of being labelled a criminal. It wasn’t until 13 years after his death that he could have been free to live and love with the similar rights as people today in the England and Wales (Scotland and Northern Ireland didn’t fix change their laws until the early eighties). In order to contribute towards saving an untold number of lives in World War II, Turing had to sacrifice a large part of himself to “fit in” with those who would benefit from his work.

I can’t claim to be able to even imagine what that was like. Being a heterosexual, white male identifying as a man, born in an affluent area of the UK, I am about as far as you can get from being excluded. Equity has never been something that I have had to fight for, or hide a part of myself to keep. In fact, I feel quite uncomfortable writing this. I can try and empathise, but I can’t possibly know what I am talking about and I certainly don’t want to be even close to coming across like this chap.

The only part of Turing that I can really relate to is his introversion. I’d love to claim that I could keep up with his level of geek, but I’d be kidding myself. So, what can I do? From talking with friends and listening to people whose experiences may be more closely aligned to Turing’s than mine, I have learnt that I can listen and allow what I hear to educate me. I can’t and I don’t want to lecture. That is not my role in this. However, I hope that I can contribute towards building up and strengthening the diversity of all groups I take part in; championing equity, fighting for inclusivity, and doing my best to make sure that every individual that I cross paths with has the opportunity to not just feel like they belong, but to know they belong.

If you’re interested in working at an inclusive company, we’d love to have you at Talend. Learn more on our Careers page.

The post How Alan Turing inspires me as a computer scientist and an ally appeared first on Talend Real-Time Open Source Data Integration Software.

↧