· 6 min read
The Three Levels of Data Quality Checks
The 3 Levels of validation every data team needs to master Data Quality.
By: Oxana Urdaneta
Data Engineers and Data Analysts bear the responsibility of delivering reliable, valuable data that drives decision-making and uncovers new insights. Organizations rely on them to provide accurate, high-quality data that informs critical business strategies. However, ensuring consistent data quality remains one of the most complex challenges. Data Quality is a moving target due to business logic, implicit rules, and constantly changing use cases.
Data Quality Isn’t Generic—It’s Contextual
Data quality is never a one-size-fits-all concept; it is profoundly contextual and depends heavily on specific use cases and business requirements. What constitutes “quality” for one team may be irrelevant to another. Data quality issues often arise when the data fails to meet stakeholders’ expectations shaped by how the data is used, which columns are critical, and the specific business rules applied. This complexity requires a tailored approach when implementing data quality validation.
Even teams working with the same data at different pipeline stages may have different expectations regarding data quality based on their understanding of its use. For instance, “Data Lake” teams responsible for loading data may have a more limited view of the data’s end use than the “Analytics Engineering” team, which does the last-mile preparation for decision-making and analysis. This difference in perspective is normal. However, we can define good quality by understanding the expectations around the data.
To address this challenge, let’s view data quality as a spectrum of expectations, ranging from basic, universal checks to more advanced, business-specific validations. These can be broken down into three levels, each building on the last to create a comprehensive strategy to ensure reliable data.
Let’s explore these three levels, beginning with the most basic and progressing to the more advanced checks.
Level 1: Did the Data Land?
The first, most basic, yet crucial check ensures the data landed and that the processes ran successfully. Did the pipelines refresh? Were there any broken pipelines due to access errors or memory issues? This validation is a simple “yes or no,” but it’s essential. Data teams should have automated checks at scale to spot any failed pipelines quickly. However, confirming data exists doesn’t guarantee its accuracy or correctness.
Level 2: Is the Data Right?
Once you’ve confirmed that the data landed, the next step is to validate its accuracy. Simple automated checks like monitoring volume growth, null rates, and value ranges can reveal if the newly loaded data behaves as expected. For instance, if the number of rows fluctuates wildly or nulls suddenly appear in critical columns, those are immediate red flags.
Many data observability platforms (including Konstellation) offer built-in features powered by machine learning to learn typical data patterns like volume growth or column null rates. When the data deviates from these expected patterns, the system raises alerts. This second level provides more profound insights than Level 1 but focuses on surface-level checks. While these validations are a step forward, the data could pass these tests and still have undetected issues.
Why Do Data Quality Problems Happen?
Before proceeding to the final level of checks, it’s crucial to understand why data quality issues arise. Data is dynamic—it can be impacted by system updates, human error, or changes in upstream processes. For example, a system update might abruptly alter the telemetry data, or a user might input incorrect information, causing downstream issues.
In today’s complex data pipelines, many teams and applications must coordinate precisely to guarantee a consistent output. There are many moving parts, and coordination between systems and applications is essential for maintaining data consistency. Even a small error or miscommunication upstream can snowball into a larger issue affecting the entire pipeline.
Additionally, the increasing volume of data processed today introduces new challenges. Teams must balance accuracy with the operational reality that not all data needs to be perfect. The critical data that informs key business decisions or powers important algorithms should be prioritized for the highest quality. With organizations leveraging only 10-30% of the data they collect, prioritizing quality checks for this subset is crucial.
Level 3: Advanced Business Logic Checks
At this stage, data quality validation delves into specific business logic that can’t be generalized or captured by basic checks. These advanced checks validate business-critical rules, ensuring the data aligns with the organization’s operations and decision-making processes. For example, you might verify that a product_id in a fact table matches the corresponding product_id in a dimension table or ensure that dates (e.g., manufactured_date < shipped_date < delivery_date) follow a logical and expected sequence.
This is where deep business domain knowledge becomes essential. Collaboration across Data Engineers, Analysts, Data Product Managers, Data Scientists, and stakeholders is key to understanding the context and nuances of how the data will be used. This shared knowledge is crucial for designing these custom checks.
Unlike basic validations, these checks often require custom SQL queries to confirm that the data meets the underlying business rules. They are more challenging to develop as they depend on an intimate understanding of business logic, workflows, and how the data interacts across different systems. Because these checks are more tailored and complex, they require careful planning and deeper contextual awareness to ensure accuracy and relevance.
Balancing Perfection and Pragmatism
Data quality checks should focus on what matters most. While some columns, like item_description, may not need perfect accuracy, key metrics like sales_per_product require meticulous validation. The goal isn’t 100% perfection on all tables and columns. Instead, it’s about identifying and addressing the data elements that have the most significant impact on decision-making and business outcomes.
These quality checks evolve as your understanding of how the data is used deepens. When new stakeholders start using a dataset for a fresh use case, their needs will possibly lead to new quality checks being added. It’s an iterative process, where checks are refined over time to target the areas that matter most.
Achieving Confidence in Data Quality
Implementing the three levels of validation—ensuring the data landed, checking if it’s accurate, and verifying advanced business logic—builds confidence in your data quality approach. Konstellation simplifies this by offering out-of-the-box checks for Levels 1 and 2 through machine learning and automation while empowering teams to define custom checks for Level 3.
Konstellation summarizes these results into a single Table Reliability Score (see more here) to quickly diagnose data quality issues. This comprehensive approach ensures you monitor what truly matters and keeps your infrastructure manageable.
Ensuring reliable data quality is an ongoing, iterative process. Modern tools can provide robust checks that evolve alongside your data, offering continuous peace of mind.
By embracing these three levels of data quality validation, your team can deliver reliable, valuable data that drives business insights and powers decision-making across your organization.