Top 5 Essential Data Quality Checks- QACraft

07 Sep

We’ve discussed the fundamental aspects of data quality and how to approach them intentionally. We have described how to include things into a framework driven by your company. This page will discuss the fundamental data quality checks you should do on your data and how dbt can make these checks easier and more automated.

We’ll discuss the ideas of uniqueness, non-nullness, acceptance, referential integrity, and freshness, and how these intertwine in dbt to form the data quality foundation.

Uniqueness:

There is nothing that compares to discovering a duplicate primary key in a table and the chill it causes a data professional to experience. You can keep your data, your business users, and your spine happy by using a uniqueness test to ensure that all the rows for a given column are unique.

In your data warehouse tables, you’ll often check the uniqueness of the main and substitute keys. Every time you update a table manually in SQL, you’d likely create a little query similar to this one:

select

from your_table

group by id

having count(*) > 1

Non-unique IDs will all be returned by this query.

Even though it’s enjoyable to rewrite and run this query whenever your tables are changed, there is a much quicker and simpler approach to finding duplicates with dbt: the unique test. You may add a unique test using the uniqueness test to any column in your dbt models:

models:

– name: orders

columns:

– name: id

tests:

– unique

In the aforementioned example, each time you run your tests against your orders table in dbt, the id column will be examined for uniqueness. Behind the scenes, dbt is creating a test to run on your data warehouse that is remarkably similar to the original SQL presented above.

The days of having count(*) > 1 scattered across your SQL spreadsheets are over, as is the far worse situation of failing to see a duplicate entering your data. Instead, single lines of code-based, version-controlled tests that notify you of errors are now the norm.

Non-nullness:

The existence of certain column values should be verified by non-null tests. Here are a few instances of these columns:

Primary keys
Signup dates
Email addresses
Order amounts
…and any other column that if it was null or contained a missing value, would lead to incorrect metric calculations.

Similar to the unique test, a not-null test in dbt may be defined using just one line of YAML:

models:

– name: orders

columns:

– name: id

tests:

– unique

– not_null

As with any main key, the id field is currently verified for uniqueness and non-nullness. This testing framework is especially helpful in light of the fact that several contemporary data warehouses acknowledge primary keys but do not need them.

Not all column values must be non-null; there should be null values in your tables, Thus it’s crucial to perform this test carefully to prevent a flood of false-positive results.

Accepted values:

Data teams are the recipients of backend databases, a variety of cloud data sources, and on rare occasions, Google Sheets; they have little influence over the quality of their source data. However, we can influence how well our modified data is produced.

Data teams will clean, combine, and aggregate the raw data to create relevant business entities. It’s crucial to ensure that column values fit your expectations when dimensions and measures are updated and established during the transformation process.

For instance, if you run a US-based online store, you would anticipate that your data would be like the following:

The only true order statuses that matter are placed, shipped, delivered, and returned.
Only USD can be used in orders’ currencies.

How would you verify the accuracy of these two claims in your orders table?

You might create some quite straightforward queries in a more ad hoc manner.

select

order_id,

order_status

from orders

where order_status not in (‘placed’, ‘shipped’, ‘delivered’, ‘ returned’)

select

order_id,

currency

from orders

where currency != ‘USD’

But if your data changes and increases over time, it makes little sense to keep writing these ad hoc queries repeatedly.

You may quickly specify expected column values in a version-controlled YAML file using the dbt tests for accepted_values, not_accepted_values, and accepted_range, and have them report errors when your data deviates from them.

A few lines of Csv would simply be used in place of the ad hoc queries from above:

– name: orders

columns:

– name: order_status

tests:

– accepted_values:

values: [‘placed’, ‘shipped’, ‘delivered’, ‘returned’]

– name: currency

tests:

– accepted_values:

values: [‘usd’]

Relationships and referential integrity:

Your data will probably change format as you turn your raw data into useful models, adding new dimensions, metrics, and joins. Testing for connections in your data models is crucial since each stage of the transformation process leaves the possibility for the data to vary from your expectations.

Data transformations should be modular, referencing one another to eventually construct a DAG—a lineage of your data—on which dbt is built. Every time you build a new data transformation—what we call a model—in dbt, you may perform a relationship test that will determine if a particular column in your new model has a valid mapping to a given upstream model.

This test is especially helpful if you’re combining different models, maybe adding new rows, and altering the link between upstream and downstream models.

Freshness and Recency:

Let’s all agree that data is only helpful to business users when it is current and end users are aware that it is current. Analyses, insights, and ultimately choices are delayed by stale data, which is data that has been delayed in getting to your data warehouse or BI tool.

If a stakeholder has asked you, “This dashboard hasn’t refreshed in over a day—what’s going on here?” you know how frustrating it can be. You are aware that it is difficult to keep track of your data’s freshness and, more importantly, the reliability of your data pipelines. Your data team is always chosen to chase in a never-ending game of duck, duck, go because there are no methods to detect whether data is new.

Source freshness and recency tests, both of which can assist you in more automatically determining the freshness of your data, are supported by dbt:

Source freshness: By using the source freshness features of dbt, you can take a snapshot of the “freshness” of your source data and get alerts and error messages when it doesn’t meet your standards. You may instantly receive notifications when something goes wrong and visually monitor the state of your data pipelines inside the dbt Cloud UI.
Recency tests: To ensure that data is arriving at the predetermined intervals that your team selects, you can also construct recency tests using dbt against certain timestamp fields. This is especially helpful for event-driven, large-volume data sets like site and email analytics.

The usage of dashboard status tiles, which show the quality and freshness of data shown in BI dashboards, is also encouraged by dbt. These tiles enable business users to instantly determine if the data they are viewing is current and correct, all from the comfort of their own workspace.

Start with Simplicity and Work Towards Complexity:

Even though they appear straightforward, these five fundamental data checks will get you far. When primary keys are truly unique and data quality issues are promptly identified (before being found by a business stakeholder), it’s remarkable what may occur.

Although not the be-all and end-all of your data quality checks, the tests we’ve described in this piece are a wonderful beginning to start laying the groundwork for your data quality. Your data quality checks will be constructed for volume and complexity with the help of a tool like dbt, which supports these generic tests, custom tests, and testing-specific packages.

No Comments

TAGS : data quality

Software Testing

Top 5 Essential Data Quality Checks.

07 Sep

Uniqueness:

Non-nullness:

Accepted values:

Relationships and referential integrity:

Freshness and Recency:

Start with Simplicity and Work Towards Complexity:

Preeti Singh

Related Post

08 Nov

Real-Time Testing with LambdaTest

26 Feb

User Acceptance Testing: Tips for a Smoo

06 Sep

Test Data in Software Testing

Leave a Comment Cancel reply

Recent Posts

Categories

Testing Expertise

Domain Expertise

Contact