We’ve discussed the fundamental aspects of data quality and how to approach them intentionally. We have described how to include things into a framework driven by your company. This page will discuss the fundamental data quality checks you should do on your data and how dbt can make these checks easier and more automated.
We’ll discuss the ideas of uniqueness, non-nullness, acceptance, referential integrity, and freshness, and how these intertwine in dbt to form the data quality foundation.
There is nothing that compares to discovering a duplicate primary key in a table and the chill it causes a data professional to experience. You can keep your data, your business users, and your spine happy by using a uniqueness test to ensure that all the rows for a given column are unique.
In your data warehouse tables, you’ll often check the uniqueness of the main and substitute keys. Every time you update a table manually in SQL, you’d likely create a little query similar to this one:
select
id
from your_table
group by id
having count(*) > 1
Non-unique IDs will all be returned by this query.
Even though it’s enjoyable to rewrite and run this query whenever your tables are changed, there is a much quicker and simpler approach to finding duplicates with dbt: the unique test. You may add a unique test using the uniqueness test to any column in your dbt models:
models:
– name: orders
columns:
– name: id
tests:
– unique
In the aforementioned example, each time you run your tests against your orders table in dbt, the id column will be examined for uniqueness. Behind the scenes, dbt is creating a test to run on your data warehouse that is remarkably similar to the original SQL presented above.
The days of having count(*) > 1 scattered across your SQL spreadsheets are over, as is the far worse situation of failing to see a duplicate entering your data. Instead, single lines of code-based, version-controlled tests that notify you of errors are now the norm.
The existence of certain column values should be verified by non-null tests. Here are a few instances of these columns:
Similar to the unique test, a not-null test in dbt may be defined using just one line of YAML:
models:
– name: orders
columns:
– name: id
tests:
– unique
– not_null
As with any main key, the id field is currently verified for uniqueness and non-nullness. This testing framework is especially helpful in light of the fact that several contemporary data warehouses acknowledge primary keys but do not need them.
Not all column values must be non-null; there should be null values in your tables, Thus it’s crucial to perform this test carefully to prevent a flood of false-positive results.
Data teams are the recipients of backend databases, a variety of cloud data sources, and on rare occasions, Google Sheets; they have little influence over the quality of their source data. However, we can influence how well our modified data is produced.
Data teams will clean, combine, and aggregate the raw data to create relevant business entities. It’s crucial to ensure that column values fit your expectations when dimensions and measures are updated and established during the transformation process.
For instance, if you run a US-based online store, you would anticipate that your data would be like the following:
How would you verify the accuracy of these two claims in your orders table?
You might create some quite straightforward queries in a more ad hoc manner.
select
order_id,
order_status
from orders
where order_status not in (‘placed’, ‘shipped’, ‘delivered’, ‘ returned’)
select
order_id,
currency
from orders
where currency != ‘USD’
But if your data changes and increases over time, it makes little sense to keep writing these ad hoc queries repeatedly.
You may quickly specify expected column values in a version-controlled YAML file using the dbt tests for accepted_values, not_accepted_values, and accepted_range, and have them report errors when your data deviates from them.
A few lines of Csv would simply be used in place of the ad hoc queries from above:
– name: orders
columns:
– name: order_status
tests:
– accepted_values:
values: [‘placed’, ‘shipped’, ‘delivered’, ‘returned’]
– name: currency
tests:
– accepted_values:
values: [‘usd’]
Your data will probably change format as you turn your raw data into useful models, adding new dimensions, metrics, and joins. Testing for connections in your data models is crucial since each stage of the transformation process leaves the possibility for the data to vary from your expectations.
Data transformations should be modular, referencing one another to eventually construct a DAG—a lineage of your data—on which dbt is built. Every time you build a new data transformation—what we call a model—in dbt, you may perform a relationship test that will determine if a particular column in your new model has a valid mapping to a given upstream model.
This test is especially helpful if you’re combining different models, maybe adding new rows, and altering the link between upstream and downstream models.
Let’s all agree that data is only helpful to business users when it is current and end users are aware that it is current. Analyses, insights, and ultimately choices are delayed by stale data, which is data that has been delayed in getting to your data warehouse or BI tool.
If a stakeholder has asked you, “This dashboard hasn’t refreshed in over a day—what’s going on here?” you know how frustrating it can be. You are aware that it is difficult to keep track of your data’s freshness and, more importantly, the reliability of your data pipelines. Your data team is always chosen to chase in a never-ending game of duck, duck, go because there are no methods to detect whether data is new.
Source freshness and recency tests, both of which can assist you in more automatically determining the freshness of your data, are supported by dbt:
The usage of dashboard status tiles, which show the quality and freshness of data shown in BI dashboards, is also encouraged by dbt. These tiles enable business users to instantly determine if the data they are viewing is current and correct, all from the comfort of their own workspace.
Even though they appear straightforward, these five fundamental data checks will get you far. When primary keys are truly unique and data quality issues are promptly identified (before being found by a business stakeholder), it’s remarkable what may occur.
Although not the be-all and end-all of your data quality checks, the tests we’ve described in this piece are a wonderful beginning to start laying the groundwork for your data quality. Your data quality checks will be constructed for volume and complexity with the help of a tool like dbt, which supports these generic tests, custom tests, and testing-specific packages.
Preeti works as a QA expert in Salesforce and other commerce and E-learning at QACraft. She is a computer science engineer with a degree and has 3 years of experience in Manual testing as well as Automation testing. In her free time, she loves to sing.
© Copyright 2024 QACraft Pvt. Ltd. All rights reserved.
Contact : +91 9157786796
Preeti Singh