Authored by – Honey Pokar
Introductions
In the realm of data management and analytics, Snowflake stands out as a comprehensive cloud-based platform that offers a wide range of functionalities to handle and analyze large volumes of data efficiently. One of the key features that Snowflake has recently added is : its suite of Data Metric Functions (DMFs), which are essential for maintaining and ensuring data quality within the ecosystem.
Types of DMFs
DMFs in Snowflake are built-in system functions designed to measure various aspects of data quality for tables and views. Each function serves a specific purpose, from identifying duplicate records to assessing the freshness of the data, thereby enabling organizations to monitor and improve the integrity of their data.
The built-in system DMFs in Snowflake include:
- DATA_METRIC_SCHEDULED_TIME: This function helps determine the scheduled time for the next execution of a data metric.
- DUPLICATE_COUNT: As the name suggests, this function counts the number of duplicate entries within a dataset.
- FRESHNESS: This metric measures the newness of the data, indicating how recently it has been updated or added.
- NULL_COUNT: This function tallies the number of null values present in a dataset, which is crucial for assessing data completeness.
- ROW_COUNT: It provides the total number of rows in a table or view, giving a quick overview of the dataset size.
- UNIQUE_COUNT: This function counts the number of unique entries, which is useful for understanding the diversity of data.
Moreover, Snowflake allows the creation of user-defined DMFs using the ‘CREATE DATA METRIC FUNCTION’ command, giving users the flexibility to tailor data quality measurements to their specific needs. This feature is particularly useful for organizations with unique data quality requirements not covered by the system DMFs.
DMFs are not limited to data quality measurement alone; they also play a significant role in data governance. By leveraging DMFs, organizations can set up alerts and automate responses to changes in data quality metrics, ensuring that any issues are promptly addressed.
Utilizing Data Metric Functions in Snowflake: A Practical Example
Imagine a retail company that uses Snowflake to manage its customer data. You want to ensure that the NAME column contains no NULL values, which is critical information for customer records.
Here’s how you might use the ‘NULL_COUNT’ DMF to check for NULL values in the NAME column:
SELECT SNOWFLAKE.CORE.NULL_COUNT(
SELECT name
FROM customer.tables.info
);
Interpreting the Results of a DMF Query in Snowflake
Interpreting the results when working with Data Metric Functions (DMFs) in Snowflake is crucial for understanding the health and quality of your data.
For example, suppose you’ve used the ‘NULL_COUNT’ DMF to identify the number of NULL values in a particular column. In that case, the result is a straightforward numerical value indicating the count of NULL entries. A higher number suggests a potential issue with data completeness that may require further investigation or remediation efforts.
Similarly, the ‘DUPLICATE_COUNT’ DMF returns the number of duplicate records within a dataset.
After running a DMF query, you can also query the event table to view more detailed results, which include additional metrics and timestamps related to the data quality measurement. This can help you track changes over time and assess the impact of any data quality initiatives you have implemented.
Benefits
- Custom Quality Metrics: DMFs allow you to create custom quality metric rules as reusable functions. You can define specific metrics that matter to your data quality assessment.
- Application to Columns: You can apply these rules to one or more columns in your tables or views. This flexibility enables targeted quality assessments based on specific data attributes.
- Frequency of Calculation: Specify how frequently you want the quality metrics to be calculated. Whether it’s daily, weekly, or on a custom schedule, you can tailor the frequency to your needs.
- Ad-Hoc Execution and Pipelines: Execute metric functions ad-hoc for testing purposes or incorporate them into your data pipelines. This adaptability ensures that you can assess data quality at different stages of your workflow.
- Serverless Compute Resources: When you call a DMF, Snowflake uses serverless compute resources. You don’t need to keep your warehouse running continuously, which helps manage costs and resource allocation.
- Centralized Management: View and manage all your quality metrics in a single place. This centralized approach simplifies monitoring and reporting on data quality across your organization.
Limitations
- DMFs can only be applied to regular tables within Snowflake. Other types of tables, such as dynamic or external tables, are not yet supported.
- DMFs cannot be set on object tags.
- The target object does not inherit DMF assignments when using the CLONE operation and the CREATE TABLE … LIKE operation.
- Creating a dynamic table or a stream based on the data in the event table is not supported.
- DMFs cannot be set on objects in Reader accounts.
- Setting a DMF on a shared table or view is not supported.