Validate data freshness with GX
Data freshness refers to how up-to-date, or current, data is relative to its source system or the real-world events it represents. Freshness is about the recency of data relative to its generation; its primary metric is the time since data was generated or updated.
Data informs many time-sensitive tasks and decision-making processes in today's business environment. Stale data can lead to significant consequences in applications across industries. For example, in e-commerce, outdated inventory data can result in oversold products and frustrated customers. In financial trading, stale market data can lead to missed opportunities and potential compliance issues. These scenarios underscore the critical importance of data freshness validation to ensure accurate insights and timely decision-making.
Great Expectations (GX) can be used to monitor and validate the freshness of your organization's data. Through use of the Expectations and features offered by GX Core, you can establish checks that ensure your datasets maintain the expected levels of freshness, catch staleness issues early, and prevent downstream issues in your data workflows. This article will discuss how to effectively manage and validate data freshness using GX, ensuring your organization has high-quality, up-to-date datasets for accurate insights and timely actions.
Prerequisite knowledge
This article assumes basic familiarity with GX components and workflows. If you're new to GX, start with the GX Core overview to familiarize yourself with key concepts and setup procedures.
Data preview
The examples presented in this article use a synthetic sensor reading dataset; a sample of this dataset is shown below. The data is available in a public Postgres database table, as described in the examples, and is also available in CSV format.
| reading_id | sensor_id | temperature_k | reading_ts | created_at |
|---|---|---|---|---|
| 1001 | factory-1 | 305.00 | 2024-11-22 13:38:07 | 2024-11-22 13:47:00 |
| 1002 | factory-2 | 310.00 | 2024-11-22 13:38:54 | 2024-11-22 13:47:00 |
| 1003 | factory-3 | 308.00 | 2024-11-22 13:39:54 | 2024-11-22 13:47:00 |
| 1004 | factory-1 | 303.75 | 2024-11-22 13:40:10 | 2024-11-22 13:47:00 |
| 1005 | factory-2 | 311.00 | 2024-11-22 13:40:53 | 2024-11-22 13:47:00 |
This dataset is representative of a scenario in which environment readings are captured by remote IOT sensors and reported back through the cloud to a central database. Freshness needs might center on timely reporting of temperature in a factory, to monitor that rooms maintain required ranges for temperature-sensitive manufacturing processes. If fresh sensor data is not received, analysts and machine learning models are not able to determine if action needs to be taken on factory premises. Most notably:
reading_tscaptures the timestamp that the reading was taken by the sensorcreated_atis the creation timestamp of the row in the database, indicating when data became available for analysis
Freshness checks can be run against both the reading_ts and created_at columns to ensure that the sensors are reporting fresh data and that the infrastructure responsible for delivering the readings is functioning in an expected, timely manner.
Key freshness Expectations
GX Core provides several options to use Expectations to validate freshness:
- Built-in Expectations, such as those discussed in this section, can be used on timestamp columns to create freshness checks.
- Custom SQL Expectations can be used to define and check data freshness based on SQL logic.
- You can customize Expectation classes to create Expectations that use Python logic to define and validate data freshness.
Expect column maximum to be between
This Expectation can be used on a timestamp column to expect that the maximum, or the most recent, timestamp, in a column is within an expected range of time.
Example: Validate that the most recent timestamp occurred after a specified time.
gxe.ExpectColumnMaxToBeBetween(
column="reading_ts",
min_value="2024-11-22 14:42:00",
)
ExpectColumnMaxToBeBetween in the Expectation Gallery.
Expect column minimum to be between
Checking for data freshness may also center on validating how old the data is. This Expectation can be used on a timestamp column to validate the minimum, or oldest, timestamp in a column is within an expected range of time.
Example: Validate that the oldest data is no older than a certain time.
gxe.ExpectColumnMinToBeBetween(
column="reading_ts",
min_value="2024-11-22 00:00:00",
)
ExpectColumnMinToBeBetween in the Expectation Gallery.
Use ExpectColumnMaxToBeBetween to validate the most recent data point, ensuring your dataset is up-to-date. Use ExpectColumnMinToBeBetween to check the oldest data point, which is useful for verifying data retention policies or identifying unexpectedly old records. Using both in combination can provide a comprehensive freshness check.
Examples
Data freshness is often calculated relative to the current point in time. For instance, using the sample data shown above, a baseline freshness check might be to validate that data has arrived in the database within the last 5 minutes.
Data freshness might also be validated relative to multiple events. For instance, using the sample data above, another freshness check might be that sensor readings are available in the database no later than 10 minutes after they are captured by the sensor.
The examples in this section showcase how to use available features in GX Core to create and run Expectations that accommodate the dynamic nature of nature of freshness checks, including the use of now()-type functions and timestamp differences.
Create a freshness custom Expectation class using GX Core
Goal: Create a custom Expectation class in GX Core to validate data freshness relative to the current point in time.
Run the following GX Core workflow:
import datetime
import great_expectations as gx
import great_expectations.expectations as gxe
# Create Data Context.
context = gx.get_context()
# Connect to sample data and create Data Source, Data Asset, Batch Definition, and Batch.
CONNECTION_STRING = "postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_learn_data_quality"
data_source = context.data_sources.add_postgres(
"postgres database", connection_string=CONNECTION_STRING
)
data_asset = data_source.add_table_asset(
name="sensor readings", table_name="freshness_sensor_readings"
)
batch_definition = data_asset.add_batch_definition_whole_table("batch definition")
batch = batch_definition.get_batch()
# Define the custom Expectation class by subclassing the built-in ExpectColumnMaxToBeBetween Expectation.
class ExpectSensorDataToBeFresh(gxe.ExpectColumnMaxToBeBetween):
"""Custom Expectation class to validate the freshness of sensor readings in the database."""
column: str = "created_at"
min_value: datetime.datetime = datetime.datetime.now() - datetime.timedelta(
minutes=5
)
description: str = "New sensor readings should have arrived in the database within the last 5 minutes."
# Validate the sample data with the custom freshness Expectation.
validation_result = batch.validate(ExpectSensorDataToBeFresh())
print(f"Freshness check passed: {validation_result['success']}")
print(f"Most recent reading timestamp: {validation_result['result']['observed_value']}")
Result:
Freshness check passed: False
Most recent reading timestamp: 2024-11-22 14:49:00
The Expectation fails because the sample data represents readings from a past point in time, the sample data is not being continually refreshed in Postgres. However, were new sensor readings to be arriving in the database, the custom Expectation could be used to check that new readings had arrived within the desired time frame.
GX solution: Use custom, Python-based logic to define and validate data freshness.
Identifying and setting freshness thresholds
Effective data freshness validation requires an understanding of appropriate thresholds based on usage of the delivered data. Use the following steps to guide the process for identifying and codifying freshness thresholds.
-
Start with business requirements: Identify downstream data consumers, assess impact of stale data, and define acceptable freshness levels with stakeholders.
-
Evaluate data source characteristics: Determine expected update frequency, identify sources with inherent delays, and align thresholds with source capabilities.
-
Analyze pipeline complexity: Map data flow through pipeline, estimate processing time per stage, and account for total pipeline lag.
-
Incorporate temporal patterns: Identify cyclical patterns (daily, weekly, etc.), determine expected gaps or inactivity, and adjust for known patterns.
-
Ensure regulatory compliance: Research industry-specific freshness and timeliness regulations and verify that thresholds meet mandatory requirements.
For example, consider a case of validating freshness for sales transaction data generated by a hypothetical retail store. The business impact of stale data is high—it can lead to stockouts and lost revenue. The source systems (in-store point of sales terminals and an e-commerce platform) provide near real-time data. The data pipeline has moderate complexity with several transformation steps. There are clear daily and weekly sales cycles, with lower volumes sold overnight and on Sundays.
Balancing these factors, appropriate freshness thresholds for this retail sales data could be:
- During peak hours: The latest transaction timestamp occurred within the last 30 minutes.
- During off-hours: The latest transaction timestamp occurred within the last 2 hours.
The thresholds are aggressive enough to quickly detect stale data that could impact the business, while accommodating source system realities and known temporal patterns. Thresholds could be set using the GX ExpectColumnMaxToBeBetween Expectation on the transaction timestamp column, and scheduled to run on an hourly basis.
By systematically working through the above steps and considering a holistic set of factors, you can define optimal freshness thresholds for your specific data assets and use cases. GX enables codifying these thresholds through data freshness validation to consistently to drive high data quality.
Scenarios
To further understand the wide-ranging impact of data freshness validation across different industries, consider the following scenarios, and how GX can be used to monitor data freshness.
Social media analytics
Context: A marketing agency provides social media monitoring and analytics services to its clients. They collect real-time data from various social media platforms to track brand mentions, sentiment, and engagement metrics.
GX Solution: By applying freshness Expectations like ExpectColumnMaxToBeBetween and ExpectColumnMinToBeBetween to the timestamp columns in the social media data, the agency can verify that they are capturing and analyzing the most recent interactions and conversations. GX enables the agency to set appropriate freshness thresholds based on their clients' needs and the dynamic nature of social media.
Smart manufacturing and predictive maintenance
Context: A manufacturing company has implemented IoT sensors on its production lines to collect real-time data on machine performance, vibration levels, temperature, and other parameters. This data is used to monitor equipment health, predict maintenance needs, and prevent unplanned downtime.
GX Solution: GX allows the company to validate the freshness of the sensor data using Expectations like ExpectColumnMaxToBeBetween and ExpectColumnMinToBeBetween. By setting up these Expectations in their data pipelines, the company can ensure that the data accurately reflects the current state of the machines, enabling predictive maintenance models to identify potential issues early and schedule maintenance proactively.
Connected vehicles and real-time fleet management
Context: A fleet management company equips its vehicles with IoT devices that collect data on location, speed, fuel consumption, and driver behavior. This data is used to optimize routes, monitor vehicle performance, and ensure driver safety.
GX Solution: GX empowers the company to set up Expectations like ExpectColumnMinToBeBetween and ExpectColumnMaxToBeBetween on the timestamp columns in the vehicle data. By incorporating these Expectations into their data pipelines, the company can verify that the data from the connected vehicles is consistently fresh, enabling real-time fleet tracking and quick response to deviations from planned routes or aggressive driving patterns.
Avoid common freshness validation pitfalls
While data freshness is critical across various domains, teams often encounter common pitfalls when implementing freshness validation:
-
Not defining clear SLAs for data freshness: Teams should establish and document clear service level agreements (SLAs) for how fresh different datasets need to be. Without clear targets, freshness can degrade over time.
-
Not considering end-to-end pipeline freshness: Data can become stale at any stage, from ingestion through transformation. Account for total pipeline lag, not just source system freshness. Add validation checks after key processing steps.
-
Averaging timestamps can conceal issues: Calculating the average timestamp can make the overall data appear misleadingly fresh. Instead, use minimum and maximum values instead to bound the full range.
-
Sampling data inappropriately for freshness checks: When validating freshness, check the full dataset or use a statistically valid sampling method. Cherry-picking rows can mask staleness issues.
The path forward
By adhering to best practices and understanding common freshness validation pitfalls, you can ensure that your data pipelines consistently deliver fresh data to drive insights and timely decision-making. Implementing freshness validation using GX empowers you to catch staleness issues early, prevent downstream problems, and maintain the high standards of data quality.
As you progress on your data quality journey, remember that freshness is just one critical aspect of a comprehensive data quality strategy. By exploring our broader data quality series, you'll gain valuable insights into how other essential dimensions of data quality, such as integrity, completeness, distribution, and schema, can be seamlessly integrated into your workflows using GX.