Developing a Custom Validation
In certain situations, the default validations may not be sufficient to check the quality of your data, especially when dealing with complex cases. To handle these scenarios, you can develop your own custom validation that encapsulates the specific logic required for your data quality checks.
This guide will walk you through the process of developing a custom validation using the Validoopsie library.
1. Define the Validation Class
To create a custom validation, start by defining a new class that:
- Inherits from
BaseValidationParameters
. - Uses the
@base_validation_wrapper
decorator.
Here's how you can begin:
from validoopsie.base import BaseValidationParameters, base_validation_wrapper
from narwhals.typing import FrameT
@base_validation_wrapper
class MyCustomValidation(BaseValidationParameters):
pass
2. Add a Docstring
While not strictly required, adding a docstring is best practice, especially for team collaboration and future maintenance. The docstring should describe the purpose of the validation and explain the parameters.
Example:
"""Custom validation that filters temperature data based on grouped dates.
Args:
column (str): The column name used in the validation and as an identifier in results.
threshold (float, optional): The threshold for the validation. Defaults to 0.0.
impact (str, optional): The impact level of the validation. Defaults to "low".
kwargs (dict): Additional keyword arguments.
"""
3. Define the __init__
Method
The __init__
method initializes your validation class. It should include at
least the column
parameter, which serves as a secondary name for the
validation result. Remember to pass *args
and **kwargs
to the base class.
If your validation doesn't inherently require a column, you can assign a
default value within the __init__
method.
Example:
def __init__(self, column: str, *args, **kwargs) -> None:
super().__init__(column, *args, **kwargs)
4. Add a Fail Message
The fail message is used in the output report if the validation fails. Define it using a property method to provide a meaningful message that aids in diagnosing issues.
Example:
@property
def fail_message(self) -> str:
"""Return the fail message used in the report."""
return "Custom validation failed: Temperature readings are outside acceptable ranges."
5. Define the Validation Logic (__call__
Method)
The __call__
method contains the core logic of your validation. This method
should return only the records that do not meet the validation criteria.
Example:
def __call__(self, frame: FrameT) -> FrameT:
"""Execute the custom validation logic.
Args:
frame (FrameT): The data frame to validate.
Returns:
FrameT: A data frame containing records that failed the validation.
"""
return (
frame.group_by(self.column)
.agg(nw.col("temperature").mean().alias("mean_temperature_farenheit"))
.with_columns(
((nw.col("mean_temperature_farenheit") - 32) * 5 / 9).alias(
"mean_temperature_celsius",
),
)
.filter(
# Every tempearture above 60 degrees celsius is considered as an error
nw.col("mean_temperature_celsius") > 60,
# Every tempearture below -40 degrees celsius is considered as an error
nw.col("mean_temperature_celsius") < -40,
# Every mean tempearture below -10 and above 30 degrees celsius is considered as an error
nw.col("mean_temperature_celsius").is_between(-10, 30) == False,
)
.group_by(self.column)
.agg(nw.col(self.column).count().alias(f"{self.column}-count"))
)
Notes:
- Ensure that the logic correctly identifies and returns only the failing records.
6. Add the Validation to the Pipeline
To integrate your custom validation into the data validation pipeline, use the
add_validation
method of the Validate
class.
Example:
import pandas as pd
from validoopsie import Validate
# Sample data frame
df = pd.DataFrame({
"date": ['2025-01-01', '2025-02-01', '2025-03-01', '2025-04-01'],
"temperature": [50, 60, 70, 80],
})
# Initialize the Validate object
validator = Validate(df)
# Add your custom validation
validator.add_validation(MyCustomValidation(column="date"))
# Execute the validations and get the result
result = validator.result
print(result)
7. Example Output
After running the validation, you can expect an output similar to the following:
OUTPUT
{
"Summary": {
"passed": true,
"validations": [
"MyCustomValidation_date"
]
},
"MyCustomValidation_date": {
"validation": "MyCustomValidation",
"impact": "low",
"timestamp": "2025-01-10T17:54:14.035378+01:00",
"column": "date",
"result": {
"status": "Success",
"threshold pass": true,
"message": "All items passed the validation.",
"frame row number": 4,
"threshold": 0.0
}
}
}
With this guide, you should be able to create custom validations tailored to your specific data quality requirements. Remember to thoroughly test your custom validation to ensure it behaves as expected in all scenarios.