# Install the ValidMind Library
%pip install -q validmind
# Initialize the ValidMind Library
import validmind as vm
# Import the `xgboost` library with an alias
import xgboost as xgb
Understand and utilize RawData
in ValidMind tests
Test functions in ValidMind can return a special object called RawData
, which holds intermediate or unprocessed data produced somewhere in the test logic but not returned as part of the test’s visible output, such as in tables or figures.
- The
RawData
feature allows you to customize the output of tests, making it a powerful tool for creating custom tests and post-processing functions. RawData
is useful when running post-processing functions with tests to recompute tabular outputs, redraw figures, or even create new outputs entirely.
In this notebook, you’ll learn how to access, inspect, and utilize RawData
from ValidMind tests.
Setup
Before we can run our examples, we’ll need to set the stage to enable running tests with the ValidMind Library. Since the focus of this notebook is on the RawData
object, this section will merely summarize the steps instead of going into greater detail.
To learn more about running tests with ValidMind: Run tests and test suites
Installation and intialization
First, let’s make sure that the ValidMind Library is installed and ready to go, and our Python environment set up for data analysis:
Load the sample dataset
Then, we’ll import a sample ValidMind dataset and preprocess it:
# Import the `customer_churn` sample dataset
from validmind.datasets.classification import customer_churn
= customer_churn.load_data()
raw_df
# Preprocess the raw dataset
= customer_churn.preprocess(raw_df)
train_df, validation_df, test_df
# Separate features and targets
= train_df.drop(customer_churn.target_column, axis=1)
x_train = train_df[customer_churn.target_column]
y_train = validation_df.drop(customer_churn.target_column, axis=1)
x_val = validation_df[customer_churn.target_column]
y_val
# Create an `XGBClassifier` object
= xgb.XGBClassifier(early_stopping_rounds=10)
model
model.set_params(=["error", "logloss", "auc"],
eval_metric
)
# Train the model using the validation set
model.fit(
x_train,
y_train,=[(x_val, y_val)],
eval_set=False,
verbose )
Initialize the ValidMind objects
Before you can run tests, you’ll need to initialize a ValidMind dataset object, as well as a ValidMind model object that can be passed to other functions for analysis and tests on the data:
# Initialize the dataset object
= vm.init_dataset(
vm_raw_dataset =raw_df,
dataset="raw_dataset",
input_id=customer_churn.target_column,
target_column=customer_churn.class_labels,
class_labels=False,
__log
)
# Initialize the datasets into their own dataset objects
= vm.init_dataset(
vm_train_ds =train_df,
dataset="train_dataset",
input_id=customer_churn.target_column,
target_column=False,
__log
)= vm.init_dataset(
vm_test_ds =test_df,
dataset="test_dataset",
input_id=customer_churn.target_column,
target_column=False,
__log
)
# Initialize a model object
= vm.init_model(
vm_model
model,="model",
input_id=False,
__log
)
# Assign predictions to the datasets
vm_train_ds.assign_predictions(=vm_model,
model
)
vm_test_ds.assign_predictions(=vm_model,
model )
RawData
usage examples
Once you’re set up to run tests, you can then try out the following examples:
- Using
RawData
from the ROC Curve Test
- Pearson Correlation Matrix
- Precision-Recall Curve
- Using
RawData
in custom tests
Using RawData
from the ROC Curve Test
In this introductory example, we run the ROC Curve test, inspect its RawData
output, and then create a custom ROC curve using the raw data values.
First, let’s run the default ROC Curve test for comparsion with later iterations:
from validmind.tests import run_test
# Run the ROC Curve test normally
= run_test(
result_roc "validmind.model_validation.sklearn.ROCCurve",
={"dataset": vm_test_ds, "model": vm_model},
inputs=False,
generate_description )
Now let’s assume we want to create a custom version of the above figure. First, let’s inspect the raw data that this test produces so we can see what we have to work with.
RawData
objects have a inspect()
method that will pretty print the attributes of the object to be able to quickly see the data and its types:
# Inspect the RawData output from the ROC test
print("RawData from ROC Curve Test:")
result_roc.raw_data.inspect()
As we can see, the ROC Curve returns a RawData
object with the following attributes: - fpr
: A list of false positive rates - tpr
: A list of true positive rates - auc
: The area under the curve
This should be enough to create our own custom ROC curve via a post-processing function without having to create a whole new test from scratch and without having to recompute any of the data:
import matplotlib.pyplot as plt
from validmind.vm_models.result import TestResult
def custom_roc_curve(result: TestResult):
# Extract raw data from the test result
= result.raw_data.fpr
fpr = result.raw_data.tpr
tpr = result.raw_data.auc
auc
# Create a custom ROC curve plot
= plt.figure()
fig =f"Custom ROC (AUC = {auc:.2f})", color="blue")
plt.plot(fpr, tpr, label0, 1], [0, 1], linestyle="--", color="gray", label="Random Guess")
plt.plot(["False Positive Rate")
plt.xlabel("True Positive Rate")
plt.ylabel("Custom ROC Curve from RawData")
plt.title(
plt.legend()
# close the plot to avoid it automatically being shown in the notebook
plt.close()
# remove existing figure
0)
result.remove_figure(
# add new figure
result.add_figure(fig)
return result
# test it on the existing result
= custom_roc_curve(result_roc)
modified_result
# show the modified result
modified_result.show()
Now that we have created a post-processing function and verified that it works on our existing test result, we can use it directly in run_test()
from now on:
= run_test(
result "validmind.model_validation.sklearn.ROCCurve",
={"dataset": vm_test_ds, "model": vm_model},
inputs=custom_roc_curve,
post_process_fn=False,
generate_description )
Pearson Correlation Matrix
In this next example, try commenting out the post_process_fn
argument in the following cell and see what happens between different runs:
import plotly.graph_objects as go
def custom_heatmap(result: TestResult):
= result.raw_data.correlation_matrix
corr_matrix
= go.Heatmap(
heatmap =corr_matrix.values,
z=list(corr_matrix.columns),
x=list(corr_matrix.index),
y="Viridis",
colorscale
)= go.Figure(data=[heatmap])
fig ="Custom Heatmap from RawData")
fig.update_layout(title
plt.close()
0)
result.remove_figure(
result.add_figure(fig)
return result
= run_test(
result_corr "validmind.data_validation.PearsonCorrelationMatrix",
={"dataset": vm_test_ds},
inputs=False,
generate_description# COMMENT OUT `post_process_fn`
=custom_heatmap,
post_process_fn )
Precision-Recall Curve
Then, let’s try the same thing with the Precision-Recall Curve test:
def custom_pr_curve(result: TestResult):
= result.raw_data.precision
precision = result.raw_data.recall
recall
= plt.figure()
fig ="Precision-Recall Curve")
plt.plot(recall, precision, label"Recall")
plt.xlabel("Precision")
plt.ylabel("Custom Precision-Recall Curve from RawData")
plt.title(
plt.legend()
plt.close()0)
result.remove_figure(
result.add_figure(fig)
return result
= run_test(
result_pr "validmind.model_validation.sklearn.PrecisionRecallCurve",
={"dataset": vm_test_ds, "model": vm_model},
inputs=False,
generate_description# COMMENT OUT `post_process_fn`
=custom_pr_curve,
post_process_fn )
Using RawData
in custom tests
These examples demonstrate some very simple ways to use the RawData
feature of ValidMind tests. The majority of ValidMind-developed tests return some form of raw data that can be used to customize the output of the test, but you can also create your own tests that return RawData
objects and use them in the same way.
Let’s take a look at how this can be done in custom tests. To start, define and run your custom test:
import pandas as pd
from validmind import test, RawData
from validmind.vm_models import VMDataset, VMModel
@test("custom.MyCustomTest")
def MyCustomTest(dataset: VMDataset, model: VMModel) -> tuple[go.Figure, RawData]:
"""Custom test that produces a figure and a RawData object"""
# pretend we are using the dataset and model to compute some data
# ...
# create some fake data that will be used to generate a figure
= pd.DataFrame({"x": [10, 20, 30, 40, 50], "y": [10, 20, 30, 40, 50]})
data
# create the figure (scatter plot)
= go.Figure(data=go.Scatter(x=data["x"], y=data["y"]))
fig
# now let's create a RawData object that holds the "computed" data
= RawData(scatter_data_df=data)
raw_data
# finally, return both the figure and the raw data
return fig, raw_data
= run_test(
my_result "custom.MyCustomTest",
={"dataset": vm_test_ds, "model": vm_model},
inputs=False,
generate_description )
We can see that the test result shows the figure. But since we returned a RawData
object, we can also inspect the contents and see how we could use it to customize or regenerate the figure in the post-processing function:
my_result.raw_data.inspect()
We can see that we get a nicely-formatted preview of the dataframe we stored in the raw data object. Let’s go ahead and use it to re-plot our data:
def custom_plot(result: TestResult):
= result.raw_data.scatter_data_df
data
# use something other than a scatter plot
= go.Figure(data=go.Bar(x=data["x"], y=data["y"]))
fig ="Custom Bar Chart from RawData")
fig.update_layout(title="X Axis")
fig.update_xaxes(title="Y Axis")
fig.update_yaxes(title
0)
result.remove_figure(
result.add_figure(fig)
return result
= run_test(
result "custom.MyCustomTest",
={"dataset": vm_test_ds, "model": vm_model},
inputs=custom_plot,
post_process_fn=False,
generate_description )