ValidMind for validation 4 — Finalize testing and reporting

Learn how to use ValidMind for your end-to-end validation process with our series of four introductory notebooks. In this last notebook, finalize the compliance assessment process and have a complete validation report ready for review.

This notebook will walk you through how to supplement ValidMind tests with your own custom tests and include them as additional evidence in your validation report. A custom test is any function that takes a set of inputs and parameters as arguments and returns one or more outputs:

The function can be as simple or as complex as you need it to be — it can use external libraries, make API calls, or do anything else that you can do in Python.
The only requirement is that the function signature and return values can be "understood" and handled by the ValidMind Library. As such, custom tests offer added flexibility by extending the default tests provided by ValidMind, enabling you to document any type of record (model) or use case.

For a more in-depth introduction to custom tests, refer to our Implement custom tests notebook.

Learn by doing

Our course tailor-made for validators new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Validator Fundamentals

Prerequisites

In order to finalize validation and reporting, you'll need to first have:

Registered a model within the ValidMind Platform and granted yourself access to the model as a validator
Installed the ValidMind Library in your local environment, allowing you to access all its features
Learned how to import and initialize datasets and models for use with ValidMind
Understood the basics of how to identify and run validation tests
Run validation tests for your champion and challengers, and logged the results of those tests to the ValidMind Platform
Inserted your logged test results into your validation report
Added some preliminary artifacts (findings) to your validation report

Need help with the above steps?

Refer to the first three notebooks in this series:

Setting up

This section should be very familiar to you now — as we performed the same actions in the previous two notebooks in this series.

Initialize the ValidMind Library

As usual, let's first connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

On the left sidebar that appears for your model, select Getting Started and select Validation from the Document drop-down menu.
Click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="validation-report",
)

Note: you may need to restart the kernel to use updated packages.

2026-07-14 05:37:26,791 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model validation (ID: cmalguc9y02ok199q2db381ib)
📁 Document Type: validation_report

Import the sample dataset

Next, we'll load in the same sample Bank Customer Churn Prediction dataset used to develop the champion that we will independently preprocess:

# Load the sample dataset
from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

# Initialize the raw dataset for use in ValidMind tests
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

Let’s also quickly remove highly correlated features from the dataset using the output from a ValidMind test:

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

# Run HighPearsonCorrelation test with our balanced dataset as input and return a result object
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify potentially redundant or highly collinear variable pairs. The result table reports the top 10 feature pairs by Pearson correlation coefficient, along with Pass/Fail status against the configured absolute threshold of 0.3. Observed coefficients range from -0.1764 to 0.3261, and only one pair exceeds the threshold. The strongest reported relationship is between Age and Exited, while all other listed pairs remain within the threshold.

Key insights:

Single threshold breach identified: The pair (Age, Exited) has a Pearson correlation coefficient of 0.3261, which exceeds the configured threshold of 0.3 and is the only result marked Fail.
Remaining correlations are modest: All other reported feature pairs have absolute correlation values below 0.18, with the next largest magnitudes being -0.1764 for (IsActiveMember, Exited) and -0.1758 for (Balance, NumOfProducts).
Top relationships are concentrated near zero: Most reported coefficients are small in magnitude, including 0.1444 for (Balance, Exited) and values between approximately -0.06 and 0.05 for the remaining listed pairs.

The reported correlation structure is characterized by one pair above the specified threshold and a broad set of other pairwise relationships with relatively low magnitudes. The most pronounced observed linear relationship is between Age and Exited, while the remaining top-ranked correlations are modest and all pass the configured test criterion. Overall, the table indicates limited concentration of high pairwise linear dependence within the reported results.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3261	Fail
(IsActiveMember, Exited)	-0.1764	Pass
(Balance, NumOfProducts)	-0.1758	Pass
(Balance, Exited)	0.1444	Pass
(Age, NumOfProducts)	-0.0586	Pass
(NumOfProducts, Exited)	-0.0557	Pass
(HasCrCard, IsActiveMember)	-0.0472	Pass
(Age, Balance)	0.0466	Pass
(Age, IsActiveMember)	0.0448	Pass
(NumOfProducts, IsActiveMember)	0.0378	Pass

# From result object, extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3261	Fail
1	(IsActiveMember, Exited)	-0.1764	Pass
2	(Balance, NumOfProducts)	-0.1758	Pass
3	(Balance, Exited)	0.1444	Pass
4	(Age, NumOfProducts)	-0.0586	Pass
5	(NumOfProducts, Exited)	-0.0557	Pass
6	(HasCrCard, IsActiveMember)	-0.0472	Pass
7	(Age, Balance)	0.0466	Pass
8	(Age, IsActiveMember)	0.0448	Pass
9	(NumOfProducts, IsActiveMember)	0.0378	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

# Extract feature names from the list of strings
high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

# Re-run the test with the reduced feature set
corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify potentially redundant variables or multicollinearity. The results table lists the top 10 strongest feature-pair correlations after removing duplicate and self-correlations, along with each pair’s Pearson coefficient and pass/fail status against the configured absolute threshold of 0.3. In this run, all reported correlations are relatively small in magnitude, ranging from -0.1764 to 0.1444, and every listed pair is marked as Pass.

Key insights:

No threshold breaches observed: All 10 reported feature pairs pass the test threshold of 0.3. None of the observed absolute correlation coefficients exceed the configured cutoff.
Largest relationship remains modest: The strongest reported correlation is between IsActiveMember and Exited at -0.1764. This is followed closely by Balance and NumOfProducts at -0.1758, both remaining well below the threshold.
Correlations are concentrated near zero: Reported coefficients span a narrow range from -0.1764 to 0.1444. Several of the listed relationships, including NumOfProducts and IsActiveMember (0.0378), CreditScore and Balance (0.0284), and Tenure and Exited (-0.0253), are very close to zero.
Both positive and negative relationships appear: The table includes negative correlations such as Balance with NumOfProducts (-0.1758) and positive correlations such as Balance with Exited (0.1444). The observed relationships are limited in magnitude in both directions.

The reported correlation structure shows no high linear dependency among the top-ranked feature pairs under the 0.3 threshold used in this test. The strongest observed associations are modest and all pairs remain within passing range. Overall, the results indicate a weak pairwise linear correlation pattern among the listed features.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(IsActiveMember, Exited)	-0.1764	Pass
(Balance, NumOfProducts)	-0.1758	Pass
(Balance, Exited)	0.1444	Pass
(NumOfProducts, Exited)	-0.0557	Pass
(HasCrCard, IsActiveMember)	-0.0472	Pass
(NumOfProducts, IsActiveMember)	0.0378	Pass
(CreditScore, IsActiveMember)	0.0331	Pass
(CreditScore, Balance)	0.0284	Pass
(Balance, HasCrCard)	-0.0269	Pass
(Tenure, Exited)	-0.0253	Pass

Split the preprocessed dataset

With our raw dataset rebalanced with highly correlated features removed, let's now spilt our dataset into train and test in preparation for model evaluation testing:

# Encode categorical features in the dataset
balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
3107	643	9	150840.03	2	1	0	155516.35	0	False	True	False
2507	643	3	167949.48	1	1	0	143162.34	0	False	False	False
7252	711	6	0.00	2	1	1	72276.24	0	False	False	False
5773	697	7	0.00	1	1	0	129188.18	1	False	True	True
4678	732	9	136576.02	1	0	1	3268.17	1	False	False	True

from sklearn.model_selection import train_test_split

# Split the dataset into train and test
train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

# Initialize the split datasets
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Import the champion model

With our raw dataset assessed and preprocessed, let's go ahead and import the champion submitted by the development team in the format of a .pkl file: lr_model_champion.pkl

# Import the champion model
import pickle as pkl

with open("lr_model_champion.pkl", "rb") as f:
    log_reg = pkl.load(f)

/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/base.py:525: InconsistentVersionWarning: Trying to unpickle estimator LogisticRegression from version 1.3.2 when using version 1.9.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

Train potential challenger model

We'll also train our random forest classification challenger to see how it compares:

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=50, random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

RandomForestClassifier

?Documentation for RandomForestClassifieriFitted

Parameters

	n_estimators n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22 The default value of ``n_estimators`` changed from 10 to 100 in 0.22.	50
	random_state random_state: int, RandomState instance or None, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if ``bootstrap=True``) and the sampling of the features to consider when looking for the best split at each node (if ``max_features < n_features``). See :term:`Glossary <random_state>` for details.	42
	criterion criterion: {"gini", "entropy", "log_loss"}, default="gini" The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "log_loss" and "entropy" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.	'gini'
	max_depth max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.	None
	min_samples_split min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for fractions.	2
	min_samples_leaf min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and `ceil(min_samples_leaf * n_samples)` are the minimum number of samples for each node. .. versionchanged:: 0.18 Added float values for fractions.	1
	min_weight_fraction_leaf min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.	0.0
	max_features max_features: {"sqrt", "log2", None}, int or float, default="sqrt" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at each split. - If "sqrt", then `max_features=sqrt(n_features)`. - If "log2", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1 The default of `max_features` changed from `"auto"` to `"sqrt"`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.	'sqrt'
	max_leaf_nodes max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.	None
	min_impurity_decrease min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following:: N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19	0.0
	bootstrap bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.	True
	oob_score oob_score: bool or callable, default=False Whether to use out-of-bag samples to estimate the generalization score. By default, :func:`~sklearn.metrics.accuracy_score` is used. Provide a callable with signature `metric(y_true, y_pred)` to use a custom metric. Only available if `bootstrap=True`. For an illustration of out-of-bag (OOB) error estimation, see the example :ref:`sphx_glr_auto_examples_ensemble_plot_ensemble_oob.py`.	False
	n_jobs n_jobs: int, default=None The number of jobs to run in parallel. :meth:`fit`, :meth:`predict`, :meth:`decision_path` and :meth:`apply` are all parallelized over the trees. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.	None
	verbose verbose: int, default=0 Controls the verbosity when fitting and predicting.	0
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`Glossary <warm_start>` and :ref:`tree_ensemble_warm_start` for details.	False
	class_weight class_weight: {"balanced", "balanced_subsample"}, dict or list of dicts, default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))`` The "balanced_subsample" mode is the same as "balanced" except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.	None
	ccp_alpha ccp_alpha: non-negative float, default=0.0 Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ``ccp_alpha`` will be chosen. By default, no pruning is performed. See :ref:`minimal_cost_complexity_pruning` for details. See :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py` for an example of such pruning. .. versionadded:: 0.22	0.0
	max_samples max_samples: int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - If None (default), then draw `X.shape[0]` samples irrespective of `sample_weight`. - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` unweighted samples or `max_samples * sample_weight.sum()` weighted samples. .. versionadded:: 0.22 .. versionchanged:: 1.9 Float `max_samples` is relative to `sample_weight.sum()` instead of `X.shape[0]` for weighted samples.	None
	monotonic_cst monotonic_cst: array-like of int of shape (n_features), default=None Indicates the monotonicity constraint to enforce on each feature. - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If monotonic_cst is None, no constraints are applied. Monotonicity constraints are not supported for: - multiclass classifications (i.e. when `n_classes > 2`), - multioutput classifications (i.e. when `n_outputs_ > 1`). The constraints hold over the probability of the positive class. Read more in the :ref:`User Guide <monotonic_cst_gbdt>`. .. versionadded:: 1.4	None

Fitted attributes

Name	Type	Value
classes_ classes_: ndarray of shape (n_classes,) or a list of such arrays The classes labels (single output problem), or a list of arrays of class labels (multi-output problem).	ndarray[int64](2,)	[0,1]
estimator_ estimator_: :class:`~sklearn.tree.DecisionTreeClassifier` The child estimator template used to create the collection of fitted sub-estimators. .. versionadded:: 1.2 `base_estimator_` was renamed to `estimator_`.	DecisionTreeClassifier	DecisionTreeClassifier()
estimators_ estimators_: list of DecisionTreeClassifier The collection of fitted sub-estimators.	list	[DecisionTreeC...te=1608637542), DecisionTreeC...te=1273642419), DecisionTreeC...te=1935803228), DecisionTreeC...ate=787846414), ...]
estimators_samples_ estimators_samples_: list of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected. .. versionadded:: 1.4	list	[array([2523, ..., dtype=int32), array([ 686, ..., dtype=int32), array([1366, ..., dtype=int32), array([1591, ..., dtype=int32), ...]
feature_importances_ feature_importances_: ndarray of shape (n_features,) The impurity-based feature importances. The higher, the more important the feature. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See :func:`sklearn.inspection.permutation_importance` as an alternative.	ndarray[float64](10,)	[0.2 ,0.11,0.2 ,...,0.04,0.02,0.02]
feature_names_in_ feature_names_in_: ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0	ndarray[object](10,)	['CreditScore','Tenure','Balance',...,'Geography_Germany', 'Geography_Spain','Gender_Male']
n_classes_ n_classes_: int or list The number of classes (single output problem), or a list containing the number of classes for each output (multi-output problem).	int	2
n_features_in_ n_features_in_: int Number of features seen during :term:`fit`. .. versionadded:: 0.24	int	10
n_outputs_ n_outputs_: int The number of outputs when ``fit`` is performed.	int	1

Initialize the ValidMind models

In addition to the initialized datasets, you'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our two models:

# Initialize the champion logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

# Assign predictions to Champion — Logistic regression model
vm_train_ds.assign_predictions(model=vm_log_model)
vm_test_ds.assign_predictions(model=vm_log_model)

# Assign predictions to Challenger — Random forest classification model
vm_train_ds.assign_predictions(model=vm_rf_model)
vm_test_ds.assign_predictions(model=vm_rf_model)

2026-07-14 05:37:36,651 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:37:36,653 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:37:36,653 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:37:36,657 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-07-14 05:37:36,659 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:37:36,661 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:37:36,662 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:37:36,663 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-07-14 05:37:36,666 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:37:36,689 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:37:36,691 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:37:36,714 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-07-14 05:37:36,718 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:37:36,732 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:37:36,734 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:37:36,747 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Implementing custom tests

Thanks to the documentation (Learn more: ValidMind for development), we know that the development team implemented a custom test to further evaluate the performance of the champion.

In a usual validation situation, you would load a saved custom test provided by the development team. In the following section, we'll have you implement the same custom test and make it available for reuse, to familiarize you with the processes.

Want to learn more about custom tests?

Refer to our in-depth introduction to custom tests: Implement custom tests

Implement a custom inline test

Let's implement the same custom inline test that calculates the confusion matrix for a binary classification model that the development team used in their performance evaluations.

An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
You'll note that the custom test function is just a regular Python function that can include and require any Python library as you see fit.

Create a confusion matrix plot

Let's first create a confusion matrix plot using the confusion_matrix function from the sklearn.metrics module:

import matplotlib.pyplot as plt
from sklearn import metrics

# Get the predicted classes
y_pred = log_reg.predict(vm_test_ds.x)

confusion_matrix = metrics.confusion_matrix(y_test, y_pred)

cm_display = metrics.ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix, display_labels=[False, True]
)
cm_display.plot()

Next, create a @vm.test wrapper that will allow you to create a reusable test. Note the following changes in the code below:

The function confusion_matrix takes two arguments dataset and model. This is a VMDataset and VMModel object respectively.
- VMDataset objects allow you to access the dataset's true (target) values by accessing the .y attribute.
- VMDataset objects allow you to access the predictions for a given record (model) by accessing the .y_pred() method.
The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.
The function body calculates the confusion matrix using the sklearn.metrics.confusion_matrix function as we just did above.
The function then returns the ConfusionMatrixDisplay.figure_ object — this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.
The @vm.test decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID my_custom_tests.ConfusionMatrix.

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

You can now run the newly created custom test on both the training and test datasets for both models using the run_test() function:

# Champion train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates classification performance by comparing predicted labels against true labels for the training and test datasets. The results are presented as 2x2 matrices showing counts for true negatives, false positives, false negatives, and true positives. For the training dataset, the matrix contains 811 true negatives, 489 false positives, 475 false negatives, and 810 true positives. For the test dataset, the matrix contains 214 true negatives, 102 false positives, 132 false negatives, and 199 true positives.

Key insights:

Training predictions are closely balanced: On the training dataset, correct negative predictions (811) and correct positive predictions (810) are nearly identical. Misclassification counts are also similar, with 489 false positives and 475 false negatives.
Test set shows lower true positive count: On the test dataset, true positives total 199 compared with 214 true negatives. The model records 132 false negatives and 102 false positives on the same dataset.
False negatives exceed false positives in test data: In the test dataset, false negatives (132) are higher than false positives (102). This differs from the training dataset, where false positives (489) are slightly higher than false negatives (475).
Correct classifications exceed errors in both datasets: In the training dataset, correct classifications sum to 1,621 versus 964 errors. In the test dataset, correct classifications sum to 413 versus 234 errors.

The confusion matrices show that the model produces more correct than incorrect classifications in both the training and test datasets. Training results are highly symmetric across positive and negative classes, both for correct predictions and misclassifications. In the test dataset, correct predictions remain higher than errors, with a somewhat larger contribution from false negatives than false positives.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:8cf7

ValidMind Figure my_custom_tests.ConfusionMatrix:champion:ffb7

2026-07-14 05:37:42,133 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:champion does not exist in model's document

# Challenger train and test
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_train_ds,vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates classification outcomes by comparing predicted labels against true labels and summarizing the counts of true positives, true negatives, false positives, and false negatives. The results are shown separately for train_dataset_final and test_dataset_final using 2x2 confusion matrices. In the training dataset, all observations fall on the diagonal with 1,300 true negatives and 1,285 true positives, and both error cells equal 0. In the test dataset, the matrix contains 236 true negatives, 216 true positives, 80 false positives, and 115 false negatives.

Key insights:

Perfect training classification: The training confusion matrix contains no off-diagonal counts, with 1,300 true negatives and 1,285 true positives, indicating zero observed misclassifications on the training sample.
Test errors in both classes: The test confusion matrix shows both false positives and false negatives, with 80 negative cases classified as positive and 115 positive cases classified as negative.
False negatives exceed false positives: Misclassification on the test sample is more concentrated in false negatives than false positives, with 115 false negatives versus 80 false positives.
Correct test predictions remain dominant: On the test dataset, diagonal counts remain larger than off-diagonal counts in both classes, with 236 true negatives versus 80 false positives and 216 true positives versus 115 false negatives.

The confusion matrices show a clear contrast between the training and test samples. Training performance is error-free in the observed sample, while test performance includes material misclassification in both directions. Within the test set, false negatives occur more frequently than false positives, although correct predictions still outnumber errors for both the negative and positive classes.

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:40b7

ValidMind Figure my_custom_tests.ConfusionMatrix:challenger:2536

2026-07-14 05:37:47,453 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:challenger does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform.

Add parameters to custom tests

Custom tests can take parameters just like any other function. To demonstrate, let's modify the confusion_matrix function to take an additional parameter normalize that will allow you to normalize the confusion matrix:

@vm.test("my_custom_tests.ConfusionMatrix")
def confusion_matrix(dataset, model, normalize=False):
    """The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.

    The confusion matrix is a 2x2 table that contains 4 values:

    - True Positive (TP): the number of correct positive predictions
    - True Negative (TN): the number of correct negative predictions
    - False Positive (FP): the number of incorrect positive predictions
    - False Negative (FN): the number of incorrect negative predictions

    The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.
    """
    y_true = dataset.y
    y_pred = dataset.y_pred(model=model)

    if normalize:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred, normalize="all")
    else:
        confusion_matrix = metrics.confusion_matrix(y_true, y_pred)

    cm_display = metrics.ConfusionMatrixDisplay(
        confusion_matrix=confusion_matrix, display_labels=[False, True]
    )
    cm_display.plot()

    plt.close()  # close the plot to avoid displaying it

    return cm_display.figure_  # return the figure object itself

Pass parameters to custom tests

You can pass parameters to custom tests by providing a dictionary of parameters to the run_test() function.

The parameters will override any default parameters set in the custom test definition. Note that dataset and model are still passed as inputs.
Since these are VMDataset or VMModel inputs, they have a special meaning.

Re-running and logging the custom confusion matrix with normalize=True for both models and our testing dataset looks like this:

# Champion with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Champion

The Confusion Matrix test evaluates classification outcomes by comparing predicted labels against true labels, and this result presents the normalized confusion matrix for test_dataset_final using log_model_champion. The matrix is shown with true labels on the vertical axis and predicted labels on the horizontal axis, with normalized cell values summing to 1 across the full table. The four displayed proportions are 0.33 for true negatives, 0.16 for false positives, 0.20 for false negatives, and 0.31 for true positives.

Key insights:

Correct classifications dominate overall: The diagonal cells sum to 0.64, comprising 0.33 true negatives and 0.31 true positives, while the off-diagonal error cells sum to 0.36.
False positives are lower than false negatives: The false positive cell is 0.16, compared with 0.20 for false negatives, indicating slightly more missed positive cases than incorrect positive assignments.
Negative and positive correct predictions are similar: The two correct classification cells are close in magnitude, with 0.33 for true negatives and 0.31 for true positives, showing relatively balanced correct identification across both classes.
Class prediction errors are distributed across both classes: Misclassification is present in both directions rather than concentrated in a single error type, with both off-diagonal cells materially above zero.

The normalized confusion matrix shows that most observations fall into correct classifications, with 64% on the diagonal and 36% in misclassified cells. Correct predictions are distributed fairly evenly between the negative and positive classes, and the two error types are also relatively close, though false negatives exceed false positives by 0.04. Overall, the result indicates balanced classification behavior with nontrivial error present in both directions.

Parameters:

{
  "normalize": true
}

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_champion:4af0

2026-07-14 05:37:52,807 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_champion does not exist in model's document

# Challenger with test dataset and normalize=True
vm.tests.run_test(
    test_id="my_custom_tests.ConfusionMatrix:test_normalized_challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    },
    params={"normalize": True}
).log()

Confusion Matrix Test Normalized Challenger

The ConfusionMatrix test evaluates classification outcomes by comparing predicted labels against true labels, and this result presents the normalized confusion matrix for the challenger model on the test dataset. The matrix shows the proportion of observations in each outcome category across true negatives, false positives, false negatives, and true positives. The displayed normalized values are 0.36 for true negatives, 0.12 for false positives, 0.18 for false negatives, and 0.33 for true positives.

Key insights:

Correct classifications dominate: The diagonal cells sum to 0.69, comprising 0.36 true negatives and 0.33 true positives, while the off-diagonal misclassifications sum to 0.30.
True negatives are the largest cell: The highest single proportion in the matrix is the true negative cell at 0.36, slightly above the true positive cell at 0.33.
False negatives exceed false positives: The false negative proportion is 0.18 compared with 0.12 for false positives, indicating more missed positive cases than incorrect positive predictions.
Positive and negative outcomes are both represented: Both diagonal cells are of similar magnitude, with 0.36 for correctly predicted negatives and 0.33 for correctly predicted positives, showing classification performance across both classes rather than concentration in only one class.

The normalized confusion matrix indicates that most observations fall into correct classification cells, with true negatives and true positives accounting for the largest shares of the matrix. Misclassifications are lower in aggregate, though errors are more concentrated in false negatives than false positives. Overall, the result shows balanced representation of correct predictions across both classes with a modest asymmetry in error type.

Parameters:

{
  "normalize": true
}

Figures

ValidMind Figure my_custom_tests.ConfusionMatrix:test_normalized_challenger:4a98

2026-07-14 05:37:58,280 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_custom_tests.ConfusionMatrix:test_normalized_challenger does not exist in model's document

Use external test providers

Sometimes you may want to reuse the same set of custom tests across multiple records (models) and share them with others in your organization, like the development team would have done with you in this example workflow featured in this series of notebooks. In this case, you can create an external custom test provider that will allow you to load custom tests from a local folder or a Git repository.

In this section you will learn how to declare a local filesystem test provider that allows loading tests from a local folder following these high level steps:

Create a folder of custom tests from existing inline tests (tests that exist in your active Jupyter Notebook)
Save an inline test to a file
Define and register a LocalTestProvider that points to that folder
Run test provider tests
Add the test results to your documentation

Create custom tests folder

Let's start by creating a new folder that will contain reusable custom tests from your existing inline tests.

The following code snippet will create a new my_tests directory in the current working directory if it doesn't exist:

tests_folder = "my_tests"

import os

# create tests folder
os.makedirs(tests_folder, exist_ok=True)

# remove existing tests
for f in os.listdir(tests_folder):
    # remove files and pycache
    if f.endswith(".py") or f == "__pycache__":
        os.system(f"rm -rf {tests_folder}/{f}")

After running the command above, confirm that a new my_tests directory was created successfully. For example:

~/notebooks/tutorials/validation/my_tests/

Save an inline test

The @vm.test decorator we used in Implement a custom inline test above to register one-off custom tests also includes a convenience method on the function object that allows you to simply call <func_name>.save() to save the test to a Python file at a specified path.

While save() will get you started by creating the file and saving the function code with the correct name, it won't automatically include any imports, or other functions or variables, outside of the functions that are needed for the test to run. To solve this, pass in an optional imports argument ensuring necessary imports are added to the file.

The confusion_matrix test requires the following additional imports:

import matplotlib.pyplot as plt
from sklearn import metrics

Let's pass these imports to the save() method to ensure they are included in the file with the following command:

confusion_matrix.save(
    # Save it to the custom tests folder we created
    tests_folder,
    imports=["import matplotlib.pyplot as plt", "from sklearn import metrics"],
)

2026-07-14 05:37:58,698 - INFO(validmind.tests.decorator): Saved to /home/runner/work/documentation/documentation/site/notebooks/EXECUTED/validation/my_tests/ConfusionMatrix.py!Be sure to add any necessary imports to the top of the file.
2026-07-14 05:37:58,699 - INFO(validmind.tests.decorator): This metric can be run with the ID: <test_provider_namespace>.ConfusionMatrix

Confirm that the save() method saved the confusion_matrix function to a file named ConfusionMatrix.py in the my_tests folder.

Note that the new file provides some context on the origin of the test, which is useful for traceability:

# Saved from __main__.confusion_matrix
# Original Test ID: my_custom_tests.ConfusionMatrix
# New Test ID: <test_provider_namespace>.ConfusionMatrix

Additionally, the new test function has been stripped off its decorator, as it now resides in a file that will be loaded by the test provider:
```
def ConfusionMatrix(dataset, model, normalize=False):
```

Register a local test provider

Now that your my_tests folder has a sample custom test, let's initialize a test provider that will tell the ValidMind Library where to find your custom tests:

ValidMind offers out-of-the-box test providers for local tests (tests in a folder) or a Github provider for tests in a Github repository.
You can also create your own test provider by creating a class that has a load_test method that takes a test ID and returns the test function matching that ID.

Want to learn more about test providers?

An extended introduction to test providers can be found in: Integrate external test providers

Initialize a local test provider

For most use cases, using a LocalTestProvider that allows you to load custom tests from a designated directory should be sufficient.

The most important attribute for a test provider is its namespace. This is a string that will be used to prefix test IDs in documentation. This allows you to have multiple test providers with tests that can even share the same ID, but are distinguished by their namespace.

Let's go ahead and load the custom tests from our my_tests directory:

from validmind.tests import LocalTestProvider

# initialize the test provider with the tests folder we created earlier
my_test_provider = LocalTestProvider(tests_folder)

vm.tests.register_test_provider(
    namespace="my_test_provider",
    test_provider=my_test_provider,
)
# `my_test_provider.load_test()` will be called for any test ID that starts with `my_test_provider`
# e.g. `my_test_provider.ConfusionMatrix` will look for a function named `ConfusionMatrix` in `my_tests/ConfusionMatrix.py` file

Run test provider tests

Now that we've set up the test provider, we can run any test that's located in the tests folder by using the run_test() method as with any other test:

For tests that reside in a test provider directory, the test ID will be the namespace specified when registering the provider, followed by the path to the test file relative to the tests folder.
For example, the Confusion Matrix test we created earlier will have the test ID my_test_provider.ConfusionMatrix. You could organize the tests in subfolders, say classification and regression, and the test ID for the Confusion Matrix test would then be my_test_provider.classification.ConfusionMatrix.

Let's go ahead and re-run the confusion matrix test with our testing dataset for our two models by using the test ID my_test_provider.ConfusionMatrix. This should load the test from the test provider and run it as before.

# Champion with test dataset and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:champion",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_log_model]
    }
).log()

Confusion Matrix Champion

The Confusion Matrix test evaluates classification performance by comparing predicted labels with observed labels. The matrix shown for test_dataset_final and log_model_champion reports counts across the four outcome types in a 2x2 layout. The observed counts are 214 for true negatives, 102 for false positives, 132 for false negatives, and 199 for true positives. These values show how predictions are distributed across correctly and incorrectly classified negative and positive cases.

Key insights:

Correct negatives are the largest cell: The model records 214 true negatives, which is the highest count among the four confusion matrix cells.
True positives remain substantial: The model correctly identifies 199 positive cases, indicating a large volume of correct positive classifications alongside the true negative count.
False negatives exceed false positives: False negatives total 132, compared with 102 false positives. This indicates more missed positive cases than incorrect positive classifications.
Errors are material in both classes: Misclassifications appear in both directions, with over 100 observations in each error cell, showing that neither false positives nor false negatives are negligible in the test sample.

The confusion matrix indicates that the model produces meaningful volumes of both correct negative and correct positive classifications, with true negatives slightly exceeding true positives. Misclassification is present in both directions, and the larger false negative count relative to false positives shows that missed positive cases occur more frequently than incorrect positive assignments. Overall, the result reflects a mixed classification pattern rather than dominance by a single outcome type.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:champion:8623

2026-07-14 05:38:03,293 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:champion does not exist in model's document

# Challenger with test dataset  and test provider custom test
vm.tests.run_test(
    test_id="my_test_provider.ConfusionMatrix:challenger",
    input_grid={
        "dataset": [vm_test_ds],
        "model" : [vm_rf_model]
    }
).log()

Confusion Matrix Challenger

The Confusion Matrix test evaluates classification performance by comparing predicted labels against true labels for the challenger model on the test dataset. The matrix reports counts for correct and incorrect classifications across the two classes. In this result, the cells show 236 true negatives, 80 false positives, 115 false negatives, and 216 true positives. These values provide a direct view of how model predictions are distributed between correct identifications and misclassifications for both the positive and negative classes.

Key insights:

Correct classifications exceed errors: The model records 236 true negatives and 216 true positives, for 452 correct classifications in total, compared with 80 false positives and 115 false negatives, for 195 misclassifications.
Negative class identified more accurately: True negatives number 236 versus 80 false positives, indicating fewer errors when the true class is False than when predicting positives for false cases.
False negatives exceed false positives: The model produces 115 false negatives compared with 80 false positives, showing more missed positive cases than incorrect positive assignments.
True positive detection remains substantial: The confusion matrix shows 216 true positives, which is close in magnitude to the 236 true negatives, indicating that both classes are captured in meaningful volumes.

Overall, the confusion matrix shows that the challenger model produces more correct classifications than errors on the test dataset, with substantial counts in both the true negative and true positive cells. The error profile is asymmetric, with false negatives occurring more often than false positives. This indicates that the model misses positive cases more frequently than it incorrectly assigns positive labels to negative cases.

Figures

ValidMind Figure my_test_provider.ConfusionMatrix:challenger:de11

2026-07-14 05:38:08,150 - INFO(validmind.vm_models.result.result): Test driven block with result_id my_test_provider.ConfusionMatrix:challenger does not exist in model's document

Verify test runs

Our final task is to verify that all the tests provided by the development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.

Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:

test_config = {
    # Run with the raw dataset
    'validmind.data_validation.DatasetDescription:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.DescriptiveStatistics:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.MissingValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percentage_threshold': 1}
    },
    'validmind.data_validation.ClassImbalance:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.Duplicates:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.HighCardinality:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {
            'num_threshold': 100,
            'percent_threshold': 0.1,
            'threshold_type': 'percent'
        }
    },
    'validmind.data_validation.Skewness:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_threshold': 1}
    },
    'validmind.data_validation.UniqueRows:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TooManyZeroValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_percent_threshold': 0.03}
    },
    'validmind.data_validation.IQROutliersTable:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'threshold': 5}
    },
    # Run with the preprocessed dataset
    'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.MissingValues:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'min_percentage_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'}
    },
    'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'raw_dataset_preprocessed'},
        'params': {'default_column': 'loan_status'}
    },
    # Run with the training and test datasets
    'validmind.data_validation.DescriptiveStatistics:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.TabularDescriptionTables:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.ClassImbalance:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.UniqueRows:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.MutualInformation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'min_threshold': 0.01}
    },
    'validmind.data_validation.PearsonCorrelationMatrix:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']}
    },
    'validmind.data_validation.HighPearsonCorrelation:development_data': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final']},
        'params': {'max_threshold': 0.3, 'top_n_correlations': 10}
    },
    'validmind.model_validation.ModelMetadata': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ModelParameters': {
        'input_grid': {'model': ['log_model_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ROCCurve': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']}
    },
    'validmind.model_validation.sklearn.MinimumROCAUCScore': {
        'input_grid': {'dataset': ['train_dataset_final', 'test_dataset_final'], 'model': ['log_model_champion']},
        'params': {'min_threshold': 0.5}
    }
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")

validmind.data_validation.DatasetDescription:raw_data

Dataset Description Raw Data

The Dataset Description test evaluates the structure, completeness, and column-level characteristics of the raw dataset used by the model. The results provide a summary for 11 variables, including inferred data type, non-missing count, missingness, and distinct-value counts. The dataset contains numeric and categorical fields, with each column showing 8,000 observed records and reported missingness of 0.0%. Distinct-value counts range from 2 for several categorical variables to 8,000 for EstimatedSalary.

Key insights:

No missing values observed: All 11 variables report a count of 8,000 with 0 missing values and 0.0% missingness, indicating complete population of the raw dataset within this test result.
Mixed feature types are present: The dataset includes six numeric variables (CreditScore, Age, Tenure, Balance, NumOfProducts, EstimatedSalary) and five categorical variables (Geography, Gender, HasCrCard, IsActiveMember, Exited), reflecting a mixed tabular structure.
EstimatedSalary is fully unique: EstimatedSalary has 8,000 distinct values out of 8,000 records, corresponding to a distinct proportion of 1.0, making it the highest-cardinality field in the dataset.
Balance also shows high cardinality: Balance contains 5,088 distinct values, or 63.6% of records, which is materially higher than the remaining numeric fields apart from EstimatedSalary.
Several fields have very low cardinality: Geography has 3 distinct values, Gender, HasCrCard, IsActiveMember, and Exited each have 2, NumOfProducts has 4, and Tenure has 11, indicating multiple variables are represented by a limited set of discrete levels.
CreditScore and Age show moderate granularity: CreditScore contains 452 distinct values and Age contains 69, placing both variables between the highly discrete fields and the near-record-level unique fields.

The raw dataset summary indicates complete record coverage across all reported variables with no observed missingness. Variable structure is mixed, combining low-cardinality categorical fields with numeric fields that vary substantially in granularity, from highly discrete measures such as Tenure and NumOfProducts to highly unique measures such as EstimatedSalary and Balance. Collectively, the result documents a dataset with full population completeness and heterogeneous feature cardinality across its columns.

Tables

Dataset Description

Name	Type	Count	Distinct	Distinct %
CreditScore	Numeric	8000.0	452	0.0565
Geography	Categorical	8000.0	3	0.0004
Gender	Categorical	8000.0	2	0.0002
Age	Numeric	8000.0	69	0.0086
Tenure	Numeric	8000.0	11	0.0014
Balance	Numeric	8000.0	5088	0.6360
NumOfProducts	Numeric	8000.0	4	0.0005
HasCrCard	Categorical	8000.0	2	0.0002
IsActiveMember	Categorical	8000.0	2	0.0002
EstimatedSalary	Numeric	8000.0	8000	1.0000
Exited	Categorical	8000.0	2	0.0002

2026-07-14 05:38:14,039 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DatasetDescription:raw_data does not exist in model's document

validmind.data_validation.DescriptiveStatistics:raw_data

Descriptive Statistics Raw Data

The Descriptive Statistics test evaluates the distributional characteristics of the raw dataset across numerical and categorical variables. The results summarize central tendency, dispersion, percentile structure, and observed ranges for eight numerical fields, alongside category concentration for two categorical fields. All listed variables have a count of 8,000 observations, and the tables show variation in spread, concentration, and category dominance across the dataset.

Key insights:

Balance shows substantial dispersion: Balance has a mean of 76,434.10, a median of 97,264, and a standard deviation of 62,612.25, with the 25th percentile at 0 and the maximum at 250,898. This reflects a wide distribution and a strong concentration at the lower end of the range.
EstimatedSalary is broadly distributed: EstimatedSalary spans from 12 to 199,992, with a mean of 99,790.19 and a median of 99,505. The 25th and 75th percentiles of 50,857 and 149,216 indicate a broad spread around the center.
CreditScore and Age are comparatively centered: CreditScore has a mean of 650.16 and median of 652, while Age has a mean of 38.95 and median of 37. Their central tendency measures are relatively close, with interquartile ranges of 583 to 717 for CreditScore and 32 to 44 for Age.
Product holding is concentrated in lower values: NumOfProducts has a median of 1, a 75th percentile of 2, and a maximum of 4, with a mean of 1.53. This indicates that most observations are concentrated in the lower product-count range.
Binary indicators are imbalanced to different degrees: HasCrCard has a mean of 0.7026, indicating that the value 1 is more prevalent, while IsActiveMember has a mean of 0.5199, indicating a more even split between 0 and 1. Both variables have medians of 1 and observed values limited to 0 and 1.
Geography is moderately concentrated; Gender is near-balanced: Geography contains 3 unique values, with France as the top category at 4,010 observations (50.12%). Gender contains 2 unique values, with Male as the top category at 4,396 observations (54.95%).

The descriptive statistics show a dataset with complete observation counts across the reported variables and mixed distributional behavior across features. CreditScore, Age, and EstimatedSalary display relatively centered distributions based on mean-median alignment, while Balance exhibits substantially wider dispersion and a lower-end concentration reflected by a 25th percentile of 0. The categorical variables show moderate concentration in the leading category, and several discrete or binary fields, including NumOfProducts, HasCrCard, and IsActiveMember, are concentrated in a limited set of values.

Tables

Numerical Variables

Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
CreditScore	8000.0	650.1596	96.8462	350.0	583.0	652.0	717.0	778.0	813.0	850.0
Age	8000.0	38.9489	10.4590	18.0	32.0	37.0	44.0	53.0	60.0	92.0
Tenure	8000.0	5.0339	2.8853	0.0	3.0	5.0	8.0	9.0	9.0	10.0
Balance	8000.0	76434.0965	62612.2513	0.0	0.0	97264.0	128045.0	149545.0	162488.0	250898.0
NumOfProducts	8000.0	1.5325	0.5805	1.0	1.0	1.0	2.0	2.0	2.0	4.0
HasCrCard	8000.0	0.7026	0.4571	0.0	0.0	1.0	1.0	1.0	1.0	1.0
IsActiveMember	8000.0	0.5199	0.4996	0.0	0.0	1.0	1.0	1.0	1.0	1.0
EstimatedSalary	8000.0	99790.1880	57520.5089	12.0	50857.0	99505.0	149216.0	179486.0	189997.0	199992.0

Categorical Variables

Name	Count	Number of Unique Values	Top Value	Top Value Frequency	Top Value Frequency %
Geography	8000.0	3.0	France	4010.0	50.12
Gender	8000.0	2.0	Male	4396.0	54.95

2026-07-14 05:38:20,508 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:raw_data does not exist in model's document

validmind.data_validation.MissingValues:raw_data

✅ Missing Values Raw Data

The Missing Values test evaluates dataset completeness by measuring the proportion of missing entries in each feature against the configured 1% threshold. The results table reports the number and percentage of missing values for each column in the raw dataset, together with a pass/fail outcome. Across the 11 reported columns, all features show 0 missing values and 0.0% missingness, and each column is marked as Pass.

Key insights:

No missing values detected: All 11 columns report 0 missing values, indicating complete observed records across the raw dataset for the features assessed.
All features passed threshold: Every column is marked Pass against the 1% missing-value threshold, with observed missingness of 0.0% in each case.
Completeness is uniform across variables: Missingness results are identical across CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, with no variation by feature.

The test results show fully complete raw data for all evaluated features under the applied missingness criterion. No column exhibits any observed missing values, and no feature approaches or exceeds the 1% threshold. The assessed dataset therefore presents a uniform completeness profile across all reported variables.

Parameters:

{
  "min_percentage_threshold": 1
}

Tables

Column	Number of Missing Values	Percentage of Missing Values (%)	Pass/Fail
CreditScore	0	0.0	Pass
Geography	0	0.0	Pass
Gender	0	0.0	Pass
Age	0	0.0	Pass
Tenure	0	0.0	Pass
Balance	0	0.0	Pass
NumOfProducts	0	0.0	Pass
HasCrCard	0	0.0	Pass
IsActiveMember	0	0.0	Pass
EstimatedSalary	0	0.0	Pass
Exited	0	0.0	Pass

2026-07-14 05:38:23,661 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:raw_data does not exist in model's document

validmind.data_validation.ClassImbalance:raw_data

✅ Class Imbalance Raw Data

The Class Imbalance test evaluates the distribution of target classes in the dataset by measuring the percentage of records in each class against a minimum threshold. For the Exited target, the results table and bar chart show two classes: class 0 at 79.80% of rows and class 1 at 20.20% of rows. The configured minimum percentage threshold is 10%, and the reported outcome is shown separately for each class.

Key insights:

Both classes pass threshold: Class 0 and class 1 both exceed the 10% minimum percentage threshold, and both are marked as Pass.
Majority class is non-exited: The Exited = 0 class represents 79.80% of rows, making it the dominant class in the dataset.
Minority class remains material: The Exited = 1 class represents 20.20% of rows, which is substantially above the 10% threshold despite being the smaller class.
Distribution is uneven but within test bounds: The observed class proportions differ by 59.60 percentage points, with the imbalance visible in both the table and plot, while remaining within the test’s defined pass criteria.

The test result shows an asymmetric class distribution for the Exited target, with class 0 comprising the majority of observations and class 1 representing approximately one-fifth of the dataset. Under the configured 10% threshold, both classes satisfy the minimum representation requirement and receive passing outcomes. Collectively, the results indicate that the dataset is not evenly distributed across classes, but no class falls below the threshold used in this test.

Parameters:

{
  "min_percent_threshold": 10
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	79.80%	Pass
1	20.20%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:raw_data:f9ee

2026-07-14 05:38:31,516 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:raw_data does not exist in model's document

validmind.data_validation.Duplicates:raw_data

✅ Duplicates Raw Data

The Duplicates test evaluates the dataset for exact duplicate rows to quantify redundancy in the input data. The results table reports both the absolute count of duplicate rows and their share of the dataset. For the raw data assessed in this test, the table shows 0 duplicate rows and a duplicate rate of 0.0%.

Key insights:

No duplicate rows detected: The dataset contains 0 exact duplicate rows based on the test output.
Duplicate rate is zero: The reported percentage of duplicate rows is 0.0%, indicating no observed row-level duplication in the evaluated data.
Threshold condition is satisfied: With min_threshold set to 1 and 0 duplicates observed, the result falls below the configured threshold.

The test result indicates that no exact row duplicates were identified in the evaluated raw dataset. Both the absolute duplicate count and the duplicate percentage are zero, and the observed value is below the configured threshold. These results document an absence of detected row-level redundancy under the scope of this exact-duplicate check.

Parameters:

{
  "min_threshold": 1
}

Tables

Duplicate Rows Results for Dataset

Number of Duplicates	Percentage of Rows (%)
0	0.0

2026-07-14 05:38:36,572 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Duplicates:raw_data does not exist in model's document

validmind.data_validation.HighCardinality:raw_data

✅ High Cardinality Raw Data

The High Cardinality test evaluates the number of unique values in categorical columns to identify features with large numbers of distinct categories. The results table reports, for each categorical column, the number of distinct values, the percentage of distinct values, and the associated pass/fail outcome against the configured threshold. In this run, two categorical columns were evaluated: Geography with 3 distinct values and 0.0375% distinctness, and Gender with 2 distinct values and 0.025% distinctness. Both columns are marked as passing the test.

Key insights:

All evaluated categorical fields passed: Both categorical columns in scope, Geography and Gender, received a Pass result under the configured high-cardinality threshold.
Distinct counts are low: Geography contains 3 distinct values and Gender contains 2 distinct values, indicating limited category dispersion across the evaluated categorical features.
Distinctness percentages remain small: The reported percentages of distinct values are 0.0375% for Geography and 0.025% for Gender, both below the configured percent-based threshold.

The observed results show that the categorical columns assessed in this test have low distinct-value counts and low distinctness percentages relative to the configured threshold. No evaluated categorical field is identified as high cardinality in this run, and the test output is uniformly passing across the reported columns.

Parameters:

{
  "num_threshold": 100,
  "percent_threshold": 0.1,
  "threshold_type": "percent"
}

Tables

Column	Number of Distinct Values	Percentage of Distinct Values (%)	Pass/Fail
Geography	3	0.0375	Pass
Gender	2	0.0250	Pass

2026-07-14 05:38:40,209 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighCardinality:raw_data does not exist in model's document

validmind.data_validation.Skewness:raw_data

❌ Skewness Raw Data

The Skewness test evaluates the asymmetry of numerical variable distributions against the configured maximum threshold of 1. The results table reports skewness values and pass/fail outcomes for nine numeric columns in the raw dataset. Observed skewness values range from -0.8867 to 1.4847, with seven columns passing the threshold check and two columns failing. The reported results cover CreditScore, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited.

Key insights:

Two variables exceed the threshold: Age records skewness of 1.0245 and Exited records skewness of 1.4847, making them the only columns that fail the maximum threshold of 1.
Exited shows the highest skewness: Exited has the largest absolute skewness value in the table at 1.4847, exceeding the threshold by the widest margin among all assessed variables.
Most variables are near symmetric: CreditScore (-0.062), Tenure (0.0077), Balance (-0.1353), IsActiveMember (-0.0796), and EstimatedSalary (0.0095) all show skewness values close to zero and pass the test.
Moderate skewness remains within threshold: NumOfProducts (0.7172) and HasCrCard (-0.8867) exhibit more pronounced asymmetry than the near-zero variables, but both remain within the defined pass range.

The results indicate that skewness is limited for most numeric variables in the dataset, with seven of nine columns remaining within the configured threshold. The largest departures from the threshold are concentrated in Exited and Age, while the remaining variables show either near-zero skewness or moderate asymmetry that still passes the test. Overall, the distributional asymmetry identified by this test is localized rather than broad across the numeric feature set.

Parameters:

{
  "max_threshold": 1
}

Tables

Skewness Results for Dataset

Column	Skewness	Pass/Fail
CreditScore	-0.0620	Pass
Age	1.0245	Fail
Tenure	0.0077	Pass
Balance	-0.1353	Pass
NumOfProducts	0.7172	Pass
HasCrCard	-0.8867	Pass
IsActiveMember	-0.0796	Pass
EstimatedSalary	0.0095	Pass
Exited	1.4847	Fail

2026-07-14 05:38:44,632 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.Skewness:raw_data does not exist in model's document

validmind.data_validation.UniqueRows:raw_data

❌ Unique Rows Raw Data

The UniqueRows test evaluates dataset diversity by comparing the percentage of unique values in each column against the configured minimum threshold of 1%. The results table reports the number of unique values, the corresponding percentage of unique values, and the pass/fail outcome for each raw data column. Across the eleven evaluated columns, three columns pass the threshold and eight fail, with observed uniqueness percentages ranging from 0.025% to 100.0%.

Key insights:

Only three columns pass: CreditScore (5.65%), Balance (63.6%), and EstimatedSalary (100.0%) exceed the 1% minimum uniqueness threshold and are the only columns marked as passing.
EstimatedSalary is fully unique: EstimatedSalary contains 8,000 unique values, corresponding to 100.0% uniqueness, which is the highest observed value in the test output.
Balance shows high diversity: Balance has 5,088 unique values and a uniqueness rate of 63.6%, making it the second most diverse column in the dataset based on this test.
Several columns have very low uniqueness: Geography (0.0375%), Gender (0.025%), NumOfProducts (0.05%), HasCrCard (0.025%), IsActiveMember (0.025%), and Exited (0.025%) all fall materially below the 1% threshold.
Age falls just below threshold: Age records 69 unique values and a uniqueness rate of 0.8625%, placing it below the threshold despite having more distinct values than several other failing columns.

The results indicate that uniqueness is concentrated in a small subset of columns, with EstimatedSalary, Balance, and CreditScore providing the only values above the configured threshold. Most variables exhibit low uniqueness percentages and are classified as failing under this test, including Age, which is the closest failing column to the threshold. Overall, the test output shows a mixed diversity profile across the raw dataset under the 1% minimum uniqueness criterion.

Parameters:

{
  "min_percent_threshold": 1
}

Tables

Column	Number of Unique Values	Percentage of Unique Values (%)	Pass/Fail
CreditScore	452	5.6500	Pass
Geography	3	0.0375	Fail
Gender	2	0.0250	Fail
Age	69	0.8625	Fail
Tenure	11	0.1375	Fail
Balance	5088	63.6000	Pass
NumOfProducts	4	0.0500	Fail
HasCrCard	2	0.0250	Fail
IsActiveMember	2	0.0250	Fail
EstimatedSalary	8000	100.0000	Pass
Exited	2	0.0250	Fail

2026-07-14 05:38:49,110 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:raw_data does not exist in model's document

validmind.data_validation.TooManyZeroValues:raw_data

❌ Too Many Zero Values Raw Data

The TooManyZeroValues test evaluates numerical columns for zero-value concentrations that exceed a specified threshold. The results table reports the row count, number of zero values, percentage of zero values, and pass/fail status for each assessed numerical variable. In this run, four variables were evaluated across 8,000 rows, and each variable exceeded the configured maximum percent threshold of 0.03%, resulting in a fail status for all reported columns.

Key insights:

All assessed variables failed: Tenure, Balance, HasCrCard, and IsActiveMember each exceeded the 0.03% threshold for zero values and were flagged as Fail in the test output.
IsActiveMember has the highest zero share: IsActiveMember contains 3,841 zero values out of 8,000 rows, corresponding to 48.0125%, which is the highest percentage among the evaluated variables.
Balance shows substantial zero concentration: Balance records 2,912 zero values, representing 36.4% of observations, indicating a large concentration of zeros within this variable.
Binary-style variables also contain large zero proportions: HasCrCard has 2,379 zero values (29.7375%) and IsActiveMember has 3,841 zero values (48.0125%), both showing zero values in a substantial share of records.
Tenure has the lowest but still elevated zero rate: Tenure contains 323 zero values, equal to 4.0375% of rows, which is the smallest zero-value proportion reported but still above the configured threshold.

The test results show that zero values are present above the configured threshold in every numerical variable included in the output. The largest concentrations occur in IsActiveMember and Balance, followed by HasCrCard, while Tenure has the lowest observed zero share. Collectively, the results indicate that zero-value prevalence is widespread across the assessed numerical fields in this dataset.

Parameters:

{
  "max_percent_threshold": 0.03
}

Tables

Variable	Row Count	Number of Zero Values	Percentage of Zero Values (%)	Pass/Fail
Tenure	8000	323	4.0375	Fail
Balance	8000	2912	36.4000	Fail
HasCrCard	8000	2379	29.7375	Fail
IsActiveMember	8000	3841	48.0125	Fail

2026-07-14 05:38:54,657 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TooManyZeroValues:raw_data does not exist in model's document

validmind.data_validation.IQROutliersTable:raw_data

IQR Outliers Table Raw Data

The Interquartile Range Outliers Table test evaluates numerical features for observations falling outside the IQR-based outlier thresholds. The result is presented as a summary table titled "Summary of Outliers Detected by IQR Method." In this test run, the returned raw data table is empty, so no feature-level outlier counts or summary statistics are displayed.

Key insights:

Empty outlier summary table: The result table contains no rows, indicating that the output does not list any numerical features with detected outliers or associated summary statistics.
No feature-level outlier detail shown: Because the table is empty, no minimum, quartile, median, or maximum values for outlier observations are reported for any feature.
Threshold parameter recorded: The test parameters show a threshold value of 5 for the IQR-based outlier detection run.

The documented result consists of an empty IQR outlier summary table under a threshold setting of 5. As presented, the output provides no feature-specific outlier counts or distribution summaries, so the result is limited to the absence of reported rows in the table.

Parameters:

{
  "threshold": 5
}

Tables

Summary of Outliers Detected by IQR Method

2026-07-14 05:38:57,726 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.IQROutliersTable:raw_data does not exist in model's document

validmind.data_validation.DescriptiveStatistics:preprocessed_data

Descriptive Statistics Preprocessed Data

The Descriptive Statistics test evaluates the distributional characteristics of numerical and categorical variables in the preprocessed dataset. The results summarize central tendency, dispersion, percentile ranges, and observed minima and maxima for seven numerical variables, alongside category counts, unique-value counts, and top-category frequencies for two categorical variables. All reported variables have a count of 3,232 observations, and the tables show both continuous-valued fields such as Balance and EstimatedSalary and low-cardinality fields such as Geography and Gender.

Key insights:

Complete coverage across variables: Every numerical and categorical variable reports a count of 3,232, indicating that the summarized dataset contains the same number of observations for all listed fields.
Balance shows the widest spread: Balance has a mean of 82,162.3758 and a standard deviation of 61,537.5307, with values ranging from 0.0 to 250,898. The 25th percentile is 0.0 while the median is 103,460, indicating a large spread across the distribution.
CreditScore is centered and broadly distributed: CreditScore has a mean of 645.7454 and a median of 646.0, with a standard deviation of 97.2503 and a range from 350.0 to 850.0. The close alignment of mean and median indicates a centered distribution in the reported summary.
EstimatedSalary spans nearly the full range: EstimatedSalary ranges from 12.0 to 199,992.0, with a mean of 99,930.9342 and median of 99,463.0. Its standard deviation of 58,369.1263 indicates substantial dispersion relative to several other variables.
Several variables are low-cardinality or binary: NumOfProducts ranges from 1 to 4 with a median of 1.0, while HasCrCard and IsActiveMember are binary fields bounded between 0 and 1. HasCrCard has a mean of 0.6962, compared with 0.4573 for IsActiveMember.
Categorical concentration is moderate: Geography contains 3 unique values, with France as the top category at 1,513 observations (46.81%). Gender contains 2 unique values, with Male as the top category at 1,657 observations (51.27%).

The descriptive summary shows a dataset with consistent observation counts across all reported variables and a mix of continuous, discrete, and binary features. The largest numerical dispersion is observed in Balance and EstimatedSalary, while CreditScore appears comparatively centered based on the reported mean and median. The categorical variables show limited cardinality with moderate concentration in the most frequent categories, particularly France for Geography and Male for Gender.

Tables

Numerical Variables

Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
CreditScore	3232.0	645.7454	97.2503	350.0	580.0	646.0	712.0	774.0	811.0	850.0
Tenure	3232.0	5.0251	2.9235	0.0	3.0	5.0	8.0	9.0	10.0	10.0
Balance	3232.0	82162.3758	61537.5307	0.0	0.0	103460.0	129756.0	150773.0	164155.0	250898.0
NumOfProducts	3232.0	1.5096	0.6717	1.0	1.0	1.0	2.0	2.0	3.0	4.0
HasCrCard	3232.0	0.6962	0.4600	0.0	0.0	1.0	1.0	1.0	1.0	1.0
IsActiveMember	3232.0	0.4573	0.4983	0.0	0.0	0.0	1.0	1.0	1.0	1.0
EstimatedSalary	3232.0	99930.9342	58369.1263	12.0	49731.0	99463.0	150661.0	180149.0	190280.0	199992.0

Categorical Variables

Name	Count	Number of Unique Values	Top Value	Top Value Frequency	Top Value Frequency %
Geography	3232.0	3.0	France	1513.0	46.81
Gender	3232.0	2.0	Male	1657.0	51.27

2026-07-14 05:39:03,006 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:preprocessed_data does not exist in model's document

validmind.data_validation.TabularDescriptionTables:preprocessed_data

Tabular Description Tables Preprocessed Data

The Descriptive Statistics test evaluates the composition, completeness, and basic distributional characteristics of the preprocessed dataset across numerical and categorical fields. The results show summary statistics for eight numerical variables and two categorical variables, each with 3,232 observations. For numerical fields, the table reports means, minimums, maximums, missing-value percentages, and data types; for categorical fields, it reports the number of unique values, listed categories, missing-value percentages, and data types. All reported variables show 0.0% missing values.

Key insights:

No reported missing values: All eight numerical variables and both categorical variables have 0.0% missing values across 3,232 observations, indicating complete coverage in the reported preprocessed dataset.
Binary indicators are numerically encoded: HasCrCard, IsActiveMember, and Exited are stored as int64 with minimum and maximum values of 0 and 1. Their means are 0.6962, 0.4573, and 0.5000 respectively, reflecting the proportion of records in the 1 category for each field.
Target variable is evenly split: Exited has a mean of 0.5 with values bounded between 0 and 1, indicating an even class balance in the reported sample.
Core numeric ranges vary substantially: CreditScore ranges from 350.0 to 850.0 with a mean of 645.7454, Tenure ranges from 0.0 to 10.0 with a mean of 5.0251, and Balance ranges from 0.0 to 250,898.09 with a mean of 82,162.3758. EstimatedSalary spans from 11.58 to 199,992.48 with a mean of 99,930.9342, showing broad scale differences across continuous inputs.
Categorical structure is low-cardinality: Geography contains 3 unique values (Spain, France, Germany) and Gender contains 2 unique values (Female, Male), both stored as object and both with 0.0% missing values.

The reported preprocessed dataset is complete across all listed variables and consists of a compact set of eight numerical and two categorical features. Several variables are binary indicators encoded as integers, including the Exited outcome, which is evenly distributed in the sample. The remaining numerical variables span distinct value ranges, while the categorical fields are limited to a small number of observed classes.

Tables

Numerical Variable	Num of Obs	Mean	Min	Max	Data Type
CreditScore	3232	645.7454	350.00	850.00	int64
Tenure	3232	5.0251	0.00	10.00	int64
Balance	3232	82162.3758	0.00	250898.09	float64
NumOfProducts	3232	1.5096	1.00	4.00	int64
HasCrCard	3232	0.6962	0.00	1.00	int64
IsActiveMember	3232	0.4573	0.00	1.00	int64
EstimatedSalary	3232	99930.9342	11.58	199992.48	float64
Exited	3232	0.5000	0.00	1.00	int64

Categorical Variable	Num of Obs	Num of Unique Values	Unique Values	Missing Values (%)	Data Type
Geography	3232.0	3.0	['Spain' 'France' 'Germany']	0.0	object
Gender	3232.0	2.0	['Female' 'Male']	0.0	object

2026-07-14 05:39:08,141 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:preprocessed_data does not exist in model's document

validmind.data_validation.MissingValues:preprocessed_data

✅ Missing Values Preprocessed Data

The Missing Values test evaluates dataset completeness by measuring the proportion of missing values in each feature against the configured 1% threshold. The result table reports the number and percentage of missing values for each column in the preprocessed dataset, along with a pass/fail outcome for the threshold comparison. Across the 10 reported columns, all features show 0 missing values and 0.0% missingness, resulting in a pass status for every field.

Key insights:

No missing values detected: All 10 columns report 0 missing values, indicating complete observed data across the tested preprocessed dataset.
Universal threshold compliance: Every feature records 0.0% missingness, which is below the configured 1% threshold, and each column is marked as Pass.
Completeness is consistent across features: Missingness results are uniform across numeric, categorical, and target fields listed in the table, with no variation in completeness by column.

The test result shows full observed completeness across all reported features in the preprocessed dataset. No column exceeds the configured missing-value threshold, and the pass/fail outcomes are uniformly positive. The missingness profile is therefore flat across the dataset, with no feature-level concentration of absent values shown in this test.

Parameters:

{
  "min_percentage_threshold": 1
}

Tables

Column	Number of Missing Values	Percentage of Missing Values (%)	Pass/Fail
CreditScore	0	0.0	Pass
Geography	0	0.0	Pass
Gender	0	0.0	Pass
Tenure	0	0.0	Pass
Balance	0	0.0	Pass
NumOfProducts	0	0.0	Pass
HasCrCard	0	0.0	Pass
IsActiveMember	0	0.0	Pass
EstimatedSalary	0	0.0	Pass
Exited	0	0.0	Pass

2026-07-14 05:39:11,486 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MissingValues:preprocessed_data does not exist in model's document

validmind.data_validation.TabularNumericalHistograms:preprocessed_data

Tabular Numerical Histograms Preprocessed Data

The TabularNumericalHistograms test evaluates the distribution of numerical input features by plotting a histogram for each variable in the preprocessed dataset. The result includes histograms for CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and EstimatedSalary. These plots show the shape, concentration, and spread of each feature, including continuous, discrete, and binary variables.

Key insights:

CreditScore is broadly unimodal: CreditScore displays a single central concentration, with the highest density in the mid-range and thinner tails at lower and higher values.
Tenure is fairly even across categories: Tenure values from 1 through 9 appear at broadly similar frequencies, while 0 and 10 are visibly less frequent than the middle tenure values.
Balance shows a zero-heavy pattern: Balance has a pronounced spike at 0 and a separate broad concentration centered roughly in the 100k to 150k range, indicating a mixed distribution rather than a single continuous shape.
NumOfProducts is concentrated at lower counts: NumOfProducts is dominated by values of 1 and 2, with substantially fewer observations at 3 and very few at 4.
Binary features are imbalanced: HasCrCard is concentrated at 1 relative to 0, while IsActiveMember is split more evenly, with a modestly higher count at 0 than at 1.
EstimatedSalary is approximately uniform: EstimatedSalary is distributed relatively evenly across its range, with no strong central peak or visible concentration in a narrow interval.

Overall, the histograms show a mix of distribution types across the numerical inputs. Several variables are discrete or binary by construction, CreditScore appears centrally concentrated, EstimatedSalary is relatively flat across its range, and Balance stands out for its strong mass at zero combined with a separate nonzero concentration. The most prominent concentration effects are therefore observed in Balance, NumOfProducts, and the binary indicators.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:34f1

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:71ed

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:0a99

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:0e17

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:357e

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:3f88

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:preprocessed_data:1e37

2026-07-14 05:39:41,620 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:preprocessed_data does not exist in model's document

validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data

Tabular Categorical Bar Plots Preprocessed Data

The TabularCategoricalBarPlots test evaluates the composition of categorical features by displaying the count of observations in each category. The result consists of bar plots for two categorical variables, Geography and Gender, showing the relative frequency of each category in the preprocessed dataset. In the Geography plot, France has the highest count, followed by Germany and then Spain. In the Gender plot, Male and Female appear at similar frequencies, with Male slightly higher.

Key insights:

Geography is unevenly distributed: France is the largest geography category at approximately 1,500 observations, compared with about 1,000 for Germany and roughly 700 for Spain. This indicates a visibly higher concentration of records in France than in the other two categories.
Gender distribution is close to balanced: The Gender feature shows similar counts for Male and Female. Male is marginally more frequent, with both categories appearing near the 1,600 level.
Category cardinality remains low: Geography contains three categories and Gender contains two categories. The plots therefore show compact categorical structures without a large number of distinct levels.

The categorical composition in the preprocessed dataset is characterized by differing concentration patterns across features. Geography shows a clear imbalance, with France represented more heavily than Germany and Spain, while Gender remains comparatively balanced between its two categories. Overall, the categorical variables shown are limited to a small number of levels and are easily interpretable from the plotted distributions.

Figures

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:be61

ValidMind Figure validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data:e45a

2026-07-14 05:40:05,972 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data does not exist in model's document

validmind.data_validation.TargetRateBarPlots:preprocessed_data

Target Rate Bar Plots Preprocessed Data

The TargetRateBarPlots test evaluates the distribution of categorical feature values alongside the observed target rate within each category. The result presents paired bar plots for the categorical features Geography and Gender, with one chart showing category counts and the other showing mean target rate by category. For Geography, the categories shown are France, Germany, and Spain; for Gender, the categories shown are Male and Female. The plots enable direct comparison between category prevalence and the corresponding target-rate levels.

Key insights:

Germany has the highest target rate: Among the geography categories, Germany shows the highest target rate at approximately 0.65, compared with about 0.41 for France and 0.45 for Spain.
France is the largest geography segment: France has the highest count at roughly 1,500 observations, followed by Germany at about 1,000 and Spain at about 700, indicating uneven category representation across geographies.
Female target rate exceeds male: The gender target-rate plot shows a higher rate for Female at approximately 0.56 versus about 0.43 for Male.
Gender counts are relatively balanced: Male and Female counts are close in magnitude, at roughly 1,650 and 1,580 respectively, indicating similar representation across the two gender categories.

The results show that target rates vary materially across both categorical features included in the test output. The largest separation is observed within Geography, where Germany combines a mid-sized count with the highest target rate among the three categories. Gender displays comparatively balanced category counts but a clear difference in target rate between Female and Male.

Figures

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:3db3

ValidMind Figure validmind.data_validation.TargetRateBarPlots:preprocessed_data:26d1

2026-07-14 05:40:21,201 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TargetRateBarPlots:preprocessed_data does not exist in model's document

validmind.data_validation.DescriptiveStatistics:development_data

Descriptive Statistics Development Data

The Descriptive Statistics test evaluates the distributional characteristics of the development dataset by summarizing central tendency, spread, and percentile structure for each numerical variable. The results report separate statistics for train_dataset_final and test_dataset_final, including count, mean, standard deviation, minimum, quartiles, upper percentiles, and maximum values. Across the seven reported variables, the tables show how the train and test partitions compare in sample size and in the location and dispersion of each feature.

Key insights:

Train and test distributions are broadly aligned: For most variables, train and test medians and quartiles are close. CreditScore has identical medians of 646 in both datasets, Tenure has a median of 5 in both, Balance medians are 103,583 and 102,967, and NumOfProducts has a median of 1 in both partitions.
Balance shows pronounced lower-tail concentration: Balance has a 25th percentile of 0 in both train and test, while medians are substantially higher at 103,583 and 102,967 respectively. This indicates that at least one quarter of observations are at zero despite a broad upper range extending to 222,268 in train and 250,898 in test.
Binary indicators differ modestly across partitions: IsActiveMember has a mean of 0.4453 in train versus 0.5054 in test, with the median shifting from 0 in train to 1 in test. HasCrCard also increases from a mean of 0.6909 in train to 0.7172 in test, while retaining a median of 1 in both datasets.
EstimatedSalary is widely dispersed but similarly centered: EstimatedSalary shows means of 99,436.8489 in train and 101,904.9846 in test, with medians of 98,913 and 100,834 and standard deviations of 58,303.7513 and 58,633.2587. The spread is therefore large in both partitions, but the central values remain close.
CreditScore and Tenure are stable across samples: CreditScore means are 646.1122 in train and 644.2798 in test, with similar dispersion (96.4753 vs. 100.35) and identical maximum values of 850. Tenure also remains closely matched, with means near 5.0, standard deviations near 2.9, and the same observed range from 0 to 10.

The descriptive statistics indicate that the train and test partitions are generally similar in central tendency and spread across the reported numerical variables. The most visible distributional feature is the concentration of zero balances alongside high median and upper-percentile Balance values, which reflects a mixed lower-tail structure within that variable. Differences between partitions are limited overall, though the binary indicators—especially IsActiveMember—show somewhat higher values in the test dataset than in the training dataset.

Tables

dataset	Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
train_dataset_final	CreditScore	2585.0	646.1122	96.4753	350.0	581.0	646.0	713.0	774.0	809.0	850.0
train_dataset_final	Tenure	2585.0	5.0290	2.9141	0.0	3.0	5.0	8.0	9.0	10.0	10.0
train_dataset_final	Balance	2585.0	82307.6416	61498.1936	0.0	0.0	103583.0	129829.0	150996.0	163915.0	222268.0
train_dataset_final	NumOfProducts	2585.0	1.4979	0.6619	1.0	1.0	1.0	2.0	2.0	3.0	4.0
train_dataset_final	HasCrCard	2585.0	0.6909	0.4622	0.0	0.0	1.0	1.0	1.0	1.0	1.0
train_dataset_final	IsActiveMember	2585.0	0.4453	0.4971	0.0	0.0	0.0	1.0	1.0	1.0	1.0
train_dataset_final	EstimatedSalary	2585.0	99436.8489	58303.7513	12.0	48682.0	98913.0	150402.0	179618.0	189386.0	199971.0
test_dataset_final	CreditScore	647.0	644.2798	100.3500	373.0	573.0	646.0	711.0	775.0	815.0	850.0
test_dataset_final	Tenure	647.0	5.0093	2.9631	0.0	2.0	5.0	8.0	9.0	10.0	10.0
test_dataset_final	Balance	647.0	81581.9860	61738.7022	0.0	0.0	102967.0	129293.0	149212.0	164513.0	250898.0
test_dataset_final	NumOfProducts	647.0	1.5564	0.7079	1.0	1.0	1.0	2.0	2.0	3.0	4.0
test_dataset_final	HasCrCard	647.0	0.7172	0.4507	0.0	0.0	1.0	1.0	1.0	1.0	1.0
test_dataset_final	IsActiveMember	647.0	0.5054	0.5004	0.0	0.0	1.0	1.0	1.0	1.0	1.0
test_dataset_final	EstimatedSalary	647.0	101904.9846	58633.2587	236.0	53181.0	100834.0	153343.0	183023.0	192247.0	199992.0

2026-07-14 05:40:31,861 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.DescriptiveStatistics:development_data does not exist in model's document

validmind.data_validation.TabularDescriptionTables:development_data

Tabular Description Tables Development Data

The Descriptive Statistics test evaluates the composition, completeness, and basic distributional properties of the development data by summarizing numerical and categorical variables. The results cover both train_dataset_final and test_dataset_final, reporting observation counts, means, minimum and maximum values, missing-value percentages, and data types for numerical fields, alongside unique-value counts and encodings for categorical fields. Across the reported variables, the tables show the range and central tendency of each feature as represented in the train and test partitions.

Key insights:

No missing values reported: All listed numerical and categorical variables in both train_dataset_final and test_dataset_final show Missing Values (%) of 0.0, indicating complete observations for the reported development fields.
Train and test sample sizes differ materially: The numerical and categorical summaries show 2,585 observations in train_dataset_final and 647 observations in test_dataset_final, establishing the relative size of each development partition.
Feature means are broadly similar across partitions: Several numerical variables have closely aligned means between train and test, including CreditScore (646.1122 vs. 644.2798), Tenure (5.0290 vs. 5.0093), Balance (82,307.6416 vs. 81,581.9860), and Exited (0.4971 vs. 0.5116).
Some variables show higher relative mean differences: IsActiveMember increases from 0.4453 in the training set to 0.5054 in the test set, HasCrCard from 0.6909 to 0.7172, NumOfProducts from 1.4979 to 1.5564, and EstimatedSalary from 99,436.8489 to 101,904.9846.
Binary-encoded fields are represented consistently: HasCrCard, IsActiveMember, and Exited are stored as int64 with minimum 0 and maximum 1 in both datasets, while Geography_Germany, Geography_Spain, and Gender_Male are stored as boolean variables with exactly two unique values: True and False.
Observed ranges vary across partitions for some variables: Maximum values differ between train and test for selected fields, including Balance (222,267.63 in train vs. 250,898.09 in test) and EstimatedSalary (199,970.74 in train vs. 199,992.48 in test), while CreditScore shares the same maximum of 850 across both partitions.

The descriptive statistics indicate that the development data is fully populated for the reported variables and uses consistent data typing across train and test partitions. Central tendencies are broadly aligned for several core numerical features, while selected variables show moderate differences in mean values between partitions. The variable encodings are uniformly binary for the reported indicator fields, and the observed value ranges show some partition-level differences, particularly in maximum values for Balance and EstimatedSalary.

Tables

dataset	Numerical Variable	Num of Obs	Mean	Min	Max	Data Type
train_dataset_final	CreditScore	2585	646.1122	350.00	850.00	int64
train_dataset_final	Tenure	2585	5.0290	0.00	10.00	int64
train_dataset_final	Balance	2585	82307.6416	0.00	222267.63	float64
train_dataset_final	NumOfProducts	2585	1.4979	1.00	4.00	int64
train_dataset_final	HasCrCard	2585	0.6909	0.00	1.00	int64
train_dataset_final	IsActiveMember	2585	0.4453	0.00	1.00	int64
train_dataset_final	EstimatedSalary	2585	99436.8489	11.58	199970.74	float64
train_dataset_final	Exited	2585	0.4971	0.00	1.00	int64
test_dataset_final	CreditScore	647	644.2798	373.00	850.00	int64
test_dataset_final	Tenure	647	5.0093	0.00	10.00	int64
test_dataset_final	Balance	647	81581.9860	0.00	250898.09	float64
test_dataset_final	NumOfProducts	647	1.5564	1.00	4.00	int64
test_dataset_final	HasCrCard	647	0.7172	0.00	1.00	int64
test_dataset_final	IsActiveMember	647	0.5054	0.00	1.00	int64
test_dataset_final	EstimatedSalary	647	101904.9846	236.45	199992.48	float64
test_dataset_final	Exited	647	0.5116	0.00	1.00	int64

dataset	Categorical Variable	Num of Obs	Num of Unique Values	Unique Values	Data Type
train_dataset_final	Geography_Germany	2585.0	2.0	[False True]	bool
train_dataset_final	Geography_Spain	2585.0	2.0	[ True False]	bool
train_dataset_final	Gender_Male	2585.0	2.0	[ True False]	bool
test_dataset_final	Geography_Germany	647.0	2.0	[False True]	bool
test_dataset_final	Geography_Spain	647.0	2.0	[False True]	bool
test_dataset_final	Gender_Male	647.0	2.0	[ True False]	bool

2026-07-14 05:40:37,767 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularDescriptionTables:development_data does not exist in model's document

validmind.data_validation.ClassImbalance:development_data

✅ Class Imbalance Development Data

The Class Imbalance test evaluates the distribution of target classes in the dataset by measuring the percentage share of each class and comparing it with the configured minimum threshold of 10%. Results are reported separately for train_dataset_final and test_dataset_final for the target Exited. In the training dataset, class 0 represents 50.29% of rows and class 1 represents 49.71%; in the test dataset, class 1 represents 51.16% and class 0 represents 48.84%. The accompanying plots show near-equal bar heights for both classes in each dataset, consistent with the tabulated percentages.

Key insights:

Both classes pass the threshold: All observed class proportions exceed the 10% minimum threshold, and every class-level result is marked as Pass in both training and test datasets.
Training data is nearly balanced: In train_dataset_final, the class distribution is 50.29% for Exited=0 and 49.71% for Exited=1, indicating a difference of 0.58 percentage points between classes.
Test data is also nearly balanced: In test_dataset_final, the class distribution is 51.16% for Exited=1 and 48.84% for Exited=0, indicating a difference of 2.32 percentage points between classes.
Train and test distributions are closely aligned: The majority class shifts slightly between datasets, but both remain close to a 50/50 split, with only small changes in class proportions across the train and test samples.

The test results show that the target variable Exited is distributed evenly across both development datasets relative to the configured 10% threshold. No class falls below the minimum percentage in either dataset, and the observed class shares remain close to parity in both training and test samples. Collectively, these results indicate an absence of material class concentration in the evaluated datasets.

Parameters:

{
  "min_percent_threshold": 10
}

Tables

dataset	Exited	Percentage of Rows (%)	Pass/Fail
train_dataset_final	0	50.29%	Pass
train_dataset_final	1	49.71%	Pass
test_dataset_final	1	51.16%	Pass
test_dataset_final	0	48.84%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:97ee

ValidMind Figure validmind.data_validation.ClassImbalance:development_data:2505

2026-07-14 05:40:50,527 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.ClassImbalance:development_data does not exist in model's document

validmind.data_validation.UniqueRows:development_data

❌ Unique Rows Development Data

The UniqueRows test evaluates dataset diversity by comparing the percentage of unique values in each column against the configured minimum threshold of 1%. Results are reported for both train_dataset_final and test_dataset_final, with each column showing its number of unique values, percentage of unique values, and pass/fail status. In the training dataset, 3 of 11 columns pass the threshold, while in the test dataset, 4 of 11 columns pass. The reported percentages range from 0.0774% to 100.0% in training and from 0.3091% to 100.0% in testing.

Key insights:

EstimatedSalary is fully unique: EstimatedSalary records 100.0% unique values in both datasets, with 2,585 unique values in training and 647 in testing, and passes in both cases.
Balance and CreditScore pass in both datasets: Balance shows 68.1238% unique values in training and 68.1607% in testing, while CreditScore shows 16.441% in training and 48.068% in testing; both columns pass in each dataset.
Tenure differs between datasets: Tenure fails in training at 0.4255% unique values with 11 unique values, but passes in testing at 1.7002% with the same 11 unique values.
Binary and low-cardinality fields consistently fail: HasCrCard, IsActiveMember, Geography_Germany, Geography_Spain, Gender_Male, and Exited each have 2 unique values and fail in both datasets. NumOfProducts, with 4 unique values, also fails in both datasets.
Training dataset has more failing columns: The training dataset has 8 failing columns versus 7 in the test dataset, driven by the additional failure of Tenure in training.

The results show that uniqueness is concentrated in a small subset of columns, with EstimatedSalary, Balance, and CreditScore meeting the threshold in both datasets and Tenure meeting it only in the test dataset. Most remaining fields have very low uniqueness percentages, particularly binary indicator variables and other low-cardinality features, resulting in failure under the 1% threshold. Overall, the test outcome reflects a mixed diversity profile across columns, with consistent passes for continuous-valued fields and consistent failures for discrete encoded variables.

Parameters:

{
  "min_percent_threshold": 1
}

Tables

dataset	Column	Number of Unique Values	Percentage of Unique Values (%)	Pass/Fail
train_dataset_final	CreditScore	425	16.4410	Pass
train_dataset_final	Tenure	11	0.4255	Fail
train_dataset_final	Balance	1761	68.1238	Pass
train_dataset_final	NumOfProducts	4	0.1547	Fail
train_dataset_final	HasCrCard	2	0.0774	Fail
train_dataset_final	IsActiveMember	2	0.0774	Fail
train_dataset_final	EstimatedSalary	2585	100.0000	Pass
train_dataset_final	Geography_Germany	2	0.0774	Fail
train_dataset_final	Geography_Spain	2	0.0774	Fail
train_dataset_final	Gender_Male	2	0.0774	Fail
train_dataset_final	Exited	2	0.0774	Fail
test_dataset_final	CreditScore	311	48.0680	Pass
test_dataset_final	Tenure	11	1.7002	Pass
test_dataset_final	Balance	441	68.1607	Pass
test_dataset_final	NumOfProducts	4	0.6182	Fail
test_dataset_final	HasCrCard	2	0.3091	Fail
test_dataset_final	IsActiveMember	2	0.3091	Fail
test_dataset_final	EstimatedSalary	647	100.0000	Pass
test_dataset_final	Geography_Germany	2	0.3091	Fail
test_dataset_final	Geography_Spain	2	0.3091	Fail
test_dataset_final	Gender_Male	2	0.3091	Fail
test_dataset_final	Exited	2	0.3091	Fail

2026-07-14 05:40:59,525 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.UniqueRows:development_data does not exist in model's document

validmind.data_validation.TabularNumericalHistograms:development_data

Tabular Numerical Histograms Development Data

The Tabular Numerical Histograms test evaluates the univariate distributions of numerical features through histograms for the development data. The result set includes plots for the train and test datasets covering CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Geography_Germany, Geography_Spain, and Gender_Male. The histograms show the relative concentration of observations across value ranges for each feature and allow direct visual comparison of distribution shape between the train and test samples.

Key insights:

CreditScore is broadly bell-shaped: CreditScore in both train and test is concentrated in the mid-range, with the highest density around roughly 600 to 700 and visibly thinner tails toward the lower and upper ends.
Balance shows a zero spike: Balance displays a pronounced mass at 0 in both train and test, alongside a separate broad concentration centered approximately in the 100k to 140k range, indicating a mixed distribution rather than a single continuous shape.
NumOfProducts is heavily concentrated at 1 and 2: NumOfProducts is discrete and strongly concentrated at values 1 and 2 in both datasets, while values 3 and especially 4 occur much less frequently.
Tenure is relatively even across categories: Tenure is distributed across integer values from 0 to 10 with broadly similar bar heights through most categories, though 0 and 10 appear somewhat lower than the mid-range categories.
EstimatedSalary is approximately uniform: EstimatedSalary is spread fairly evenly across its range in both train and test, without a dominant central peak or pronounced tail concentration.
Binary indicators remain imbalanced but consistent: HasCrCard has more observations at 1 than 0 in both datasets, IsActiveMember appears close to balanced with a modest difference between classes, Geography_Germany and Geography_Spain each show more false than true observations, and Gender_Male is close to evenly split.

The histograms indicate that train and test distributions are visually aligned across the reviewed numerical features. The most notable structural patterns are the zero-inflated Balance distribution, the discrete concentration of NumOfProducts at lower integer values, and the class imbalance present in several binary indicator variables. Other features, including CreditScore, Tenure, EstimatedSalary, and Gender_Male, exhibit stable and comparable shapes between the two datasets.

Figures

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:20fb

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:24b7

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:0c7f

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:f1fe

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:5f4d

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:756f

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:db18

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:25b1

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:f5de

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:abb0

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:621d

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:399f

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:3657

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:ce37

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e5b5

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:7482

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:3686

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:a596

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:2084

ValidMind Figure validmind.data_validation.TabularNumericalHistograms:development_data:e56b

2026-07-14 05:42:16,824 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.TabularNumericalHistograms:development_data does not exist in model's document

validmind.data_validation.MutualInformation:development_data

Mutual Information Development Data

The Mutual Information test evaluates the statistical dependency between each feature and the target to quantify feature relevance. The results are shown as feature-level mutual information scores for both train_dataset_final and test_dataset_final, with a minimum threshold of 0.01 indicated by a dashed reference line. In both datasets, the features are ordered by decreasing score, allowing direct comparison of which variables exceed the threshold and which contribute little or no measured mutual information relative to the target.

Key insights:

NumOfProducts is the strongest feature: NumOfProducts has the highest mutual information score in both datasets, at approximately 0.11 in train_dataset_final and 0.084 in test_dataset_final, standing clearly above all other features.
Only four features exceed threshold in both datasets: NumOfProducts, Geography_Germany, IsActiveMember, and one additional feature exceed the 0.01 threshold in each dataset. In training, that fourth feature is Balance, while in testing it is EstimatedSalary.
Geography_Germany and IsActiveMember remain consistently informative: Geography_Germany and IsActiveMember are above the threshold in both datasets, with scores around 0.039 and 0.013 in training and about 0.026 and 0.025 in testing, respectively.
Feature rankings differ between train and test: Balance is above threshold in training at roughly 0.028 but is near zero in testing, while EstimatedSalary is near zero in training but above threshold in testing at roughly 0.023. This indicates a material shift in relative feature informativeness across the two samples.
Several features show minimal or zero mutual information: CreditScore, HasCrCard, and Geography_Spain are at or near zero in both datasets. Tenure and Gender_Male remain below the 0.01 threshold in both cases, with Gender_Male in the test dataset sitting close to the threshold.

The mutual information profile is concentrated in a small subset of features, with NumOfProducts as the dominant predictor in both train and test samples. A limited group of variables consistently clears the 0.01 threshold, while several features contribute little measured dependency with the target. The comparison between training and testing results also shows noticeable shifts for Balance and EstimatedSalary, indicating that feature relevance is not fully uniform across the two datasets.

Parameters:

{
  "min_threshold": 0.01
}

Figures

ValidMind Figure validmind.data_validation.MutualInformation:development_data:66a1

ValidMind Figure validmind.data_validation.MutualInformation:development_data:d722

2026-07-14 05:43:08,441 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.MutualInformation:development_data does not exist in model's document

validmind.data_validation.PearsonCorrelationMatrix:development_data

Pearson Correlation Matrix Development Data

The PearsonCorrelationMatrix test evaluates linear dependency between numerical variables in the development dataset using pairwise Pearson correlation coefficients. The result is presented as heat maps for both the train and test subsets, with coefficients ranging from -1 to 1 for each variable pair. Across both matrices, most correlations are close to zero, with a limited number of moderate positive or negative relationships visible among Balance, Geography indicators, NumOfProducts, IsActiveMember, and Exited.

Key insights:

No high pairwise correlations observed: No correlation shown in either heat map reaches the stated high-correlation threshold of 0.7 in absolute value. The largest observed magnitudes are 0.42 in the train subset and 0.37 in the test subset.
Balance and Geography_Germany show the strongest positive relationship: The highest positive correlation is between Balance and Geography_Germany, at 0.42 in train and 0.37 in test. This is the most prominent positive association in both subsets.
Geography indicators are moderately negatively related: Geography_Germany and Geography_Spain show correlations of -0.36 in train and -0.35 in test. This is the strongest negative relationship displayed in both matrices.
Exited has weak correlations with individual predictors: In the train subset, Exited correlates most with Geography_Germany (0.21), IsActiveMember (-0.17), Balance (0.16), and Gender_Male (-0.13). In the test subset, the largest associations with Exited are Geography_Germany (0.20), IsActiveMember (-0.20), Gender_Male (-0.13), and Balance (0.10).
Train and test correlation structures are broadly consistent: The main correlation patterns are similar across the two subsets, including Balance with Geography_Germany, Geography_Germany with Geography_Spain, NumOfProducts with Balance (-0.18 train, -0.17 test), and Exited with IsActiveMember (-0.17 train, -0.20 test). Differences between subsets are present but small in magnitude.

The correlation analysis indicates a largely low-dependency feature set, with no pairwise relationships approaching the high-correlation threshold defined for this test. The most material structure is confined to a small number of moderate associations, particularly those involving Balance and geography indicators. The similarity between train and test matrices shows that the observed linear dependency patterns are stable across the development split.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:97ca

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:development_data:7976

2026-07-14 05:43:28,182 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.PearsonCorrelationMatrix:development_data does not exist in model's document

validmind.data_validation.HighPearsonCorrelation:development_data

❌ High Pearson Correlation Development Data

The High Pearson Correlation test evaluates pairwise linear relationships between features to identify potentially redundant variables or concentrated dependence within the dataset. The results list the top 10 strongest correlations for both train_dataset_final and test_dataset_final, using an absolute correlation threshold of 0.3 to assign Pass or Fail outcomes. Across both datasets, most reported feature pairs are below the threshold, while two pairs exceed it in each dataset and are marked as Fail. The reported coefficients include both positive and negative relationships, with magnitudes ranging from 0.0526 to 0.4154 in absolute value.

Key insights:

Two failed pairs in each dataset: In both train_dataset_final and test_dataset_final, exactly two feature pairs exceed the 0.3 threshold. These are (Balance, Geography_Germany) and (Geography_Germany, Geography_Spain) in both datasets.
Strongest correlation is stable across splits: (Balance, Geography_Germany) is the largest observed correlation in both datasets, with coefficients of 0.4154 in training and 0.3743 in testing. This indicates a consistent positive relationship across the two samples.
Negative geography relationship also exceeds threshold: (Geography_Germany, Geography_Spain) shows correlations of -0.3606 in training and -0.3501 in testing, exceeding the threshold in both datasets. The relationship is negative and similar in magnitude across the two splits.
Remaining correlations are materially lower: All other reported feature pairs are below the threshold, with the next-largest absolute values at 0.2145 in training and 0.2031 in testing for (Geography_Germany, Exited). The rest of the listed correlations are below 0.20 in absolute value in the test dataset and mostly below that level in the training dataset.
Correlation pattern is broadly consistent between train and test: The same two failed pairs appear in both datasets, and several additional relationships recur with similar magnitudes, including (IsActiveMember, Exited), (Balance, NumOfProducts), (Gender_Male, Exited), and (Balance, Exited). This indicates that the strongest observed linear relationships are not isolated to one split.

The results show a limited set of feature pairs with correlations above the configured threshold, concentrated in two relationships that appear consistently in both training and test data. The strongest observed association is between Balance and Geography_Germany, followed by the negative relationship between Geography_Germany and Geography_Spain. Outside these pairs, reported correlations are lower and remain below the threshold, indicating that elevated linear dependence is localized rather than widespread among the top reported relationships.

Parameters:

{
  "max_threshold": 0.3,
  "top_n_correlations": 10
}

Tables

dataset	Columns	Coefficient	Pass/Fail
train_dataset_final	(Balance, Geography_Germany)	0.4154	Fail
train_dataset_final	(Geography_Germany, Geography_Spain)	-0.3606	Fail
train_dataset_final	(Geography_Germany, Exited)	0.2145	Pass
train_dataset_final	(Balance, NumOfProducts)	-0.1767	Pass
train_dataset_final	(IsActiveMember, Exited)	-0.1715	Pass
train_dataset_final	(Balance, Exited)	0.1553	Pass
train_dataset_final	(Balance, Geography_Spain)	-0.1378	Pass
train_dataset_final	(Gender_Male, Exited)	-0.1341	Pass
train_dataset_final	(NumOfProducts, Exited)	-0.0745	Pass
train_dataset_final	(Geography_Spain, Exited)	-0.0526	Pass
test_dataset_final	(Balance, Geography_Germany)	0.3743	Fail
test_dataset_final	(Geography_Germany, Geography_Spain)	-0.3501	Fail
test_dataset_final	(Geography_Germany, Exited)	0.2031	Pass
test_dataset_final	(IsActiveMember, Exited)	-0.1997	Pass
test_dataset_final	(Balance, NumOfProducts)	-0.1724	Pass
test_dataset_final	(Gender_Male, Exited)	-0.1348	Pass
test_dataset_final	(Balance, Geography_Spain)	-0.1304	Pass
test_dataset_final	(Tenure, EstimatedSalary)	0.1023	Pass
test_dataset_final	(Balance, Exited)	0.1014	Pass
test_dataset_final	(NumOfProducts, IsActiveMember)	0.0876	Pass

2026-07-14 05:43:38,759 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:development_data does not exist in model's document

validmind.model_validation.ModelMetadata

Model Metadata

The ModelMetadata test evaluates model metadata across models to compare architecture, framework, framework version, and programming language. The results are presented as a summary table for two models, log_model_champion and rf_model, with columns for Modeling Technique, Modeling Framework, Framework Version, and Programming Language. Both rows report the same metadata values, enabling a direct comparison of implementation attributes across the models.

Key insights:

Metadata is fully aligned across models: log_model_champion and rf_model both report SKlearnModel as the Modeling Technique, sklearn as the Modeling Framework, version 1.9.0 as the Framework Version, and Python as the Programming Language.
No cross-model version differences observed: The Framework Version is 1.9.0 for both models, indicating no version variation in the reported framework metadata.
No missing fields in reported metadata: All displayed metadata fields in the summary table contain populated values for both models.

The result shows a consistent metadata profile across the two documented models. Modeling technique, framework, framework version, and programming language are identical in the reported output, and all displayed fields are populated. This indicates that the compared models are documented with the same high-level implementation metadata in the test result.

Tables

model	Modeling Technique	Modeling Framework	Framework Version	Programming Language
log_model_champion	SKlearnModel	sklearn	1.9.0	Python
rf_model	SKlearnModel	sklearn	1.9.0	Python

2026-07-14 05:43:42,967 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.ModelMetadata does not exist in model's document

validmind.model_validation.sklearn.ModelParameters

Model Parameters

The Model Parameters test extracts and displays estimator configuration values to document model setup and support reproducibility. The results present parameter tables for two models, log_model_champion and rf_model, with each row showing a parameter name and its recorded value. For log_model_champion, the output includes regularization, optimization, and convergence-related settings, while for rf_model it includes tree construction, sampling, and ensemble configuration settings.

Key insights:

Two model configurations are documented: The result contains parameter listings for log_model_champion and rf_model, indicating that the test captured configuration metadata for both a logistic regression model and a random forest model.
Logistic regression uses L1 regularization: log_model_champion is configured with penalty = l1, solver = liblinear, and C = 1, with fit_intercept = True, max_iter = 100, and tol = 0.0001 also recorded in the output.
Random forest uses 50 trees: rf_model is configured with n_estimators = 50, criterion = gini, bootstrap = True, and max_features = sqrt, identifying the ensemble size and core tree-splitting settings.
Random state is recorded for rf_model: The random forest configuration includes random_state = 42, along with warm_start = False and oob_score = False, providing explicit values for repeatability-related and sampling-related settings.
Tree growth constraints shown explicitly: The random forest parameter table records min_samples_split = 2, min_samples_leaf = 1, min_impurity_decrease = 0.0, and ccp_alpha = 0.0, showing that these threshold and pruning-related controls were explicitly captured.

The result documents the full extracted parameter set made available by the test for both model instances. The logistic regression configuration is characterized by l1 regularization with the liblinear solver and standard optimization controls, while the random forest configuration records a 50-tree ensemble with bootstrap sampling, gini splitting, and an explicit random_state value. Collectively, the output provides a structured record of model settings relevant to transparency and reproducibility.

Tables

model	Parameter	Value
log_model_champion	C	1
log_model_champion	dual	False
log_model_champion	fit_intercept	True
log_model_champion	intercept_scaling	1
log_model_champion	max_iter	100
log_model_champion	penalty	l1
log_model_champion	solver	liblinear
log_model_champion	tol	0.0001
log_model_champion	verbose	0
log_model_champion	warm_start	False
rf_model	bootstrap	True
rf_model	ccp_alpha	0.0
rf_model	criterion	gini
rf_model	max_features	sqrt
rf_model	min_impurity_decrease	0.0
rf_model	min_samples_leaf	1
rf_model	min_samples_split	2
rf_model	min_weight_fraction_leaf	0.0
rf_model	n_estimators	50
rf_model	oob_score	False
rf_model	random_state	42
rf_model	verbose	0
rf_model	warm_start	False

2026-07-14 05:43:47,542 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ModelParameters does not exist in model's document

validmind.model_validation.sklearn.ROCCurve

ROC Curve

The ROCCurve test evaluates binary classification performance by plotting the relationship between true positive rate and false positive rate across classification thresholds and summarizing discrimination with the AUC metric. The results include ROC curves for log_model_champion on both train_dataset_final and test_dataset_final, each benchmarked against a random-classification reference line with AUC = 0.5. In both figures, the model ROC curve remains above the random baseline, and the reported AUC is 0.68 for the training dataset and 0.68 for the test dataset.

Key insights:

Train and test AUCs are identical: The reported AUC is 0.68 on both train_dataset_final and test_dataset_final, indicating the same aggregate ROC discrimination measure across the two evaluated datasets.
Performance exceeds random classification: In both plots, the ROC curve lies above the diagonal random baseline, and the AUC of 0.68 is above the reference value of 0.5 shown in the figures.
ROC shape is consistent across datasets: The training and test ROC curves show similar upward progression across false positive rate levels, with no visible divergence between the two datasets in overall curve profile.

Taken together, the ROC results show that log_model_champion achieves the same measured discrimination level on the training and test datasets, with AUC = 0.68 in both cases. The ROC curves remain consistently above the random baseline throughout the plotted range, indicating positive class separation beyond random ranking. The similarity between the two plots indicates stable ROC-based performance across the evaluated samples.

Figures

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:7ff5

ValidMind Figure validmind.model_validation.sklearn.ROCCurve:98d8

2026-07-14 05:43:58,889 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.ROCCurve does not exist in model's document

validmind.model_validation.sklearn.MinimumROCAUCScore

✅ Minimum ROCAUC Score

The Minimum ROC AUC Score test evaluates whether the model’s ROC AUC score meets or exceeds a predefined minimum threshold on the evaluated datasets. The results table reports ROC AUC scores for both train_dataset_final and test_dataset_final, alongside the common threshold of 0.5 and the corresponding pass/fail outcome. Observed scores are 0.6764 for the training dataset and 0.6773 for the test dataset, with both datasets recorded as passing the test.

Key insights:

Both datasets passed threshold: train_dataset_final and test_dataset_final each exceeded the minimum ROC AUC threshold of 0.5, and both are marked as Pass.
Scores are closely aligned: The ROC AUC scores are 0.6764 on train_dataset_final and 0.6773 on test_dataset_final, a difference of 0.0009 between datasets.
Test performance is consistent across datasets: The near-identical ROC AUC values across training and test datasets indicate that the measured discrimination level was stable in this evaluation.

The test result shows that the model satisfied the minimum ROC AUC requirement on both evaluated datasets. Performance was nearly unchanged between train_dataset_final and test_dataset_final, with both scores remaining above the 0.5 threshold and both outcomes recorded as passing.

Parameters:

{
  "min_threshold": 0.5
}

Tables

dataset	Score	Threshold	Pass/Fail
train_dataset_final	0.6764	0.5	Pass
test_dataset_final	0.6773	0.5	Pass

2026-07-14 05:44:05,936 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.model_validation.sklearn.MinimumROCAUCScore does not exist in model's document

In summary

In this final notebook, you learned how to:

Implement a custom inline test
Run and log your custom inline tests
Use external custom test providers
Run and log tests from your custom test providers
Re-run tests provided by your development team to verify that they were run and reported accurately

With our ValidMind for validation series of notebooks, you learned how to validate a record (model) end-to-end with the ValidMind Library by running through some common scenarios in a typical validation setting:

Verifying the data quality steps performed by the development team
Independently replicating the champion's results and conducting additional tests to assess performance, stability, and robustness
Setting up test inputs and a challenger for comparative analysis
Running validation tests, analyzing results, and logging artifacts to ValidMind

Next steps

Work with your validation report

Now that you've logged all your test results and verified the work done by the development team, head to the ValidMind Platform to wrap up your validation report. Continue to work on your validation report by:

Inserting additional test results: Click Link Evidence under any Evidence panel of 2. Validation in your validation report. (Learn more: Link evidence to reports)
Making qualitative edits to your test descriptions: Expand any linked evidence under Validator Evidence and click See evidence details to review and edit the ValidMind-generated test descriptions for quality and accuracy. (Learn more: Preparing validation reports)
Adding more findings: Click Link Finding to Report in any validation report section, then click + Create New Finding. (Learn more: Add and manage artifacts)
Adding risk assessment notes: Click under Risk Assessment Notes in any validation report section to access the text editor and content editing toolbar, including an option to generate a draft with AI. Once generated, edit your ValidMind-generated test descriptions to adhere to your organization's requirements. (Learn more: Work with content blocks)
Assessing compliance: Under the Guideline for any validation report section, click Assessment and select the compliance status from the drop-down menu. (Learn more: Assign compliance assessments)
Collaborate with other stakeholders: Use the ValidMind Platform's real-time collaborative features to work seamlessly together with the rest of your organization, including developers. Propose suggested changes in the documentation, work with versioned history, and use comments to discuss specific portions of the documentation. (Learn more: Collaborate with others)

When your validation report is complete and ready for review, submit it for approval from the same ValidMind Platform where you made your edits and collaborated with the rest of your organization, ensuring transparency and a thorough validation history. (Learn more: Submit documents)

Learn more

Now that you're familiar with the basics, you can explore the following notebooks to get a deeper understanding on how the ValidMind Library assists you in streamlining validation:

Use cases

Validate an application scorecard model

Discover more learning resources

Learn more about the ValidMind Library tools we used in this notebook:

We also offer many interactive notebooks to help you use the ValidMind Library to streamline your work:

Or, visit our documentation to learn more about ValidMind.