ValidMind for development 2 — Start the development process

Learn how to use ValidMind for your end-to-end documentation process with our series of four introductory notebooks. In this second notebook, you'll run tests and investigate results, then add the results or evidence to your documentation.

You'll become familiar with the individual tests available in ValidMind, as well as how to run them and change parameters as necessary. Using ValidMind's repository of individual tests as building blocks helps you ensure that a record (model) is being built appropriately.

For a full list of out-of-the-box tests and descriptions, use the interactive ValidMind test sandbox.

Learn by doing

Our course tailor-made for developers new to ValidMind combines this series of notebooks with more a more in-depth introduction to the ValidMind Platform — Developer Fundamentals

Prerequisites

In order to log test results or evidence to your documentation with this notebook, you'll need to first have:

Registered a model within the ValidMind Platform with a predefined documentation template
Installed the ValidMind Library in your local environment, allowing you to access all its features

Need help with the above steps?

Refer to the first notebook in this series: 1 — Set up the ValidMind Library

Setting up

Initialize the ValidMind Library

First, let's connect up the ValidMind Library to our model we previously registered in the ValidMind Platform:

On the left sidebar that appears for your model, select Getting Started and select Development from the Document drop-down menu.
Click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Make sure the ValidMind Library is installed

%pip install -q validmind

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    document="documentation",
)

Note: you may need to restart the kernel to use updated packages.

2026-07-14 05:26:19,517 - INFO(validmind.api_client): 🎉 Connected to ValidMind!
📊 Model: [ValidMind Academy] Model development (ID: cmalgf3qi02ce199qm3rdkl46)
📁 Document Type: model_documentation

Import sample dataset

Then, let's import the public Bank Customer Churn Prediction dataset from Kaggle.

In our below example, note that:

The target column, Exited has a value of 1 when a customer has churned and 0 otherwise.
The ValidMind Library provides a wrapper to automatically load the dataset as a Pandas DataFrame object. A Pandas Dataframe is a two-dimensional tabular data structure that makes use of rows and columns.

from validmind.datasets.classification import customer_churn as demo_dataset

print(
    f"Loaded demo dataset with: \n\n\t• Target column: '{demo_dataset.target_column}' \n\t• Class labels: {demo_dataset.class_labels}"
)

raw_df = demo_dataset.load_data()
raw_df.head()

Loaded demo dataset with: 

    • Target column: 'Exited' 
    • Class labels: {'0': 'Did not exit', '1': 'Exited'}

	CreditScore	Geography	Gender	Age	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
0	619	France	Female	42	2	0.00	1	1	1	101348.88	1
1	608	Spain	Female	41	1	83807.86	1	0	1	112542.58	0
2	502	France	Female	42	8	159660.80	3	1	0	113931.57	1
3	699	France	Female	39	1	0.00	2	0	0	93826.63	0
4	850	Spain	Female	43	2	125510.82	1	1	1	79084.10	0

Identify qualitative tests

Next, let's say we want to do some data quality assessments by running a few individual tests.

Use the vm.tests.list_tests() function introduced by the first notebook in this series in combination with vm.tests.list_tags() and vm.tests.list_tasks() to find which prebuilt tests are relevant for data quality assessment:

tasks represent the kind of modeling task associated with a test. Here we'll focus on classification tasks.
tags are free-form descriptions providing more details about the test, for example, what category the test falls into. Here we'll focus on the data_quality tag.

# Get the list of available task types
sorted(vm.tests.list_tasks())

['classification',
 'clustering',
 'data_validation',
 'feature_extraction',
 'monitoring',
 'nlp',
 'regression',
 'residual_analysis',
 'text_classification',
 'text_generation',
 'text_qa',
 'text_summarization',
 'time_series_forecasting',
 'visualization']

# Get the list of available tags
sorted(vm.tests.list_tags())

['AUC',
 'analysis',
 'anomaly',
 'anomaly_detection',
 'bias_and_fairness',
 'binary_classification',
 'calibration',
 'categorical_data',
 'classification',
 'classification_metrics',
 'clustering',
 'correlation',
 'credit_risk',
 'data_analysis',
 'data_distribution',
 'data_quality',
 'data_validation',
 'descriptive_statistics',
 'dimensionality_reduction',
 'distribution',
 'embeddings',
 'feature_importance',
 'feature_selection',
 'few_shot',
 'forecasting',
 'frequency_analysis',
 'kmeans',
 'linear_regression',
 'llm',
 'logistic_regression',
 'metadata',
 'model_comparison',
 'model_diagnosis',
 'model_explainability',
 'model_interpretation',
 'model_performance',
 'model_predictions',
 'model_selection',
 'model_training',
 'model_validation',
 'multiclass_classification',
 'nlp',
 'normality',
 'numerical_data',
 'outlier',
 'outliers',
 'qualitative',
 'rag_performance',
 'ragas',
 'regression',
 'retrieval_performance',
 'scorecard',
 'seasonality',
 'senstivity_analysis',
 'sklearn',
 'stationarity',
 'statistical_test',
 'statistics',
 'statsmodels',
 'tabular_data',
 'text_data',
 'threshold_optimization',
 'time_series_data',
 'unit_root_test',
 'visualization',
 'zero_shot']

You can pass tags and tasks as parameters to the vm.tests.list_tests() function to filter the tests based on the tags and task types.

For example, to find tests related to tabular data quality for classification models, you can call list_tests() like this:

vm.tests.list_tests(task="classification", tags=["tabular_data", "data_quality"])

ID	Name	Description	Has Figure	Has Table	Required Inputs	Params	Tags	Tasks
validmind.data_validation.ClassImbalance	Class Imbalance	Evaluates and quantifies class distribution imbalance in a dataset used by a machine learning model....	True	True	['dataset']	{'min_percent_threshold': {'type': 'int', 'default': 10}}	['tabular_data', 'binary_classification', 'multiclass_classification', 'data_quality']	['classification']
validmind.data_validation.DescriptiveStatistics	Descriptive Statistics	Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's...	False	True	['dataset']	{}	['tabular_data', 'time_series_data', 'data_quality']	['classification', 'regression']
validmind.data_validation.Duplicates	Duplicates	Tests dataset for duplicate entries, ensuring model reliability via data quality verification....	False	True	['dataset']	{'min_threshold': {'type': '_empty', 'default': 1}}	['tabular_data', 'data_quality', 'text_data']	['classification', 'regression']
validmind.data_validation.HighCardinality	High Cardinality	Assesses the number of unique values in categorical columns to detect high cardinality and potential overfitting....	False	True	['dataset']	{'num_threshold': {'type': 'int', 'default': 100}, 'percent_threshold': {'type': 'float', 'default': 0.1}, 'threshold_type': {'type': 'str', 'default': 'percent'}}	['tabular_data', 'data_quality', 'categorical_data']	['classification', 'regression']
validmind.data_validation.HighPearsonCorrelation	High Pearson Correlation	Identifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity....	False	True	['dataset']	{'max_threshold': {'type': 'float', 'default': 0.3}, 'top_n_correlations': {'type': 'int', 'default': 10}, 'feature_columns': {'type': 'list', 'default': None}}	['tabular_data', 'data_quality', 'correlation']	['classification', 'regression']
validmind.data_validation.MissingValues	Missing Values	Evaluates dataset quality by ensuring missing value percentage across all features does not exceed a set threshold....	False	True	['dataset']	{'min_percentage_threshold': {'type': 'float', 'default': 1.0}}	['tabular_data', 'data_quality']	['classification', 'regression']
validmind.data_validation.MissingValuesBarPlot	Missing Values Bar Plot	Assesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on...	True	False	['dataset']	{'threshold': {'type': 'int', 'default': 80}, 'fig_height': {'type': 'int', 'default': 600}}	['tabular_data', 'data_quality', 'visualization']	['classification', 'regression']
validmind.data_validation.Skewness	Skewness	Evaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data...	False	True	['dataset']	{'max_threshold': {'type': '_empty', 'default': 1}}	['data_quality', 'tabular_data']	['classification', 'regression']
validmind.plots.BoxPlot	Box Plot	Generates customizable box plots for numerical features in a dataset with optional grouping using Plotly....	True	False	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'group_by': {'type': 'Optional', 'default': None}, 'width': {'type': 'int', 'default': 1800}, 'height': {'type': 'int', 'default': 1200}, 'colors': {'type': 'Optional', 'default': None}, 'show_outliers': {'type': 'bool', 'default': True}, 'title_prefix': {'type': 'str', 'default': 'Box Plot of'}}	['tabular_data', 'visualization', 'data_quality']	['classification', 'regression', 'clustering']
validmind.plots.HistogramPlot	Histogram Plot	Generates customizable histogram plots for numerical features in a dataset using Plotly....	True	False	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'bins': {'type': 'Union', 'default': 30}, 'color': {'type': 'str', 'default': 'steelblue'}, 'opacity': {'type': 'float', 'default': 0.7}, 'show_kde': {'type': 'bool', 'default': True}, 'normalize': {'type': 'bool', 'default': False}, 'log_scale': {'type': 'bool', 'default': False}, 'title_prefix': {'type': 'str', 'default': 'Histogram of'}, 'width': {'type': 'int', 'default': 1200}, 'height': {'type': 'int', 'default': 800}, 'n_cols': {'type': 'int', 'default': 2}, 'vertical_spacing': {'type': 'float', 'default': 0.15}, 'horizontal_spacing': {'type': 'float', 'default': 0.1}}	['tabular_data', 'visualization', 'data_quality']	['classification', 'regression', 'clustering']
validmind.stats.DescriptiveStats	Descriptive Stats	Provides comprehensive descriptive statistics for numerical features in a dataset....	False	True	['dataset']	{'columns': {'type': 'Optional', 'default': None}, 'include_advanced': {'type': 'bool', 'default': True}, 'confidence_level': {'type': 'float', 'default': 0.95}}	['tabular_data', 'statistics', 'data_quality']	['classification', 'regression', 'clustering']

Want to learn more about navigating ValidMind tests?

Refer to our notebook outlining the utilities available for viewing and understanding available ValidMind tests: Explore tests

Initialize the ValidMind dataset

With the individual tests we want to run identified, the next step is to connect your data with a ValidMind Dataset object. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind, but you only need to do it once per dataset.

Initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module. For this example, we'll pass in the following arguments:

dataset — The raw dataset that you want to provide as input to tests.
input_id — A unique identifier that allows tracking what inputs are used when running each individual test.
target_column — A required argument if tests require access to true values. This is the name of the target column in the dataset.

# vm_raw_dataset is now a VMDataset object that you can pass to any ValidMind test
vm_raw_dataset = vm.init_dataset(
    dataset=raw_df,
    input_id="raw_dataset",
    target_column="Exited",
)

Running tests on datasets

Now that we know how to initialize a ValidMind dataset object, we're ready to run some tests!

You run individual tests by calling the run_test function provided by the validmind.tests module. For the examples below, we'll pass in the following arguments:

test_id — The ID of the test to run, as seen in the ID column when you run list_tests.
params — A dictionary of parameters for the test. These will override any default_params set in the test definition.

Run tabular data tests

The inputs expected by a test can also be found in the test definition — let's take validmind.data_validation.DescriptiveStatistics as an example.

Note that the output of the describe_test() function below shows that this test expects a dataset as input:

vm.tests.describe_test("validmind.data_validation.DescriptiveStatistics")

▶ Test: Descriptive Statistics ('validmind.data_validation.DescriptiveStatistics')

Descriptive Statistics

Performs a detailed descriptive statistical analysis of both numerical and categorical data within a model's dataset.

Purpose

The purpose of the Descriptive Statistics metric is to provide a comprehensive summary of both numerical and categorical data within a dataset. This involves statistics such as count, mean, standard deviation, minimum and maximum values for numerical data. For categorical data, it calculates the count, number of unique values, most common value and its frequency, and the proportion of the most frequent value relative to the total. The goal is to visualize the overall distribution of the variables in the dataset, aiding in understanding the model's behavior and predicting its performance.

Test Mechanism

The testing mechanism utilizes two in-built functions of pandas dataframes: describe() for numerical fields and value_counts() for categorical fields. The describe() function pulls out several summary statistics, while value_counts() accounts for unique values. The resulting data is formatted into two distinct tables, one for numerical and another for categorical variable summaries. These tables provide a clear summary of the main characteristics of the variables, which can be instrumental in assessing the model's performance.

Signs of High Risk

Skewed data or significant outliers can represent high risk. For numerical data, this may be reflected via a significant difference between the mean and median (50% percentile).
For categorical data, a lack of diversity (low count of unique values), or overdominance of a single category (high frequency of the top value) can indicate high risk.

Strengths

Provides a comprehensive summary of the dataset, shedding light on the distribution and characteristics of the variables under consideration.
It is a versatile and robust method, applicable to both numerical and categorical data.
Helps highlight crucial anomalies such as outliers, extreme skewness, or lack of diversity, which are vital in understanding model behavior during testing and validation.

Limitations

While this metric offers a high-level overview of the data, it may fail to detect subtle correlations or complex patterns.
Does not offer any insights on the relationship between variables.
Alone, descriptive statistics cannot be used to infer properties about future unseen data.
Should be used in conjunction with other statistical tests to provide a comprehensive understanding of the model's data.

Required Inputs: dataset

How to Run:

Code:

        
import validmind as vm

# inputs dictionary maps your inputs to the expected input names
# keys are the expected input names and values are the actual inputs
# values may be string input_ids or the actual VMDataset or VMModel objects
inputs = {
    "dataset": "my_vm_dataset"
}
params = {}

# to run and view the result of this test, run the following code:
result = vm.tests.run_test(
  "validmind.data_validation.DescriptiveStatistics", inputs=inputs, params=params
)

# To see the result of the test, ensure that you have called `vm.init()` and then run:
result.log()

Now, let's run a few tests to assess the quality of the dataset:

result = vm.tests.run_test(
    test_id="validmind.data_validation.DescriptiveStatistics",
    inputs={"dataset": vm_raw_dataset},
)

Descriptive Statistics

The Descriptive Statistics test evaluates the distributional characteristics of numerical and categorical variables in the dataset. The results provide summary statistics for eight numerical variables, including counts, central tendency, dispersion, and percentile ranges, along with category frequency summaries for Geography and Gender. All reported variables have a count of 8,000 observations, and the tables show the spread of values across continuous, discrete, and binary fields as well as the concentration of the most common categorical values.

Key insights:

Complete coverage across variables: Every numerical and categorical variable reports a count of 8,000, indicating no missing observations in the fields included in this summary.
Balance shows strong asymmetry: Balance has a mean of 76,434.10 and a median of 97,264.00, with the 25th percentile at 0.00. This combination indicates substantial concentration at zero together with a broad upper range extending to 250,898.00.
Salary is broadly dispersed but centered: EstimatedSalary has a mean of 99,790.19 and a median of 99,505.00, with values spanning from 12.00 to 199,992.00 and a standard deviation of 57,520.51. The close alignment of mean and median indicates a relatively centered distribution despite the wide range.
Credit score and age are moderately spread: CreditScore ranges from 350 to 850 with mean 650.16 and median 652.00, while Age ranges from 18 to 92 with mean 38.95 and median 37.00. In both variables, mean and median are close, indicating limited central tendency distortion.
Product holdings are concentrated at low counts: NumOfProducts has a median of 1.00, a 75th percentile of 2.00, and a maximum of 4.00. The distribution is concentrated in the lower values, with most observations at one or two products.
Categorical concentration is moderate: Geography contains three unique values, with France representing 50.12% of observations, and Gender contains two unique values, with Male representing 54.95%. The top category in each field accounts for slightly more than half of the dataset.

The descriptive summary indicates a complete dataset with varied distributional profiles across variables. Several fields, including CreditScore, Age, and EstimatedSalary, show mean and median values that are closely aligned, while Balance stands out for its zero lower quartile and lower mean relative to median, reflecting a more uneven distribution. The categorical variables exhibit moderate concentration in their most frequent classes without a single category dominating the dataset to an extreme degree.

Tables

Numerical Variables

Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
CreditScore	8000.0	650.1596	96.8462	350.0	583.0	652.0	717.0	778.0	813.0	850.0
Age	8000.0	38.9489	10.4590	18.0	32.0	37.0	44.0	53.0	60.0	92.0
Tenure	8000.0	5.0339	2.8853	0.0	3.0	5.0	8.0	9.0	9.0	10.0
Balance	8000.0	76434.0965	62612.2513	0.0	0.0	97264.0	128045.0	149545.0	162488.0	250898.0
NumOfProducts	8000.0	1.5325	0.5805	1.0	1.0	1.0	2.0	2.0	2.0	4.0
HasCrCard	8000.0	0.7026	0.4571	0.0	0.0	1.0	1.0	1.0	1.0	1.0
IsActiveMember	8000.0	0.5199	0.4996	0.0	0.0	1.0	1.0	1.0	1.0	1.0
EstimatedSalary	8000.0	99790.1880	57520.5089	12.0	50857.0	99505.0	149216.0	179486.0	189997.0	199992.0

Categorical Variables

Name	Count	Number of Unique Values	Top Value	Top Value Frequency	Top Value Frequency %
Geography	8000.0	3.0	France	4010.0	50.12
Gender	8000.0	2.0	Male	4396.0	54.95

result2 = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_raw_dataset},
    params={"min_percent_threshold": 30},
)

❌ Class Imbalance

The Class Imbalance test evaluates the distribution of target classes in the dataset by comparing each class frequency against a minimum percentage threshold. In this result, the target variable Exited is shown across two classes, with class 0 representing 79.80% of rows and class 1 representing 20.20% of rows. The applied threshold is 30%, and the table reports a pass/fail outcome for each class based on that cutoff. The accompanying bar chart visually reflects the same class distribution, with a substantially larger share for class 0 than for class 1.

Key insights:

Class distribution is uneven: The target classes are split 79.80% for Exited = 0 and 20.20% for Exited = 1, indicating a large difference in class representation.
Minority class fails threshold: Exited = 1 falls below the configured 30% minimum threshold and is marked as Fail in the test output.
Majority class exceeds threshold: Exited = 0 is above the 30% threshold at 79.80% and is marked as Pass.
Binary target shows concentrated majority: In the two-class target structure, most observations are concentrated in the non-exited class, as shown consistently in both the table and the plot.

The result shows that the Exited target distribution is concentrated in class 0, with class 1 materially less represented. Under the configured 30% minimum class threshold, the majority class passes and the minority class fails. Taken together, the table and plot document a binary class distribution with one underrepresented class relative to the applied test criterion.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	79.80%	Pass
1	20.20%	Fail

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:53ad

The output above shows that the validmind.data_validation.ClassImbalance test did not pass according to the value we set for min_percent_threshold.

To address this issue, we'll re-run the test on some processed data. In this case let's apply a very simple rebalancing technique to the dataset:

import pandas as pd

raw_copy_df = raw_df.sample(frac=1)  # Create a copy of the raw dataset

# Create a balanced dataset with the same number of exited and not exited customers
exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 1]
not_exited_df = raw_copy_df.loc[raw_copy_df["Exited"] == 0].sample(n=exited_df.shape[0])

balanced_raw_df = pd.concat([exited_df, not_exited_df])
balanced_raw_df = balanced_raw_df.sample(frac=1, random_state=42)

With this new balanced dataset, you can re-run the individual test to see if it now passes the class imbalance test requirement.

As this is technically a different dataset, remember to first initialize a new ValidMind Dataset object to pass in as input as required by run_test():

# Register new data and now 'balanced_raw_dataset' is the new dataset object of interest
vm_balanced_raw_dataset = vm.init_dataset(
    dataset=balanced_raw_df,
    input_id="balanced_raw_dataset",
    target_column="Exited",
)

# Pass the initialized `balanced_raw_dataset` as input into the test run
result = vm.tests.run_test(
    test_id="validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_balanced_raw_dataset},
    params={"min_percent_threshold": 30},
)

✅ Class Imbalance

The Class Imbalance test evaluates the distribution of target classes in the dataset by measuring the share of records in each class against a defined minimum percentage threshold. In this run, the target variable Exited is summarized across two classes, with the results table and bar chart showing the proportion of rows for each class and the associated pass/fail outcome. The test was executed with a minimum class percentage threshold of 30%, and both observed classes are reported at 50.00%.

Key insights:

Classes are evenly distributed: Exited = 0 and Exited = 1 each represent 50.00% of the dataset, indicating an exact split between the two target classes.
Both classes pass the threshold: With the minimum percentage threshold set to 30%, both classes are marked as Pass because each exceeds the threshold by 20 percentage points.
No underrepresented target class observed: The test output does not identify any class below the configured minimum percentage level.

The test result shows a balanced binary target distribution for Exited, with both classes occurring at equal frequency and both passing the 30% minimum threshold. The observed class proportions indicate that the dataset used in this test does not exhibit class imbalance under the configured test criterion.

Parameters:

{
  "min_percent_threshold": 30
}

Tables

Exited Class Imbalance

Exited	Percentage of Rows (%)	Pass/Fail
0	50.00%	Pass
1	50.00%	Pass

Figures

ValidMind Figure validmind.data_validation.ClassImbalance:071d

Utilize test output

You can utilize the output from a ValidMind test for further use, for example, if you want to remove highly correlated features. Removing highly correlated features helps make the model simpler, more stable, and easier to understand.

Below we demonstrate how to retrieve the list of features with the highest correlation coefficients and use them to reduce the final list of features for modeling.

First, we'll run validmind.data_validation.HighPearsonCorrelation with the balanced_raw_dataset we initialized previously as input as is for comparison with later runs:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)

❌ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships between features to identify potentially redundant variables or concentrations of linear dependence. The result table reports the top feature pairs ranked by Pearson correlation coefficient together with Pass/Fail status based on the configured absolute threshold of 0.3. The reported coefficients range from -0.1909 to 0.3479 across the ten listed pairs, and only one pair exceeds the threshold. The strongest reported relationship is between Age and Exited, while the remaining listed pairs are below the threshold and are marked as passing.

Key insights:

Single threshold breach identified: The pair (Age, Exited) has a Pearson correlation coefficient of 0.3479, making it the only listed relationship that exceeds the 0.3 threshold and receives a Fail result.
All other reported pairs pass: The remaining nine reported feature pairs have absolute correlation coefficients below 0.3, with Pass results across relationships including (Balance, NumOfProducts) at -0.1909 and (IsActiveMember, Exited) at -0.1898.
Correlation magnitudes are otherwise limited: Excluding (Age, Exited), the reported coefficients are relatively small in magnitude, with the largest absolute value among passing pairs equal to 0.1909.
Both positive and negative relationships appear: The listed results include positive correlations such as (Balance, Exited) at 0.1642 and negative correlations such as (CreditScore, Exited) at -0.0470, indicating mixed linear association directions across the reported pairs.

The reported correlation structure is concentrated in a single pair above the configured threshold, with Age and Exited showing the strongest observed linear association at 0.3479. All other listed pairwise relationships remain below the threshold and are materially smaller in absolute magnitude. Taken together, the results indicate that the top reported Pearson correlations are limited in scale aside from the single failing pair.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3479	Fail
(Balance, NumOfProducts)	-0.1909	Pass
(IsActiveMember, Exited)	-0.1898	Pass
(Balance, Exited)	0.1642	Pass
(NumOfProducts, IsActiveMember)	0.0525	Pass
(NumOfProducts, Exited)	-0.0524	Pass
(Age, Balance)	0.0492	Pass
(CreditScore, Exited)	-0.0470	Pass
(Age, NumOfProducts)	-0.0444	Pass
(Tenure, IsActiveMember)	-0.0387	Pass

The output above shows that the test did not pass according to the value we set for max_threshold.

corr_result is an object of type TestResult. We can inspect the result object to see what the test has produced:

print(type(corr_result))
print("Result ID: ", corr_result.result_id)
print("Params: ", corr_result.params)
print("Passed: ", corr_result.passed)
print("Tables: ", corr_result.tables)

<class 'validmind.vm_models.result.result.TestResult'>
Result ID:  validmind.data_validation.HighPearsonCorrelation
Params:  {'max_threshold': 0.3}
Passed:  False
Tables:  [ResultTable]

Let's remove the highly correlated features and create a new VM dataset object.

We'll begin by checking out the table in the result and extracting a list of features that failed the test:

# Extract table from `corr_result.tables`
features_df = corr_result.tables[0].data
features_df

	Columns	Coefficient	Pass/Fail
0	(Age, Exited)	0.3479	Fail
1	(Balance, NumOfProducts)	-0.1909	Pass
2	(IsActiveMember, Exited)	-0.1898	Pass
3	(Balance, Exited)	0.1642	Pass
4	(NumOfProducts, IsActiveMember)	0.0525	Pass
5	(NumOfProducts, Exited)	-0.0524	Pass
6	(Age, Balance)	0.0492	Pass
7	(CreditScore, Exited)	-0.0470	Pass
8	(Age, NumOfProducts)	-0.0444	Pass
9	(Tenure, IsActiveMember)	-0.0387	Pass

# Extract list of features that failed the test
high_correlation_features = features_df[features_df["Pass/Fail"] == "Fail"]["Columns"].tolist()
high_correlation_features

['(Age, Exited)']

Next, extract the feature names from the list of strings (example: (Age, Exited) > Age):

high_correlation_features = [feature.split(",")[0].strip("()") for feature in high_correlation_features]
high_correlation_features

['Age']

Now, it's time to re-initialize the dataset with the highly correlated features removed.

Note the use of a different input_id. This allows tracking the inputs used when running each individual test.

# Remove the highly correlated features from the dataset
balanced_raw_no_age_df = balanced_raw_df.drop(columns=high_correlation_features)

# Re-initialize the dataset object
vm_raw_dataset_preprocessed = vm.init_dataset(
    dataset=balanced_raw_no_age_df,
    input_id="raw_dataset_preprocessed",
    target_column="Exited",
)

Re-running the test with the reduced feature set should pass the test:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships between features to identify potentially redundant variables or multicollinearity. The result table reports the top feature pairs by absolute Pearson correlation coefficient using a threshold of 0.3 for pass/fail classification. Ten feature pairs are listed, with coefficients ranging from -0.1909 to 0.1642, and all reported pairs are classified as Pass. The strongest absolute relationships in the reported output are between Balance and NumOfProducts, IsActiveMember and Exited, and Balance and Exited.

Key insights:

No correlations exceed threshold: All reported feature pairs pass against the 0.3 threshold. The largest absolute coefficient is 0.1909 for Balance and NumOfProducts, which remains below the configured limit.
Observed relationships are weak: Reported coefficients are small in magnitude, spanning from -0.1909 to 0.1642. This indicates that the strongest linear associations in the displayed output are limited.
Top associations are concentrated in few pairs: The largest absolute correlations are for Balance and NumOfProducts (-0.1909), IsActiveMember and Exited (-0.1898), and Balance and Exited (0.1642). The remaining reported pairs are all closer to zero, with absolute values at or below 0.0525.
Both positive and negative correlations appear: The output includes negative and positive coefficients, with negative relationships more prominent among the strongest reported pairs. Positive correlations in the table are comparatively smaller, except for Balance and Exited at 0.1642.

The reported correlation structure shows no pairwise linear relationship exceeding the test threshold of 0.3. The strongest observed associations are weak in magnitude and are limited to a small number of feature pairs, while most listed relationships are close to zero. Based on the reported output, the test does not identify high pairwise Pearson correlation among the displayed feature combinations.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Balance, NumOfProducts)	-0.1909	Pass
(IsActiveMember, Exited)	-0.1898	Pass
(Balance, Exited)	0.1642	Pass
(NumOfProducts, IsActiveMember)	0.0525	Pass
(NumOfProducts, Exited)	-0.0524	Pass
(CreditScore, Exited)	-0.0470	Pass
(Tenure, IsActiveMember)	-0.0387	Pass
(CreditScore, IsActiveMember)	0.0384	Pass
(Tenure, HasCrCard)	0.0308	Pass
(HasCrCard, IsActiveMember)	-0.0276	Pass

You can also plot the correlation matrix to visualize the new correlation between features:

corr_result = vm.tests.run_test(
    test_id="validmind.data_validation.PearsonCorrelationMatrix",
    inputs={"dataset": vm_raw_dataset_preprocessed},
)

Pearson Correlation Matrix

The PearsonCorrelationMatrix test evaluates linear dependency among numerical variables using pairwise Pearson correlation coefficients. The result is presented as a symmetric heat map covering CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, with coefficients ranging from -1 to 1 and 1.0 values along the diagonal. Off-diagonal values are generally close to zero, and the largest observed pairwise relationships in magnitude are visible between Balance and NumOfProducts, IsActiveMember and Exited, and Balance and Exited.

Key insights:

Correlations are broadly weak: Most off-diagonal correlation coefficients are near zero, indicating limited linear dependency across the variables included in the matrix.
No high-correlation pairs observed: The largest absolute correlations shown are -0.19 between Balance and NumOfProducts, -0.19 between IsActiveMember and Exited, and 0.16 between Balance and Exited. All observed values remain well below the 0.7 threshold referenced in the test description.
Exited has limited linear association: Exited shows weak relationships with the other variables, including 0.16 with Balance, -0.19 with IsActiveMember, -0.05 with CreditScore, -0.05 with NumOfProducts, -0.01 with Tenure, 0.0 with HasCrCard, and -0.01 with EstimatedSalary.
Feature-to-feature relationships are minimal: Among predictors, the most notable associations are -0.19 between Balance and NumOfProducts, 0.05 between NumOfProducts and IsActiveMember, 0.04 between CreditScore and IsActiveMember, and 0.03 between Tenure and HasCrCard or EstimatedSalary, with the remainder effectively negligible.

The correlation structure is sparse and dominated by weak linear relationships. No variable pair exhibits a strong positive or negative Pearson correlation, and the largest observed magnitudes are limited to 0.19 in absolute value. Collectively, the results indicate low linear redundancy among the numerical variables represented in the matrix.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:e9ad

Documenting test results

Now that we've done some analysis on two different datasets, we can use ValidMind to easily document why certain things were done to our raw data with testing to support it.

Every test result returned by the run_test() function has a .log() method that can be used to send the test results to the ValidMind Platform:

When using run_documentation_tests(), documentation sections will be automatically populated with the results of all tests registered in the documentation template.
When logging individual test results to the platform, you'll need to manually add those results to the desired section of the documentation.

To demonstrate how to add test results to your documentation, we'll populate the entire Data Preparation section of the documentation using the clean vm_raw_dataset_preprocessed dataset as input, and then document an additional individual result for the highly correlated dataset vm_balanced_raw_dataset.

Run and log multiple tests

run_documentation_tests() allows you to run multiple tests at once and automatically log the results to your documentation. Below, we'll run the tests using the previously initialized vm_raw_dataset_preprocessed as input — this will populate the entire Data Preparation section for every test that is part of the documentation template.

For this example, we'll pass in the following arguments:

inputs: Any inputs to be passed to the tests.
config: A dictionary <test_id>:<test_config> that allows configuring each test individually. Each test config requires the following:
- params: Individual test parameters.
- inputs: Individual test inputs. This overrides any inputs passed from the run_documentation_tests() function.

When including explicit configuration for individual tests, you'll need to specify the inputs even if they mirror what is included in your global configuration.

# Individual test config with inputs specified
test_config = {
    "validmind.data_validation.ClassImbalance": {
        "params": {"min_percent_threshold": 30},
        "inputs": {"dataset": vm_raw_dataset_preprocessed},
    },
    "validmind.data_validation.HighPearsonCorrelation": {
        "params": {"max_threshold": 0.3},
        "inputs": {"dataset": vm_raw_dataset_preprocessed},
    },
}

# Global test config
tests_suite = vm.run_documentation_tests(
    inputs={
        "dataset": vm_raw_dataset_preprocessed,
    },
    config=test_config,
    section=["data_preparation"],
)

Test suite complete!

26/26 (100.0%)

Test Suite Results: Binary Classification V2

Check out the updated documentation on ValidMind.

Template for binary classification models.

▶ Data Preparation

Data Preparation

▶ Test Result: Dataset Description (validmind.data_validation.DatasetDescription)

Dataset Description

The Dataset Description test evaluates the structure, completeness, and column-level characteristics of the dataset used by the model. The results summarize 10 columns across numeric and categorical types, reporting count, missingness, and distinct-value statistics for each field. All columns contain 3,232 observations, with no missing values recorded in any column. Distinct-value counts vary substantially across variables, from 2-category binary fields to fully unique numeric values in EstimatedSalary.

Key insights:

No missing values observed: Every column reports a count of 3,232 with 0 missing values and 0.0% missingness, indicating complete population of all fields included in the summary.
Mixed numeric and categorical structure: The dataset contains four numeric variables (CreditScore, Tenure, Balance, NumOfProducts, EstimatedSalary) and five categorical variables (Geography, Gender, HasCrCard, IsActiveMember, Exited), reflecting a combination of continuous, discrete, and binary inputs.
EstimatedSalary is fully unique: EstimatedSalary has 3,232 distinct values out of 3,232 observations, corresponding to a distinct ratio of 1.0, making it the only field with a fully unique value per record in this sample.
Balance shows high cardinality: Balance has 2,181 distinct values, representing 67.48% of observations, which is substantially higher than other numeric variables aside from EstimatedSalary.
Several variables have very low cardinality: Gender, HasCrCard, IsActiveMember, and Exited each contain 2 distinct values, while Geography contains 3 and NumOfProducts contains 4, indicating multiple binary or limited-category fields.
Tenure and CreditScore differ in granularity: Tenure has 11 distinct values, whereas CreditScore has 437 distinct values, showing that numeric fields vary materially in the level of value granularity represented in the dataset.

The dataset summary indicates a fully populated input table with no observed missingness across all 10 columns. The column set combines low-cardinality categorical fields with numeric variables that range from limited discrete values (Tenure, NumOfProducts) to high-cardinality and fully unique values (Balance, EstimatedSalary). Overall, the result documents a dataset with complete records and substantial variation in feature cardinality across variables.

Tables

Dataset Description

Name	Type	Count	Distinct	Distinct %
CreditScore	Numeric	3232.0	437	0.1352
Geography	Categorical	3232.0	3	0.0009
Gender	Categorical	3232.0	2	0.0006
Tenure	Numeric	3232.0	11	0.0034
Balance	Numeric	3232.0	2181	0.6748
NumOfProducts	Numeric	3232.0	4	0.0012
HasCrCard	Categorical	3232.0	2	0.0006
IsActiveMember	Categorical	3232.0	2	0.0006
EstimatedSalary	Numeric	3232.0	3232	1.0000
Exited	Categorical	3232.0	2	0.0006

▶ Test Result: Class Imbalance (validmind.data_validation.ClassImbalance)

▶ Test Result: Duplicates (validmind.data_validation.Duplicates)

Number of Duplicates	Percentage of Rows (%)
0	0.0

▶ Test Result: High Cardinality (validmind.data_validation.HighCardinality)

Column	Number of Distinct Values	Percentage of Distinct Values (%)	Pass/Fail
Geography	3	0.0928	Pass
Gender	2	0.0619	Pass

▶ Test Result: Missing Values (validmind.data_validation.MissingValues)

Column	Number of Missing Values	Percentage of Missing Values (%)	Pass/Fail
CreditScore	0	0.0	Pass
Geography	0	0.0	Pass
Gender	0	0.0	Pass
Tenure	0	0.0	Pass
Balance	0	0.0	Pass
NumOfProducts	0	0.0	Pass
HasCrCard	0	0.0	Pass
IsActiveMember	0	0.0	Pass
EstimatedSalary	0	0.0	Pass
Exited	0	0.0	Pass

▶ Test Result: Skewness (validmind.data_validation.Skewness)

❌ Skewness

The Skewness test evaluates the asymmetry of numerical feature distributions by comparing each column’s skewness against the maximum threshold of 1. The results table reports skewness values and pass/fail outcomes for eight numeric columns in the dataset. Seven columns are marked as passing the threshold, while one column is marked as failing. Reported skewness values range from -0.8408 to 1.2399.

Key insights:

Single threshold breach observed: NumOfProducts is the only column that fails the test, with skewness of 1.2399, exceeding the threshold of 1.
Most variables show limited skewness: CreditScore (-0.0758), Tenure (0.0402), Balance (-0.2519), IsActiveMember (0.144), EstimatedSalary (-0.0315), and Exited (0.0) all remain close to zero and pass the test.
Largest negative skew still passes: HasCrCard records the most negative skewness at -0.8408 and remains within the acceptance threshold.
Skewness profile is mixed in direction: The dataset contains both negatively and positively skewed variables, with negative skewness observed in CreditScore, Balance, HasCrCard, and EstimatedSalary, and positive skewness observed in Tenure, NumOfProducts, and IsActiveMember.

Overall, the skewness results indicate that distributional asymmetry is limited across most numeric columns based on the defined threshold. The only material exception is NumOfProducts, which exceeds the threshold and stands apart from the remaining variables. The rest of the dataset shows skewness values that remain within the test’s acceptance range, including one exact zero value for Exited.

Tables

Skewness Results for Dataset

Column	Skewness	Pass/Fail
CreditScore	-0.0758	Pass
Tenure	0.0402	Pass
Balance	-0.2519	Pass
NumOfProducts	1.2399	Fail
HasCrCard	-0.8408	Pass
IsActiveMember	0.1440	Pass
EstimatedSalary	-0.0315	Pass
Exited	0.0000	Pass

▶ Test Result: Unique Rows (validmind.data_validation.UniqueRows)

❌ Unique Rows

The UniqueRows test evaluates dataset diversity by comparing the number and percentage of unique values in each column against a prescribed minimum threshold. The results are reported at the column level, showing the count of distinct values, the corresponding percentage of unique values, and a pass/fail outcome for each variable. In this run, 3 columns passed the test and 7 columns failed. Reported uniqueness percentages range from 0.0619% to 100.0% across the evaluated fields.

Key insights:

Uniqueness is concentrated in few columns: EstimatedSalary shows 3,232 unique values and 100.0% uniqueness, Balance shows 2,181 unique values and 67.4814%, and CreditScore shows 437 unique values and 13.521%; these are the only columns that passed the test.
Most columns have very low uniqueness: Seven columns failed, with uniqueness percentages below 0.5% in each case: Geography (0.0928%), Gender (0.0619%), Tenure (0.3403%), NumOfProducts (0.1238%), HasCrCard (0.0619%), IsActiveMember (0.0619%), and Exited (0.0619%).
Several variables are binary: Gender, HasCrCard, IsActiveMember, and Exited each contain 2 unique values, corresponding to 0.0619% uniqueness, and all are marked as failed.
Categorical cardinality varies materially across fields: Among failed columns, distinct value counts range from 2 to 11, with Geography containing 3 unique values, NumOfProducts 4, and Tenure 11.

The test results show a mixed uniqueness profile across the dataset, with high uniqueness limited to EstimatedSalary, Balance, and CreditScore, and low uniqueness across the remaining variables. The failed columns are characterized by small numbers of distinct values and correspondingly low uniqueness percentages. Overall, the observed diversity is unevenly distributed across features, with most variables exhibiting limited distinctness under the applied threshold.

Tables

Column	Number of Unique Values	Percentage of Unique Values (%)	Pass/Fail
CreditScore	437	13.5210	Pass
Geography	3	0.0928	Fail
Gender	2	0.0619	Fail
Tenure	11	0.3403	Fail
Balance	2181	67.4814	Pass
NumOfProducts	4	0.1238	Fail
HasCrCard	2	0.0619	Fail
IsActiveMember	2	0.0619	Fail
EstimatedSalary	3232	100.0000	Pass
Exited	2	0.0619	Fail

▶ Test Result: Too Many Zero Values (validmind.data_validation.TooManyZeroValues)

❌ Too Many Zero Values

The TooManyZeroValues test evaluates numerical columns for excessive concentrations of zero values relative to a defined threshold. The results table reports four numerical variables—Tenure, Balance, HasCrCard, and IsActiveMember—along with row count, zero-value count, zero-value percentage, and pass/fail status. All four variables have 3,232 observations and are marked as Fail, with zero-value percentages ranging from 4.0842% to 53.5891%.

Key insights:

All evaluated variables failed: Each of the four reported numerical variables exceeded the configured zero-value threshold and received a Fail status.
IsActiveMember has the highest zero share: IsActiveMember contains 1,732 zero values out of 3,232 rows, corresponding to 53.5891%, which is the largest zero concentration among the reported variables.
Balance and HasCrCard show substantial zeros: Balance has 1,052 zero values (32.5495%) and HasCrCard has 990 zero values (30.6312%), indicating that roughly one-third of observations are zero in both variables.
Tenure has the lowest zero concentration: Tenure records 132 zero values, equivalent to 4.0842%, making it the lowest among the failed variables but still above the test threshold.

The test result shows that every numerical variable reported in this output exceeded the zero-value threshold. Zero concentration varies materially across variables, with the highest levels observed in IsActiveMember and the lowest in Tenure. Collectively, the result indicates that zero values are present at nontrivial levels across all evaluated numerical fields in this dataset slice.

Tables

Variable	Row Count	Number of Zero Values	Percentage of Zero Values (%)	Pass/Fail
Tenure	3232	132	4.0842	Fail
Balance	3232	1052	32.5495	Fail
HasCrCard	3232	990	30.6312	Fail
IsActiveMember	3232	1732	53.5891	Fail

▶ Test Result: IQR Outliers Table (validmind.data_validation.IQROutliersTable)

IQR Outliers Table

The IQROutliersTable test identifies and summarizes outliers in numerical features using the interquartile range method. The results table reports the variables with detected outliers, the total outlier count for each variable, the mean value of the full variable, and summary statistics for the outlier values. In this result, outliers were detected for CreditScore and NumOfProducts, with counts of 9 and 44 respectively. The reported outlier summaries show the range and concentration of the detected values within each feature.

Key insights:

Outliers are concentrated in two features: Outliers were identified only for CreditScore and NumOfProducts in the reported results, indicating that the detected IQR exceptions are limited to these variables.
NumOfProducts has the highest outlier count: NumOfProducts has 44 outliers, compared with 9 for CreditScore, making it the feature with the larger concentration of detected outlier observations.
NumOfProducts outliers are uniform: All reported outlier summary values for NumOfProducts are 4, with the minimum, quartiles, median, and maximum all equal to 4, indicating that every detected outlier in this feature takes the same value.
CreditScore outliers are low-valued: For CreditScore, the outlier values range from 350 to 373, with the 25th percentile and median both at 350 and the 75th percentile at 365, showing that the detected outliers are concentrated in the lower end of the variable’s observed outlier range.
Outlier values are separated from variable means: The mean values reported for the full variables are 649.9081 for CreditScore and 1.5074 for NumOfProducts, while the detected outlier values fall in the ranges 350–373 and exactly 4 respectively.

The IQR outlier assessment shows that detected outliers are limited to two numerical features, with a notably higher count in NumOfProducts than in CreditScore. The NumOfProducts outliers are entirely concentrated at a single value, while CreditScore outliers occupy a narrow low-value range. Together, these results indicate a localized outlier pattern rather than a broad distribution of IQR exceptions across the reported features.

Tables

Summary of Outliers Detected by IQR Method

Variable	Total Count of Outliers	Mean Value of Variable	Minimum Outlier Value	Outlier Value at 25th Percentile	Outlier Value at 50th Percentile	Outlier Value at 75th Percentile	Maximum Outlier Value
CreditScore	9	649.9081	350	350.0	350.0	365.0	373
NumOfProducts	44	1.5074	4	4.0	4.0	4.0	4

▶ Test Result: IQR Outliers Bar Plot (validmind.data_validation.IQROutliersBarPlot)

IQR Outliers Bar Plot

The IQROutliersBarPlot test evaluates the distribution of IQR-based outliers across percentile bands for numeric features. The results shown cover the CreditScore and NumOfProducts variables, with bar heights representing the count of outliers identified within each percentile interval. For CreditScore, outliers appear in the 50–75 and 75–100 percentile bands, while for NumOfProducts, outliers are concentrated entirely in the 75–100 percentile band.

Key insights:

CreditScore outliers are upper-half concentrated: CreditScore shows no outliers in the 0–25 or 25–50 percentile bands, with 6 outliers in the 50–75 band and 3 outliers in the 75–100 band. The observed outlier pattern is therefore limited to the upper half of the distribution.
NumOfProducts outliers are isolated at the top end: NumOfProducts shows 0 outliers in the 0–25, 25–50, and 50–75 percentile bands, and 44 outliers in the 75–100 percentile band. All detected outliers for this feature are concentrated in the highest percentile range.
Outlier concentration differs materially by feature: The total observed outlier count is 9 for CreditScore and 44 for NumOfProducts. In addition, CreditScore outliers are split across two upper percentile bands, whereas NumOfProducts outliers are confined to a single band.

The results indicate that detected outliers are not distributed uniformly across the examined features or percentile ranges. CreditScore exhibits a moderate number of outliers confined to the upper half of the distribution, while NumOfProducts shows a larger concentration entirely within the highest percentile band. Across both displayed features, no outliers are present in the lower two percentile intervals.

Figures

ValidMind Figure validmind.data_validation.IQROutliersBarPlot:16bb

ValidMind Figure validmind.data_validation.IQROutliersBarPlot:673c

▶ Test Result: Descriptive Statistics (validmind.data_validation.DescriptiveStatistics)

Descriptive Statistics

The Descriptive Statistics test evaluates the distributional characteristics of numerical and categorical variables in the dataset. The results provide summary statistics for seven numerical variables and frequency-based summaries for two categorical variables, each with 3,232 observations. The numerical table reports central tendency, dispersion, and percentile ranges, while the categorical table reports the number of unique values, the most frequent category, and its share of the sample. Together, these tables describe the overall spread, concentration, and category composition of the input data.

Key insights:

Complete coverage across variables: All reported numerical and categorical variables have a count of 3,232, indicating that the summarized fields were populated for the full set of observations included in this result.
Balance shows the widest dispersion: Balance has a mean of 80,951.1917, a median of 102,780.0, a standard deviation of 61,514.5938, and a minimum of 0.0. The gap between the 25th percentile at 0.0 and the median at 102,780.0 indicates substantial concentration at zero alongside a broad positive range up to 250,898.0.
EstimatedSalary is broadly distributed: EstimatedSalary spans from 12.0 to 199,953.0, with a mean of 100,717.0003 and median of 102,083.0. Its quartiles at 51,468.0 and 150,672.0, together with a standard deviation of 58,057.4116, show a wide spread across the variable range.
CreditScore is centered near its median: CreditScore has a mean of 649.9081 and a median of 651.0, with quartiles of 581.0 and 718.0. The close alignment of mean and median indicates limited asymmetry in central tendency within the reported summary.
Product holdings are concentrated at low counts: NumOfProducts has a median of 1.0, a mean of 1.5074, a 75th percentile of 2.0, and a maximum of 4.0. This indicates that most observations are concentrated in the lower product-count values.
Binary indicators show uneven class balance: HasCrCard has a mean of 0.6937, indicating that the value 1 accounts for approximately 69.37% of observations, while IsActiveMember has a mean of 0.4641, indicating a more even split with the value 1 representing approximately 46.41% of observations.
Categorical concentration is moderate: Geography contains 3 unique values, with France as the top category at 1,475 observations or 45.64% of the sample. Gender contains 2 unique values, with Male as the top category at 1,664 observations or 51.49%, showing limited dominance by the most frequent category in both fields.

The descriptive statistics indicate full observed coverage across the reported variables and a mix of distributional patterns across inputs. The most pronounced numerical concentration appears in Balance, where zero values are present through the 25th percentile despite a substantially higher median and upper-tail values, while EstimatedSalary and CreditScore show broad but more centrally anchored distributions. Categorical variables show limited concentration in their top categories, and the binary indicators differ in balance, with HasCrCard more concentrated in the positive class than IsActiveMember.

Tables

Numerical Variables

Name	Count	Mean	Std	Min	25%	50%	75%	90%	95%	Max
CreditScore	3232.0	649.9081	99.2814	350.0	581.0	651.0	718.0	781.0	821.0	850.0
Tenure	3232.0	4.9873	2.8875	0.0	3.0	5.0	8.0	9.0	9.0	10.0
Balance	3232.0	80951.1917	61514.5938	0.0	0.0	102780.0	128814.0	149706.0	162678.0	250898.0
NumOfProducts	3232.0	1.5074	0.6726	1.0	1.0	1.0	2.0	2.0	3.0	4.0
HasCrCard	3232.0	0.6937	0.4610	0.0	0.0	1.0	1.0	1.0	1.0	1.0
IsActiveMember	3232.0	0.4641	0.4988	0.0	0.0	0.0	1.0	1.0	1.0	1.0
EstimatedSalary	3232.0	100717.0003	58057.4116	12.0	51468.0	102083.0	150672.0	180193.0	189978.0	199953.0

Categorical Variables

Name	Count	Number of Unique Values	Top Value	Top Value Frequency	Top Value Frequency %
Geography	3232.0	3.0	France	1475.0	45.64
Gender	3232.0	2.0	Male	1664.0	51.49

▶ Test Result: Pearson Correlation Matrix (validmind.data_validation.PearsonCorrelationMatrix)

Pearson Correlation Matrix

The PearsonCorrelationMatrix test evaluates linear dependency among numerical variables using pairwise Pearson correlation coefficients. The result is presented as a symmetric heat map across CreditScore, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, and Exited, with coefficients ranging from -1 to 1 and diagonal values of 1.0. Off-diagonal correlations are generally close to zero, with the largest observed magnitudes at -0.19, 0.16, and -0.05 across selected variable pairs.

Key insights:

Correlations are uniformly weak: All off-diagonal Pearson correlation coefficients are small in magnitude, with the observed range approximately from -0.19 to 0.16. No pair approaches the 0.7 absolute-value threshold described in the test methodology.
Largest negative relationship is limited: The most negative correlation shown is -0.19 between Balance and NumOfProducts, with another -0.19 correlation between IsActiveMember and Exited. These values indicate only weak inverse linear relationships.
Largest positive relationship remains modest: The strongest positive correlation involving the target is 0.16 between Balance and Exited. Other positive relationships, such as 0.05 between NumOfProducts and IsActiveMember and 0.04 between CreditScore and IsActiveMember, are weaker still.
EstimatedSalary is largely independent of other variables: EstimatedSalary shows near-zero correlations with the remaining variables, including -0.02 with CreditScore, 0.03 with Tenure, 0.02 with Balance, approximately 0.00 with NumOfProducts, -0.02 with HasCrCard, -0.01 with IsActiveMember, and -0.01 with Exited.

The correlation matrix shows a broadly low-correlation structure across the numerical variables in scope. The largest pairwise relationships are weak and remain well below the high-correlation threshold stated in the test description. Collectively, the results indicate limited linear dependency among the variables represented in the heat map.

Figures

ValidMind Figure validmind.data_validation.PearsonCorrelationMatrix:3352

▶ Test Result: High Pearson Correlation (validmind.data_validation.HighPearsonCorrelation)

✅ High Pearson Correlation

The High Pearson Correlation test evaluates pairwise linear relationships among features to identify potentially redundant or highly collinear variable pairs. The result table lists the top reported feature pairs with their Pearson correlation coefficients and Pass/Fail status under a maximum threshold of 0.3. In this run, all reported coefficients are below the threshold in absolute value, with values ranging from -0.1909 to 0.1642. The table includes both positive and negative relationships across the reported feature pairs, and each pair is marked as Pass.

Key insights:

No threshold breaches observed: All reported feature pairs pass the test against the 0.3 threshold. The largest absolute coefficient is 0.1909 for the Balance and NumOfProducts pair, which remains below the configured limit.
Observed relationships are weak: The reported coefficients are clustered close to zero, ranging from -0.1909 to 0.1642. This indicates limited linear association among the top-ranked pairs shown in the output.
Strongest negative pair is Balance and NumOfProducts: The most pronounced negative relationship in the table is between Balance and NumOfProducts at -0.1909. No other reported negative coefficient exceeds this magnitude.
Strongest positive pair is Balance and Exited: The largest positive coefficient reported is 0.1642 for Balance and Exited. The remaining positive correlations are smaller, with the next highest positive value at 0.0525 for NumOfProducts and IsActiveMember.

The reported correlation structure does not show any feature pair exceeding the configured Pearson correlation threshold. Across the top reported pairs, linear relationships are weak in magnitude and all results receive a Pass designation. Based on the pairs shown in the output, the test result indicates limited evidence of high pairwise linear dependence within the reported features.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Balance, NumOfProducts)	-0.1909	Pass
(IsActiveMember, Exited)	-0.1898	Pass
(Balance, Exited)	0.1642	Pass
(NumOfProducts, IsActiveMember)	0.0525	Pass
(NumOfProducts, Exited)	-0.0524	Pass
(CreditScore, Exited)	-0.0470	Pass
(Tenure, IsActiveMember)	-0.0387	Pass
(CreditScore, IsActiveMember)	0.0384	Pass
(Tenure, HasCrCard)	0.0308	Pass
(HasCrCard, IsActiveMember)	-0.0276	Pass

Run and log an individual test

Next, we'll use the previously initialized vm_balanced_raw_dataset (that still has a highly correlated Age column) as input to run an individual test, then log the result to the ValidMind Platform.

When running individual tests, you can use a custom result_id to tag the individual result with a unique identifier:

This result_id can be appended to test_id with a : separator.
The balanced_raw_dataset result identifier will correspond to the balanced_raw_dataset input, the dataset that still has the Age column.

result = vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation:balanced_raw_dataset",
    params={"max_threshold": 0.3},
    inputs={"dataset": vm_balanced_raw_dataset},
)
result.log()

❌ High Pearson Correlation Balanced Raw Dataset

The High Pearson Correlation test evaluates pairwise linear relationships between features to identify potentially redundant or highly collinear variables. The result table lists the top reported feature pairs from the balanced raw dataset, along with their Pearson correlation coefficients and Pass/Fail status against the configured absolute threshold of 0.3. Across the 10 reported pairs, coefficients range from -0.1909 to 0.3479, with one pair exceeding the threshold and the remaining pairs marked as passing.

Key insights:

Single threshold breach observed: The pair (Age, Exited) records the highest absolute correlation at 0.3479 and is the only reported relationship classified as Fail under the 0.3 threshold.
All other reported pairs remain below threshold: The remaining nine reported feature pairs have absolute correlations at or below 0.1909, including (Balance, NumOfProducts) at -0.1909 and (IsActiveMember, Exited) at -0.1898, and are all marked as Pass.
Observed relationships are mostly weak: Aside from (Age, Exited), the reported coefficients are clustered close to zero, with the smallest listed absolute value equal to 0.0387 for (Tenure, IsActiveMember).
Both positive and negative associations appear: The reported pairs include positive correlations such as (Balance, Exited) at 0.1642 and negative correlations such as (CreditScore, Exited) at -0.0470, indicating mixed directional relationships within the top listed pairs.

The reported correlation structure is limited to one pair exceeding the configured threshold, while the rest of the listed relationships remain below the cutoff and are relatively small in magnitude. The strongest observed linear association is between Age and Exited, and the remaining top reported pairs reflect comparatively weak positive or negative relationships. Overall, the result indicates a largely low-correlation pattern among the reported feature pairs, with one identified exception under the test criterion.

Parameters:

{
  "max_threshold": 0.3
}

Tables

Columns	Coefficient	Pass/Fail
(Age, Exited)	0.3479	Fail
(Balance, NumOfProducts)	-0.1909	Pass
(IsActiveMember, Exited)	-0.1898	Pass
(Balance, Exited)	0.1642	Pass
(NumOfProducts, IsActiveMember)	0.0525	Pass
(NumOfProducts, Exited)	-0.0524	Pass
(Age, Balance)	0.0492	Pass
(CreditScore, Exited)	-0.0470	Pass
(Age, NumOfProducts)	-0.0444	Pass
(Tenure, IsActiveMember)	-0.0387	Pass

2026-07-14 05:27:46,259 - INFO(validmind.vm_models.result.result): Test driven block with result_id validmind.data_validation.HighPearsonCorrelation:balanced_raw_dataset does not exist in model's document

Note the output returned indicating that a test-driven block doesn't currently exist in your documentation for this particular test ID.

That's expected, as when we run individual tests the results logged need to be manually added to your documentation within the ValidMind Platform.

Add individual test results to documentation

With the test results logged, let's head to the model we connected to at the beginning of this notebook and insert our test results into the documentation (Learn more: Work with test results):

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Development under Documents.
Locate the Data Preparation section and click on 2.3. Correlations and Interactions to expand that section.
Hover under the Pearson Correlation Matrix content block until a horizontal dashed line with a + button appears, indicating that you can insert a new block.
Click + and then select Test-Driven Block under FROM LIBRARY:
- Click on VM Library under TEST-DRIVEN in the left sidebar.
- In the search bar, type in HighPearsonCorrelation.
- Select HighPearsonCorrelation:balanced_raw_dataset as the test.
A preview of the test gets shown:
Finally, click Insert 1 Test Result to Document to add the test result to the documentation.

Confirm that the individual results for the high correlation test has been correctly inserted into section 2.3. Correlations and Interactions of the documentation.
Finalize the documentation by editing the test result's description block to explain the changes you made to the raw data and the reasons behind them as shown in the screenshot below:

Running model evaluation tests

So far, we've focused on the data assessment and pre-processing that usually occurs prior to any models being built. Now, let's instead assume we have already built a model and we want to incorporate some model results into our documentation.

Train simple logistic regression model

Using ValidMind tests, we'll train a simple logistic regression model on our dataset and evaluate its performance by using the LogisticRegression class from the sklearn.linear_model.

To start, let's grab the first few rows from the balanced_raw_no_age_df dataset with the highly correlated features removed we initialized earlier:

balanced_raw_no_age_df.head()

	CreditScore	Geography	Gender	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited
4770	753	France	Female	6	106596.29	1	0	0	91305.77	0
6399	826	France	Male	5	142662.68	1	0	0	60285.30	0
1698	660	France	Female	6	100768.77	1	1	0	19199.61	0
1541	565	France	Male	4	118803.35	2	1	1	128124.70	1
7499	771	France	Female	4	0.00	1	0	0	85876.67	1

Before training the model, we need to encode the categorical features in the dataset:

Use the OneHotEncoder class from the sklearn.preprocessing module to encode the categorical features.
The categorical features in the dataset are Geography and Gender.

balanced_raw_no_age_df = pd.get_dummies(
    balanced_raw_no_age_df, columns=["Geography", "Gender"], drop_first=True
)
balanced_raw_no_age_df.head()

	CreditScore	Tenure	Balance	NumOfProducts	HasCrCard	IsActiveMember	EstimatedSalary	Exited	Geography_Germany	Geography_Spain	Gender_Male
4770	753	6	106596.29	1	0	0	91305.77	0	False	False	False
6399	826	5	142662.68	1	0	0	60285.30	0	False	False	True
1698	660	6	100768.77	1	1	0	19199.61	0	False	False	False
1541	565	4	118803.35	2	1	1	128124.70	1	False	False	True
7499	771	4	0.00	1	0	0	85876.67	1	False	False	False

We'll split our preprocessed dataset into training and testing, to help assess how well the model generalizes to unseen data:

We start by dividing our balanced_raw_no_age_df dataset into training and test subsets using train_test_split, with 80% of the data allocated to training (train_df) and 20% to testing (test_df).
From each subset, we separate the features (all columns except "Exited") into X_train and X_test, and the target column ("Exited") into y_train and y_test.

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(balanced_raw_no_age_df, test_size=0.20)

X_train = train_df.drop("Exited", axis=1)
y_train = train_df["Exited"]
X_test = test_df.drop("Exited", axis=1)
y_test = test_df["Exited"]

Then using GridSearchCV, we'll find the best-performing hyperparameters or settings and save them:

from sklearn.linear_model import LogisticRegression

# Logistic Regression grid params
log_reg_params = {
    "penalty": ["l1", "l2"],
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "solver": ["liblinear"],
}

# Grid search for Logistic Regression
from sklearn.model_selection import GridSearchCV

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(X_train, y_train)

# Logistic Regression best estimator
log_reg = grid_log_reg.best_estimator_

/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1403: FutureWarning: 'penalty' was deprecated in version 1.8 and will be removed in 1.10. To avoid this warning, leave 'penalty' set to its default value and use 'l1_ratio' or 'C' instead. Use l1_ratio=0 instead of penalty='l2', l1_ratio=1 instead of penalty='l1', l1_ratio set to a float between 0 and 1 instead of penalty='elasticnet', and C=np.inf instead of penalty=None.
  warnings.warn(
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1429: UserWarning: Inconsistent values: penalty=l1 with l1_ratio=0.0. penalty is deprecated. Please use l1_ratio only.
  warnings.warn(

Initialize ValidMind datasets

The last step for evaluating the model's performance is to initialize the ValidMind Dataset and Model objects in preparation for assigning model predictions to each dataset.

# Initialize the datasets into their own dataset objects
vm_train_ds = vm.init_dataset(
    input_id="train_dataset_final",
    dataset=train_df,
    target_column="Exited",
)

vm_test_ds = vm.init_dataset(
    input_id="test_dataset_final",
    dataset=test_df,
    target_column="Exited",
)

Initialize a ValidMind model

You'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our three models.

Despite the naming convention, ValidMind model objects can be any type of record you want to test, document, validate, or monitor with the ValidMind Library.
From classical statistical and machine learning models, to generative and agentic AI systems and more, the ValidMind model object provides a consistent wrapper around your record so it can be passed as a unified input to any ValidMind test or test suite, with results sent directly to the ValidMind Platform.

Initialize your model object with vm.init_model():

# Register the model
vm_model = vm.init_model(log_reg, input_id="log_reg_model_v1")

Assign predictions

Once the model has been registered you can assign predictions to the training and testing datasets.

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

vm_train_ds.assign_predictions(model=vm_model)
vm_test_ds.assign_predictions(model=vm_model)

2026-07-14 05:27:48,031 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:27:48,034 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:27:48,034 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:27:48,036 - INFO(validmind.vm_models.dataset.utils): Done running predict()
2026-07-14 05:27:48,038 - INFO(validmind.vm_models.dataset.utils): Running predict_proba()... This may take a while
2026-07-14 05:27:48,039 - INFO(validmind.vm_models.dataset.utils): Done running predict_proba()
2026-07-14 05:27:48,040 - INFO(validmind.vm_models.dataset.utils): Running predict()... This may take a while
2026-07-14 05:27:48,040 - INFO(validmind.vm_models.dataset.utils): Done running predict()

Run the model evaluation tests

In this next example, we'll focus on running the tests within the Model Development section of the documentation. Only tests associated with this section will be executed, and the corresponding results will be updated in the documentation.

Note the additional config that is passed to run_documentation_tests() — this allows you to override inputs or params in certain tests.
In our case, we want to explicitly use the vm_train_ds for the validmind.model_validation.sklearn.ClassifierPerformance:in_sample test, since it's supposed to run on the training dataset and not the test dataset.

test_config = {
    "validmind.model_validation.sklearn.ClassifierPerformance:in_sample": {
        "inputs": {
            "dataset": vm_train_ds,
            "model": vm_model,
        },
    }
}
results = vm.run_documentation_tests(
    section=["model_development"],
    inputs={
        "dataset": vm_test_ds,  # Any test that requires a single dataset will use vm_test_ds
        "model": vm_model,
        "datasets": (
            vm_train_ds,
            vm_test_ds,
        ),  # Any test that requires multiple datasets will use vm_train_ds and vm_test_ds
    },
    config=test_config,
)

Test suite complete!

34/34 (100.0%)

Test Suite Results: Binary Classification V2

Check out the updated documentation on ValidMind.

Template for binary classification models.

▶ Model Development

Model Development

▶ Test Result: Model Metadata (validmind.model_validation.ModelMetadata)

Modeling Technique	Modeling Framework	Framework Version	Programming Language
SKlearnModel	sklearn	1.9.0	Python

▶ Test Result: Dataset Split (validmind.data_validation.DatasetSplit)

Dataset	Size	Proportion
train_dataset_final	2585	79.98%
test_dataset_final	647	20.02%
Total	3232	100%

▶ Test Result: Population Stability Index (validmind.model_validation.sklearn.PopulationStabilityIndex)

Population Stability Index

The Population Stability Index (PSI) test evaluates the stability of the distribution being compared between the initial and new datasets. The results are presented across 10 bins, showing count and percentage shares for the initial dataset (train_dataset_final) and the new dataset (test_dataset_final), along with the PSI contribution from each bin. Bin-level percentages in the initial dataset range from 4.6035% to 14.7002%, while the new dataset ranges from 4.9459% to 16.2287%. The total PSI reported across all bins is 0.0145.

Key insights:

Overall PSI is low: The total PSI across the compared distributions is 0.0145, based on 2,585 observations in the initial dataset and 647 observations in the new dataset.
Largest shift occurs in Bin 0: Bin 0 contributes the highest individual PSI value at 0.0051, with the population share decreasing from 6.8472% in the initial dataset to 5.1005% in the new dataset.
Middle bins show modest increases: Bins 2 through 4 have higher population shares in the new dataset than in the initial dataset, increasing from 10.6383%, 13.4623%, and 14.7002% to 12.0556%, 14.8377%, and 16.2287%, respectively.
Upper-middle bins decline in the new dataset: Bins 5 through 7 show lower shares in the new dataset, moving from 13.5010%, 11.6828%, and 10.6770% in the initial dataset to 12.2102%, 10.9737%, and 9.7372%, respectively.
Bin-level PSI contributions are small: Aside from Bin 0, all individual bin PSI values are between 0.0000 and 0.0018, indicating that the total PSI is composed of small contributions spread across multiple bins.

The PSI result indicates that the compared distributions are closely aligned overall, with a total PSI of 0.0145 and only small bin-level contributions across most of the range. The observed differences are concentrated in a reduced share in the first bin and moderately higher shares in the central bins of the new dataset, offset by lower shares in the upper-middle bins. Taken together, the result reflects limited distributional change between train_dataset_final and test_dataset_final in this comparison.

Tables

Population Stability Index for train_dataset_final and test_dataset_final Datasets

Bin	Count Initial	Percent Initial (%)	Count New	Percent New (%)	PSI
0	177	6.8472	33	5.1005	0.0051
1	215	8.3172	55	8.5008	0.0000
2	275	10.6383	78	12.0556	0.0018
3	348	13.4623	96	14.8377	0.0013
4	380	14.7002	105	16.2287	0.0015
5	349	13.5010	79	12.2102	0.0013
6	302	11.6828	71	10.9737	0.0004
7	276	10.6770	63	9.7372	0.0009
8	119	4.6035	35	5.4096	0.0013
9	144	5.5706	32	4.9459	0.0007
Total	2585	100.0000	647	100.0000	0.0145

Figures

ValidMind Figure validmind.model_validation.sklearn.PopulationStabilityIndex:4570

▶ Test Result: Confusion Matrix (validmind.model_validation.sklearn.ConfusionMatrix)

Confusion Matrix

The ConfusionMatrix test evaluates classification performance by comparing predicted class labels with observed class labels and summarizing the results across true positives, true negatives, false positives, and false negatives. The heatmap shows counts for each outcome type across the two classes. In this result, the matrix contains 222 true negatives, 200 true positives, 119 false positives, and 106 false negatives. These values provide a direct view of correct and incorrect classifications for both negative and positive classes.

Key insights:

Correct classifications exceed errors: The matrix shows 222 true negatives and 200 true positives, compared with 119 false positives and 106 false negatives. Correct classifications total 422 versus 225 misclassifications.
Negative class is identified slightly more often: True negatives (222) are higher than true positives (200). This indicates more correct predictions for class 0 than for class 1 in absolute count terms.
False positives slightly exceed false negatives: The model produces 119 false positives and 106 false negatives. Misclassification counts are therefore relatively balanced, with a modestly higher number of negative cases predicted as positive than positive cases predicted as negative.
Observed class counts are similar: Based on the matrix totals, actual class 0 observations equal 341 (222 TN + 119 FP) and actual class 1 observations equal 306 (106 FN + 200 TP). The evaluated sample therefore shows comparable class representation across the two observed labels.

The confusion matrix indicates that the model produces more correct than incorrect classifications, with both classes contributing materially to correct prediction counts. Error counts are present in both directions and are of similar magnitude, with false positives marginally higher than false negatives. Overall, the result reflects a balanced pattern of classification outcomes without a single error type dominating the matrix.

Figures

ValidMind Figure validmind.model_validation.sklearn.ConfusionMatrix:60e4

▶ Test Result: Classifier Performance In Sample (validmind.model_validation.sklearn.ClassifierPerformance:in_sample)

Classifier Performance In Sample

The Classifier Performance test evaluates classification model performance using precision, recall, F1-score, accuracy, and ROC AUC. The in-sample results are presented for both classes, along with macro and weighted averages across classes. Class-level precision ranges from 0.63 to 0.6341, recall ranges from 0.6157 to 0.6481, and F1 ranges from 0.6228 to 0.6410. Overall summary metrics show accuracy of 0.6321 and ROC AUC of 0.6868.

Key insights:

Similar class-level performance: Precision is 0.63 for class 0 and 0.6341 for class 1, while F1-scores are 0.6228 and 0.6410 respectively. The model shows closely aligned performance across the two classes rather than a large divergence in class-specific results.
Class 1 recall is higher: Recall is 0.6481 for class 1 compared with 0.6157 for class 0. This indicates that the model identifies class 1 observations more frequently than class 0 within the in-sample evaluation.
Aggregate metrics are tightly aligned: Weighted-average precision, recall, and F1 are 0.6321, 0.6321, and 0.6320, while macro averages are 0.6320, 0.6319, and 0.6319. The near-equality of macro and weighted averages indicates minimal dispersion between the class-specific metric contributions in this result set.
ROC AUC exceeds accuracy: Accuracy is 0.6321, whereas ROC AUC is 0.6868. The ranking-based discrimination measure is higher than the threshold-dependent accuracy reported for the same in-sample test.

Overall, the in-sample classification results show moderate and internally consistent performance across precision, recall, and F1-score, with limited separation between class-specific and aggregate metrics. Performance is balanced across classes, although recall is somewhat higher for class 1 than for class 0. The ROC AUC of 0.6868 indicates stronger ranking discrimination than the single-threshold accuracy value of 0.6321.

Tables

Precision, Recall, and F1

Class	Precision	Recall	F1
0	0.6300	0.6157	0.6228
1	0.6341	0.6481	0.6410
Weighted Average	0.6321	0.6321	0.6320
Macro Average	0.6320	0.6319	0.6319

Accuracy and ROC AUC

Metric	Value
Accuracy	0.6321
ROC AUC	0.6868

▶ Test Result: Classifier Performance Out Of Sample (validmind.model_validation.sklearn.ClassifierPerformance:out_of_sample)

Classifier Performance Out Of Sample

The Classifier Performance test evaluates classification performance using precision, recall, F1-score, accuracy, and ROC AUC. In this out-of-sample result, the report presents class-level precision, recall, and F1 values for classes 0 and 1, together with macro and weighted averages. Separate summary metrics report overall accuracy of 0.6522 and ROC AUC of 0.6954. The results therefore provide both per-class and aggregate views of predictive performance on the evaluation sample.

Key insights:

Balanced class-level performance: Class 0 and class 1 show similar results across metrics. F1-scores are 0.6637 for class 0 and 0.6400 for class 1, while recall is 0.6510 and 0.6536 respectively, indicating limited divergence in performance between the two classes.
Aggregate metrics are closely aligned: Weighted-average and macro-average scores are nearly identical across precision, recall, and F1. Weighted-average F1 is 0.6525 and macro-average F1 is 0.6518, indicating that aggregate performance is not materially altered by class weighting.
Precision and recall are comparable: Precision and recall are closely matched at both the class and aggregate levels. For example, weighted-average precision is 0.6532 versus weighted-average recall of 0.6522, showing no pronounced imbalance between false-positive and false-negative tendencies in the reported metrics.
ROC AUC exceeds accuracy: The reported ROC AUC is 0.6954 compared with accuracy of 0.6522, indicating stronger ranking performance than point-classification performance at the applied decision threshold.

Overall, the out-of-sample classification results show moderate and internally consistent performance across the reported metrics. Class-level results are similar between classes 0 and 1, and the close alignment of macro and weighted averages indicates stable aggregate behavior across classes. The reported ROC AUC of 0.6954 is higher than the accuracy of 0.6522, while precision, recall, and F1 remain concentrated around 0.65 across both class-specific and summary views.

Tables

Precision, Recall, and F1

Class	Precision	Recall	F1
0	0.6768	0.6510	0.6637
1	0.6270	0.6536	0.6400
Weighted Average	0.6532	0.6522	0.6525
Macro Average	0.6519	0.6523	0.6518

Accuracy and ROC AUC

Metric	Value
Accuracy	0.6522
ROC AUC	0.6954

▶ Test Result: Precision Recall Curve (validmind.model_validation.sklearn.PrecisionRecallCurve)

Precision Recall Curve

The PrecisionRecallCurve test evaluates the trade-off between precision and recall for this binary classification model across decision thresholds. The result is presented as a precision-recall curve with recall on the x-axis and precision on the y-axis. The plotted curve shows high precision at very low recall, followed by a gradual decline in precision as recall increases. Across most of the recall range, the curve remains above 0.5 precision and trends downward toward approximately 0.48 near full recall.

Key insights:

High initial precision at low recall: At very low recall levels, precision rises to above 0.8, indicating that the highest-confidence positive predictions are comparatively accurate.
Gradual precision decline across thresholds: After the initial low-recall region, precision generally decreases as recall increases, moving from roughly the low-0.7 range through the mid-0.6 range and ending below 0.5 near recall of 1.0.
Mid-range recall retains moderate precision: Through the central portion of the curve, approximately around 0.3 to 0.7 recall, precision remains relatively stable in the mid-0.6 to low-0.7 range.
No abrupt collapse in curve shape: The curve declines progressively rather than showing a sharp drop over most thresholds, indicating a continuous precision-recall trade-off as the classification threshold is relaxed.

The precision-recall result shows that the model delivers its strongest precision at low recall and then exhibits a steady reduction in precision as broader positive coverage is pursued. Most of the curve occupies a moderate precision range, with the decline becoming more pronounced at higher recall levels. Overall, the result reflects a typical threshold trade-off in which increased recall is associated with lower precision, without a sudden deterioration across the majority of the operating range.

Figures

ValidMind Figure validmind.model_validation.sklearn.PrecisionRecallCurve:6584

▶ Test Result: ROC Curve (validmind.model_validation.sklearn.ROCCurve)

▶ Test Result: Training Test Degradation (validmind.model_validation.sklearn.TrainingTestDegradation)

✅ Training Test Degradation

The TrainingTestDegradation test evaluates whether model performance degradation from the training dataset to the test dataset remains within a predefined threshold across classification metrics. The results table reports train and test scores for precision, recall, and F1-score separately for classes 1 and 0, along with the corresponding degradation percentages and pass/fail outcomes. Observed degradation values range from -7.4305% to 1.1191%, and all six metric evaluations are marked as Pass.

Key insights:

All evaluated metrics passed: Each class-metric combination is marked as Pass, indicating that all reported degradation percentages remained within the configured threshold used by the test.
Class 1 performance is nearly unchanged: For class 1, precision declines from 0.6341 to 0.6270 with 1.1191% degradation, recall increases from 0.6481 to 0.6536 with -0.8491% degradation, and F1-score changes from 0.6410 to 0.6400 with 0.1555% degradation.
Class 0 test performance is higher than training: For class 0, all three metrics are higher on the test dataset than on the training dataset, producing negative degradation values: precision improves from 0.6300 to 0.6768, recall from 0.6157 to 0.6510, and F1-score from 0.6228 to 0.6637.
Largest change occurs in class 0 precision: The largest absolute degradation value reported is -7.4305% for class 0 precision, reflecting the biggest train-to-test score difference among the evaluated metrics.

The results indicate limited train-to-test performance change across the evaluated classification metrics, with all observed degradation values falling within the test acceptance threshold. Class 1 metrics remain closely aligned between training and test datasets, while class 0 shows higher scores on the test dataset for precision, recall, and F1-score. Overall, the reported outcomes show no threshold breaches in the measured training-to-test performance comparison.

Tables

Class	Metric	train_dataset_final Score	test_dataset_final Score	Degradation (%)	Pass/Fail
1	Precision	0.6341	0.6270	1.1191	Pass
1	Recall	0.6481	0.6536	-0.8491	Pass
1	F1-Score	0.6410	0.6400	0.1555	Pass
0	Precision	0.6300	0.6768	-7.4305	Pass
0	Recall	0.6157	0.6510	-5.7400	Pass
0	F1-Score	0.6228	0.6637	-6.5688	Pass

▶ Test Result: Minimum Accuracy (validmind.model_validation.sklearn.MinimumAccuracy)

Score	Threshold	Pass/Fail
0.6522	0.7	Fail

▶ Test Result: Minimum F1 Score (validmind.model_validation.sklearn.MinimumF1Score)

Score	Threshold	Pass/Fail
0.64	0.5	Pass

▶ Test Result: Minimum ROCAUC Score (validmind.model_validation.sklearn.MinimumROCAUCScore)

Score	Threshold	Pass/Fail
0.6954	0.5	Pass

▶ Test Result: Permutation Feature Importance (validmind.model_validation.sklearn.PermutationFeatureImportance)

Permutation Feature Importance

The Permutation Feature Importance test evaluates the significance of each model input by measuring the change in model performance after randomly permuting individual feature values. The result is presented as a ranked horizontal bar chart of permutation importances across the model features. The chart shows a clear ordering of feature contribution, with the largest bars associated with Geography_Germany and IsActiveMember, followed by a smaller group of mid-level contributors and several features with near-zero importance. One feature, NumOfProducts, appears slightly negative in the plotted importance scale.

Key insights:

Geography_Germany is the dominant feature: Geography_Germany has the largest permutation importance in the chart, with an importance value near 0.08, making it the strongest single contributor among the plotted variables.
IsActiveMember is the second-largest driver: IsActiveMember shows the next highest importance, at approximately 0.06, indicating a material but lower contribution than Geography_Germany.
Importance drops sharply after the top two: After the two leading features, the remaining variables show substantially smaller importance values. Gender_Male is the largest of this next group at roughly 0.03, followed by Balance at around 0.02.
Several features have limited incremental contribution: CreditScore, HasCrCard, Tenure, EstimatedSalary, and Geography_Spain all have low positive importances clustered close to zero relative to the leading features.
NumOfProducts shows negative importance: NumOfProducts is plotted slightly below zero, indicating that permuting this feature did not reduce model performance in this test and was associated with a marginal increase on the plotted importance scale.

Overall, the permutation importance profile is concentrated in a small number of features, with Geography_Germany and IsActiveMember contributing the largest observed performance impact when permuted. A second tier of features, including Gender_Male and Balance, shows moderate influence, while most remaining inputs have limited incremental importance in the plotted results. The presence of a slightly negative importance for NumOfProducts indicates that not all included features contributed positively under this test outcome.

Figures

ValidMind Figure validmind.model_validation.sklearn.PermutationFeatureImportance:8181

▶ Test Result: SHAP Global Importance (validmind.model_validation.sklearn.SHAPGlobalImportance)

SHAP Global Importance

The SHAP Global Importance test evaluates global feature importance using absolute SHAP values and visualizes how individual features contribute to model output across observations. The results include a normalized feature-importance bar chart and a SHAP summary plot. The bar chart ranks features by mean absolute SHAP value, while the summary plot shows the direction and spread of feature-level effects on model output, with color indicating relative feature magnitude. Across both views, the features are ordered from highest to lowest contribution, allowing comparison of relative importance and effect dispersion.

Key insights:

IsActiveMember is the dominant driver: IsActiveMember has the highest normalized SHAP importance at 100, clearly above all other features. In the summary plot, its SHAP values form two separated clusters on opposite sides of zero, indicating a strong directional shift in model output across its observed values.
Geography_Germany and Gender_Male are also highly influential: Geography_Germany and Gender_Male are the next two most important features, with normalized importance levels visibly close to 80–85 and about 70–75 respectively. Both features show distinct separation between low and high feature values in the summary plot, with contributions spanning both negative and positive SHAP regions.
Importance declines sharply after the top four features: After IsActiveMember, Geography_Germany, Gender_Male, and Balance, feature importance drops materially. Balance is the fourth-ranked feature at roughly 40 on the normalized scale, while CreditScore and all remaining variables are well below that level.
Balance shows predominantly positive contributions at higher values: In the summary plot, higher Balance values are concentrated on the positive SHAP side, while lower values are positioned on the negative side. The observed spread is narrower than for the top three features but remains materially wider than for the lower-ranked variables.
Lower-ranked features have limited marginal impact: Tenure, Geography_Spain, and EstimatedSalary appear at the bottom of the importance ranking, with EstimatedSalary near zero relative importance. Their SHAP values are tightly concentrated around zero in the summary plot, indicating comparatively small contribution magnitudes across observations.

The SHAP results show that model behavior is primarily driven by a small set of features, led by IsActiveMember, Geography_Germany, and Gender_Male, with Balance contributing at a secondary level. CreditScore has moderate influence, while the remaining variables contribute relatively little in comparison. The summary plot further indicates that the most important features exhibit clearer directional separation and wider SHAP dispersion, whereas lower-ranked features remain concentrated near zero.

Figures

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:5f92

ValidMind Figure validmind.model_validation.sklearn.SHAPGlobalImportance:5cc4

▶ Test Result: Weakspots Diagnosis (validmind.model_validation.sklearn.WeakspotsDiagnosis)

❌ Weakspots Diagnosis

The WeakspotsDiagnosis test evaluates model performance across feature-space slices by comparing accuracy, precision, recall, and F1 between the training and test datasets within each bin. The results are reported for multiple features, including continuous variables such as Balance, CreditScore, EstimatedSalary, and Tenure, as well as binary indicators such as Gender_Male, Geography_Germany, Geography_Spain, HasCrCard, IsActiveMember, and NumOfProducts. The plots additionally show metric-specific thresholds at 0.75 for accuracy, 0.5 for precision and recall, and 0.7 for F1, allowing direct identification of bins where test performance falls below those levels.

Key insights:

Weak spots are concentrated in accuracy and F1: Test accuracy is below the 0.75 threshold in nearly all reported slices across features. Test F1 is also below the 0.7 threshold in many bins, including most Balance, HasCrCard, IsActiveMember, Gender_Male, and several EstimatedSalary, CreditScore, Tenure, and Geography_Spain slices.
Balance shows the widest variability: Test performance for Balance ranges from accuracy 0.3000 to 0.7500, precision 0.0000 to 1.0000, recall 0.0000 to 0.8571, and F1 0.0000 to 0.8571 across bins. The slice (175628.663, 200718.472] is the lowest populated materially nonzero test bin, with 10 records and accuracy 0.3000, precision 0.3333, recall 0.4000, and F1 0.3636.
Sparse bins correspond to extreme metric values: Several slices have very small record counts and highly variable metrics, including Balance bins with 8, 4, 0, and 10 test records; CreditScore (349.5, 400.0] with 3 test records; and NumOfProducts (3.7, 4.0] with 7 test records. These sparse slices include boundary values such as Balance (225808.281, 250898.09] with 0 test records and all metrics reported as 0.0000, and small-sample bins with precision reported as 1.0000.
CreditScore contains a pronounced low-performance slice: The CreditScore bin (400.0, 450.0] has test accuracy 0.4545, precision 0.4286, recall 0.6000, and F1 0.5000 on 11 records, which is the lowest test accuracy among CreditScore slices. By contrast, the (700.0, 750.0] test bin reaches accuracy 0.7500 and F1 0.7500.
Tenure performance is uneven across bins: Test Tenure accuracy ranges from 0.5263 in (2.0, 3.0] to 0.8889 in (9.0, 10.0]. The (2.0, 3.0] slice also has the lowest Tenure test F1 at 0.4194, while the (9.0, 10.0] slice has recall 1.0000 and F1 0.8889 on 27 records.
IsActiveMember splits show strong recall asymmetry: For test data, the inactive slice (-0.001, 0.1] has recall 0.8214 and F1 0.7124, while the active slice (0.9, 1.0] has recall 0.3545 and F1 0.4509. This same directional pattern is also present in training data, with recall 0.8104 versus 0.3737.
Geography indicators show contrasting slice behavior: For Geography_Germany, the Germany slice (0.9, 1.0] has materially stronger test performance than the non-Germany slice, with recall 0.9360 and F1 0.7905 versus recall 0.4586 and F1 0.5046. For Geography_Spain, the Spain slice (0.9, 1.0] has higher test accuracy at 0.7248 than the non-Spain slice at 0.6305, but lower recall at 0.5283 versus 0.6798.
NumOfProducts degrades at higher product counts: Test accuracy declines from 0.6366 for (0.997, 1.3] and 0.7040 for (1.9, 2.2] to 0.5172 for (2.8, 3.1] and 0.4286 for (3.7, 4.0]. The highest-count bins also have small test samples, with 29 and 7 records respectively.
EstimatedSalary is comparatively stable but not uniform: Most EstimatedSalary test bins cluster between roughly 0.63 and 0.71 accuracy, but the slice (159964.98, 179959.155] drops to accuracy 0.4590, precision 0.5152, recall 0.5000, and F1 0.5075. The strongest F1 occurs in (20005.755, 39999.93] at 0.7606.

Overall, the weak spot analysis shows that model performance varies materially across slices of several features rather than remaining uniform throughout the feature space. The most pronounced dispersion appears in Balance, Tenure, CreditScore, IsActiveMember, and the geography indicators, with some bins showing substantially lower test accuracy, recall, or F1 than adjacent slices. A number of the most extreme outcomes occur in sparsely populated bins, while several broader and better-populated slices also exhibit persistent metric differences between groups.

Tables

Slice	Number of Records	Feature	Accuracy	Precision	Recall	F1	Dataset
(-250.898, 25089.809]	209	Balance	0.6746	0.5625	0.3649	0.4426	test_dataset_final
(25089.809, 50179.618]	8	Balance	0.6250	0.5000	0.6667	0.5714	test_dataset_final
(50179.618, 75269.427]	29	Balance	0.6207	0.6000	0.4615	0.5217	test_dataset_final
(75269.427, 100359.236]	72	Balance	0.6111	0.5641	0.6667	0.6111	test_dataset_final
(100359.236, 125449.045]	153	Balance	0.6993	0.7158	0.7816	0.7473	test_dataset_final
(125449.045, 150538.854]	113	Balance	0.5929	0.5823	0.7797	0.6667	test_dataset_final
(150538.854, 175628.663]	49	Balance	0.6939	0.6857	0.8571	0.7619	test_dataset_final
(175628.663, 200718.472]	10	Balance	0.3000	0.3333	0.4000	0.3636	test_dataset_final
(200718.472, 225808.281]	4	Balance	0.7500	1.0000	0.7500	0.8571	test_dataset_final
(225808.281, 250898.09]	0	Balance	0.0000	0.0000	0.0000	0.0000	test_dataset_final
(-250.898, 25089.809]	846	Balance	0.6312	0.5370	0.3537	0.4265	train_dataset_final
(25089.809, 50179.618]	22	Balance	0.5455	0.6000	0.2727	0.3750	train_dataset_final
(50179.618, 75269.427]	90	Balance	0.5444	0.6562	0.4118	0.5060	train_dataset_final
(75269.427, 100359.236]	284	Balance	0.6021	0.6101	0.6554	0.6319	train_dataset_final
(100359.236, 125449.045]	600	Balance	0.6917	0.7066	0.8164	0.7575	train_dataset_final
(125449.045, 150538.854]	492	Balance	0.6362	0.6493	0.7943	0.7145	train_dataset_final
(150538.854, 175628.663]	186	Balance	0.5591	0.5512	0.7368	0.6306	train_dataset_final
(175628.663, 200718.472]	48	Balance	0.5417	0.5714	0.7407	0.6452	train_dataset_final
(200718.472, 225808.281]	15	Balance	0.6000	0.8000	0.6667	0.7273	train_dataset_final
(225808.281, 250898.09]	2	Balance	0.5000	1.0000	0.5000	0.6667	train_dataset_final
(349.5, 400.0]	3	CreditScore	0.6667	1.0000	0.6667	0.8000	test_dataset_final
(400.0, 450.0]	11	CreditScore	0.4545	0.4286	0.6000	0.5000	test_dataset_final
(450.0, 500.0]	22	CreditScore	0.7727	0.6923	0.9000	0.7826	test_dataset_final
(500.0, 550.0]	63	CreditScore	0.6508	0.5806	0.6667	0.6207	test_dataset_final
(550.0, 600.0]	91	CreditScore	0.6374	0.7000	0.5714	0.6292	test_dataset_final
(600.0, 650.0]	130	CreditScore	0.5846	0.5441	0.6167	0.5781	test_dataset_final
(650.0, 700.0]	113	CreditScore	0.6106	0.6000	0.6429	0.6207	test_dataset_final
(700.0, 750.0]	96	CreditScore	0.7500	0.7500	0.7500	0.7500	test_dataset_final
(750.0, 800.0]	69	CreditScore	0.6957	0.7097	0.6471	0.6769	test_dataset_final
(800.0, 850.0]	49	CreditScore	0.6939	0.4737	0.6429	0.5455	test_dataset_final
(349.5, 400.0]	11	CreditScore	0.7273	1.0000	0.7273	0.8421	train_dataset_final
(400.0, 450.0]	50	CreditScore	0.6600	0.6875	0.7586	0.7213	train_dataset_final
(450.0, 500.0]	125	CreditScore	0.5920	0.5802	0.7344	0.6483	train_dataset_final
(500.0, 550.0]	264	CreditScore	0.6591	0.6954	0.7047	0.7000	train_dataset_final
(550.0, 600.0]	367	CreditScore	0.6322	0.6204	0.7166	0.6650	train_dataset_final
(600.0, 650.0]	469	CreditScore	0.6077	0.6016	0.6525	0.6260	train_dataset_final
(650.0, 700.0]	511	CreditScore	0.6536	0.6301	0.6432	0.6366	train_dataset_final
(700.0, 750.0]	377	CreditScore	0.5968	0.6257	0.5487	0.5847	train_dataset_final
(750.0, 800.0]	229	CreditScore	0.6725	0.6727	0.6549	0.6637	train_dataset_final
(800.0, 850.0]	182	CreditScore	0.6319	0.6324	0.5059	0.5621	train_dataset_final
(-188.362, 20005.755]	70	EstimatedSalary	0.7143	0.6750	0.7941	0.7297	test_dataset_final
(20005.755, 39999.93]	58	EstimatedSalary	0.7069	0.7941	0.7297	0.7606	test_dataset_final
(39999.93, 59994.105]	56	EstimatedSalary	0.6250	0.5714	0.5000	0.5333	test_dataset_final
(59994.105, 79988.28]	64	EstimatedSalary	0.6875	0.7600	0.5758	0.6552	test_dataset_final
(79988.28, 99982.455]	65	EstimatedSalary	0.6462	0.5946	0.7333	0.6567	test_dataset_final
(99982.455, 119976.63]	61	EstimatedSalary	0.6721	0.6333	0.6786	0.6552	test_dataset_final
(119976.63, 139970.805]	74	EstimatedSalary	0.6892	0.6000	0.7000	0.6462	test_dataset_final
(139970.805, 159964.98]	71	EstimatedSalary	0.6338	0.5882	0.6250	0.6061	test_dataset_final
(159964.98, 179959.155]	61	EstimatedSalary	0.4590	0.5152	0.5000	0.5075	test_dataset_final
(179959.155, 199953.33]	67	EstimatedSalary	0.6716	0.5333	0.6667	0.5926	test_dataset_final
(-188.362, 20005.755]	262	EstimatedSalary	0.6603	0.6594	0.6842	0.6716	train_dataset_final
(20005.755, 39999.93]	252	EstimatedSalary	0.6468	0.6281	0.6333	0.6307	train_dataset_final
(39999.93, 59994.105]	251	EstimatedSalary	0.6813	0.6767	0.7087	0.6923	train_dataset_final
(59994.105, 79988.28]	269	EstimatedSalary	0.6320	0.6525	0.6479	0.6502	train_dataset_final
(79988.28, 99982.455]	237	EstimatedSalary	0.5949	0.6015	0.6504	0.6250	train_dataset_final
(99982.455, 119976.63]	267	EstimatedSalary	0.6067	0.6107	0.5970	0.6038	train_dataset_final
(119976.63, 139970.805]	257	EstimatedSalary	0.6226	0.6343	0.6391	0.6367	train_dataset_final
(139970.805, 159964.98]	254	EstimatedSalary	0.6496	0.6260	0.6417	0.6337	train_dataset_final
(159964.98, 179959.155]	274	EstimatedSalary	0.6168	0.6389	0.6345	0.6367	train_dataset_final
(179959.155, 199953.33]	262	EstimatedSalary	0.6107	0.6099	0.6466	0.6277	train_dataset_final
(-0.001, 0.1]	329	Gender_Male	0.6505	0.6364	0.7733	0.6982	test_dataset_final
(0.9, 1.0]	318	Gender_Male	0.6541	0.6091	0.5000	0.5492	test_dataset_final
(-0.001, 0.1]	1239	Gender_Male	0.6279	0.6469	0.7997	0.7153	train_dataset_final
(0.9, 1.0]	1346	Gender_Male	0.6360	0.6081	0.4608	0.5243	train_dataset_final
(-0.001, 0.1]	458	Geography_Germany	0.6441	0.5608	0.4586	0.5046	test_dataset_final
(0.9, 1.0]	189	Geography_Germany	0.6720	0.6842	0.9360	0.7905	test_dataset_final
(-0.001, 0.1]	1791	Geography_Germany	0.6097	0.5653	0.4545	0.5039	train_dataset_final
(0.9, 1.0]	794	Geography_Germany	0.6826	0.6948	0.9338	0.7968	train_dataset_final
(-0.001, 0.1]	498	Geography_Spain	0.6305	0.6255	0.6798	0.6515	test_dataset_final
(0.9, 1.0]	149	Geography_Spain	0.7248	0.6364	0.5283	0.5773	test_dataset_final
(-0.001, 0.1]	1960	Geography_Spain	0.6347	0.6411	0.6970	0.6679	train_dataset_final
(0.9, 1.0]	625	Geography_Spain	0.6240	0.5972	0.4657	0.5233	train_dataset_final
(-0.001, 0.1]	201	HasCrCard	0.6418	0.5526	0.7500	0.6364	test_dataset_final
(0.9, 1.0]	446	HasCrCard	0.6570	0.6683	0.6171	0.6417	test_dataset_final
(-0.001, 0.1]	789	HasCrCard	0.5995	0.6103	0.6341	0.6220	train_dataset_final
(0.9, 1.0]	1796	HasCrCard	0.6464	0.6451	0.6544	0.6498	train_dataset_final
(-0.001, 0.1]	348	IsActiveMember	0.6264	0.6289	0.8214	0.7124	test_dataset_final
(0.9, 1.0]	299	IsActiveMember	0.6823	0.6190	0.3545	0.4509	test_dataset_final
(-0.001, 0.1]	1384	IsActiveMember	0.6264	0.6488	0.8104	0.7207	train_dataset_final
(0.9, 1.0]	1201	IsActiveMember	0.6386	0.5852	0.3737	0.4561	train_dataset_final
(0.997, 1.3]	388	NumOfProducts	0.6366	0.6820	0.6727	0.6773	test_dataset_final
(1.9, 2.2]	223	NumOfProducts	0.7040	0.4096	0.6667	0.5075	test_dataset_final
(2.8, 3.1]	29	NumOfProducts	0.5172	0.9375	0.5357	0.6818	test_dataset_final
(3.7, 4.0]	7	NumOfProducts	0.4286	1.0000	0.4286	0.6000	test_dataset_final
(0.997, 1.3]	1487	NumOfProducts	0.6301	0.6992	0.6837	0.6914	train_dataset_final
(1.9, 2.2]	895	NumOfProducts	0.6525	0.3704	0.5909	0.4553	train_dataset_final
(2.8, 3.1]	166	NumOfProducts	0.5843	0.9560	0.5724	0.7160	train_dataset_final
(3.7, 4.0]	37	NumOfProducts	0.4324	1.0000	0.4324	0.6038	train_dataset_final
(-0.01, 1.0]	93	Tenure	0.7419	0.7200	0.7826	0.7500	test_dataset_final
(1.0, 2.0]	70	Tenure	0.6000	0.6286	0.5946	0.6111	test_dataset_final
(2.0, 3.0]	76	Tenure	0.5263	0.4062	0.4333	0.4194	test_dataset_final
(3.0, 4.0]	58	Tenure	0.6034	0.5926	0.5714	0.5818	test_dataset_final
(4.0, 5.0]	69	Tenure	0.5797	0.5588	0.5758	0.5672	test_dataset_final
(5.0, 6.0]	51	Tenure	0.6863	0.7391	0.6296	0.6800	test_dataset_final
(6.0, 7.0]	63	Tenure	0.7460	0.6875	0.7857	0.7333	test_dataset_final
(7.0, 8.0]	67	Tenure	0.6418	0.5588	0.6786	0.6129	test_dataset_final
(8.0, 9.0]	73	Tenure	0.6438	0.6486	0.6486	0.6486	test_dataset_final
(9.0, 10.0]	27	Tenure	0.8889	0.8000	1.0000	0.8889	test_dataset_final
(-0.01, 1.0]	372	Tenure	0.6344	0.6699	0.6699	0.6699	train_dataset_final
(1.0, 2.0]	258	Tenure	0.6124	0.5755	0.6612	0.6154	train_dataset_final
(2.0, 3.0]	280	Tenure	0.6429	0.6467	0.6736	0.6599	train_dataset_final
(3.0, 4.0]	282	Tenure	0.6454	0.6503	0.6503	0.6503	train_dataset_final
(4.0, 5.0]	253	Tenure	0.6561	0.6429	0.7087	0.6742	train_dataset_final
(5.0, 6.0]	248	Tenure	0.5968	0.6080	0.5984	0.6032	train_dataset_final
(6.0, 7.0]	250	Tenure	0.5920	0.5285	0.5963	0.5603	train_dataset_final
(7.0, 8.0]	266	Tenure	0.6466	0.6508	0.6212	0.6357	train_dataset_final
(8.0, 9.0]	244	Tenure	0.6598	0.6692	0.6850	0.6770	train_dataset_final
(9.0, 10.0]	132	Tenure	0.6288	0.7193	0.5541	0.6260	train_dataset_final

Figures

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:fc07

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:e1fb

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:abe5

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:510e

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:75c3

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:7734

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:50cb

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:27a2

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:8480

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:fdc7

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:fff0

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:c48f

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:a543

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:2acf

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:9994

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:caa2

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:2602

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:2451

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:fdee

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:996b

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:494f

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:83a6

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:cd86

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:6158

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:1cbf

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:b4a0

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:d675

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:b25c

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:c120

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:4841

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:0d03

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:4b23

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:36bc

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:266b

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:8e5a

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:9ca4

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:52ae

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:34e9

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:a18e

ValidMind Figure validmind.model_validation.sklearn.WeakspotsDiagnosis:6ce0

▶ Test Result: Overfit Diagnosis (validmind.model_validation.sklearn.OverfitDiagnosis)

Overfit Diagnosis

The Overfit Diagnosis test evaluates differences between training and test performance across feature-level segments to identify regions where the AUC gap exceeds the configured threshold of 0.04. The results are presented as feature slices with training and test record counts, training AUC, test AUC, and the resulting gap, alongside bar charts showing the direction and magnitude of these gaps. The reported overfit regions span continuous variables including CreditScore, Tenure, Balance, EstimatedSalary, and NumOfProducts, as well as the binary indicator Geography_Germany.

Key insights:

Largest gaps occur in Balance slices: The highest reported gaps are in Balance segments (175628.663, 200718.472] and (200718.472, 225808.281], with gaps of 0.2803 and 0.2778 respectively. These slices also have small sample counts, with 48/10 and 15/4 training/test records.
NumOfProducts shows a concentrated overfit region: The slice (2.8, 3.1] for NumOfProducts has a gap of 0.2462, with training AUC of 0.7820 and test AUC of 0.5357 across 166 training and 29 test records. No other NumOfProducts slice is reported in the table.
Tenure includes the most pronounced moderate gap: Within Tenure, the slice (2.0, 3.0] shows a gap of 0.1444, based on training AUC of 0.7220 and test AUC of 0.5775. Additional Tenure slices exceed the threshold with smaller gaps of 0.0550, 0.0417, and 0.0470.
CreditScore contains multiple flagged segments: Three CreditScore intervals are listed above threshold: (400.0, 450.0] with a gap of 0.1580, (500.0, 550.0] with 0.0668, and (650.0, 700.0] with 0.0705. The largest of these, (400.0, 450.0], is associated with 50 training and 11 test records.
EstimatedSalary and Geography_Germany show smaller but above-threshold gaps: EstimatedSalary has three flagged slices with gaps of 0.0525, 0.0567, and 0.0855, while Geography_Germany (0.9, 1.0] shows a gap of 0.0542 with 794 training and 189 test records.

The test identifies several feature regions where training AUC exceeds test AUC by more than the 0.04 threshold, with the largest observed discrepancies concentrated in specific Balance and NumOfProducts slices. CreditScore and Tenure contribute multiple flagged segments, indicating that the performance gap is distributed across more than one interval within those features. EstimatedSalary and Geography_Germany also show above-threshold gaps, but at lower magnitudes than the most extreme regions.

Tables

Overfit Diagnosis

Feature	Slice	Number of Training Records	Number of Test Records	Training AUC	Test AUC	Gap
CreditScore	(400.0, 450.0]	50	11	0.6913	0.5333	0.1580
CreditScore	(500.0, 550.0]	264	63	0.7212	0.6543	0.0668
CreditScore	(650.0, 700.0]	511	113	0.7159	0.6454	0.0705
Tenure	(2.0, 3.0]	280	76	0.7220	0.5775	0.1444
Tenure	(3.0, 4.0]	282	58	0.7003	0.6452	0.0550
Tenure	(4.0, 5.0]	253	69	0.6940	0.6524	0.0417
Tenure	(7.0, 8.0]	266	67	0.7000	0.6529	0.0470
Balance	(125449.045, 150538.854]	492	113	0.6991	0.6347	0.0645
Balance	(175628.663, 200718.472]	48	10	0.5203	0.2400	0.2803
Balance	(200718.472, 225808.281]	15	4	0.2778	0.0000	0.2778
NumOfProducts	(2.8, 3.1]	166	29	0.7820	0.5357	0.2462
EstimatedSalary	(39999.93, 59994.105]	251	56	0.7400	0.6875	0.0525
EstimatedSalary	(139970.805, 159964.98]	254	71	0.7065	0.6498	0.0567
EstimatedSalary	(159964.98, 179959.155]	274	61	0.6716	0.5861	0.0855
Geography_Germany	(0.9, 1.0]	794	189	0.6643	0.6101	0.0542

Figures

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:aa4d

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:5fe9

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:9253

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:c991

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:d401

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:416e

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:8fc9

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:edd8

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:f55f

ValidMind Figure validmind.model_validation.sklearn.OverfitDiagnosis:961c

▶ Test Result: Robustness Diagnosis (validmind.model_validation.sklearn.RobustnessDiagnosis)

✅ Robustness Diagnosis

The Robustness Diagnosis test evaluates model robustness by measuring AUC under increasing Gaussian noise applied to numeric input features. Results are reported for the training and test datasets across perturbation sizes from 0.0 to 0.5 standard deviations, together with the associated performance decay and pass status. Baseline AUC values are 0.6868 for train_dataset_final and 0.6954 for test_dataset_final, and subsequent rows show how these values change as perturbation size increases. The accompanying plot visualizes these trajectories relative to fixed threshold lines for each dataset.

Key insights:

All perturbation runs passed: Every evaluated perturbation level for both train_dataset_final and test_dataset_final is marked as passed, including all noise scales from 0.1 through 0.5.
Training AUC declines progressively: Training-set AUC decreases from 0.6868 at baseline to 0.6675 at perturbation size 0.5. The corresponding performance decay rises from 0.0009 at 0.1 to 0.0193 at 0.5, indicating a monotonic deterioration as noise increases.
Test AUC remains comparatively stable: Test-set AUC starts at 0.6954 and remains within a narrow range from 0.6811 to 0.6970 across all perturbation levels. Performance decay varies between -0.0016 and 0.0143, with the largest decay observed at perturbation size 0.3.
Small non-monotonic variation on test data: Unlike the training trajectory, the test-set series does not decline steadily with increasing noise. AUC increases slightly above baseline at perturbation size 0.1, declines through 0.3, and then recovers to 0.6912 and 0.6948 at perturbation sizes 0.4 and 0.5.
Observed AUC levels stay above plotted thresholds: In the figure, both dataset trajectories remain above their respective dashed threshold lines across the full perturbation range shown.

The robustness results show that model performance under Gaussian perturbation differs by dataset. The training dataset exhibits a steady reduction in AUC as perturbation size increases, while the test dataset shows more limited and non-monotonic variation with recovery at higher perturbation levels. Across all evaluated scenarios, the reported runs pass and the plotted AUC values remain above the displayed threshold lines.

Tables

Perturbation Size	Dataset	Row Count	AUC	Performance Decay	Passed
Baseline (0.0)	train_dataset_final	2585	0.6868	0.0000	True
Baseline (0.0)	test_dataset_final	647	0.6954	0.0000	True
0.1	train_dataset_final	2585	0.6860	0.0009	True
0.1	test_dataset_final	647	0.6970	-0.0016	True
0.2	train_dataset_final	2585	0.6818	0.0050	True
0.2	test_dataset_final	647	0.6872	0.0081	True
0.3	train_dataset_final	2585	0.6808	0.0060	True
0.3	test_dataset_final	647	0.6811	0.0143	True
0.4	train_dataset_final	2585	0.6740	0.0128	True
0.4	test_dataset_final	647	0.6912	0.0041	True
0.5	train_dataset_final	2585	0.6675	0.0193	True
0.5	test_dataset_final	647	0.6948	0.0006	True

Figures

ValidMind Figure validmind.model_validation.sklearn.RobustnessDiagnosis:780b

In summary

In this second notebook, you learned how to:

Import a sample dataset
Identify which tests you might want to run with ValidMind
Initialize ValidMind datasets and model objects
Run individual tests
Utilize the output from tests you've run
Log test results from sets of or individual tests as evidence to the ValidMind Platform
Add supplementary individual test results to your documentation
Assign predictions to your ValidMind model objects

Next steps

Integrate custom tests

Now that you're familiar with the basics of using the ValidMind Library to run and log tests to provide evidence for your documentation, let's learn how to incorporate your own custom tests into ValidMind: 3 — Integrate custom tests

Prerequisites

Setting up

Initialize the ValidMind Library

Import sample dataset

Identify qualitative tests

Initialize the ValidMind dataset

Running tests on datasets

Run tabular data tests

Descriptive Statistics

Purpose

Test Mechanism

Signs of High Risk

Strengths

Limitations

Required Inputs: dataset

Parameters:

How to Run:

Code:

Descriptive Statistics

Tables

Numerical Variables

Categorical Variables

❌ Class Imbalance

Parameters:

Tables

Exited Class Imbalance

Figures

✅ Class Imbalance

Parameters:

Tables

Exited Class Imbalance

Figures

Utilize test output

❌ High Pearson Correlation

Parameters:

Tables

✅ High Pearson Correlation

Parameters:

Tables

Pearson Correlation Matrix

Figures

Documenting test results

Run and log multiple tests

Test Suite Results: Binary Classification V2

Check out the updated documentation on ValidMind.

Dataset Description

Tables

Dataset Description

✅ Class Imbalance

Parameters:

Tables

Exited Class Imbalance

Figures

✅ Duplicates

Tables

Duplicate Rows Results for Dataset

✅ High Cardinality

Tables

✅ Missing Values

Tables

❌ Skewness

Tables

Skewness Results for Dataset

❌ Unique Rows

Tables

❌ Too Many Zero Values

Tables

IQR Outliers Table

Tables

Summary of Outliers Detected by IQR Method

IQR Outliers Bar Plot

Figures

Descriptive Statistics

Tables

Numerical Variables

Categorical Variables

Pearson Correlation Matrix

Figures

✅ High Pearson Correlation

Parameters: