Validate an application scorecard model

Learn how to independently assess an application scorecard model developed using the ValidMind Library as a validator. You'll evaluate the development of the model by conducting thorough testing and analysis, including the use of challenger models to benchmark performance.

An application scorecard model is a type of statistical model used in credit scoring to evaluate the creditworthiness of potential borrowers by generating a score based on various characteristics of an applicant such as credit history, income, employment status, and other relevant financial data.

This score assists lenders in making informed decisions about whether to approve or reject loan applications, as well as in determining the terms of the loan, including interest rates and credit limits.
Effective validation of application scorecard models ensures that lenders can manage risk efficiently while maintaining a fast and transparent loan application process for applicants.

This interactive notebook provides a step-by-step guide for:

Verifying the data quality steps performed by the model development team
Independently replicating the champion model's results and conducting additional tests to assess performance, stability, and robustness
Setting up test inputs and challenger models for comparative analysis
Running validation tests, analyzing results, and logging findings to ValidMind

About ValidMind

ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Library to automate comparison and other validation tests, and then use the ValidMind Platform to submit compliance assessments of champion models via comprehensive validation reports. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model developers.

Before you begin

This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with pip install, and then re-run the notebook. For more help, refer to Installing Python Modules.

New to ValidMind?

If you haven't already seen our documentation on the ValidMind Library, we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.

For access to all features available in this notebook, you'll need access to a ValidMind account.

Register with ValidMind

Key concepts

Validation report: A comprehensive and structured assessment of a model’s development and performance, focusing on verifying its integrity, appropriateness, and alignment with its intended use. It includes analyses of model assumptions, data quality, performance metrics, outcomes of testing procedures, and risk considerations. The validation report supports transparency, regulatory compliance, and informed decision-making by documenting the validator’s independent review and conclusions.

Validation report template: Serves as a standardized framework for conducting and documenting model validation activities. It outlines the required sections, recommended analyses, and expected validation tests, ensuring consistency and completeness across validation reports. The template helps guide validators through a systematic review process while promoting comparability and traceability of validation outcomes.

Tests: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets.

Metrics: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.

Custom metrics: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with the ValidMind Library to be used in the ValidMind Platform.

Inputs: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:

model: A single model that has been initialized in ValidMind with vm.init_model().
dataset: Single dataset that has been initialized in ValidMind with vm.init_dataset().
models: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.
datasets: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. (Learn more: Run tests with multiple datasets)

Parameters: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.

Outputs: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.

Setting up

Register a sample model

In a usual model lifecycle, a champion model will have been independently registered in your model inventory and submitted to you for validation by your model development team as part of the effective challenge process. (Learn more: Submit for approval)

For this notebook, we'll have you register a dummy model in the ValidMind Platform inventory and assign yourself as the validator to familiarize you with the ValidMind interface and circumvent the need for an existing model:

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and click + Register Model.
Enter the model details and click Continue. (Need more help?)

For example, to register a model for use with this notebook, select:
- Documentation template: Credit Risk Scorecard
- Use case: Credit Risk — CECL
You can fill in other options according to your preference.

Assign validator credentials

In order to log tests as a validator instead of as a developer, on the model details page that appears after you've successfully registered your sample model:

Remove yourself as a model owner:
- Click on the OWNERS tile.
- Click the x next to your name to remove yourself from that model's role.
- Click Save to apply your changes to that role.
Remove yourself as a developer:
- Click on the DEVELOPERS tile.
- Click the x next to your name to remove yourself from that model's role.
- Click Save to apply your changes to that role.
Add yourself as a validator:
- Click on the VALIDATORS tile.
- Select your name from the drop-down menu.
- Click Save to apply your changes to that role.

Install the ValidMind Library

Recommended Python versions

Python 3.8 <= x <= 3.11

To install the library:

%pip install -q validmind

Initialize the ValidMind Library

ValidMind generates a unique code snippet for each registered model to connect with your validation environment. You initialize the ValidMind Library with this code snippet, which ensures that your test results are uploaded to the correct model when you run the notebook.

Get your code snippet

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and select the model you registered for this notebook.
Go to Getting Started and click Copy snippet to clipboard.

Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
)

Importing the champion model

With the ValidMind Library set up and ready to go, let's go ahead and import the champion model submitted by the model development team in the format of a .pkl file: xgb_model_champion.pkl

import xgboost as xgb

#Load the saved model
xgb_model = xgb.XGBClassifier()
xgb_model.load_model("xgb_model_champion.pkl")
xgb_model

# Ensure that we have to appropriate order in feature names from Champion model and dataset
cols_when_model_builds = xgb_model.get_booster().feature_names

Load the sample dataset

Let's next import the public Lending Club dataset from Kaggle, which was used to develop the dummy champion model.

We'll use this dataset to review steps that should have been conducted during the initial development and documentation of the model to ensure that the model was built correctly.
By independently performing steps such as preprocessing and feature engineering, we can confirm whether the model was built using appropriate and properly processed data.

To be able to use the dataset, you'll need to import the dataset and load it into a pandas DataFrame, a two-dimensional tabular data structure that makes use of rows and columns:

from validmind.datasets.credit_risk import lending_club

df = lending_club.load_data(source="offline")
df.head()

Preprocess the dataset

We'll first quickly preprocess the dataset for data quality testing purposes using lending_club.preprocess. This function performs the following operations:

Filters the dataset to include only loans for debt consolidation or credit card purposes
Removes loans classified under the riskier grades "F" and "G"
Excludes uncommon home ownership types and standardizes employment length and loan terms into numerical formats
Discards unnecessary fields and any entries with missing information to maintain a clean and robust dataset for modeling

preprocess_df = lending_club.preprocess(df)

Apply feature engineering to the dataset

Feature engineering improves the dataset's structure to better match what our model expects, and ensures that the model performs optimally by leveraging additional insights from raw data.

We'll apply the following transformations using the ending_club.feature_engineering() function to optimize the dataset for predictive modeling in our application scorecard:

WoE encoding: Converts both numerical and categorical features into Weight of Evidence (WoE) values. WoE is a statistical measure used in scorecard modeling that quantifies the relationship between a predictor variable and the binary target variable. It calculates the ratio of the distribution of good outcomes to the distribution of bad outcomes for each category or bin of a feature. This transformation helps to ensure that the features are predictive and consistent in their contribution to the model.
Integration of WoE bins: Ensures that the WoE transformed values are integrated throughout the dataset, replacing the original feature values while excluding the target variable from this transformation. This transformation is used to maintain a consistent scale and impact of each variable within the model, which helps make the predictions more stable and accurate.

fe_df = lending_club.feature_engineering(preprocess_df)
fe_df.head()

Split the feature engineered dataset

With our dummy model imported and our independently preprocessed and feature engineered dataset ready to go, let's now spilt our dataset into train and test to start the validation testing process.

Splitting our dataset into training and testing is essential for proper validation testing, as this helps assess how well the model generalizes to unseen data:

We begin by dividing our data, which is based on Weight of Evidence (WoE) features, into training and testing sets (train_df, test_df).
With lending_club.split, we employ a simple random split, randomly allocating data points to each set to ensure a mix of examples in both.

# Split the data
train_df, test_df = lending_club.split(fe_df, test_size=0.2)

x_train = train_df.drop(lending_club.target_column, axis=1)
y_train = train_df[lending_club.target_column]

x_test = test_df.drop(lending_club.target_column, axis=1)
y_test = test_df[lending_club.target_column]

# Now let's apply the order of features from the champion model construction
x_train = x_train[cols_when_model_builds]
x_test = x_test[cols_when_model_builds]

cols_use = ['annual_inc_woe',
 'verification_status_woe',
 'emp_length_woe',
 'installment_woe',
 'term_woe',
 'home_ownership_woe',
 'purpose_woe',
 'open_acc_woe',
 'total_acc_woe',
 'int_rate_woe',
 'sub_grade_woe',
 'grade_woe','loan_status']


train_df = train_df[cols_use]
test_df = test_df[cols_use]
test_df.head()

Developing potential challenger models

Train potential challenger models

We're curious how alternate models compare to our champion model, so let's train two challenger models as basis for our testing.

Our selected options below offer decreased complexity in terms of implementation — such as lessened manual preprocessing — which can reduce the amount of risk for implementation. However, model risk is not calculated in isolation from a single factor, but rather in consideration with trade-offs in predictive performance, ease of interpretability, and overall alignment with business objectives.

Random forest classification model

A random forest classification model is an ensemble machine learning algorithm that uses multiple decision trees to classify data. In ensemble learning, multiple models are combined to improve prediction accuracy and robustness.

Random forest classification models generally have higher accuracy because they capture complex, non-linear relationships, but as a result they lack transparency in their predictions.

# Import the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Create the model instance with 50 decision trees
rf_model = RandomForestClassifier(
    n_estimators=50,
    random_state=42,
)

# Train the model
rf_model.fit(x_train, y_train)

Logistic regression model

A logistic regression model is a statistical machine learning algorithm that uses a linear equation (straight-line relationship between variables) and the logistic function (or sigmoid function, which maps any real-valued number to a range between 0 and 1) to classify data. In statistical modeling, a single equation is used to estimate the probability of an outcome based on input features.

Logistic regression models are simple and interpretable because they provide clear probability estimates and feature coefficients (numerical value that represents the influence of a particular input feature on the model's prediction), but they may struggle with capturing complex, non-linear relationships in the data.

# Import the Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Logistic Regression grid params
log_reg_params = {
    "penalty": ["l1", "l2"],
    "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "solver": ["liblinear"],
}

# Grid search for Logistic Regression
from sklearn.model_selection import GridSearchCV

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(x_train, y_train)

# Logistic Regression best estimator
log_reg = grid_log_reg.best_estimator_
log_reg

Extract predicted probabilities

With our challenger models trained, let's extract the predicted probabilities from our three models:

# Champion — Application scorecard model
train_xgb_prob = xgb_model.predict_proba(x_train)[:, 1]
test_xgb_prob = xgb_model.predict_proba(x_test)[:, 1]

# Challenger — Random forest classification model
train_rf_prob = rf_model.predict_proba(x_train)[:, 1]
test_rf_prob = rf_model.predict_proba(x_test)[:, 1]

# Challenger — Logistic regression model
train_log_prob = log_reg.predict_proba(x_train)[:, 1]
test_log_prob = log_reg.predict_proba(x_test)[:, 1]

Compute binary predictions

Next, we'll convert the probability predictions from our three models into a binary, based on a threshold of 0.3:

If the probability is greater than 0.3, the prediction becomes 1 (positive).
Otherwise, it becomes 0 (negative).

cut_off_threshold = 0.3

# Champion — Application scorecard model
train_xgb_binary_predictions = (train_xgb_prob > cut_off_threshold).astype(int)
test_xgb_binary_predictions = (test_xgb_prob > cut_off_threshold).astype(int)

# Challenger — Random forest classification model
train_rf_binary_predictions = (train_rf_prob > cut_off_threshold).astype(int)
test_rf_binary_predictions = (test_rf_prob > cut_off_threshold).astype(int)

# Challenger — Logistic regression model
train_log_binary_predictions = (train_log_prob > cut_off_threshold).astype(int)
test_log_binary_predictions = (test_log_prob > cut_off_threshold).astype(int)

Initializing the ValidMind objects

Initialize the ValidMind datasets

Before you can run tests, you'll need to connect your data with a ValidMind Dataset object. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind, but you only need to do it once per dataset.

Initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module. For this example, we'll pass in the following arguments:

dataset — The raw dataset that you want to provide as input to tests.
input_id — A unique identifier that allows tracking what inputs are used when running each individual test.
target_column — A required argument if tests require access to true values. This is the name of the target column in the dataset.

# Initialize the raw dataset
vm_raw_dataset = vm.init_dataset(
    dataset=df,
    input_id="raw_dataset",
    target_column=lending_club.target_column,
)

# Initialize the preprocessed dataset
vm_preprocess_dataset = vm.init_dataset(
    dataset=preprocess_df,
    input_id="preprocess_dataset",
    target_column=lending_club.target_column,
)

# Initialize the feature engineered dataset
vm_fe_dataset = vm.init_dataset(
    dataset=fe_df,
    input_id="fe_dataset",
    target_column=lending_club.target_column,
)

# Initialize the training dataset
vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=lending_club.target_column,
)

# Initialize the test dataset
vm_test_ds = vm.init_dataset(
    dataset=test_df,
    input_id="test_dataset",
    target_column=lending_club.target_column,
)

After initialization, you can pass the ValidMind Dataset objects vm_raw_dataset, vm_preprocess_dataset, vm_fe_dataset, vm_train_ds, and vm_test_ds into any ValidMind tests.

Initialize the model objects

You'll also need to initialize a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data for each of our three models.

You simply initialize this model object with vm.init_model():

# Initialize the champion application scorecard model
vm_xgb_model = vm.init_model(
    xgb_model,
    input_id="xgb_model_developer_champion",
)

# Initialize the challenger random forest classification model
vm_rf_model = vm.init_model(
    rf_model,
    input_id="rf_model",
)

# Initialize the challenger logistic regression model
vm_log_model = vm.init_model(
    log_reg,
    input_id="log_model",
)

Assign predictions

With our models registered, we'll move on to assigning both the predictive probabilities coming directly from each model's predictions, and the binary prediction after applying the cutoff threshold described in the Compute binary predictions step above.

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

# Champion — Application scorecard model
vm_train_ds.assign_predictions(
    model=vm_xgb_model,
    prediction_values=train_xgb_binary_predictions,
    prediction_probabilities=train_xgb_prob,
)

vm_test_ds.assign_predictions(
    model=vm_xgb_model,
    prediction_values=test_xgb_binary_predictions,
    prediction_probabilities=test_xgb_prob,
)

# Challenger — Random forest classification model
vm_train_ds.assign_predictions(
    model=vm_rf_model,
    prediction_values=train_rf_binary_predictions,
    prediction_probabilities=train_rf_prob,
)

vm_test_ds.assign_predictions(
    model=vm_rf_model,
    prediction_values=test_rf_binary_predictions,
    prediction_probabilities=test_rf_prob,
)


# Challenger — Logistic regression model
vm_train_ds.assign_predictions(
    model=vm_log_model,
    prediction_values=train_log_binary_predictions,
    prediction_probabilities=train_log_prob,
)

vm_test_ds.assign_predictions(
    model=vm_log_model,
    prediction_values=test_log_binary_predictions,
    prediction_probabilities=test_log_prob,
)

Compute credit risk scores

Finally, we'll translate model predictions into actionable scores using probability estimates generated by our trained model:

# Compute the scores
train_xgb_scores = lending_club.compute_scores(train_xgb_prob)
test_xgb_scores = lending_club.compute_scores(test_xgb_prob)
train_rf_scores = lending_club.compute_scores(train_rf_prob)
test_rf_scores = lending_club.compute_scores(test_rf_prob)
train_log_scores = lending_club.compute_scores(train_log_prob)
test_log_scores = lending_club.compute_scores(test_log_prob)

# Assign scores to the datasets
vm_train_ds.add_extra_column("xgb_scores", train_xgb_scores)
vm_test_ds.add_extra_column("xgb_scores", test_xgb_scores)
vm_train_ds.add_extra_column("rf_scores", train_rf_scores)
vm_test_ds.add_extra_column("rf_scores", test_rf_scores)
vm_train_ds.add_extra_column("log_scores", train_log_scores)
vm_test_ds.add_extra_column("log_scores", test_log_scores)

Run data quality tests

With everything ready to go, let's explore some of ValidMind's available tests. Using ValidMind’s repository of tests streamlines your validation testing, and helps you ensure that your models are being validated appropriately.

We want to narrow down the tests we want to run from the selection provided by ValidMind, so we'll use the vm.tests.list_tasks_and_tags() function to list which tags are associated with each task type:

tasks represent the kind of modeling task associated with a test. Here we'll focus on classification tasks.
tags are free-form descriptions providing more details about the test, for example, what category the test falls into. Here we'll focus on the data_quality tag.

vm.tests.list_tasks_and_tags()

Then we'll call the vm.tests.list_tests() function to list all the data quality tests for classification:

vm.tests.list_tests(
    tags=["data_quality"], task="classification"
)

Want to learn more about navigating ValidMind tests?

Refer to our notebook outlining the utilities available for viewing and understanding available ValidMind tests: Explore tests

Run and log an individual data quality test

Next, we'll use our previously initialized preprocessed dataset (vm_preprocess_dataset) as input to run an individual test, then log the result to the ValidMind Platform.

You run validation tests by calling the run_test function provided by the validmind.tests module.
Every test result returned by the run_test() function has a .log() method that can be used to send the test results to the ValidMind Platform.

Here, we'll use the HighPearsonCorrelation test as an example:

vm.tests.run_test(
    test_id="validmind.data_validation.HighPearsonCorrelation",
    inputs={
        "dataset": vm_preprocess_dataset
    }
).log()

Note the output returned indicating that a test-driven block doesn't currently exist in your model's documentation for some test IDs.

That's expected, as when we run validations tests the results logged need to be manually added to your report as part of your compliance assessment process within the ValidMind Platform. You'll continue to see this message throughout this notebook as we run and log more tests.

Log multiple data quality tests

Now that we understand how to run a test with ValidMind, we want to run all the tests that were returned for our classification tasks focusing on data_quality.

We'll store the identified tests in dq in preparation for batch running these tests and logging their results to the ValidMind Platform:

dq = vm.tests.list_tests(tags=["data_quality"], task="classification",pretty=False)
dq

With our data quality tests stored, let's run our first batch of tests using the same preprocessed dataset (vm_preprocess_dataset) and log their results.

for test in dq:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_preprocess_dataset
        }
    ).log()

Run data quality comparison tests

Next, let's reuse the tests in dq to perform comparison tests between the raw (vm_raw_dataset) and preprocessed (vm_preprocess_dataset) dataset, again logging the results to the ValidMind Platform:

for test in dq:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_raw_dataset,vm_preprocess_dataset]
        }
    ).log()

Run performance tests

We'll also run some performance tests, beginning with independent testing of our champion application scorecard model, then moving on to our potential challenger models.

Identify performance tests

Use vm.tests.list_tests() to this time identify all the model performance tests for classification:


vm.tests.list_tests(tags=["model_performance"], task="classification")

Run and log an individual performance test

Before we run our batch of performance tests, we'll use our previously initialized testing dataset (vm_test_ds) as input to run an individual test, then log the result to the ValidMind Platform.

When running individual tests, you can use a custom result_id to tag the individual result with a unique identifier by appending this result_id to the test_id with a : separator. We'll append an identifier for our champion model here (xgboost_champion):

Here, we'll use the ClassifierPerformance test as an example:

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.ClassifierPerformance:xgboost_champion",
    inputs={
        "dataset": vm_test_ds, "model" : vm_xgb_model
    }
).log()

Log multiple performance tests

We only want to run a few other tests that were returned for our classification tasks focusing on model_performance, so we'll isolate the specific tests we want to batch run in mpt:

Note the custom result_ids appended to the test_ids for our champion model (xgboost_champion):

mpt = [
    "validmind.model_validation.sklearn.ClassifierPerformance:xgboost_champion",
    "validmind.model_validation.sklearn.ConfusionMatrix:xgboost_champion",
    "validmind.model_validation.sklearn.MinimumAccuracy:xgboost_champion",
    "validmind.model_validation.sklearn.MinimumF1Score:xgboost_champion",
    "validmind.model_validation.sklearn.ROCCurve:xgboost_champion"
]

Evaluate performance of the champion model

Now, let's run and log our batch of model performance tests using our testing dataset (vm_test_ds) for our champion model:

for test in mpt:
    vm.tests.run_test(
        test,
        inputs={
            "dataset": vm_test_ds, "model" : vm_xgb_model
        },
    ).log()

Evaluate performance of challenger models

We've now conducted similar tests as the model development team for our champion model, with the aim of verifying their test results.

Next, let's see how our challenger models compare. We'll use the same batch of tests here as we did in mpt, but append a different result_id to indicate that these results should be associated with our challenger models:

mpt_chall = [
    "validmind.model_validation.sklearn.ClassifierPerformance:xgboost_champion_vs_challengers",
    "validmind.model_validation.sklearn.ConfusionMatrix:xgboost_champion_vs_challengers",
    "validmind.model_validation.sklearn.MinimumAccuracy:xgboost_champion_vs_challengers",
    "validmind.model_validation.sklearn.MinimumF1Score:xgboost_champion_vs_challengers",
    "validmind.model_validation.sklearn.ROCCurve:xgboost_champion_vs_challengers"
]

Enable custom context for test descriptions

When you run ValidMind tests, test descriptions are automatically generated with LLM using the test results, the test name, and the static test definitions provided in the test’s docstring. While this metadata offers valuable high-level overviews of tests, insights produced by the LLM-based descriptions may not always align with your specific use cases or incorporate organizational policy requirements.

Before we run our next batch of tests, we'll include some custom use case context to focus on comparison testing going forward, improving the relevancy, insight, and format of the test descriptions returned. By default, custom context for LLM-generated descriptions is disabled, meaning that the output will not include any additional context. To enable custom use case context, set the VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED environment variable to 1.

This is a global setting that will affect all tests for your linked model:

import os
os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1"

Enabling use case context allows you to pass in additional context to the LLM-generated text descriptions within context:

import os
os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT_ENABLED"] = "1"

context = """
FORMAT FOR THE LLM DESCRIPTIONS: 
    **<Test Name>** is designed to <begin with a concise overview of what the test does and its primary purpose, 
    extracted from the test description>.

    The test operates by <write a paragraph about the test mechanism, explaining how it works and what it measures. 
    Include any relevant formulas or methodologies mentioned in the test description.>

    The primary advantages of this test include <write a paragraph about the test's strengths and capabilities, 
    highlighting what makes it particularly useful for specific scenarios.>

    Users should be aware that <write a paragraph about the test's limitations and potential risks. 
    Include both technical limitations and interpretation challenges. 
    If the test description includes specific signs of high risk, incorporate these here.>

    **Key Insights:**

    The test results reveal:

    - **<insight title>**: <comprehensive description of one aspect of the results>
    - **<insight title>**: <comprehensive description of another aspect>
    ...

    Based on these results, <conclude with a brief paragraph that ties together the test results with the test's 
    purpose and provides any final recommendations or considerations.>

ADDITIONAL INSTRUCTIONS:

    The champion model as the basis for comparison is called "xgb_model_developer_champion" and emphasis should be on the following:
    - The metrics for the champion model compared against the challenger models
    - Which model potentially outperforms the champion model based on the metrics, this should be highlighted and emphasized


    For each metric in the test results, include in the test overview:
    - The metric's purpose and what it measures
    - Its mathematical formula
    - The range of possible values
    - What constitutes good/bad performance
    - How to interpret different values

    Each insight should progressively cover:
    1. Overall scope and distribution
    2. Complete breakdown of all elements with specific values
    3. Natural groupings and patterns
    4. Comparative analysis between datasets/categories
    5. Stability and variations
    6. Notable relationships or dependencies

    Remember:
    - Champion model (xgb_model_developer_champion) is the selection and challenger models are used to challenge the selection
    - Keep all insights at the same level (no sub-bullets or nested structures)
    - Make each insight complete and self-contained
    - Include specific numerical values and ranges
    - Cover all elements in the results comprehensively
    - Maintain clear, concise language
    - Use only "- **Title**: Description" format for insights
    - Progress naturally from general to specific observations

""".strip()

os.environ["VALIDMIND_LLM_DESCRIPTIONS_CONTEXT"] = context

Want to learn more about setting custom context for LLM-generated test descriptions?

Refer to our extended walkthrough notebook: Add context to LLM-generated test descriptions

Run performance comparison tests

With the use case context set, we'll run each test in mpt_chall once for each model with the same vm_test_ds dataset to compare them:

for test in mpt_chall:
    vm.tests.run_test(
        test,
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_xgb_model,vm_log_model,vm_rf_model]
        }
    ).log()

Based on the performance metrics, we can conclude that the random forest classification model is not a viable candidate for our use case and can be disregarded in our tests going forward.

In the next section, we'll dive a bit deeper into some tests comparing our champion application scorecard model and our remaining challenger logistic regression model, including tests that will allow us to customize parameters and thresholds for performance standards.

Adjust a ValidMind test

Let's dig deeper into the MinimumF1Score test we ran previously in Run performance tests to ensure that the models maintain a minimum acceptable balance between precision and recall. Precision refers to how many out of the positive predictions made by the model were actually correct, and recall refers to how many out of the actual positive cases did the model correctly identify.

Use run_test() with our testing dataset (vm_test_ds) to run the test in isolation again for our two remaining models without logging the result to have the output to compare with a subsequent iteration:

vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumF1Score:xgboost_champion_vs_challengers",
    input_grid={
        "dataset": [vm_test_ds],
        "model": [vm_xgb_model, vm_log_model]
    },
)

As MinimumF1Score allows us to customize parameters and thresholds for performance standards, let's adjust the threshold to see if it improves metrics:

result = vm.tests.run_test(
    "validmind.model_validation.sklearn.MinimumF1Score:AdjThreshold",
    input_grid={
        "dataset": [vm_test_ds],
        "model": [vm_xgb_model, vm_log_model],
        "params": {"min_threshold": 0.35}
    },
).log()

Run diagnostic tests

Next, we want to inspect the robustness and stability testing comparison between our champion and challenger model.

Use list_tests() to list all available diagnosis tests applicable to classification tasks:

vm.tests.list_tests(tags=["model_diagnosis"], task="classification")

Let's see if models suffer from any overfit potentials and also where there are potential sub-segments of issues with the OverfitDiagnosis test.

Overfitting occurs when a model learns the training data too well, capturing not only the true pattern but noise and random fluctuations resulting in excellent performance on the training dataset but poor generalization to new, unseen data.

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.OverfitDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_xgb_model,vm_log_model]
    }
).log()

Let's also conduct robustness and stability testing of the two models with the RobustnessDiagnosis test.

Robustness refers to a model's ability to maintain consistent performance, and stability refers to a model's ability to produce consistent outputs over time across different data subsets.

vm.tests.run_test(
    test_id="validmind.model_validation.sklearn.RobustnessDiagnosis:Champion_vs_LogRegression",
    input_grid={
        "datasets": [[vm_train_ds,vm_test_ds]],
        "model" : [vm_xgb_model,vm_log_model]
    },
).log()

Run feature importance tests

We also want to verify the relative influence of different input features on our models' predictions, as well as inspect the differences between our champion and challenger model to see if a certain model offers more understandable or logical importance scores for features.

Use list_tests() to identify all the feature importance tests for classification:

# Store the feature importance tests
FI = vm.tests.list_tests(tags=["feature_importance"], task="classification",pretty=False)
FI

# Run and log our feature importance tests for both models for the testing dataset
for test in FI:
    vm.tests.run_test(
        "".join((test,':Champion_vs_LogisticRegression')),
        input_grid={
            "dataset": [vm_test_ds], "model" : [vm_xgb_model,vm_log_model]
        },
    ).log()

Implement a custom test

Let's finish up testing by implementing a custom inline test that outputs a FICO score-type score. An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.

The @vm.test wrapper allows you to create a reusable test:

import numpy as np
import pandas as pd
import plotly.graph_objects as go

@vm.test("my_custom_tests.ScoreToOdds")
def score_to_odds_analysis(dataset, score_column='score', score_bands=[410, 440, 470]):
    """
    Analyzes the relationship between score bands and odds (good:bad ratio).
    Good odds = (1 - default_rate) / default_rate
    
    Higher scores should correspond to higher odds of being good.

    If there are multiple scores provided through score_column, this means that there are two different models and the scores reflect each model

    If there are more scores provided in the score_column then focus the assessment on the differences between the two scores and indicate through evidence which one is preferred.
    """
    df = dataset.df
    
    # Create score bands
    df['score_band'] = pd.cut(
        df[score_column],
        bins=[-np.inf] + score_bands + [np.inf],
        labels=[f'<{score_bands[0]}'] + 
               [f'{score_bands[i]}-{score_bands[i+1]}' for i in range(len(score_bands)-1)] +
               [f'>{score_bands[-1]}']
    )
    
    # Calculate metrics per band
    results = df.groupby('score_band').agg({
        dataset.target_column: ['mean', 'count']
    })
    
    results.columns = ['Default Rate', 'Total']
    results['Good Count'] = results['Total'] - (results['Default Rate'] * results['Total'])
    results['Bad Count'] = results['Default Rate'] * results['Total']
    results['Odds'] = results['Good Count'] / results['Bad Count']
    
    # Create visualization
    fig = go.Figure()
    
    # Add odds bars
    fig.add_trace(go.Bar(
        name='Odds (Good:Bad)',
        x=results.index,
        y=results['Odds'],
        marker_color='blue'
    ))
    
    fig.update_layout(
        title='Score-to-Odds Analysis',
        yaxis=dict(title='Odds Ratio (Good:Bad)'),
        showlegend=False
    )
    
    return fig

With the custom test available, run and log the test for our champion and challenger models with our testing dataset (vm_test_ds):

result = vm.tests.run_test(
    "my_custom_tests.ScoreToOdds:Champion_vs_Challenger",
    inputs={
        "dataset": vm_test_ds,
    },
    param_grid={
        "score_column": ["xgb_scores","log_scores"],
        "score_bands": [[500, 540, 570]],
    },
).log()

Want to learn more about custom tests?

Refer to our in-depth introduction to custom tests: Implement custom tests

Verify test runs

Our final task is to verify that all the tests provided by the model development team were run and reported accurately. Note the appended result_ids to delineate which dataset we ran the test with for the relevant tests.

Here, we'll specify all the tests we'd like to independently rerun in a dictionary called test_config. Note here that inputs and input_grid expect the input_id of the dataset or model as the value rather than the variable name we specified:

test_config = {
    # Run with the raw dataset
    'validmind.data_validation.DatasetDescription:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.DescriptiveStatistics:raw_data': {
        'inputs': {'dataset': 'raw_dataset'}
    },
    'validmind.data_validation.MissingValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.ClassImbalance:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.Duplicates:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.HighCardinality:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {
            'num_threshold': 100,
            'percent_threshold': 0.1,
            'threshold_type': 'percent'
        }
    },
    'validmind.data_validation.Skewness:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_threshold': 1}
    },
    'validmind.data_validation.UniqueRows:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TooManyZeroValues:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'max_percent_threshold': 0.03}
    },
    'validmind.data_validation.IQROutliersTable:raw_data': {
        'inputs': {'dataset': 'raw_dataset'},
        'params': {'threshold': 5}
    },
    # Run with the preprocessed dataset
    'validmind.data_validation.DescriptiveStatistics:preprocessed_data': {
        'inputs': {'dataset': 'preprocess_dataset'}
    },
    'validmind.data_validation.TabularDescriptionTables:preprocessed_data': {
        'inputs': {'dataset': 'preprocess_dataset'}
    },
    'validmind.data_validation.MissingValues:preprocessed_data': {
        'inputs': {'dataset': 'preprocess_dataset'},
        'params': {'min_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:preprocessed_data': {
        'inputs': {'dataset': 'preprocess_dataset'}
    },
    'validmind.data_validation.TabularCategoricalBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'preprocess_dataset'}
    },
    'validmind.data_validation.TargetRateBarPlots:preprocessed_data': {
        'inputs': {'dataset': 'preprocess_dataset'},
        'params': {'default_column': 'loan_status'}
    },
    # Run with the training and test datasets
    'validmind.data_validation.DescriptiveStatistics:development_data': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset']}
    },
    'validmind.data_validation.TabularDescriptionTables:development_data': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset']}
    },
    'validmind.data_validation.ClassImbalance:development_data': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset']},
        'params': {'min_percent_threshold': 10}
    },
    'validmind.data_validation.UniqueRows:development_data': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset']},
        'params': {'min_percent_threshold': 1}
    },
    'validmind.data_validation.TabularNumericalHistograms:development_data': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset']}
    },
    'validmind.data_validation.MutualInformation:development_data': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset']},
        'params': {'min_threshold': 0.01}
    },
    'validmind.data_validation.PearsonCorrelationMatrix:development_data': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset']}
    },
    'validmind.data_validation.HighPearsonCorrelation:development_data': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset']},
        'params': {'max_threshold': 0.3, 'top_n_correlations': 10}
    },
    'validmind.model_validation.ModelMetadata': {
        'input_grid': {'model': ['xgb_model_developer_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ModelParameters': {
        'input_grid': {'model': ['xgb_model_developer_champion', 'rf_model']}
    },
    'validmind.model_validation.sklearn.ROCCurve': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset'], 'model': ['xgb_model_developer_champion']}
    },
    'validmind.model_validation.sklearn.MinimumROCAUCScore': {
        'input_grid': {'dataset': ['train_dataset', 'test_dataset'], 'model': ['xgb_model_developer_champion']},
        'params': {'min_threshold': 0.5}
    }
}

Then batch run and log our tests in test_config:

for t in test_config:
    print(t)
    try:
        # Check if test has input_grid
        if 'input_grid' in test_config[t]:
            # For tests with input_grid, pass the input_grid configuration
            if 'params' in test_config[t]:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, input_grid=test_config[t]['input_grid']).log()
        else:
            # Original logic for regular inputs
            if 'params' in test_config[t]:
                vm.tests.run_test(t, inputs=test_config[t]['inputs'], params=test_config[t]['params']).log()
            else:
                vm.tests.run_test(t, inputs=test_config[t]['inputs']).log()
    except Exception as e:
        print(f"Error running test {t}: {str(e)}")

Next steps

Work with your validation report

Now that you've logged all your test results and verified the work done by the model development team, head to the ValidMind Platform to wrap up your validation report:

From the Inventory in the ValidMind Platform, go to the model you connected to earlier.
In the left sidebar that appears for your model, click Validation Report.

Include your logged test results as evidence, create risk assessment notes, add findings, and assess compliance, then submit your report for review when it's ready. Learn more: Preparing validation reports

Discover more learning resources

All notebook samples can be found in the following directories of the ValidMind Library GitHub repository:

Or, visit our documentation to learn more about ValidMind.

Upgrade ValidMind

After installing ValidMind, you’ll want to periodically make sure you are on the latest version to access any new features and other enhancements.

Retrieve the information for the currently installed version of ValidMind:

%pip show validmind

If the version returned is lower than the version indicated in our production open-source code, restart your notebook and run:

%pip install --upgrade validmind

You may need to restart your kernel after running the upgrade package for changes to be applied.