%pip install -q validmind
Customize test result descriptions
When you run ValidMind tests, test descriptions are automatically generated with LLM using the test results, the test name, and the static test definitions provided in the test's docstring. While this metadata offers valuable high-level overviews of tests, insights produced by the LLM-based descriptions may not always align with your specific use cases or incorporate organizational policy requirements.
In this notebook, you'll learn how to take complete control over the context that drives test description generation. ValidMind provides a context
parameter in run_test
that accepts a dictionary with three complementary keys for comprehensive context management:
instructions
: Overwrites ValidMind’s default result description structure. If you provide custom instructions, they take full priority over the built-in ones. This parameter controls how the final description is structured and presented. Use this to specify formatting requirements, target different audiences (executives vs. technical teams), or ensure consistent report styles across your organization.test_description
: Overwrites the test’s built-in docstring if provided. This parameter contains the technical mechanics of how the test works. However, for generic tests where the methodology isn't the focus, you may use this to describe what's actually being analyzed—the specific variables, features, or metrics being plotted and their business meaning rather than the statistical mechanics. You can also override ValidMind's built-in test documentation if you prefer different structure or language.additional_context
: Does not overwrite the instructions or test descriptions, but instead adds to them. This parameter provides any background information you want the LLM to consider when analyzing results. It could include business priorities, acceptance thresholds, regulatory requirements, domain expertise, use case details, model purpose, or stakeholder concerns—any information that helps the LLM better understand and interpret your specific situation.
Together, these context parameters allow you to manage every aspect of how the LLM interprets and presents your test results. Whether you need to align descriptions with regulatory requirements, target specific audiences, incorporate organizational policies, or ensure consistent reporting standards, this context management approach gives you the flexibility to generate descriptions that perfectly match your needs while still leveraging the analytical power of AI-generated insights.
Setup
This section covers the basic setup required to run the examples in this notebook. We'll install ValidMind, connect to the platform, and create a customer churn model that we'll use to demonstrate the instructions and knowledge parameters throughout the examples.
Install the ValidMind Library
To install the library:
Initialize the ValidMind Library
ValidMind generates a unique code snippet for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.
Get your code snippet
In a browser, log in to ValidMind.
In the left sidebar, navigate to Model Inventory and click + Register Model.
Enter the model details and click Continue. (Need more help?)
For example, to register a model for use with this notebook, select:
- Documentation template:
Binary classification
- Use case:
Marketing/Sales - Attrition/Churn Management
You can fill in other options according to your preference.
- Documentation template:
Go to Getting Started and click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env
file or replace the placeholder with your own code snippet:
# Load your model identifier credentials from an `.env` file
%load_ext dotenv
%dotenv .env
# Or replace with your code snippet
import validmind as vm
vm.init(# api_host="...",
# api_key="...",
# api_secret="...",
# model="...",
)
Initialize the Python environment
After you've connected to your model register in the ValidMind Platform, let's import the necessary libraries and set up your Python environment for data analysis:
import xgboost as xgb
import os
%matplotlib inline
Model development
Now we'll build the customer churn model using XGBoost and ValidMind's sample dataset. This trained model will generate the test results we'll use to demonstrate the instructions and knowledge parameters.
Load data
First, we'll import a sample ValidMind dataset and load it into a pandas dataframe:
# Import the sample dataset from the library
from validmind.datasets.classification import customer_churn
print(
f"Loaded demo dataset with: \n\n\t• Target column: '{customer_churn.target_column}' \n\t• Class labels: {customer_churn.class_labels}"
)
= customer_churn.load_data()
raw_df raw_df.head()
Fit the model
Then, we prepare the data and model by first splitting the DataFrame into training, validation, and test sets, then separating features from targets. An XGBoost classifier is initialized with early stopping, evaluation metrics (error, logloss, and auc) are defined, and the model is trained on the training data with validation monitoring.
= customer_churn.preprocess(raw_df)
train_df, validation_df, test_df
= train_df.drop(customer_churn.target_column, axis=1)
x_train = train_df[customer_churn.target_column]
y_train = validation_df.drop(customer_churn.target_column, axis=1)
x_val = validation_df[customer_churn.target_column]
y_val
= xgb.XGBClassifier(early_stopping_rounds=10)
model
model.set_params(=["error", "logloss", "auc"],
eval_metric
)
model.fit(
x_train,
y_train,=[(x_val, y_val)],
eval_set=False,
verbose )
Initialize the ValidMind objects
Before you can run tests, you'll need to initialize a ValidMind dataset object using the init_dataset
function from the ValidMind (vm
) module.
We'll include the following arguments:
dataset
— the raw dataset that you want to provide as input to testsinput_id
- a unique identifier that allows tracking what inputs are used when running each individual testtarget_column
— a required argument if tests require access to true values. This is the name of the target column in the datasetclass_labels
— an optional value to map predicted classes to class labels
With all datasets ready, you can now initialize the raw, training, and test datasets (raw_df
, train_df
and test_df
) created earlier into their own dataset objects using vm.init_dataset()
:
= vm.init_dataset(
vm_raw_dataset =raw_df,
dataset="raw_dataset",
input_id=customer_churn.target_column,
target_column=customer_churn.class_labels,
class_labels
)
= vm.init_dataset(
vm_train_ds =train_df,
dataset="train_dataset",
input_id=customer_churn.target_column,
target_column
)
= vm.init_dataset(
vm_test_ds =test_df, input_id="test_dataset", target_column=customer_churn.target_column
dataset )
Additionally, you'll need to initialize a ValidMind model object (vm_model
) that can be passed to other functions for analysis and tests on the data.
Simply intialize this model object with vm.init_model()
:
= vm.init_model(
vm_model
model,="model",
input_id )
We can now use the assign_predictions()
method from the Dataset object to link existing predictions to any model.
If no prediction values are passed, the method will compute predictions automatically:
vm_train_ds.assign_predictions(=vm_model,
model
)
vm_test_ds.assign_predictions(=vm_model,
model )
Understanding test result descriptions
Before diving into custom instructions, let's understand how ValidMind generates test descriptions by default.
Default LLM-generated descriptions
When you run a test without custom instructions, ValidMind's LLM analyzes: - The test results (tables, figures) - The test's built-in documentation (docstring)
When ValidMind generates test descriptions automatically (without custom instructions), the LLM follows a series of standardized sections designed to provide comprehensive, objective analysis of test results:
Test purpose: This section opens with a clear explanation of what the test does and why it exists. It draws from the test’s documentation and presents the purpose in accessible, straightforward language.
Test mechanism: Here the description outlines how the test works, including its methodology, what it measures, and how those measurements are derived. For statistical tests, it also explains the meaning of each metric, how values are typically interpreted, and what ranges are expected.
Test strengths: This part highlights the value of the test by pointing out its key strengths and the scenarios where it is most useful. It also notes the kinds of insights it can provide that other tests may not capture.
Test limitations: Limitations focus on both technical constraints and interpretation challenges. The text notes when results should be treated with caution and highlights specific risk indicators tied to the test type.
Results interpretation: The results section explains how to read the outputs, whether tables or figures, and clarifies what each column, axis, or metric means. It also points out key data points, units of measurement, and any notable observations that help frame interpretation.
Key insights: Insights are listed in bullet points, moving from broad to specific. Each one has a clear title, includes relevant numbers or ranges, and ensures that all important aspects of the results are addressed.
Conclusions: The conclusion ties the insights together into a coherent narrative. It synthesizes the findings into objective technical takeaways and emphasizes what the results reveal about the model or data.
Let's see a default description:
vm.tests.run_test("validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"dataset": vm_test_ds,
"model": vm_model,
}, )
Customizing results structure with instructions
While the default descriptions are designed to be comprehensive, there are many cases where you might want to tailor them for your specific context. Customizing test results allows you to shape descriptions to fit your organization’s standards and practical needs. This can involve adjusting report formats, applying specific risk rating scales, adding mandatory disclaimer text, or emphasizing particular metrics.
The instructions
parameter is what enables this flexibility by adapting the generated descriptions to different audiences and test types. Executives often need concise summaries that emphasize overall risk, data scientists look for detailed explanations of the methodology behind tests, and compliance teams require precise language that aligns with regulatory expectations. Different test types also demand different emphases: performance metrics may benefit from technical breakdowns, while validation checks might require risk-focused narratives.
Simple instruction example
Let's start with simple examples of the instructions
parameter. Here's how to provide basic guidance to the LLM-generated descriptions:
= """
simple_instructions Please focus on business impact and provide a concise summary.
Include specific actionable recommendations.
"""
vm.tests.run_test("validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"dataset": vm_test_ds,
"model": vm_model,
},={
context"instructions": simple_instructions,
}, )
Structured format instructions
You can request specific formatting and structure:
= """
structured_instructions Please structure your analysis using the following format:
### Executive Summary
- One sentence overview of the test results
### Key Findings
- Bullet points with the most important insights
- Include specific percentages and thresholds
### Risk Assessment
- Classify risk level as Low/Medium/High
- Explain reasoning for the risk classification
### Recommendations
- Specific actionable next steps
- Priority level for each recommendation
"""
vm.tests.run_test("validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"dataset": vm_test_ds,
"model": vm_model,
},={
context"instructions": structured_instructions,
}, )
Template with LLM fill-ins
One of the most powerful features is combining hardcoded text with LLM-generated content using placeholders. This allows you to ensure specific information is always included while still getting intelligent analysis of the results.
Create a template where specific sections are filled by the LLM:
= """
template_instructions Please generate the description using this exact template.
Fill in the [PLACEHOLDER] sections with your analysis:
---
**VALIDATION REPORT: CLASSIFIER PERFORMANCE ASSESSMENT**
**Dataset ID:** test_dataset
**Validation Type:** Classification Performance Analysis
**Reviewer:** ValidMind AI Analysis
**EXECUTIVE SUMMARY:**
[PROVIDE_2_SENTENCE_SUMMARY_OF_RESULTS]
**KEY FINDINGS:**
[ANALYZE_AND_LIST_TOP_3_MOST_IMPORTANT_FINDINGS_WITH_VALUES]
**CLASSIFICATION PERFORMANCE ASSESSMENT:**
[DETAILED_ANALYSIS_OF_CLASSIFICATION_PERFORMANCE_PATTERNS_AND_IMPACT]
**RISK RATING:** [ASSIGN_LOW_MEDIUM_HIGH_RISK_WITH_JUSTIFICATION]
**RECOMMENDATIONS:**
[PROVIDE_SPECIFIC_ACTIONABLE_RECOMMENDATIONS_NUMBERED_LIST]
**VALIDATION STATUS:** [PASS_CONDITIONAL_PASS_OR_FAIL_WITH_REASONING]
---
*This report was generated using ValidMind's automated validation platform.*
*For questions about this analysis, contact the Data Science team.*
---
Important: Use the exact template structure above and fill in each [PLACEHOLDER] section.
"""
vm.tests.run_test("validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"dataset": vm_test_ds,
"model": vm_model,
},={
context"instructions": template_instructions,
}, )
Mixed static and dynamic content
Combine mandatory text with intelligent analysis:
# Mixed static and dynamic content
="""
mixed_content_instructions Return ONLY the assembled content in plain Markdown paragraphs and lists.
Do NOT include any headings or titles (no lines starting with '#'), labels,
XML-like tags (<MANDATORY>, <PLACEHOLDER>), variable names, or code fences.
Do NOT repeat or paraphrase these instructions. Start the first line with the
first mandatory sentence below—no preface.
You MUST include all MANDATORY blocks verbatim (exact characters, spacing, and punctuation).
You MUST replace PLACEHOLDER blocks with the requested content.
Between blocks, include exactly ONE blank line.
MANDATORY BLOCK A (include verbatim):
This data validation assessment was conducted in accordance with the
XYZ Bank Model Risk Management Policy (Document ID: MRM-2024-001).
All findings must be reviewed by the Model Validation Team before
model deployment.
PLACEHOLDER BLOCK B (replace with prose paragraphs; no headings):
[Provide detailed analysis of the test results, including specific values,
interpretations, and implications for model quality. Focus on classification performance quality
aspects and potential issues that could affect model performance.]
MANDATORY BLOCK C (include verbatim):
IMPORTANT: This automated analysis is supplementary to human expert review.
All high-risk findings require immediate escalation to the Chief Risk Officer.
Model deployment is prohibited until all Medium and High risk items are resolved.
PLACEHOLDER BLOCK D (replace with a numbered list only):
[Create a numbered list of specific action items with responsible parties
and suggested timelines for resolution.]
MANDATORY BLOCK E (include verbatim):
Validation performed using ValidMind Platform v2.0 |
Next review required: [30 days from test date] |
Contact: model-risk@xyzbank.com
Compliance checks BEFORE you finalize your answer:
- No headings or titles present (no '#' anywhere).
- No tags (<MANDATORY>, <PLACEHOLDER>) or labels (e.g., "BLOCK A") in the output.
- All three MANDATORY blocks included exactly as written.
- PLACEHOLDER B replaced with prose; PLACEHOLDER D replaced with a numbered list.
- Exactly one blank line between each block.
"""
vm.tests.run_test("validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"dataset": vm_test_ds,
"model": vm_model,
},={
context"instructions": mixed_content_instructions,
}, )
Enriching results with additional context
While the instructions
parameter controls how your test descriptions are formatted and structured, the additional_context
parameter provides background information about what the results mean for your specific business situation. Think of instructions
as the "presentation guide" and additional_context
as the "business background" that helps the LLM understand what matters most in your organization and how to interpret the results in your specific context.
Understanding the additional context parameter
The additional_context
parameter can be used to add any background information that helps put the test results into context. For example, you might include business priorities and constraints that shape how results are interpreted, risk tolerance levels or acceptance criteria specific to your organization, regulatory requirements that influence what counts as acceptable performance, or details about the intended use case of the model in production. These are just examples—the parameter is flexible and can capture whatever context is most relevant to your needs.
Key difference: - instructions
: "Write a 3-paragraph executive summary"
additional_context
: "If Accuracy is above 0.85 but Class 1 Recall falls below 0.60, the model should be considered high risk"
When used together, these parameters create descriptions that don’t just report the Recall or Accuracy measures for Class 1, but explain that because Accuracy is above 0.85 while Recall falls below 0.60, the model should be treated as high risk for your business.
Basic additional context usage
Here's how business context transforms the interpretation of our classifier results:
= """
simple_context MODEL CONTEXT:
- Class 0 = Customer stays (retains banking relationship)
- Class 1 = Customer churns (closes accounts, leaves bank)
DECISION RULES:
- ROC AUC >0.9: APPROVE deployment
- ROC AUC <0.9: REJECT model
CHURN DETECTION RULES:
- Recall >50% for churning customers: Good - use high-touch retention
- Recall <50% for churning customers: Poor - retention program will fail
"""
vm.tests.run_test("validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"dataset": vm_test_ds,
"model": vm_model,
},={
context"additional_context": simple_context,
}, )
Combining instructions and additional context
Here's how combining both parameters creates targeted analysis of our churn model performance, using additional_context to pass both static business rules and dynamic real-time information like analysis dates:
from datetime import datetime
# Get today's date
= datetime.now().strftime("%B %d, %Y")
today
# Executive decision instructions with date placeholder
= """
executive_instructions Create a GO/NO-GO decision memo following this template:
<TEMPLATE>
**DATE:** [Use analysis date from context]
**THRESHOLD ANALYSIS:** [Pass/Fail against specific thresholds]
**BUSINESS IMPACT:** [Revenue impact of current performance]
**DEPLOYMENT DECISION:** [APPROVE/CONDITIONAL/REJECT]
**REQUIRED ACTIONS:** [Specific next steps with timelines]
</TEMPLATE>
Be definitive - use the thresholds to make clear recommendations.
"""
# Retail banking with hard thresholds including date
= f"""
retail_thresholds RETAIL BANKING CONTEXT (Analysis Date: {today}):
- Class 0 = Customer retention (keeps checking/savings accounts)
- Class 1 = Customer churn (closes accounts, switches banks)
REGULATORY THRESHOLDS:
- AUC >0.80: Meets regulatory model standards
- Churn Recall >55%: Adequate churn detection
- Churn Precision >65%: Cost-effective targeting
DEPLOYMENT CRITERIA:
- All 3 Pass: FULL DEPLOYMENT
- 2 Pass: CONDITIONAL DEPLOYMENT
- <2 Pass: REJECT MODEL
"""
vm.tests.run_test("validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"dataset": vm_test_ds,
"model": vm_model,
},={
context"instructions": executive_instructions,
"additional_context": retail_thresholds,
}, )
Overriding test documentation with test description parameter
Each test, whether built-in or customized, includes a built-in docstring that serves as its default documentation. This docstring usually explains what the test does and what it outputs. In many cases, especially for specialized tests with well-defined purposes—the default docstring is already useful and sufficient.
Structure of ValidMind built-in test docstrings
Every ValidMind built-in test includes a docstring that serves as its default documentation. This docstring follows a consistent structure so that both users and the LLM can rely on a predictable format. While the content varies depending on the type of test—for example, highly specific tests like SHAP values or PSI provide technical detail, whereas generic tests like descriptive statistics or histograms are more general—the overall layout remains the same.
A typical docstring contains the following sections:
Overview: A short description of what the test does and what kind of output it generates.
Purpose: Explains why the test exists and what it is designed to evaluate. This section provides the context for the test’s role in model documentation, often describing the intended use cases or the kind of insights it supports.
Test mechanism: Describes how the test works internally. This includes the approach or methodology, what inputs are used, how results are calculated or visualized, and the logic behind the test’s implementation.
Signs of high risks: Outlines risk indicators that are specific to the test. These highlight situations where results should be interpreted with caution—for example, imbalances in distributions or errors in processing steps.
Strengths: Highlights the capabilities and benefits of the test, explaining what makes it particularly useful and what kinds of insights it provides that may not be captured elsewhere.
Limitations: Discusses the constraints of the test, including technical shortcomings, interpretive challenges, and situations where the results might be misleading or incomplete.
This structure ensures that all built-in tests provide a comprehensive explanation of their purpose, mechanics, strengths, and limitations. For more generic tests, the docstring may read as boilerplate information about the test’s mechanics. In these cases, the doc
parameter can be used to override the docstring with context that is more relevant to the dataset, feature, or business use case under analysis.
Understanding the test description parameter
Overriding the docstring with the test_description
parameter is particularly valuable for more generic tests, where the default text often focuses on the mechanics of producing an output rather than the data or variable being analyzed. For example, instead of including documentation about the details about the methodology used to compute an histogram, you may want to document the business meaning of the feature being visualized, its expected distribution, or what to pay attention to. Similarly, when generating a descriptive statistics table, you may prefer documentation that describes the dataset under review.
Customizing the doc, allows you to shift the focus of the explanation from the test machinery to the aspects of the data that matter most for your audience, while still relying on the built-in docstring for cases where the default detail is already fit for purpose.
When to override
For tests like histograms or descriptive statistics where the statistical methodology is standard and uninteresting, replace the generic documentation with meaningful descriptions of the variables being analyzed. Also use this to customize ValidMind's built-in test documentation when you want different terminology, structure, or emphasis than what's provided by default.
Basic test description usage
= """
custom_description This test evaluates customer churn prediction model performance specifically
for retail banking applications. The analysis focuses on classification
metrics relevant to customer retention programs and regulatory compliance
requirements under our internal Model Risk Management framework.
Key metrics analyzed:
- Precision: Accuracy of churn predictions to minimize wasted retention costs
- Recall: Coverage of actual churners to maximize retention program effectiveness
- F1-Score: Balanced measure considering both precision and recall
- ROC AUC: Overall discriminatory power for regulatory model approval
Results inform deployment decisions for automated retention campaigns.
"""
= vm.tests.run_test(
result "validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"model": vm_model,
"dataset": vm_test_ds
},={
context"test_description": custom_description,
}, )
Combining test description with instructions and additional context
# All three parameters working together
= """
banking_test_description Customer Churn Risk Assessment Test for Retail Banking.
Evaluates model's ability to identify customers likely to close accounts
and switch to competitor banks within 12 months.
- Class 0 = Customer retention (maintains banking relationship)
- Class 1 = Customer churn (closes primary accounts)
"""
= """
executive_instructions Format as a risk committee briefing:
**TEST DESCRIPTION:** [Test description]
**RISK ASSESSMENT:** [Model risk level]
**REGULATORY STATUS:** [Compliance with banking regulations]
**BUSINESS RECOMMENDATION:** [Deploy/Hold/Reject with rationale]
"""
= """
banking_contetx REGULATORY CONTEXT:
- OCC guidance requires AUC >0.80 for model approval
- Our threshold: Churn recall >50% for retention program viability
"""
= vm.tests.run_test(
result "validmind.model_validation.sklearn.ClassifierPerformance",
={
inputs"model": vm_model,
"dataset": vm_test_ds
},={
context"test_description": banking_test_description,
"instructions": executive_instructions,
"additional_context": banking_contetx,
}, )
Best practices for managing context
When using instructions
, additional_context
, and test_description
parameters together, follow these guidelines to create effective, consistent, and maintainable test descriptions.
Choose the right parameter for each need:
Use
test_description
for technical corrections when you need to fix or clarify test methodology, override ValidMind's built-in documentation with your preferred structure or terminology, replace generic test mechanics with meaningful descriptions of variables and features being analyzed, or provide domain-specific context for regulatory compliance.Apply
additional_context
for business rules like performance thresholds and decision criteria, business context such as customer economics and operational constraints, threshold-driven decision logic, regulatory requirements, real-time information like dates or risk indicators, stakeholder priorities, or any background information that helps the LLM interpret results in your specific contextLeverage
instructions
for audience targeting to control format and presentation style, create structured templates with specific sections and placeholders for LLM fill-ins, combine hardcoded mandatory text with dynamic analysis, and ensure consistent organizational reporting standards across different stakeholder groups.
Avoid redundancy:
Don't repeat the same information across multiple parameters, as each parameter should add unique value to the description generation. If content overlaps, choose the most appropriate parameter for that information to maintain clarity and prevent conflicting or duplicate guidance in your test descriptions.
Increasing consistency and grounding:
Since LLMs can produce variable responses, use hardcoded sections in your instructions for content that requires no variability, combined with specific placeholders for data you trust the LLM to generate. For example, include mandatory disclaimers, policy references, and fixed formatting exactly as written, while using placeholders like [ANALYZE_PERFORMANCE_METRICS]
for dynamic content. This approach ensures critical information appears consistently while still leveraging the LLM's analytical capabilities.
Use test_description
and additional_context
parameters to anchor test results descriptions in your specific domain and business context, preventing the LLM from generating generic or inappropriate interpretations. Then use instructions
to explicitly direct the LLM to ground its analysis in this provided context, such as "Base all recommendations on the thresholds specified in the additional context section" or "Interpret all metrics according to the test description provided."