Document an agentic AI system

Build and document an agentic AI system with the ValidMind Library. Construct a LangGraph-based banking agent, assign AI evaluation metric scores to your agent, and run accuracy, RAGAS, and safety tests, then log those test results to the ValidMind Platform.

An AI agent is an autonomous system that interprets inputs, selects from available tools or actions, and executes multi-step behaviors to achieve defined goals. In this notebook, the agent acts as a banking assistant that analyzes user requests and automatically selects and invokes the appropriate specialized banking tool to deliver accurate, compliant, and actionable responses.

This agent enables financial institutions to automate complex banking workflows where different customer requests require different specialized tools and knowledge bases.
Effective validation of agentic AI systems reduces the risks of agents misinterpreting inputs, failing to extract required parameters, or producing incorrect assessments or actions — such as selecting the wrong tool.

For the LLM components in this notebook to function properly, you'll need access to OpenAI.

Before you continue, ensure that a valid OPENAI_API_KEY is set in your .env file.

About ValidMind

ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

Before you begin

This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with pip install, and then re-run the notebook. For more help, refer to Installing Python Modules.

New to ValidMind?

If you haven't already seen our documentation on the ValidMind Library, we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.

For access to all features available in this notebook, you'll need access to a ValidMind account.

Register with ValidMind

Key concepts

Model documentation: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.

Documentation template: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.

Tests: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.

Metrics: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.

Custom metrics: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with the ValidMind Library to be used in the ValidMind Platform.

Inputs: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:

model: A single model that has been initialized in ValidMind with vm.init_model().
dataset: Single dataset that has been initialized in ValidMind with vm.init_dataset().
models: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.
datasets: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. (Learn more: Run tests with multiple datasets)

Parameters: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.

Outputs: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.

Test suites: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.

Example: the classifier_full_suite test suite runs tests from the tabular_dataset and classifier test suites to fully document the data and model sections for binary classification model use-cases.

Setting up

Install the ValidMind Library

Recommended Python versions

Python 3.9 <= x <= 3.11

Let's begin by installing the ValidMind Library with large language model (LLM) support:

%pip install -q "validmind[llm]" "langgraph==0.3.21"

Initialize the ValidMind Library

Register sample model

Let's first register a sample model for use with this notebook.

In a browser, log in to ValidMind.
In the left sidebar, navigate to Inventory and click + Register Model.
Enter the model details and click Next > to continue to assignment of model stakeholders. (Need more help?)
Select your own name under the MODEL OWNER drop-down.
Click Register Model to add the model to your inventory.

Apply documentation template

Once you've registered your model, let's select a documentation template. A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.

In the left sidebar that appears for your model, click Documents and select Documentation.
Under TEMPLATE, select Agentic AI.
Click Use Template to apply the template.

Can't select this template?

Your organization administrators may need to add it to your template library:

Get your code snippet

ValidMind generates a unique code snippet for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

On the left sidebar that appears for your model, select Getting Started and click Copy snippet to clipboard.
Next, load your model identifier credentials from an .env file or replace the placeholder with your own code snippet:

# Load your model identifier credentials from an `.env` file

%load_ext dotenv
%dotenv .env

# Or replace with your code snippet

import validmind as vm

vm.init(
    # api_host="...",
    # api_key="...",
    # api_secret="...",
    # model="...",
    # document="documentation",
)

Preview the documentation template

Let's verify that you have connected the ValidMind Library to the ValidMind Platform and that the appropriate template is selected for your model.

You will upload documentation and test results unique to your model based on this template later on. For now, take a look at the default structure that the template provides with the vm.preview_template() function from the ValidMind library and note the empty sections:

vm.preview_template()

Verify OpenAI API access

Verify that a valid OPENAI_API_KEY is set in your .env file:

# Load environment variables if using .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    print("dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.")

Initialize the Python environment

Let's import all the necessary libraries to prepare for building our banking LangGraph agentic system:

Standard libraries for data handling and environment management.
pandas, a Python library for data manipulation and analytics, as an alias. We'll also configure pandas to show all columns and all rows at full width for easier debugging and inspection.
LangChain components for LLM integration and tool management.
LangGraph for building stateful, multi-step agent workflows.
Banking tools for specialized financial services as defined in banking_tools.py.

from typing import TypedDict, Annotated, Sequence

from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, END, START
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode

# LOCAL IMPORTS FROM banking_tools.py
from banking_tools import AVAILABLE_TOOLS

import pandas as pd
# Configure pandas to show all columns and all rows at full width
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', None)

Building the LangGraph agent

Test available banking tools

We'll use the demo banking tools defined in banking_tools.py that provide use cases of financial services:

Credit Risk Analyzer - Loan applications and credit decisions
Customer Account Manager - Account services and customer support
Fraud Detection System - Security and fraud prevention

print(f"Available tools: {len(AVAILABLE_TOOLS)}")
print("\nTool Details:")
for i, tool in enumerate(AVAILABLE_TOOLS, 1):
    print(f"   - {tool.name}")

Let's test each banking tool individually to ensure they're working correctly before integrating them into our agent:

# Test 1: Credit Risk Analyzer
print("TEST 1: Credit Risk Analyzer")
print("-" * 40)
try:
    # Access the underlying function using .func
    credit_result = AVAILABLE_TOOLS[0].func(
        customer_income=75000,
        customer_debt=1200,
        credit_score=720,
        loan_amount=50000,
        loan_type="personal"
    )
    print(credit_result)
    print("Credit Risk Analyzer test PASSED")
except Exception as e:
    print(f"Credit Risk Analyzer test FAILED: {e}")

print("" + "=" * 60)


# Test 2: Customer Account Manager
print("TEST 2: Customer Account Manager")
print("-" * 40)
try:
    # Test checking balance
    account_result = AVAILABLE_TOOLS[1].func(
        account_type="checking",
        customer_id="12345",
        action="check_balance"
    )
    print(account_result)

    # Test getting account info
    info_result = AVAILABLE_TOOLS[1].func(
        account_type="all",
        customer_id="12345", 
        action="get_info"
    )
    print(info_result)
    print("Customer Account Manager test PASSED")
except Exception as e:
    print(f"Customer Account Manager test FAILED: {e}")

print("" + "=" * 60)


# Test 3: Fraud Detection System
print("TEST 3: Fraud Detection System")
print("-" * 40)
try:
    fraud_result = AVAILABLE_TOOLS[2].func(
        transaction_id="TX123",
        customer_id="12345",
        transaction_amount=500.00,
        transaction_type="withdrawal",
        location="Miami, FL",
        device_id="DEVICE_001"
    )
    print(fraud_result)
    print("Fraud Detection System test PASSED")
except Exception as e:
    print(f"Fraud Detection System test FAILED: {e}")

print("" + "=" * 60)

Create LangGraph banking agent

With our tools ready to go, we'll create our intelligent banking agent with LangGraph that automatically selects and uses the appropriate banking tool based on a user request.

Define system prompt

We'll begin by defining our system prompt, which provides the LLM with context about its role as a banking assistant and guidance on when to use each available tool:


# Enhanced banking system prompt with tool selection guidance
system_context = """You are a professional banking AI assistant with access to specialized banking tools.
            Analyze the user's banking request and directly use the most appropriate tools to help them.
            
            AVAILABLE BANKING TOOLS:
            
            credit_risk_analyzer - Analyze credit risk for loan applications and credit decisions
            - Use for: loan applications, credit assessments, risk analysis, mortgage eligibility
            - Examples: "Analyze credit risk for $50k personal loan", "Assess mortgage eligibility for $300k home purchase"
            - Parameters: customer_income, customer_debt, credit_score, loan_amount, loan_type

            customer_account_manager - Manage customer accounts and provide banking services
            - Use for: account information, transaction processing, product recommendations, customer service
            - Examples: "Check balance for checking account 12345", "Recommend products for customer with high balance"
            - Parameters: account_type, customer_id, action, amount, account_details

            fraud_detection_system - Analyze transactions for potential fraud and security risks
            - Use for: transaction monitoring, fraud prevention, risk assessment, security alerts
            - Examples: "Analyze fraud risk for $500 ATM withdrawal in Miami", "Check security for $2000 online purchase"
            - Parameters: transaction_id, customer_id, transaction_amount, transaction_type, location, device_id

            BANKING INSTRUCTIONS:
            - Analyze the user's banking request carefully and identify the primary need
            - If they need credit analysis → use credit_risk_analyzer
            - If they need financial calculations → use financial_calculator
            - If they need account services → use customer_account_manager
            - If they need security analysis → use fraud_detection_system
            - Extract relevant parameters from the user's request
            - Provide helpful, accurate banking responses based on tool outputs
            - Always consider banking regulations, risk management, and best practices
            - Be professional and thorough in your analysis

            Choose and use tools wisely to provide the most helpful banking assistance.
            Describe the response in user friendly manner with details describing the tool output. 
            Provide the response in at least 500 words.
            Generate a concise execution plan for the banking request.
        """

Initialize the LLM

Let's initialize the LLM that will power our banking agent:

# Initialize the main LLM for banking responses
main_llm = ChatOpenAI(
    model="gpt-5-mini",
    reasoning={
        "effort": "low",
        "summary": "auto"
    }
)

Then bind the available banking tools to the LLM, enabling the model to automatically recognize and invoke each tool when appropriate based on request input and the system prompt we defined above:

# Bind all banking tools to the main LLM
llm_with_tools = main_llm.bind_tools(AVAILABLE_TOOLS)

Define agent state structure

The agent state defines the data structure that flows through the LangGraph workflow. It includes:

messages — The conversation history between the user and agent
user_input — The current user request
session_id — A unique identifier for the conversation session
context — Additional context that can be passed between nodes

Defining this state structure maintains the structure throughout the agent's execution and allows for multi-turn conversations with memory:

# Banking Agent State Definition
class BankingAgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]
    user_input: str
    session_id: str
    context: dict

Create agent workflow function

We'll build the LangGraph agent workflow with two main components:

LLM node — Processes user requests, applies the system prompt, and decides whether to use tools.
Tools node — Executes the selected banking tools when the LLM determines they're needed.

The workflow begins with the LLM analyzing the request, then uses tools if needed — or ends if the response is complete, and finally returns to the LLM to generate the final response.

def create_banking_langgraph_agent():
    """Create a comprehensive LangGraph banking agent with intelligent tool selection."""
    def llm_node(state: BankingAgentState) -> BankingAgentState:
        """Main LLM node that processes banking requests and selects appropriate tools."""
        messages = state["messages"]
        # Add system context to messages
        enhanced_messages = [SystemMessage(content=system_context)] + list(messages)
        # Get LLM response with tool selection
        response = llm_with_tools.invoke(enhanced_messages)
        return {
            **state,
            "messages": messages + [response]
        }
    
    def should_continue(state: BankingAgentState) -> str:
        """Decide whether to use tools or end the conversation."""
        last_message = state["messages"][-1]
        # Check if the LLM wants to use tools
        if hasattr(last_message, 'tool_calls') and last_message.tool_calls:
            return "tools"
        return END
        
    # Create the banking state graph
    workflow = StateGraph(BankingAgentState)
    # Add nodes
    workflow.add_node("llm", llm_node)
    workflow.add_node("tools", ToolNode(AVAILABLE_TOOLS))
    # Simplified entry point - go directly to LLM
    workflow.add_edge(START, "llm")
    # From LLM, decide whether to use tools or end
    workflow.add_conditional_edges(
        "llm",
        should_continue,
        {"tools": "tools", END: END}
    )
    # Tool execution flows back to LLM for final response
    workflow.add_edge("tools", "llm")
    # Set up memory
    memory = MemorySaver()
    # Compile the graph
    agent = workflow.compile(checkpointer=memory)
    return agent

Instantiate the banking agent

Now, we'll create an instance of the banking agent by calling the workflow creation function.

This compiled agent is ready to process banking requests and will automatically select and use the appropriate tools based on user queries:

# Create the banking intelligent agent
banking_agent = create_banking_langgraph_agent()

print("Banking LangGraph Agent Created Successfully!")
print("\nFeatures:")
print("   - Intelligent banking tool selection")
print("   - Comprehensive banking system prompt")
print("   - Streamlined workflow: LLM → Tools → Response")
print("   - Automatic tool parameter extraction")
print("   - Professional banking assistance")

Integrate agent with ValidMind

To integrate our LangGraph banking agent with ValidMind, we need to create a wrapper function that ValidMind can use to invoke the agent and extract the necessary information for testing and documentation, allowing ValidMind to run validation tests on the agent's behavior, tool usage, and responses.

Import ValidMind components

We'll start with importing the necessary ValidMind components for integrating our agent:

Prompt from validmind.models for handling prompt-based model inputs
extract_tool_calls_from_agent_output and _convert_to_tool_call_list from validmind.scorers.llm.deepeval for extracting and converting tool calls from agent outputs

from validmind.models import Prompt
from validmind.scorers.llm.deepeval import extract_tool_calls_from_agent_output, _convert_to_tool_call_list
from deepeval.tracing import observe, update_current_span
from deepeval.test_case import LLMTestCase

Create agent wrapper function

We'll then create a wrapper function that:

Accepts input in ValidMind's expected format (with input and session_id fields)
Invokes the banking agent with the proper state initialization
Captures tool outputs and tool calls for evaluation
Returns a standardized response format that includes the prediction, full output, tool messages, and tool call information
Handles errors gracefully with fallback responses

@observe(type="agent")
def banking_agent_fn(input):
    """
    Invoke the banking agent with the given input.
    """
    try:
        # Initial state for banking agent
        initial_state = {
            "user_input": input["input"],
            "messages": [HumanMessage(content=input["input"])],
            "session_id": input["session_id"],
            "context": {}
        }
        session_config = {"configurable": {"thread_id": input["session_id"]}}
        result = banking_agent.invoke(initial_state, config=session_config)

        from utils import capture_tool_output_messages

        # Capture all tool outputs and metadata
        captured_data = capture_tool_output_messages(result)
    
        # Access specific tool outputs, this will be used for RAGAS tests
        tool_message = ""
        for output in captured_data["tool_outputs"]:
            tool_message += output['content']
        
        tool_calls_found = []
        messages = result['messages']
        for message in messages:
            if hasattr(message, 'tool_calls') and message.tool_calls:
                for tool_call in message.tool_calls:
                    # Handle both dictionary and object formats
                    if isinstance(tool_call, dict):
                        tool_calls_found.append(tool_call['name'])
                    else:
                        # ToolCall object - use attribute access
                        tool_calls_found.append(tool_call.name)

        prediction_text = result['messages'][-1].content[0]['text']
        tools_called_value = _convert_to_tool_call_list(extract_tool_calls_from_agent_output(result))
        expected_tools_value = _convert_to_tool_call_list(input.get("expected_tools", []))

        # Feed trace data for DeepEval metrics (e.g. PlanQuality) that require tracing
        update_current_span(
            test_case=LLMTestCase(
                input=input["input"],
                actual_output=prediction_text,
                tools_called=tools_called_value,
                expected_tools=expected_tools_value
            )
        )

        return {
            "prediction": prediction_text,
            "output": result,
            "tool_messages": [tool_message],
            # "tool_calls": tool_calls_found,
            "tool_called": tools_called_value
        }
    except Exception as e:
        # Return a fallback response if the agent fails
        error_message = f"""I apologize, but I encountered an error while processing your banking request: {str(e)}.
        Please try rephrasing your question or contact support if the issue persists."""
        return {
            "prediction": error_message, 
            "output": {
                "messages": [HumanMessage(content=input["input"]), SystemMessage(content=error_message)],
                "error": str(e)
            }
        }

Initialize the ValidMind model object

We'll also need to register the banking agent as a ValidMind model object (vm_model) that can be passed to other functions for analysis and tests on the data.

You simply initialize this model object with vm.init_model() that:

Associates the wrapper function with the model for prediction
Stores the system prompt template for documentation
Provides a unique input_id for tracking and identification
Enables the agent to be used with ValidMind's testing and documentation features

# Initialize the agent as a model
vm_banking_model = vm.init_model(
    input_id="banking_agent_model",
    predict_fn=banking_agent_fn,
    prompt=Prompt(template=system_context)
)

Store the agent reference

We'll also store a reference to the original banking agent object in the ValidMind model. This allows us to access the full agent functionality directly if needed, while still maintaining the wrapper function interface for ValidMind's testing framework.

# Add the banking agent to the vm model
vm_banking_model.model = banking_agent

Verify integration

Let's confirm that the banking agent has been successfully integrated with ValidMind:

print("Banking Agent Successfully Integrated with ValidMind!")
print(f"Model ID: {vm_banking_model.input_id}")

Validate the system prompt

Let's get an initial sense of how well our defined system prompt meets a few best practices for prompt engineering by running a few tests — we'll run evaluation tests later on our agent's performance.

You run individual tests by calling the run_test function provided by the validmind.tests module. Passing in our agentic model as an input, the tests below rate the prompt on a scale of 1-10 against the following criteria:

Clarity — How clearly the prompt states the task.
Conciseness — How succinctly the prompt states the task.
Delimitation — When using complex prompts containing examples, contextual information, or other elements, is the prompt formatted in such a way that each element is clearly separated?
NegativeInstruction — Whether the prompt contains negative instructions.
Specificity — How specific the prompt defines the task.

vm.tests.run_test(
    "validmind.prompt_validation.Clarity",
    inputs={
        "model": vm_banking_model,
    },
).log()

vm.tests.run_test(
    "validmind.prompt_validation.Conciseness",
    inputs={
        "model": vm_banking_model,
    },
).log()

vm.tests.run_test(
    "validmind.prompt_validation.Delimitation",
    inputs={
        "model": vm_banking_model,
    },
).log()

vm.tests.run_test(
    "validmind.prompt_validation.NegativeInstruction",
    inputs={
        "model": vm_banking_model,
    },
).log()

vm.tests.run_test(
    "validmind.prompt_validation.Specificity",
    inputs={
        "model": vm_banking_model,
    },
).log()

Initializing the ValidMind dataset

After validation our system prompt, let's import our sample dataset (banking_test_dataset.py), which we'll use in the next section to evaluate our agent's performance across different banking scenarios:

from banking_test_dataset import banking_test_dataset

The next step is to connect your data with a ValidMind Dataset object. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind, but you only need to do it once per dataset.

Initialize a ValidMind dataset object using the init_dataset function from the ValidMind (vm) module. For this example, we'll pass in the following arguments:

input_id — A unique identifier that allows tracking what inputs are used when running each individual test.
dataset — The raw dataset that you want to provide as input to tests.
text_column — The name of the column containing the text input data.
target_column — A required argument if tests require access to true values. This is the name of the target column in the dataset.

vm_test_dataset = vm.init_dataset(
    input_id="banking_test_dataset",
    dataset=banking_test_dataset,
    text_column="input",
    target_column="possible_outputs",
)

print("Banking Test Dataset Initialized in ValidMind!")
print(f"Dataset ID: {vm_test_dataset.input_id}")
print(f"Dataset columns: {vm_test_dataset._df.columns}")
vm_test_dataset._df

Assign predictions

Now that both the model object and the datasets have been registered, we'll assign predictions to capture the banking agent's responses for evaluation:

The assign_predictions() method from the Dataset object can link existing predictions to any number of models.
This method links the model's class prediction values and probabilities to our vm_train_ds and vm_test_ds datasets.

If no prediction values are passed, the method will compute predictions automatically:

vm_test_dataset.assign_predictions(vm_banking_model)

print("Banking Agent Predictions Generated Successfully!")
print(f"Predictions assigned to {len(vm_test_dataset._df)} test cases")
vm_test_dataset._df.head()

Running accuracy tests

Using @vm.test, let's implement some reusable custom inline tests to assess the accuracy of our banking agent:

An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.
You'll note that the custom test functions are just regular Python functions that can include and require any Python library as you see fit.

Response accuracy test

We'll create a custom test that evaluates the banking agent's ability to provide accurate responses by:

Testing against a dataset of predefined banking questions and expected answers.
Checking if responses contain expected keywords and banking terminology.
Providing detailed test results including pass/fail status.
Helping identify any gaps in the agent's banking knowledge or response quality.


@vm.test("my_custom_tests.banking_accuracy_test")
def banking_accuracy_test(model, dataset, list_of_columns):
    """
    The Banking Accuracy Test evaluates whether the agent’s responses include 
    critical domain-specific keywords and phrases that indicate accurate, compliant,
    and contextually appropriate banking information. This test ensures that the agent
    provides responses containing the expected banking terminology, risk classifications,
    account details, or other domain-relevant information required for regulatory compliance,
    customer safety, and operational accuracy.
    """
    df = dataset._df
    
    # Pre-compute responses for all tests
    y_true = dataset.y.tolist()
    y_pred = dataset.y_pred(model).tolist()

    # Vectorized test results
    test_results = []
    for response, keywords in zip(y_pred, y_true):
        # Convert keywords to list if not already a list
        if not isinstance(keywords, list):
            keywords = [keywords]
        test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))
        
    results = pd.DataFrame()
    column_names = [col + "_details" for col in list_of_columns]
    results[column_names] = df[list_of_columns]
    results["actual"] = y_pred
    results["expected"] = y_true
    results["passed"] = test_results
    results["error"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'
    
    return results

Now that we've defined our custom response accuracy test, we can run the test using the same run_test() function we used earlier to validate the system prompt using our sample dataset and agentic model as input, and log the test results to the ValidMind Platform with the log() method:

result = vm.tests.run_test(
    "my_custom_tests.banking_accuracy_test",
    inputs={
        "dataset": vm_test_dataset,
        "model": vm_banking_model
    },
    params={
        "list_of_columns": ["input"]
    }
)
result.log()

Let's review the first five rows of the test dataset to inspect the results to see how well the banking agent performed. Each column in the output serves a specific purpose in evaluating agent performance:

Column header	Description	Importance
`input`	Original user query or request	Essential for understanding the context of each test case and tracing which inputs led to specific agent behaviors.
`expected_tools`	Banking tools that should be invoked for this request	Enables validation of correct tool selection, which is critical for agentic AI systems where choosing the right tool is a key success metric.
`expected_output`	Expected output or keywords that should appear in the response	Defines the success criteria for each test case, enabling objective evaluation of whether the agent produced the correct result.
`session_id`	Unique identifier for each test session	Allows tracking and correlation of related test runs, debugging specific sessions, and maintaining audit trails.
`category`	Classification of the request type	Helps organize test results by domain and identify performance patterns across different banking use cases.
`banking_agent_model_output`	Complete agent response including all messages and reasoning	Allows you to examine the full output to assess response quality, completeness, and correctness beyond just keyword matching.
`banking_agent_model_tool_messages`	Messages exchanged with the banking tools	Critical for understanding how the agent interacted with tools, what parameters were passed, and what tool outputs were received.
`banking_agent_model_tool_called`	Specific tool that was invoked	Enables validation that the agent selected the correct tool for each request, which is fundamental to agentic AI validation.
`possible_outputs`	Alternative valid outputs or keywords that could appear in the response	Provides flexibility in evaluation by accounting for multiple acceptable response formats or variations.

vm_test_dataset.df.head(5)

Tool selection accuracy test

We'll also create a custom test that evaluates the banking agent's ability to select the correct tools for different requests by:

Testing against a dataset of predefined banking queries with expected tool selections.
Comparing the tools actually invoked by the agent against the expected tools for each request.
Providing quantitative accuracy scores that measure the proportion of expected tools correctly selected.
Helping identify gaps in the agent's understanding of user needs and tool selection logic.

First, we'll define a helper function that extracts tool calls from the agent's messages and compares them against the expected tools. This function handles different message formats (dictionary or object) and calculates accuracy scores:

def validate_tool_calls_simple(messages, expected_tools):
    """Simple validation of tool calls without RAGAS dependency issues."""
    
    tool_calls_found = []
    
    for message in messages:
        if hasattr(message, 'tool_calls') and message.tool_calls:
            for tool_call in message.tool_calls:
                # Handle both dictionary and object formats
                if isinstance(tool_call, dict):
                    tool_calls_found.append(tool_call['name'])
                else:
                    # ToolCall object - use attribute access
                    tool_calls_found.append(tool_call.name)
    
    # Check if expected tools were called
    accuracy = 0.0
    matches = 0
    if expected_tools:
        matches = sum(1 for tool in expected_tools if tool in tool_calls_found)
        accuracy = matches / len(expected_tools)
    
    return {
        'expected_tools': expected_tools,
        'found_tools': tool_calls_found,
        'matches': matches,
        'total_expected': len(expected_tools) if expected_tools else 0,
        'accuracy': accuracy,
    }

Now we'll define the main test function that uses the helper function to evaluate tool selection accuracy across all test cases in the dataset:

@vm.test("my_custom_tests.BankingToolCallAccuracy")
def BankingToolCallAccuracy(dataset, agent_output_column, expected_tools_column):
    """
    Evaluates the tool selection accuracy of a LangGraph-powered banking agent.

    This test measures whether the agent correctly identifies and invokes the required banking tools
    for each user query scenario.
    For each case, the outputs generated by the agent (including its tool calls) are compared against an
    expected set of tools. The test considers both coverage and exactness: it computes the proportion of
    expected tools correctly called by the agent for each instance.

    Parameters:
        dataset (VMDataset): The dataset containing user queries, agent outputs, and ground-truth tool expectations.
        agent_output_column (str): Dataset column name containing agent outputs (should include tool call details in 'messages').
        expected_tools_column (str): Dataset column specifying the true expected tools (as lists).

    Returns:
        List[dict]: Per-row dictionaries with details: expected tools, found tools, match count, total expected, and accuracy score.

    Purpose:
        Provides diagnostic evidence of the banking agent's core reasoning ability—specifically, its capacity to
        interpret user needs and select the correct banking actions. Useful for diagnosing gaps in tool coverage,
        misclassifications, or breakdowns in agent logic.

    Interpretation:
        - An accuracy of 1.0 signals perfect tool selection for that example.
        - Lower scores may indicate partial or complete failures to invoke required tools.
        - Review 'found_tools' vs. 'expected_tools' to understand the source of discrepancies.

    Strengths:
        - Directly tests a core capability of compositional tool-use agents.
        - Framework-agnostic; robust to tool call output format (object or dict).
        - Supports batch validation and result logging for systematic documentation.

    Limitations:
        - Does not penalize extra, unnecessary tool calls.
        - Does not assess result quality—only correct invocation.

    """
    df = dataset._df
    
    results = []
    for i, row in df.iterrows():
        result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])
        results.append(result)
         
    return results

Finally, we can call our function with run_test() and log the test results to the ValidMind Platform:

result = vm.tests.run_test(
    "my_custom_tests.BankingToolCallAccuracy",
    inputs={
        "dataset": vm_test_dataset,
    },
    params={
        "agent_output_column": "banking_agent_model_output",
        "expected_tools_column": "expected_tools"
    }
)
result.log()

Assigning AI evaluation metric scores

AI agent evaluation metrics are specialized measurements designed to assess how well autonomous LLM-based agents reason, plan, select and execute tools, and ultimately complete user tasks by analyzing the full execution trace — including reasoning steps, tool calls, intermediate decisions, and outcomes, rather than just single input–output pairs. These metrics are essential because agent failures often occur in ways traditional LLM metrics miss — for example, choosing the right tool with wrong arguments, creating a good plan but not following it, or completing a task inefficiently.

In this section, we'll evaluate our banking agent's outputs and add scoring to our sample dataset against metrics defined in DeepEval’s AI agent evaluation framework which breaks down AI agent evaluation into three layers with corresponding subcategories: reasoning, action, and execution.

Together, these three metrics enable granular diagnosis of agent behavior, help pinpoint where failures occur (reasoning, action, or execution), and support both development benchmarking and production monitoring.

Identify relevant DeepEval scorers

Scorers are evaluation metrics that analyze model outputs and store their results in the dataset:

Each scorer adds a new column to the dataset with format: {scorer_name}_{metric_name}
The column contains the numeric score (typically 0-1) for each example
Multiple scorers can be run on the same dataset, each adding their own column
Scores are persisted in the dataset for later analysis and visualization
Common scorer patterns include:
- Model performance metrics (accuracy, F1, etc.)
- Output quality metrics (relevance, faithfulness)
- Task-specific metrics (completion, correctness)

Use list_scorers() from validmind.scorers to discover all available scoring methods and their IDs that can be used with assign_scores(). We'll filter these results to return only DeepEval scorers for our desired three metrics in a formatted table with descriptions:

# Load all DeepEval scorers
llm_scorers_dict = vm.tests.load._load_tests([s for s in vm.scorer.list_scorers() if "deepeval" in s.lower()])

# Categorize scorers by metric layer
reasoning_scorers = {}
action_scorers = {}
execution_scorers = {}

for scorer_id, scorer_func in llm_scorers_dict.items():
    tags = getattr(scorer_func, "__tags__", [])
    scorer_name = scorer_id.split(".")[-1]

    if "reasoning_layer" in tags:
        reasoning_scorers[scorer_id] = scorer_func
    elif "action_layer" in tags:
        action_scorers[scorer_id] = scorer_func
    elif "TaskCompletion" in scorer_name:
        execution_scorers[scorer_id] = scorer_func

# Display scorers by category
print("=" * 80)
print("REASONING LAYER")
print("=" * 80)
if reasoning_scorers:
    reasoning_df = vm.tests.load._pretty_list_tests(reasoning_scorers, truncate=True)
    display(reasoning_df)
else:
    print("No reasoning layer scorers found.")

print("\n" + "=" * 80)
print("ACTION LAYER")
print("=" * 80)
if action_scorers:
    action_df = vm.tests.load._pretty_list_tests(action_scorers, truncate=True)
    display(action_df)
else:
    print("No action layer scorers found.")

print("\n" + "=" * 80)
print("EXECUTION LAYER")
print("=" * 80)
if execution_scorers:
    execution_df = vm.tests.load._pretty_list_tests(execution_scorers, truncate=True)
    display(execution_df)
else:
    print("No execution layer scorers found.")

Assign reasoning scores

Reasoning evaluates planning and strategy generation:

Plan quality – How logical, complete, and efficient the agent’s plan is.
Plan adherence – Whether the agent follows its own plan during execution.

Plan quality score

Let's measure how well our banking agent generates a plan before acting. A high score means the plan is logical, complete, and efficient.

vm_test_dataset.assign_scores(
    metrics = "validmind.scorers.llm.deepeval.PlanQuality",
    model = vm_banking_model,
    input_column = "input",
)
vm_test_dataset._df[['banking_agent_model_PlanQuality_score','banking_agent_model_PlanQuality_reason']]

Plan adherence score

Let's check whether our banking agent follows the plan it created. Deviations lower this score and indicate gaps between reasoning and execution.

vm_test_dataset.assign_scores(
    metrics = "validmind.scorers.llm.deepeval.PlanAdherence",
    input_column = "input",
    model = vm_banking_model,
)
vm_test_dataset._df[['banking_agent_model_PlanAdherence_score','banking_agent_model_PlanAdherence_reason']]

Assign action scores

Action assesses tool usage and argument generation:

Tool correctness – Whether the agent selects and calls the right tools.
Argument correctness – Whether the agent generates correct tool arguments.

Tool correctness score

Let's evaluate if our banking agent selects the appropriate tool for the task. Choosing the wrong tool reduces performance even if reasoning was correct.

vm_test_dataset.assign_scores(
    metrics = "validmind.scorers.llm.deepeval.ToolCorrectness",
    input_column = "input",
    model = vm_banking_model,
    expected_tools_called_column = "expected_tools",
    actual_tools_called_column = "banking_agent_model_tool_called",
)
vm_test_dataset._df[['banking_agent_model_ToolCorrectness_score','banking_agent_model_ToolCorrectness_reason']]

Argument correctness score

Let's assesses whether our banking agent provides correct inputs or arguments to the selected tool. Incorrect arguments can lead to failed or unexpected results.

vm_test_dataset.assign_scores(
    metrics = "validmind.scorers.llm.deepeval.ArgumentCorrectness",
    input_column = "input",
    model = vm_banking_model,
    actual_tools_called_column = "banking_agent_model_tool_called",
)
vm_test_dataset._df[['banking_agent_model_ArgumentCorrectness_score','banking_agent_model_ArgumentCorrectness_reason']]

Assign execution score

Execution measures end-to-end performance:

Task completion – Whether the agent successfully completes the intended task.

Task completion score

Let's evaluate whether our banking agent successfully completes the requested tasks. Incomplete task execution can lead to user dissatisfaction and failed banking operations.

vm_test_dataset.assign_scores(
    metrics = "validmind.scorers.llm.deepeval.TaskCompletion",
    input_column = "input",
    model = vm_banking_model,
    actual_tools_called_column = "banking_agent_model_tool_called",
)
vm_test_dataset._df[['banking_agent_model_TaskCompletion_score','banking_agent_model_TaskCompletion_reason']]

As you recall from the beginning of this section, when we run scorers through assign_scores(), the return values are automatically processed and added as new columns with the format {scorer_name}_{metric_name}. Note that the task completion scorer has added a new column TaskCompletion_score to our dataset.

We'll use this column to visualize the distribution of task completion scores across our test cases through the BoxPlot test:

vm.tests.run_test(
    "validmind.plots.BoxPlot",
    inputs={"dataset": vm_test_dataset},
    params={
        "columns": "banking_agent_model_TaskCompletion_score",
        "title": "Distribution of Task Completion Scores",
        "ylabel": "Score",
        "figsize": (8, 6)
    }
).log()

Running RAGAS tests

Next, let's run some out-of-the-box Retrieval-Augmented Generation Assessment (RAGAS) tests available in the ValidMind Library. RAGAS provides specialized metrics for evaluating retrieval-augmented generation systems and conversational AI agents. These metrics analyze different aspects of agent performance by assessing how well systems integrate retrieved information with generated responses.

Our banking agent uses tools to retrieve information and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate the quality of this integration by analyzing the relationship between retrieved tool outputs, user queries, and generated responses.

These tests provide insights into how well our banking agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to banking users while maintaining fidelity to retrieved information.

Identify relevant RAGAS tests

Let's explore some of ValidMind's available tests. Using ValidMind’s repository of tests streamlines your development testing, and helps you ensure that your models are being documented and evaluated appropriately.

You can pass tasks and tags as parameters to the vm.tests.list_tests() function to filter the tests based on the tags and task types:

tasks represent the kind of modeling task associated with a test. Here we'll focus on text_qa tasks.
tags are free-form descriptions providing more details about the test, for example, what category the test falls into. Here we'll focus on the ragas tag.

We'll then run three of these tests returned as examples below.

vm.tests.list_tests(task="text_qa", tags=["ragas"])

Faithfulness

Let's evaluate whether the banking agent's responses accurately reflect the information retrieved from tools. Unfaithful responses can misreport credit analysis, financial calculations, and compliance results—undermining user trust in the banking agent.

vm.tests.run_test(
    "validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "response_column": ["banking_agent_model_prediction"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
    },
).log()

Response Relevancy

Let's evaluate whether the banking agent's answers address the user's original question or request. Irrelevant or off-topic responses can frustrate users and fail to deliver the banking information they need.

vm.tests.run_test(
    "validmind.model_validation.ragas.ResponseRelevancy",
    inputs={"dataset": vm_test_dataset},
    params={
        "user_input_column": "input",
        "response_column": "banking_agent_model_prediction",
        "retrieved_contexts_column": "banking_agent_model_tool_messages",
    }
).log()

Context Recall

Let's evaluate how well the banking agent uses the information retrieved from tools when generating its responses. Poor context recall can lead to incomplete or underinformed answers even when the right tools were selected.

vm.tests.run_test(
    "validmind.model_validation.ragas.ContextRecall",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
        "reference_column": ["banking_agent_model_prediction"],
    },
).log()

Running safety tests

Finally, let's run some out-of-the-box safety tests available in the ValidMind Library. Safety tests provide specialized metrics for evaluating whether AI agents operate reliably and securely. These metrics analyze different aspects of agent behavior by assessing adherence to safety guidelines, consistency of outputs, and resistance to harmful or inappropriate requests.

Our banking agent handles sensitive financial information and user requests, making safety and reliability essential. Safety tests help evaluate whether the agent maintains appropriate boundaries, responds consistently and correctly to inputs, and avoids generating harmful, biased, or unprofessional content.

These tests provide insights into how well our banking agent upholds standards of fairness and professionalism, ensuring it operates reliably and securely for banking users.

AspectCritic

Let's evaluate our banking agent's responses across multiple quality dimensions — conciseness, coherence, correctness, harmfulness, and maliciousness. Weak performance on these dimensions can degrade user experience, fall short of professional banking standards, or introduce safety risks.

We'll use the AspectCritic we identified earlier:

vm.tests.run_test(
    "validmind.model_validation.ragas.AspectCritic",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "response_column": ["banking_agent_model_prediction"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
    },
).log()

Bias

Let's evaluate whether our banking agent's prompts contain unintended biases that could affect banking decisions. Biased prompts can lead to unfair or discriminatory outcomes — undermining customer trust and exposing the institution to compliance risk.

We'll first use list_tests() again to filter for tests relating to prompt_validation:

vm.tests.list_tests(filter="prompt_validation")

And then run the identified Bias test:

vm.tests.run_test(
    "validmind.prompt_validation.Bias",
    inputs={
        "model": vm_banking_model,
    },
).log()

Next steps

You can look at the output produced by the ValidMind Library right in the notebook where you ran the code, as you would expect. But there is a better way — use the ValidMind Platform to work with your model documentation.

Work with your model documentation

From the Inventory in the ValidMind Platform, go to the model you registered earlier. (Need more help?)
In the left sidebar that appears for your model, click Documentation under Documents.

What you see is the full draft of your model documentation in a more easily consumable version. From here, you can make qualitative edits to model documentation, view guidelines, collaborate with validators, and submit your model documentation for approval when it's ready. Learn more ...
Click into any section related to the tests we ran in this notebook, for example: 4.3. Prompt Evaluation to review the results of the tests we logged.

Customize the banking agent for your use case

You've now built an agentic AI system designed for banking use cases that supports compliance with supervisory guidance such as SR 11-7 and SS1/23, covering credit and fraud risk assessment for both retail and commercial banking. Extend this example agent to real-world banking scenarios and production deployment by:

Adapting the banking tools to your organization's specific requirements
Adding more banking scenarios and edge cases to your test set
Connecting the agent to your banking systems and databases
Implementing additional banking-specific tools and workflows

Discover more learning resources

Learn more about the ValidMind Library tools we used in this notebook:

We also offer many more interactive notebooks to help you document models:

Or, visit our documentation to learn more about ValidMind.

Upgrade ValidMind

After installing ValidMind, you’ll want to periodically make sure you are on the latest version to access any new features and other enhancements.

Retrieve the information for the currently installed version of ValidMind:

%pip show validmind

If the version returned is lower than the version indicated in our production open-source code, restart your notebook and run:

%pip install --upgrade validmind

You may need to restart your kernel after running the upgrade package for changes to be applied.